Site Reliability Engineer - SRE

August 9

🏡 Remote – New York

Apply Now
Logo of StarCompliance

StarCompliance

We are Reputation Guardians, on a mission to make compliance simple and easy.

Insider Trading • Political Activity • Gifts & Entertainment • Outside Activity • Reporting

201 - 500

💰 Venture Round on 2020-12

Description

• Maintain and improve platform's reliability, availability, and performance leveraging Azure as the core cloud platform. • Work closely with cross-functional teams to design, implement, and maintain resilient systems. • Automate wherever possible to streamline operations and minimize downtime. • Proactively identify and resolve potential issues before impacting customers. • Contribute to the continuous improvement of our infrastructure and processes. • Analyze reliability challenges and develop automated solutions for incident resolution. • Work with development teams to improve applications operational features for faster MTTD, MTTR, and auto-recovery. • Lead the establishment of SLIs, SLOs, Error budgets, policies. • Identify, track, and address Toil. • Conduct Post-Mortems. • Identify and implement continuous improvement in various facets of production operations. • Offer advanced technical support for cross-product issues and incidents. • Leveraging SRE tooling to develop, implement, and deliver on the SRE mission. • Conduct Chaos Testing. • Identify, define, and implement new tools and technologies to improve quality and efficiency of distributed platforms. • Drive reliability and supportability aspects of Cloud service, including change management, triage of customer escalations, remediation plans, playbooks, and automation. • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. • Engage in and improve the whole lifecycle of services from inception and design through deployment, operation, and refinement.

Requirements

• 4+ years of experience in Reliability engineering background • 2+ recent years of experience with Azure systems • Advanced knowledge of New Relic ecosystem. • Working Knowledge of Monitoring and APM tools such as Azure App Insights, Grafana, and Selenium • Knowledge of networking and troubleshooting latency, connectivity, and performance • Experience working with IaC with Terraform and CaC with Ansible. • Familiar with one or more Databases - SQL server, Mongo DB, and PostgreSQL • Hands-on experience with SRE practices and writing, running Chaos engineering experiments. • Preferred experience with C#, .Net, and PowerShell or Python or Golang • Experience with containerization. • Experience in High Availability and distributed systems. • Proficient in Linux and Windows administration, troubleshooting, and support • Experience with Azure DevOps • Excellent Debugging skills across a variety of integrated platforms.

Apply Now

Similar Jobs

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@techjobsnewyorkcity.com