Lead Site Reliability Engineer

August 14

🏡 Remote – New York

Apply Now
Logo of hims & hers

hims & hers

hims & hers offers a modern approach to health and wellness.

201 - 500

Description

• Design and implement SRE practices ensuring availability, scalability and observability of production systems with a strong focus on excellent customer experience • Actively seek and identify opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation • Use automation extensively to design, configure, manage, and monitor systems in support of our product development teams • Understanding of Infrastructure and infra automation (Infrastructure as Code) • Manage incidents and emergency response, track outages, ensure data integrity and engineer releases to promote safe, efficient and rapid deployments • Handle emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed • Improve the codebase by resolving logic issues, deprecating unused code, etc. • Implement monitoring, logging, alerting and SLO Reporting • Identify Service Level Indicators (SLIs) that will align the team to meet the availability and performance objectives • Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent incident reoccurrence • Provides reviews on design documents from internal and external teams • Performs more-complex tasks using highly-specialized knowledge and advanced business experience • Resolves complex tickets in creative manners • Develops and leads large and highly-complex cross-functional projects or programs • Determines solutions to blockers, identify tasks, and developing solutions as appropriate • Responsible for at least for 1 major delivery domain and accountable for all the aspects of SRE for that domain • Develops standards, tools, and knowledge requirements for skill and career development

Requirements

• 10+ years as a software engineer, shipping production code • 5+ years of experience as a Site Reliability Engineer or Production support Engineer • Bachelor's degree in Computer Science, Engineering, or related field, or relevant years of work experience • Experience with service-oriented architectures and microservices at scale • Strong proficiency with RDBMS databases (PostgreSQL, MySQL, SQL Server, etc.) • Strong proficiency in SQL scripting • Proficiency developing in one or more languages such as Java, Kotlin, Python, and/or others • Ability to use containers and orchestration frameworks (Kubernetes, Docker, Container registries etc.) • Knowledge of CDN, typescript frameworks, and GQL. • Knowledge and good understanding of any pub/sub / Queue messaging systems • Proficiency in Git or other VCS • Experience with configuring, customizing, and extending monitoring tools (Datadog, Prometheus, New Relic etc.) • Excellent debugging and troubleshooting skills • Strong technical competency, with a data-driven analytical approach towards solving complex challenges • Have a systematic problem-solving approach, coupled with strong and effective communication skills and a sense of drive • Nice-to-have: Experience with Terraform or other IAC tools such as Chef, Puppet or Ansible

Benefits

• Competitive salary & equity compensation for full-time roles • Unlimited PTO, company holidays, and quarterly mental health days • Comprehensive health benefits including medical, dental & vision, and parental leave • Employee Stock Purchase Program (ESPP) • Employee discounts on hims & hers & Apostrophe online products • 401k benefits with employer matching contribution • Offsite team retreats

Apply Now

Similar Jobs

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@techjobsnewyorkcity.com