Site Reliability Engineering Manager
Site Reliability Engineering Manager | London (2 days Hybrid)
We're partnering with one of the UK's most recognised and high-traffic consumer tech platforms to find an Engineering Manager to lead their Site Reliability function.
The Role
This is a blended people leadership and technical role, responsible for operational excellence, observability, and reliability at scale across a platform that serves millions of users. You'll own incident management processes, drive reliability engineering standards, and ensure the business maintains its exceptionally high availability targets.
Key Responsibilities
- Own monitoring, alerting and observability strategy, ensuring product teams have high reliability confidence and fast incident detection and resolution
- Lead and standardise incident management processes, maintaining a culture of accountability, transparency and continuous learning
- Define reliability patterns and standards to reduce cascading failures across distributed systems
- Own and manage the reliability roadmap, OKR delivery and alignment with wider business goals
- Lead, develop and grow a team of engineers — setting objectives, growth plans and fostering a psychologically safe, inclusive environment.
What You'll Need
- Proven experience in SRE management across production environments — observability, monitoring and service delivery
- Strong understanding of reliability in distributed microservices and cloud-based architectures
- Experience with modern SRE tooling, incident management workflows and SLO/SLI frameworks
- Familiarity with platform engineering concepts and reducing friction for product teams
- Strong leadership, communication and stakeholder management skills