Site Reliability Engineer
Lead Site Reliability Engineer
Job Type: Contract
Location: London, UK
Key Responsibilities:
- The person will be responsible for as a Technical Authority (SME) for both Azure and Terraform, guide teams on SRE practices, approve production changes.
- This role is platform-focused, not application-specific, and requires deep expertise in SRE principles, Azure Landing Zones (Hub-and-Spoke), Terraform, DevOps enablement, monitoring/observability, and incident management.
- He should be involved in address long-term reliability and operational risks while building and mentoring SRE teams.
- Design, implement, and operate Azure Hub-and-Spoke Landing Zone architectures.
- Reduce operational toil through automation and platform improvements.
- Own and evangelize SRE principles including availability, reliability, scalability, resilience, and operational maturity.
- Define Terraform best practices, state management, drift detection, and CI/CD integration.
- Build and maintain CI/CD foundations using GitHub Actions.
- Design and standardize monitoring and observability across the Azure platform.
- Lead and participate in major incident management following ITIL processes.
- Partner with security teams to implement least-privilege access and secure-by-default architectures.
- Enforce governance using Azure Policy and standardized platform controls.
- Lead, mentor, and grow a high-performing SRE/platform engineering team.
- Drive SRE culture across the organization and set technical direction, standards, and operational maturity goals.
- Clearly explain and apply SRE concepts (SLIs, SLOs, error budgets, toil reduction, blameless postmortems).
- Define and track platform-level SLIs/SLOs and ensure alignment with business objectives.
- Strong hands-on knowledge of RBAC and IAM in Azure, Managed Identities, Azure Key Vault for secrets, keys, and certificates