Senior Site Reliability Engineer (SRE)
Senior Site Reliability Engineer (SRE)
London/Hybrid
12-month contract (high chance of extension)
Job description:
Join a global pioneer in the video game industry, shaping the future of digital entertainment for millions of players worldwide. As a Senior Site Reliability Engineer, you'll sit at the heart of a high-impact Technical Operations team, driving reliability, scalability, and performance across revenue-critical commerce platforms powering subscriptions and personalised experiences.
You'll collaborate closely with product and engineering teams, influencing architecture, improving deployment safety, and elevating observability-ensuring seamless experiences for a global gaming community.
This is a role where your decisions directly impact live services at massive scale.
Responsibilities:
Reliability & Engineering Excellence
- Identify and eliminate systemic reliability risks
- Define and evolve SLIs, SLOs, and error budgets aligned to user and business outcomes
- Lead major incident response, post-mortems, and long-term remediation
Architecture & Scalability
- Influence system design for high availability and resilience
- Drive strategies for multi-region failover and disaster recovery
- Balance performance, cost, and operational risk
CI/CD & Deployment Safety
- Enhance pipelines to enable faster, safer releases
- Implement modern deployment strategies (canary, blue/green, progressive delivery)
- Build robust rollback and recovery mechanisms
Observability & Performance
- Develop advanced monitoring across metrics, logs, and tracing
- Improve signal quality and reduce alert fatigue
- Troubleshoot and resolve performance bottlenecks
Infrastructure & Automation
- Operate large-scale cloud-native, containerised systems
- Build Infrastructure as Code solutions for resilient environments
- Automate away toil and improve operational efficiency
Leadership & Collaboration
- Mentor engineers and champion SRE best practices
- Partner cross-functionally with engineering, product, and security teams
- Drive a culture of reliability across the organisation
Experience:
- 7+ years in Site Reliability Engineering, Production Engineering, or Systems Engineering
- Strong expertise in distributed systems, including failure modes and fault tolerance
- Proven experience operating cloud platforms (AWS, GCP, or Azure) in multi-region environments
- Deep knowledge of Kubernetes and container orchestration at scale
- Strong programming skills (Go, Python, Java) with a focus on automation and tooling
- Hands-on experience building and managing CI/CD pipelines with safety guardrails
- Demonstrated success leading high-severity incidents and driving systemic improvements
- Excellent stakeholder management and ability to influence technical decisions
Preferred experience:
- Multi-cloud or advanced resilience architecture experience
- Familiarity with tools like Prometheus, Grafana, or Datadog
- Experience with Terraform, CloudFormation, or similar IaC tools
- Exposure to AI-assisted tooling for operations or observability
If you are interested in this role, please feel free to submit your CV!