Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer (SRE)
London/Hybrid

12-month contract (high chance of extension)

Job description:

Join a global pioneer in the video game industry, shaping the future of digital entertainment for millions of players worldwide. As a Senior Site Reliability Engineer, you'll sit at the heart of a high-impact Technical Operations team, driving reliability, scalability, and performance across revenue-critical commerce platforms powering subscriptions and personalised experiences.

You'll collaborate closely with product and engineering teams, influencing architecture, improving deployment safety, and elevating observability-ensuring seamless experiences for a global gaming community.

This is a role where your decisions directly impact live services at massive scale.

Responsibilities:

Reliability & Engineering Excellence

Identify and eliminate systemic reliability risks
Define and evolve SLIs, SLOs, and error budgets aligned to user and business outcomes
Lead major incident response, post-mortems, and long-term remediation

Architecture & Scalability

Influence system design for high availability and resilience
Drive strategies for multi-region failover and disaster recovery
Balance performance, cost, and operational risk

CI/CD & Deployment Safety

Enhance pipelines to enable faster, safer releases
Implement modern deployment strategies (canary, blue/green, progressive delivery)
Build robust rollback and recovery mechanisms

Observability & Performance

Develop advanced monitoring across metrics, logs, and tracing
Improve signal quality and reduce alert fatigue
Troubleshoot and resolve performance bottlenecks

Infrastructure & Automation

Operate large-scale cloud-native, containerised systems
Build Infrastructure as Code solutions for resilient environments
Automate away toil and improve operational efficiency

Leadership & Collaboration

Mentor engineers and champion SRE best practices
Partner cross-functionally with engineering, product, and security teams
Drive a culture of reliability across the organisation

Experience:

7+ years in Site Reliability Engineering, Production Engineering, or Systems Engineering
Strong expertise in distributed systems, including failure modes and fault tolerance
Proven experience operating cloud platforms (AWS, GCP, or Azure) in multi-region environments
Deep knowledge of Kubernetes and container orchestration at scale
Strong programming skills (Go, Python, Java) with a focus on automation and tooling
Hands-on experience building and managing CI/CD pipelines with safety guardrails
Demonstrated success leading high-severity incidents and driving systemic improvements
Excellent stakeholder management and ability to influence technical decisions

Preferred experience:

Multi-cloud or advanced resilience architecture experience
Familiarity with tools like Prometheus, Grafana, or Datadog
Experience with Terraform, CloudFormation, or similar IaC tools
Exposure to AI-assisted tooling for operations or observability

If you are interested in this role, please feel free to submit your CV!

Apply Now

Senior Site Reliability Engineer (SRE)

Job Details