Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer (SRE)
London/Hybrid

12-month contract (high chance of extension)

Job description:

Join a global pioneer in the video game industry, shaping the future of digital entertainment for millions of players worldwide. As a Senior Site Reliability Engineer, you'll sit at the heart of a high-impact Technical Operations team, driving reliability, scalability, and performance across revenue-critical commerce platforms powering subscriptions and personalised experiences.

You'll collaborate closely with product and engineering teams, influencing architecture, improving deployment safety, and elevating observability-ensuring seamless experiences for a global gaming community.

This is a role where your decisions directly impact live services at massive scale.

Responsibilities:

Reliability & Engineering Excellence

  • Identify and eliminate systemic reliability risks
  • Define and evolve SLIs, SLOs, and error budgets aligned to user and business outcomes
  • Lead major incident response, post-mortems, and long-term remediation

Architecture & Scalability

  • Influence system design for high availability and resilience
  • Drive strategies for multi-region failover and disaster recovery
  • Balance performance, cost, and operational risk

CI/CD & Deployment Safety

  • Enhance pipelines to enable faster, safer releases
  • Implement modern deployment strategies (canary, blue/green, progressive delivery)
  • Build robust rollback and recovery mechanisms

Observability & Performance

  • Develop advanced monitoring across metrics, logs, and tracing
  • Improve signal quality and reduce alert fatigue
  • Troubleshoot and resolve performance bottlenecks

Infrastructure & Automation

  • Operate large-scale cloud-native, containerised systems
  • Build Infrastructure as Code solutions for resilient environments
  • Automate away toil and improve operational efficiency

Leadership & Collaboration

  • Mentor engineers and champion SRE best practices
  • Partner cross-functionally with engineering, product, and security teams
  • Drive a culture of reliability across the organisation

Experience:

  • 7+ years in Site Reliability Engineering, Production Engineering, or Systems Engineering
  • Strong expertise in distributed systems, including failure modes and fault tolerance
  • Proven experience operating cloud platforms (AWS, GCP, or Azure) in multi-region environments
  • Deep knowledge of Kubernetes and container orchestration at scale
  • Strong programming skills (Go, Python, Java) with a focus on automation and tooling
  • Hands-on experience building and managing CI/CD pipelines with safety guardrails
  • Demonstrated success leading high-severity incidents and driving systemic improvements
  • Excellent stakeholder management and ability to influence technical decisions

Preferred experience:

  • Multi-cloud or advanced resilience architecture experience
  • Familiarity with tools like Prometheus, Grafana, or Datadog
  • Experience with Terraform, CloudFormation, or similar IaC tools
  • Exposure to AI-assisted tooling for operations or observability

If you are interested in this role, please feel free to submit your CV!

Job Details

Company
CBSbutler Holdings Limited trading as CBSbutler
Location
London, United Kingdom
Employment Type
Contract
Posted