Senior Site Reliability Engineer

  • Globally known Technology and entertainment company!
  • Senior Site Reliability Engineer
  • AWS, GCP, or Azure, including multi-region deployments.
Senior Site Reliability Engineer Client: Globally known Technology and entertainment company!Location: London- Hybrid working Contract Dates: ASAP Start- 12-month contract Pay: TBC (Will be day rate)Role Overview We are seeking a Senior Site Reliability Engineer to drive the reliability, scalability, and operational excellence of Network commerce systems, including subscriptions and personalization services. You will collaborate closely with product and engineering teams to enhance system architecture, deployment safety, observability, and overall performance. As part of the centralized Technical Operations team, you will have horizontal responsibility for production reliability, participate in the on-call rotation for critical commerce services, and influence systems that serve millions of users globally.Key Responsibilities Reliability & Risk Engineering
  • Identify systemic reliability risks and implement preventative solutions.
  • Define and maintain SLIs, SLOs, and error budgets aligned with business and user outcomes.
  • Lead incident management, post-incident reviews, and remediation planning.
Architecture & Resilience
  • Review and advise on system architecture to improve scalability, availability, and fault isolation.
  • Design strategies for high availability, graceful degradation, and disaster recovery across multi-region environments.
  • Quantify trade-offs between performance, cost, and operational risk.
CI/CD & Deployment Safety
  • Enhance deployment pipelines and implement automation to reduce risk and accelerate delivery.
  • Apply safe deployment patterns such as canary, blue/green, and progressive delivery.
  • Ensure robust rollback and recovery mechanisms.
Observability & Performance
  • Build and evolve monitoring, logging, and tracing solutions to provide actionable insights.
  • Collaborate to reduce alert fatigue and improve signal quality.
  • Diagnose performance bottlenecks across infrastructure and applications.
Infrastructure & Automation
  • Operate cloud-native and containerized workloads at scale.
  • Use Infrastructure as Code tools to deploy and manage resilient platforms.
  • Develop automation frameworks to reduce manual toil and operational risk.
Leadership & Mentorship
  • Mentor mid-level engineers and advocate SRE best practices across teams.
  • Partner with engineering, product, and security teams to embed reliability into system design.
Required Qualifications
  • Bachelor's degree in Computer Science, Engineering, or equivalent experience.
  • 7+ years in site reliability, production engineering, or systems engineering roles.
  • Strong understanding of distributed systems, consistency models, failure modes, and fault isolation strategies.
  • Hands-on experience with AWS, GCP, or Azure, including multi-region deployments.
  • Proficiency in Kubernetes and large-scale container orchestration.
  • Programming experience in Go, Python, or Java, building automation or reliability systems.
  • Experience designing and operating CI/CD pipelines with deployment safety guardrails.
  • Proven track record leading high-severity incidents and driving systemic remediation.
  • Excellent interpersonal skills with experience influencing cross-team decisions.
Preferred Qualifications
  • Experience with multi-cloud or multi-region resilience architecture.
  • Proficiency in monitoring and observability tools (Prometheus, Grafana, Datadog).
  • Prior mentorship or technical leadership experience.
  • Familiarity with Infrastructure as Code tools (Terraform, CloudFormation).
  • Experience using AI-assisted tools for incident analysis, operational efficiency, or observability.
If this sounds like you apply now by sending your CV to

Job Details

Company
Outsource
Location
London, South East, England, United Kingdom
Hybrid / Remote Options
Employment Type
Contractor
Salary
Salary negotiable
Posted