Senior Site Reliability Engineer

Globally known Technology and entertainment company!
Senior Site Reliability Engineer
AWS, GCP, or Azure, including multi-region deployments.

Senior Site Reliability Engineer Client: Globally known Technology and entertainment company!Location: London- Hybrid working Contract Dates: ASAP Start- 12-month contract Pay: TBC (Will be day rate)Role Overview We are seeking a Senior Site Reliability Engineer to drive the reliability, scalability, and operational excellence of Network commerce systems, including subscriptions and personalization services. You will collaborate closely with product and engineering teams to enhance system architecture, deployment safety, observability, and overall performance. As part of the centralized Technical Operations team, you will have horizontal responsibility for production reliability, participate in the on-call rotation for critical commerce services, and influence systems that serve millions of users globally.Key Responsibilities Reliability & Risk Engineering

Identify systemic reliability risks and implement preventative solutions.
Define and maintain SLIs, SLOs, and error budgets aligned with business and user outcomes.
Lead incident management, post-incident reviews, and remediation planning.

Architecture & Resilience

Review and advise on system architecture to improve scalability, availability, and fault isolation.
Design strategies for high availability, graceful degradation, and disaster recovery across multi-region environments.
Quantify trade-offs between performance, cost, and operational risk.

CI/CD & Deployment Safety

Enhance deployment pipelines and implement automation to reduce risk and accelerate delivery.
Apply safe deployment patterns such as canary, blue/green, and progressive delivery.
Ensure robust rollback and recovery mechanisms.

Observability & Performance

Build and evolve monitoring, logging, and tracing solutions to provide actionable insights.
Collaborate to reduce alert fatigue and improve signal quality.
Diagnose performance bottlenecks across infrastructure and applications.

Infrastructure & Automation

Operate cloud-native and containerized workloads at scale.
Use Infrastructure as Code tools to deploy and manage resilient platforms.
Develop automation frameworks to reduce manual toil and operational risk.

Leadership & Mentorship

Mentor mid-level engineers and advocate SRE best practices across teams.
Partner with engineering, product, and security teams to embed reliability into system design.

Required Qualifications

Bachelor's degree in Computer Science, Engineering, or equivalent experience.
7+ years in site reliability, production engineering, or systems engineering roles.
Strong understanding of distributed systems, consistency models, failure modes, and fault isolation strategies.
Hands-on experience with AWS, GCP, or Azure, including multi-region deployments.
Proficiency in Kubernetes and large-scale container orchestration.
Programming experience in Go, Python, or Java, building automation or reliability systems.
Experience designing and operating CI/CD pipelines with deployment safety guardrails.
Proven track record leading high-severity incidents and driving systemic remediation.
Excellent interpersonal skills with experience influencing cross-team decisions.

Preferred Qualifications

Experience with multi-cloud or multi-region resilience architecture.
Proficiency in monitoring and observability tools (Prometheus, Grafana, Datadog).
Prior mentorship or technical leadership experience.
Familiarity with Infrastructure as Code tools (Terraform, CloudFormation).
Experience using AI-assisted tools for incident analysis, operational efficiency, or observability.

If this sounds like you apply now by sending your CV to

Apply Now

Senior Site Reliability Engineer

Job Details