Site Reliability Engineer (Contractor)
THE COMPANYJoin one of the world's most recognisable entertainment brands as they continue to scale their global digital commerce platform. You'll be part of a centralised Technical Operations function supporting revenue-critical services used by millions of players worldwide. This is a high-impact role where reliability, scalability, and engineering excellence are at the heart of everything you do.THE ROLEAs a Senior Site Reliability Engineer, you will take ownership of reliability strategy and operational excellence across key commerce domains, including subscriptions and personalisation services on the PlayStation Network. Operating in a centralised SRE model, you'll partner closely with product and engineering teams while maintaining horizontal responsibility for production health, resilience, and scalability.You'll lead incident response, define reliability standards, influence architectural decisions, and build automation that elevates deployment safety and operational efficiency. This is a hands-on, senior technical role with direct influence on systems that generate significant global revenue.Key ResponsibilitiesReliability & Risk Engineering* Identify systemic reliability risks and drive long-term preventative improvements* Define and refine SLIs, SLOs, and error budgets aligned to customer and business outcomes* Lead high-severity incident response, post-incident reviews, and remediation planningArchitecture & Resilience* Influence system architecture to improve scalability, availability, and failure isolation* Design multi-region HA, graceful degradation, and disaster recovery strategies* Evaluate trade-offs between performance, cost, and operational riskCI/CD & Deployment Safety* Enhance CI/CD pipelines with automation and safety guardrails* Implement safe deployment patterns (canary, blue/green, progressive delivery)* Ensure robust rollback and recovery mechanismsObservability & Performance* Build and evolve observability tooling across metrics, logs, and traces* Reduce alert fatigue and improve signal quality* Diagnose performance bottlenecks across infrastructure and applicationsInfrastructure & Automation* Design and operate cloud-native and containerised workloads at scale* Use IaC to build resilient, repeatable platforms* Develop automation frameworks that eliminate manual toilLeadership & Collaboration* Mentor mid-level engineers and champion SRE best practices* Partner with engineering, product, and security teams to embed reliability into system designYOUR SKILLS & EXPERIENCERequired:* Degree in Computer Science, Engineering, or equivalent experience* 7+ years in SRE, Production Engineering, or Systems Engineering roles* Strong understanding of distributed systems, failure modes, and consistency models* Hands-on experience operating production workloads in AWS, GCP, or Azure (multi-region)* Strong Kubernetes experience in large-scale production environments* Proficiency in Go, Python, or Java for tooling and automation* Proven experience leading high-severity incidents and driving cross-team remediation* Experience designing and operating CI/CD systems with deployment safety guardrailsPreferred:* Multi-cloud or multi-region resilience experience* Experience with Prometheus, Grafana, Datadog, or similar observability stacks* Prior mentorship or technical leadership experience* Experience with Terraform, CloudFormation, or other IaC tools* Exposure to AI-assisted tooling for incident analysis or operational insightsBENEFITS* Competitive salary + bonus* Hybrid working model* Excellent pension, healthcare, and wellbeing benefits* Opportunity to influence global-scale systems used by millionsHOW TO APPLYIf you're a Senior SRE who thrives in high-scale, high-impact environments and wants to shape the reliability of globally distributed systems, we'd love to hear from you. Apply now or contact Harnham for more information.