Site Reliability Engineer (GCP) - 12-Month Contract - Bristol, Manchester, Leeds, Halifax

Site Reliability Engineer (GCP) - 12-Month Contract - Bristol.Manchester, Leeds, Halifax (Inside IR35)

We are seeking an experienced Site Reliability Engineer (SRE) to join our engineering organisation and drive the reliability, availability, performance, and operational excellence of critical cloud-native services running on Google Cloud Platform (GCP).

This role is focused on reliability engineering principles rather than platform engineering or traditional DevOps activities. The successful candidate will be responsible for defining and managing Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, incident management processes, and observability frameworks to ensure highly resilient and scalable services.

Key Responsibilities:

  • Define, manage, and report on Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets for critical services.
  • Lead the response to major incidents, driving rapid restoration of services and conducting post-incident reviews to identify and address root causes.
  • Design, implement, and continuously improve observability solutions using Grafana, Prometheus, logging, metrics, and distributed tracing.
  • Monitor production systems proactively, ensuring high availability, performance, and reliability of customer-facing services.
  • Support and troubleshoot Kubernetes workloads running on Google Kubernetes Engine (GKE).
  • Partner with engineering teams to improve application resilience, scalability, and operational readiness.

What You Will Ideally Bring:

  • Proven experience working as a Site Reliability Engineer supporting large-scale production environments.
  • Strong understanding of SRE principles, including SLIs, SLOs, Error Budgets, availability, reliability, and operational excellence.
  • Hands-on experience with Google Cloud Platform (GCP), including Google Kubernetes Engine (GKE).
  • Strong Kubernetes operational and troubleshooting skills within production environments.
  • Expertise in observability and monitoring using Grafana, Prometheus, logging, metrics, alerting, and distributed tracing.
  • Experience managing and resolving critical incidents, conducting root cause analysis (RCA), and driving post-incident improvements.
  • Strong understanding of system performance, scalability, capacity planning, and resilience engineering.
  • Experience automating operational processes using Scripting or programming languages such as Python, Go, or Bash.

Contract Details:

  • Duration: 12 months
  • Rate: £500-525 per day (Inside IR35)
  • Location: Multiple Locations (Hybrid - 2 days per week)
  • Start Date: ASAP

Job Details

Company
Hamilton Barnes
Location
United Kingdom
Hybrid / Remote Options
Employment Type
Contract
Salary
GBP 500 - 525 Daily
Posted