Site Reliability Engineer (SRE)

Site Reliability Engineer – (SRE, Site Reliability Engineer, Terraform, AKS, Azure, Kubernetes, PowerShell, Python, Bash, Datadog, Monitoring Tools) – Permanent – Remote

Charles Simon Associates are currently recruiting for an SRE Engineer on a permanent basis. This role is for a global business with a HQ in the City of London.

Candidates will need to be British Citizens due to Security Clearance requirements.

Location: Remote, with some travel to London

Salary: Up to £125,000 per annum

Skills/Requirements for the Site Reliability Engineer:

Extensive SRE experience within previous roles
Strong Terraform skills
Proven Kubernetes and AKS experience
Experience in creating and modifying terraform deployment on live environments
Experience with Monitoring solutions ideally Datadog, however Azure Application Insight, Log Analytics or Grafana
Scripting skills for automation within; PowerShell, Python or Bash
Experience with web based applications

Desirable Skills:

Knowledge or commercial experience of Microservices Architecture
Kanban
Any prior experience of working with Puppet and Chef would be advantageous

Start date is ASAP for the Site Reliability Engineer

The Site Reliability Engineer will be responsible for:

Designing and enforcing service-level objectives (SLOs), SLIs, and SLAs to ensure reliability targets are measurable and aligned with business expectations
Implementing incident response frameworks, including runbooks, postmortems, and blameless RCA processes to drive continuous improvement
Integrating observability tooling (e.g. Prometheus, Grafana, Datadog, OpenTelemetry) to enable proactive detection and resolution of system anomalies
Managing infrastructure as code (IaC) using tools like Terraform, Pulumi, or CloudFormation to ensure repeatable, auditable deployments
Optimizing cost and resource utilization across cloud environments through rightsizing, autoscaling, and lifecycle policies
Driving chaos engineering initiatives to test system resilience under failure conditions and validate recovery strategies
Championing security best practices within infrastructure—e.g. secrets management, IAM policies, and vulnerability scanning
Collaborating with DevOps and platform teams to build paved-road deployment patterns and internal developer portals
Leading capacity planning and load testing efforts to anticipate scaling needs and prevent bottlenecks
Contributing to architectural decisions that impact reliability, latency, and fault domains across distributed systems

Please send an up-to-date copy of your CV to be considered for the Site Reliability Engineer

Site Reliability Engineer – (SRE, Site Reliability Engineer, Terraform, AKS, Azure, Kubernetes, PowerShell, Python, Bash, Datadog, Monitoring Tools) – Permanent – Remote

Apply Now

Site Reliability Engineer (SRE)

Job Details