Site Reliability Engineer (SRE)
Site Reliability Engineer – (SRE, Site Reliability Engineer, Terraform, AKS, Azure, Kubernetes, PowerShell, Python, Bash, Datadog, Monitoring Tools) – Permanent – Remote
Charles Simon Associates are currently recruiting for an SRE Engineer on a permanent basis. This role is for a global business with a HQ in the City of London.
Candidates will need to be British Citizens due to Security Clearance requirements.
Location: Remote, with some travel to London
Salary: Up to £125,000 per annum
Skills/Requirements for the Site Reliability Engineer:
- Extensive SRE experience within previous roles
- Strong Terraform skills
- Proven Kubernetes and AKS experience
- Experience in creating and modifying terraform deployment on live environments
- Experience with Monitoring solutions ideally Datadog, however Azure Application Insight, Log Analytics or Grafana
- Scripting skills for automation within; PowerShell, Python or Bash
- Experience with web based applications
Desirable Skills:
- Knowledge or commercial experience of Microservices Architecture
- Kanban
- Any prior experience of working with Puppet and Chef would be advantageous
Start date is ASAP for the Site Reliability Engineer
The Site Reliability Engineer will be responsible for:
- Designing and enforcing service-level objectives (SLOs), SLIs, and SLAs to ensure reliability targets are measurable and aligned with business expectations
- Implementing incident response frameworks, including runbooks, postmortems, and blameless RCA processes to drive continuous improvement
- Integrating observability tooling (e.g. Prometheus, Grafana, Datadog, OpenTelemetry) to enable proactive detection and resolution of system anomalies
- Managing infrastructure as code (IaC) using tools like Terraform, Pulumi, or CloudFormation to ensure repeatable, auditable deployments
- Optimizing cost and resource utilization across cloud environments through rightsizing, autoscaling, and lifecycle policies
- Driving chaos engineering initiatives to test system resilience under failure conditions and validate recovery strategies
- Championing security best practices within infrastructure—e.g. secrets management, IAM policies, and vulnerability scanning
- Collaborating with DevOps and platform teams to build paved-road deployment patterns and internal developer portals
- Leading capacity planning and load testing efforts to anticipate scaling needs and prevent bottlenecks
- Contributing to architectural decisions that impact reliability, latency, and fault domains across distributed systems
Please send an up-to-date copy of your CV to be considered for the Site Reliability Engineer
Site Reliability Engineer – (SRE, Site Reliability Engineer, Terraform, AKS, Azure, Kubernetes, PowerShell, Python, Bash, Datadog, Monitoring Tools) – Permanent – Remote