Site Reliability Engineer (SRE) - Cloud & Automation

Site Reliability Engineer (SRE) - Cloud & Automation

London, Docklands (hybrid)

£80,000 - £90,000 per annum + annual discretionary bonus

On behalf of a leading financial services organisation, I'm looking for a highly capable Site Reliability Engineer (SRE) to drive the adoption of SRE methodologies across their Cloud-hosted environment and act as the central point of expertise for automation within the Platform Operations function. This role is ideal for someone who thrives in complex, regulated environments and is passionate about building reliable, scalable, and automated cloud platforms.

The organisation is pleased to offer the role on a hybrid basis with 2 days per week in their Canary Wharf office, therefore you must be within a reasonable commute of London.

Responsibilities:

Lead the implementation of SRE practices across the organisation, working closely with infrastructure teams to optimise deployment processes and embed automation and operational excellence.
Enhance observability and reliability, defining and implementing SLAs, SLOs and SLIs to improve alerting, monitoring, and capacity planning.
Identify and eliminate toil, developing frameworks to analyse recurring issues and automate remediation wherever possible.
Develop secure, production-ready code, while reviewing and debugging code produced by others.
Build and mature GitOps capabilities using tools such as Terraform and Ansible Automation Platform to support multi-environment, multi-region cloud platforms.
Provide on-call support for Cloud and Automation services, ensuring production stability remains the top priority.
Drive post-incident improvements, ensuring risks and stability issues are understood and addressed through SRE best practices.

Experience/Skills required:

Strong operational support experience within an infrastructure services team, including on-call responsibilities, incident ownership, and root-cause analysis.
2+ years applying SRE methodologies, with a solid understanding of service-level metrics and reliability engineering principles.
Proficiency in at least one Scripting language - ideally Python or Ansible (PowerShell also beneficial).
Experience supporting and building multi-environment, multi-region cloud platforms (AWS or GCP), using IaC and GitOps workflows.
Hands-on experience with observability/APM tooling such as Grafana, Datadog or Dynatrace.
Background working in regulated financial services or banking environments.
Excellent troubleshooting, analytical and communication skills, able to work effectively with both technical and non-technical stakeholders.

Nice to have:

Software development background.
Familiarity with the ITIL framework.
Experience with Ansible Automation Platform.
Strong service-oriented mindset with the ability to work proactively and keep stakeholders informed.

Apply Now

Site Reliability Engineer (SRE) - Cloud & Automation

Job Details