Lead Site Reliability Engineer

Own Reliability. Shape the Platform. Empower Millions.

At Holland & Barrett, we're transforming into a truly product- and platform-led technology organisation — and we're looking for a Lead Site Reliability Engineer who's excited by scale, complexity, and impact.

Our mission? Build and evolve the resilient, high-performance systems that power health and wellness for millions of customers. If you're obsessed with reliability, driven by automation, and thrive in high-ownership engineering cultures, this is your opportunity to lead from the front.

What You'll Lead & Deliver

Reliability & Performance at Scale

Architect and improve cloud-native systems with reliability as a first-class principle.
Shape SLIs/SLOs, error budgets, capacity planning, and performance strategies.
Continuously evolve availability, efficiency, and resilience across our platforms.

Technical Leadership That Raises the Bar

Mentor SREs, platform engineers, and developers across the organisation.
Champion automation, observability, DevSecOps, and modern operational practices.
Influence engineering culture and architectural direction.

Operational Excellence

Own and lead high-severity incident response with calm, clarity, and technical depth.
Run world-class post-incident reviews and drive meaningful, measurable improvements.
Strengthen monitoring, alerting, on-call practices, and reliability processes.
Support resilience validation through load testing, stress testing, and chaos engineering.

Automation, Tooling & Engineering Efficiency

Build tools and automation that remove toil and accelerate teams.
Develop CI/CD pipelines and Infrastructure-as-Code environments.
Drive consistency, repeatability, and self-service across engineering.

Cross-Team Collaboration

Partner with Security, Platform, and Engineering teams to align reliability with security and resilience goals.
Lead teams toward better design, operational readiness, and measurable service health.
Contribute to documentation, runbooks, and operational processes that scale.

Key requirements:

5–8+ years in SRE, Platform, Cloud Infrastructure, or operational engineering roles.
Hands-on experience architecting and improving large-scale, distributed systems.
Strong coding proficiency in Python, Go, Bash, or similar automation-focused languages.
Expertise with observability stacks: Datadog, Prometheus, Grafana, OpenTelemetry.
Deep AWS experience across EC2, EKS, Lambda, VPC, DynamoDB, S3, CloudFront, RDS, IAM, KMS, and more.
Proficiency with Terraform, CloudFormation, or AWS CDK.
Incident response leadership and root-cause analysis expertise.
Excellent documentation and communication skills.
Strong analytical and troubleshooting abilities.

Bonus

Experience mentoring or leading engineers within SRE or platform teams.
Experience with load testing, stress testing, and chaos engineering.
A passion for uplifting engineering culture through tooling, automation, and reliability-first thinking.

Why Build the Future with Holland & Barrett?

Technology is at the heart of our mission to make health & wellness accessible to everyone. As a Lead SRE, you won't just keep systems running — you'll design the reliability, resilience, and operational maturity that accelerates our entire business.

We offer:

A modern engineering culture built on autonomy, experimentation, and learning.
The chance to create real impact across critical customer and internal platforms.
A collaborative team that values innovation, continuous improvement, and technical excellence.

If you're ready to lead reliability for platforms with massive real-world impact, we'd love to meet you.

Apply now and help shape the future of H&B Technology.

Apply Now

Lead Site Reliability Engineer

Job Details