Site Reliability Engineer

Senior / Staff Site Reliability Engineer | £136k–£180k + equity | Remote Europe or London

We're partnering with a fast-growing developer infrastructure startup on a senior SRE hire at a pivotal moment in their growth.

The platform runs AI agents and background workflows in production at massive scale handling hundreds of millions of executions per month on infrastructure they run themselves. The team is ~13 people. No engineering managers. Engineers own large parts of the system and work directly with the founders.

The core challenge right now is scale. Execution volume is growing faster than the team can build, which means the next hires are walking into genuine distributed systems problems — not a greenfield rebuild or a dashboard feature.

What you'll be working on

Owning observability across the platform OpenTelemetry, metrics, logs, traces, and making them genuinely useful at 3am
Designing and operating distributed systems primitives under real production load — queues, schedulers, checkpoints, backpressure
Architecting and tuning auto-scaling infrastructure that runs untrusted customer code at high throughput
Hardening multi-tenant sandbox isolation, secrets handling, network policy, and supply chain security
Owning Terraform and IaC as a first principle across a cloud-native footprint
Running on-call practice: SLOs, runbooks, blameless postmortems, paging hygiene

What they're looking for

Strong observability background production experience with OpenTelemetry, Prometheus or equivalent
Distributed systems experience you've designed or operated systems with non-trivial failure modes
Strong with in TypeScript and/or Go the codebase is TypeScript-heavy with Go emerging as a second language.
Self-managed Kubernetes in production, not just managed control planes
Performance and scaling instincts you've chased real bottlenecks across app, database, and infra layers
Terraform as a first principle, run at meaningful scale
Security mindset — multi-tenant isolation, least privilege, threat modelling
Postgres and Redis under load, AWS strongly preferred

The process

Screening call, hiring manager conversation, Technical with roughly a 10% pass rate, then a final with the wider team. The bar is high but if you find that motivating rather than off-putting, that's probably a good sign.

Apply Now

Site Reliability Engineer

Job Details