Site Reliability Engineer

Senior / Staff Site Reliability Engineer | £136k–£180k + equity | Remote Europe or London

We're partnering with a fast-growing developer infrastructure startup on a senior SRE hire at a pivotal moment in their growth.

The platform runs AI agents and background workflows in production at massive scale handling hundreds of millions of executions per month on infrastructure they run themselves. The team is ~13 people. No engineering managers. Engineers own large parts of the system and work directly with the founders.

The core challenge right now is scale. Execution volume is growing faster than the team can build, which means the next hires are walking into genuine distributed systems problems — not a greenfield rebuild or a dashboard feature.

What you'll be working on

  • Owning observability across the platform OpenTelemetry, metrics, logs, traces, and making them genuinely useful at 3am
  • Designing and operating distributed systems primitives under real production load — queues, schedulers, checkpoints, backpressure
  • Architecting and tuning auto-scaling infrastructure that runs untrusted customer code at high throughput
  • Hardening multi-tenant sandbox isolation, secrets handling, network policy, and supply chain security
  • Owning Terraform and IaC as a first principle across a cloud-native footprint
  • Running on-call practice: SLOs, runbooks, blameless postmortems, paging hygiene

What they're looking for

  • Strong observability background production experience with OpenTelemetry, Prometheus or equivalent
  • Distributed systems experience you've designed or operated systems with non-trivial failure modes
  • Strong with in TypeScript and/or Go the codebase is TypeScript-heavy with Go emerging as a second language.
  • Self-managed Kubernetes in production, not just managed control planes
  • Performance and scaling instincts you've chased real bottlenecks across app, database, and infra layers
  • Terraform as a first principle, run at meaningful scale
  • Security mindset — multi-tenant isolation, least privilege, threat modelling
  • Postgres and Redis under load, AWS strongly preferred

The process

Screening call, hiring manager conversation, Technical with roughly a 10% pass rate, then a final with the wider team. The bar is high but if you find that motivating rather than off-putting, that's probably a good sign.

Job Details

Company
Wave Talent
Location
Greater London, England, United Kingdom
Posted