Site Reliability Engineer

Job Title: Lead Site Reliability Engineer (SRE) – Observability

Location: Remote Options

About the Role

We are looking for a Lead SRE to design, scale, and operate massive-scale observability systems that keep our global services online and performant. You will join an autonomous team of software engineers focused on solving complex data infrastructure challenges.

Key Responsibilities

  • Scale Prometheus metrics infrastructure to handle 100+ million active series.
  • Operate large Elasticsearch clusters holding 2000+TB of data.
  • Grow high-throughput Kafka data pipelines processing hundreds of thousands of events per second.
  • Build custom alerting workflows and self-service APIs for internal engineering teams.
  • Provision cloud and private infrastructure using Terraform.

Requirements

  • 5+ years operating mid-to-large distributed systems on Linux VMs or bare-metal machines.
  • 2+ years developing in Go, Python, Ruby, Scala, or Bash.
  • Hands-on experience with Prometheus/Thanos/Cortex, Kafka, the ELK stack, Ansible, or Consul.
  • Comfortable diving into unfamiliar codebases and participating in an on-call rotation.

Keywords: Observability, Monitoring, SRE, Site Reliability Engineering, DevOps, ElasticSearch, ELK, Prometheus, Kafka, Terraform, Linux, Bare Metal

Job Details

Company
Randstad Digital UK
Location
United Kingdom
Posted