Site Reliability Engineer
Job Title: Lead Site Reliability Engineer (SRE) – Observability
Location: Remote Options
About the Role
We are looking for a Lead SRE to design, scale, and operate massive-scale observability systems that keep our global services online and performant. You will join an autonomous team of software engineers focused on solving complex data infrastructure challenges.
Key Responsibilities
- Scale Prometheus metrics infrastructure to handle 100+ million active series.
- Operate large Elasticsearch clusters holding 2000+TB of data.
- Grow high-throughput Kafka data pipelines processing hundreds of thousands of events per second.
- Build custom alerting workflows and self-service APIs for internal engineering teams.
- Provision cloud and private infrastructure using Terraform.
Requirements
- 5+ years operating mid-to-large distributed systems on Linux VMs or bare-metal machines.
- 2+ years developing in Go, Python, Ruby, Scala, or Bash.
- Hands-on experience with Prometheus/Thanos/Cortex, Kafka, the ELK stack, Ansible, or Consul.
- Comfortable diving into unfamiliar codebases and participating in an on-call rotation.
Keywords: Observability, Monitoring, SRE, Site Reliability Engineering, DevOps, ElasticSearch, ELK, Prometheus, Kafka, Terraform, Linux, Bare Metal