Principal Observability & Cloud Platform Engineer
Principal Observability & Cloud Platform Engineer
Most observability engineers run someone else's stack. This role is for the person who builds it.
Our client is re-architecting observability and cloud infrastructure at a scale very few engineers ever touch: a ~3,000-node Kubernetes estate, 50TB of logs a day (around 600k logs/second) and up to 80 million active time-series, running multi-region and multi-cloud across AWS and GCP.
You'll own the architecture: metrics, logs, traces, telemetry pipelines, service mesh and developer experience for thousands of services and millions of devices. You'll overhaul core open-source components, storage layers, query paths for performance, cost and reliability, and push improvements back upstream to CNCF projects. This is hands-on architecture, not stack-sitting.
What you'll need:
- Strong, hands-on Go in production, plus Python or Shell.
- Real scale: PB-level ingestion and hundreds of millions of active series, and you built or scaled it, not just watched it run.
- Depth across the open-source observability stack: Prometheus, Grafana, and large-scale metrics (Thanos, Mimir, Cortex or VictoriaMetrics); logs (Loki / ELK / OpenSearch); traces (Tempo).
- Kubernetes at multi-cluster scale, service mesh (Istio / Envoy), Terraform, and AWS and/or GCP.
- A track record of evolving storage and query architectures (TSDB, Parquet, distributed processing) for cost, scale and latency.
Nice to have:
- OpenTelemetry / OpenMetrics standards work, CNCF open-source contributions, security-in-platform experience, and using AI tooling to cut toil.
If you've built observability at scale and want to do it again with proper ownership, send your CV and a line on the largest system you've personally architected.