Principal Observability & Cloud Platform Engineer

Principal Observability & Cloud Platform Engineer

Most observability engineers run someone else's stack. This role is for the person who builds it.

Our client is re-architecting observability and cloud infrastructure at a scale very few engineers ever touch: a ~3,000-node Kubernetes estate, 50TB of logs a day (around 600k logs/second) and up to 80 million active time-series, running multi-region and multi-cloud across AWS and GCP.

You'll own the architecture: metrics, logs, traces, telemetry pipelines, service mesh and developer experience for thousands of services and millions of devices. You'll overhaul core open-source components, storage layers, query paths for performance, cost and reliability, and push improvements back upstream to CNCF projects. This is hands-on architecture, not stack-sitting.

What you'll need:

Strong, hands-on Go in production, plus Python or Shell.
Real scale: PB-level ingestion and hundreds of millions of active series, and you built or scaled it, not just watched it run.
Depth across the open-source observability stack: Prometheus, Grafana, and large-scale metrics (Thanos, Mimir, Cortex or VictoriaMetrics); logs (Loki / ELK / OpenSearch); traces (Tempo).
Kubernetes at multi-cluster scale, service mesh (Istio / Envoy), Terraform, and AWS and/or GCP.
A track record of evolving storage and query architectures (TSDB, Parquet, distributed processing) for cost, scale and latency.

Nice to have:

OpenTelemetry / OpenMetrics standards work, CNCF open-source contributions, security-in-platform experience, and using AI tooling to cut toil.

If you've built observability at scale and want to do it again with proper ownership, send your CV and a line on the largest system you've personally architected.

Apply Now

Principal Observability & Cloud Platform Engineer

Job Details