Telemetry SRE Engineer
Technical Skills Must have:
Must‐Have
Observability & Reliability Engineering
· Strong hands‐on experience across core observability pillars including metrics, traces, service health and distributed systems visibility
· Practical experience implementing OpenTelemetry across application, platform and infrastructure layers
· Ability to design, deploy and operate end‐to‐end observability pipelines (collector‐to‐backend, agent management, data flows, routing and filtering)
· Strong understanding of SLI/SLO frameworks, error budgets and reliability‐focused operating models
· Experience defining alerting strategy, tuning thresholds and reducing operational noise through effective signal engineering
Observability Platforms & Tooling
· Hands‐on expertise in one or more enterprise‐grade observability platforms (Dynatrace, Splunk Observability, Datadog or equivalent)
· Proficiency with Prometheus ecosystem components including Alertmanager
· Experience designing clear, insightful dashboards and visualisations using Grafana
· Strong troubleshooting capability using metrics, traces and dependency insights to diagnose performance and availability issues
Cloud & Platform Monitoring
· Strong technical experience with at least one major public cloud (AWS, Azure or GCP)
· Monitoring fundamentals across cloud‐native services including compute, storage, networking, load balancers and managed services
· Solid understanding of cloud networking constructs (VPC/VNet, subnets, routing, NAT, firewalls and security groups)
Containers & Kubernetes
· Working knowledge of Kubernetes objects (pods, services, deployments) and operational lifecycle
· Experience monitoring containerised/app‐modernisation workloads
· Basic experience with Helm or Kustomize for packaging, configuration and deployment
· Ability to troubleshoot application behaviour and platform-level issues within container environments
Programming & Automation
· Proficiency in one or more languages (Python, Go, Java) to support automation and tooling
· Experience writing automation scripts and utilities supporting observability and SRE practices
· Awareness of integrating observability checks within CI/CD pipelines
· Comfort with shell scripting for diagnostics and operational tasks
Data & Analytics
· Strong understanding of time‐series data and telemetry characteristics
· Hands‐on experience with PromQL, SignalFlow, Metrics Explorer or equivalent query languages
· Ability to analyse latency percentiles (p95/p99), error rates and throughput metrics
· Working knowledge of SQL for querying telemetry backends or data stores