Observability Engineering Lead
Observability Engineering Lead, Prometheus, Grafana, 2 days onsite.
The role:
We're looking for a highly skilled Observability Engineering Lead to drive the uplift, resilience, and effectiveness of our monitoring ecosystem. You'll play a pivotal role in shaping how we detect, diagnose, and prevent issues across our critical applications-partnering with engineering teams to deliver world-class insights through metrics, dashboards, alerts, and automation.
This is a hands-on technical leadership position where you'll influence standards, modernise tooling, and enhance our visibility across complex distributed systems.
- Collaborate with Application Stewards and SREs to validate critical assets in scope for monitoring verification and uplift.
- Work with EMAS to analyse Prometheus scrape coverage, exporter deployment, and Grafana dashboard availability for critical services.
- Identify and implement improvements across monitoring configurations, alert quality, data models, dashboards, KPIs, SLIs, and SLOs.
- Review roles and responsibilities across observability functions and recommend enhancements aligned to Operational Resilience standards.
- Contribute to delivering automated, end-to-end business flow visibility, surfaced in Grafana through service maps, dependency visualisation, or topology integrations.
- Ensure alerting configurations are reliable, actionable, and noise-optimised, following Alertmanager best practices.
Deep expertise in designing, implementing, and configuring modern observability stacks-specifically Prometheus, Grafana, and associated tooling.
Prometheus- Strong instrumentation strategy (exporters, service discovery, custom metrics).
- Advanced PromQL skills for complex querying and performance analysis.
- Experience building recording/alerting rules and optimising metric ingestion.
- Knowledge of HA architectures, federation, sharding, and long-term storage (Thanos, Cortex, Mimir).
- Dashboard and panel design focused on performance and operator clarity.
- Best-practice alert configuration and routing.
- Experience with synthetic monitoring (Grafana Synthetic Monitoring, Blackbox exporter).
- Log ingestion/analysis (Loki).
- Familiarity with Real User Monitoring tooling (e.g., Grafana Faro).
- Strong API and automation skills for dashboard provisioning, alert management, and data ingestion.
- Experience integrating the Grafana/Prometheus ecosystem with logging, tracing, and event platforms (Loki, Tempo, OpenTelemetry).
Observability Engineering Lead, Prometheus, Grafana, 2 days onsite.
McGregor Boyall is an equal opportunity employer and do not discriminate on any grounds.