Observability Engineering Lead

Observability Engineering Lead, Prometheus, Grafana, 2 days onsite.

The role:

We're looking for a highly skilled Observability Engineering Lead to drive the uplift, resilience, and effectiveness of our monitoring ecosystem. You'll play a pivotal role in shaping how we detect, diagnose, and prevent issues across our critical applications-partnering with engineering teams to deliver world-class insights through metrics, dashboards, alerts, and automation.

This is a hands-on technical leadership position where you'll influence standards, modernise tooling, and enhance our visibility across complex distributed systems.

Collaborate with Application Stewards and SREs to validate critical assets in scope for monitoring verification and uplift.
Work with EMAS to analyse Prometheus scrape coverage, exporter deployment, and Grafana dashboard availability for critical services.
Identify and implement improvements across monitoring configurations, alert quality, data models, dashboards, KPIs, SLIs, and SLOs.
Review roles and responsibilities across observability functions and recommend enhancements aligned to Operational Resilience standards.
Contribute to delivering automated, end-to-end business flow visibility, surfaced in Grafana through service maps, dependency visualisation, or topology integrations.
Ensure alerting configurations are reliable, actionable, and noise-optimised, following Alertmanager best practices.

Skill required:

Deep expertise in designing, implementing, and configuring modern observability stacks-specifically Prometheus, Grafana, and associated tooling.

Prometheus

Strong instrumentation strategy (exporters, service discovery, custom metrics).
Advanced PromQL skills for complex querying and performance analysis.
Experience building recording/alerting rules and optimising metric ingestion.
Knowledge of HA architectures, federation, sharding, and long-term storage (Thanos, Cortex, Mimir).

Grafana

Dashboard and panel design focused on performance and operator clarity.
Best-practice alert configuration and routing.
Experience with synthetic monitoring (Grafana Synthetic Monitoring, Blackbox exporter).
Log ingestion/analysis (Loki).
Familiarity with Real User Monitoring tooling (e.g., Grafana Faro).

Ecosystem & Integrations

Strong API and automation skills for dashboard provisioning, alert management, and data ingestion.
Experience integrating the Grafana/Prometheus ecosystem with logging, tracing, and event platforms (Loki, Tempo, OpenTelemetry).

Observability Engineering Lead, Prometheus, Grafana, 2 days onsite.

McGregor Boyall is an equal opportunity employer and do not discriminate on any grounds.

Apply Now

Observability Engineering Lead

Job Details