Engineering Lead
NEW CONTRACT ROLE - ENGINEERING LEAD (OBSERVABILITY / SRE) | ASAP START | UK (Remote / Hybrid) | 6-Month Contract | Possible Extension | London, Manchester, Birmingham or Edinburgh
THE OPPORTUNITY
We're looking for an experienced Engineering Lead to support a critical enterprise observability and operational resilience programme.
This role is focused on leading the uplift of monitoring, alerting, and end-to-end service visibility across business-critical applications. It's ideal for a senior, hands-on engineering lead with deep Prometheus and Grafana expertise, capable of guiding best practices across SRE, platform, and application teams.
THE ROLE
- Lead collaboration with Application Stewards and Site Reliability Engineers (SREs) to confirm critical services and assets in scope for monitoring verification and uplift
- Work with EMAS to analyse Prometheus scrape coverage, exporter deployment, and Grafana dashboard availability for critical applications
- Drive improvements across monitoring configuration, alert quality, metrics, dashboards, KPIs, SLIs, and SLOs
- Lead the optimisation of alerting to ensure alerts are reliable, actionable, and noise-optimised, applying Alertmanager best practices
- Oversee delivery of automated end-to-end business flow visibility through Grafana service maps, dependency visualisation, and topology integrations
- Review observability roles and responsibilities and recommend improvements aligned to Operational Resilience standards
- Champion automation and API-driven approaches for dashboard provisioning, alert management, and data ingestion
- Ensure clear documentation of standards, configurations, and improvements delivered
TECHNICAL SKILLS / REQUIREMENTS
Strong hands-on and leadership experience with:
Prometheus - instrumentation strategy, exporters, service discovery, custom metrics, PromQL, recording rules, alerting rules, HA architectures (Thanos, Cortex, Mimir)
Grafana - dashboard and panel design, alerting and routing, synthetic monitoring, Loki, real user monitoring (e.g. Grafana Faro)
Observability Ecosystem - integration of metrics, logs, and traces (Loki, Tempo, OpenTelemetry), APIs and automation
PROFILE
- Proven experience as a Senior Engineer, Technical Lead, or Engineering Lead within SRE, Observability, DevOps, or Platform Engineering
- Comfortable leading technical direction while remaining hands-on
- Strong stakeholder engagement and communication skills
- Experience operating in complex, enterprise-scale or regulated environments
- Typically 6+ years' experience in reliability engineering, monitoring, or observability-focused roles
KEYWORDS
Engineering Lead, Observability Engineering, Site Reliability Engineering, SRE, Prometheus, Grafana, Alertmanager, PromQL, Monitoring, Operational Resilience, DevOps, Platform Engineering, Metrics, Logging, Tracing, OpenTelemetry, Loki, Tempo, Thanos, Cortex, Mimir