Site Reliability Engineer
Site reliability Engineer
Location - Glasgow, Scotland (Hybrid - 2-3days weekly onsite)
3 months employment contract - can extend
Government Sector
We are looking for a L7 senior Site Reliability Engineer (6-9 years overall experience) who can assess existing monitoring, reliability, and operational practices and define a comprehensive observability model across the application landscape. The ideal candidate will drive improvements that enhance availability, performance, resilience, and operational readiness across critical systems. This role requires a strong advisory mindset, deep production engineering expertise, and the ability to influence development, platform, and operations teams to deliver measurable reliability and operational outcomes.
Technical Skills:
- Strong experience as a Senior Site Reliability Engineer, Reliability Engineer, or Platform Engineer operating at L7 level.
- Deep expertise in application monitoring, observability, alerting, incident management, and production reliability.
- Hands-on experience assessing, selecting, and implementing monitoring and observability tools, frameworks, and integration approaches.
- Strong understanding of SRE principles including SLIs, SLOs, error budgets, and resilience engineering.
- Design and operation of highly available, fault-tolerant, multi-region systems
- Advanced capacity planning, load modeling, and traffic forecasting
- Deep expertise in metrics, logs, traces, and event-based telemetry
Process Skills:
- Assess current monitoring, alerting, and incident management mechanisms to identify gaps and improvement opportunities.
- Define and implement an end-to-end application monitoring and observability model aligned across the SDLC.
- Identify risks related to reliability, performance, availability, and operational readiness and recommend mitigation strategies.
- Establish SRE best practices including proactive alerting, error budgets, operational runbooks, and reliability metrics.
- Articulate expected operational benefits such as improved system stability, faster incident resolution, reduced operational risk, and improved customer experience.