Site Reliability Engineer

Technical Skills:

  • Strong experience as a Senior Site Reliability Engineer, Reliability Engineer, or Platform Engineer operating at L7 level.
  • Deep expertise in application monitoring, observability, alerting, incident management, and production reliability.
  • Hands-on experience assessing, selecting, and implementing monitoring and observability tools, frameworks, and integration approaches.
  • Strong understanding of SRE principles including SLIs, SLOs, error budgets, and resilience engineering.
  • Design and operation of highly available, fault-tolerant, multi-region systems
  • Advanced capacity planning, load modeling, and traffic forecasting
  • Deep expertise in metrics, logs, traces, and event-based telemetry

Process Skills:

  • Assess current monitoring, alerting, and incident management mechanisms to identify gaps and improvement opportunities.
  • Define and implement an end-to-end application monitoring and observability model aligned across the SDLC.
  • Identify risks related to reliability, performance, availability, and operational readiness and recommend mitigation strategies.
  • Establish SRE best practices including proactive alerting, error budgets, operational runbooks, and reliability metrics.
  • Articulate expected operational benefits such as improved system stability, faster incident resolution, reduced operational risk, and improved customer experience.

Job Details

Company
Ubique Systems
Location
Glasgow, Scotland, United Kingdom
Posted