ensuring governance, security, compliance, and control. Experience Requirements: Proven experience in a senior SRE role or similar. Strong knowledge of cloud technologies and SLA SLO SLI management. Experience leading teams and implementing SCRUM processes. Excellent communication and leadership skills. Experience line managing, mentoring, and coaching. Responsibilities: Collaborate with the Principal More ❯
valued. What You'll Do Key responsibilities in this role will include (but not be limited to): Leveraging core SRE values - measuring (SLI/SLO/SLA), testing, and eliminating toil via automation with appropriate Disaster Recovery planning Refining KPIs to enable data-driven decision making for availability and reliability More ❯
deployment) of e.g. ELK, CloudWatch, Fluentd, to enable forensic log analysis and system tuning as well as data-driven performance analysis (i.e. SLI/SLO) and capacity planning. You are a competent Linux & Windows systems administrator (for multiple distributions), including storage management (e.g. LVM, RAID) and security best-practices e.g. More ❯
and training engineers up to Staff standard. Operational Stability: Demonstrate a production first attitude, continuously considering observability and maintaining ServiceLevelObjectives, while delivering change at pace. Research & Innovation: Embrace emerging technologies and trends, and share insights with the organisation, while More ❯
This will involve: Defining and implementing ServiceLevel Indicators (SLIs) and ServiceLevelObjectives (SLOs) to measure and maintain system and application performance, ensuring services meet agreed reliability targets. Instrumenting applications to collect … principles, including the creation and management of ServiceLevel Indicators (SLIs), ServiceLevelObjectives (SLOs) and error budgets ensuring reliability and performance. Experience in implementing observability, instrumenting applications to provide insights into system More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Sanderson Recruitment
As our Site Reliability Engineer, you'll work closely with our feature team and other colleagues to meet defined servicelevelobjectives and continually improve systems and environments. You'll define error budgets that support finding the right balance between risk More ❯
systems and third-party solutions. Network Health Management: Define and implement prediction pipelines for long-term network health, availability, and service-level objectives. Operations Automation: Lead initiatives to automate and optimize network operations focusing on scalability and reliability. Collaborative Development: Work closely More ❯
quality. The Service Delivery Manager will be responsible for ensuring our technical teams meet their servicelevelobjectives, driving operational excellence, and maintaining strong relationships with internal and external stakeholders. You will play a vital part in More ❯
quality. The Service Delivery Manager will be responsible for ensuring our technical teams meet their servicelevelobjectives, driving operational excellence, and maintaining strong relationships with internal and external stakeholders. You will play a vital part in More ❯
and EMEA time zones Preferred (Bonus) Skills Hands-on experience with tools like PagerDuty, OpsGenie, ServiceNow, CloudWatch, Chronosphere, or similar Understanding of SLA/SLO implementation and performance tracking Exposure to incident management frameworks, automated remediation, and runbook automation Background in DevOps or SRE culture and tooling Prior people leadership More ❯
and EMEA time zones Preferred (Bonus) Skills Hands-on experience with tools like PagerDuty, OpsGenie, ServiceNow, CloudWatch, Chronosphere, or similar Understanding of SLA/SLO implementation and performance tracking Exposure to incident management frameworks, automated remediation, and runbook automation Background in DevOps or SRE culture and tooling Prior people leadership More ❯
your ideas to technical and non-technical audiences. Additional Desired Skills Experience with incident management platforms like PagerDuty, OpsGenie, or similar tools Understanding of SLO/SLA management and implementations Knowledge of industry standard incident management frameworks and best practices Familiarity with automated remediation and runbook automation Experience with DevOps More ❯
Level Agreements (SLA) through ServiceLevelObjectives (SLO) and ServiceLevel Indicators (SLI). Liaise with client technical and business teams as needed to ensure More ❯
members to resolve complex problems * Understands servicelevel indicators and utilizes servicelevelobjectives to proactively resolve issues before they impact customers * Supports the adoption of site reliability engineering best practices within your … technical discipline (e.g., Cloud, artificial intelligence, Android, etc.) * Experience in observability such as white and black box monitoring, servicelevelobjective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others * Experience with continuous integration More ❯