London, England, United Kingdom Hybrid / WFH Options
Global Screening Services
culture where your ideas are valued. What You’ll Do Key responsibilities in this role will include (but not be limited to): Leveraging core SRE values - measuring (SLI/SLO/SLA), testing, and eliminating toil via automation with appropriate Disaster Recovery planning Refining KPIs to enable data-driven decision making for availability and reliability Proactively analysing monitoring data to More ❯
London, England, United Kingdom Hybrid / WFH Options
9fin
a service catalog to the engineering team, and also author many other useful DevOps plugins. Contributing to observability best practices and providing key SLI/SLO metric reporting, so that the engineering team can balance velocity and reliability. Develop inner/open source projects to help provide a world-class developer experience to the engineering team. More ❯
London, England, United Kingdom Hybrid / WFH Options
BlackRock, Inc
the usage (and, desirably, the deployment) of e.g. ELK, CloudWatch, Fluentd, to enable forensic log analysis and system tuning as well as data-driven performance analysis (i.e. SLI/SLO) and capacity planning. You are a competent Linux & Windows systems administrator (for multiple distributions), including storage management (e.g. LVM, RAID) and security best-practices e.g. SSH, SSL/TLS, HMAC More ❯
London, England, United Kingdom Hybrid / WFH Options
Cencora, Inc
of containerization technologies, including Docker and Kubernetes. Excellent understanding of networking principles (IP addressing, virtual networks, network security and networking models). Understanding of observability and site-reliability principles (SLO’s, SLI’s) and working with engineering teams to improve the applications and platform. Good understanding of SQL and working with relational databases. Experience working in a production environment to More ❯
London, England, United Kingdom Hybrid / WFH Options
Prima
operating their applications in the cloud. Your Key Responsibilities Will Include System Reliability: Develop and sustain reliable, scalable, and efficient systems, establishing and monitoring ServiceLevelObjectives (SLOs) and ServiceLevel Indicators (SLIs) to enhance system reliability Infrastructure Automation: Lead the automation of More ❯
London, England, United Kingdom Hybrid / WFH Options
Amed Commercial Refrigeration Equipment Co., Ltd
level consistency in how services are built, deployed, and monitored Collaborate with security and DevOps to embed automated compliance and runtime protection 4. Observability and SLO Strategy Define telemetry standards across logs, metrics, and distributed tracing (Example: driving correlation between API errors, latency spikes, and infrastructure metric)s Work with engineering and SRE teams to define More ❯
London, England, United Kingdom Hybrid / WFH Options
IG Group
continuously work towards making systems/processes better Ensure system observability, security, data integrity and compliance with regulatory requirements. Establish a metrics-based organization, develop key operational metrics (preferably SLO) and push for continuous improvement. To take highly complex and manual processes and work to simplify and automate them To oversee the SRE team to ensure they are involved in More ❯
London, England, United Kingdom Hybrid / WFH Options
NatWest Group
pipelines and automation to help manage our product and services. You’ll work closely with our feature team and other colleagues to meet defined servicelevelobjectives and continually improve systems and environments. You’ll define error budgets that support finding the right balance between risk and reliability. You’ll also More ❯
issues. Experience managing and contributing to mid-large projects related to system reliability improvements. Knowledge of Site Reliability Engineering (SRE) Practices: including error budgeting, servicelevelobjectives (SLOs), and servicelevel indicators (SLIs). Demonstrated ability to collaborate with cross-functional teams, including More ❯
London, England, United Kingdom Hybrid / WFH Options
Attio Ltd
will have the following attributes: Proven experience with Google Cloud and Kubernetes Contribute across the stack, including TypeScript, Node.js, and Google Cloud Platform Champion operational excellence and resilience (99.99% SLO) Manage CI/CD pipelines to improve deployment speed and reliability Support backup, disaster recovery, and security Experience with Google Spanner is a nice to have Hiring Process An introductory More ❯
London, England, United Kingdom Hybrid / WFH Options
NatWest Group
office What you'll do As our Site Reliability Engineer, you’ll work closely with our feature team and other colleagues to meet defined servicelevelobjectives and continually improve systems and environments. You’ll define error budgets that support finding the right balance between risk and reliability. You’ll also More ❯
operational insights. Last updated 5 days ago Collaborate with SRE teams on building and enhancing tooling and automation solutions Work with customers to understand pain points around Supportability and SLO attainment Be the single point of contact for enterprise customer service escalations Implement changes to service telemetry for automation consumption Enhance customer More ❯
London, England, United Kingdom Hybrid / WFH Options
Anthropic
of Anthropic’s mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way. Responsibilities Develop appropriate ServiceLevelObjectives for large language model serving and training systems, balancing availability/latency with development velocity Design and implement monitoring systems including availability, latency and … distributed systems observability and monitoring at scale Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence) Have experience with chaos engineering and More ❯
of Anthropic's mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way. Responsibilities: Develop appropriate ServiceLevelObjectives for large language model serving and training systems, balancing availability/latency with development velocity Design and implement monitoring systems including availability, latency and … distributed systems observability and monitoring at scale Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence) Have experience with chaos engineering and More ❯
London, England, United Kingdom Hybrid / WFH Options
Steamship Insurance Management Services Ltd
as well as reports to Senior Management on all aspects of service performance, ensuring transparency and timely communication on any issues. Manage SLA’s/SLO’s and develop service excellent by regularly attending internal and external service review meetings. Document meeting minutes and oversee or assign follow More ❯