in automating/scripting to remove toil. A bility to quickly understand, update and write code in languages such as Python, Java, Golang BASH and PowerShell; Working experience monitoring SLO's, SLI's and SLAs and logging updates and alerting where appropriate; S trong DevOps understanding and familiarity, including experience of Infrastructure as Code and CI/CD pipelines, such More ❯
Manchester, England, United Kingdom Hybrid / WFH Options
bet365
and management of effective ServiceLevel Indicators (SLI) and ServiceLevelObjectives (SLO) for reliability and customer satisfaction. Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty. Excellent knowledge of programming languages including Python, Golang More ❯
Stoke-on-Trent, England, United Kingdom Hybrid / WFH Options
bet365
and management of effective ServiceLevel Indicators (SLI) and ServiceLevelObjectives (SLO) for reliability and customer satisfaction. Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty. Excellent knowledge of programming languages including Python, Golang More ❯
and team talks , helping them improve their C#/.NET Core skills. Support and enhance current systems and initiatives during office hours, ensuring that servicelevelobjectives are met. Maintain a strong focus on quality, reusability, clean architectures, security, and resilience across the full application lifecycle. Collaborate with the Lead Developer More ❯
and team talks , helping them improve their C#/.NET Core skills. Support and enhance current systems and initiatives during office hours, ensuring that servicelevelobjectives are met. Maintain a strong focus on quality, reusability, clean architectures, security, and resilience across the full application lifecycle. Collaborate with the Lead Developer More ❯
and management of effective ServiceLevel Indicators (SLI) and ServiceLevelObjectives (SLO) for reliability and customer satisfaction. Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty. Knowledge and experience of modern software development techniques More ❯
Cardiff, South Glamorgan, United Kingdom Hybrid / WFH Options
RVU Co UK
s perspective by sharing your experience, knowledge & expertise in a continuous learning environment. As a member of the platform engineering team you will be accountable for the following: Objective setting, feature ideation, development and measurement Architectural decisions and designs of the platform, domains and systems Defining, evolving, and applying team processes Building efficient CI/CD pipelines and … well architected principles Solid understanding of platform and reliability engineering approaches (SRE), including observability, performance optimisation, capturing analytics and security best practices Experience implementing ServiceLevelObjectives and using them to drive error budgets, risk management and alerting Knowledge and experience with operating containers at scale within the Kubernetes ecosystem Experience More ❯
issues. Experience managing and contributing to mid-large projects related to system reliability improvements. Knowledge of Site Reliability Engineering (SRE) Practices: including error budgeting, servicelevelobjectives (SLOs), and servicelevel indicators (SLIs). Demonstrated ability to collaborate with cross-functional teams, including More ❯
Northern Ireland, United Kingdom Hybrid / WFH Options
Jobgether
Terraform or Pulumi, and observability tools such as Datadog or CloudWatch. Experience in implementing AI-powered tools for workflow optimization and operational improvements. Proven success in setting up scalable, SLO-driven monitoring strategies in 24/7 environments. Ability to manage distributed teams, foster innovation, and drive results in a collaborative, inclusive setting. Strong communication and mentorship skills, with a More ❯
Newcastle Upon Tyne, Tyne and Wear, North East, United Kingdom Hybrid / WFH Options
Develop
platform's core value streams. Key Responsibilities Technical Leadership & Strategy Champion engineering best practices, system reliability, and architectural integrity Define and track progress toward ServiceLevelObjectives (SLOs) Collaborate with product stakeholders to shape robust and scalable solutions Take responsibility for non-functional areas such as performance, maintainability, and security Provide More ❯
of Anthropic's mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way. Responsibilities: Develop appropriate ServiceLevelObjectives for large language model serving and training systems, balancing availability/latency with development velocity Design and implement monitoring systems including availability, latency and … distributed systems observability and monitoring at scale Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence) Have experience with chaos engineering and More ❯