Bristol, Avon, England, United Kingdom Hybrid / WFH Options
Robert Walters
to manage Kubernetes clusters in production environments Competence in scripting and development using languages such as Python, Java, Go, Bash, or PowerShell Strong understanding of service-levelobjectives (SLOs), indicators (SLIs), and monitoring practices Hands-on experience with infrastructure as code (e.g., Terraform) and CI/CD tools (e.g., Jenkins, Azure More ❯
critical detail to your mentees Production Kubernetes experience and debugging all services that run within the K8s ecosystem, including Istio service mesh SRE mentality (SLI, SLO & SLA) using Observability, Logging, Monitoring & Alerting (Dynatrace) Ideally coming from a software engineering or exceptional scripting skill background and have moved into SRE/DevOps while gaining a wider understanding More ❯
Main responsibilities We are looking for people with a passion to learn, and who bring a continuous improvement mentality to our team! SREs maintain ServiceLevelObjectives for the systems they own. Constantly measuring and improving availability, latency, and overall system health is at the core of our team's purpose. More ❯
critical detail to your mentees Production Kubernetes experience and debugging all services that run within the K8s ecosystem, including Istio service mesh SRE mentality (SLI, SLO & SLA) using Observability, Logging, Monitoring & Alerting (Dynatrace) Ideally coming from a software engineering or exceptional scripting skill background and have moved into SRE/DevOps while gaining a wider understanding More ❯
observability platforms to support real-time decision-making. Support incident prevention, root cause analysis, and continuous improvement through data-driven insights. Define and enforce servicelevelobjectives (SLOs) and key performance indicators (KPIs) for SACM health and value. Governance, Compliance & Asset Management: Ensure accurate, complete, and up-to-date asset and More ❯
observability platforms to support real-time decision-making. Support incident prevention, root cause analysis, and continuous improvement through data-driven insights. Define and enforce servicelevelobjectives (SLOs) and key performance indicators (KPIs) for SACM health and value. Governance, Compliance & Asset Management: Ensure accurate, complete, and up-to-date asset and More ❯
automated response. Apply SRE principles to improve reliability, performance, and maintainability of security services. Lead platform health, patching automation, and vulnerability remediation workflows. Define servicelevelobjectives (SLOs) and key performance indicators (KPIs) for all security services. Compliance, Governance & Risk Management: Ensure alignment with global compliance requirements such as ISO More ❯
Cloud Platform (GCP). This role will involve working closely with development, platform engineering, and security teams to implement DevOps best practices, define and enforce service-levelobjectives, and build a scalable monitoring and alerting platform. Key Responsibilities Automate deployment, monitoring, and incident response processes using GCP-native tools and technologies. … in Onyx to operate with a DevOps ethos. Collaborate with development teams to optimise application performance, reliability, and observability on GCP. Implement and enforce ServiceLevelObjectives (SLOs) and Error Budgets to ensure a balance between reliability and feature development. Develop and maintain a comprehensive monitoring and alerting platform to detect More ❯
culture of innovation, collaboration, and continuous improvement. Ensure network automation complies with relevant regulatory requirements, security requirements and industry standards. Establish Key Performance Indicators and Service-LevelObjectives to measure operational effectiveness. Build relationships with CTO, Application Production Support & Engineering, CIO organizations and other stakeholders. Communicate effectively with technical and non More ❯
development lifecycle to ensure reliability, scalability, and operational stability are maintained across all supported platforms.* Define, create, and monitor application analytics to support improved servicelevelobjectives and drive data-informed decision making.* Ensure strict adherence to change management release processes while accelerating automation initiatives for these workflows.* Lead resiliency management … e.g., RDS/Aurora) and non-relational databases equips you to support diverse data storage requirements.* Previous exposure to site reliability engineering concepts-including servicelevelobjectives (SLOs), servicelevel agreements (SLAs), servicelevel indicators More ❯
Quality, Stability & Standards: Establish quality standards to meet performance, reliability, and maintainability of the systems. With a strong production-first mindset, drive observability, maintain ServiceLevelObjectives (SLOs), and ensure efficient incident resolution. Oversee the maintenance of existing systems, ensuring continuous improvements and prompt resolution of issues. Agile Delivery & Collaboration: Working More ❯
a code concept is desirable. Experience with build automation, test driven development, continuous integration and delivery Experience with Relational and non Relational Databases Previous SRE experience including knowledge about SLO/SLA/SLI and error budgets, is advantageous Experience working or familiarity with one public cloud (AWS, Google or Azure) If this is of interest and you have the More ❯
Actively participate in the development life cycle, ensuring reliability and scalability and operational stability Define, create and track application analytics in support of better servicelevelobjectives Ensure adherence to change management release processes, accelerate automation of these processes Run resiliency management planning, scheduling and execution of disaster recovery tests & seek … a code concept is desirable. Experience with build automation, test driven development, continuous integration and delivery Experience with Relational and non Relational Databases Previous SRE experience including knowledge about SLO/SLA/SLI and error budgets, is advantageous Experience working or familiarity with one public cloud (AWS, Google or Azure) Preferred skills – what’ll get you noticed! Experience in More ❯
teamwork. Build rapport with each member of the team and support them as they level up their skills. Define and maintain company-wide practices around SLO definition and management, incident management, postmortem analysis, and disaster testing and recovery. Generate informed insights regarding service quality and interface directly with executive leadership to communicate More ❯
Newcastle Upon Tyne, Tyne and Wear, North East, United Kingdom Hybrid / WFH Options
Develop
platform's core value streams. Key Responsibilities Technical Leadership & Strategy Champion engineering best practices, system reliability, and architectural integrity Define and track progress toward ServiceLevelObjectives (SLOs) Collaborate with product stakeholders to shape robust and scalable solutions Take responsibility for non-functional areas such as performance, maintainability, and security Provide More ❯
their full potential through the Microsoft Cloud. We are fast growing team, but we make sure we are committed to remain agile. Customer first, nurturing trust, high responsiveness, automation, SLO/SLI/SLA, blameless post-mortem, observability, monitoring, alerting, and toil reduction form the foundations of our code and we work with teams across Microsoft and external customers to … Baseline Personnel Security Standards; UK Security Clearance Responsibilities Collaborating closely with the existing SRE teams on building and enhancing tooling and automation solutions for faster resolution of issues impacting SLO's and averting incidents altogether when possible. Collaborating with the customers to understand their pain points around Supportability and SLO attainment and formulate strategies for addressing recurring issues in a More ❯
of Anthropic's mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way. Responsibilities: Develop appropriate ServiceLevelObjectives for large language model serving and training systems, balancing availability/latency with development velocity Design and implement monitoring systems including availability, latency and … distributed systems observability and monitoring at scale Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence) Have experience with chaos engineering and More ❯