Bristol, Avon, England, United Kingdom Hybrid / WFH Options
Robert Walters
to manage Kubernetes clusters in production environments Competence in scripting and development using languages such as Python, Java, Go, Bash, or PowerShell Strong understanding of service-levelobjectives (SLOs), indicators (SLIs), and monitoring practices Hands-on experience with infrastructure as code (e.g., Terraform) and CI/CD tools (e.g., Jenkins, Azure More ❯
critical detail to your mentees Production Kubernetes experience and debugging all services that run within the K8s ecosystem, including Istio service mesh SRE mentality (SLI, SLO & SLA) using Observability, Logging, Monitoring & Alerting (Dynatrace) Ideally coming from a software engineering or exceptional scripting skill background and have moved into SRE/DevOps while gaining a wider understanding More ❯
the usage (and, desirably, the deployment) of e.g. ELK, CloudWatch, Fluentd, to enable forensic log analysis and system tuning as well as data-driven performance analysis (i.e. SLI/SLO) and capacity planning. You are a competent Linux & Windows systems administrator (for multiple distributions), including storage management (e.g. LVM, RAID) and security best-practices e.g. SSH, SSL/TLS, HMAC More ❯
Bristol, Gloucestershire, United Kingdom Hybrid / WFH Options
TwinStream Limited
consistent and correctly configured. The system is designed to be highly observable and available. The team will use monitoring tools to verify that all components are meeting SLA/SLO requirements. If any problems are identified, the team will take preventive actions to minimise customer impact and restore service as quickly as possible. This role is More ❯
Bristol, Gloucestershire, United Kingdom Hybrid / WFH Options
TwinStream
consistent and correctly configured. The system is designed to be highly observable and available. The team will use monitoring tools to verify that all components are meeting SLA/SLO requirements. If any problems are identified, the team will take preventive actions to minimise customer impact and restore service as quickly as possible. This role is More ❯
be able to talk through the key principles of managing a large infrastructure estate. Monitoring infrastructure and applications hosted using taking into consideration: Observability, Alerting, Uptime SLA's and SLO's Azure Devops pipeline management. Strong collaboration with both engineering teams and colleagues in customer-facing teams. Excellent communicator both in written and verbal forms. Comfortable breaking down big tasks More ❯
Cardiff, South Glamorgan, United Kingdom Hybrid / WFH Options
RVU Co UK
s perspective by sharing your experience, knowledge & expertise in a continuous learning environment. As a member of the platform engineering team you will be accountable for the following: Objective setting, feature ideation, development and measurement Architectural decisions and designs of the platform, domains and systems Defining, evolving, and applying team processes Building efficient CI/CD pipelines and … well architected principles Solid understanding of platform and reliability engineering approaches (SRE), including observability, performance optimisation, capturing analytics and security best practices Experience implementing ServiceLevelObjectives and using them to drive error budgets, risk management and alerting Knowledge and experience with operating containers at scale within the Kubernetes ecosystem Experience More ❯
RVUs perspective by sharing your experience, knowledge & expertise in a continuous learning environment. As a member of the platform engineering team you will be accountable for the following: Objective setting, feature ideation, development and measurement Architectural decisions and designs of the platform, domains and systems Defining, evolving, and applying team processes Building efficient CI/CD pipelines and … well architected principles Solid understanding of platform and reliability engineering approaches (SRE), including observability, performance optimisation, capturing analytics and security best practices Experience implementing ServiceLevelObjectives and using them to drive error budgets, risk management and alerting Knowledge and experience with operating containers at scale within the Kubernetes ecosystem Experience More ❯
RVUs perspective by sharing your experience, knowledge & expertise in a continuous learning environment. As a member of the platform engineering team you will be accountable for the following: Objective setting, feature ideation, development and measurement Architectural decisions and designs of the platform, domains and systems Defining, evolving, and applying team processes Building efficient CI/CD pipelines and … well architected principles Solid understanding of platform and reliability engineering approaches (SRE), including observability, performance optimisation, capturing analytics and security best practices Experience implementing ServiceLevelObjectives and using them to drive error budgets, risk management and alerting Knowledge and experience with operating containers at scale within the Kubernetes ecosystem Experience More ❯
Cardiff, South Glamorgan, Wales, United Kingdom Hybrid / WFH Options
Confused.com
RVUs perspective by sharing your experience, knowledge & expertise in a continuous learning environment. As a member of the platform engineering team you will be accountable for the following: Objective setting, feature ideation, development and measurement Architectural decisions and designs of the platform, domains and systems Defining, evolving, and applying team processes Building efficient CI/CD pipelines and … well architected principles Solid understanding of platform and reliability engineering approaches (SRE), including observability, performance optimisation, capturing analytics and security best practices Experience implementing ServiceLevelObjectives and using them to drive error budgets, risk management and alerting Knowledge and experience with operating containers at scale within the Kubernetes ecosystem Experience More ❯
issues. Experience managing and contributing to mid-large projects related to system reliability improvements. Knowledge of Site Reliability Engineering (SRE) Practices: including error budgeting, servicelevelobjectives (SLOs), and servicelevel indicators (SLIs). Demonstrated ability to collaborate with cross-functional teams, including More ❯
operational insights. Last updated 5 days ago Collaborate with SRE teams on building and enhancing tooling and automation solutions Work with customers to understand pain points around Supportability and SLO attainment Be the single point of contact for enterprise customer service escalations Implement changes to service telemetry for automation consumption Enhance customer More ❯
teamwork. Build rapport with each member of the team and support them as they level up their skills. Define and maintain company-wide practices around SLO definition and management, incident management, postmortem analysis, and disaster testing and recovery. Generate informed insights regarding service quality and interface directly with executive leadership to communicate More ❯
Newcastle Upon Tyne, Tyne and Wear, North East, United Kingdom Hybrid / WFH Options
Develop
platform's core value streams. Key Responsibilities Technical Leadership & Strategy Champion engineering best practices, system reliability, and architectural integrity Define and track progress toward ServiceLevelObjectives (SLOs) Collaborate with product stakeholders to shape robust and scalable solutions Take responsibility for non-functional areas such as performance, maintainability, and security Provide More ❯
Sunderland, Tyne and Wear, UK Hybrid / WFH Options
Develop
platform's core value streams. Key Responsibilities Technical Leadership & Strategy Champion engineering best practices, system reliability, and architectural integrity Define and track progress toward ServiceLevelObjectives (SLOs) Collaborate with product stakeholders to shape robust and scalable solutions Take responsibility for non-functional areas such as performance, maintainability, and security Provide More ❯
of Anthropic's mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way. Responsibilities: Develop appropriate ServiceLevelObjectives for large language model serving and training systems, balancing availability/latency with development velocity Design and implement monitoring systems including availability, latency and … distributed systems observability and monitoring at scale Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence) Have experience with chaos engineering and More ❯