Bristol, Avon, England, United Kingdom Hybrid / WFH Options
Robert Walters
to manage Kubernetes clusters in production environments Competence in scripting and development using languages such as Python, Java, Go, Bash, or PowerShell Strong understanding of service-levelobjectives (SLOs), indicators (SLIs), and monitoring practices Hands-on experience with infrastructure as code (e.g., Terraform) and CI/CD tools (e.g., Jenkins, Azure More ❯
critical detail to your mentees Production Kubernetes experience and debugging all services that run within the K8s ecosystem, including Istio service mesh SRE mentality (SLI, SLO & SLA) using Observability, Logging, Monitoring & Alerting (Dynatrace) Ideally coming from a software engineering or exceptional scripting skill background and have moved into SRE/DevOps while gaining a wider understanding More ❯
critical detail to your mentees Production Kubernetes experience and debugging all services that run within the K8s ecosystem, including Istio service mesh SRE mentality (SLI, SLO & SLA) using Observability, Logging, Monitoring & Alerting (Dynatrace) Ideally coming from a software engineering or exceptional scripting skill background and have moved into SRE/DevOps while gaining a wider understanding More ❯
a service catalog to the engineering team, and also author many other useful DevOps plugins. Contributing to observability best practices and providing key SLI/SLO metric reporting, so that the engineering team can balance velocity and reliability. Develop inner/open source projects to help provide a world-class developer experience to the engineering team. More ❯
Main responsibilities We are looking for people with a passion to learn, and who bring a continuous improvement mentality to our team! SREs maintain ServiceLevelObjectives for the systems they own. Constantly measuring and improving availability, latency, and overall system health is at the core of our team's purpose. More ❯
Bristol, Gloucestershire, United Kingdom Hybrid / WFH Options
TwinStream Limited
consistent and correctly configured. The system is designed to be highly observable and available. The team will use monitoring tools to verify that all components are meeting SLA/SLO requirements. If any problems are identified, the team will take preventive actions to minimise customer impact and restore service as quickly as possible. This role is More ❯
Bristol, Gloucestershire, United Kingdom Hybrid / WFH Options
TwinStream
consistent and correctly configured. The system is designed to be highly observable and available. The team will use monitoring tools to verify that all components are meeting SLA/SLO requirements. If any problems are identified, the team will take preventive actions to minimise customer impact and restore service as quickly as possible. This role is More ❯
include: To form part of a critical operations function that is responsible for the monitoring, availability and performance of production services. Responding to stakeholder requests within agreed timescales or SLO Drive automation to reduce failures, manual tasks and therefore improving overall application performance and availability. Perform systems administration activities to ensure the smooth operation of applications across multiple platforms Coordinate More ❯
configurations across legacy and modern applications to ensure their continued performance and reliability. System Monitoring & Performance: Maintain and improve logging, monitoring, and alerting systems. Define service-levelobjectives and indicators for business applications. Continuously review performance metrics against SLO/SLIs and proactively address performance bottlenecks or underperforming systems. Manage system More ❯
Cloud Platform (GCP). This role will involve working closely with development, platform engineering, and security teams to implement DevOps best practices, define and enforce service-levelobjectives, and build a scalable monitoring and alerting platform. Key Responsibilities Automate deployment, monitoring, and incident response processes using GCP-native tools and technologies. … in Onyx to operate with a DevOps ethos. Collaborate with development teams to optimise application performance, reliability, and observability on GCP. Implement and enforce ServiceLevelObjectives (SLOs) and Error Budgets to ensure a balance between reliability and feature development. Develop and maintain a comprehensive monitoring and alerting platform to detect More ❯
culture of innovation, collaboration, and continuous improvement. Ensure network automation complies with relevant regulatory requirements, security requirements and industry standards. Establish Key Performance Indicators and Service-LevelObjectives to measure operational effectiveness. Build relationships with CTO, Application Production Support & Engineering, CIO organizations and other stakeholders. Communicate effectively with technical and non More ❯
observability platforms to support real-time decision-making. Support incident prevention, root cause analysis, and continuous improvement through data-driven insights. Define and enforce servicelevelobjectives (SLOs) and key performance indicators (KPIs) for SACM health and value. Governance, Compliance & Asset Management: Ensure accurate, complete, and up-to-date asset and More ❯
observability platforms to support real-time decision-making. Support incident prevention, root cause analysis, and continuous improvement through data-driven insights. Define and enforce servicelevelobjectives (SLOs) and key performance indicators (KPIs) for SACM health and value. Governance, Compliance & Asset Management: Ensure accurate, complete, and up-to-date asset and More ❯
billion events per day. To ensure the reliability of this environment for our customers, SREs work closely with developers and product managers to understand servicelevelobjectives, think through failures scenarios, and design systems which balance cost with reliability objectives. Additionally, SREs collaborate with the Information Security team to ensure that More ❯
billion events per day. To ensure the reliability of this environment for our customers, SREs work closely with developers and product managers to understand servicelevelobjectives, think through failures scenarios, and design systems which balance cost with reliability objectives. Additionally, SREs collaborate with the Information Security team to ensure that More ❯
automated response. Apply SRE principles to improve reliability, performance, and maintainability of security services. Lead platform health, patching automation, and vulnerability remediation workflows. Define servicelevelobjectives (SLOs) and key performance indicators (KPIs) for all security services. Compliance, Governance & Risk Management: Ensure alignment with global compliance requirements such as ISO More ❯
initiatives from design through deployment and operations Write maintainable, well-tested, high-quality code and uphold engineering best practices Focus on observability and maintain ServiceLevelObjectives, take operational responsibility for the Payments Platform, including joining the on-call rota Foster a strong engineering culture through mentorship, code reviews, and collaboration More ❯
Quality, Stability & Standards: Establish quality standards to meet performance, reliability, and maintainability of the systems. With a strong production-first mindset, drive observability, maintain ServiceLevelObjectives (SLOs), and ensure efficient incident resolution. Oversee the maintenance of existing systems, ensuring continuous improvements and prompt resolution of issues. Agile Delivery & Collaboration: Working More ❯
Quality, Stability & Standards: Establish quality standards to meet performance, reliability, and maintainability of the systems. With a strong production-first mindset, drive observability, maintain ServiceLevelObjectives (SLOs), and ensure efficient incident resolution. Oversee the maintenance of existing systems, ensuring continuous improvements and prompt resolution of issues. Agile Delivery & Collaboration: Working More ❯
Quality, Stability & Standards: Establish quality standards to meet performance, reliability, and maintainability of the systems. With a strong production-first mindset, drive observability, maintain ServiceLevelObjectives (SLOs), and ensure efficient incident resolution. Oversee the maintenance of existing systems, ensuring continuous improvements and prompt resolution of issues. Agile Delivery & Collaboration: Working More ❯
development lifecycle to ensure reliability, scalability, and operational stability are maintained across all supported platforms.* Define, create, and monitor application analytics to support improved servicelevelobjectives and drive data-informed decision making.* Ensure strict adherence to change management release processes while accelerating automation initiatives for these workflows.* Lead resiliency management … e.g., RDS/Aurora) and non-relational databases equips you to support diverse data storage requirements.* Previous exposure to site reliability engineering concepts-including servicelevelobjectives (SLOs), servicelevel agreements (SLAs), servicelevel indicators More ❯
a code concept is desirable. Experience with build automation, test driven development, continuous integration and delivery Experience with Relational and non Relational Databases Previous SRE experience including knowledge about SLO/SLA/SLI and error budgets, is advantageous Experience working or familiarity with one public cloud (AWS, Google or Azure) If this is of interest and you have the More ❯
network. Enhance existing monitoring and observability frameworks, integrating intelligent alerting and self-remediation capabilities to reduce manual intervention and improve incident response. Define and measure service-levelobjectives (SLOs) to track infrastructure performance and reliability. Write software utilizing orchestration systems to automate tasks and interact with other systems. Provide mentorship to More ❯