in automating/scripting to remove toil. A bility to quickly understand, update and write code in languages such as Python, Java, Golang BASH and PowerShell; Working experience monitoring SLO's, SLI's and SLAs and logging updates and alerting where appropriate; S trong DevOps understanding and familiarity, including experience of Infrastructure as Code and CI/CD pipelines, such More ❯
performance engineering Work with Technology teams to resolve major incidents Conduct root cause analysis (RCA) for incidents and implement preventive measures. Define and monitor ServiceLevelObjectives (SLOs), ServiceLevel Indicators (SLIs), and error budgets. Continuously improve automated remediation tasks to ensure the More ❯
in automating/scripting to remove toil. A bility to quickly understand, update and write code in languages such as Python, Java, Golang BASH and PowerShell; Working experience monitoring SLO's, SLI's and SLAs and logging updates and alerting where appropriate; S trong DevOps understanding and familiarity, including experience of Infrastructure as Code and CI/CD pipelines, such More ❯
App Gateway. 2+ years of experience with Reliability concepts to ensure high performance and high service availability, able to define implement and improve business performance SLO's. 2+ years of experience with Production operations including 24x7 on-call support, escalation/paging with OpsGenie, incident management, RCA (Root Cause Analysis) and retrospective analysis. 2+ or more … enhance the Observability and Reliability of applications and services running on IaaS and PaaS in Microsoft Azure. AWS and GCP are nice to have. ServiceLevelObjectives and indicators focused on improving business workflow performance and availability. Technical and business dashboards, metrics, and actionable alerting. Processes and automation for increasing uptime More ❯
as code for the applications and platforms in your remit Understand servicelevel indicators and utilizes servicelevelobjectives to proactively resolve issues before they impact customers Supports the adoption of site reliability engineering best practices within your team (metrics, alerting, logging, automation More ❯
the SRE organisation as a key member of the SRE leadership team. Lead the definition and track ServiceLevelObjectives (SLO) to measure service availability in combination with service, product and engineering communities. Collaborate with product and engineering senior managers to ensure delivery More ❯
and principles to strengthen focus, behaviours, and culture You will support the POs and ELs to ensure our products and services are sufficiently resilient and to address any SLA, SLO failures or increased incident levels SREs are contributing to the Product Engineering backlog with a focus on reliability and performance, you will work with Product Owners and Engineering Leads using … lifecycle and experience in end-to-end delivery of software products, with emphasis on operational aspects The ability to define, implement and achieve relevant ServiceLevelObjectives for the products in the lab you are aligned to Experience with agile development methods (Scrum, Kanban) and tooling (Jira and Confluence) and experience More ❯
and principles to strengthen focus, behaviours, and culture You will support the POs and ELs to ensure our products and services are sufficiently resilient and to address any SLA, SLO failures or increased incident levels SREs are contributing to the Product Engineering backlog with a focus on reliability and performance, you will work with Product Owners and Engineering Leads using … lifecycle and experience in end-to-end delivery of software products, with emphasis on operational aspects The ability to define, implement and achieve relevant ServiceLevelObjectives for the products in the lab you are aligned to Experience with agile development methods (Scrum, Kanban) and tooling (Jira and Confluence) and experience More ❯
Software Engineer - Site Reliability Engineering London About Neo4j: Neo4j is the leader in Graph Database & Analytics, helping organizations uncover hidden patterns and relationships across billions of data connections deeply, easily, and quickly. Customers use Neo4j to gain a deeper understanding More ❯
include: To form part of a critical operations function that is responsible for the monitoring, availability and performance of production services. Responding to stakeholder requests within agreed timescales or SLO Drive automation to reduce failures, manual tasks and therefore improving overall application performance and availability. Perform systems administration activities to ensure the smooth operation of applications across multiple platforms Coordinate More ❯
Manchester, England, United Kingdom Hybrid / WFH Options
bet365
and management of effective ServiceLevel Indicators (SLI) and ServiceLevelObjectives (SLO) for reliability and customer satisfaction. Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty. Excellent knowledge of programming languages including Python, Golang More ❯
Stoke-on-Trent, England, United Kingdom Hybrid / WFH Options
bet365
and management of effective ServiceLevel Indicators (SLI) and ServiceLevelObjectives (SLO) for reliability and customer satisfaction. Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty. Excellent knowledge of programming languages including Python, Golang More ❯
and team talks , helping them improve their C#/.NET Core skills. Support and enhance current systems and initiatives during office hours, ensuring that servicelevelobjectives are met. Maintain a strong focus on quality, reusability, clean architectures, security, and resilience across the full application lifecycle. Collaborate with the Lead Developer More ❯
and team talks , helping them improve their C#/.NET Core skills. Support and enhance current systems and initiatives during office hours, ensuring that servicelevelobjectives are met. Maintain a strong focus on quality, reusability, clean architectures, security, and resilience across the full application lifecycle. Collaborate with the Lead Developer More ❯
billion events per day. To ensure the reliability of this environment for our customers, SREs work closely with developers and product managers to understand servicelevelobjectives, think through failures scenarios, and design systems which balance cost with reliability objectives. Additionally, SREs collaborate with the Information Security team to ensure that More ❯
and management of effective ServiceLevel Indicators (SLI) and ServiceLevelObjectives (SLO) for reliability and customer satisfaction. Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty. Knowledge and experience of modern software development techniques More ❯
Cardiff, South Glamorgan, United Kingdom Hybrid / WFH Options
RVU Co UK
s perspective by sharing your experience, knowledge & expertise in a continuous learning environment. As a member of the platform engineering team you will be accountable for the following: Objective setting, feature ideation, development and measurement Architectural decisions and designs of the platform, domains and systems Defining, evolving, and applying team processes Building efficient CI/CD pipelines and … well architected principles Solid understanding of platform and reliability engineering approaches (SRE), including observability, performance optimisation, capturing analytics and security best practices Experience implementing ServiceLevelObjectives and using them to drive error budgets, risk management and alerting Knowledge and experience with operating containers at scale within the Kubernetes ecosystem Experience More ❯
and more. Lead incident management, capacity planning, and performance tuning initiatives. Guide engineers in observability, cost optimisation, and security best practices. Define and track servicelevelobjectives (SLOs) to improve engineering outcomes. Champion a DevOps mindset with “you build it, you run it” accountability. We’re Looking For: Proven background in More ❯
issues. Experience managing and contributing to mid-large projects related to system reliability improvements. Knowledge of Site Reliability Engineering (SRE) Practices: including error budgeting, servicelevelobjectives (SLOs), and servicelevel indicators (SLIs). Demonstrated ability to collaborate with cross-functional teams, including More ❯
Quality, Stability & Standards: Establish quality standards to meet performance, reliability, and maintainability of the systems. With a strong production-first mindset, drive observability, maintain ServiceLevelObjectives (SLOs), and ensure efficient incident resolution. Oversee the maintenance of existing systems, ensuring continuous improvements and prompt resolution of issues. Agile Delivery & Collaboration: Working More ❯
tools, and workflows, integrating internal systems and third-party solutions. Network Health Management: Define and implement prediction pipelines for long-term network health, availability, and service-level objectives. Operations Automation: Lead initiatives to automate and optimize network operations focusing on scalability and reliability. Collaborative Development: Work closely with teams on requirements analysis More ❯
Northern Ireland, United Kingdom Hybrid / WFH Options
Jobgether
Terraform or Pulumi, and observability tools such as Datadog or CloudWatch. Experience in implementing AI-powered tools for workflow optimization and operational improvements. Proven success in setting up scalable, SLO-driven monitoring strategies in 24/7 environments. Ability to manage distributed teams, foster innovation, and drive results in a collaborative, inclusive setting. Strong communication and mentorship skills, with a More ❯
ensure that solutions are designed with customer experience, scalability, and performance in mind. Analyze system performance and reliability, offering recommendations for enhancement. Develop and uphold service-levelobjectives (SLOs), service-level indicators (SLIs), and error budgets for our services. Participate in on-call More ❯
network. Enhance existing monitoring and observability frameworks, integrating intelligent alerting and self-remediation capabilities to reduce manual intervention and improve incident response. Define and measure service-levelobjectives (SLOs) to track infrastructure performance and reliability. Write software utilizing orchestration systems to automate tasks and interact with other systems. Provide mentorship to More ❯