observability platforms to support real-time decision-making. Support incident prevention, root cause analysis, and continuous improvement through data-driven insights. Define and enforce servicelevelobjectives (SLOs) and key performance indicators (KPIs) for SACM health and value. Governance, Compliance & Asset Management: Ensure accurate, complete, and up-to-date asset and More ❯
demonstrated ability to implement site reliability within an application or platform Advanced knowledge and experience in observability such as white and black box monitoring, servicelevelobjectives, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc. Ability to communicate data-based solutions with complex reporting and More ❯
automated response. Apply SRE principles to improve reliability, performance, and maintainability of security services. Lead platform health, patching automation, and vulnerability remediation workflows. Define servicelevelobjectives (SLOs) and key performance indicators (KPIs) for all security services. Compliance, Governance & Risk Management: Ensure alignment with global compliance requirements such as ISO More ❯
closely with our SecOps teams to ensure timely vulnerability management Educating teams in SRE practices and maintaining high standards of compliance Implementing world-class observability standards utilising SLI/SLO/Error Budgets Continually evolving our observability platforms for greater coverage Liaising with Product & Engineering teams for constant evolution of metrics Aligning SRE Sprints & Backlog with our roadmaps to meet More ❯
London, England, United Kingdom Hybrid / WFH Options
NatWest Group
pipelines and automation to help manage our product and services. You’ll work closely with our feature team and other colleagues to meet defined servicelevelobjectives and continually improve systems and environments. You’ll define error budgets that support finding the right balance between risk and reliability. You’ll also More ❯
debugging distributed systems issues across Edge, Network, Compute, and Storage layers Experience with observability stacks (metrics, logs, tracing) and tools like Splunk and New Relic Familiarity with SRE practices: SLO, SLA, etc. Excellent English communication skills, verbal and written (German not required). A collaborative mindset: you're helpful, respectful, and enjoy sharing knowledge Ability to context switch, work through More ❯
debugging distributed systems issues across Edge, Network, Compute, and Storage layers. Experience with observability stacks (metrics, logs, tracing) and tools like Splunk and New Relic. Familiarity with SRE practices: SLO, SLA, etc. Excellent English communication skills, verbal and written (German not required). A collaborative mindset: you’re helpful, respectful, and enjoy sharing knowledge. Ability to context switch, work through More ❯
configurations across legacy and modern applications to ensure their continued performance and reliability. System Monitoring & Performance: Maintain and improve logging, monitoring, and alerting systems. Define service-levelobjectives and indicators for business applications. Continuously review performance metrics against SLO/SLIs and proactively address performance bottlenecks or underperforming systems. Manage system More ❯
performance. What You’ll Do Define strategies for Application Performance Monitoring, Unit Cost, and Chaos Engineering. Continuously optimize production environments to enhance reliability and efficiency. Implement and apply MTTR, SLO, and SLI principles to ensure high service standards. Respond to incidents, analyze root causes, and drive long-term improvements. Maintain fault-tolerant, scalable, and cost-effective … role in shaping the core technology layers that drive our platform’s success. What You Need Proven experience implementing SRE principles at scale, including deep knowledge of SLI/SLO/SLA differences. A product engineering background with strong coding skills in Python, C#, or similar. Experience with incident management frameworks and evolving them for efficiency. Expertise in cloud platforms More ❯
issues. Experience managing and contributing to mid-large projects related to system reliability improvements. Knowledge of Site Reliability Engineering (SRE) Practices: including error budgeting, servicelevelobjectives (SLOs), and servicelevel indicators (SLIs). Demonstrated ability to collaborate with cross-functional teams, including More ❯
Quality, Stability & Standards: Establish quality standards to meet performance, reliability, and maintainability of the systems. With a strong production-first mindset, drive observability, maintain ServiceLevelObjectives (SLOs), and ensure efficient incident resolution. Oversee the maintenance of existing systems, ensuring continuous improvements and prompt resolution of issues. Agile Delivery & Collaboration: Working More ❯
Quality, Stability & Standards: Establish quality standards to meet performance, reliability, and maintainability of the systems. With a strong production-first mindset, drive observability, maintain ServiceLevelObjectives (SLOs), and ensure efficient incident resolution. Oversee the maintenance of existing systems, ensuring continuous improvements and prompt resolution of issues. Agile Delivery & Collaboration: Working More ❯
Quality, Stability & Standards: Establish quality standards to meet performance, reliability, and maintainability of the systems. With a strong production-first mindset, drive observability, maintain ServiceLevelObjectives (SLOs), and ensure efficient incident resolution. Oversee the maintenance of existing systems, ensuring continuous improvements and prompt resolution of issues. Agile Delivery & Collaboration: Working More ❯
tools, and workflows, integrating internal systems and third-party solutions. Network Health Management : Define and implement prediction pipelines for long-term network health, availability, and service-level objectives. Operations Automation : Lead initiatives to automate and optimize network operations focusing on scalability and reliability. Collaborative Development : Work closely with teams on requirements analysis More ❯
tools, and workflows, integrating internal systems and third-party solutions. Network Health Management: Define and implement prediction pipelines for long-term network health, availability, and service-level objectives. Operations Automation: Lead initiatives to automate and optimize network operations focusing on scalability and reliability. Collaborative Development: Work closely with teams on requirements analysis More ❯
graphs, service maps, and transaction breakdowns in APM UI. Dashboarding & Visualization Develop Kibana dashboards, Canvas presentations, and Lens visualizations for SREs and Dev teams. Implement SLO/SLI monitoring and alerting using Kibana Alerting API and Watcher where needed. Performance Tuning & Scaling Advise on shard sizing, index rollover policies, and hot-warm architecture for efficient storage. More ❯
graphs, service maps, and transaction breakdowns in APM UI. Dashboarding & Visualization Develop Kibana dashboards, Canvas presentations, and Lens visualizations for SREs and Dev teams. Implement SLO/SLI monitoring and alerting using Kibana Alerting API and Watcher where needed. Performance Tuning & Scaling Advise on shard sizing, index rollover policies, and hot-warm architecture for efficient storage. More ❯
ensure that solutions are designed with customer experience, scalability, and performance in mind. Analyze system performance and reliability, offering recommendations for enhancement. Develop and uphold service-levelobjectives (SLOs), service-level indicators (SLIs), and error budgets for our services. Participate in on-call More ❯
network. Enhance existing monitoring and observability frameworks, integrating intelligent alerting and self-remediation capabilities to reduce manual intervention and improve incident response. Define and measure service-levelobjectives (SLOs) to track infrastructure performance and reliability. Write software utilizing orchestration systems to automate tasks and interact with other systems. Provide mentorship to More ❯
in projects for new Infrastructure Services Optimize data center Infrastructure to improve performance, utilisation, availability and security, whilst controlling costs. Maintain observability best practices to measure compliance with Rimes SLO’s Implement efficient monitoring systems to measure performance and reliability of production systems. Participate in Capacity and Performance planning. Who you are: Core requirements: Bachelor's degree in Computer Science More ❯
London, England, United Kingdom Hybrid / WFH Options
Attio Ltd
will have the following attributes: Proven experience with Google Cloud and Kubernetes Contribute across the stack, including TypeScript, Node.js, and Google Cloud Platform Champion operational excellence and resilience (99.99% SLO) Manage CI/CD pipelines to improve deployment speed and reliability Support backup, disaster recovery, and security Experience with Google Spanner is a nice to have Hiring Process An introductory More ❯
and pair programming. Drive DevOps practices to automate the Product development life cycle Foster a culture of experimentation and innovation to drive solutions. Ensure products meet their SLI and SLO targets and are fully supported by teams both in and out of hours. Lead development of Product Group OKRs and Product health, and demonstrate responsibility for the entire Product Group More ❯
and pair programming. Drive DevOps practices to automate the Product development life cycle Foster a culture of experimentation and innovation to drive solutions. Ensure products meet their SLI and SLO targets and are fully supported by teams both in and out of hours. Lead development of Product Group OKRs and Product health, and demonstrate responsibility for the entire Product Group More ❯
driving technical, business, and cultural change to improve the reliability, performance, and efficiency excite you? The AWS Managed Operations (MO) organization was founded in April 2023, with the objective to reduce operational load and toil through long-term engineering projects. MO is building the best-in-class engineering and operations team that will own the day-to-day … fixes yesterday, and designed a solution to that class of problem, seeking feedback from your team. On Wednesday you investigated a ServiceLevelObjective (SLO) that recently became less than useful. You dove deep, talked with the partner team, and found out the thresholds no-longer makes sense, so you More ❯
cloud-hosted environments in Amazon Web Services with Terraform You have experience programming React (or other Javascript frameworks) You have experience setting and maintaining servicelevelobjectives and servicelevel indicators around enterprise platforms You have experience participating in incident response and engineering More ❯