critical detail to your mentees Production Kubernetes experience and debugging all services that run within the K8s ecosystem, including Istio service mesh SRE mentality (SLI, SLO & SLA) using Observability, Logging, Monitoring & Alerting (Dynatrace) Ideally coming from a software engineering or exceptional scripting skill background and have moved into SRE/DevOps while gaining a wider understanding More ❯
the usage (and, desirably, the deployment) of e.g. ELK, CloudWatch, Fluentd, to enable forensic log analysis and system tuning as well as data-driven performance analysis (i.e. SLI/SLO) and capacity planning. You are a competent Linux & Windows systems administrator (for multiple distributions), including storage management (e.g. LVM, RAID) and security best-practices e.g. SSH, SSL/TLS, HMAC More ❯
london, south east england, united kingdom Hybrid / WFH Options
BlackRock, Inc
the usage (and, desirably, the deployment) of e.g. ELK, CloudWatch, Fluentd, to enable forensic log analysis and system tuning as well as data-driven performance analysis (i.e. SLI/SLO) and capacity planning. You are a competent Linux & Windows systems administrator (for multiple distributions), including storage management (e.g. LVM, RAID) and security best-practices e.g. SSH, SSL/TLS, HMAC More ❯
highly available. Use best practices for AWS services, automation, and monitoring. SRE Practices Implementation: Establish and lead the implementation of SRE principles, such as ServiceLevelObjectives (SLOs), ServiceLevel Indicators (SLIs), and Error Budgets, to drive the team's focus on reliability. More ❯
highly available. Use best practices for AWS services, automation, and monitoring. SRE Practices Implementation: Establish and lead the implementation of SRE principles, such as ServiceLevelObjectives (SLOs), ServiceLevel Indicators (SLIs), and Error Budgets, to drive the team's focus on reliability. More ❯
configurations across legacy and modern applications to ensure their continued performance and reliability. System Monitoring & Performance: Maintain and improve logging, monitoring, and alerting systems. Define service-levelobjectives and indicators for business applications. Continuously review performance metrics against SLO/SLIs and proactively address performance bottlenecks or underperforming systems. Manage system More ❯
issues. Experience managing and contributing to mid-large projects related to system reliability improvements. Knowledge of Site Reliability Engineering (SRE) Practices: including error budgeting, servicelevelobjectives (SLOs), and servicelevel indicators (SLIs). Demonstrated ability to collaborate with cross-functional teams, including More ❯
Quality, Stability & Standards: Establish quality standards to meet performance, reliability, and maintainability of the systems. With a strong production-first mindset, drive observability, maintain ServiceLevelObjectives (SLOs), and ensure efficient incident resolution. Oversee the maintenance of existing systems, ensuring continuous improvements and prompt resolution of issues. Agile Delivery & Collaboration: Working More ❯
Quality, Stability & Standards: Establish quality standards to meet performance, reliability, and maintainability of the systems. With a strong production-first mindset, drive observability, maintain ServiceLevelObjectives (SLOs), and ensure efficient incident resolution. Oversee the maintenance of existing systems, ensuring continuous improvements and prompt resolution of issues. Agile Delivery & Collaboration: Working More ❯
network. Enhance existing monitoring and observability frameworks, integrating intelligent alerting and self-remediation capabilities to reduce manual intervention and improve incident response. Define and measure service-levelobjectives (SLOs) to track infrastructure performance and reliability. Write software utilizing orchestration systems to automate tasks and interact with other systems. Provide mentorship to More ❯
their full potential through the Microsoft Cloud. We are fast growing team, but we make sure we are committed to remain agile. Customer first, nurturing trust, high responsiveness, automation, SLO/SLI/SLA, blameless post-mortem, observability, monitoring, alerting, and toil reduction form the foundations of our code and we work with teams across Microsoft and external customers to … Baseline Personnel Security Standards; UK Security Clearance Responsibilities Collaborating closely with the existing SRE teams on building and enhancing tooling and automation solutions for faster resolution of issues impacting SLO's and averting incidents altogether when possible. Collaborating with the customers to understand their pain points around Supportability and SLO attainment and formulate strategies for addressing recurring issues in a More ❯
on input from stakeholders, market analysis, and user feedback Provide clarity and guidance to the development and run teams on product requirements, acceptance criteria, servicelevelobjectives and desired outcomes. Drive a culture of continuous improvement by implementing best practices, fostering innovation, and promoting experimentation within the value stream. Lead and More ❯
on input from stakeholders, market analysis, and user feedback Provide clarity and guidance to the development and run teams on product requirements, acceptance criteria, servicelevelobjectives and desired outcomes. Drive a culture of continuous improvement by implementing best practices, fostering innovation, and promoting experimentation within the value stream. Lead and More ❯
and legal counsel. Establish a collaborative environment for sharing data on machine timelines and suspicious events. Create operational metrics, key performance indicators (KPIs), and servicelevelobjectives to measure team competence. #LI-CB1 Qualifications 4+ years' experience working with incident investigations utilizing EDRs, SIEMs, and containment procedures. 4+ years' experience with More ❯
operational insights. Last updated 5 days ago Collaborate with SRE teams on building and enhancing tooling and automation solutions Work with customers to understand pain points around Supportability and SLO attainment Be the single point of contact for enterprise customer service escalations Implement changes to service telemetry for automation consumption Enhance customer More ❯
willing to present and defend your ideas to technical and non-technical audiences. Additional Desired Skills Experience with incident management platforms like PagerDuty, OpsGenie, or similar tools Understanding of SLO/SLA management and implementations Knowledge of industry standard incident management frameworks and best practices Familiarity with automated remediation and runbook automation Experience with DevOps and SRE practices Cultural Fit More ❯
teamwork. Build rapport with each member of the team and support them as they level up their skills. Define and maintain company-wide practices around SLO definition and management, incident management, postmortem analysis, and disaster testing and recovery. Generate informed insights regarding service quality and interface directly with executive leadership to communicate More ❯
Required Provisioning and maintaining cloud-hosted environments in Amazon Web Services with Terraform Programming experience with React (or other JavaScript frameworks) Setting and maintaining servicelevelobjectives and servicelevel indicators Qualities We're Looking For Kind, passionate, and collaborative problem-solver who More ❯
of Anthropic's mission to bring the capabilities of groundbreaking AI technologies to benefit humanity in a safe and reliable way. Responsibilities: Develop appropriate ServiceLevelObjectives for large language model serving and training systems, balancing availability/latency with development velocity Design and implement monitoring systems including availability, latency and … distributed systems observability and monitoring at scale Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence) Have experience with chaos engineering and More ❯
ensure cost-effective utilisation of all available resources, within budget Developing and operating all Products, to agreed timescales, scope and quality (including security and servicelevelobjectives) Monitoring competition and ensuring product features remain competitive across the Product unit You Will Also Collaborate With Other ICT Leads To Develop and communicate More ❯
ensure cost-effective utilisation of all available resources, within budget Developing and operating all Products, to agreed timescales, scope and quality (including security and servicelevelobjectives) Monitoring competition and ensuring product features remain competitive across the Product unit You Will Also Collaborate With Other ICT Leads To Develop and communicate More ❯
to build and enhance tooling and automation solutions, enabling faster resolution of issues impacting SLOs and preventing incidents when possible. Engage with customers to understand their supportability challenges and SLO attainment concerns, developing sustainable strategies to address recurring issues. Serve as the primary technical contact for interfacing with large enterprise customers, managing service escalations, and driving More ❯
Security Analyst, Security Operations and Incident Response Meta is seeking a Security Analyst to join the Global Security Operations and Incident Response team. The Analyst will serve on the front lines of Meta's Security team and will lead and More ❯