have: Experience with Observability-Driven Development (ODD). Experience designing and scaling monitoring and alerting infrastructure for distributed systems, ideally in Kubernetes and AWS. Familiarity with distributed tracing (e.g. OpenTelemetry). Experience with cost optimisation strategies for observability tools and infrastructure. Knowledge of microservices architectures and how to monitor complex, interdependent services. Experience with SRE (Site Reliability Engineering) principles and More ❯
Manchester, England, United Kingdom Hybrid / WFH Options
bet365
Who we are looking for A Site Reliability Engineer, who will enhance system reliability, observability and performance through a strong engineering approach and assist with incident resolution and best practices. You will have software engineering skills, focusing on system reliability More ❯
DDN France Lynch, England, United Kingdom Join or sign in to find your next job Join to apply for the Staff Software Engineer - AI In-Market Engineering role at DDN DDN France Lynch, England, United Kingdom 1 day ago Be More ❯
London, England, United Kingdom Hybrid / WFH Options
WunderGraph, Inc
engineering in Go. You will not only architect our internal systems for scale but also build and operate key product infrastructure, including our customer-facing telemetry pipeline (built on OpenTelemetry and ClickHouse) and the AI pipeline that empowers our products. We are looking for a hands-on technical leader, driven by the challenge of solving ambiguous, 'eBay-scale' problems—whether … on, but is not limited to: Architecting, building, and operating the core cloud-native infrastructure for WunderGraph Cosmo, primarily using Go and Kubernetes. Owning and evolving our observability stack (OpenTelemetry, Prometheus, ClickHouse) and the infrastructure supporting our AI-driven features to ensure deep, actionable insights into our systems. Building and optimizing CI/CD pipelines to improve build times, automate … system architecture, distributed systems, and the challenges of running high-performance API gateways. Familiarity with GraphQL Federation is a significant plus. Experience building or managing modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, ClickHouse). A self-starter attitude and a leader’s mindset: you are comfortable with ambiguity, can identify and solve ill-defined problems, and don’t need hand More ❯
performance DevOps function (building and optimising CI/CD pipelines etc.!) A brand new observability platform dealing with over 6 million requests per second and working on tracing and OpenTelemetry (as well as exploring ML for Observability) Automate anything and everything with Python & config tools This is one of the best opportunities for a passionate Linux infrastructure enthusiast out there More ❯
performance DevOps function (building and optimising CI/CD pipelines etc.!) A brand new observability platform dealing with over 6 million requests per second and working on tracing and OpenTelemetry (as well as exploring ML for Observability) Automate anything and everything with Python & config tools This is one of the best opportunities for a passionate Linux infrastructure enthusiast out there More ❯
performance DevOps function (building and optimising CI/CD pipelines etc.!) A brand new observability platform dealing with over 6 million requests per second and working on tracing and OpenTelemetry (as well as exploring ML for Observability) Automate anything and everything with Python & config tools This is one of the best opportunities for a passionate Linux infrastructure enthusiast out there More ❯
performance DevOps function (building and optimising CI/CD pipelines etc.!) A brand new observability platform dealing with over 6 million requests per second and working on tracing and OpenTelemetry (as well as exploring ML for Observability) Automate anything and everything with Python & config tools This is one of the best opportunities for a passionate Linux infrastructure enthusiast out there More ❯
or similar) or Unix/Linux systems Excellent collaboration, communication, and problem-solving skills Nice to Have Experience with: Cybersecurity or DLP products Incident, problem, and change management tools OpenTelemetry or telemetry pipeline tooling Automation and scripting for monitoring Working in Agile or operational environments Why Join? Work on a globally distributed, high-impact security team Learn and grow in More ❯
years in a management role Background in security engineer, DevSecOps or a strong understanding of security best practices in cloud-native environments Familiarity with CNCF tools such as Prometheus, OpenTelemetry, and ArgoCD Cisco values the perspectives and skills that emerge from employees with diverse backgrounds. That's why Cisco is expanding the boundaries of discovering top talent by not only More ❯
The CoE Lead - Observability & Tools at JD Sports Fashion Plc is a critical, hands-on technical role focused on designing, building, and maintaining the company's Observability platform.This role ensures that our technology platforms operate efficiently and reliably, providing early More ❯
The Role The company we are working with here are migration to Dynatrace and you will be building Splunk Pipelines Design and implement monitoring pipelines using Splunk, Dynatrace, and OpenTelemetry (OTel). Automate the deployment of monitoring tools using Terraform, Ansible, and Jenkins. Manage configuration and version control with Bitbucket and Artifactory. Ensure seamless integration of monitoring solutions into CI More ❯
The Role The company we are working with here are migration to Dynatrace and you will be building Splunk Pipelines Design and implement monitoring pipelines using Splunk, Dynatrace, and OpenTelemetry (OTel). Automate the deployment of monitoring tools using Terraform, Ansible, and Jenkins. Manage configuration and version control with Bitbucket and Artifactory. Ensure seamless integration of monitoring solutions into CI More ❯
programming language, ideally C# Understand the .NET runtime and system fundamentals like networking Our current tools include: C# on .NET Core Azure DevOps Pipelines and AWS Grafana, Splunk, Prometheus, OpenTelemetry Docker and Kubernetes If you're familiar with similar technologies, training is available. Trayport is committed to a respectful, diverse, and inclusive work environment. We provide accommodations for applicants and More ❯
on practical experience delivering enterprise-level cybersecurity solutions and controls Advanced in one or more programming languages, ideally one or more of: *NIX Scripting, Python, SQL & GraphQL, Splunk, Grafana & OpenTelemetry Proficiency in automation and continuous delivery methods Proficiency in all aspects of the Software Development Life Cycle Advanced understanding of agile methodologies such as continuous integration and delivery, application resiliency More ❯
technical experience in Cloud DevOps, SaaS, or observability, with 5+ years in leadership roles. Strong hands-on experience with AWS, GCP, Azure, K8S, Terraform and observability tools: Prometheus, Grafana, OpenTelemetry, ELK, Splunk, Datadog, and similar. Proficiency with metrics, logs, traces and APM. Leadership & Global Operations Proven success leading multi-regional or global technical teams with direct management of managers. Demonstrated More ❯
OpenAI Evals, LangSmith, LLM-Harness) and a track record of turning eval results into product-ready gating. Observability chops -you've wired up tracing/metrics for distributed systems (OpenTelemetry, Prometheus, Grafana) and know how to set SLOs that actually matter. Prompt-engineering fluency -few-shot, function-calling, RAG orchestration-and an instinct for spotting ambiguity or jailbreak vectors. Production More ❯
technical experience in Cloud DevOps, SaaS, or observability, with 5+ years in leadership roles. Strong hands-on experience with AWS, GCP, Azure, K8S, Terraform and observability tools: Prometheus, Grafana, OpenTelemetry, ELK, Splunk, Datadog, and similar. Proficiency with metrics, logs, traces and APM. Leadership & Global Operations Proven success leading multi-regional or global technical teams with direct management of managers. Demonstrated More ❯
on practical experience delivering enterprise-level cybersecurity solutions and controls Advanced in one or more programming languages, ideally one or more of: *NIX Scripting, Python, SQL & GraphQL, Splunk, Grafana & OpenTelemetry Proficiency in automation and continuous delivery methods Proficiency in all aspects of the Software Development Life Cycle Advanced understanding of agile methodologies such as continuous integration and delivery, application resiliency More ❯
Site Reliability Engineer (SRE) , Apple Pay London, England, United Kingdom Software and Services Description As an SRE in WPC, you'll need to solve problems using data, teamwork, and your own expertise. You will own the full stack and our More ❯
London, England, United Kingdom Hybrid / WFH Options
Birdie
About Birdie Birdie is the leading home healthcare technology platform that aims to radically transform the lives of older adults. Its all-in-one solution supports around 4.8 million (and growing) care visits every month, equipping care providers with the More ❯
London, England, United Kingdom Hybrid / WFH Options
Bjak
services. Conduct performance and load testing for distributed systems. Work with DevOps Engineers to integrate tests into CI/CD pipelines. Ensure observability and logging for test executions, e.g. OpenTelemetry, ELK. Collaborate with Software Engineers to enforce quality in system refactoring efforts. Bachelor's Degree in Computer Science, Software Engineering, or related fields. 3+ years of experience in QA Automation More ❯
experience in technical customer-facing roles (e.g. Solutions Architect, Sales Engineering, Pre-Sales). Background in enterprise SaaS, especially in infrastructure monitoring, analytics, or APM. Hands-on expertise with OpenTelemetry, Kubernetes, and modern cloud-native observability stacks. Familiarity with streaming data and real-time metric processing. Experience working in Agile environments and across the full software development lifecycle. Join a More ❯
Stafford, England, United Kingdom Hybrid / WFH Options
JR United Kingdom
Social network you want to login/join with: A Site Reliability Engineer, who will enhance system reliability, observability and performance through a strong engineering approach and assist with incident resolution and best practices. You will have software engineering skills More ❯
Stoke-on-Trent, England, United Kingdom Hybrid / WFH Options
JR United Kingdom
Who we are looking for A Site Reliability Engineer who will enhance system reliability, observability, and performance through a strong engineering approach, and assist with incident resolution and best practices. You will have software engineering skills, focusing on system reliability More ❯