Infrastructure as Code (Terraform/Terragrunt) Kubernetes expertise in container orchestration and cluster management Network engineering skills including load balancers, CDN, Istio, and security patterns Experience with observability platforms (OpenTelemetry) and distributed systems Nice-to-have skills: Python programming and Linux system debugging Database administration (SQL, MongoDB, Redis) Message broker and event streaming experience (Kafka) Database performance optimisation skills At More ❯
artifact promotion, and release gating into the SDLC. Ensure pipeline scalability and governance while maintaining developer velocity. Observability & Troubleshooting Lead the implementation and usage of modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, Splunk, Datadog). Establish SLOs, SLIs, and error budgets with product and engineering teams. Drive root cause identification using distributed tracing, advanced log analysis, and anomaly detection. Security More ❯
artifact promotion, and release gating into the SDLC. Ensure pipeline scalability and governance while maintaining developer velocity. Observability & Troubleshooting Lead the implementation and usage of modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, Splunk, Datadog). Establish SLOs, SLIs, and error budgets with product and engineering teams. Drive root cause identification using distributed tracing, advanced log analysis, and anomaly detection. Security More ❯
have: Experience with Observability-Driven Development (ODD). Experience designing and scaling monitoring and alerting infrastructure for distributed systems, ideally in Kubernetes and AWS. Familiarity with distributed tracing (e.g. OpenTelemetry). Experience with cost optimisation strategies for observability tools and infrastructure. Knowledge of microservices architectures and how to monitor complex, interdependent services. Experience with SRE (Site Reliability Engineering) principles and More ❯
meet internal SLOs/SLAs and reliability targets. Support the migration from our legacy ELK stack to a modern observability platform using Prometheus, Mimir, Grafana, Honeycomb, Loki, Quickwit , and OpenTelemetry . Contribute to knowledge sharing and the ongoing development of best practices in observability across the organisation. What you'll need: 4+ years of professional experience as a software engineer … to collaborate effectively across teams and explain complex technical concepts clearly. A proactive mindset focused on long-term impact, sustainable engineering practices, and continuous improvement. Preferred Qualifications Experience with OpenTelemetry or distributed tracing systems. Understanding of observability-driven development and service reliability principles (e.g. SRE, MTTR, SLIs/SLOs). Experience optimising observability systems for cost and performance at scale. More ❯
London, England, United Kingdom Hybrid / WFH Options
Durlston Partners
on with Docker (Kubernetes is a plus), infrastructure-as-code, and CI/CD tooling Strong scripting and automation experience in Python and Bash Familiarity with observability stacks (Prometheus, OpenTelemetry, eBPF) Cloud infrastructure experience (AWS/GCP/Azure), with attention to IAM and software supply chain security Curious, persistent, and comfortable experimenting at the lowest levels of the stack More ❯
level based on candidate experience. Qualifications Preferred Requirements: Experience with query languages such as SQL, SPL, or KQL. Experience with observability and log collectors/pipelines such as FluentBit, OpenTelemetry, Cribl, and Logstash. Experience with web technologies such as HTML, CSS, and JavaScript. Experience with programming/scripting side technologies such as Java, .NET, PHP, Go, Node.js and database. Advanced More ❯
between Google's Load Balancer and the HTTP server in our main Elixir application causing HTTP 5XX responses to be returned to our customers. - Debugging an issue in our OpenTelemetry pipelines causing us to silently drop spans. - An enthusiasm for both software development and systems engineering. - A high bar for code and configuration quality and readability. - A good understanding of … to managing our Kubernetes configuration, using ArgoCD and Helm. - We manage a high-availability metrics collection system using Grafana, Thanos & Prometheus. We're in the process of transitioning to OpenTelemetry and Honeycomb for our application telemetry (traces and metrics). - We manage a data pipeline using Pub/Sub, Airbyte, and dbt. Our Current Focus We're currently driving a … how we think about and monitor reliability across the engineering organisation, with a focus on early detection of customer-impacting issues. We're extending and standardising our use of OpenTelemetry, and introducing Honeycomb as the single place for engineers to understand how our applications are operating in production. This project involves both technical work, on the application libraries and infrastructure More ❯
management, performance optimization, security, and cost control. Experience with tools like Helm, Karpenter, and k9s is essential. Ideal candidates will have experience with collecting logs, traces, and metrics via Opentelemetry, and making these available through AWS services like X-Ray and CloudWatch. These insights should be used to ensure Nexus meets high standards for performance and reliability, and to guide … cloud-based systems where uptime and reliability are critical Desirable Skills Experience with PostgreSQL Experience in continuous deployment environments Debugging and troubleshooting code Professional experience with Python Familiarity with OpenTelemetry SDKs and standards What We Offer Work alongside a talented team in the quantum computing industry. We offer a competitive salary, equity, 28 days of paid holiday (plus public holidays More ❯
is the bulk of the headcount. They would also want good knowledge of: Cloud (AWS, OnPrem) Microservices (K8s, Kafka) IaC (Terraform) CI/CD (GitOps, Github Actions, ArgoCD) Monitoring (OpenTelemetry, Prometheus, Grafana) Security (Vault, IAM, OPA, SOC2, GDPR) What’s in it for you? Annual bonus Share Options L&D Fund Private Medical Hybrid/Flexi Working The chance to More ❯
is the bulk of the headcount. They would also want good knowledge of: Cloud (AWS, OnPrem) Microservices (K8s, Kafka) IaC (Terraform) CI/CD (GitOps, Github Actions, ArgoCD) Monitoring (OpenTelemetry, Prometheus, Grafana) Security (Vault, IAM, OPA, SOC2, GDPR) What’s in it for you? Annual bonus Share Options L&D Fund Private Medical Hybrid/Flexi Working The chance to More ❯
Generative AI integration Programming experience in Go (Golang) , Java , Kotlin , JavaScript/Node.js , or Python Strong hands-on experience with Kubernetes (K8s) and OpenShift Experience with MongoDB , Kafka , Prometheus , OpenTelemetry , Grafana Familiarity with tools like Helm , Kustomize , Terraform , and Vault Proven experience with hybrid cloud environments (on-prem + public cloud) Ability to explain complex ideas clearly to both technical More ❯
edge team driving the future of observability! We're looking for a Monitoring & Observability Engineer to lead the design and deployment of robust monitoring solutions using Splunk , Dynatrace , and OpenTelemetry (OTel) in a fast-paced, tech-forward environment. Key Responsibilities: Design and implement end-to-end monitoring pipelines (Splunk, Dynatrace, OTel). Build and maintain dashboards and queries in Splunk. More ❯
edge team driving the future of observability! We're looking for a Monitoring & Observability Engineer to lead the design and deployment of robust monitoring solutions using Splunk , Dynatrace , and OpenTelemetry (OTel) in a fast-paced, tech-forward environment. Key Responsibilities: Design and implement end-to-end monitoring pipelines (Splunk, Dynatrace, OTel). Build and maintain dashboards and queries in Splunk. More ❯
London, England, United Kingdom Hybrid / WFH Options
Ocho
warehouses like Snowflake • Day-2 operations experience including observability, debugging, and triage Desirable Skills: • Experience with Auth0 , AWS Cognito , or similar identity platforms • Familiarity with Helm , Prometheus , Grafana , or OpenTelemetry • Exposure to other cloud platforms (GCP, Azure) • CI/CD pipeline development for containerised/serverless apps Why Join • Shape a cutting-edge FinOps platform with real impact on how More ❯
London, England, United Kingdom Hybrid / WFH Options
Xtremepush
the API to Application to Database layer of the platform. Strong communication skills and ability to explain complex technical solutions simply to others Strong understanding of PHP, GoLang, MySQL, Opentelemetry, Prometheus Experience with Cloud and DevOps technologies (AWS, Terraform, CI/CD etc.) Experience with specific technologies in our stack: Clickhouse, Kafka, Pulsar, Python Experience with networking and security concepts More ❯
elastic and resilient to failure. Participate in and improve our 24x7 incident response and on-call rotation. Use and expand our existing CNCF solutions like Kubernetes, Service Mesh, Prometheus, OpenTelemetry, and ArgoCD to increase platform reliability. Automate production operations to provide guardrails and continuous platform operation. Develop automation solutions for scalable service and platform operations, including deployment, scale testing, graceful More ❯
with cross-functional engineering teams. Experience working in Linux-based environments. Bonus/Nice-to-Have Skills: Experience deploying Grafana instances via code (provisioning dashboards, datasources). Familiarity with OpenTelemetry, metric instrumentation, and telemetry pipelines. Background in data center environments, infrastructure monitoring, or SRE practices. Exposure to CI/CD workflows, containers (Podman/Docker), and cloud-native systems. More ❯
with cross-functional engineering teams. Experience working in Linux-based environments. Bonus/Nice-to-Have Skills: Experience deploying Grafana instances via code (provisioning dashboards, datasources). Familiarity with OpenTelemetry, metric instrumentation, and telemetry pipelines. Background in data center environments, infrastructure monitoring, or SRE practices. Exposure to CI/CD workflows, containers (Podman/Docker), and cloud-native systems. More ❯
with cross-functional engineering teams. Experience working in Linux-based environments. Bonus/Nice-to-Have Skills: Experience deploying Grafana instances via code (provisioning dashboards, datasources). Familiarity with OpenTelemetry, metric instrumentation, and telemetry pipelines. Background in data center environments, infrastructure monitoring, or SRE practices. Exposure to CI/CD workflows, containers (Podman/Docker), and cloud-native systems. More ❯
Driven Architecture using AWS services (SNS, SQS, EventBridge). Knowledge of GraphQL, WebSockets, or real-time data streaming. Exposure to DevOps and observability practices (e.g., Prometheus, Datadog, AWS CloudWatch, OpenTelemetry). Prior experience in leading distributed engineering teams. More ❯
Driven Architecture using AWS services (SNS, SQS, EventBridge). Knowledge of GraphQL, WebSockets, or real-time data streaming. Exposure to DevOps and observability practices (e.g., Prometheus, Datadog, AWS CloudWatch, OpenTelemetry). Prior experience in leading distributed engineering teams. #J-18808-Ljbffr More ❯
applications, and back-end services. Technical Skills: Strong proficiency in Python, React.js, TypeScript, and cloud-based technologies (AWS, Terraform, Docker, Kubernetes, etc.). Familiarity with PostgreSQL, GitHub Actions, and OpenTelemetry is also highly beneficial. Cloud & Infrastructure Experience: Strong experience working on cloud-based projects with a focus on scalable infrastructure. AI/ML Knowledge: Experience working with AI/ML More ❯
London, England, United Kingdom Hybrid / WFH Options
TechBiz Global
and garbage collection. Build fault-tolerant systems with strong recovery mechanisms and failover strategies to maintain service continuity. Implement comprehensive logging and tracing using tools such as zap, klog, OpenTelemetry, and Jaeger to enhance monitoring and troubleshooting. Apply Test-Driven Development (TDD) and engage in Pair Programming to ensure high code quality and promote team collaboration. Participate actively in code More ❯