Proven experience in building and scaling observability platforms in a cloud-native environment. Observability Expertise: Deep understanding of observability pillars (metrics, logs, traces) and related tools (e.g., Prometheus, Grafana, OpenTelemetry, Jaeger, Kibana Elastic Stack). AI/ML Proficiency: Hands-on experience integrating ML/AI models into observability systems to drive advanced insights, anomaly detection, and predictive analysis. Distributed More ❯
Logic, New Relic, AppDynamics, Dynatrace, Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry/fluentd/fluentbit/filebeat/logstash) Hands-on experience with complex troubleshooting of Kubernetes and Docker container Good knowledge of Regex, Lucene, PromQL Good knowledge of Linux Experience More ❯
Logic, New Relic, AppDynamics, Dynatrace, Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry/fluentd/fluentbit/filebeat/logstash) Hands-on experience with complex troubleshooting of Kubernetes and Docker container Good knowledge of Regex, Lucene, PromQL Good knowledge of Linux Experience More ❯
Datadog, Sumologic, NewRelic, AppDynamics, Dynatrace, Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry/fluentd/fluentbit/filebeat/logstash) Hands-on experience with complex troubleshooting of Kubernetes and Docker container Good knowledge of RegEx, Lucene, PromQL Good knowledge of Linux Experience More ❯
London, England, United Kingdom Hybrid / WFH Options
Circadia Health
. Experience orchestrating GPU/AI workloads , MLops, or large‐language‐model serving. Knowledge of edge/IoT deployments and over‐the‐air update strategies. Exposure to observability stacks (OpenTelemetry, Loki) and security tooling (Falco, Aqua, Wiz). What We Offer Base salary £100,000 – £170,000 plus meaningful equity. Gym membership Comprehensive health, dental & vision coverage (UK & global travel More ❯
Secure Innovation is part of CGI's Space, Defence, and Intelligence business unit, focused primarily on the delivery of contemporary and innovative technical solutions for the government agencies most challenging problems. Our teams work alongside our clients to help them More ❯
Who we are We are a seed-stage AI start-up backed by leading European and US funds. Our founders previously built and deployed cutting-edge AI systems at world-class research labs and high-growth technology companies. We apply More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Stealth AI Startup
Who we are We are a seed-stage AI start-up backed by leading European and US funds. Our founders previously built and deployed cutting-edge AI systems at world-class research labs and high-growth technology companies. We apply More ❯
Infrastructure as Code (Terraform/Terragrunt) Kubernetes expertise in container orchestration and cluster management Network engineering skills including load balancers, CDN, Istio, and security patterns Experience with observability platforms (OpenTelemetry) and distributed systems Nice-to-have skills: Python programming and Linux system debugging Database administration (SQL, MongoDB, Redis) Message broker and event streaming experience (Kafka) Database performance optimisation skills At More ❯
between Google's Load Balancer and the HTTP server in our main Elixir application causing HTTP 5XX responses to be returned to our customers. - Debugging an issue in our OpenTelemetry pipelines causing us to silently drop spans. - An enthusiasm for both software development and systems engineering. - A high bar for code and configuration quality and readability. - A good understanding of … to managing our Kubernetes configuration, using ArgoCD and Helm. - We manage a high-availability metrics collection system using Grafana, Thanos & Prometheus. We're in the process of transitioning to OpenTelemetry and Honeycomb for our application telemetry (traces and metrics). - We manage a data pipeline using Pub/Sub, Airbyte, and dbt. Our Current Focus We're currently driving a … how we think about and monitor reliability across the engineering organisation, with a focus on early detection of customer-impacting issues. We're extending and standardising our use of OpenTelemetry, and introducing Honeycomb as the single place for engineers to understand how our applications are operating in production. This project involves both technical work, on the application libraries and infrastructure More ❯
is the bulk of the headcount. They would also want good knowledge of: Cloud (AWS, OnPrem) Microservices (K8s, Kafka) IaC (Terraform) CI/CD (GitOps, Github Actions, ArgoCD) Monitoring (OpenTelemetry, Prometheus, Grafana) Security (Vault, IAM, OPA, SOC2, GDPR) What’s in it for you? Annual bonus Share Options L&D Fund Private Medical Hybrid/Flexi Working The chance to More ❯
is the bulk of the headcount. They would also want good knowledge of: Cloud (AWS, OnPrem) Microservices (K8s, Kafka) IaC (Terraform) CI/CD (GitOps, Github Actions, ArgoCD) Monitoring (OpenTelemetry, Prometheus, Grafana) Security (Vault, IAM, OPA, SOC2, GDPR) What’s in it for you? Annual bonus Share Options L&D Fund Private Medical Hybrid/Flexi Working The chance to More ❯
Generative AI integration Programming experience in Go (Golang) , Java , Kotlin , JavaScript/Node.js , or Python Strong hands-on experience with Kubernetes (K8s) and OpenShift Experience with MongoDB , Kafka , Prometheus , OpenTelemetry , Grafana Familiarity with tools like Helm , Kustomize , Terraform , and Vault Proven experience with hybrid cloud environments (on-prem + public cloud) Ability to explain complex ideas clearly to both technical More ❯
edge team driving the future of observability! We're looking for a Monitoring & Observability Engineer to lead the design and deployment of robust monitoring solutions using Splunk , Dynatrace , and OpenTelemetry (OTel) in a fast-paced, tech-forward environment. Key Responsibilities: Design and implement end-to-end monitoring pipelines (Splunk, Dynatrace, OTel). Build and maintain dashboards and queries in Splunk. More ❯
London, England, United Kingdom Hybrid / WFH Options
Ocho
warehouses like Snowflake • Day-2 operations experience including observability, debugging, and triage Desirable Skills: • Experience with Auth0 , AWS Cognito , or similar identity platforms • Familiarity with Helm , Prometheus , Grafana , or OpenTelemetry • Exposure to other cloud platforms (GCP, Azure) • CI/CD pipeline development for containerised/serverless apps Why Join • Shape a cutting-edge FinOps platform with real impact on how More ❯
with cross-functional engineering teams. Experience working in Linux-based environments. Bonus/Nice-to-Have Skills: Experience deploying Grafana instances via code (provisioning dashboards, datasources). Familiarity with OpenTelemetry, metric instrumentation, and telemetry pipelines. Background in data center environments, infrastructure monitoring, or SRE practices. Exposure to CI/CD workflows, containers (Podman/Docker), and cloud-native systems. More ❯
with cross-functional engineering teams. Experience working in Linux-based environments. Bonus/Nice-to-Have Skills: Experience deploying Grafana instances via code (provisioning dashboards, datasources). Familiarity with OpenTelemetry, metric instrumentation, and telemetry pipelines. Background in data center environments, infrastructure monitoring, or SRE practices. Exposure to CI/CD workflows, containers (Podman/Docker), and cloud-native systems. More ❯
with cross-functional engineering teams. Experience working in Linux-based environments. Bonus/Nice-to-Have Skills: Experience deploying Grafana instances via code (provisioning dashboards, datasources). Familiarity with OpenTelemetry, metric instrumentation, and telemetry pipelines. Background in data center environments, infrastructure monitoring, or SRE practices. Exposure to CI/CD workflows, containers (Podman/Docker), and cloud-native systems. More ❯
Driven Architecture using AWS services (SNS, SQS, EventBridge). Knowledge of GraphQL, WebSockets, or real-time data streaming. Exposure to DevOps and observability practices (e.g., Prometheus, Datadog, AWS CloudWatch, OpenTelemetry). Prior experience in leading distributed engineering teams. More ❯
Driven Architecture using AWS services (SNS, SQS, EventBridge). Knowledge of GraphQL, WebSockets, or real-time data streaming. Exposure to DevOps and observability practices (e.g., Prometheus, Datadog, AWS CloudWatch, OpenTelemetry). Prior experience in leading distributed engineering teams. More ❯