set targets will be expected. VARIED DAY TO DAY RESPONSIBILITIES Ensuring system reliability, performance, and scalability through monitoring and automation Building and maintaining observability solutions using Grafana, Prometheus, Loki, OpenTelemetry Proactively identifying and resolving performance bottlenecks and infrastructure issues Automating infrastructure provisioning, configuration management, and deployments Implementing effective logging, monitoring, and alerting strategies Managing incident response and post-mortem processes … robust observability, monitoring and logging solutions Strong proficiency with observability and monitoring tools such as Grafana, Prometheus, and Loki Strong experience with distributed tracing and telemetry tools such as OpenTelemetry An understanding of cloud networking architecture and load balancing techniques Experience with container orchestration platforms like Kubernetes Proficiency in infrastructure as code (IaC) tools such as Terraform or Ansible Strong More ❯
Logic, New Relic, AppDynamics, Dynatrace, Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry/fluentd/fluentbit/filebeat/logstash) Hands-on experience with complex troubleshooting of Kubernetes and Docker container Good knowledge of Regex, Lucene, PromQL Good knowledge of Linux Experience More ❯
Bring: Strong hands-on experience with cloud platforms (AWS, GCP, Azure) and DevOps tooling Familiarity with observability stacks like Grafana, Prometheus, Datadog, Splunk, Kibana, etc. Experience with technical integrations (OpenTelemetry, Fluentd, Fluentbit, Filebeat, etc.) Skilled in troubleshooting Kubernetes and containerised environments Strong communication skills — able to engage with technical teams and senior stakeholders Comfortable working in fast-paced environments and More ❯
Cambridgeshire, England, United Kingdom Hybrid / WFH Options
SoCode Recruitment
Management and DevOps Pipelines and AWS including EKS Lambda and CloudFormation Modern C#/.Net Infrastructure as Code and GitOps : Terraform Bicep Pulumi ArgoCD and FluxCD Observability : Prometheus Grafana OpenTelemetry and Datadog Security and Compliance : HashiCorp Vault Azure Key Vault AWS KMS OPA Gatekeeper and Drata or similar AI Coding Tools : GitHub Copilot Cursor Claude Code 📩 Interested in exploring this More ❯
systems administration combined with strong SQL skills and proficiency in scripting languages such as Python or Java.* Demonstrated experience with monitoring and observability tools including Prometheus, Grafana, Splunk, Geneos, OpenTelemetry or Corvil is highly desirable.* Familiarity with cloud platforms as well as containerisation technologies like Kubernetes or Docker alongside CI/CD pipeline management is important for this role.* Comprehensive More ❯
CD. Requirements: Expert level scripting/coding skills in one or more languages (Python/Golang etc.). Expert knowledge of observability systems (Prometheus/ELK/Jaeger/Opentelemetry/Service Meshes etc.). Experience with configuration management tools (Ansible/Puppet/Kapitan/Terraform). Experience with distributed data platforms (Kafka/Flink/Airflow). Comfortable More ❯
experience. What will help you succeed Preferred Requirements: Experience with query languages such as SQL, SPL, or KQL. Experience with observability and log collectors/pipelines such as FluentBit, OpenTelemetry, Cribl, and Logstash. Experience with web technologies such as HTML, CSS, and JavaScript. Experience with programming/scripting side technologies such as Java, .NET, PHP, Go, Node.js and database. Advanced More ❯
Chesterfield, England, United Kingdom Hybrid / WFH Options
JR United Kingdom
on with Docker (Kubernetes is a plus), infrastructure-as-code, and CI/CD tooling Strong scripting and automation experience in Python and Bash Familiarity with observability stacks (Prometheus, OpenTelemetry, eBPF) Cloud infrastructure experience (AWS/GCP/Azure), with attention to IAM and software supply chain security Curious, persistent, and comfortable experimenting at the lowest levels of the stack More ❯
Sheffield, England, United Kingdom Hybrid / WFH Options
JR United Kingdom
on with Docker (Kubernetes is a plus), infrastructure-as-code, and CI/CD tooling Strong scripting and automation experience in Python and Bash Familiarity with observability stacks (Prometheus, OpenTelemetry, eBPF) Cloud infrastructure experience (AWS/GCP/Azure), with attention to IAM and software supply chain security Curious, persistent, and comfortable experimenting at the lowest levels of the stack More ❯
ML lifecycle tools, model monitoring, and versioning Exposure to tools like KServe, Ray Serve, Triton, or vLLM a big plus Bonus Points: Experience with observability frameworks like Prometheus or OpenTelemetry Knowledge of ML libraries: TensorFlow, PyTorch, HuggingFace Exposure to Azure or GCP Passion for financial services Requirements: Degree in Computer Science, Engineering, Data Science or similar What We Offer A More ❯
lifecycle tools, model monitoring, and versioning Exposure to tools like KServe, Ray Serve, Triton, or vLLM is a big plus Bonus Points Experience with observability frameworks like Prometheus or OpenTelemetry Knowledge of ML libraries: TensorFlow, PyTorch, HuggingFace Exposure to Azure or GCP Passion for financial services Qualifications Degree in Computer Science, Engineering, Data Science, or similar What We Offer A More ❯
Belfast, Northern Ireland, United Kingdom Hybrid / WFH Options
CME Group Inc
with Linux-based systems and Cloud-based platform(s). Experience and knowledge of working with distributed systems and working with Docker & Kubernetes Exposure to working with metrics & monitoring, OpenTelemetry, Splunk, Prometheus, Grafana, etc. Experience working with Infrastructure as Code Competent programming/scripting skills (Python, Bash, etc.). Strong problem-solving and analytical abilities. Excellent communication and teamwork skills. More ❯
Belfast, Northern Ireland, United Kingdom Hybrid / WFH Options
CME Group
with Linux-based systems and Cloud-based platform(s). Experience and knowledge of working with distributed systems and working with Docker & Kubernetes Exposure to working with metrics & monitoring, OpenTelemetry, Splunk, Prometheus, Grafana, etc. Experience working with Infrastructure as Code Competent programming/scripting skills (Python, Bash, etc.). Strong problem-solving and analytical abilities. Excellent communication and teamwork skills. More ❯
Leeds, England, United Kingdom Hybrid / WFH Options
ZipRecruiter
infrastructure with a strong emphasis on AWS (EKS, MSK, DynamoDB, RDS) Driving automation with Terraform/OpenTofu , scripting (Python, PowerShell), and GitLab CI Enabling observability across services using Prometheus , OpenTelemetry , and custom tooling Implementing secure deployment practices, performance tuning, and cost optimisation strategies Collaborating with engineers and data scientists to ensure platform reliability and rapid iteration What We’re Looking More ❯
Driven Architecture using AWS services (SNS, SQS, EventBridge). Knowledge of GraphQL, WebSockets, or real-time data streaming. Exposure to DevOps and observability practices (e.g., Prometheus, Datadog, AWS CloudWatch, OpenTelemetry). Prior experience in leading distributed engineering teams. Carbon60, Lorien & SRG - The Impellam Group STEM Portfolio are acting as an Employment Business in relation to this vacancy. More ❯
Northern Ireland, United Kingdom Hybrid / WFH Options
Ocho
warehouses like Snowflake • Day-2 operations experience including observability, debugging, and triage Desirable Skills: • Experience with Auth0 , AWS Cognito , or similar identity platforms • Familiarity with Helm , Prometheus , Grafana , or OpenTelemetry • Exposure to other cloud platforms (GCP, Azure) • CI/CD pipeline development for containerised/serverless apps Why Join • Shape a cutting-edge FinOps platform with real impact on how More ❯
under test: Containerisation (e.g. Docker), Virtualisation and Provisioning, Workload and job scheduling (e.g. Kubernetes, Ray) on high core-count machines and rack-scale installations, Management and Observability (e.g. Prometheus, OpenTelemetry, DataDog, Splunk, etc.). 10+ years of relevant experience related to quality assurance/testing teams. Experience with the Atlassian suite and CI/CD platforms such as Jenkins; GitHub More ❯
under test: Containerisation (e.g. Docker), Virtualisation and Provisioning, Workload and job scheduling (e.g. Kubernetes, Ray) on high core-count machines and rack-scale installations, Management and Observability (e.g. Prometheus, OpenTelemetry, DataDog, Splunk, etc.). 10+ years of relevant experience related to quality assurance/testing teams. Experience with the Atlassian suite and CI/CD platforms such as Jenkins; GitHub More ❯
under test: Containerisation (e.g. Docker), Virtualisation and Provisioning, Workload and job scheduling (e.g. Kubernetes, Ray) on high core-count machines and rack-scale installations, Management and Observability (e.g. Prometheus, OpenTelemetry, DataDog, Splunk, etc.). 10+ years of relevant experience related to quality assurance/testing teams. Experience with the Atlassian suite and CI/CD platforms such as Jenkins; GitHub More ❯
Some other highly valued skills may include: Consumer-Driven Contract Testing experience with tools such as Pact, Spring Cloud Contract. Experience in Cell-Based Architecture. Observability Engineering: Tools & Practices OpenTelemetry, Prometheus, Grafana, distributed tracing, structured logging, service level indicator's (SLI) service level objective (SLO). You may be assessed on the key critical skills relevant for success in role More ❯
experience, some of which should have focus on Observability. Excellent knowledge and hands-on experience with monitoring, logging, and tracing tools such as Prometheus, VictoriaMetrics, Grafana, Datadog, New Relic, OpenTelemetry, ELK Stack, or similar. Experience with high volume data storage (Structured and unstructured). A strong technical background, with current capabilities and willingness to get hands on when needed. Excellent More ❯
Lead Engineer Join a digital first bank that's powered by people. Our technology team builds innovative digital solutions rapidly and at scale to deliver the next generation of banking services for our customers around the world. You'll be More ❯
Social network you want to login/join with: Lead Engineer Join a digital-first bank that’s powered by people. Our technology team builds innovative digital solutions rapidly and at scale to deliver the next generation of banking services More ❯
Stoke-on-Trent, England, United Kingdom Hybrid / WFH Options
ZipRecruiter
Job Description Who we are looking for A Site Reliability Engineer, who will enhance system reliability, observability and performance through a strong engineering approach and assist with incident resolution and best practices. You will have software engineering skills, focusing on More ❯
and cost management of these instances. Experience with tools like Helm, Karpenter, and k9s is essential. The ideal candidate will have experience with collecting logs, traces, and metrics via Opentelemetry and making these available through AWS products like X-Ray and CloudWatch. These readings should be used to ensure Nexus meets high standards for performance and reliability, or to guide … where uptime and reliability are critical Professional experience with Python Desirable skills include: Experience with PostgreSQL Experience in continuous deployment environments Experience triaging and debugging code issues Familiarity with OpenTelemetry standards and SDKs What is in it for you? Work alongside a talented team in the quantum computing industry. We offer a competitive package, equity, 28 days of paid holiday More ❯