help safeguard our enterprise systems and support secure digital transformation. Dynatrace exists to make the world's software work perfectly. Our unified software intelligence platform combines broad and deep observability and continuous runtime application security with the most advanced AIOps to provide answers and intelligent automation from data at an enormous scale. This enables innovators to modernize and automate cloud More ❯
the future of our compute estate, including adoption and optimisation of next-generation infrastructure Building and maintaining robust systems to run models in live 24/5 environments, with observability, reliability and operational control at their core Developing and embedding best-in-class MLOps practices Helping to define our vision for the future of ML and quantitative research, and mapping More ❯
thinking, factoring in Conduktor's deeply technical nature to drive product success. The role would suit someone naturally customer-centric, with experience in enterprise software, big data platforms, or observability products, and who is excited about technologies within the real-time streaming data space. This is a hybrid role and we are looking for folks to join us onsite More ❯
thinking, factoring in Conduktor's deeply technical nature to drive product success. The role would suit someone naturally customer-centric, with experience in enterprise software, big data platforms, or observability products, and who is excited about technologies within the real-time streaming data space. This is a hybrid role and we are looking for folks to join us onsite More ❯
and technical implementation , ensuring that our service offerings are scalable, cost-effective, and aligned with industry best practices . What You'll Love Architect and enhance Asda's enterprise observability solutions, including Application Performance Monitoring, Logging, Monitoring & Alerting, and Dashboarding . Define and optimize incident management processes , focusing on tooling, integrations, automation, and evolving ServiceNow solution design . Support the More ❯
with our existing systems Background in human-computer interaction, computational creativity, or writing research Experience with A/B testing, statistical analysis, and experimental design Familiarity with modern AI observability and monitoring tools Published research or deep interest in AI evaluation methodologies Interest in writing (fiction, non-fiction, essays) However, you are NOT expected to: Be a senior software engineer More ❯
with our existing systems Background in human-computer interaction, computational creativity, or writing research Experience with A/B testing, statistical analysis, and experimental design Familiarity with modern AI observability and monitoring tools Published research or deep interest in AI evaluation methodologies Interest in writing (fiction, non-fiction, essays) However, you are NOT expected to: Be a senior software engineer More ❯
DNS, DHCP Exchange Server 2019, SQL Server, SharePoint 2019 VMware: VMware Cloud Foundation vSphere, ESXi, NSXT, and vSAN Endpoint & Configuration: Windows 10 & 11 Microsoft Endpoint Configuration Manager (MECM) Monitoring & Observability: Microsoft System Centre Operations Manager (SCOM) PKI Technologies: Microsoft Certificate Services, Hardware Security Modules (HSMs), and lifecycle key management Security Clearance This role is subject to pre-employment screening in More ❯
Commercial teams to continuously improve the user experience, performance, conversion and retention of our global store Champion engineering excellence across the cluster - from clean architecture and automated testing to observability, security, accessibility, etc Foster an inclusive, collaborative and high-performing engineering culture Coach your manager-level reports to strengthen their leadership and help them unlock the full potential of their More ❯
Sheffield, South Yorkshire, Yorkshire, United Kingdom
Experis
opportunities in the evolving world of cloud, digital and platforms. Role purpose/summary We are seeking an experienced OpenTelemetry Developer to lead the design, development, and deployment of observability solutions in on-premises environments. The ideal candidate will have strong expertise in Go programming, OpenTelemetry instrumentation, and CI/CD automation tailored for enterprise infrastructure. Key Responsibilities: Develop and … diverse infrastructure setups. Design and implement CI/CD pipelines for automated rollout and updates of Otel agents and collectors. Collaborate with infrastructure, DevOps, and application teams to integrate observability into legacy and modern systems. Optimize telemetry data collection, processing, and storage for performance and reliability. Troubleshoot and resolve issues related to observability pipelines and instrumentation. Contribute to internal documentation … Required Skills & Experience: Strong proficiency in Go (Golang), especially for writing modular and reusable code. Hands-on experience with OpenTelemetry Collector, agents, and SDKs. Proven experience deploying and managing observability tools in on-premises infrastructure. Familiarity with CI/CD tools (e.g., Jenkins, GitLab CI, GitHub Actions). Experience with Linux systems, networking, and containerization (Docker). Understanding of monitoring More ❯
INSIDE IR35 Start Date: 26/08/2025 Job Type: Contract Company Introduction We are seeking an experienced OpenTelemetry Developer to lead the design, development, and deployment of observability solutions in on-premises environments. The ideal candidate will have strong expertise in Go programming, OpenTelemetry instrumentation, and CI/CD automation tailored for enterprise infrastructure. Key Responsibilities: Develop and … diverse infrastructure setups. Design and implement CI/CD pipelines for automated rollout and updates of Otel agents and collectors. Collaborate with infrastructure, DevOps, and application teams to integrate observability into Legacy and modern systems. Optimize telemetry data collection, processing, and storage for performance and reliability. Troubleshoot and resolve issues related to observability pipelines and instrumentation. Contribute to internal documentation … Required Skills & Experience: Strong proficiency in Go (Golang), especially for writing modular and reusable code. Hands-on experience with OpenTelemetry Collector, agents, and SDKs. Proven experience deploying and managing observability tools in on-premises infrastructure. Familiarity with CI/CD tools (eg, Jenkins, GitLab CI, GitHub Actions). Experience with Linux systems, networking, and containerization (Docker). Understanding of monitoring More ❯
handsworth, yorkshire and the humber, united kingdom
Experis
opportunities in the evolving world of cloud, digital and platforms. Role purpose/summary We are seeking an experienced OpenTelemetry Developer to lead the design, development, and deployment of observability solutions in on-premises environments. The ideal candidate will have strong expertise in Go programming, OpenTelemetry instrumentation, and CI/CD automation tailored for enterprise infrastructure. Key Responsibilities: Develop and … diverse infrastructure setups. Design and implement CI/CD pipelines for automated rollout and updates of Otel agents and collectors. Collaborate with infrastructure, DevOps, and application teams to integrate observability into legacy and modern systems. Optimize telemetry data collection, processing, and storage for performance and reliability. Troubleshoot and resolve issues related to observability pipelines and instrumentation. Contribute to internal documentation … Required Skills & Experience: Strong proficiency in Go (Golang), especially for writing modular and reusable code. Hands-on experience with OpenTelemetry Collector, agents, and SDKs. Proven experience deploying and managing observability tools in on-premises infrastructure. Familiarity with CI/CD tools (e.g., Jenkins, GitLab CI, GitHub Actions). Experience with Linux systems, networking, and containerization (Docker). Understanding of monitoring More ❯
McDonald's is on an exciting journey to revolutionize our observability and AIOps capabilities. Our mission is to transform data into actionable insights and drive operational excellence across our organization. We are seeking a dedicated team member who will play a pivotal role in bridging the gap between our technology stakeholders and the observability and AIOps teams. In this role … crucial liaison, representing the voice of technology during the development of processes and governance frameworks that will bring our new function to life. Your expertise in IT operations and observability, combined with your excellent communication skills, will be essential in informing the design of critical capabilities such as incident management, monitoring, and automation. You will ensure that future state processes … potential issues before they impact operations. Data Pipeline Management : Develop and maintain ETL pipelines within Cribl to ensure data collection and processing. Collaboration : Work closely with development, operations, and observability teams to integrate AI solutions into the everyday workflow. Incident Management : Employ AIOps for proactive incident management and resolution. Performance Optimization : Continuously optimize system performance using AI insights and recommendations. More ❯
and help shape how platform engineering is done as the team continues to scale. Tech stack AWS (Core services - EC2, RDS, S3, IAM, etc.) Configuration Management Ansible Monitoring and Observability Grafana, Prometheus Kubernetes (building and managing production clusters) Terraform (IaC provisioning) GitHub Actions (CI/CD pipelines) What They’re Looking For Experience in AWS cloud infrastructure (ideally in a … regulated or high-traffic environment) Previous experience working with Monitoring and Observability Tools Hands-on Kubernetes know-how, specifically with EKS. Solid IaC experience with Terraform. Experience with containerisation (Docker, Helm) and CI/CD (GitHub Actions or similar) A good communicator who enjoys working collaboratively across product and engineering The client is willing to take someone that doesn't More ❯
of our team's purpose. Write automation to scale systems sustainably, prevent service issues, or when they occur, quickly recover service. Partner with development teams to improve system reliability, observability, and release velocity. Participate in on-call rotations, incident response, postmortems, and root cause analysis and resolution. Be a vocal advocate of strong/sound engineering practices that allow us … Azure, AWS, or GCP. Preferred qualifications Minimum 8-10 years in the industry Experience on DevOps concepts and way of working Experience with algorithms and data structures. Experience in Observability practices with logging, metrics, tracing, and alerting. Experience with Infrastructure as Code. Understanding of identity and access management, and application security. We use Datadog and BigPanda for our observability stack More ❯
We are seeking a Production Engineer with strong systems engineering fundamentals and an operational mindset to join our global technology team. You will focus on the resilience, automation, and observability of production systems that power a mission-critical quantitative trading platform. The role is based in London and forms part of a follow-the-sun global support model. This is … and recovery processes. Key Responsibilities Primary Duties Platform Engineering & Automation (Core SRE Focus - 50%) Build and maintain automated tools for deployment, health checks, alerts, and runbooks. Lead efforts in observability, including metrics instrumentation, logging, and dashboards. Develop self-healing mechanisms for recurring production issues. Continuously reduce manual operational work ("toil") through scripting. Reliability Engineering & Incident Management ( 30%) Monitor health of … and procedures. Skills, Knowledge and Expertise Experience, Knowledge & Skills 5+ years in a production-facing engineering role within finance or other mission-critical tech domains. Proven experience with automation, observability, and incident response in distributed systems. Comfort with scripting and systems programming (Python, Bash). Experience with config management and container orchestration tools. Strong communication and debugging skills, especially under More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Harnham - Data & Analytics Recruitment
SaaS hosting Implement GitOps deployment workflows using ArgoCD Create and manage infrastructure as code with Terraform Set up CI/CD pipelines for infrastructure and application deployment Implement monitoring, observability, and cloud cost optimisation (FinOps) Collaborate with ML engineers to fine-tune infrastructure for large-scale model training What You'll Bring 5+ years in cloud infrastructure/DevOps roles … GitOps tools, and CI/CD (GitHub Actions preferred) Proficiency in Python and scripting for automation Solid understanding of cloud networking, security, and cross-cloud connectivity Experience in monitoring, observability, and cost optimisation Nice to Have Experience with ML tooling (MLflow, Kubeflow) Knowledge of FastAPI , Databricks, or Snowflake Exposure to SRE practices or cloud security certifications Familiarity with Prometheus , Grafana More ❯
automation. Ensure end-to-end network automation to improve operational efficiency, agility, and reliability. Drive zero-trust network security principles, ensuring compliance and proactive threat mitigation. Establish a global observability and telemetry framework for real-time network insights. Align network strategies with business growth, cloud-first initiatives, and digital transformation. Network Infrastructure & Cloud Networking: Oversee global network architecture, spanning data … response using AI-driven network analytics. Ensure high availability, network resilience, and 24x7 operational support. Develop a follow-the-sun support model, ensuring global network performance optimization. Implement network observability and predictive analytics to proactively prevent outages. Security, Compliance & Risk Management: Drive zero-trust security frameworks, ensuring secure and resilient network access. Ensure adherence to ISO 27001, NIST, SOC … role, managing large-scale global network environments. Deep expertise in cloud networking (AWS, Azure, GCP), SD-WAN, and network automation. Proven track record in end-to-end network automation, observability, and self-healing networks. Experience in AI-driven networking, predictive analytics, and network telemetry. Strong understanding of zero-trust networking, compliance frameworks, and security policies. Excellent leadership, communication, and stakeholder More ❯
Engineer, you will: Build and automate IaaS and PaaS platforms across public, private, and hybrid cloud environments Create and manage solutions such as landing zones, container platforms, DevSecOps pipelines, observability stacks, and integration layers Use modern tooling like Terraform , CI/CD pipelines , and cloud-native security frameworks Collaborate with product teams, cloud architects, and stakeholders to rapidly deliver working … consultancy mindset: adaptable, delivery-focused, and comfortable with ambiguity Experience in the following areas is highly desirable: Designing and building multi-cloud or hybrid platforms Implementing cloud-native operations, observability, or SRE practices Working with Kubernetes, container orchestration, and modern networking patterns Securing cloud infrastructure and deploying secure coding practices (DevSecOps) Migrating legacy workloads to the cloud using agile methodologies More ❯
and services-and own them completely, from architecture to operation. Drive the architectural vision for our platform, making key decisions on technologies like Kubernetes, Infrastructure as Code, and our observability stack. Bring deep platform expertise to the table, leveling up the entire team through mentorship, architectural guidance, and by championing best practices. Grow with WunderGraph as we scale, expanding your … role focuses on, but is not limited to: Architecting, building, and operating the core cloud-native infrastructure for WunderGraph Cosmo, primarily using Go and Kubernetes. Owning and evolving our observability stack (OpenTelemetry, Prometheus, ClickHouse) and the infrastructure supporting our AI-driven features to ensure deep, actionable insights into our systems. Building and optimizing CI/CD pipelines to improve build … strong understanding of system architecture, distributed systems, and the challenges of running high-performance API gateways. Familiarity with GraphQL Federation is a significant plus. Experience building or managing modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, ClickHouse). A self-starter attitude and a leader's mindset: you are comfortable with ambiguity, can identify and solve ill-defined problems, and don More ❯
Manchester, Lancashire, United Kingdom Hybrid / WFH Options
Suits Me Limited
across multiple squads to ensure our platform is scalable, secure, and designed for rapid deployment and operational excellence. You'll contribute to the development and automation of cloud infrastructure, observability systems, CI/CD pipelines, and event-based services that power key parts of our product ecosystem. About Suits Me Suits Me is a multi-award-winning, ethical fintech dedicated … pipelines (e.g. GitHub Actions) to enable rapid and reliable delivery of services Contributing to the design of scalable and secure platform components that enable developer productivity Building and improving observability tooling (e.g. CloudWatch, Grafana) to support rapid detection and resolution of issues Collaborating with developers and stakeholders across squads to understand infrastructure needs and ensure best practices are applied Writing More ❯
Zopa with UK retailers and marketplaces.In this role, you'll ensure our systems are reliable, scalable, and secure. You'll help automate deployments, evolve our cloud infrastructure, and improve observability and developer experience - making it easier for product teams to deliver quality software quickly and safely. Why Zopa Manchester? We're building a new tech hub right in the heart … platform and developer experience teams Ensuring our container platforms (including Kubernetes) are reliable, secure, and up to date Designing scalable, self-service tools to reduce operational toil Supporting infrastructure observability through metrics, tracing, and alerting Working closely with product teams to foster a culture of reliability engineering About you: 4+ years in a Platform/Site Reliability Engineering or similar More ❯
management for Windows workloads Create tooling and automation around the deployment of a customer-specific Windows-based SaaS product Ensure high availability, reliability, and scalability of Windows services. Integrate observability tooling (metrics, logs, traces) into IIS-hosted services Harden Windows infrastructure for security, compliance, and operational best practices Lead incident response for Windows-related systems Contribute to internal documentation and … Windows internals Proven ability to build infrastructure-as-code and CI/CD for Windows environments Comfort wrapping a Windows software product with the surrounding infrastructure, services, automation, and observability required to run it as a SaaS offering. Hands-on experience administering cloud infrastructure or building cloud-native applications (preferably on AWS) Comfortable using AWS EC2 Proficiency with command-line More ❯
CI/CD pipelineswithGitLab CI or Jenkinsto enable fast, secure, and reliable software delivery. o Champion Kubernetes-based platformsusingAmazon EKSandIstio Service Meshto build scalable, service-oriented architectures. o Drive observability and reliability engineeringthrough proactive monitoring, alerting, and incident response strategies. o Mentor and guide DevOps engineers, fostering a culture of continuous improvement, automation, and operational excellence. o Collaborate cross-functionallywith … We're looking for someone with deep expertise in: oInfrastructure as Code: Terraform, CloudFormation o Security best practices: IAM, KMS, encryption in transit/at rest, DevSecOps o Monitoring & observability: Datadog, Prometheus, Grafana, ELK, or similar What You Bring o 6+ years in DevOps or platform engineering, with experience in a technical lead role. o Proven experience designing and operating More ❯
Position Summary We are looking for an experienced Systems Engineer with strong Linux and Kubernetes experience to join our Group Engineering - Systems team. You will help design, build and operate modern infrastructure platforms that support continually evolving applications and services. More ❯