Observability Jobs in England

601 to 625 of 687 Observability Jobs in England

Senior Agentic AI Engineer

Birmingham, England, United Kingdom
Method Resourcing
act? This is a chance to design and deliver agentic AI systems on Azure that automate real business workflows through tool use, retrieval, and reasoning, with the reliability and observability of true production engineering. In this position you’ll take ownership of designing and scaling end-to-end agentic solutions on Azure, combining LLMs, APIs, and orchestration frameworks to deliver … Productionise on Azure using AI Foundry/OpenAI, Azure ML, Functions, Event Grid/Service Bus, and Kubernetes. Build LLMOps pipelines for evaluation, monitoring, safety, and cost control. Define observability standards across prompts, tools, and data flows. Establish governance patterns, safety, privacy, and auditability. Stay hands-on with critical code paths while guiding architecture and best practice. 🧠Required Skills/ More ❯
Posted:

Linux Production Engineer

London Area, United Kingdom
Autonomai Recruitment
experience building technology 0→1 , owning systems end-to-end, and working close to the metal. They will operate across everything from bare-metal Linux to modern build and observability stacks . Linux Platform Engineer – Trading Infrastructure Overview The firm is seeking a Linux Platform Engineer to join a small, high-impact engineering group supporting ML/AI-driven trading. … latency . Contribute to kernel-level debugging and system improvements . Automate Linux fleet builds—creating consistent, reproducible systems . Manage Kubernetes cluster infrastructure, networking, and container orchestration. Enhance observability Analyze and optimize networking across the full TCP/IP stack . Investigate core dumps, memory bottlenecks, and CPU performance issues across distributed systems. Develop Python tooling for internal automation More ❯
Posted:

Linux Production Engineer

City of London, London, United Kingdom
Autonomai Recruitment
experience building technology 0→1 , owning systems end-to-end, and working close to the metal. They will operate across everything from bare-metal Linux to modern build and observability stacks . Linux Platform Engineer – Trading Infrastructure Overview The firm is seeking a Linux Platform Engineer to join a small, high-impact engineering group supporting ML/AI-driven trading. … latency . Contribute to kernel-level debugging and system improvements . Automate Linux fleet builds—creating consistent, reproducible systems . Manage Kubernetes cluster infrastructure, networking, and container orchestration. Enhance observability Analyze and optimize networking across the full TCP/IP stack . Investigate core dumps, memory bottlenecks, and CPU performance issues across distributed systems. Develop Python tooling for internal automation More ❯
Posted:

Head of Platform Engineering

Surrey, England, United Kingdom
Hybrid/Remote Options
La Fosse
scale. This is a pivotal, visible role reporting directly to the CTO. The Opportunity You’ll shape the operational strategy and modernise how the platform is managed, driving reliability, observability, automation, and cost efficiency. You’ll manage the Head of DevOps and work closely with Product, Engineering and Finance to ensure the platform is secure, resilient, scalable and commercially efficient. … IT, Security and Platform Operations Reliability, performance and availability of a cloud-native SaaS platform (AWS/serverless) Cost-to-Serve ownership and cloud cost visibility/optimisation Maturing observability, incident management & operational governance Uplifting DevOps engineering practices and platform automation, use of AI Vendor & outsourced IT partner management Supporting a high-change organisation scaling for enterprise success What You More ❯
Posted:

Director - Performance and Reliability

Cambridgeshire, England, United Kingdom
Sanderson
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯
Employment Type: Full-Time
Salary: £84,000 - £95,000 per annum, Negotiable, Inc benefits
Posted:

Director - Performance and Reliability

Cambridgeshire, East Anglia, United Kingdom
Sanderson Recruitment
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯
Employment Type: Permanent
Salary: £95,000
Posted:

Software Developer

Hammersmith, England, United Kingdom
OpenSource
data sources using advanced web scraping and reverse-engineering techniques. Developing and maintaining low-latency, real-time data feeds to support internal systems and strategies. Improving internal visibility and observability tooling to help diagnose integration issues and identify improvements. Contributing across the full lifecycle of your work — design, development, testing, review, deployment, and ongoing support. Working within an agile, flexible … a rotational basis. Tech Stack Languages: Python (3.10+), plus TypeScript/JavaScript for frontend work, and occasional Go for infrastructure tasks. Messaging: RabbitMQ, Kafka Storage: PostgreSQL, Redis Environment: Linux Observability: OpenTelemetry, Prometheus, Grafana, Zabbix Requirements Must-haves Strong software development experience, especially with Python. A degree in Computer Science or another numerical discipline. Clear communication skills, able to explain technical More ❯
Posted:

Software Developer

Alderley Edge, Cheshire, United Kingdom
Transunion
Day You'll Be: Design and build reliable backend systems and infrastructure tooling Use TDD to write high-quality, maintainable code and build out automated test suites Own reliability, observability, and performance of key services Collaborate with clients to understand requirements, debug issues, and propose solutions Drive improvements to system architecture, automation, and deployment processes Mentor junior developers and contribute … in writing and on calls Desirable Skills & Experience: Experience owning backend systems in production environments Experience with Cloud Platforms AWS or GCP Infrastructure-as-code, CI/CD, and observability tooling Experience scaling systems under sustained load Contributions to internal tooling or open source Experience with large datasets and machine learning models Impact You'll Make: What's In It More ❯
Employment Type: Permanent
Posted:

Senior DevSecOps engineer

England, United Kingdom
Hybrid/Remote Options
Seccl Technology Limited
handling, JWK publishing, and SSO connection setup. Utilising Infrastructure as Code (Terraform) and CI/CD (GitHub Actions) to manage Auth0 configuration and ensure safe, repeatable deployments. Implementing comprehensive observability for authentication paths with structured logs, monitoring dashboards, alerts, and SLOs. Collaborating closely with product, engineering, and support teams on migration timelines, communications, and incident response. This role's for … and identity configurations, including secure secrets management. Solid understanding of core AWS services relevant to modern authentication patterns, such as API Gateway, Lambda authorisers, and CloudWatch. A commitment to observability, with hands-on experience implementing structured logging, dashboards, and SLOs for critical services. Excellent collaboration skills, demonstrated through participation in design reviews, pairing, and writing clear technical documentation (e.g., runbooks More ❯
Employment Type: Permanent
Salary: GBP Annual
Posted:

Senior DevSecOps engineer

Bath, Somerset, United Kingdom
Hybrid/Remote Options
Seccl Technology Limited
handling, JWK publishing, and SSO connection setup. Utilising Infrastructure as Code (Terraform) and CI/CD (GitHub Actions) to manage Auth0 configuration and ensure safe, repeatable deployments. Implementing comprehensive observability for authentication paths with structured logs, monitoring dashboards, alerts, and SLOs. Collaborating closely with product, engineering, and support teams on migration timelines, communications, and incident response. This role's for … and identity configurations, including secure secrets management. Solid understanding of core AWS services relevant to modern authentication patterns, such as API Gateway, Lambda authorisers, and CloudWatch. A commitment to observability, with hands-on experience implementing structured logging, dashboards, and SLOs for critical services. Excellent collaboration skills, demonstrated through participation in design reviews, pairing, and writing clear technical documentation (e.g., runbooks More ❯
Employment Type: Permanent
Salary: GBP Annual
Posted:

Head of Cloud – Contract (Outside IR35)

Derby, England, United Kingdom
Hybrid/Remote Options
Experis UK
Head of Cloud – Contract (Outside IR35) Location: Hybrid (East Midlands/London 1-2 days/week onsite) Rate: Up to £700/day Contract Type: Outside IR35 Duration: 3-6 months (initial), with potential extension Start Date: ASAP About More ❯
Posted:

Staff Data Engineer

London, United Kingdom
Hybrid/Remote Options
Fruition Group
Job Title: Staff Data Engineer Location: London, Hybrid Salary: c.£140,000 + bonus + share options Why Apply? This is a unique opportunity to take a leading role in shaping the data strategy of a fast growing Insurtech scale More ❯
Employment Type: Permanent
Posted:

Staff Site Reliability Engineer - Observability

London Area, United Kingdom
Hybrid/Remote Options
Motive Group
Senior/Staff Site Reliability Engineer - Observability | London (Hybrid) If you care deeply about building and operating world-class infrastructure for AI at scale , this one’s worth your time. We’re working with a company that builds the backbone powering some of the most demanding AI workloads on the planet. Think large-scale GPU clusters, global telemetry systems, and … distributed training environments used by leading research and enterprise teams. They’re looking for a Senior or Staff SRE with deep experience in observability at massive scale - someone who’s tuned Prometheus/Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast-moving environments. You’ll be working on … Designing and scaling observability for globally distributed GPU infrastructure Building automation that cuts operational toil and improves reliability Partnering with platform and infrastructure teams to deliver true visibility across complex AI systems If you’ve built or operated telemetry stacks for large-scale, GPU-heavy, or multi-tenant environments - and want to work on cutting-edge problems in a business More ❯
Posted:

Staff Site Reliability Engineer - Observability

City of London, London, United Kingdom
Hybrid/Remote Options
Motive Group
Senior/Staff Site Reliability Engineer - Observability | London (Hybrid) If you care deeply about building and operating world-class infrastructure for AI at scale , this one’s worth your time. We’re working with a company that builds the backbone powering some of the most demanding AI workloads on the planet. Think large-scale GPU clusters, global telemetry systems, and … distributed training environments used by leading research and enterprise teams. They’re looking for a Senior or Staff SRE with deep experience in observability at massive scale - someone who’s tuned Prometheus/Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast-moving environments. You’ll be working on … Designing and scaling observability for globally distributed GPU infrastructure Building automation that cuts operational toil and improves reliability Partnering with platform and infrastructure teams to deliver true visibility across complex AI systems If you’ve built or operated telemetry stacks for large-scale, GPU-heavy, or multi-tenant environments - and want to work on cutting-edge problems in a business More ❯
Posted:

Platform Engineer

England, United Kingdom
Hybrid/Remote Options
Harnham
AND RESPONSIBILITIES Support and enhance the company's infrastructure and production systems across GCP. Contribute to a major replatforming project to GKE, ensuring scalability, automation, and security. Improve reliability, observability, and CI/CD pipelines (GitLab CI, Argo, Flux). Work closely with developers to embed best practices and elevate the internal developer experience (IDP). Collaborate within a small … with GCP (Google Cloud Platform) . Deep understanding of CI/CD pipelines - GitLab CI, Argo, or Flux. Experience with HashiCorp Vault and open-source tooling. Background in automation, observability, and platform reliability . Excellent problem-solving skills and a collaborative, pragmatic mindset. THE DETAILS Day rate: £550-£650/day (Outside IR35) Contract: 3 months, with scope for extension … AND RESPONSIBILITIES Support and enhance the company's infrastructure and production systems across GCP. Contribute to a major replatforming project to GKE, ensuring scalability, automation, and security. Improve reliability, observability, and CI/CD pipelines (GitLab CI, Argo, Flux). Work closely with developers to embed best practices and elevate the internal developer experience (IDP). Collaborate within a small More ❯
Posted:

Site Reliability Engineer - AWS - Grafana - Cloudwatch - ELK - UK Remote

East London, London, United Kingdom
Hybrid/Remote Options
Opus Recruitment Solutions
AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability Are you looking for a genuinely Remote opportunity? Somewhere you're part of something bigger, working on a global product within a close-knit SRE team? I've partnered a WebApp that provide an end to end event management for some of the … planet's biggest artists and they're now looking for a SRE. Someone that knows their way around classic Observability with Grafana, ELK stack, and cost optomisation for the product as they continue scaling. Working across the glove their multi-tenanted, AWS environments requires someone who is able to reverse engineer product faults, or post incident audits to ensure future … like to hear more, send over a CV to robin.shaw@opusrs.com or apply! AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability More ❯
Posted:

Site Reliability Engineer - AWS - Grafana - Cloudwatch - ELK - UK Remote

City of London, London, United Kingdom
Hybrid/Remote Options
Opus Recruitment Solutions
AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability Are you looking for a genuinely Remote opportunity? Somewhere you're part of something bigger, working on a global product within a close-knit SRE team? I've partnered a WebApp that provide an end to end event management for some of the … planet's biggest artists and they're now looking for a SRE. Someone that knows their way around classic Observability with Grafana, ELK stack, and cost optomisation for the product as they continue scaling. Working across the glove their multi-tenanted, AWS environments requires someone who is able to reverse engineer product faults, or post incident audits to ensure future … like to hear more, send over a CV to robin.shaw@opusrs.com or apply! AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability More ❯
Posted:

Site Reliability Engineer - AWS - Grafana - Cloudwatch - ELK - UK Remote

Altrincham, Greater Manchester, United Kingdom
Hybrid/Remote Options
Opus Recruitment Solutions
AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability Are you looking for a genuinely Remote opportunity? Somewhere you're part of something bigger, working on a global product within a close-knit SRE team? I've partnered a WebApp that provide an end to end event management for some of the … planet's biggest artists and they're now looking for a SRE. Someone that knows their way around classic Observability with Grafana, ELK stack, and cost optomisation for the product as they continue scaling. Working across the glove their multi-tenanted, AWS environments requires someone who is able to reverse engineer product faults, or post incident audits to ensure future … like to hear more, send over a CV to robin.shaw@opusrs.com or apply! AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability More ❯
Posted:

Site Reliability Engineer - AWS - Grafana - Cloudwatch - ELK - UK Remote

Leigh, Greater Manchester, United Kingdom
Hybrid/Remote Options
Opus Recruitment Solutions
AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability Are you looking for a genuinely Remote opportunity? Somewhere you're part of something bigger, working on a global product within a close-knit SRE team? I've partnered a WebApp that provide an end to end event management for some of the … planet's biggest artists and they're now looking for a SRE. Someone that knows their way around classic Observability with Grafana, ELK stack, and cost optomisation for the product as they continue scaling. Working across the glove their multi-tenanted, AWS environments requires someone who is able to reverse engineer product faults, or post incident audits to ensure future … like to hear more, send over a CV to robin.shaw@opusrs.com or apply! AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability More ❯
Posted:

Site Reliability Engineer - AWS - Grafana - Cloudwatch - ELK - UK Remote

Bury, Greater Manchester, United Kingdom
Hybrid/Remote Options
Opus Recruitment Solutions
AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability Are you looking for a genuinely Remote opportunity? Somewhere you're part of something bigger, working on a global product within a close-knit SRE team? I've partnered a WebApp that provide an end to end event management for some of the … planet's biggest artists and they're now looking for a SRE. Someone that knows their way around classic Observability with Grafana, ELK stack, and cost optomisation for the product as they continue scaling. Working across the glove their multi-tenanted, AWS environments requires someone who is able to reverse engineer product faults, or post incident audits to ensure future … like to hear more, send over a CV to robin.shaw@opusrs.com or apply! AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability More ❯
Posted:

Site Reliability Engineer - AWS - Grafana - Cloudwatch - ELK - UK Remote

Bolton, Greater Manchester, United Kingdom
Hybrid/Remote Options
Opus Recruitment Solutions
AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability Are you looking for a genuinely Remote opportunity? Somewhere you're part of something bigger, working on a global product within a close-knit SRE team? I've partnered a WebApp that provide an end to end event management for some of the … planet's biggest artists and they're now looking for a SRE. Someone that knows their way around classic Observability with Grafana, ELK stack, and cost optomisation for the product as they continue scaling. Working across the glove their multi-tenanted, AWS environments requires someone who is able to reverse engineer product faults, or post incident audits to ensure future … like to hear more, send over a CV to robin.shaw@opusrs.com or apply! AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability More ❯
Posted:

Site Reliability Engineer - AWS - Grafana - Cloudwatch - ELK - UK Remote

Leeds, West Yorkshire, United Kingdom
Hybrid/Remote Options
Opus Recruitment Solutions
AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability Are you looking for a genuinely Remote opportunity? Somewhere you're part of something bigger, working on a global product within a close-knit SRE team? I've partnered a WebApp that provide an end to end event management for some of the … planet's biggest artists and they're now looking for a SRE. Someone that knows their way around classic Observability with Grafana, ELK stack, and cost optomisation for the product as they continue scaling. Working across the glove their multi-tenanted, AWS environments requires someone who is able to reverse engineer product faults, or post incident audits to ensure future … like to hear more, send over a CV to robin.shaw@opusrs.com or apply! AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability More ❯
Posted:

Site Reliability Engineer - AWS - Grafana - Cloudwatch - ELK - UK Remote

Central London / West End, London, United Kingdom
Hybrid/Remote Options
Opus Recruitment Solutions
AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability Are you looking for a genuinely Remote opportunity? Somewhere you're part of something bigger, working on a global product within a close-knit SRE team? I've partnered a WebApp that provide an end to end event management for some of the … planet's biggest artists and they're now looking for a SRE. Someone that knows their way around classic Observability with Grafana, ELK stack, and cost optomisation for the product as they continue scaling. Working across the glove their multi-tenanted, AWS environments requires someone who is able to reverse engineer product faults, or post incident audits to ensure future … like to hear more, send over a CV to robin.shaw@opusrs.com or apply! AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability More ❯
Posted:

Site Reliability Engineer - AWS - Grafana - Cloudwatch - ELK - UK Remote

Ashton-Under-Lyne, Greater Manchester, United Kingdom
Hybrid/Remote Options
Opus Recruitment Solutions
AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability Are you looking for a genuinely Remote opportunity? Somewhere you're part of something bigger, working on a global product within a close-knit SRE team? I've partnered a WebApp that provide an end to end event management for some of the … planet's biggest artists and they're now looking for a SRE. Someone that knows their way around classic Observability with Grafana, ELK stack, and cost optomisation for the product as they continue scaling. Working across the glove their multi-tenanted, AWS environments requires someone who is able to reverse engineer product faults, or post incident audits to ensure future … like to hear more, send over a CV to robin.shaw@opusrs.com or apply! AWS | GCP | SRE | Site Reliability Engineer | Terraform | Cloudformation | ECS | ELK | Elasticsearch | Logstash | Kabana | Cloudwatch | Grafana | Windows | Observability More ❯
Posted:

Senior Data Engineer

London Area, United Kingdom
Hybrid/Remote Options
Identify Solutions
the past year and aggressive expansion across the UK, US, and EU, the company is scaling at pace. Data is the backbone: from APIs and pipelines to governance and observability, their data platform directly powers customer-facing products and AI-driven insights. They’re now hiring a Senior Data Engineer to own and shape this platform, building scalable, production-grade … systems that become the foundation for global brands. Why join? ✨ Greenfield impact – inherit a live but early platform, define best practice across structure, testing, observability, and governance. ✨ Direct product impact – your APIs, pipelines, and orchestration power the platform that 1,000+ brands rely on every day. ✨ AI at the core – work on infrastructure that enables machine learning and intelligent decision … doing: API strategy & development – own and scale FastAPI endpoints that deliver real-time access to platform data. Data pipeline development – build ingestion and replication pipelines with best-in-class observability, latency, and resilience. Platform technical vision – influence architecture and orchestration, shaping how the business handles data at scale. Data quality & governance – embed testing, freshness, lineage, and monitoring to ensure reliability More ❯
Posted:
Observability
England
10th Percentile
£56,250
25th Percentile
£67,500
Median
£80,000
75th Percentile
£105,000
90th Percentile
£146,000