Observability Job Vacancies

301 to 325 of 2,262 Observability Jobs

Senior Software Engineer (TypeScript / React) - Content

Bristol, Avon, South West, United Kingdom
Hybrid / WFH Options
Hargreaves Lansdown
Excited to grow your career? Our purpose is to empower people to save and invest with confidence. We are looking for great people to join us, so please come and invest in YOUR future at HL. We know that sometimes More ❯
Employment Type: Permanent, Part Time
Salary: £80,000
Posted:

Principal Support Engineer

London, England, United Kingdom
Hybrid / WFH Options
EDB
Social network you want to login/join with: EDB provides a data and AI platform that enables organizations to harness the full power of Postgres for transactional, analytical, and AI workloads across any cloud, anywhere. EDB empowers enterprises to More ❯
Posted:

Senior Director – Operations and Reliability Engineering

City of London, England, United Kingdom
The Boston Consulting Group GmbH
Locations : Canary Wharf | Boston Who We Are Boston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest opportunities. BCG was the pioneer in business strategy when it was founded in More ❯
Posted:

Lead DevOps Engineer

London, England, United Kingdom
Hybrid / WFH Options
Sprout.ai LTD
Salary banding: £90,000 - £110,000 dependent on experience Working pattern: 1-2 days per week in office Location: London About our Engineering Team As a business which has AI at its core, we need to have a reliable, scalable More ❯
Posted:

Senior Director - Operations and Reliability Engineering

London, United Kingdom
The Boston Consulting Group GmbH
Locations : Canary Wharf Boston Who We Are Boston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest opportunities. BCG was the pioneer in business strategy when it was founded in More ❯
Employment Type: Permanent
Salary: GBP Annual
Posted:

Senior DevOps Engineer

United Kingdom
Hybrid / WFH Options
Complexio Limited
Complexio is Foundational AI workstoautomate business activities by ingesting whole company data- both structured andunstructured - and making sense of it. Usingproprietarymodels and algorithms Complexio forms adeepunderstanding ofhow humans are interacting and using it. Automation can then replicate and improve these More ❯
Employment Type: Permanent
Salary: GBP Annual
Posted:

Observability/ Monitoring Engineer - Grafana Dashboarding

London Area, United Kingdom
Levy Global
We’re seeking an experienced contractor to support the delivery of observability solutions for a new, large-scale infrastructure environment. This role focuses on developing insightful and automated Grafana dashboards, with a strong emphasis on data integration and actionable telemetry. Required Skills Excellent, concise communication skills - essential for collaborating with technical teams to shape observability outputs. Deep experience with Grafana … dashboard creation, templating, and performance optimization. Strong understanding of PromQL, VictoriaMetrics, or VictoriaLogs query languages. Ability to interpret and map RESTful API data into observability pipelines and dashboards. Familiarity with IaC outputs and tooling (e.g., Terraform) as data sources for observability. Solid programming ability in Golang (preferred) or Python for automation and integration. Strong collaboration skills to work with cross More ❯
Posted:

Observability/ Monitoring Engineer - Grafana Dashboarding

City of London, London, United Kingdom
Levy Global
We’re seeking an experienced contractor to support the delivery of observability solutions for a new, large-scale infrastructure environment. This role focuses on developing insightful and automated Grafana dashboards, with a strong emphasis on data integration and actionable telemetry. Required Skills Excellent, concise communication skills - essential for collaborating with technical teams to shape observability outputs. Deep experience with Grafana … dashboard creation, templating, and performance optimization. Strong understanding of PromQL, VictoriaMetrics, or VictoriaLogs query languages. Ability to interpret and map RESTful API data into observability pipelines and dashboards. Familiarity with IaC outputs and tooling (e.g., Terraform) as data sources for observability. Solid programming ability in Golang (preferred) or Python for automation and integration. Strong collaboration skills to work with cross More ❯
Posted:

Platform Observability Engineer

Bristol, Gloucestershire, United Kingdom
Hybrid / WFH Options
Just Eat Takeaway.com
customers with hundreds of thousands of restaurant, grocery and convenience partners across the globe. About the role: Just Eat Takeaway is seeking an aspiring Engineer to join the Platform Observability team. The team sits within the Platform & Reliability department, which exists to provide global engineering a magnifying glass into their services while driving commercial availability and optimization. The team is … responsible for looking after a wide range of Observability capabilities that underpin our global platforms. As a Platform Engineer, you will support the implementation and continual evolution of these areas, following guidance from senior engineers within the department. In this role, you will be expected to have a passion for technology and a desire to learn. You will have the More ❯
Employment Type: Permanent
Salary: GBP Annual
Posted:

Senior Site Reliability Engineer

London, United Kingdom
Hybrid / WFH Options
Randstad Technologies Recruitment
Job Title: Senior SRE - Site Reliability Engineering for Observability Location: London (Mostly Remote | 1 Day/Week in Office) Pay Rate: £50 - £62 per hour (Inside IR35) Contract Duration: Initial 12 Months Working Hours: 11:00 AM - 7:00 PM About the Role We're looking for a Senior Site Reliability Engineer (SRE) to join a high-impact Observability team … monitoring and logging platforms that ensure service reliability, performance, and visibility. If you're passionate about distributed systems, high-throughput data pipelines, and enabling engineering teams with top-tier observability tooling-this is the role for you. What You'll Be Doing Designing and operating observability platforms (logging, monitoring, alerting) at scale. Managing large, high-performance ElasticSearch clusters and Prometheus … deployments. Building scalable data pipelines using Kafka to process millions of events per second. Developing tools, APIs, and dashboards to enable self-service observability for engineering teams. Automating infrastructure using Terraform and configuration with Ansible . Participating in on-call rotations to ensure platform uptime and responsiveness. What We're Looking For 5+ years of experience in SRE/DevOps More ❯
Employment Type: Contract
Rate: £50 - £62/hour
Posted:

Principal Engineer - Reliability Engineering

London, England, United Kingdom
Just Eat Takeaway.com
thousands of restaurant, grocery and convenience partners across the globe. About this role We are seeking a seasoned Principal Engineer to lead the design, development, and evolution of our Observability Platform , ensuring it meets the needs of our rapidly scaling systems and engineering teams. This role will also focus on leveraging Machine Learning (ML) and Artificial Intelligence (AI) to deliver … system health and drive down Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR) . The ideal candidate will be a visionary technologist with deep expertise in observability, monitoring, and distributed systems, capable of driving strategy, architecture, and execution for a world-class platform. These are some of the key ingredients to the role: Architect, design, and implement … a cutting-edge Observability Platform to support metrics, logs, traces, and events at scale. Integrate ML/AI-driven solutions to enhance anomaly detection, root cause analysis, and predictive insights. Lead the development and adoption of platform capabilities to ensure system health, reliability, and performance. Establish and evolve platform standards and best practices to align with the company’s overall More ❯
Posted:

Software Performance Engineer - UK

London, England, United Kingdom
CluePoints
with Google Continue with Google Continue with Google Continue with Google DEPARTMENT: Product REPORTS TO: Engineering Manager or Director The Software Performance Engineer will be responsible for the performance observability and testing of a product domain (i.e., software products within the same business domain). She/He will work closely together with the Domain Architect and the Engineering Director …/Manager to develop an observability/test strategy and operational approach. He will coach and support the different squads in their continuous performance improvement activities (e.g., troubleshooting, bottleneck identification). Responsibilities Ownership of the Performance Testing & Observability framework and tooling within a product domain Set up, coach and promote best practices in Performance (testing & observability) across the squads within … a product domain Design and maintain K6 Test scripts and framework development Design and maintain performance observability dashboards Execute Performance Tests for software products within a domain – to identify issues and bottlenecks which may affect performance Ensure that software products meet performance requirements Work closely with the Domain Architect and the Engineering Director/Manager to develop an observabilityMore ❯
Posted:

Linux System Engineer

London Area, United Kingdom
Caspian One
throughput applications Develop and refine automation solutions using Ansible, Python, and Terraform Troubleshoot hardware, networking, and performance issues in various environments Deploy monitoring and log aggregation tools to improve observability Collaborate with teams to identify bottlenecks and deploy scalable, automated solutions What We're Looking For: 6+ years of Linux system administration and engineering experience in performance-critical environments Proficiency … in Python and bash Scripting, with hands-on Ansible experience Solid networking fundamentals: IP Addressing, VLANs, etc. Familiarity with observability tools like Prometheus, Grafana, and ELK Infrastructure-as-code experience with Terraform and CI/CD pipelines Proven ability to resolve complex system-level issues and performance challenges Knowledge of container orchestration tools (Docker/containers, Kubernetes) Desirable: Experience with More ❯
Posted:

Linux System Engineer

City of London, London, United Kingdom
Caspian One
throughput applications Develop and refine automation solutions using Ansible, Python, and Terraform Troubleshoot hardware, networking, and performance issues in various environments Deploy monitoring and log aggregation tools to improve observability Collaborate with teams to identify bottlenecks and deploy scalable, automated solutions What We're Looking For: 6+ years of Linux system administration and engineering experience in performance-critical environments Proficiency … in Python and bash Scripting, with hands-on Ansible experience Solid networking fundamentals: IP Addressing, VLANs, etc. Familiarity with observability tools like Prometheus, Grafana, and ELK Infrastructure-as-code experience with Terraform and CI/CD pipelines Proven ability to resolve complex system-level issues and performance challenges Knowledge of container orchestration tools (Docker/containers, Kubernetes) Desirable: Experience with More ❯
Posted:

Site Reliability Engineer

London, United Kingdom
Hybrid / WFH Options
NinjaOne, LLC
SRE team in the Platform Engineering organization and help us scale our products to millions of end-users. We are looking for individuals with a passion for automation and observability, ensuring the quality and availability of our services. Location - We are flexible on remote working from home, if you are based in the UK or Germany. This is a fully … our 24x7 on-call rotation, SCRUM, and deployment planning Perform Root Cause Analysis (RCA) and provide recommendations for application teams Improve availability and reduce customer impact using Industry best observability tools Ensure best-practice and security-minded architecture by influencing design decisions Create and maintain technical documentation and SOP's Develop software, scripts, or tooling to improve efficiency and reduce … time of applications and infrastructure Other duties as needed About You 5+ years' experience in Site Reliability Engineer roles Expert+ level Linux administration, scripting, and troubleshooting Demonstrable knowledge of Observability tools (Prometheus/Grafana, New Relic, Splunk, DataDog) Comprehensive experience with AWS (Amazon Web Services) and its core capabilities (VPC, EC2, ECS, Route53, Fargate, ALB/NLB distributions, etc) Extensive More ❯
Employment Type: Permanent
Salary: GBP Annual
Posted:

Infrastructure Specialist

City of London, London, United Kingdom
Ascendion
service mesh solutions across our distributed systems. In this role, you will lead the design and operation of Kong Mesh (based on Kuma) for managing microservices communication, security, and observability at scale. You’ll play a crucial role in defining service-to-service architecture and ensuring platform reliability, scalability, and security. Key Responsibilities: • Lead the design and deployment of Kong … Mesh across our environments (on-prem and cloud). • Define and enforce best practices for service mesh architecture, traffic routing, zero-trust security, observability, and policy enforcement. • Collaborate with infrastructure, security, and development teams to integrate Kong Mesh with CI/CD, monitoring, and logging solutions. • Develop custom policies, plugins, and automation scripts to enhance Kong Mesh capabilities. • Monitor mesh More ❯
Posted:

Infrastructure Specialist

London Area, United Kingdom
Ascendion
service mesh solutions across our distributed systems. In this role, you will lead the design and operation of Kong Mesh (based on Kuma) for managing microservices communication, security, and observability at scale. You’ll play a crucial role in defining service-to-service architecture and ensuring platform reliability, scalability, and security. Key Responsibilities: • Lead the design and deployment of Kong … Mesh across our environments (on-prem and cloud). • Define and enforce best practices for service mesh architecture, traffic routing, zero-trust security, observability, and policy enforcement. • Collaborate with infrastructure, security, and development teams to integrate Kong Mesh with CI/CD, monitoring, and logging solutions. • Develop custom policies, plugins, and automation scripts to enhance Kong Mesh capabilities. • Monitor mesh More ❯
Posted:

Senior Engineering Team Lead - Platform

London, England, United Kingdom
dojo
closely with engineers and stakeholders to align platform capabilities with business needs, prioritise and maximise impact across the organisation. Stay at the cutting edge, exploring and advocating for modern observability practices, cloud-native technologies, and industry best practices to push the boundaries of developer experience. What you will bring A proven track record of leading and developing high-performing engineering … teams, providing mentorship, support, and opportunities for growth. Strong knowledge of software engineering best practices, system design, observability, resilience, and expertise in Telemetry, Prometheus and Grafana Experience with cloud platforms like GCP, Azure, AWS, etc Drive the implementation of observability pipelines for different systems and applications across dojo. Strong understanding of containerisation and orchestration technologies like docker, kubernetes, etc Identify … areas of improvement and propose capabilities to enhance the observability platform. Excellent communication and stakeholder management skills, with the ability to advocate for technical solutions and drive adoption across diverse teams. Dojo home and away We believe our best work happens when we collaborate in-person. These “together days” foster communication, drive innovation and spark our brightest ideas. That's More ❯
Posted:

Senior Solutions Engineer - Logs

Maidenhead, Berkshire, United Kingdom
dynaTrace software GmbH
a key member of the Dynatrace sales engine and will be responsible for providing excellent technical support to the sales team. You will be the expert on Dynatrace and observability, with a specialization in Log Management and Analytics. Within this exciting role, you will be responsible for executing great demos which demonstrate the Dynatrace unique approach in solving the customer … be filled at a higher level based on candidate experience. What will help you succeed Preferred Requirements: Experience with query languages such as SQL, SPL, or KQL. Experience with observability and log collectors/pipelines such as FluentBit, OpenTelemetry, Cribl, and Logstash. Experience with web technologies such as HTML, CSS, and JavaScript. Experience with programming/scripting side technologies such … OpenShift, Serverless functions, and CI/CD pipelines. Experience with automation like Ansible, Puppet, Terraform, etc. Why you will love being a Dynatracer Dynatrace is a leader in unified observability and security. We provide a culture of excellence with competitive compensation packages designed to recognize and reward performance. Our employees work with the largest cloud providers, including AWS, Microsoft, and More ❯
Employment Type: Permanent
Salary: GBP Annual
Posted:

Head of Platform Engineering

London, England, United Kingdom
Yolo Group
functions, championing a culture of proactive readiness, efficient release pipelines, robust incident response, and continuous infrastructure improvement. This role ensures maximum uptime, enables safe and frequent deployments, establishes comprehensive observability, and drives effective postmortem practices. They will work closely with Engineering, QA, and Security leadership to embed operational excellence across the software development lifecycle and support the platform’s growth … distributed team of DevOps engineers, SREs, and incident responders; Foster a culture of ownership, continuous improvement, and operational excellence; Define and execute the long-term strategy for system reliability, observability, performance, and incident management; Champion the adoption of modern tooling, technologies, and best practices to enhance resilience and agility; Own and continuously evolve incident response processes, including SLOs, SLAs, and More ❯
Posted:

Systems Developer – E-Commerce Integrations (Cloud-Native, AI-Driven)

Liverpool, England, United Kingdom
Protein Works
and refine queue-based processing to support asynchronous workflows and event-driven architecture. Work collaboratively with cross-functional teams, including DevOps, Infrastructure, and Product, to deliver robust systems. Leverage observability tools to monitor, alert, and troubleshoot application and integration health. Stay current on AI-driven software development practices (e.g., GPT-assisted development, Agentic AI workflows) and suggest practical implementations. Participate … Prior experience building middleware for data sync, order processing, and internal APIs in a multi-system e-commerce environment Understanding of architecture patterns: Microservices , SOA , Hexagonal , Modular Monolith Monitoring & Observability: Grafana , Prometheus , CloudWatch , New Relic , Datadog , etc. Solid grasp of AI trends in software development , particularly in using GPT tools and agentic systems Education: Mathematics or Computer Science degree (or More ❯
Posted:

Site Reliability Engineer

Bristol, Gloucestershire, United Kingdom
Hybrid / WFH Options
TwinStream
services. You will be working with multiple feature development teams and the BAU/Support team to define and evolve our cloud & on-prem infrastructure & delivery pipelines, improving system observability, demonstrating performance and capacity improvements and proactively identifying and mitigating reliability risks. Key Responsibilities of the Site Reliability Engineer: Collaborate with Software Engineers to improve reliability and performance in their … subsystems Partner with System Administrators in automating toil and eliminating alerts Evolve observability and monitoring capabilities to identify and solve problems before they impact the business Support development environments to help us achieve our delivery and quality goals Research and evaluate technologies, tools and services to influence buy-vs-build decisions Develop expertise in diverse technical and business domains Expand … in one of our platform languages (Java, Go, Python or similar) Knowledge of cross domain principles & technologies Experience of working in a service management environment Practical applications of using observability patterns in previous systems Creating and monitoring system availability metrics and using those to drive work that reduces downtime There are many great reasons to join our team! Pension Plan More ❯
Employment Type: Permanent
Salary: GBP Annual
Posted:

Site Reliability Engineer

Bristol, England, United Kingdom
Hybrid / WFH Options
TwinStream
services. You will be working with multiple feature development teams and the BAU/Support team to define and evolve our cloud & on-prem infrastructure & delivery pipelines, improving system observability, demonstrating performance and capacity improvements and proactively identifying and mitigating reliability risks. Key Responsibilities of the Site Reliability Engineer: Collaborate with Software Engineers to improve reliability and performance in their … subsystems Partner with System Administrators in automating toil and eliminating alerts Evolve observability and monitoring capabilities to identify and solve problems before they impact the business Support development environments to help us achieve our delivery and quality goals Research and evaluate technologies, tools and services to influence buy-vs-build decisions Develop expertise in diverse technical and business domains Expand … in one of our platform languages (Java, Go, Python or similar) Knowledge of cross domain principles & technologies Experience of working in a service management environment Practical applications of using observability patterns in previous systems Creating and monitoring system availability metrics and using those to drive work that reduces downtime There are many great reasons to join our team! Pension Plan More ❯
Posted:

Senior Software Engineer, Fleet

Fleet, England, United Kingdom
Hayden AI Technologies, Inc
global transportation agencies. As a senior engineer, you will play a critical role in designing, building, and scaling cloud services that enable remote device management, over-the-air updates, observability, and high-availability operations for our mobile perception platform. We tackle complex challenges related to scalability, performance, and security to enable smarter and safer cities through cutting-edge innovation. As … future of intelligent transportation systems. Responsibilities: Participate in incident prevention, response, and remediation efforts, learning and applying best practices. Design, build, and maintain scalable cloud services that support device observability, OTA updates, and fleet operations. Lead efforts to improve the reliability, security, and performance of multi-region AWS infrastructure using Infrastructure as Code (IaC) tools. Own CI/CD pipelines More ❯
Posted:

Junior Cloud Operations Engineer

Reading, England, United Kingdom
Objective Corporation
elevate the end-user experience. This position is designed to fuel your hands-on growth, giving you the chance to master cloud architectures, Continuous Integration/Continuous Deployment pipelines, observability tools, and incident management processes — all while working in a fast-paced, ever-evolving environment. You'll report directly to the Cloud Operations Director in this role, with no people … pipelines and CI/CD processes to streamline releases. Troubleshoot production issues and drive initiatives to prevent future disruptions, keeping systems stable and available. Set up and maintain powerful observability tools (logging, monitoring, alerting) to ensure fast incident detection and resolution. Take part in an on-call rotation, gaining invaluable real-time experience in incident management and root cause analysis. More ❯
Posted:
Observability
10th Percentile
£57,500
25th Percentile
£65,000
Median
£80,000
75th Percentile
£97,500
90th Percentile
£120,000