GPS). Our teams operate across the UK, Germany, France, and India, delivering complex, enterprise-grade IT solutions and consultancy across infrastructure, cloud, and modern operations. As a Monitoring & Observability Engineer, you'll work in high-impact delivery teams that support some of the worlds most well-known organisations. Youll play a key role in helping our customers achieve greater … visibility, performance, and reliability across their IT estatescontributing to their operational success through proactive insight and incident prevention. What you'll do Design, implement, and manage observability solutions using industry-leading tools such as Dynatrace (primary), Grafana, and Splunk Collect and analyse telemetry data (metrics, logs, traces, events) to diagnose and resolve system and application performance issues Integrate monitoring platforms … with ITSM tools (e.g. ServiceNow) and CI/CD pipelines to enable proactive alerting and resolution workflows Act as a Monitoring & Observability SME within customer delivery teams Support incident response activities and postmortems by identifying patterns, root causes, and optimisation opportunities Work collaboratively with cross-functional teams to define and implement best practices in observability and monitoring Attend customer and More ❯
Select how often (in days) to receive an alert: Monitoring & Observability Engineer Life on the team At Computacenter, you'll be joining a world-class team of over 1,000 skilled professionals within Group Professional Services (GPS). Our teams operate across the UK, Germany, France, and India, delivering complex, enterprise-grade IT solutions and consultancy across infrastructure, cloud, and … modern operations. As a Monitoring & Observability Engineer, you'll work in high-impact delivery teams that support some of the world's most well-known organisations. You'll play a key role in helping our customers achieve greater visibility, performance, and reliability across their IT estates-contributing to their operational success through proactive insight and incident prevention. What you'll … do Design, implement, and manage observability solutions using industry-leading tools such as Dynatrace (primary), Grafana, and Splunk Collect and analyse telemetry data (metrics, logs, traces, events) to diagnose and resolve system and application performance issues Integrate monitoring platforms with ITSM tools (e.g. ServiceNow) and CI/CD pipelines to enable proactive alerting and resolution workflows Act as a Monitoring More ❯
one or more public cloud providers such as Azure, AWS or GCP Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and More ❯
to L3 networking Programming languages, such as C#, Python, Perl, Java, C++ CICD tools such as Azure DevOps, GitHub Actions, Gitlab, Jenkins, TeamCity Scripting languages such as PowerShell, bash Observability/Monitoring: Prometheus, Grafana, Splunk Containerisation tools such as Docker, K8S, OpenShift, EC, containers Hosting technologies such as IIS, nginx, Apache, App Service, LightSail Analytical and creative approach to problem More ❯
will involve designing robust software solutions that enhance system performance while ensuring high availability for critical applications. You will work hand-in-hand with product engineering teams to improve observability tools and telemetry systems, driving forward automation initiatives that reduce manual intervention. By participating in incident management processes-facilitating transparent communication with stakeholders and leading blameless post-mortems-you will … a focus on automating these activities wherever possible.* Provide on-call support during production incidents outside standard working hours as required by the business needs.* Contribute to enhancing product observability and telemetry by supporting ongoing modernisation efforts within the infrastructure.* Collaborate closely with engineering teams to brainstorm ideas that simplify infrastructure management and streamline SRE practices. What you bring: * Proficiency More ❯
our enterprise messaging infrastructure, ensuring high availability, optimal performance, and reliability across production and non-production environments. This includes working on incident response, capacity planning, network optimization, and system observability using industry-standard monitoring tools. Required Skills & Qualifications: 3+ years of experience administering enterprise-grade messaging systems. Strong background in production support, preferably in a 24x7 enterprise environment. Experience working More ❯
our enterprise messaging infrastructure, ensuring high availability, optimal performance, and reliability across production and non-production environments. This includes working on incident response, capacity planning, network optimization, and system observability using industry-standard monitoring tools. Required Skills & Qualifications: 3+ years of experience administering enterprise-grade messaging systems. Strong background in production support, preferably in a 24x7 enterprise environment. Experience working More ❯
access to the best tools available. We combine problem-solving skills with software and systems engineering to take a proactive approach in building fault-tolerant and secure systems, improving observability and zealously automating away toil. In this role you will: Use your site reliability expertise to design, operate and support Preqin's infrastructure, middleware and internal services. Improving their performance More ❯
london, south east england, united kingdom Hybrid / WFH Options
BlackRock, Inc
access to the best tools available. We combine problem-solving skills with software and systems engineering to take a proactive approach in building fault-tolerant and secure systems, improving observability and zealously automating away toil. In this role you will: Use your site reliability expertise to design, operate and support Preqin's infrastructure, middleware and internal services. Improving their performance More ❯
to architect secure, performant, and highly available cloud solutions. Proficiency with monitoring and log analytics tools such as AWS CloudWatch, ELK Stack, Prometheus, Datadog, or New Relic, to maintain observability and ensure operational excellence. Demonstrated leadership skills in managing complex, high-pressure situations and guiding teams through incident resolution. Exceptional communication and presentation skills, with proven experience engaging with senior More ❯
reliability of cloud and hybrid infrastructure powering some of the most critical client-facing applications in financial services. You will be the strategic and operational leader for platform reliability, observability, incident response, CI/CD modernisation, and developer productivity. You will drive automation, lead with metrics, and build systems and teams that proactively address issues before they impact clients. Key … and services. Implement a comprehensive incident management lifecycle (on-call, escalation, RCA, blameless postmortems). Reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) through automated observability, alerting, and playbooks. CI/CD and Platform Engineering Oversee the development and evolution of CI/CD pipelines for all GIDS products using GitHub Actions, ArgoCD, TeamCity, Octopus Deploy … and GitOps principles. Integrate static and dynamic code analysis, vulnerability scanning, artifact promotion, and release gating into the SDLC. Ensure pipeline scalability and governance while maintaining developer velocity. Observability & Troubleshooting Lead the implementation and usage of modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, Splunk, Datadog). Establish SLOs, SLIs, and error budgets with product and engineering teams. Drive root cause More ❯
reliability of cloud and hybrid infrastructure powering some of the most critical client-facing applications in financial services. You will be the strategic and operational leader for platform reliability, observability, incident response, CI/CD modernisation, and developer productivity. You will drive automation, lead with metrics, and build systems and teams that proactively address issues before they impact clients. Key … and services. Implement a comprehensive incident management lifecycle (on-call, escalation, RCA, blameless postmortems). Reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) through automated observability, alerting, and playbooks. CI/CD and Platform Engineering Oversee the development and evolution of CI/CD pipelines for all GIDS products using GitHub Actions, ArgoCD, TeamCity, Octopus Deploy … and GitOps principles. Integrate static and dynamic code analysis, vulnerability scanning, artifact promotion, and release gating into the SDLC. Ensure pipeline scalability and governance while maintaining developer velocity. Observability & Troubleshooting Lead the implementation and usage of modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, Splunk, Datadog). Establish SLOs, SLIs, and error budgets with product and engineering teams. Drive root cause More ❯
IT workflows. Your responsibilities will also include developing CI/CD pipelines tailored for IT infrastructure, enhancing deployment efficiency, and integrating robust network security measures. You will establish comprehensive observability and proactive issue resolution strategies. We are seeking individuals passionate about network automation, security, and scalable IT solutions that enhance both campus and cloud network operations. You should possess extensive More ❯
or DevOps Expertise in microservices and API design Docker, and container runtime platforms such as Kubernetes, EKS, ECS etc Strong understand of operational concepts on AWS, particularly monitoring and observability, FinOps Utilising CI/CD tools, such as Bamboo, Jenkins, TeamCity, Bitbucket, in order to streamline delivery of new features and fixes Continual testing of code using Automated Testing Frameworks More ❯
ensure code quality and reliability; Experience of work with Docker for containerisation and application packaging; Experience of implementing and managing monitoring solutions, with experience in Prometheus and Grafana for observability and alerting. Experience of implementing and managing robust security practices, including Encryption (TLS) and Secret Management in the Cloud; Experience of leveraging GitLab API for advanced automation, integration, and reporting More ❯
upgrades, ensuring comprehensive testing across third-party and custom-built applications. Establish Advanced Performance Engineering: Establish a robust performance engineering strategy, integrating advanced tools for application performance monitoring (APM), observability, and telemetry. Focus on early identification of performance bottlenecks and quality assurance measures tailored for large-scale enterprise systems, ensuring seamless functionality across platforms. Collaborate Across Cross-Functional Teams/ More ❯
CI/CD pipelines, infrastructure as code (IaC), and automated testing. Experience with industry-standard monitoring tools (ITRS or similar) Proficiency in managing Kubernetes clusters, including deployment, scaling, storage, observability, and lifecycle management Understanding of financial regulations and reporting requirements in Europe such as MiFID II Person Profile The role will suit someone who relishes the prospect of supporting an More ❯
Tuesdays, Thursdays WFH) Pay: negotiable, inside IR35 We're looking for an experienced DevOps Engineer to join our team on a contract basis, with a focus on AWS infrastructure, observability tooling, and CI/CD automation. This is a hands-on role supporting high-availability systems, rapid deployments, and production incident response. Key Responsibilities - Manage and monitor AWS infrastructure for … performance and security - Respond to production incidents, perform root cause analysis, and implement fixes - Maintain observability tools (Prometheus, Grafana, Splunk) and write PromQL queries - Improve and operate CI/CD pipelines using GitHub Actions and Kubernetes - Automate infrastructure tasks with Python, Bash, Go or SQL - Work with Git-based workflows for infrastructure as code - Troubleshoot Kubernetes workloads and containerised services More ❯
South West London, London, England, United Kingdom
Oscar Technology
experienced Site Reliability Engineer (SRE) to join them on a 6-month contract (outside IR35) You'll be leading efforts acriss AWS and Azure Cloud environments, focusing on automation, observability, infrastructure as code and performance at scale. Stakeholder engagements and strong communication is essential in this role, so if you've been in a start-up/smaller team- this … scripting (Python, Bash, PowerShell), and cloud architecture Comfortable with containerisation and orchestration ( Docker, Kubernetes ) Understanding of networking, DNS, IAM, and load balancing in cloud environments Hands-on experience with observability tooling and production-level troubleshooting If this sounds like you, it's a great opportunity so apply now! Site Reliability Engineer - AWS/Azure | Outside IR35 | £450-500/day More ❯
infrastructure provisioning and tooling to enhance development efficiency. You will manage Platform Reliability and Infrastructure ensuring a reliable and stable platform. You will oversee YouLend's the Security and Observability frameworks , focusing on platform security, maintaining observability, and providing dashboards for developers to monitor service health. The ideal candidate is someone who has successfully built and scaled platform architectures, led … the ability to work across technical and non-technical teams. Excellent communication skills, with the ability to translate complex technical concepts to business stakeholders. Operational Focus: Expertise in platform observability, monitoring, incident management, and creating highly reliable systems. Experience implementing SLAs, SLOs, and SLIs is a plus. Security & Compliance: In-depth understanding of platform security, data privacy, and regulatory compliance More ❯
infrastructure provisioning and tooling to enhance development efficiency. You will manage Platform Reliability and Infrastructure ensuring a reliable and stable platform. You will oversee YouLend's the Security and Observability frameworks , focusing on platform security, maintaining observability, and providing dashboards for developers to monitor service health. The ideal candidate is someone who has successfully built and scaled platform architectures, led … the ability to work across technical and non-technical teams. Excellent communication skills, with the ability to translate complex technical concepts to business stakeholders. Operational Focus: Expertise in platform observability, monitoring, incident management, and creating highly reliable systems. Experience implementing SLAs, SLOs, and SLIs is a plus. Security & Compliance: In-depth understanding of platform security, data privacy, and regulatory compliance More ❯
incidents using data-driven decision making to minimise downtime and financial impact while leading root cause analysis and conducting blameless post-mortems.* Enhance application health monitoring by implementing robust observability solutions and automating manual processes to improve system resilience.* Drive cost optimisation initiatives and manage capacity resources to ensure efficient and scalable operations across all FX trading platforms.* Collaborate with … Deep technical expertise in Linux/Unix systems administration combined with strong SQL skills and proficiency in scripting languages such as Python or Java.* Demonstrated experience with monitoring and observability tools including Prometheus, Grafana, Splunk, Geneos, OpenTelemetry or Corvil is highly desirable.* Familiarity with cloud platforms as well as containerisation technologies like Kubernetes or Docker alongside CI/CD pipeline More ❯
teams to build cost-effective solutions on GCP while maintaining agility and fostering innovation. This position is perfect for engineers who are passionate about optimising cloud usage, enhancing cost observability, and championing a FinOps culture. What you'll do Partner with engineering, finance and product teams to drive cost-efficiency across GCP Design and implement automation to boost cost optimisation … had GCP certifications (e.g. Professional Cloud DevOps Engineer, Professional Cloud Architect) FinOps Foundation certifications (e.g. Practitioner, Engineer) Familiarity with security tools e.g. Hashicorp Vault, Aquasec, Nexus IQ. Knowledge of observability tools e.g. Dynatrace. Experience in cost management tools e.g. Cloudability. About working for us Our focus is to ensure we're inclusive every day, building an organisation that reflects modern More ❯