City of London, London, United Kingdom Hybrid / WFH Options
SoTalent
QA teams to ensure robust and scalable integrations. Drive continuous improvement, automation, and cost-optimization across engineering platforms. Provide advanced troubleshooting and 3rd-line production support using tools like Prometheus, Grafana, and ELK Stack . Maintain detailed technical documentation, system diagrams, and operational runbooks. Ensure compliance with data security and regulatory standards (e.g., GDPR, ISO 27001). Contribute to disaster More ❯
QA teams to ensure robust and scalable integrations. Drive continuous improvement, automation, and cost-optimization across engineering platforms. Provide advanced troubleshooting and 3rd-line production support using tools like Prometheus, Grafana, and ELK Stack . Maintain detailed technical documentation, system diagrams, and operational runbooks. Ensure compliance with data security and regulatory standards (e.g., GDPR, ISO 27001). Contribute to disaster More ❯
South East London, England, United Kingdom Hybrid / WFH Options
SoTalent
QA teams to ensure robust and scalable integrations. Drive continuous improvement, automation, and cost-optimization across engineering platforms. Provide advanced troubleshooting and 3rd-line production support using tools like Prometheus, Grafana, and ELK Stack . Maintain detailed technical documentation, system diagrams, and operational runbooks. Ensure compliance with data security and regulatory standards (e.g., GDPR, ISO 27001). Contribute to disaster More ❯
Southampton, Hampshire, South East, United Kingdom Hybrid / WFH Options
Spectrum It Recruitment Limited
tools such as Jenkins, GitLab CI/CD, or CircleCI Strong understanding of containerisation (e.g., Docker, Kubernetes) and microservices architecture Skilled in using observability and monitoring tools such as Prometheus, Grafana, ELK stack, or AWS CloudWatch Excellent analytical and troubleshooting abilities, especially within complex distributed systems Proven experience handling incident management and conducting blameless postmortems, including leading cross-functional teams More ❯
Hampshire, England, United Kingdom Hybrid / WFH Options
Spectrum IT Recruitment
tools such as Jenkins, GitLab CI/CD, or CircleCI Strong understanding of containerisation (e.g., Docker, Kubernetes) and microservices architecture Skilled in using observability and monitoring tools such as Prometheus, Grafana, ELK stack, or AWS CloudWatch Excellent analytical and troubleshooting abilities, especially within complex distributed systems Proven experience handling incident management and conducting blameless postmortems, including leading cross-functional teams More ❯
Portsmouth, England, United Kingdom Hybrid / WFH Options
Spectrum IT Recruitment
tools such as Jenkins, GitLab CI/CD, or CircleCI Strong understanding of containerisation (e.g., Docker, Kubernetes) and microservices architecture Skilled in using observability and monitoring tools such as Prometheus, Grafana, ELK stack, or AWS CloudWatch Excellent analytical and troubleshooting abilities, especially within complex distributed systems Proven experience handling incident management and conducting blameless postmortems, including leading cross-functional teams More ❯
as Azure, AWS or GCP Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and operational playbooks. Useful/Bonus Skills to More ❯
London, England, United Kingdom Hybrid / WFH Options
Stratospherec Limited
as Azure, AWS or GCP Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and operational playbooks. Useful/Bonus Skills to More ❯
London, England, United Kingdom Hybrid / WFH Options
ZipRecruiter
tools such as Jenkins, GitLab CI/CD, or CircleCI Strong understanding of containerisation (e.g., Docker, Kubernetes) and microservices architecture Skilled in using observability and monitoring tools such as Prometheus, Grafana, ELK stack, or AWS CloudWatch Excellent analytical and troubleshooting abilities, especially within complex distributed systems Proven experience handling incident management and conducting blameless postmortems, including leading cross-functional teams More ❯
Manchester, Lancashire, United Kingdom Hybrid / WFH Options
Arm Limited
To Have" Skills and Experience: Experience in a GitOps solution such as ArgoCD, Flux or Fleet Implementation of the Security Development Lifecycle (SDL) in infrastructure Monitoring and observability using Prometheus and Grafana, ELK stack or equivalent Use of Kubernetes management systems such as Rancher Familiarity with open source project development cycles and contribution processes, particularly around CI/CD infrastructure More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Amber Labs
teams using tools like Git , Jira , and Confluence Eligible for SC and NPPV3 clearance Desirable: Container orchestration with Kubernetes HashiCorp tools: Vault , Consul , Packer Monitoring and observability with Grafana , Prometheus , or similar Familiarity with cloud networking, VPCs, NAT Gateways, security groups, etc. Personal Attributes: Proactive and self-driven with a passion for technology Strong problem-solving mindset Collaborative team player More ❯
teams using tools like Git , Jira , and Confluence Eligible for SC and NPPV3 clearance Desirable: Container orchestration with Kubernetes HashiCorp tools: Vault , Consul , Packer Monitoring and observability with Grafana , Prometheus , or similar Familiarity with cloud networking, VPCs, NAT Gateways, security groups, etc. Personal Attributes: Proactive and self-driven with a passion for technology Strong problem-solving mindset Collaborative team player More ❯
South East London, England, United Kingdom Hybrid / WFH Options
Amber Labs
teams using tools like Git , Jira , and Confluence Eligible for SC and NPPV3 clearance Desirable: Container orchestration with Kubernetes HashiCorp tools: Vault , Consul , Packer Monitoring and observability with Grafana , Prometheus , or similar Familiarity with cloud networking, VPCs, NAT Gateways, security groups, etc. Personal Attributes: Proactive and self-driven with a passion for technology Strong problem-solving mindset Collaborative team player More ❯
London, England, United Kingdom Hybrid / WFH Options
Amber Labs Limited
teams using tools like Git, Jira, and Confluence Eligible for SC and NPPV3 clearance Desirable Container orchestration with Kubernetes HashiCorp tools: Vault, Consul, Packer Monitoring and observability with Grafana, Prometheus, or similar Familiarity with cloud networking, VPCs, NAT Gateways, security groups, etc. Personal Attributes Proactive and self-driven with a passion for technology Strong problem-solving mindset Collaborative team player More ❯
and alerts Building multi-cluster Kubernetes structures all across the globe Supporting a globally dispersed frontend with local APIs in each country API orchestration Cloud level performance monitoring: Backstage.io, Prometheus and Grafana As part of the Platform team, you will be responsible for: Attending the Team meetings, discussing and communicating what platform requirements are needed by the Squad and how More ❯
London, England, United Kingdom Hybrid / WFH Options
Protegrity
Experiential learning on Infrastructure as Code tools (like Terraform, Ansible). Hands-On experience on Container & Container Orchestration tools like Docker, AWS ECS, Kubernetes and infrastructure monitoring tools like Prometheus and Grafana. Experience with designing, building, and maintaining cloud-native applications across major cloud platforms such as AWS, Azure or GCP is a strong plus. Excellent communication and collaboration skills More ❯
London, England, United Kingdom Hybrid / WFH Options
Mistral AI
call rotations...) • Experience working against reliability KPIs (observability, alerting, SLAs) • Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...), monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...), infrastructure-as-code tools (Terraform, CloudFormation...) • Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices • Understanding of networking, security, and system More ❯
such as: Docker, OpenShift, Kubernetes etc. Infrastructure as Code and CI/CD paradigms and systems such as: Ansible, Terraform, Jenkins, Bamboo, Concourse etc. Monitoring utilising products such as: Prometheus, Grafana, ELK, filebeat etc. Observability - SRE Big Data solutions (ecosystems) and technologies such as: Apache Spark and the Hadoop Ecosystem Edge technologies e.g. NGINX, HAProxy etc. Excellent knowledge of YAML More ❯
long running services and analytics in C#. We use Airflow for workflow management, Kafka for data pipelines, Bitbucket for source control, Jenkins for continuous integration, ELK for logs, Grafana, Prometheus & InfluxDb for metrics, Docker and Kubernetes for containerisation, OpenStack for our private cloud, Ansible and Terraform for architecture automation, and Slack for internal communication. We heavily utilise ArcticDB () our in More ❯
collaborate with DevOps to optimize build times, parallelize tests, and reduce pipeline flakiness. Result Analysis & Root Cause • Analyze test outputs, system logs, and metrics (e.g., via ELK Stack or Prometheus/Grafana) to pinpoint failures and performance regressions. • Lead root-cause investigations for infrastructure incidents, producing clear post-mortem reports and remediation recommendations. Defect Management • Log, triage, and track defects More ❯
long running services and analytics in C#. We use Airflow for workflow management, Kafka for data pipelines, Bitbucket for source control, Jenkins for continuous integration, ELK for logs, Grafana, Prometheus & InfluxDb for metrics, Docker and Kubernetes for containerisation, OpenStack for our private cloud, Ansible and Terraform for architecture automation, and Slack for internal communication. We heavily utilise ArcticDB ( https:/ More ❯
London, England, United Kingdom Hybrid / WFH Options
SAP
We help the world run better At SAP, we enable you to bring out your best. Our company culture is focused on collaboration and a shared passion to help the world run better. How? We focus every day on building More ❯
At SAP, we enable you to bring out your best. Our company culture is focused on collaboration and a shared passion to help the world run better. How? We focus every day on building the foundation for tomorrow and creating More ❯