through initiatives to remove single points of failure and improve autoscaling, high availability and managed service adoption across the platform. Collaborate with SRE, Security and Engineering teams to enhance observability, monitoring and alerting through tools like Prometheus, Grafana and CloudWatch. Partner with Security to embed best practices for IAM, secrets management, WAF, and posture management. Optimise performance and cloud spend … CodePipeline). Strong knowledge of Kubernetes operations on AWS (EKS), including cluster scaling, deployment automation, and monitoring. Solid background in Linux administration, networking, and cloud security principles. Familiarity with observability tools (Prometheus, Grafana, Loki) and structured alerting practices. Experience with database migrations, HA configurations, backups, and DR strategies. Strong scripting and automation skills (Terraform, Python, Bash, or similar). Excellent More ❯
Shefford, Bedfordshire, South East, United Kingdom
Stackstudio Digital Ltd
Deep knowledge of REST API design, scalability, fault tolerance, and performance optimization. Demonstrated ability to own projects end-to-end and mentor junior engineers. Significant experience with production infrastructure, observability, and incident management. Strong collaboration and communication skills across disciplines and teams. Clear understanding of engineering best practices and architectural principles. Desirable Skills/Knowledge/Experience: 5+ years of More ❯
St. Albans, Hertfordshire, England, United Kingdom
Method Resourcing
ensuring quality, velocity, and resilience. Shape engineering culture: reliability, ownership, automation, and continuous improvement. Manage budgets, supplier relationships, and resource planning. Ensure modern, efficient practices across CI/CD, observability, security, and cloud operations. What You Bring Proven leadership of engineering teams within scaling or transforming environments. Strong technical background in distributed systems, cloud-native design, and modern .NET or More ❯
St. Albans, Hertfordshire, South East, United Kingdom
Method-Resourcing
ensuring quality, velocity, and resilience. Shape engineering culture: reliability, ownership, automation, and continuous improvement. Manage budgets, supplier relationships, and resource planning. Ensure modern, efficient practices across CI/CD, observability, security, and cloud operations. What You Bring Proven leadership of engineering teams within scaling or transforming environments. Strong technical background in distributed systems, cloud-native design, and modern .NET or More ❯
Stevenage, Hertfordshire, England, United Kingdom Hybrid/Remote Options
MBDA
stakeholders to meet the ever-evolving challenges of the cyber threat landscape. Key responsibilities include; Act as the subject matter expert (SME) for Splunk across all cyber security and observability use cases. Lead SOC automation initiatives using scripting and SOAR tools, optimising processes through AI and ML technologies. Support alert tuning, connectivity, and visibility across monitored networks and infrastructure. Maintain More ❯
Stevenage, Hertfordshire, South East, United Kingdom Hybrid/Remote Options
MBDA
stakeholders to meet the ever-evolving challenges of the cyber threat landscape. Key responsibilities include; Act as the subject matter expert (SME) for Splunk across all cyber security and observability use cases. Lead SOC automation initiatives using scripting and SOAR tools, optimising processes through AI and ML technologies. Support alert tuning, connectivity, and visibility across monitored networks and infrastructure. Maintain More ❯
Luton, England, United Kingdom Hybrid/Remote Options
easyJet
and platforms to automate and optimise data management steps and gateways into data and analytical pipelines. • Expertise in implementing and managing statistical process controls for data quality measurement, continuous observability, and data quality remediation. • Strong SQL background – comfortable writing efficient SQL (Transact-SQL, Hive -HQL) to meet the requirement, having had exposure to working with large datasets on a distributed More ❯
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯