APIs • Experience of writing performance critical code • Experience of using Git or similar to track changes • Experience of both the full .NET Framework and .NET Core • Experience of using observability systems such as Elastic APM or DataDog to track and diagnose issues in production • A solid understanding of security principles and secure coding including OWASP Top 10 Nice to haves More ❯
that process massive amounts of data? Do you thrive on designing and implementing innovative solutions to empower engineering teams with actionable insights? Are you excited to advance open-source observability at a massive scale? Join us to extend (open source) observability tools and build new capabilities that help teams manage data better and get actionable insights. About the Team The … Observability team is part of Roku's Cloud Technology Infrastructure organisation and plays a critical role in our platform. We are a high-performing, fast-moving international team that thrives on ownership, effective communication, and delivering impactful engineering solutions. Our mission is to advance Roku's observability platform, which operates at an impressive scale by ingesting terabytes of data daily … and fast iteration, and we emphasise solving meaningful engineering problems, collaboration, and continuous improvement of how we work as a team. About the Role You will work on core observability systems (metrics, logs, traces) while also developing robust data pipelines and storage solutions optimized for high throughput, performance, and cost. You'll leverage technologies such as time-series databases, columnar More ❯
cambridge, east anglia, united kingdom Hybrid / WFH Options
Speechmatics
postmortems and ensuring the same incident doesn't happen twice. Managing and improving GitOps release workflows and CI/CD pipelines. Monitoring system performance and troubleshooting production environments. Implementing observability improvements using OpenTelemetry tooling. Automating processes that reduces manual efforts and creates self-healing systems. Taking part in on-call rota for production systems that has a generous daily pay … dive deep into new technologies; you thrive on learning as you go. Prior experience with on-call rotations and incident response is a plus. Familiarity with OpenTelemetry and related observability tooling is advantageous. We encourage you to apply even if you do not feel you match all of the requirements exactly. The list of requirements is intended to show the More ❯
Chelmsford, Essex, United Kingdom Hybrid / WFH Options
Brooks Automation, Inc
infrastructure and security services, ensuring operational excellence and incident response readiness. Partner with the CISO to shape long-term strategy and roadmap for secure, resilient IT services. Drive automation, observability, and scalability across the infrastructure and security stack. Serve as a key escalation point for technical troubleshooting and security event resolution. Guide vendor selection, contract negotiations, and service-level adherence More ❯
AI platform designs leveraging Dell's product and partner ecosystem e.g. NVAIE, Run.ai, H2O.ai, ClearML, OpenShift, etc. Provide expert guidance on modern data stack components: data quality, metadata management, observability, data products, feature stores, with governance and Dell's maturity model frameworks. Stay current on emerging AI and associated Data Management technologies. Actively contribute field feedback to Dell's product More ❯
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯
Employment Type: Full-Time
Salary: £84,000 - £95,000 per annum, Negotiable, Inc benefits
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯
hertfordshire, east anglia, united kingdom Hybrid / WFH Options
Rightmove
scientists to take models from development to production-grade systems, ensuring scalability, reproducibility, and robustness. Automating feature engineering and data pipeline processes, ensuring reproducibility and auditability. Implementing monitoring and observability to detect drift, bias, and performance degradation, and setting up rollback/recovery processes. Using MLOps tools (e.g., Vertex Pipelines, Kubeflow, Weights & Biases) for experiment tracking, model registry, and automated … distributed systems). 3+ years of experience as an ML Engineer, MLOps Engineer, Data Engineer, or similar, in a larger-scale, production-focused environment. Hands-on with model monitoring, observability, and retraining pipelines. Exposure to feature stores, registries, and experimentation frameworks. Familiarity with business-driven metrics and experience balancing ML performance with commercial goals. Experience with generative AI and LLM More ❯
Peterborough, Cambridgeshire, England, United Kingdom Hybrid / WFH Options
Noir
Performance & Reliability Director - Software House - Peterborough/Hybrid (Key skills: Performance Engineering, Reliability Engineering, SRE, Load Testing, Observability, Chaos Testing, Cloud Platforms, Microservices, Leadership, CI/CD, APM Tools) Are you a technology leader passionate about driving performance, scalability, and reliability across complex software platforms? Do you thrive in high-growth environments where innovation, engineering excellence, and resilience are core … lifecycle. You'll oversee system profiling, capacity planning, and test strategies - ensuring every release meets the highest standards for speed, scalability, and reliability. You'll drive the adoption of observability and monitoring frameworks, leveraging platforms like Datadog and Dynatrace to build a proactive performance culture. You'll champion continuous improvement, implement chaos testing programmes, and ensure teams deliver fault-tolerant More ❯
St. Albans, Hertfordshire, England, United Kingdom
Method Resourcing
Software Engineering Lead£90,000 + Equity Method is working with a purpose-driven technology company on a multi-year transformation to rebuild its core platform into a modern, event-driven microservices architecture. Their mission is to improve safety, efficiency More ❯
St. Albans, Hertfordshire, South East, United Kingdom
Method-Resourcing
Software Engineering Lead £90,000 + Equity Method is working with a purpose-driven technology company on a multi-year transformation to rebuild its core platform into a modern, event-driven microservices architecture. Their mission is to improve safety, efficiency More ❯