cambridge, east anglia, united kingdom Hybrid / WFH Options
Speechmatics
postmortems and ensuring the same incident doesn't happen twice. Managing and improving GitOps release workflows and CI/CD pipelines. Monitoring system performance and troubleshooting production environments. Implementing observability improvements using OpenTelemetry tooling. Automating processes that reduces manual efforts and creates self-healing systems. Taking part in on-call rota for production systems that has a generous daily pay … dive deep into new technologies; you thrive on learning as you go. Prior experience with on-call rotations and incident response is a plus. Familiarity with OpenTelemetry and related observability tooling is advantageous. We encourage you to apply even if you do not feel you match all of the requirements exactly. The list of requirements is intended to show the More ❯
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯
lead performance testing and chaos engineering initiatives, and embed reliability best practices across engineering, DevOps, and infrastructure teams. This is a senior, strategic leadership role focused on system excellence, observability, and continuous improvement. Ideal Candidate: Proven experience leading Performance Engineering, Reliability, or SRE functions Deep expertise in performance testing methodologies (load, stress, spike, soak) Strong hands-on background with LoadRunner … strategy across critical platforms and services Oversee load, stress, and chaos testing initiatives to ensure systems perform and recover under real-world conditions Define and drive best practices for observability, monitoring, and APM adoption using tools like Dynatrace Drive incident reduction, faster recovery (MTTR) , and continuous reliability improvements Champion a culture of performance ownership , ensuring teams build with scalability, stability More ❯
hertfordshire, east anglia, united kingdom Hybrid / WFH Options
Rightmove
scientists to take models from development to production-grade systems, ensuring scalability, reproducibility, and robustness. Automating feature engineering and data pipeline processes, ensuring reproducibility and auditability. Implementing monitoring and observability to detect drift, bias, and performance degradation, and setting up rollback/recovery processes. Using MLOps tools (e.g., Vertex Pipelines, Kubeflow, Weights & Biases) for experiment tracking, model registry, and automated … distributed systems). 3+ years of experience as an ML Engineer, MLOps Engineer, Data Engineer, or similar, in a larger-scale, production-focused environment. Hands-on with model monitoring, observability, and retraining pipelines. Exposure to feature stores, registries, and experimentation frameworks. Familiarity with business-driven metrics and experience balancing ML performance with commercial goals. Experience with generative AI and LLM More ❯