performance cloud infra for ML workloads Build and manage GPU clusters, storage systems, and distributed training environments Set up and optimise containerised workflows (Docker, Kubernetes, Terraform) Implement robust monitoring, incidentresponse, and CI/CD practices Collaborate closely with researchers to integrate and scale experiments This person must have experience building ML Infrastructure and cloud architecture from scratch More ❯
performance cloud infra for ML workloads Build and manage GPU clusters, storage systems, and distributed training environments Set up and optimise containerised workflows (Docker, Kubernetes, Terraform) Implement robust monitoring, incidentresponse, and CI/CD practices Collaborate closely with researchers to integrate and scale experiments This person must have experience building ML Infrastructure and cloud architecture from scratch More ❯
Reading, Oxfordshire, United Kingdom Hybrid / WFH Options
Tenth Revolution Group
RAG, and prompt engineering Familiarity with Azure services and cloud ecosystems Excellent communication and presentation skills A passion for mentoring and developing engineering talent Experience with distributed systems and incidentresponse Benefits: Flexible remote working Competitive salary 25 days holiday Private health insurance (after 1 year) Enhanced parental leave And more Please Note: This is a permanent role More ❯
Reading, Berkshire, United Kingdom Hybrid / WFH Options
Tenth Revolution Group
RAG, and prompt engineering Familiarity with Azure services and cloud ecosystems Excellent communication and presentation skills A passion for mentoring and developing engineering talent Experience with distributed systems and incidentresponse Benefits: Flexible remote working Competitive salary 25 days holiday Private health insurance (after 1 year) Enhanced parental leave And more Please Note: This is a permanent role More ❯
and manage the day-to-day operations of a hyperscale data centre, ensuring high availability and reliability . Oversee mechanical and electrical systems , ensuring optimal performance, preventative maintenance, and incident response. Manage on-site teams, including engineers and technical staff, fostering a culture of safety, accountability, and continuous improvement . Coordinate with contractors, vendors, and stakeholders to ensure projects More ❯
and manage the day-to-day operations of a hyperscale data centre, ensuring high availability and reliability . Oversee mechanical and electrical systems , ensuring optimal performance, preventative maintenance, and incident response. Manage on-site teams, including engineers and technical staff, fostering a culture of safety, accountability, and continuous improvement . Coordinate with contractors, vendors, and stakeholders to ensure projects More ❯
Own and evolve CI/CD pipelines and Kubernetes environments to enable engineers and researchers to ship securely and recover quickly. Design deployment workflows, rollout strategies, observability, alerting and incident response. Experiment with new approaches to automation and environment management that boost speed and reliability. Work closely with engineering, research and product teams to turn breakthrough detection models into More ❯
Own and evolve CI/CD pipelines and Kubernetes environments to enable engineers and researchers to ship securely and recover quickly. Design deployment workflows, rollout strategies, observability, alerting and incident response. Experiment with new approaches to automation and environment management that boost speed and reliability. Work closely with engineering, research and product teams to turn breakthrough detection models into More ❯
london (city of london), south east england, united kingdom
rmg digital
Own and evolve CI/CD pipelines and Kubernetes environments to enable engineers and researchers to ship securely and recover quickly. Design deployment workflows, rollout strategies, observability, alerting and incident response. Experiment with new approaches to automation and environment management that boost speed and reliability. Work closely with engineering, research and product teams to turn breakthrough detection models into More ❯
across GPU clusters, networking fabrics, Kubernetes (NKS/NKS Lite), and Slurm orchestration Establish and track reliability metrics (SLIs, SLOs, error budgets) to guide service health. Integrate observability with incident management and fleet automation. Drive down MTTD and MTTR through proactive monitoring and automated remediation. Deliver executive-level reporting on system health, capacity, and reliability trends. Stay ahead of … understanding of distributed systems, networking, and cloud-native architectures. Proficiency in automation and scripting (Python, Go, Bash). Hands-on experience with Kubernetes and container orchestration. Experience in improving incidentresponse processes and operational reliability. Nice to Have Experience with GPU/AI workload observability (e.g. DCGM, model telemetry, prompt analytics). Familiarity with HPC environments (Slurm, RDMA More ❯
across GPU clusters, networking fabrics, Kubernetes (NKS/NKS Lite), and Slurm orchestration Establish and track reliability metrics (SLIs, SLOs, error budgets) to guide service health. Integrate observability with incident management and fleet automation. Drive down MTTD and MTTR through proactive monitoring and automated remediation. Deliver executive-level reporting on system health, capacity, and reliability trends. Stay ahead of … understanding of distributed systems, networking, and cloud-native architectures. Proficiency in automation and scripting (Python, Go, Bash). Hands-on experience with Kubernetes and container orchestration. Experience in improving incidentresponse processes and operational reliability. Nice to Have Experience with GPU/AI workload observability (e.g. DCGM, model telemetry, prompt analytics). Familiarity with HPC environments (Slurm, RDMA More ❯
london (city of london), south east england, united kingdom
Nscale
across GPU clusters, networking fabrics, Kubernetes (NKS/NKS Lite), and Slurm orchestration Establish and track reliability metrics (SLIs, SLOs, error budgets) to guide service health. Integrate observability with incident management and fleet automation. Drive down MTTD and MTTR through proactive monitoring and automated remediation. Deliver executive-level reporting on system health, capacity, and reliability trends. Stay ahead of … understanding of distributed systems, networking, and cloud-native architectures. Proficiency in automation and scripting (Python, Go, Bash). Hands-on experience with Kubernetes and container orchestration. Experience in improving incidentresponse processes and operational reliability. Nice to Have Experience with GPU/AI workload observability (e.g. DCGM, model telemetry, prompt analytics). Familiarity with HPC environments (Slurm, RDMA More ❯
london, south east england, united kingdom Hybrid / WFH Options
Oho Group Ltd
infrastructure A thoughtful, pragmatic engineering approach Curiosity about security and detection (no prior experience required) Bonus if you’ve worked with: Event-driven or distributed systems Security tooling or incidentresponse workflows Why Join? Work on hard, meaningful problems in cybersecurity Be part of a fast, technical, remote-first team Competitive salary and meaningful equity Founding Engineer - London More ❯
london (city of london), south east england, united kingdom Hybrid / WFH Options
Oho Group Ltd
infrastructure A thoughtful, pragmatic engineering approach Curiosity about security and detection (no prior experience required) Bonus if you’ve worked with: Event-driven or distributed systems Security tooling or incidentresponse workflows Why Join? Work on hard, meaningful problems in cybersecurity Be part of a fast, technical, remote-first team Competitive salary and meaningful equity Founding Engineer - London More ❯
slough, south east england, united kingdom Hybrid / WFH Options
Oho Group Ltd
infrastructure A thoughtful, pragmatic engineering approach Curiosity about security and detection (no prior experience required) Bonus if you’ve worked with: Event-driven or distributed systems Security tooling or incidentresponse workflows Why Join? Work on hard, meaningful problems in cybersecurity Be part of a fast, technical, remote-first team Competitive salary and meaningful equity Founding Engineer - London More ❯
Lead Site Reliability Engineer to bring innovation, leadership, and technical excellence to our growing team. What You'll Do: Design and implement scalable, efficient systems for maximum reliability. Lead incidentresponse and implement monitoring solutions to maintain high system uptime. Optimize performance through in-depth analysis and continuous improvement. Develop preventive maintenance programs and carry out Root Cause More ❯
Lead Site Reliability Engineer to bring innovation, leadership, and technical excellence to our growing team. What You'll Do: Design and implement scalable, efficient systems for maximum reliability. Lead incidentresponse and implement monitoring solutions to maintain high system uptime. Optimize performance through in-depth analysis and continuous improvement. Develop preventive maintenance programs and carry out Root Cause More ❯
Reading, Berkshire, United Kingdom Hybrid / WFH Options
Bytes Group
equivalent), integrate static security scanning via Snyk, and manage issue tracking in JIRA. Observability: Instrument applications using the LGTM stack (e.g. logs, metrics, tracing) to ensure reliability and rapid incident response. Database Management: Design and optimize schemas in PostgreSQL and Microsoft SQL Server; write efficient queries, migrations, and manage connections securely. Collaboration & Mentorship: Work closely with product managers, designers More ❯
Leatherhead, Surrey, United Kingdom Hybrid / WFH Options
Bytes Group
equivalent), integrate static security scanning via Snyk, and manage issue tracking in JIRA. Observability: Instrument applications using the LGTM stack (e.g. logs, metrics, tracing) to ensure reliability and rapid incident response. Database Management: Design and optimize schemas in PostgreSQL and Microsoft SQL Server; write efficient queries, migrations, and manage connections securely. Collaboration & Mentorship: Work closely with product managers, designers More ❯
guidance to IT team and staff Carry out regular access log review and organise improvements Organise and provide security training to staff Document the security process Support the security incidentresponse Communicate regularly with customer line manager to update on task progress Hold monthly 1:1 meetings with line manager and bi-weekly meetings with service management team … IT Security Coordinator Ideal Candidate: 2 - 3 years of experience in the coordination Experience in IT security administration such as documentation, audit and incident evidence collections, and Windows and Anti-Virus log review General IT system knowledge in Windows AD Experience with MS365 as a user Task management and tracking issues Business level fluency in English Flexible working ethic More ❯
driving automation and supporting the development teams with robust CI/CD infrastructure in a hands-on leadership role. KEY RESPONSIBILITIES - Oversee day-to-day cloud operations, including monitoring, incidentresponse and trouble shooting. - Leading and managing short and long term project planning. - Developing and implementing cloud governance, security and compliance. - Leading automation and IaC improvements. - Providing mentorship More ❯
driving automation and supporting the development teams with robust CI/CD infrastructure in a hands-on leadership role. KEY RESPONSIBILITIES - Oversee day-to-day cloud operations, including monitoring, incidentresponse and trouble shooting. - Leading and managing short and long term project planning. - Developing and implementing cloud governance, security and compliance. - Leading automation and IaC improvements. - Providing mentorship More ❯
Caldecotte, Milton Keynes, Buckinghamshire, England, United Kingdom
Connells Group HQ
a culture of observability across the engineering team. Helps teams across engineering use operational data to improve stability and performance of their applications. Awareness of application security considerations Leads incidentresponse across the engineering teams as needed Identifies dependencies across the organization and works with individual teams to resolve them before they become an issue, and installs preventative More ❯
across the team. Help teams use operational data to improve the stability and performance of their applications Maintain documentation and release notes Have awareness of application security considerations Lead incidentresponse across the team as needed Identify dependencies across the organization and work with teams to resolve them before they become an issue, and install preventative measures to More ❯
regular access log review and organise improvement where necessary. Organise and provide security training to staff. Correspond with staff for security matters. Document the security process. Support the security incident response. Regular and weekly communications with the customer line manager to update and prioritise task progress. If you're interested in this role, click 'apply now' to forward an More ❯