DevOps Engineer
DevOps Engineer - Reinforcement Learning Platforms
We are seeking an experienced DevOps Engineer to help build and scale a web-based platform for reinforcement learning (RL) training and RLOps. You will design, implement, and maintain the cloud infrastructure, CI/CD pipelines, and deployment systems that support large-scale RL workloads.
Responsibilities* Design and manage scalable cloud infrastructure for high-performance RL training and distributed environments* Build and optimise CI/CD pipelines for open-source and enterprise components* Implement containerisation and orchestration using Docker and Kubernetes* Develop Infrastructure as Code solutions (Terraform, CloudFormation, Pulumi)* Implement monitoring, logging, and alerting for distributed ML systems* Collaborate with ML teams on resource optimisation and cost efficiency* Apply security best practices, manage access controls, and ensure compliance* Automate operational tasks: backups, disaster recovery, maintenance* Support GPU clusters and distributed compute resources for RL workloads* Maintain availability and performance of production ML systems
Requirements* Degree in Computer Science/Engineering or 3+ years of DevOps/infrastructure experience* Strong background with AWS, GCP, or Azure, including ML/AI workloads* Proficiency with Docker, Kubernetes, and ML-focused orchestration* Experience with Terraform/CloudFormation/Pulumi and configuration management* Solid understanding of CI/CD tools (GitHub Actions, GitLab CI, Jenkins)* Knowledge of monitoring/observability tools (Prometheus, Grafana, OpenObserve)* Experience with GPU infrastructure and distributed ML compute frameworks* Familiarity with MLOps tools and model lifecycle management* Strong scripting skills (Python, Bash)* Understanding of cloud networking, security, and database fundamentals* Experience with HPC environments or schedulers is a plus* Strong problem-solving and communication skills
Compensation & Benefits* Stock options* 30 days' holiday plus bank holidays* Flexible and remote working options* Enhanced parental leave* £500 annual learning and development budget* Pension scheme* Regular socials and quarterly gatherings* Bike-to-Work scheme