Staff DevOps Engineer

Humanoid is the first AI and robotics company in the UK, creating the world’s most advanced, reliable, commercially scalable, and safe humanoid robots. Our first humanoid robot HMND 01 is a next-gen labour automation unit, providing highly efficient services across various use cases, starting with industrial applications.

Our Mission

At Humanoid we strive to create the world’s leading, commercially scalable, safe, and advanced humanoid robots that seamlessly integrate into daily life and amplify human capacity.

We are building large-scale compute infrastructure to train next-generation robotics models, including transformer-based systems like VLA.

As a Staff Engineer, you will lead the design and evolution of our multi-GPU, cross-cloud platforms, driving architecture, reliability, and performance at scale. This role sits at the intersection of DevOps, MLOps, and distributed systems, enabling cutting-edge AI in real-world environments.

What You’ll Do:

  • Lead the design and evolution of scalable multi-GPU infrastructure across cloud environments (AWS, GCP, etc.)
  • Own architecture and long-term technical direction of model training platforms
  • Drive reliability, performance, and cost-efficiency at scale
  • Define and implement best practices for infrastructure, DevOps, and MLOps across the organization
  • Build and evolve infrastructure-as-code and automation for provisioning, orchestration, and lifecycle management
  • Architect and improve CI/CD systems for both infrastructure and ML training workflows
  • Optimize distributed training workloads (scheduling, resource utilization, observability)
  • Partner with ML engineers and researchers to enable efficient experimentation and productionization
  • Lead troubleshooting and resolution of complex system issues across distributed, GPU-heavy environments
  • Mentor engineers and raise the bar for engineering quality and operational excellence
  • Document architecture, systems, and key technical decisions

We’re Looking For:

  • 7+ years of experience in DevOps, MLOps, or infrastructure engineering (Staff level)
  • Proven experience designing and operating multi-GPU / distributed compute infrastructure
  • Experience with GPU scheduling/orchestration (e.g., Kubernetes schedulers, Volcano, Ray, etc.)
  • Strong experience with Kubernetes and containerized workloads at scale
  • Deep expertise in Infrastructure-as-Code (Terraform, Helm, or similar)
  • Deep familiarity with at least one major cloud provider (AWS preferred)
  • Strong experience building and scaling CI/CD systems (e.g., GitHub Actions, GitLab CI, ArgoCD)
  • Proficiency in Python for automation and tooling
  • Strong understanding of distributed systems, networking, and system reliability
  • Demonstrated ability to lead large technical initiatives and influence system design
  • Experience supporting ML workloads or training pipelines (PyTorch, TensorFlow, etc.)

Nice to have:

  • Experience with multi-cloud or hybrid cloud environments
  • Background in performance optimization for large-scale training workloads
  • Experience in robotics, simulation, or embodied AI systems

What we offer:

  • Competitive salary plus participation in our Stock Option Plan
  • Paid vacation with adjustments based on your location to comply with local labor laws
  • Travel opportunities to our Vancouver and Boston offices
  • Office perks: free breakfasts, lunches, snacks, and regular team events
  • Freedom to influence the product and own key initiatives
  • Collaboration with top‐tier engineers, researchers, and product experts in AI and robotics
  • Startup culture prioritising speed, transparency, and minimal bureaucracy

Job Details

Company
Humanoid
Location
City of London, London, United Kingdom
Posted