AI/ML Platform Engineer

Apply Now

AI/ML Platform Engineer

Up to £80,000 plus benefits

Onsite - Kensington, London

Company and Role

This is an opportunity to join a global technology and AI solutions provider delivering some of the most advanced computing platforms in the world. You will be working on a high profile programme to design, deploy and support a next generation AI and Machine Learning Operations Platform within a world class research and innovation environment.

As a Senior AI/ML Platform Engineer, you will play a pivotal role in implementing and maintaining a high performance computing environment that supports large scale artificial intelligence and machine learning workloads. This platform will enable cutting edge research and AI development, and you will be directly involved from design and build through to long term operational support.

It is a hands on, technically challenging position that combines platform engineering, containerisation and GPU optimisation within a highly collaborative setting.

Why This Role Stands Out

• Work on a flagship AI and ML infrastructure project for one of the UK's most respected research institutions

• Shape a high performance computing environment supporting next generation AI and data science innovation

• Collaborate with global technology partners and vendors delivering the latest GPU enabled platforms

• Onsite position in Kensington, London within a world leading research setting

• Salary up to £80,000 with long term career potential in AI platform engineering

What You'll Be Doing

• Deploying and configuring a complete AI and ML operations platform within a large scale HPC environment

• Installing and optimising the Ubuntu operating system across compute and GPU nodes

• Implementing and tuning the Kubernetes containerisation platform for high performance workloads

• Installing and configuring NVIDIA GPU Operator and Network Operator

• Deploying and managing the NVIDIA Run AI orchestration platform

• Integrating Run AI with Kubernetes clusters to deliver scalable AI and ML compute capacity

• Ensuring the platform meets performance and reliability standards for AI research

• Providing knowledge transfer and operational documentation to enable long term platform support

• Taking ownership of the platform post deployment to provide ongoing maintenance and optimisation

• Troubleshooting, patching and improving system performance while supporting researchers and developers using the platform

What You'll Bring

• Proven experience deploying and supporting HPC or large scale compute environments for AI and ML workloads

• Strong knowledge of Ubuntu server administration and optimisation

• Hands on experience with Kubernetes including cluster management and scaling

• Practical experience with NVIDIA GPU technologies, particularly GPU Operator

• Experience with AI and ML orchestration platforms such as NVIDIA Run AI

• Understanding of networking principles in containerised or distributed environments

• Strong analytical and troubleshooting skills with a proactive approach to problem solving

• Excellent communication and documentation skills with experience transferring knowledge to technical teams

Desirable Experience

• Certifications in Kubernetes, NVIDIA or relevant cloud technologies

• Experience in academic, scientific or research computing environments

• Understanding of AI and ML workflows, pipelines and development tools

• Familiarity with Infrastructure as Code or automation within HPC or AI environments

If you are passionate about AI and platform engineering and want to work on one of the most advanced AI infrastructure projects in the UK, this is your opportunity to make a genuine impact on the evolution of machine learning and research computing.

Company: Cloud People
Location: South East, United Kingdom
Employment Type: Permanent
Salary: GBP Annual
Posted: 6 hours ago

Apply Now

Company: Cloud People
Location: South East, United Kingdom
Employment Type: Permanent
Salary: GBP Annual
Posted: 6 hours ago