Solution Architect - NVIDIA Cluster (End-to-End Design & Validation)

Job Specification: Solution Architect - NVIDIA Cluster (End-to-End Design & Validation)

Location: London (1 day per week onsite)
Travel: Occasional travel to datacenter sites outside the UK
Engagement: Contract Inside IR35
Department: Engineering/Advanced Compute

Role Overview

We are seeking a highly skilled Solution Architect with deep experience in designing, validating, and delivering end-to-end NVIDIA GPU clusters in enterprise and hyperscale environments. This individual will own the full life cycle of architectural design-from requirements gathering through implementation oversight and performance validation. They will work closely with engineering, networking, DevOps, security, and datacenter operations teams to ensure high-performance, scalable, and resilient GPU infrastructure for AI, HPC, and ML workloads.

The role is primarily London-based one day per week, with occasional international travel required to support datacenter design reviews, deployment validation, or site acceptance testing.

Key Responsibilities Architecture & Design

  • Lead the architecture of NVIDIA GPU clusters leveraging technologies such as H100/H200, NVLink, NVSwitch, DGX, HGX, or SuperPod-class designs.

  • Produce high-level and low-level designs (HLD/LLD), including compute, network, storage, and power/cooling considerations.

  • Validate hardware and platform selections, ensuring architectural alignment with customer requirements and scalability goals.

  • Design fabric architectures including InfiniBand (200/400Gb), RoCE, and high-performance east-west traffic patterns.

  • Ensure designs adhere to NVIDIA reference architectures (NVAIE, Base Command, DGX SuperPod specs, etc.).

Cluster Integration & Validation

  • Define and execute validation test plans for GPU cluster performance, resilience, networking throughput, and workload behaviour.

  • Oversee integration of GPU nodes, networking, and storage systems into the existing datacenter environment.

  • Collaborate with DevOps/Platform teams to validate cluster orchestration (Kubernetes, Slurm, Bright Cluster Manager, or equivalents).

  • Validate firmware, drivers, NCCL, CUDA libraries, and container environments for production readiness.

Deployment & Delivery Oversight

  • Provide technical leadership across the full deployment life cycle.

  • Partner with datacenter operations to ensure correct rack layouts, cabling, airflow and power design.

  • Support delivery teams during build-out phases, ensuring the design is executed correctly.

  • Participate in factory acceptance tests (FAT), site acceptance tests (SAT), and operational readiness reviews.

Stakeholder Collaboration

  • Work closely with internal and external teams including network engineering, platform engineering, procurement, and vendors such as NVIDIA, Mellanox, Supermicro, Dell, or HPE.

  • Provide technical guidance to customers, partners, and cross-functional engineering teams.

  • Communicate complex architectural concepts clearly to both technical and non-technical audiences.

Documentation & Governance

  • Produce detailed architecture documents, diagrams, acceptance criteria, and operational runbooks.

  • Ensure security, compliance, and governance standards are built into the design.

  • Provide knowledge transfer (KT) and training sessions to internal teams where required.

Required Skills & Experience Technical Expertise

  • Proven experience architecting and delivering NVIDIA GPU clusters at scale (AI/ML/HPC environments).

  • Strong hands-on understanding of GPU interconnects (NVLink/NVSwitch) and DGX/HGX/SuperPod architectures.

  • Deep knowledge of InfiniBand and high-performance networking architectures.

  • Experience with cluster orchestration: Kubernetes, Slurm, PBS, or similar.

  • Familiarity with AI/ML workload requirements, CUDA, Docker/OCI containers, and NVIDIA software stacks (NCCL, CUDA Toolkit).

  • Comfort with Linux systems engineering, hardware validation, and troubleshooting across compute/network layers.

Soft Skills

  • Strong communication skills, with the ability to bridge engineering and business discussions.

  • Comfortable owning architecture decisions and delivering executive-ready documentation.

  • Ability to work autonomously while coordinating with multi-disciplinary teams.

  • Problem-solver with strong critical-thinking abilities and a delivery-focused mindset.

Desirable Experience

  • Experience with hyperscaler-class deployments or multi-megawatt datacenter environments.

  • Work with NVIDIA Base Command Manager or similar cluster management tooling.

  • Exposure to data pipelines, storage systems (Lustre, GPUDirect Storage, Ceph), or AI workflow platforms.

  • Certifications such as NVIDIA Certified Associate/Expert, Kubernetes certifications (CKA/CKS), or related vendor accreditations.

What We Offer

  • Hybrid working: 1 day per week in London

  • Opportunity to design next-generation high-performance GPU infrastructure

  • Exposure to cutting-edge AI compute at scale

Company
WNTD
Location
London, United Kingdom
Hybrid/Remote Options
Employment Type
Contract
Salary
GBP Annual
Posted
Company
WNTD
Location
London, United Kingdom
Hybrid/Remote Options
Employment Type
Contract
Salary
GBP Annual
Posted