Solution Architect - NVIDIA Cluster (End-to-End Design & Validation)
Job Specification: Solution Architect - NVIDIA Cluster (End-to-End Design & Validation)
Location: London (1 day per week onsite)
Travel: Occasional travel to datacenter sites outside the UK
Engagement: Contract Inside IR35
Department: Engineering/Advanced Compute
Role Overview
We are seeking a highly skilled Solution Architect with deep experience in designing, validating, and delivering end-to-end NVIDIA GPU clusters in enterprise and hyperscale environments. This individual will own the full life cycle of architectural design-from requirements gathering through implementation oversight and performance validation. They will work closely with engineering, networking, DevOps, security, and datacenter operations teams to ensure high-performance, scalable, and resilient GPU infrastructure for AI, HPC, and ML workloads.
The role is primarily London-based one day per week, with occasional international travel required to support datacenter design reviews, deployment validation, or site acceptance testing.
Key Responsibilities Architecture & Design
-
Lead the architecture of NVIDIA GPU clusters leveraging technologies such as H100/H200, NVLink, NVSwitch, DGX, HGX, or SuperPod-class designs.
-
Produce high-level and low-level designs (HLD/LLD), including compute, network, storage, and power/cooling considerations.
-
Validate hardware and platform selections, ensuring architectural alignment with customer requirements and scalability goals.
-
Design fabric architectures including InfiniBand (200/400Gb), RoCE, and high-performance east-west traffic patterns.
-
Ensure designs adhere to NVIDIA reference architectures (NVAIE, Base Command, DGX SuperPod specs, etc.).
Cluster Integration & Validation
-
Define and execute validation test plans for GPU cluster performance, resilience, networking throughput, and workload behaviour.
-
Oversee integration of GPU nodes, networking, and storage systems into the existing datacenter environment.
-
Collaborate with DevOps/Platform teams to validate cluster orchestration (Kubernetes, Slurm, Bright Cluster Manager, or equivalents).
-
Validate firmware, drivers, NCCL, CUDA libraries, and container environments for production readiness.
Deployment & Delivery Oversight
-
Provide technical leadership across the full deployment life cycle.
-
Partner with datacenter operations to ensure correct rack layouts, cabling, airflow and power design.
-
Support delivery teams during build-out phases, ensuring the design is executed correctly.
-
Participate in factory acceptance tests (FAT), site acceptance tests (SAT), and operational readiness reviews.
Stakeholder Collaboration
-
Work closely with internal and external teams including network engineering, platform engineering, procurement, and vendors such as NVIDIA, Mellanox, Supermicro, Dell, or HPE.
-
Provide technical guidance to customers, partners, and cross-functional engineering teams.
-
Communicate complex architectural concepts clearly to both technical and non-technical audiences.
Documentation & Governance
-
Produce detailed architecture documents, diagrams, acceptance criteria, and operational runbooks.
-
Ensure security, compliance, and governance standards are built into the design.
-
Provide knowledge transfer (KT) and training sessions to internal teams where required.
Required Skills & Experience Technical Expertise
-
Proven experience architecting and delivering NVIDIA GPU clusters at scale (AI/ML/HPC environments).
-
Strong hands-on understanding of GPU interconnects (NVLink/NVSwitch) and DGX/HGX/SuperPod architectures.
-
Deep knowledge of InfiniBand and high-performance networking architectures.
-
Experience with cluster orchestration: Kubernetes, Slurm, PBS, or similar.
-
Familiarity with AI/ML workload requirements, CUDA, Docker/OCI containers, and NVIDIA software stacks (NCCL, CUDA Toolkit).
-
Comfort with Linux systems engineering, hardware validation, and troubleshooting across compute/network layers.
Soft Skills
-
Strong communication skills, with the ability to bridge engineering and business discussions.
-
Comfortable owning architecture decisions and delivering executive-ready documentation.
-
Ability to work autonomously while coordinating with multi-disciplinary teams.
-
Problem-solver with strong critical-thinking abilities and a delivery-focused mindset.
Desirable Experience
-
Experience with hyperscaler-class deployments or multi-megawatt datacenter environments.
-
Work with NVIDIA Base Command Manager or similar cluster management tooling.
-
Exposure to data pipelines, storage systems (Lustre, GPUDirect Storage, Ceph), or AI workflow platforms.
-
Certifications such as NVIDIA Certified Associate/Expert, Kubernetes certifications (CKA/CKS), or related vendor accreditations.
What We Offer
-
Hybrid working: 1 day per week in London
-
Opportunity to design next-generation high-performance GPU infrastructure
-
Exposure to cutting-edge AI compute at scale
- Company
- WNTD
- Location
- London, United Kingdom
Hybrid/Remote Options - Employment Type
- Contract
- Salary
- GBP Annual
- Posted
- Company
- WNTD
- Location
- London, United Kingdom
Hybrid/Remote Options - Employment Type
- Contract
- Salary
- GBP Annual
- Posted