Machine Learning Performance Engineer
Slough, England, United Kingdom
JR United Kingdom
CUDA, PTX/SASS, Tensor Cores, memory hierarchy, warp-level primitives Familiarity with ML frameworks like PyTorch, and their internals Proficiency in profiling and debugging tools like NSight, CUDA GDB, nvprof, NSight Compute Deep knowledge of Triton, cuDNN, cuBLAS, CUTLASS, CUB, or similar libraries Experience optimising across the stack: from kernel-level compute to cluster-wide networking and memory IO More ❯
Posted: