LLM Inference & Deployment Engineer
LLM Inference & Deployment Engineer (Air-Gapped Environments)
3 month + contract
Inside IR35
Hybrid working (2-3 days in London)
You've deployed 70B parameter models on GPU clusters with no Internet access. You know the difference between a model that works in a notebook and one that runs reliably in production under compliance scrutiny. If that's your world, we want to talk.
This is a genuinely specialist role. The platform you'll be working on runs multiple large-scale LLMs concurrently, frontier models for text screening, code LLMs for analysis, transformer encoders for classification, all in an air-gapped environment with a fixed compute budget and zero external API access.
You'll own the inference infrastructure end-to-end: GPU allocation strategy, quantisation decisions, batching, determinism controls, and offline deployment packaging. The system has to be fast, reliable, and auditable. That's a rare combination of skills and this role is for someone who has genuinely done it before.
What we're looking for
- Production experience with vLLM, TensorRT-LLM, TGI, or equivalent at multi-GPU scale
- Model quantisation expertise: GPTQ, AWQ, GGUF, bitsandbytes
- Multi-node inference: tensor/pipeline/expert parallelism
- Air-gapped or classified environment deployment experience strongly preferred
- Offline dependency packaging: conda-pack, pip wheels, container images
If you are available and interested in this new role please send a current CV.