Research Engineer (Inference)
Research Engineer (Inference)
About
Serving a multimodal agent model in production is a different problem to serving a standard LLM. Context length, tool calls, and computer-use workloads create constraints that require co-designing the inference stack with the model team - not just bolting on a serving framework after the fact.
This is a VC-backed challenger lab building state-of-the-art computer-use agents. The inference team owns the full stack from engine layer (vLLM, SGLang) through to serving architecture (disaggregated inference, intelligent routing).
The team operates at the intersection of research and production - translating cutting-edge techniques directly into the systems behind live agent products.
What you'll do
- Build and operate the inference stack serving multimodal agentic models in production
- Improve latency, throughput, and cost across the serving stack
- Research and implement inference techniques tailored to agent workloads
- Co-design with the models team on training-time decisions that affect inference behaviour
- Evaluate inference frameworks and hardware platforms and feed findings back into roadmap decisions
- Stay current with advances in inference, model serving, and accelerator technology
What you'll need
- Strong software engineering fundamentals and a solid production track record
- Proficient in Python and at least one systems language - Rust, C++, or Go
- Hands-on experience with PyTorch or JAX in an industry setting
- Experience with inference frameworks: vLLM, SGLang, TensorRT-LLM
- Solid distributed systems fundamentals and experience operating production ML infrastructure
- Working knowledge of modern ML including transformers and multimodal architectures
Optional Bonus
- Research engagement: advanced degree with research output, top-tier publications (NeurIPS, ICML, MLSys, OSDI), or open-source contributions
- GPU kernel work - CUDA, Triton, or similar
- Experience with quantisation, speculative decoding, disaggregated inference, or KV-cache compression
Shortlisted candidates will be contacted within 48 hours.