Machine Learning Engineer
Build Low Latency Conversational AI Systems
We are building real-time conversational AI systems built on top of large language models, speech AI, and agentic workflows. Our platform combines ASR, LLMs, and TTS into production-grade AI systems used globally across enterprise environments where latency, reliability, and scalability matter.
We are hiring a Machine Learning Engineer to build low-latency production systems for our LLM team. This role is centred around writing scalable code that enables real-time conversational AI to perform reliably under heavy production workloads.
You’ll work closely with our LLM and speech teams to solve challenges around inference speed, concurrency, request handling, GPU performance, distributed systems, and real-time response streaming.
What you’ll do
- Build and optimise low-latency LLM systems for real-time conversational AI
- Write production-grade Python code focused on performance, scalability, and reliability
- Design systems capable of handling large volumes of concurrent real-time requests
- Solve engineering challenges around batching, request scheduling, queue management, streaming responses, and distributed workloads
- Improve inference speed, GPU memory usage, and overall system responsiveness
- Deploy and optimise open-source LLMs using tooling such as vLLM, TensorRT-LLM, Triton, SGLang, CUDA, or similar technologies
- Build scalable orchestration layers and ML pipelines around LLM systems, including RAG and agentic workflows
- Develop backend inference services and APIs for production AI systems
- Productionise new model capabilities and features for real-world customer use cases
What we’re looking for
- Strong experience writing production-grade software for machine learning systems
- Strong Python engineering skills
- Experience building low-latency or highly concurrent systems
- Strong problem-solving ability and enjoyment of building systems from the ground up
- Experience with distributed systems, parallel workloads, and performance optimisation
- Experience working with inference tooling such as vLLM, TensorRT, Triton, CUDA, ONNX, or similar technologies
- Experience building scalable backend services or ML systems used in production
- Understanding of real-time systems and performance-focused engineering
- Strong communication skills and ability to work closely with engineers and researchers
Why this role?
You’ll work on designing and building low-latency conversational AI systems capable of serving large volumes of concurrent real-time requests. The role focuses on solving difficult engineering challenges around inference speed, reliability, concurrency, GPU performance, and scalable production AI systems.