AIOps / LLM Operations
Deploying AI is only the beginning. Organisations increasingly require operational leaders capable of ensuring AI systems remain secure, observable, reliable and effective once in production.
Role Overview
- Establish AI operational processes and standards
- Monitor model, agent and workflow performance
- Define observability, incident management and support frameworks
- Manage AI deployment, release and rollback processes
- Support security, compliance and operational resilience requirements
- Drive continuous improvement across AI systems
- Ensure cost is controlled
Tools & Technologies Required
- MLflow, Weights & Biases or similar
- Monitoring and observability platforms
- CI/CD pipelines
- Cloud infrastructure
- Logging and performance management tools
- Security and governance frameworks
Nice to Have
- Site Reliability Engineering (SRE) background
- MLOps experience
- Cloud architecture knowledge
- Experience supporting regulated environments
About You
- Strong operational mindset
- Calm and methodical under pressure
- Focused on reliability and resilience
- Able to balance speed with control
- Comfortable working across engineering, security and governance teams
Benefits & Perks
- Amongst the best around.