Platform/SRE Engineer
Role Title: Platform/SRE Engineer
Location: Sheffield (3 days a week onsite is mandatory)
Duration: 30/11/2026
Rate: 525p/d via Umbrella
Role Description:
Own deployment, observability, reliability, cost control, and production operations for the AI helpdesk platform.
Key responsibilities
- Build and manage CI/CD pipelines, infrastructure, and runtime environments for AI services.
- Deploy and operate model-serving, orchestration, and application workloads.
- Implement monitoring, tracing, alerting, logging, and operational dashboards.
- Manage scaling, release processes, rollback mechanisms, and production support.
- Optimize inference cost, latency, uptime, and system reliability.
- Create runbooks, incident response processes, and operational standards.
Required skills
- Strong experience in DevOps, SRE.
- Experience with Docker, Kubernetes, cloud platforms, and infrastructure as code.
- Experience with monitoring and observability tools.
- Familiarity with CI/CD, release automation, secrets management, and production support.
- Understanding of LLM deployment patterns and API-based model integration.
- Experience with cloud, particularly AWS.
- Jira, Confluence, ServiceNow experience
Preferred
- Experience supporting AI/ML workloads in production.
- Experience with GPU workloads, autoscaling, and cost optimization.