Platform/SRE Engineer
Platform/SRE Engineer
Location: Sheffield, UK (3 days onsite per week mandatory)
Rate: £500/day (Inside IR35)
Contract Duration: 6 Month
About the Role
We are looking for an experienced Platform/Site Reliability Engineer (SRE) to support the build, deployment, and production operations of a large-scale AI-powered enterprise platform within a regulated industry environment.
This role focuses on ensuring the reliability, scalability, performance, and cost efficiency of production systems supporting AI-driven applications and services.
Key Responsibilities
- Own deployment, observability, reliability, and production operations for AI services
- Build and manage CI/CD pipelines, infrastructure, and runtime environments
- Deploy and operate model-serving, orchestration, and application workloads
- Implement monitoring, logging, tracing, alerting, and operational dashboards
- Manage scaling strategies, release processes, rollback mechanisms, and incident response
- Optimise inference cost, latency, uptime, and overall system reliability
- Develop runbooks, operational standards, and production support processes
Required Skills & Experience
- Strong experience in DevOps/Site Reliability Engineering (SRE) roles
- Hands-on experience with Docker, Kubernetes, and Infrastructure as Code
- Strong knowledge of cloud platforms (AWS preferred)
- Experience with monitoring and observability tools
- CI/CD pipelines, release automation, secrets management, and production support
- Understanding of LLM deployment patterns and API-based model integration
- Experience working with enterprise tools such as Jira, Confluence, ServiceNow
Preferred Experience
- Supporting AI/ML workloads in production environments
- Experience with GPU workloads, autoscaling, and cost optimisation
- Experience working in large enterprise or regulated environments