Platform/SRE Engineer

Platform/SRE Engineer

Location: Sheffield, UK (3 days onsite per week mandatory)
Rate: £500/day (Inside IR35)
Contract Duration: 6 Month

About the Role

We are looking for an experienced Platform/Site Reliability Engineer (SRE) to support the build, deployment, and production operations of a large-scale AI-powered enterprise platform within a regulated industry environment.

This role focuses on ensuring the reliability, scalability, performance, and cost efficiency of production systems supporting AI-driven applications and services.

Key Responsibilities

Own deployment, observability, reliability, and production operations for AI services
Build and manage CI/CD pipelines, infrastructure, and runtime environments
Deploy and operate model-serving, orchestration, and application workloads
Implement monitoring, logging, tracing, alerting, and operational dashboards
Manage scaling strategies, release processes, rollback mechanisms, and incident response
Optimise inference cost, latency, uptime, and overall system reliability
Develop runbooks, operational standards, and production support processes

Required Skills & Experience

Strong experience in DevOps/Site Reliability Engineering (SRE) roles
Hands-on experience with Docker, Kubernetes, and Infrastructure as Code
Strong knowledge of cloud platforms (AWS preferred)
Experience with monitoring and observability tools
CI/CD pipelines, release automation, secrets management, and production support
Understanding of LLM deployment patterns and API-based model integration
Experience working with enterprise tools such as Jira, Confluence, ServiceNow

Preferred Experience

Supporting AI/ML workloads in production environments
Experience with GPU workloads, autoscaling, and cost optimisation
Experience working in large enterprise or regulated environments

Apply Now

Platform/SRE Engineer

Job Details