AI Platform Engineer

Apply Now

Role: AI Platform Engineer (6-Month Contract)

Location: Remote

Contract Type: 6-Month Contract

Overview

We are seeking an experienced AI Platform Engineer to provide Level 1 and Level 2 operational support for our enterprise AI platform. This customer-facing role involves technical troubleshooting, proactive platform management, and close collaboration with vendor engineering teams to ensure seamless and reliable AI platform operations.

Key Responsibilities

Operational Support

Provide L1 support for customer-reported issues and service requests.
Deliver L2 troubleshooting by diagnosing, replicating, and resolving issues across platform components and underlying infrastructure.
Coordinate L3 escalations with vendor product and engineering teams, tracking responses and resolutions.
Monitor system health, alerts, and customer usage patterns to proactively identify potential issues.
Maintain detailed documentation , knowledge base articles, and support procedures.
Automate recurring operational tasks and fixes to improve efficiency.
Support tooling integration and configuration to enhance monitoring, reporting, and performance.
Assist customers with onboarding, configuration, and platform best practices .
Collaborate with infrastructure, platform, and application teams to resolve integration and interoperability issues .
Ensure adherence to SLAs, uptime targets, and customer satisfaction goals .
Provide reporting on platform usage, workflows, and billing insights for stakeholders.

Technical Responsibilities

Cluster Infrastructure Management: Administer and support GPU cluster infrastructure.
High Availability & Resilience: Implement failover and redundancy strategies to ensure minimal downtime.
Resource Optimization: Manage GPU resource partitioning, workload scheduling, and capacity planning.
Performance Monitoring: Utilize HPE tools for real-time monitoring, diagnostics, and tuning.
Incident Response: Address node failures, driver issues, and networking incidents; escalate to vendors when needed.
Security & Access Control: Manage RBAC, user permissions, platform hardening, and data protection measures.

Required Skills & Experience

10 years of experience in technical support, systems engineering, or platform operations .
Strong knowledge of L1/L2 support processes , including ticketing, escalation, and troubleshooting workflows.
Familiarity with cloud-based platforms , APIs, and distributed systems.
Understanding of AI/ML workflows (model training, inference, data pipelines).
Experience with monitoring and logging tools (e.g., Grafana, Kibana, Splunk).
Excellent communication and customer engagement skills.
Working knowledge of ML engineering and data science toolchains to optimize user experience.

Core Technical Skills

System Administration: RHEL/CentOS, Ubuntu, Linux kernel tuning.
Automation & Orchestration: Ansible, Kubernetes, container management.
GPU & AI Tooling
Automation, Monitoring & Security: Experience delivering GPU-as-a-Service with appropriate observability and controls.

Company: NineTech
Location: United Kingdom, UK
Employment Type: Part-time
Posted: 11 hours ago

Apply Now

Company: NineTech
Location: United Kingdom, UK
Employment Type: Part-time
Posted: 11 hours ago