AI Platform Engineer

Role: AI Platform Engineer (6-Month Contract)

Location: Remote

Contract Type: 6-Month Contract

Overview

We are seeking an experienced AI Platform Engineer to provide Level 1 and Level 2 operational support for our enterprise AI platform. This customer-facing role involves technical troubleshooting, proactive platform management, and close collaboration with vendor engineering teams to ensure seamless and reliable AI platform operations.

Key Responsibilities

Operational Support

  • Provide L1 support for customer-reported issues and service requests.
  • Deliver L2 troubleshooting by diagnosing, replicating, and resolving issues across platform components and underlying infrastructure.
  • Coordinate L3 escalations with vendor product and engineering teams, tracking responses and resolutions.
  • Monitor system health, alerts, and customer usage patterns to proactively identify potential issues.
  • Maintain detailed documentation , knowledge base articles, and support procedures.
  • Automate recurring operational tasks and fixes to improve efficiency.
  • Support tooling integration and configuration to enhance monitoring, reporting, and performance.
  • Assist customers with onboarding, configuration, and platform best practices .
  • Collaborate with infrastructure, platform, and application teams to resolve integration and interoperability issues .
  • Ensure adherence to SLAs, uptime targets, and customer satisfaction goals .
  • Provide reporting on platform usage, workflows, and billing insights for stakeholders.

Technical Responsibilities

  • Cluster Infrastructure Management: Administer and support GPU cluster infrastructure.
  • High Availability & Resilience: Implement failover and redundancy strategies to ensure minimal downtime.
  • Resource Optimization: Manage GPU resource partitioning, workload scheduling, and capacity planning.
  • Performance Monitoring: Utilize HPE tools for real-time monitoring, diagnostics, and tuning.
  • Incident Response: Address node failures, driver issues, and networking incidents; escalate to vendors when needed.
  • Security & Access Control: Manage RBAC, user permissions, platform hardening, and data protection measures.

Required Skills & Experience

  • 10 years of experience in technical support, systems engineering, or platform operations .
  • Strong knowledge of L1/L2 support processes , including ticketing, escalation, and troubleshooting workflows.
  • Familiarity with cloud-based platforms , APIs, and distributed systems.
  • Understanding of AI/ML workflows (model training, inference, data pipelines).
  • Experience with monitoring and logging tools (e.g., Grafana, Kibana, Splunk).
  • Excellent communication and customer engagement skills.
  • Working knowledge of ML engineering and data science toolchains to optimize user experience.

Core Technical Skills

  • System Administration: RHEL/CentOS, Ubuntu, Linux kernel tuning.
  • Automation & Orchestration: Ansible, Kubernetes, container management.
  • GPU & AI Tooling
  • Automation, Monitoring & Security: Experience delivering GPU-as-a-Service with appropriate observability and controls.
Company
NineTech
Location
United Kingdom, UK
Employment Type
Part-time
Posted
Company
NineTech
Location
United Kingdom, UK
Employment Type
Part-time
Posted