Senior DevOps Engineer [UAE Based]

Apply Now

Role: Senior DevOps Engineer

Location: Abu Dhabi

Company: AI71

About Us

AI71 is an applied research team dedicated to building responsible and impactful AI agents that empower knowledge workers. We work closely with our industry partners and leverage cutting-edge research from the Technology Innovation Institute (TII) to develop AI products that drive transformative change.

About the Role

As a DevOps Engineer at AI71 you will own the pipelines, platforms, and processes that let our researchers ship AI from notebook to production at lightning speed and enterprise scale. You’ll design and automate cloud‑native infrastructure, champion CI/CD best practices, and ensure our GenAI services run reliably, securely, and cost‑effectively across staging, test, and high‑availability production environments.

This is an early‑stage, high‑growth environment—perfect for builders who like green‑field architecture, rapid iteration, and the chance to shape both culture and tech stack from day one.

Key Responsibilities

Design & Build Cloud Infrastructure

Architect scalable, secure, and cost‑optimized Kubernetes‑based environments (EKS/GKE/AKS or on‑prem k8s).
Codify infrastructure with Terraform, Pulumi, or similar IaC, implementing GitOps‑style workflows.

End‑to‑End CI/CD Automation

Create and maintain CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, or Argo Workflows) for containerized microservices, ML model training, and inference workloads.
Integrate automated testing, security scans, and policy checks into the release process.

Observability & Reliability Engineering

Implement comprehensive monitoring, logging, and tracing stacks (Prometheus/Grafana, Loki, ELK, OpenTelemetry).
Define SLOs/SLA dashboards; lead incident response, root‑cause analysis, and post‑mortems.

Security & Compliance

Embed DevSecOps practices—secrets management, container image hardening, zero‑trust networking, vulnerability management, and compliance automation (ISO 27001, SOC 2).

Collaborate with ML/AI Teams

Package and deploy large‑language‑model (LLM) training jobs on distributed GPU clusters (Slurm, Ray, Kubeflow, or AWS SageMaker).
Optimize model‑serving (Triton, vLLM, TorchServe) for low‑latency, high‑throughput inference.

Cost & Performance Optimization

Track cloud spend, right‑size resources, and introduce autoscaling strategies (Karpenter, Cluster‑Autoscaler, HPA/VPA).
Champion FinOps best practices and forecasting.

Culture & Process

Mentor engineers on DevOps fundamentals.
Establish runbooks, playbooks, and robust documentation to support rapid onboarding and knowledge sharing.

Required Qualifications

Bachelor’s degree in computer science, Engineering, or related field (or equivalent practical experience).
3+ years of hands‑on DevOps/SRE experience building and operating production systems.
Proficiency with at least one major cloud provider (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker).
Strong skills in Infrastructure‑as‑Code (Terraform, CloudFormation, Pulumi, or CDK).
Experience implementing CI/CD for microservices or ML pipelines.
Solid understanding of networking, Linux, and security fundamentals.

Preferred Qualifications

Experience supporting GPU‑accelerated workloads or MLOps/LLMOps platforms.
Familiarity with service mesh (Istio, Linkerd) and event‑driven architectures (Kafka, Pub/Sub).
Knowledge of distributed storage systems (Ceph, MinIO, S3) and artifact registries.
Certifications: CKA/CKAD, AWS DevOps Engineer Professional, or equivalent.
Track record in early‑stage or high‑growth tech environments.
Excellent communication skills; ability to partner with researchers, backend engineers, and product stakeholders.

Company: AI71
Location: London, UK
Posted: 4 days ago

Apply Now

Company: AI71
Location: London, UK
Posted: 4 days ago