Senior DevOps Engineer [UAE Based]
Role: Senior DevOps Engineer
Location: Abu Dhabi
Company: AI71
About Us
AI71 is an applied research team dedicated to building responsible and impactful AI agents that empower knowledge workers. We work closely with our industry partners and leverage cutting-edge research from the Technology Innovation Institute (TII) to develop AI products that drive transformative change.
About the Role
As a DevOps Engineer at AI71 you will own the pipelines, platforms, and processes that let our researchers ship AI from notebook to production at lightning speed and enterprise scale. You’ll design and automate cloud‑native infrastructure, champion CI/CD best practices, and ensure our GenAI services run reliably, securely, and cost‑effectively across staging, test, and high‑availability production environments.
This is an early‑stage, high‑growth environment—perfect for builders who like green‑field architecture, rapid iteration, and the chance to shape both culture and tech stack from day one.
Key Responsibilities
- Design & Build Cloud Infrastructure
- Architect scalable, secure, and cost‑optimized Kubernetes‑based environments (EKS/GKE/AKS or on‑prem k8s).
- Codify infrastructure with Terraform, Pulumi, or similar IaC, implementing GitOps‑style workflows.
- End‑to‑End CI/CD Automation
- Create and maintain CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, or Argo Workflows) for containerized microservices, ML model training, and inference workloads.
- Integrate automated testing, security scans, and policy checks into the release process.
- Observability & Reliability Engineering
- Implement comprehensive monitoring, logging, and tracing stacks (Prometheus/Grafana, Loki, ELK, OpenTelemetry).
- Define SLOs/SLA dashboards; lead incident response, root‑cause analysis, and post‑mortems.
- Security & Compliance
- Embed DevSecOps practices—secrets management, container image hardening, zero‑trust networking, vulnerability management, and compliance automation (ISO 27001, SOC 2).
- Collaborate with ML/AI Teams
- Package and deploy large‑language‑model (LLM) training jobs on distributed GPU clusters (Slurm, Ray, Kubeflow, or AWS SageMaker).
- Optimize model‑serving (Triton, vLLM, TorchServe) for low‑latency, high‑throughput inference.
- Cost & Performance Optimization
- Track cloud spend, right‑size resources, and introduce autoscaling strategies (Karpenter, Cluster‑Autoscaler, HPA/VPA).
- Champion FinOps best practices and forecasting.
- Culture & Process
- Mentor engineers on DevOps fundamentals.
- Establish runbooks, playbooks, and robust documentation to support rapid onboarding and knowledge sharing.
Required Qualifications
- Bachelor’s degree in computer science, Engineering, or related field (or equivalent practical experience).
- 3+ years of hands‑on DevOps/SRE experience building and operating production systems.
- Proficiency with at least one major cloud provider (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker).
- Strong skills in Infrastructure‑as‑Code (Terraform, CloudFormation, Pulumi, or CDK).
- Experience implementing CI/CD for microservices or ML pipelines.
- Solid understanding of networking, Linux, and security fundamentals.
Preferred Qualifications
- Experience supporting GPU‑accelerated workloads or MLOps/LLMOps platforms.
- Familiarity with service mesh (Istio, Linkerd) and event‑driven architectures (Kafka, Pub/Sub).
- Knowledge of distributed storage systems (Ceph, MinIO, S3) and artifact registries.
- Certifications: CKA/CKAD, AWS DevOps Engineer Professional, or equivalent.
- Track record in early‑stage or high‑growth tech environments.
- Excellent communication skills; ability to partner with researchers, backend engineers, and product stakeholders.
- Company
- AI71
- Location
- London, UK
- Posted
- Company
- AI71
- Location
- London, UK
- Posted