/CD and GitOps practices with GitHub Actions and ArgoCD, including automated testing, vulnerability scanning, and environment promotion workflows. Drive the definition and implementation of observability standards - Prometheus, Grafana, Loki/ELK, Jaeger, Sentry - enabling end-to-end visibility and SLA tracking. Define scalability and reliability patterns (KEDA, HPA, circuit breakers, bulkheads, caching tiers) and ensure resilience of critical More ❯
RabbitMQ. Understanding of in-memory data structure store/cache systems such as Redis. Hands-on knowledge of monitoring and analytical systems such as the Grafana/Prometheus/Loki stack or ELK. A strong understanding of security best practices. Good understanding of database technologies to mainly support the DBA team such as MySQL/MariaDB, ProxySQL, MySQL/ More ❯
Cambridge, England, United Kingdom Hybrid / WFH Options
RegGenome
learn. Hands-on experience with Kubernetes and Terraform/Terragrunt/OpenTofu. Strong cloud infrastructure knowledge in either AWS or GCP. Nice to Have: Monitoring stack tools: Prometheus, Thanos, Loki, Alertmanager, Grafana. CI/CD experience with FluxCD (or ArgoCD). Database performance optimization and management experience. Qualities We Value: Solution-oriented mindset with a knack for solving tough More ❯
LL BRING: Proven experience in observability, SRE, or platform engineering roles within complex, distributed environments. Strong hands-on expertise with telemetry tools such as OpenTelemetry, Prometheus, Grafana, Splunk, Elastic, Loki, Jaeger, or similar . Proficiency in at least one programming language (e.g., Python, Go, Java) and infrastructure-as-code tools (e.g., Terraform, Helm). Deep understanding of cloud-native More ❯
ll be doing: Building and maintaining a Kubernetes-hosted AI platform (AKS) Deploying and managing LLMOps tools such as LiteLLM, Langflow, and Langfuse Implementing observability with Prometheus, Grafana, and Loki Managing infrastructure through Terraform, ArgoCD, and GitHub Actions Supporting internal AI applications including RAG, document processing, and internal AI assistants What you’ll need: 2–4 years in Platform More ❯
code modules Tech Environment AWS (EKS, Lambda, Step Functions, Batch, API Gateway) Terraform (core IaC tool) Kubernetes (EKS) and Helm charts Python (used for Lambdas and testing) Prometheus + Loki for monitoring and observability Serverless-first architecture approach What They’re Looking For Hands-on AWS experience (not just certifications) Strong Terraform and Kubernetes (EKS) skills Solid understanding of More ❯
GitOps practices Expertise in cloud platforms (AWS, GCP, Azure) and cloud architecture; certifications are a plus Experience with Kubernetes, Docker, and microservices, as well as monitoring tools (Prometheus, Grafana, Loki, Mimir) Strong experience in Infrastructure as Code (IaC) and configuration management (especially Terraform) Responsibilities: As a Senior DevOps Engineer (f/m/d), you will be responsible for More ❯
Deployments, StatefulSets, PVCs, NetworkPolicies, etc Implement role-based access control (RBAC), service accounts, and admission policies Monitor cluster health and performance using tools like Prometheus, Grafana, and ELK/Loki -JupyterHub/Jupyter Notebook Expertise Deploy, configure, and scale JupyterHub for multi-tenant use on Kubernetes/OpenShift Integrate JupyterHub with enterprise authentication (OAuth, SAML, LDAP) Manage persistent storage More ❯
teams, helping to establish telemetry standards, efficient usage patterns, and scalable platform abstractions. Ability to make forward-looking technical decisions and lead others through ambiguity. Familiarity with ClickHouse, Grafana Loki, Athena, or equivalent systems for log and metrics querying. Contributions to open-source observability tools or communities. Experience building cost visibility or FinOps tooling for cloud compute and telemetry More ❯
leading research and enterprise teams. They’re looking for a Senior or Staff SRE with deep experience in observability at massive scale - someone who’s tuned Prometheus/Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast-moving environments. You’ll be working on: Designing and scaling observability More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Motive Group
leading research and enterprise teams. They’re looking for a Senior or Staff SRE with deep experience in observability at massive scale - someone who’s tuned Prometheus/Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast-moving environments. You’ll be working on: Designing and scaling observability More ❯
london, south east england, united kingdom Hybrid / WFH Options
Motive Group
leading research and enterprise teams. They’re looking for a Senior or Staff SRE with deep experience in observability at massive scale - someone who’s tuned Prometheus/Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast-moving environments. You’ll be working on: Designing and scaling observability More ❯
london (city of london), south east england, united kingdom Hybrid / WFH Options
Motive Group
leading research and enterprise teams. They’re looking for a Senior or Staff SRE with deep experience in observability at massive scale - someone who’s tuned Prometheus/Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast-moving environments. You’ll be working on: Designing and scaling observability More ❯
slough, south east england, united kingdom Hybrid / WFH Options
Motive Group
leading research and enterprise teams. They’re looking for a Senior or Staff SRE with deep experience in observability at massive scale - someone who’s tuned Prometheus/Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast-moving environments. You’ll be working on: Designing and scaling observability More ❯
and maintaining Azure Kubernetes (AKS) environments Managing Infrastructure as Code with Terraform and improving GitOps workflows (ArgoCD/GitHub Actions) Building observability and monitoring stacks using Prometheus, Grafana, and Loki Supporting AI workloads (LLMs, RAG, and document processing applications) running on Kubernetes Automating platform operations with Python, Go, and shell scripting Implementing security guardrails, PII compliance tooling, and best … experience in DevOps or Platform Engineering Strong background in Azure and Kubernetes Hands-on experience with Terraform, CI/CD, and container orchestration Familiarity with observability tools (Prometheus, Grafana, Loki) Scripting or programming skills in Python or Go Interest in AI infrastructure, LLMOps, or large language model deployment More ❯