documentation Conduct architecture reviews, technical audits, and drive adoption of best practices Partner with infrastructure teams to ensure system reliability and operational efficiency Integrate monitoring and logging solutions (e.g., Prometheus, Grafana, ELK) Define strategies for disaster recovery, scaling, and infrastructure resilience Improve observability by enhancing visibility into performance and error metrics Skills and Experience Required 10+ years of backend development More ❯
Farnborough, England, United Kingdom Hybrid / WFH Options
Addition+
in Platform or Site Reliability Engineering (5+ years ideally) Proven background with Kubernetes, CI/CD tooling (e.g. GitLab, Jenkins), and IaC (Terraform, Ansible) Confident with monitoring tools (e.g. Prometheus, Grafana) Git proficiency and solid repository management knowledge Comfortable leading technical decisions and collaborating with engineering teams What’s in It for You: A genuinely collaborative, no-blame engineering culture More ❯
and container orchestration (Docker, ECS, or Kubernetes) Solid understanding of system/network security, IAM, VPC, and secure cloud configurations Familiarity with monitoring and logging tools (e.g., CloudWatch, Datadog, Prometheus, Sentry) Experience with Postgres, Redis, and scalable backend systems Bonus: Exposure to fintech or regulated environments, GDPR/data compliance, or SOC2 setup A little about us Our founders have More ❯
explain complex systems to mixed audiences, and build trust through technical credibility. Automation-first mindset: Skilled in infrastructure-as-code (Terraform or Pulumi), CI/CD workflows, observability stacks (Prometheus, Grafana, Loki), and scripting (Python, Bash). Bonus: Prior experience working with GPU capacity providers, hyperscaler partnerships, or AI infrastructure startups. Benefits: Competitive total compensation package. Retirement or pension plan More ❯
explain complex systems to mixed audiences, and build trust through technical credibility. Automation-first mindset: Skilled in infrastructure-as-code (Terraform or Pulumi), CI/CD workflows, observability stacks (Prometheus, Grafana, Loki), and scripting (Python, Bash). Bonus: Prior experience working with GPU capacity providers, hyperscaler partnerships, or AI infrastructure startups. Benefits: Competitive total compensation package. Retirement or pension plan More ❯
explain complex systems to mixed audiences, and build trust through technical credibility. Automation-first mindset: Skilled in infrastructure-as-code (Terraform or Pulumi), CI/CD workflows, observability stacks (Prometheus, Grafana, Loki), and scripting (Python, Bash). Bonus: Prior experience working with GPU capacity providers, hyperscaler partnerships, or AI infrastructure startups. Benefits: Competitive total compensation package. Retirement or pension plan More ❯
years of technical experience in Cloud DevOps, SaaS, or observability, with 5+ years in leadership roles. Strong hands-on experience with AWS, GCP, Azure, K8S, Terraform and observability tools: Prometheus, Grafana, OpenTelemetry, ELK, Splunk, Datadog, and similar. Proficiency with metrics, logs, traces and APM. Leadership & Global Operations Proven success leading multi-regional or global technical teams with direct management of More ❯
Github Actions, Gitlab, Argo CD, AzureDevops). Experience with Devops processes and practices using different tools and methods to monitor systems in production, using such tools as ELK, Grafana, Prometheus, Datadog or AWS CloudWatch. Strong problem-solving skills and the ability to debug and optimize code, Clear concise technical documentation; creating and maintaining runbooks and end user documentation. Comfortable working More ❯
routing). You will bring some of these skills, but more importantly you're interested in learning these things: • Hardware & physical infrastructure. • Data-driven monitoring and observability (Grafana, InfluxDB, Prometheus, Elastic). • Exposure to configuration management (Puppet, Ansible, Terraform). • Some exposure to scripting (Bash, Python). • Supporting CI/CD delivery pipelines (GitLab, GitHub). 25 days of holiday More ❯
RabbitMQ, Kafka). Strong grasp of telemetry, observability, and performance monitoring in distributed systems. Track record of technical leadership and setting engineering standards. Nice to Have: Experience with OpenTelemetry , Prometheus, Grafana, or similar observability tooling. Exposure to hybrid-cloud or cloud migration strategies. Familiarity with performance optimisation in low-latency data pipelines. Contributions to DevOps-related communities, blogs, open source More ❯
RabbitMQ, Kafka). Strong grasp of telemetry, observability, and performance monitoring in distributed systems. Track record of technical leadership and setting engineering standards. Nice to Have: Experience with OpenTelemetry , Prometheus, Grafana, or similar observability tooling. Exposure to hybrid-cloud or cloud migration strategies. Familiarity with performance optimisation in low-latency data pipelines. Contributions to DevOps-related communities, blogs, open source More ❯
troubleshooting experience. Working knowledge of HPC container runtimes (e.g., Singularity, Apptainer). Exposure to provisioning and automation tools (e.g., Ansible, PXE, Terraform). Experience with monitoring tools such as Prometheus, Grafana, and DCGM. Understanding of GPU/accelerator toolchains like CUDA or ROCm. A proactive, customer-first mindset with strong communication skills. Ability to work effectively in both individual and More ❯
/Linux fundamentals. Curiosity and the confidence to ask questions in a fast-moving team. Nice-to-haves Exposure to Kubernetes, Docker or Terraform. Experience with observability stacks (Grafana, Prometheus, OpenTelemetry). Familiarity with Postgres. Interest in data-privacy, AdTech/MarTech or large-scale data processing. Familiarity with Kafka, gRPC or Apache Spark. As well as working as part More ❯
a technical setting (preferably SaaS). Customer support experience ideally in the monitoring, observability, or data pipeline space. Experience with Kubernetes, Terraform, and significant consideration if you also have Prometheus experience. Technical understanding and experience with: Coding/SDLC, Linux, Cloud providers (AWS, GCP, Azure), Networking, Shell Strong communication skills both written and verbal. Strong technical, analytic and problem solving More ❯
by several microservices, also written in Python, utilising frameworks and libraries such as Celery, Eventlet, SQLAlchemy, etc. Additionally, GOV.UK Notify utilises AWS RDS (Postgres), AWS SQS, AWS ElastiCache, OpenTelemetry, Prometheus, Grafana and other related services. Concourse CI and Terraform are used to run build-pipelines and manage our infrastructure. For the frontend, we follow theGOV.UK Design System , making use of More ❯
by several microservices, also written in Python, utilising frameworks and libraries such as Celery, Eventlet, SQLAlchemy, etc. Additionally, GOV.UK Notify utilises AWS RDS (Postgres), AWS SQS, AWS ElastiCache, OpenTelemetry, Prometheus, Grafana and other related services. Concourse CI and Terraform are used to run build-pipelines and manage our infrastructure. For the frontend, we follow theGOV.UK Design System , making use of More ❯
end: Java, Python, Spring Boot Database: MongoDB, PL/SQL,NOSQL API Development: RESTful APIs Version Control: Git CI/CD: TeamCity Docker and Containerization Monitoring and Logging (e.g., Prometheus, Grafana, ELK Stack) Security and Compliance Code Quality Tools (e.g., SonarQube) Agile Methodologies (Scrum or Kanban) Soft Skills: Team Collaboration : Ability to work effectively with cross-functional teams, sharing knowledge More ❯
Actions) Solid AWS experience and proficiency in at least one programming language (we use Go) Comfortable designing, operating and troubleshooting production platforms at scale Strong command of observability tooling (Prometheus, Splunk or similar); eager to master Honeycomb Developer empathy & outstanding communication skills; thrive on coaching and cross team collaboration Track record of data driven decision making and continuous improvement Familiarity More ❯
load (JMeter/Gatling/wrk2 etc) and JVM profiling to identify and fix performance bottlenecks Hands-on experience with instrumentation and analysis of production metrics using tools like Prometheus, Grafana, InfluxDB, or the ELK stack to identify performance bottlenecks and ensure system health. As an industry pioneer, our work is constantly evolving and challenging us in new ways that More ❯
CD environment. Demonstrate experience with the games industry, TeamCity, and Perforce Helix Core. Proficiency in Kotlin, Ansible, Bash, Python, HCL, PowerShell. Monitor CI/CD effectiveness using tools like Prometheus and Grafana. Apply problem-solving, troubleshooting, and critical thinking skills. Understand VMWare, Windows, Linux, and MacOS server platforms. Experience with Unity build processes and Apple developer tools. Knowledge of cloud More ❯
web applications Familiarity with infrastructure-as-code tools such as Terraform Understanding of security best practices in web infrastructure and application delivery Exposure to observability tooling and techniques (e.g., Prometheus, Grafana, structured logging) Confident in debugging and resolving issues in complex distributed or web-based Systems A product mindset and collaborative approach to improving how teams build and run software More ❯
times a week. Experience with Agile and/or DevOps methodologies. Good understanding of Linux operating systems, particularly Ubuntu and Redhat. Exposure to OSS monitoring systems (e.g., Nagios, Observium, Prometheus). Scripting and automation experience using tools such as Netbox, Ansible, Puppet, Bash, Python, GIT. Benefits include 25 days of holiday, bonus, pension contribution, private medical, dental, and vision coverage More ❯
this role is for you. Ideally you have several years experience using Go in production. You'll be comfortable with Docker, and familiar with modern observability tools such as Prometheus, Alert Manager, Grafana and X-Ray/Tempo/Jaeger. We're looking for 3+ years tackling hard backend problems Seasoned database experience - we use MySQL, DynamoDB, Elasticsearch and Redis More ❯
this role is for you. Ideally you have several years experience using Go in production. You'll be comfortable with Docker, and familiar with modern observability tools such as Prometheus, Alert Manager, Grafana and X-Ray/Tempo/Jaeger. We're looking for 3+ years tackling hard backend problems Seasoned database experience - we use MySQL, DynamoDB, Elasticsearch and Redis More ❯
following a bonus: Java experience Python experience Ruby experience Big data technologies: Spark, Trino, Kafka Financial Markets experience SQL: Postgres, Oracle Cloud-native deployments: AWS, Docker, Kubernetes Observability: Splunk, Prometheus, Grafana For more information about DRW's processing activities and our use of job applicants' data, please view our Privacy Notice at . California residents, please review the California Privacy More ❯