Kubernetes layers. Orchestrate post-incident reviews: document findings, define mitigation plans, and drive tickets to resolution. Reliability Engineering & Automation Develop and maintain robust observability for front-end components: integrate Datadog for observability. Define SLIs/SLOs for page load times, Time to Interactive, and error rates; build alerting that balances sensitivity with noise reduction. Automate deployments via CI/CD More ❯
london, south east england, united kingdom Hybrid / WFH Options
VIOOH
knowledge of streaming technologies, preferably Kafka, both as a user of the service and its inner workings. Experience managing AWS or GCP. Experience in building or integrating Monitoring Tools (Datadog/Kibana/Grafana/Prometheus). Write software using either Java/Scala/Python. The following are nice to have, but not required - Apache Spark jobs and pipelines. More ❯
stage environments preferred. Nice to Have: Experience scaling engineering orgs across multiple geographies or domains (e.g., front-end, back-end, infrastructure). Familiarity with tools like Linear, Asana, GitHub, Datadog, DORA metrics, or similar performance/observability platforms. Background in organisational change management or engineering program management. What you can expect from us Competitive salary with substantial incentive schemes Generous More ❯
stage environments preferred. Nice to Have: Experience scaling engineering orgs across multiple geographies or domains (e.g., front-end, back-end, infrastructure). Familiarity with tools like Linear, Asana, GitHub, Datadog, DORA metrics, or similar performance/observability platforms. Background in organisational change management or engineering program management. What you can expect from us Competitive salary with substantial incentive schemes Generous More ❯
fault-tolerant APIs. Experience building high-performance, distributed systems at scale. A strong understanding of modern dev practices like 12 Factor, CI/CD, and observability tools such as Datadog or Prometheus. Exposure to GraphQL APIs and WebSockets for real-time interactions. As part of our commitment to information security, all employees are expected to adhere to company security policies More ❯
fault-tolerant APIs. Experience building high-performance, distributed systems at scale. A strong understanding of modern dev practices like 12 Factor, CI/CD, and observability tools such as Datadog or Prometheus. Exposure to GraphQL APIs and WebSockets for real-time interactions. As part of our commitment to information security, all employees are expected to adhere to company security policies More ❯
london, south east england, united kingdom Hybrid / WFH Options
Rocket Lab
fault-tolerant APIs. Experience building high-performance, distributed systems at scale. A strong understanding of modern dev practices like 12 Factor, CI/CD, and observability tools such as Datadog or Prometheus. Exposure to GraphQL APIs and WebSockets for real-time interactions. As part of our commitment to information security, all employees are expected to adhere to company security policies More ❯
Luton, Bedfordshire, United Kingdom Hybrid / WFH Options
OAG Aviation Worldwide Limited
A BOUT THE COMPANY: OAG is a leading data platform for the global travel industry offering an industry-first single source for supply, demand, and pricing data. We empower the global travel industry with high-quality, relevant datasets covering the More ❯
Kubernetes is a plus Knowledge of Redis and log queries is a plus Experience in automations/AI would be an advantage Experience administering multiple monitoring systems such as Datadog, NewRelic, Kubernetes, Grafana and Elastic Cloud Experience with Cloud Computing, AWS, Microservices Architecture, Unix and Linux Systems Life @ Empowered to think big. Try new opportunities while working with a talented More ❯
Birmingham, Staffordshire, United Kingdom Hybrid / WFH Options
CET Structures Limited
responsive, user-friendly interfaces and working with component libraries like Vuetify. Experience in writing unit and integration tests Experience working with the Azure stack is essential Experience working with DataDog or other observability platforms is desirable Interest in learning new technologies is desirable Additional Skills & Qualities Agile experience: Familiarity with Scrum, Kanban, or similar methodologies. A team player with strong More ❯
systems, such as: Puppet, Chef, Ansible, or related systems - Experience with performance testing and tuning - Experience in a 24x7 production environment - Significant experience of monitoring frameworks (such as CloudWatch, Datadog, Grafana, Elastic or similar) Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your More ❯
to have, but can be learned on the job: Experience with web and/or app Scraping TypeScript (just the ability to understand the logic, not necessarily write code) DataDog (just the ability to write queries) LaunchDarkly (just the ability to change feature flag rules manually or programmatically) Postman for testing API calls Most importantly , though, you will embody the More ❯
Cambridge, Cambridgeshire, United Kingdom Hybrid / WFH Options
Arm Limited
Job Overview: The Engineering IT group provides the high-performance compute environment that fuels product and solutions development for Arm's engineering community. Whether its high-performance compute (HPC) on Arm's on-prem infrastructure and/or in the More ❯
Frontend. Tech & Data Science stack: Kubernetes & Docker on Google Cloud Python 3: Pandas, RabbitMQ, Celery, Flask, SciPy, NumPy, Dash, Plotly, Matplotlib Javascript, React, Redux PostgreSQL, Redis Prometheus, Alert Manager, DataDog If you joined the company in a Data Science role you would be working on sophisticated pricing algorithms which would enable companies in the entertainment industry to significantly increase profit More ❯
are ideally looking for someone with: Strong experience supporting technical products in a customer facing capacity Deep understanding of cloud native technologies and modern observability stacks such as Grafana, DataDog, Splunk or similar A hands on mindset and the ability to work comfortably across Kubernetes, microservices, and comparable environments Beyond technical skills, they value clear communicators who are curious, adaptable More ❯
up to browser extensions and web applications. Develop software to analyse and interpret cryptocurrency usage behaviours and trends on the clear and dark web Implement observability mechanisms (we use DataDog) to detect problems in your environment(s), and run the associated business processes to resolve Work with the existing engineers on your team to foster their growth and development, and More ❯
management skills, with the ability to lead through influence. Experience in scaling teams across different domains or geographies is a strong plus. Familiarity with tools such as GitHub, Asana, Datadog, Linear, and DORA metrics is desirable. A background in organizational change or transformation initiatives is an advantage. Competitive salary with substantial performance-based incentives. Generous Long-Term Incentive Plan (LTIP More ❯
experience at leading tech companies, startups, and the enterprise software sphere. Our backers include Y Combinator, Index Ventures, and stellar angels such as the founders of Looker, GitHub, Mulesoft, Datadog and UiPath. More ❯
of this type of work: Improving our in-house dbt CLI wrapper to make it more user friendly and optimise runtimes Monitor tooling interaction with tools like Sentry or Datadog to identify areas for improvement Developing our internal BI tooling powered by Streamlit to improve user experience and improve accessibility for less technical users Integrate 3rd party tooling via APIs More ❯
Troubleshooting: Diagnosing and fixing technical issues Monitoring the Claims Centre platform, including but not limited to: Identifying requirements for specific alerts o Creating alerts for events and thresholds Accessing Datadog logs and dashboards for error analysis Monitoring DXC downtime and communicating to users You will play a pivotal role in updating the platform, including but not limited to: Performing a More ❯
Troubleshooting: Diagnosing and fixing technical issues Monitoring the Claims Centre platform, including but not limited to: Identifying requirements for specific alerts o Creating alerts for events and thresholds Accessing Datadog logs and dashboards for error analysis Monitoring DXC downtime and communicating to users You will play a pivotal role in updating the platform, including but not limited to: Performing a More ❯
management systems, such as Puppet, Chef, Ansible, DSC, or related systems - Experience with performance testing and optimisation in a 24x7 production environment - Experience using monitoring platforms, such as CloudWatch, Datadog, Grafana, Elastic or similar Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your More ❯
for sports. Strong understanding of video streaming protocols, encoding/transcoding workflows. Demonstrated ability to lead technical recovery during high-pressure incidents Familiarity with observability tools (e.g., Grafana, Prometheus, Datadog) and incident management platforms (e.g., PagerDuty, Opsgenie). Excellent communication and stakeholder management skills. Strong analytical and problem-solving abilities. What's in it For You? Hybrid Work Model: We More ❯