integration and deployment of ML models and related infrastructure Monitoring and Observability: Build and maintain comprehensive monitoring and alerting systems for our ML infrastructure and models, leveraging tools like DataDog to ensure system health and performance Collaboration and Mentorship: Collaborate effectively with data scientists, engineers, and other stakeholders. Provide guidance and support to junior team members Performance Optimization: Continuously optimize … and implement efficient CI/CD pipelines Containerization and Orchestration: Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes) Monitoring and Logging: Experience with monitoring and logging tools like DataDog, Prometheus, or Grafana Data Engineering Skills: Knowledge of event streaming platforms (e.g., Apache Kafka) and SQL database management Strong Communication and Collaboration: Excellent communication skills and the ability to work More ❯
AWS CDK, or CloudFormation to automate cloud resource provisioning, enabling consistent and repeatable infrastructure deployments. Monitoring & Observability: Implement monitoring, logging, and alerting solutions using tools like Prometheus, Grafana, Loki, Datadog, or CloudWatch to ensure system health and performance. Security & Compliance: Implement security best practices for cloud infrastructure, including IAM policies, security groups, and VPC configurations, to ensure compliance and data More ❯
London, England, United Kingdom Hybrid / WFH Options
Quaisr Limited
DevOps/Site Reliability Engineer, Junior/Mid/Senior (m/f/*) We are a London tech startup on the lookout for bright, motivated and self-driven individuals to join the team. Who you are You are a More ❯
Who we are We are a London tech startup on the lookout for bright, motivated and self-driven individuals to join the team. Who you are You are a DevOps/Site Reliability Engineer with experience managing complex infrastructure and More ❯
Jenkins, GitHub Actions) Define and enforce platform standards across environments (dev, staging, prod) Collaborate with developers and DevOps on deployment tooling and security Enable platform observability using tools like Datadog, Prometheus, and CloudWatch Maintain Helm charts and Terraform modules for shared infrastructure Contribute to onboarding documentation and platform adoption practices Participate in incident response and postmortem analysis, where applicable Essential … Docker and secure image management Scripting or programming experience in Bash, Python, or TypeScript Strong understanding of GitOps practices and infrastructure lifecycle management Desirable Skills Experience with observability tooling (Datadog, Prometheus, Fluent Bit) Knowledge of admission controllers, OPA/Gatekeeper (optional for governance) Familiarity with cloud cost optimisation and Kubernetes scaling strategies Exposure to security scanning tools (tfsec, Trivy, Snyk More ❯
Jenkins, GitHub Actions) Define and enforce platform standards across environments (dev, staging, prod) Collaborate with developers and DevOps on deployment tooling and security Enable platform observability using tools like Datadog, Prometheus, and CloudWatch Maintain Helm charts and Terraform modules for shared infrastructure Contribute to onboarding documentation and platform adoption practices Participate in incident response and postmortem analysis, where applicable Essential … Docker and secure image management Scripting or programming experience in Bash, Python, or TypeScript Strong understanding of GitOps practices and infrastructure lifecycle management Desirable Skills Experience with observability tooling (Datadog, Prometheus, Fluent Bit) Knowledge of admission controllers, OPA/Gatekeeper (optional for governance) Familiarity with cloud cost optimisation and Kubernetes scaling strategies Exposure to security scanning tools (tfsec, Trivy, Snyk More ❯
London, England, United Kingdom Hybrid / WFH Options
Elwood Technologies Services Limited
closely with engineering teams to design and deploy scalable, fault-tolerant infrastructure solutions on AWS or GCP . Improve observability by utilizing monitoring, logging, and alerting systems (e.g., CloudWatch , Datadog ). Lead post-incident reviews , contribute to the continuous improvement of system reliability and follow up on strategic fixes. Develop and update runbooks, incident response playbooks, and documentation. Work closely … love it if you have experience of some or all of the following: Experience with client-impact triage , working cross-functionally with account managers or product teams. Proficiency with Datadog or similar observability platforms. Knowledge of serverless architectures (e.g., AWS Lambda, GCP Cloud Functions). Familiarity with RDBMS and NoSQL databases , such as RDS, CloudSQL, DynamoDB. Prior experience in fintech More ❯
Southampton, Hampshire, United Kingdom Hybrid / WFH Options
Spectrum IT Recruitment
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck Experience using configuration management platforms like Ansible, Puppet, or Chef Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer More ❯
Hampshire, England, United Kingdom Hybrid / WFH Options
Spectrum IT Recruitment
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck Experience using configuration management platforms like Ansible, Puppet, or Chef Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer More ❯
Hedge End, England, United Kingdom Hybrid / WFH Options
Spectrum IT Recruitment
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck Experience using configuration management platforms like Ansible, Puppet, or Chef Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer More ❯
London, England, United Kingdom Hybrid / WFH Options
ZipRecruiter
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck Experience using configuration management platforms like Ansible, Puppet, or Chef Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer More ❯
the equivalent with Azure and GCP Background knowledge and hands-on practice in Observability, specifically experience working with one or more of the following tools - Kibana, Open-Search, Grafana, Datadog, Sumo Logic, New Relic, AppDynamics, Dynatrace, Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations More ❯
London, England, United Kingdom Hybrid / WFH Options
Octopus Legacy
Python web frameworks such as Flask or FastAPI. Experience optimising applications for cloud performance, cost-efficiency, and scalability. Hands-on experience with monitoring and logging tools (e.g., AWS CloudWatch, Datadog, ELK stack). An understanding of lean software development principles and practices focused on delivering value quickly. A passion for mentoring and sharing knowledge, contributing to a culture of continuous More ❯
or Windows administration, with the ability to architect secure, performant, and highly available cloud solutions. Proficiency with monitoring and log analytics tools such as AWS CloudWatch, ELK Stack, Prometheus, Datadog, or New Relic, to maintain observability and ensure operational excellence. Demonstrated leadership skills in managing complex, high-pressure situations and guiding teams through incident resolution. Exceptional communication and presentation skills … or Windows administration, with the ability to architect secure, performant, and highly available cloud solutions. Proficiency with monitoring and log analytics tools such as AWS CloudWatch, ELK Stack, Prometheus, Datadog, or New Relic, to maintain observability and ensure operational excellence. Demonstrated leadership skills in managing complex, high-pressure situations and guiding teams through incident resolution. Exceptional communication and presentation skills More ❯
compliance. Collaborate with development and operations teams to improve system performance and scalability. Maintain and improve logging, monitoring, and alerting systems using tools like Prometheus, Grafana, ELK Stack, or Datadog Support and optimize infrastructure for both Linux and Windows-based environments. Participate in incident management, problem resolution, and root cause analysis. Ensure documentation of infrastructure, processes, and best practices is More ❯
Actions, CircleCI) and orchestration technologies (e.g., Kubernetes, Docker). Proficiency in scripting and programming languages (e.g., Python, Bash, Go). Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog). Solid understanding of security best practices, compliance standards, and DevSecOps. Proven ability to manage and deliver complex projects on time and within budget. Strong interpersonal, communication, and problem-solving More ❯
Manage cloud infrastructure (OCI, AWS, Azure, or GCP) using Infrastructure as Code tools like Terraform or Serverless Functions. Monitor system health and performance using tools like Prometheus, Grafana, or Datadog or NewRelic. Collaborate closely with development teams to automate builds, performance tests, and deployments. Ensure system security, compliance, and best practices are followed in deployment pipelines. Ensure network security with More ❯
such as Azure, AWS or GCP Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and operational playbooks. Useful/Bonus Skills More ❯
the equivalent with Azure and GCP Background knowledge and hands-on practice in Observability, specifically experience working with one or more of the following tools - Kibana, Open-Search, Grafana, Datadog, Sumologic, NewRelic, AppDynamics, Dynatrace, Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry/ More ❯
the equivalent with Azure and GCP Background knowledge and hands-on practice in Observability, specifically experience working with one or more of the following tools - Kibana, Open-Search, Grafana, Datadog, Sumo Logic, New Relic, AppDynamics, Dynatrace, Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations More ❯
the equivalent with Azure and GCP Background knowledge and hands-on practice in Observability, specifically experience working with one or more of the following tools - Kibana, Open-Search, Grafana, Datadog, Sumo Logic, New Relic, AppDynamics, Dynatrace, Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations More ❯
London, England, United Kingdom Hybrid / WFH Options
Magentus Group
CI, or similar). Experience with scripting or programming languages (Python, Go, Bash, etc.). Understanding of networking, security principles, and best practices. Knowledge of observability tools such as Datadog, Prometheus, Grafana, etc. Desired Attributes Strong problem-solving skills with a proactive approach to improving systems and processes. Excellent communication and collaboration skills, able to work effectively with cross-functional More ❯
Manchester, England, United Kingdom Hybrid / WFH Options
Magentus Group
CI, or similar). Experience with scripting or programming languages (Python, Go, Bash, etc.). Understanding of networking, security principles, and best practices. Knowledge of observability tools such as Datadog, Prometheus, Grafana, etc. Desired Attributes Strong problem-solving skills with a proactive approach to improving systems and processes. Excellent communication and collaboration skills, able to work effectively with cross-functional More ❯
. Knowledge of networking concepts and security best practices. Familiarity with SRE activities and best practices. Familiarity with DevOps practices and tools. Experience with monitoring and logging tools (e.g., DataDog, Coralogix, AWS CloudWatch, Azure Monitor). Excellent problem-solving and stakeholder management skills. Strong written and oral communication skills. Experience collaborating across multiple topics in parallel. Responsibilities Collaborating with technical More ❯
the equivalent with Azure and GCP Background knowledge and hands-on practice in Observability, specifically experience working with one or more of the following tools - Kibana, Open-Search, Grafana, Datadog, Sumologic, NewRelic, AppDynamics, Dynatrace, Prometheus,Logz. io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry More ❯