Baltimore, Maryland, United States Hybrid / WFH Options
Archesys Inc
Ability to work independently and collaboratively in a fast-paced, dynamic environment. Nice to Have: AWS Certifications (e.g., Solutions Architect, DevOps Engineer). Experience with other observability tools (e.g., Datadog, New Relic, OpenTelemetry). Knowledge of distributed tracing concepts and tools (e.g., Jaeger, Tempo). Experience with machine learning for anomaly detection in time-series data. Contributions to open-source More ❯
London, England, United Kingdom Hybrid / WFH Options
Spectrum IT Recruitment
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck Experience using configuration management platforms like Ansible, Puppet, or Chef Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer More ❯
Who we are We are a London tech startup on the lookout for bright, motivated and self-driven individuals to join the team. Who you are You are a DevOps/Site Reliability Engineer with experience managing complex infrastructure and More ❯
London, England, United Kingdom Hybrid / WFH Options
Quaisr Limited
DevOps/Site Reliability Engineer, Junior/Mid/Senior (m/f/*) We are a London tech startup on the lookout for bright, motivated and self-driven individuals to join the team. Who you are You are a More ❯
Chicago, Illinois, United States Hybrid / WFH Options
Ahold Delhaize
and track service level objectives (SLOs) and service level indicators (SLIs). Build and manage microservices-based platforms leveraging Spring Boot, Java, Tomcat, and Redis. Monitor production environments using Datadog and proactively address performance and reliability issues. Perform root cause analysis and lead post-incident reviews to drive continual improvement. Manage CI/CD pipelines and deployment automation using GitHub … or Go. Proven experience with Spring Boot, Tomcat, Redis, and microservices architecture. Hands-on experience in managing Linux environments, particularly Ubuntu. Proficiency with observability stacks and performance monitoring using Datadog, Prometheus, and ELK. Deep understanding of containerization and orchestration using Docker, Kubernetes, and ArgoCD. Experience managing event-driven systems using Kafka. Expertise in IaC and automation using Terraform and GitHub More ❯
Cambridge, Cambridgeshire, United Kingdom Hybrid / WFH Options
Arm Limited
/green & canary releases, and automated rollbacks. Proficiency with Docker, Kubernetes, and related cloud-native orchestration patterns. Proven track record building dashboards and visualizations across platforms such as Grafana, Datadog, and AWS. Experience with instrumentation tools like Prometheus and managing time-series stores such as Graphite and VictoriaMetrics. Solid understanding of networking, security, and compliance in cloud environments. Excellent written More ❯
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck Experience using configuration management platforms like Ansible, Puppet, or Chef Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer More ❯
Southampton, Hampshire, South East, United Kingdom Hybrid / WFH Options
Spectrum It Recruitment Limited
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck Experience using configuration management platforms like Ansible, Puppet, or Chef Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer More ❯
Portsmouth, England, United Kingdom Hybrid / WFH Options
Spectrum IT Recruitment
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck Experience using configuration management platforms like Ansible, Puppet, or Chef Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer More ❯
London, England, United Kingdom Hybrid / WFH Options
ZipRecruiter
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck Experience using configuration management platforms like Ansible, Puppet, or Chef Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer More ❯
London, England, United Kingdom Hybrid / WFH Options
Elwood Technologies
closely with engineering teams to design and deploy scalable, fault-tolerant infrastructure solutions on AWS or GCP . Improve observability by utilizing monitoring, logging, and alerting systems (e.g., CloudWatch , Datadog ). Lead post-incident reviews , contribute to the continuous improvement of system reliability and follow up on strategic fixes. Develop and update runbooks, incident response playbooks, and documentation. Work closely … love it if you have experience of some or all of the following: Experience with client-impact triage , working cross-functionally with account managers or product teams. Proficiency with Datadog or similar observability platforms. Knowledge of serverless architectures (e.g., AWS Lambda, GCP Cloud Functions). Familiarity with RDBMS and NoSQL databases , such as RDS, CloudSQL, DynamoDB. Prior experience in fintech More ❯
Boston, Massachusetts, United States Hybrid / WFH Options
Nexthink
topologies (VPC, Virtual Subnets, NACLS, NSG, ILB, ELB, etc.) and storage (S3, EBS, Azure Files etc). Monitor system health, application performance, and user-facing SLAs using tools like Datadog, Prometheus, Grafana Be a main actor and improve incident response practices and help reduce mean time to detect (MTTD) and recover (MTTR). Experience in coordinating teams and persons to … programming or scripting skills (Python, Go, Bash ). Experience with CI/CD pipelines (e.g., GitHub Actions, GitLab CI, ArgoCD). Experience with observability stacks (Prometheus, ELK/EFK, Datadog, etc.). Comfort with being part of a rotating on-call schedule , including handling critical incidents and conducting post-incident reviews. Strong system-level troubleshooting skills and a proactive mindset More ❯
Colorado Springs, Colorado, United States Hybrid / WFH Options
Nexthink
topologies (VPC, Virtual Subnets, NACLS, NSG, ILB, ELB, etc.) and storage (S3, EBS, Azure Files etc). Monitor system health, application performance, and user-facing SLAs using tools like Datadog, Prometheus, Grafana Be a main actor and improve incident response practices and help reduce mean time to detect (MTTD) and recover (MTTR). Experience in coordinating teams and persons to … programming or scripting skills (Python, Go, Bash ). Experience with CI/CD pipelines (e.g., GitHub Actions, GitLab CI, ArgoCD). Experience with observability stacks (Prometheus, ELK/EFK, Datadog, etc.). Comfort with being part of a rotating on-call schedule , including handling critical incidents and conducting post-incident reviews. Strong system-level troubleshooting skills and a proactive mindset More ❯
Atlanta, Georgia, United States Hybrid / WFH Options
Zencon Group
best practices. Preferred Qualifications: AWS certifications (e.g., AWS Certified DevOps Engineer, Solutions Architect ) Experience in hybrid cloud environments or enterprise-scale distributed systems Familiarity with other observability tools like Datadog, Prometheus, or Grafana Experience with incident management and SRE metrics (SLIs, SLOs, error budgets More ❯
such as Azure, AWS or GCP Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and operational playbooks. Useful/Bonus Skills More ❯
London, England, United Kingdom Hybrid / WFH Options
Stratospherec Limited
such as Azure, AWS or GCP Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and operational playbooks. Useful/Bonus Skills More ❯
San Diego, California, United States Hybrid / WFH Options
PlayStation Global
Control Nice to have Experience with hosting and CDN technologies like Akamai and Cloudflare Experience with Cyber Security, threat detection and mitigation with Akamai Monitoring and Alerting solutions including Datadog, Prometheus and Grafana Logging and log aggregation solutions like Splunk, ElasticSearch and AWS CloudWatch Logs Tracing & debugging on various level including container, network, storage, compute Certifications in Linux, AWS, Docker More ❯
London, England, United Kingdom Hybrid / WFH Options
Magentus Group
CI, or similar). Experience with scripting or programming languages (Python, Go, Bash, etc.). Understanding of networking, security principles, and best practices. Knowledge of observability tools such as Datadog, Prometheus, Grafana, etc. Desired Attributes Strong problem-solving skills with a proactive approach to improving systems and processes. Excellent communication and collaboration skills, able to work effectively with cross-functional More ❯
Manchester, England, United Kingdom Hybrid / WFH Options
Magentus Group
CI, or similar). Experience with scripting or programming languages (Python, Go, Bash, etc.). Understanding of networking, security principles, and best practices. Knowledge of observability tools such as Datadog, Prometheus, Grafana, etc. Desired Attributes Strong problem-solving skills with a proactive approach to improving systems and processes. Excellent communication and collaboration skills, able to work effectively with cross-functional More ❯
London, England, United Kingdom Hybrid / WFH Options
Octopus Legacy
Python web frameworks such as Flask or FastAPI. Experience optimising applications for cloud performance, cost-efficiency, and scalability. Hands-on experience with monitoring and logging tools (e.g., AWS CloudWatch, Datadog, ELK stack). An understanding of lean software development principles and practices focused on delivering value quickly. A passion for mentoring and sharing knowledge, contributing to a culture of continuous More ❯
Bessemer, Alabama, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions (Dynatrace, New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
London, England, United Kingdom Hybrid / WFH Options
Keyrock
EKS, K3s, or self-managed). Proficiency in scripting with Python, Bash, or Go. Experience with Infrastructure as Code (Terraform, CloudFormation, Ansible). Familiarity with observability tools (Prometheus, Grafana, Datadog, ELK). Solid understanding of networking (VPC, Load Balancers, DNS, Firewalls). Experience with DevOps, CI/CD, and GitOps practices. Experience with high-performance, low-latency systems. Familiarity with More ❯
Birmingham, Alabama, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
Orlando, Florida, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯