Baltimore, Maryland, United States Hybrid / WFH Options
Archesys Inc
Ability to work independently and collaboratively in a fast-paced, dynamic environment. Nice to Have: AWS Certifications (e.g., Solutions Architect, DevOps Engineer). Experience with other observability tools (e.g., Datadog, New Relic, OpenTelemetry). Knowledge of distributed tracing concepts and tools (e.g., Jaeger, Tempo). Experience with machine learning for anomaly detection in time-series data. Contributions to open-source More ❯
Who we are We are a London tech startup on the lookout for bright, motivated and self-driven individuals to join the team. Who you are You are a DevOps/Site Reliability Engineer with experience managing complex infrastructure and More ❯
Chicago, Illinois, United States Hybrid / WFH Options
Ahold Delhaize
and track service level objectives (SLOs) and service level indicators (SLIs). Build and manage microservices-based platforms leveraging Spring Boot, Java, Tomcat, and Redis. Monitor production environments using Datadog and proactively address performance and reliability issues. Perform root cause analysis and lead post-incident reviews to drive continual improvement. Manage CI/CD pipelines and deployment automation using GitHub … or Go. Proven experience with Spring Boot, Tomcat, Redis, and microservices architecture. Hands-on experience in managing Linux environments, particularly Ubuntu. Proficiency with observability stacks and performance monitoring using Datadog, Prometheus, and ELK. Deep understanding of containerization and orchestration using Docker, Kubernetes, and ArgoCD. Experience managing event-driven systems using Kafka. Expertise in IaC and automation using Terraform and GitHub More ❯
Cambridge, Cambridgeshire, United Kingdom Hybrid / WFH Options
Arm Limited
/green & canary releases, and automated rollbacks. Proficiency with Docker, Kubernetes, and related cloud-native orchestration patterns. Proven track record building dashboards and visualizations across platforms such as Grafana, Datadog, and AWS. Experience with instrumentation tools like Prometheus and managing time-series stores such as Graphite and VictoriaMetrics. Solid understanding of networking, security, and compliance in cloud environments. Excellent written More ❯
Southampton, Hampshire, South East, United Kingdom Hybrid / WFH Options
Spectrum It Recruitment Limited
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck Experience using configuration management platforms like Ansible, Puppet, or Chef Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer More ❯
Understanding of Linux/Unix systems, networking, cloud platforms (AWS, Azure, GCP), containerization (Kubernetes, Docker), and infrastructure-as-code tools (Terraform, Ansible). Proficiency with monitoring tools (Prometheus, Grafana, Datadog, etc.), logging systems (ELK stack, Splunk), and tracing tools (Jaeger, Zipkin). Proven track record of automating complex tasks and processes to improve efficiency and reliability using Python, Go, Java More ❯
Boston, Massachusetts, United States Hybrid / WFH Options
Nexthink
topologies (VPC, Virtual Subnets, NACLS, NSG, ILB, ELB, etc.) and storage (S3, EBS, Azure Files etc). Monitor system health, application performance, and user-facing SLAs using tools like Datadog, Prometheus, Grafana Be a main actor and improve incident response practices and help reduce mean time to detect (MTTD) and recover (MTTR). Experience in coordinating teams and persons to … programming or scripting skills (Python, Go, Bash ). Experience with CI/CD pipelines (e.g., GitHub Actions, GitLab CI, ArgoCD). Experience with observability stacks (Prometheus, ELK/EFK, Datadog, etc.). Comfort with being part of a rotating on-call schedule , including handling critical incidents and conducting post-incident reviews. Strong system-level troubleshooting skills and a proactive mindset More ❯
Colorado Springs, Colorado, United States Hybrid / WFH Options
Nexthink
topologies (VPC, Virtual Subnets, NACLS, NSG, ILB, ELB, etc.) and storage (S3, EBS, Azure Files etc). Monitor system health, application performance, and user-facing SLAs using tools like Datadog, Prometheus, Grafana Be a main actor and improve incident response practices and help reduce mean time to detect (MTTD) and recover (MTTR). Experience in coordinating teams and persons to … programming or scripting skills (Python, Go, Bash ). Experience with CI/CD pipelines (e.g., GitHub Actions, GitLab CI, ArgoCD). Experience with observability stacks (Prometheus, ELK/EFK, Datadog, etc.). Comfort with being part of a rotating on-call schedule , including handling critical incidents and conducting post-incident reviews. Strong system-level troubleshooting skills and a proactive mindset More ❯
Atlanta, Georgia, United States Hybrid / WFH Options
Zencon Group
best practices. Preferred Qualifications: AWS certifications (e.g., AWS Certified DevOps Engineer, Solutions Architect ) Experience in hybrid cloud environments or enterprise-scale distributed systems Familiarity with other observability tools like Datadog, Prometheus, or Grafana Experience with incident management and SRE metrics (SLIs, SLOs, error budgets More ❯
such as Azure, AWS or GCP Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and operational playbooks. Useful/Bonus Skills More ❯
London, England, United Kingdom Hybrid / WFH Options
Stratospherec Limited
such as Azure, AWS or GCP Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and operational playbooks. Useful/Bonus Skills More ❯
San Diego, California, United States Hybrid / WFH Options
Sony Interactive Entertainment
Control Nice to have Experience with hosting and CDN technologies like Akamai and Cloudflare Experience with Cyber Security, threat detection and mitigation with Akamai Monitoring and Alerting solutions including Datadog, Prometheus and Grafana Logging and log aggregation solutions like Splunk, ElasticSearch and AWS CloudWatch Logs Tracing & debugging on various level including container, network, storage, compute Certifications in Linux, AWS, Docker More ❯
Bessemer, Alabama, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions (Dynatrace, New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
Birmingham, Alabama, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
Salt Lake City, Utah, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
Nashville, Tennessee, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
Houston, Texas, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
Denver, Colorado, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
Tampa, Florida, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
Atlanta, Georgia, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
Orlando, Florida, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
Charlotte, North Carolina, United States Hybrid / WFH Options
Regions Bank
technologies: Prior experience supporting hybrid environments with one or more Cloud providers (AWS, Azure) Observability: Prior experience implementing one or more Commercial Observability/APM solutions ( Dynatrace , New Relic, Datadog, AppDynamics, Honeycomb) Monitoring and Logging: Solid familiarity with Splunk, Elastic, OpenSearch, Prometheus, Grafana Implementing Site Reliability Engineering (SRE) principles SLO/SLI Experience troubleshooting and resolving issues with critical business More ❯
and other relevant tools. Security Best Practices: IAM, MFA, data encryption, firewall configurations. Programming/Scripting: Python, Terraform, or similar languages. Event-Driven Architectures: Kafka. Monitoring and Logging: Datadog, ELK Stack, Prometheus, etc. Experience in agile methodologies and DevOps practices. Location: Hybrid. Office located in London. (Hayes area). Office presence required: Yes. Frequency: 2-3 times a week at More ❯