related field. 5+ years of experience as a Site Reliability Engineer or equivalent in a similar role. Proficient in application and infrastructure observability, Splunk OpenTelemetry preferred Experienced in production environments running in AWS Comfortable with Infrastructure as Code, Terraform is preferred Comfortable with CI/CD pipelines such as GitHub More ❯
Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry/fluentd/fluentbit/filebeat/logstash) Hands-on experience with complex troubleshooting of Kubernetes and Docker container Good knowledge of Regex, Lucene, PromQL More ❯
Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry/fluentd/fluentbit/filebeat/logstash) Hands-on experience with complex troubleshooting of Kubernetes and Docker container Good knowledge of Regex, Lucene, PromQL More ❯
Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry/fluentd/fluentbit/filebeat/logstash) Hands-on experience with complex troubleshooting of Kubernetes and Docker container Good knowledge of RegEx, Lucene, PromQL More ❯
Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry/fluentd/fluentbit/filebeat/logstash) Hands-on experience with complex troubleshooting of Kubernetes and Docker container Good knowledge of RegEx, Lucene, PromQL More ❯
Warwick, Warwickshire, United Kingdom Hybrid / WFH Options
ICEO
programming language (Python, GoLang, C++, or Java). Solid experience with Terraform for IaC. Hands-on skills with observability tools (Prometheus, Grafana, ELK stack, OpenTelemetry) and logging pipelines (Kibana, Elasticsearch). Expertise in Docker and container orchestration using Kubernetes (preferably on GCP) and Helm. Familiarity with CI/CD systems More ❯
Warwick, Warwickshire, United Kingdom Hybrid / WFH Options
ICEO
to implement redundancy and disaster recovery scenarios. Track record in scaling high-efficiency production systems. Proficiency with observability tools (e.g., Prometheus, Grafana, Grafana Mimir, OpenTelemetry). Strong written and spoken English (B2 level or higher). Nice to Have: Experience with Argo CD and Argo Rollouts. Familiarity with technologies such More ❯
in cloud-native environments at scale. Exposure to high-load, high-performance systems and large-scale microservices architectures. Experience with observability and monitoring frameworks (OpenTelemetry, Grafana, Prometheus). Knowledge of Graph Databases and AI integration in platform operations is a plus. Experience mentoring junior engineers and leading cross-functional initiatives. More ❯
Code (IaC) : Proficiency with Infrastructure as Code (IaC) tools such as Terraform or CloudFormation. Distributed Tracing : Experience with distributed tracing tools like Jaeger or OpenTelemetry for debugging microservices. Security : Strong knowledge of securing microservices, Kubernetes clusters, and cloud-based applications. Additional Information We believe that coming together as a community More ❯
integration. Experience with Grafana, VictoriaMetrics, and PromQL Experience with centralized logs solutions deployment and management Strong Infrastructure as Code (IaC) knowledge. Nice to Have: OpenTelemetry experience. Terraform, Ansible, or CI/CD knowledge. Background in datacentre and compute hardware services . AWS infrastructure configuration and deployment Familiarity with Kubernetes and More ❯
related field. Proven experience as a Site Reliability Engineer or similar role. Proficient in Java, Spring Boot, distributed systems, and modern observability practices (e.g., OpenTelemetry, Prometheus), with strong cross-functional collaboration and knowledge-sharing skills. In-depth knowledge of system architecture, distributed systems, and networking. Experience with cloud platforms (e.g. More ❯
Proven experience in building and scaling cloud-native observability platforms. Deep understanding of observability pillars (metrics, logs, traces) and tools such as Prometheus, Grafana, OpenTelemetry, Jaeger, Kibana, Elastic Stack. Hands-on experience integrating ML/AI models for insights, anomaly detection, and predictive analysis. Strong expertise in designing scalable distributed More ❯
Virtualisation and Provisioning, Workload and job scheduling (e.g. Kubernetes, Ray) on high core-count machines and rack-scale installations, Management and Observability (e.g. Prometheus, OpenTelemetry, DataDog, Splunk, etc.). 10+ years of relevant experience related to quality assurance/testing teams. Experience with the Atlassian suite and CI/CD More ❯
SNS, SQS, EventBridge). Knowledge of GraphQL, WebSockets, or real-time data streaming. Exposure to DevOps and observability practices (e.g., Prometheus, Datadog, AWS CloudWatch, OpenTelemetry). Prior experience in leading distributed engineering teams. More ❯
SNS, SQS, EventBridge). Knowledge of GraphQL, WebSockets, or real-time data streaming. Exposure to DevOps and observability practices (e.g., Prometheus, Datadog, AWS CloudWatch, OpenTelemetry). Prior experience in leading distributed engineering teams. More ❯
discovery/registry frameworks. In-depth knowledge of CI/CD pipelines, automated testing, distributed tracing, and observability tools (e.g., Prometheus, Grafana, Jaeger/OpenTelemetry). Proven skills in event-driven architectures, messaging systems (e.g., RabbitMQ, Kafka), and data modeling across diverse database types. Previous experience with healthcare software, EMR More ❯
best practices Experience implementing and managing logging solutions (such as ELK stack) Proficiency with monitoring platforms (such as Prometheus) Familiarity with tracing technologies (including OpenTelemetry or Jaeger) Background in performance optimization and resource allocation Industry certifications (cloud platforms preferred) Knowledge of Agile development practices Capability to diagnose and address critical More ❯
best practices Experience implementing and managing logging solutions (such as ELK stack) Proficiency with monitoring platforms (such as Prometheus) Familiarity with tracing technologies (including OpenTelemetry or Jaeger) Background in performance optimization and resource allocation Industry certifications (cloud platforms preferred) Knowledge of Agile development practices Capability to diagnose and address critical More ❯
and Kubernetes. Manage CI/CD pipelines using GitHub Actions and ensure smooth delivery to production. Own monitoring, alerting, and observability, using tools like OpenTelemetry and Dynatrace. Security & Compliance: Ensure systems are compliant with PCI DSS, PSD2, and SCA. Champion secure coding practices and data protection across services. Collaboration & Mentoring More ❯
infrastructure level Experience with monitoring and logging tools like DataDog or Grafana's observability stack (Prometheus, Tempo, Loki, Grafana) Familiarity with the open standard OpenTelemetry Excellent written and verbal communication skills, we're a collaborative team! PLEASE NOTE: Our engineering teams work fully remotely across Europe but we are focusing More ❯
on experience with containerization (Docker, Kubernetes). Strong security mindset with experience in compliance frameworks (SOC, PCI, GDPR). Familiarity with monitoring tools like OpenTelemetry, Instana, or LogicMonitor. Scripting experience (Ruby, Python, Bash) for automation and infrastructure management. More ❯
skills with the ability to proactively engage with a wide range of stakeholders In depth experience with observability tools such as Grafana, Prometheus and OpenTelemetry Strong knowledge of publlic cloud environments such as AWS and GCP, and Infrastructure as Code tools such as Terraform #J-18808-Ljbffr More ❯
React, GoLang); Proficient in (Azure) cloud platforms and tooling ( , Terraform/OpenTofu, ArgoCD, GitLab); Experienced in using and extending observability tooling like Datadog, Grafana, OpenTelemetry and system/application performance monitoring; Ability to debug, optimize code, and automate routine operational tasks; Deep understanding in infrastructure and software development security best More ❯