such as Azure, AWS or GCP Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and operational playbooks. Useful/Bonus Skills More ❯
. Preferred Qualifications Experience in hybrid cloud environments and integration with on-premise systems. Background in DevOps, SRE, or Infrastructure Engineering. Knowledge of monitoring/logging tools (e.g., CloudWatch, Datadog, Prometheus, ELK). Experience with enterprise security and compliance frameworks (e.g., ISO 27001, SOC 2, GDPR). Familiarity with cost modeling and optimization strategies in AWS. More ❯
and other relevant tools. Security Best Practices: IAM, MFA, data encryption, firewall configurations. Programming/Scripting: Python, Terraform, or similar languages. Event-Driven Architectures: Kafka. Monitoring and Logging: Datadog, ELK Stack, Prometheus, etc. Experience in agile methodologies and DevOps practices. Location: Hybrid. Office located in London. (Hayes area). Office presence required: Yes. Frequency: 2-3 times a week at More ❯
or Windows administration, with the ability to architect secure, performant, and highly available cloud solutions. Proficiency with monitoring and log analytics tools such as AWS CloudWatch, ELK Stack, Prometheus, Datadog, or New Relic, to maintain observability and ensure operational excellence. Demonstrated leadership skills in managing complex, high-pressure situations and guiding teams through incident resolution. Exceptional communication and presentation skills More ❯
technical issues Contribute to codebases as needed to drive projects forward Requirements Technical Expertise Proven experience managing Kubernetes clusters and expertise in container orchestration. Experience with observability tools (e.g., DataDog, Prometheus, Grafana) Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation Experience in Database optimization and management (especially for multi-tenant architectures) Extensive knowledge of AWS services, including … exposed to the following stack: Infrastructure: Kubernetes (EKS) for container orchestration AWS as our primary cloud provider GitOps deployment model with ArgoCD Infrastructure as Code using Terraform Observability stack (Datadog, Sentry, Atatus) Python; Web Frameworks: Python frameworks including Django, Fast API Databases: PostgreSQL, MongoDB, ElasticSearch, and Redis Other: RabbitMQ, Celery Frontend: Working with Manatal This role is based at our More ❯
Familiarity with Infrastructure as Code and DevOps practices. Knowledge of Hyper-V management. Understanding of networking, security, and system administration (Linux/Windows). Experience with monitoring tools (e.g., DataDog, CloudWatch, Azure Monitor). Strong communication and collaboration skills. Responsibilities: Deploying and managing Kubernetes clusters, including networking, storage, and security. Collaborating with development and platform teams to deliver scalable, secure More ❯
gating into the SDLC. Ensure pipeline scalability and governance while maintaining developer velocity. Observability & Troubleshooting Lead the implementation and usage of modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, Splunk, Datadog). Establish SLOs, SLIs, and error budgets with product and engineering teams. Drive root cause identification using distributed tracing, advanced log analysis, and anomaly detection. Security, Audit & Compliance Partner with More ❯
software applications and optimizing fleet utilization - Strong understanding of network fundamentals (DNS, DHCP, TCP/IP, routing, load balancing, load shedding) and experience with monitoring frameworks (such as CloudWatch, Datadog, Grafana, Elastic or similar) - Experience scripting operating system tasks in Bash, Python, etc. and with Infrastructure as Code, (such as CDK, CloudFormation, Puppet, Chef, Ansible, or similar) - Experience operating services More ❯
GitLab CI). Write clean, production-grade code in Python (Scala is a bonus). Build infrastructure using Terraform, AWS CloudFormation, or SAM. Drive observability across the platform using Datadog or CloudWatch. Actively mentor Data Engineers and Associates, and lead technical discussions and design sessions. Key requirements: Must-Have: Strong experience with AWS services: Glue, Lambda, S3, Athena, Step Functions … operate services in production. Good to Have: Experience with Scala for data applications. Familiarity with serverless/event-driven architectures. Experience designing scalable, low-latency data services. Exposure to Datadog or CloudWatch monitoring tools. Nice to Have: Experience with LLM-powered applications or OpenAI APIs . Professional experience in a similar environment or high-scale system. Key Roles and Responsibilities More ❯
needed About You 5+ years' experience in Site Reliability Engineer roles Expert+ level Linux administration, scripting, and troubleshooting Demonstrable knowledge of Observability tools (Prometheus/Grafana, New Relic, Splunk, DataDog) Comprehensive experience with AWS (Amazon Web Services) and its core capabilities (VPC, EC2, ECS, Route53, Fargate, ALB/NLB distributions, etc) Extensive experience with cloud automation and infrastructure-as-code More ❯
needed About You 5+ years' experience in Site Reliability Engineer roles Expert+ level Linux administration, scripting, and troubleshooting Demonstrable knowledge of Observability tools (Prometheus/Grafana, New Relic, Splunk, DataDog) Comprehensive experience with AWS (Amazon Web Services) and its core capabilities (VPC, EC2, ECS, Route53, Fargate, ALB/NLB distributions, etc) Extensive experience with cloud automation and infrastructure-as-code More ❯
containerization (Docker, Kubernetes), and CI/CD practices. Familiarity with Guidewire Cloud architecture models, deployment automation, and support practices. Experience integrating cloud infrastructure with DevOps, Monitoring (e.g., CloudWatch, Prometheus, Datadog), and Logging tools (ELK, Splunk). Solid understanding of cloud security, compliance (including regulatory needs in insurance), and networking. Knowledge of data migration, analytics integration, and insurance data models is More ❯
containerization (Docker, Kubernetes), and CI/CD practices. Familiarity with Guidewire Cloud architecture models, deployment automation, and support practices. Experience integrating cloud infrastructure with DevOps, Monitoring (e.g., CloudWatch, Prometheus, Datadog), and Logging tools (ELK, Splunk). Solid understanding of cloud security, compliance (including regulatory needs in insurance), and networking. Knowledge of data migration, analytics integration, and insurance data models is More ❯
Our stack AWS as our cloud compute platform Kubernetes (EKS) for container runtime and orchestration RDS (PostgreSQL, MySQL), Kafka, Redis Terraform for infrastructure as code Lambda and Step Functions Datadog for Observability Github actions for CICD Frontend is React Backend services are developed in NodeJS (TypeScript) As we are an international team, please submit your application and CV in English. More ❯
roles 3+ years' experience with an object-oriented language (preferably Java, .NET or C++) Expert+ level Linux administration, scripting, and troubleshooting Demonstrable knowledge of Observability tools (New Relic, Splunk, DataDog) Comprehensive experience with AWS (Amazon Web Services) and its core capabilities (VPC, EC2, ECS, Route53, Fargate, ALB/NLB distributions, etc) Extensive experience with cloud automation and infrastructure-as-code More ❯
factor principles and fit into our microservices architecture Cloud-related tools, services, and distributed system observability to support these applications, such as Docker, Kubernetes, ElasticSearch, log management systems, and Datadog APM, to name but a few API specifications, conforming to the OpenAPI (Swagger) standard, provide a clean boundary both externally between our customers and our product, and internally between our More ❯
GCP, or Azure). Expert of CI/CD processes, containerization (Docker, Kubernetes), and a deep understanding of networking, distributed systems, and databases. Expert with monitoring and troubleshooting utilities (DataDog, Prometheus, Grafana, ELK stack, Splunk, Humio, etc.). Exceptional problem-solving skills and a detail-oriented mindset, coupled with outstanding communication abilities. Desirable Experience with Azure, a background in autonomous More ❯
GCP, or Azure). Expert of CI/CD processes, containerization (Docker, Kubernetes), and a deep understanding of networking, distributed systems, and databases. Expert with monitoring and troubleshooting utilities (DataDog, Prometheus, Grafana, ELK stack, Splunk, Humio, etc.). Exceptional problem-solving skills and a detail-oriented mindset, coupled with outstanding communication abilities. Experience with Azure, a background in autonomous vehicles More ❯
TypeScript for Frontend. Our backend services are written in TypeScript and Kotlin. Frameworks and Libraries: We use React/Redux and WebAssembly. Monitoring and Logging: We are currently using Datadog for monitoring and logging. Metrics are collected across our agents, taken from the logs using metric filters, and updated directly from lambda function or the application. Infrastructure-as-Code: Most More ❯
building robust and efficient backend solutions. Strong hands-on experience with Terraform for infrastructure as code, enabling scalable and reliable systems. Experience with monitoring and observability tools, such as Datadog or Prometheus. Familiarity with event-driven systems, particularly Kafka and/or RabbitMQ. Deep understanding of messaging and queuing systems, including design patterns for reliability, retries, and scaling. Strong understanding More ❯
building robust and efficient backend solutions. Strong hands-on experience with Terraform for infrastructure as code, enabling scalable and reliable systems. Experience with monitoring ****and observability tools, such as Datadog or Prometheus. Familiarity with event-driven systems, particularly Kafka and/or RabbitMQ. Deep understanding of messaging and queuing systems, including design patterns for reliability, retries, and scaling. Strong understanding More ❯
building robust and efficient backend solutions. Strong hands-on experience with Terraform for infrastructure as code, enabling scalable and reliable systems. Experience with monitoring and observability tools, such as Datadog or Prometheus. Familiarity with event-driven systems, particularly Kafka and/or RabbitMQ. Deep understanding of messaging and queuing systems, including design patterns for reliability, retries, and scaling. Strong understanding More ❯
building robust and efficient backend solutions. Strong hands-on experience with Terraform for infrastructure as code, enabling scalable and reliable systems. Experience with monitoring ****and observability tools, such as Datadog or Prometheus. Familiarity with event-driven systems, particularly Kafka and/or RabbitMQ. Deep understanding of messaging and queuing systems, including design patterns for reliability, retries, and scaling. Strong understanding More ❯
Wandsworth, Greater London, UK Hybrid / WFH Options
PeopleCheck
present past case studies and guide stakeholders Preferred Qualifications Background in compliance or background-screening services Experience with microservices design and orchestration (Kubernetes, ECS) Knowledge of advanced observability tools (Datadog, New Relic, ELK) Why Join Us? Impact : Help define the technical roadmap together with our tech lead of a mission-critical compliance platform. Ownership : Lead key initiatives end-to-end More ❯
Cloudfront and MSK extensively Have an understanding of SLIs, SLOs & SLAs Knowledge of platform and ops concepts such as networking and Linux administration Experience with monitoring tools: we use Datadog, Grafana, ELK, Sentry and OpsGenie. £90,000 - £125,000 a year Inclusive workforce At Fresha, we are creating a culture where individuals of all backgrounds feel comfortable. We want all More ❯