infrastructure environments, including coding, testing, and certifying technology platforms, software, and applications, as well as coach and mentor team members. THE IMPACT YOU WILL MAKE The Lead AWS Monitoring & Observability Engineer - AWS & APM Tools role will offer you the flexibility to make each day your own, while working alongside people who care so that you can deliver on the following … the team in addressing identified issues. Qualifications THE EXPERIENCE YOU BRING TO THE TEAM Minimum Required Experiences 6 years 4 years of hands-on experience managing the Monitoring and Observability platform using Splunk/Dynatrace/Open Telemetry/AWS Cloudwatch in a large-scale Linux and Windows Server environments. Experience in generating and using complex database queries. Skilled in … Python Programming. Desired Experiences Bachelor degree or equivalent 4+ years of hands-on experience managing the Monitoring and Observability platform using Splunk/Dynatrace/Open Telemetry/AWS Cloudwatch in a large-scale Linux and Windows Server environments on-premises and AWS. Experience supporting mission-critical platforms in an on-call setting. AWS/Linux/Windows/Other More ❯
infrastructure environments, including coding, testing, and certifying technology platforms, software, and applications, as well as coach and mentor team members. THE IMPACT YOU WILL MAKE The Lead AWS Monitoring & Observability Engineer - AWS & APM Tools role will offer you the flexibility to make each day your own, while working alongside people who care so that you can deliver on the following … the team in addressing identified issues. Qualifications THE EXPERIENCE YOU BRING TO THE TEAM Minimum Required Experiences 6 years 4 years of hands-on experience managing the Monitoring and Observability platform using Splunk/Dynatrace/Open Telemetry/AWS Cloudwatch in a large-scale Linux and Windows Server environments. Experience in generating and using complex database queries. Skilled in … Python Programming. Desired Experiences Bachelor degree or equivalent 4+ years of hands-on experience managing the Monitoring and Observability platform using Splunk/Dynatrace/Open Telemetry/AWS Cloudwatch in a large-scale Linux and Windows Server environments on-premises and AWS. Experience supporting mission-critical platforms in an on-call setting. AWS/Linux/Windows/Other More ❯
creation by building world-class audio infrastructure for our customers. As a Site Reliability Engineer, you'll play a key role in improving our platform's developer operations, including observability, monitoring, and overall reliability. You will be part of a cross-functional team dedicated to implementing robust DevOps practices and enhancing infrastructure and site reliability engineering (SRE). A customer … focused mindset is essential, as the team collaborates closely with stakeholders to ensure solutions meet business and user needs. In addition to a focus on observability, you will contribute hands-on by developing features, automating workflows, and supporting the deployment of advanced machine-learning models. Strong communication skills are vital for working effectively with engineers, product teams, and stakeholders across … about CI/CD to these engineers Identifying and resolving security issues Automating tests and supporting our engineers on building great software Minimum qualifications: Strong experience with monitoring/observability tools (Grafana, Prometheus, or similar) Proficiency in Python, Docker, Kubernetes, and CI/CD pipelines Hands-on cloud experience (AWS or similar) A passion for designing and implementing scalable observabilityMore ❯
area of the product component or the system in aggregate and at scale. Specific domains include Workload Management (Kubernetes, Ray, and so on); Cloud Development (Cloud Infrastructure Automation); Management & Observability (open source and commercial monitoring, observability and DCIM solutions) Skills and Experience Essential Strong relevant programming experience Python/Go/C infrastructure-as-code scripting or related to the … of the products under test: Containerisation (e.g. Docker), Virtualisation and Provisioning, Workload and job scheduling (e.g. Kubernetes, Ray) on high core-count machines and rack-scale installations, Management and Observability (e.g. Prometheus, OpenTelemetry, DataDog, Splunk, etc.). 10+ years of relevant experience related to quality assurance/testing teams. Experience with the Atlassian suite and CI/CD platforms such More ❯
and contribute to the continuous evolution of cutting-edge betting products. Platform Engineer Responsibilities Build and maintain robust, secure and scalable cloud and on-prem infrastructure Drive improvements across observability, monitoring, and alerting Collaborate with development, architecture and SRE teams to support modern platform capabilities Contribute to infrastructure decision-making, balancing long-term goals with practical delivery Play an active … also need experience with infrastructure-as-code (ideally Terraform), as well as scripting skills using Python or Bash. Familiarity with containerisation, CI/CD pipelines, and modern monitoring and observability tools would also be beneficial. What's in it for me? Competitive salary with annual performance reviews Hybrid working with flexibility on location 25 days holiday + bank holidays Healthcare More ❯
SRE team in the Platform Engineering organization and help us scale our products to millions of end-users. We are looking for individuals with a passion for automation and observability, ensuring the quality and availability of our services. Location - We are flexible on remote working from home, if you are based in the UK or Germany. On Call Requirements - Participate … our 24x7 on-call rotation, SCRUM, and deployment planning Perform Root Cause Analysis (RCA) and provide recommendations for application teams Improve availability and reduce customer impact using Industry best observability tools Ensure best-practice and security-minded architecture by influencing design decisions Create and maintain technical documentation and SOP's Develop software, scripts, or tooling to improve efficiency and reduce … experience in Site Reliability Engineer roles 3+ years' experience with an object-oriented language (preferably Java, .NET or C++) Expert+ level Linux administration, scripting, and troubleshooting Demonstrable knowledge of Observability tools (New Relic, Splunk, DataDog) Comprehensive experience with AWS (Amazon Web Services) and its core capabilities (VPC, EC2, ECS, Route53, Fargate, ALB/NLB distributions, etc) Extensive experience with cloud More ❯
automation. Ensure end-to-end network automation to improve operational efficiency, agility, and reliability. Drive zero-trust network security principles, ensuring compliance and proactive threat mitigation. Establish a global observability and telemetry framework for real-time network insights. Align network strategies with business growth, cloud-first initiatives, and digital transformation. Network Infrastructure & Cloud Networking: Oversee global network architecture, spanning data … response using AI-driven network analytics. Ensure high availability, network resilience, and 24x7 operational support. Develop a follow-the-sun support model, ensuring global network performance optimization. Implement network observability and predictive analytics to proactively prevent outages. Security, Compliance & Risk Management: Drive zero-trust security frameworks, ensuring secure and resilient network access. Ensure adherence to ISO 27001, NIST, SOC … role, managing large-scale global network environments. Deep expertise in cloud networking (AWS, Azure, GCP), SD-WAN, and network automation. Proven track record in end-to-end network automation, observability, and self-healing networks. Experience in AI-driven networking, predictive analytics, and network telemetry. Strong understanding of zero-trust networking, compliance frameworks, and security policies. Excellent leadership, communication, and stakeholder More ❯
Manchester, Lancashire, United Kingdom Hybrid / WFH Options
Suits Me Limited
across multiple squads to ensure our platform is scalable, secure, and designed for rapid deployment and operational excellence. You'll contribute to the development and automation of cloud infrastructure, observability systems, CI/CD pipelines, and event-based services that power key parts of our product ecosystem. About Suits Me Suits Me is a multi-award-winning, ethical fintech dedicated … pipelines (e.g. GitHub Actions) to enable rapid and reliable delivery of services Contributing to the design of scalable and secure platform components that enable developer productivity Building and improving observability tooling (e.g. CloudWatch, Grafana) to support rapid detection and resolution of issues Collaborating with developers and stakeholders across squads to understand infrastructure needs and ensure best practices are applied Writing More ❯
Leeds, Yorkshire, United Kingdom Hybrid / WFH Options
William Hill PLC
AWS, Linux, Kubernetes (EKS), Terraform, Istio, ArgoCD and Crossplane to continually evolve and meet the demands of our fast-paced industry. What you will be doing: Championing reliability:Implement observability and security solutions, with robust testing and disaster recovery strategies Accelerating productivity:Automate deployments and maintain state-of-the-art CI/CD pipelines to deliver efficiently at scale Powering … in AWS, using Terraform or similar Infrastructure as Code tools for streamlined management Containerization:Skilled in Kubernetes administration and orchestration Developer Experience:Experienced in developing SDLC pipelines with GitOps Observability:Familiar with Prometheus, New Relic, Splunk, or similar monitoring tools Security First:Demonstrates an understanding of security best practices in every workflow with an Agile Mindset you'll be an More ❯
management for Windows workloads Create tooling and automation around the deployment of a customer-specific Windows-based SaaS product Ensure high availability, reliability, and scalability of Windows services. Integrate observability tooling (metrics, logs, traces) into IIS-hosted services Harden Windows infrastructure for security, compliance, and operational best practices Lead incident response for Windows-related systems Contribute to internal documentation and … Windows internals Proven ability to build infrastructure-as-code and CI/CD for Windows environments Comfort wrapping a Windows software product with the surrounding infrastructure, services, automation, and observability required to run it as a SaaS offering. Hands-on experience administering cloud infrastructure or building cloud-native applications (preferably on AWS) Comfortable using AWS EC2 Proficiency with command-line More ❯
strategy across FIC Technology, aligning reliability goals with business priorities and regulatory expectations Lead the transformation of production support into a proactive, data-driven engineering discipline focused on automation, observability, and continuous improvement Stay close to the technology—reviewing architecture, contributing to tooling, and leading by example in incident response and root cause analysis Act as a trusted advisor to … proficiency in Linux/Unix systems, SQL, and programming languages such as C++, Java or Python. Strong understanding of distributed systems and low-latency architectures Hands-on experience with observability stacks (e.g., Prometheus, Grafana, Splunk, Geneos, OpenTelemetry) and infrastructure automation (e.g., Ansible, Terraform, CI/CD pipelines) Strong understanding of the trade lifecycle, market data, and fixed income products, FX More ❯
strategy across FIC Technology, aligning reliability goals with business priorities and regulatory expectations Lead the transformation of production support into a proactive, data-driven engineering discipline focused on automation, observability, and continuous improvement Stay close to the technology-reviewing architecture, contributing to tooling, and leading by example in incident response and root cause analysis Act as a trusted advisor to … proficiency in Linux/Unix systems, SQL, and programming languages such as C++, Java or Python. Strong understanding of distributed systems and low-latency architectures Hands-on experience with observability stacks (e.g., Prometheus, Grafana, Splunk, Geneos, OpenTelemetry) and infrastructure automation (e.g., Ansible, Terraform, CI/CD pipelines) Strong understanding of the trade lifecycle, market data, and fixed income products, FX More ❯
resilience. We use a fully automated deployment pipeline, allowing you to ship code to production within minutes, multiple times per day. We practice Test-Driven Development (TDD) and emphasize observability and secure-by-design principles from the start. As a squad, we own our software end-to-end, including development, maintenance, and support. You'll play a key role in … well-defined technical solutions. Write clean, maintainable code using Python and AWS cloud tooling. Ensure quality through test driven development and continuous integration. Define and monitor performance expectations using observability tooling. Build secure solutions by following data protection best practices and incorporating guardrails early. Ensure platform reliability and simplicity Build and maintain resilient, performant, scalable and secure services. Improve operational More ❯
Out in Science, Technology, Engineering, and Mathematics
across engineering teams to build, refine, and enrich data-driven solutions that span diverse systems, data models, and cloud-native architectures. By championing best practices in engineering, including testing, observability, security, and robust documentation, you'll play a key role in ensuring Axon's platforms are reliable, maintainable, and prepared to scale. What You'll Do Design, develop, and maintain … an in-house L7 (reverse) proxy that allows users to securely access parts of the data platform directly Drive best practices around production data systems, including performance, testing, security, observability, and documentation. Troubleshoot and resolve issues in production environments to ensure data integrity and platform reliability What you bring: Bachelor's Degree in Computer Science, Engineering, or related field 3+ More ❯
move? Get in touch and apply today! Responsibilities: Respond rapidly to critical AWS incidents, identify root causes, and deploy automated hotfixes. Lead the setup and integration of Prometheus-Grafana observability stack. Refactor and modernize deployment pipelines using GitHub Actions and Kubernetes. Maintain robust monitoring, alerting, and CI/CD systems. Skills/Must have: Strong hands-on experience with AWS More ❯
Bracknell, Berkshire, South East, United Kingdom Hybrid / WFH Options
Halian Technology Limited
in the team Contribute to solution architecture and strategic technical direction Build, integrate, and maintain REST APIs and backend services Champion best practices in software quality, CI/CD, observability, and DevOps Collaborate with cross-functional teams including Product, QA, and DevOps Optionally take on people management responsibilities for engineers Stay updated with emerging backend and cloud technologies Key Skills More ❯
in demanding, customer service-oriented environment Ability to communicate clearly with all levels within an organization Excellent analytical skills, organizational abilities and problem-solving skills Experience in instituting data observability solutions using tools such as Grafana, Splunk, AWS CloudWatch, Kibana, etc. Experience in container technologies such as Docker, Kubernetes, and Amazon EKS Qualifications: Ability to obtain an Active Secret clearance More ❯
Burke, Virginia, United States Hybrid / WFH Options
ALTA IT Services
with cloud platforms such as AWS GovCloud or Azure Government. Preferred Qualifications: • Elastic Certified Engineer or Elastic Certified Analyst. • Experience with Elasticsearch Service (Elastic Cloud). • Familiarity with other observability tools (e.g., Grafana, Splunk, Prometheus). • Experience with NIST RMF, DoD 8570 compliance, or CDM initiatives. • Prior experience supporting DoD, IC, or civilian agencies. More ❯
stakeholders to define solutions Mentor junior developers and promote engineering best practices Drive improvements in development processes, CI/CD pipelines, and tooling Investigate and resolve production issues Ensure observability through logging, metrics, and diagnostics Contribute to event-driven architecture and distributed systems design What You Bring 5+ years of backend development experience Expertise in C#, .NET (preferably .NET 6+ More ❯
stakeholders to define solutions Mentor junior developers and promote engineering best practices Drive improvements in development processes, CI/CD pipelines, and tooling Investigate and resolve production issues Ensure observability through logging, metrics, and diagnostics Contribute to event-driven architecture and distributed systems design What You Bring 5+ years of backend development experience Expertise in C#, .NET (preferably .NET 6+ More ❯
Strong scripting skills - you're confident automating processes in Bash, Groovy, or similar Proficient in building and managing robust CI/CD pipelines You're passionate about platform security, observability, and continuous improvement Comfortable collaborating across product and engineering teams to improve workflows and reliability You have a keen eye for performance and cost efficiency in cloud architecture To speak More ❯
legacy platform teams to eventually migrate workloads onto your pipelines • Collaborate with Java software developers, systems engineers, and other cross-functional teams • Deliver an MVP pipeline for the greenfield observability product by end of summer, with operational maturity by January • Provide technical leadership and independent initiative in a fast-moving Agile environment What You'll Learn • Full mission lifecycle of More ❯
integrate with CI/CD pipelines. Infrastructure as Code (IaC): Hands-on experience using Terraform for provisioning and managing cloud infrastructure. Proficient in version control, particularly with GitHub. Monitoring & Observability: Proficient with monitoring and alerting tools (e.g., Prometheus, Grafana, CloudWatch) to track pipeline and infrastructure health. Strong troubleshooting skills for resolving CI/CD pipeline issues and optimizing pipeline performance. More ❯
recognize road blocks and demonstrates interest in learning technology that facilitates innovation Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, Terraform Experience in at least one observability tool such as Dynatrace, Datadog, New Relic, CloudWatch, AppDynamics, Splunk, Geneos. More ❯
legacy platform teams to eventually migrate workloads onto your pipelines Collaborate with Java software developers, systems engineers, and other cross-functional teams Deliver an MVP pipeline for the greenfield observability product by end of summer, with operational maturity by January Provide technical leadership and independent initiative in a fast-moving Agile environment Qualifications We Prefer: Experience in DevSecOps toolchains including More ❯