london (city of london), south east england, united kingdom Hybrid / WFH Options
Understanding Recruitment
/infrastructure engineering role Strong scripting skills in Python , Bash , or Ruby Familiarity with configuration management tools (Ansible, Puppet, or Chef) Interest or exposure to observability tools like Datadog , Prometheus , or Grafana A passion for learning and improving in high-performance environments This is a rare chance to learn from elite engineers and contribute directly to a platform supporting global More ❯
Manchester, North West, United Kingdom Hybrid / WFH Options
Hays
Strong understanding of networking, virtualisation, and cloud security principles. Operate, maintain, and enhance the Azure Virtual Desktop (AVD) environment. Experience with monitoring and logging tools (e.g., Azure Monitor, CloudWatch, Prometheus). Expert in setting up and managing host pools, session hosts, user access, application layers, and FSLogix profiles. Strong knowledge of cloud architecture, design, and implementation principles and practices. Proficiency More ❯
observability platforms. Hands-on Go experience; our observability ecosystem is Go-based, making it the most effective language for this role. Experience with, or strong interest in, observability tools (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, Clickhouse) and standards (OpenTelemetry, OpenTracing, OpenMetrics). Deep understanding of distributed systems and data models Hands-on experience with Kubernetes, and cloud platforms (AWS More ❯
Accreditation Council for Graduate Medical Education
highly technical, ambiguous domains. Strong knowledge of REST APIs, distributed system design, and performance optimization. Experience with both SQL and NoSQL data stores, caching layers, and observability tooling (e.g., Prometheus, Datadog). Nice To Have Experience deploying or integrating LLMs or NLP models in production systems. Comfortable balancing short-term execution with long-term architectural thinking. Passion for building highly More ❯
ensuring high availability, optimal performance, and reliability across production and non-production environments. This includes working on incident response, capacity planning, WAN optimization, and system observability using tools like Prometheus and Grafana . Key Responsibilities: Administer and maintain Solace PubSub+ appliances and software brokers across environments (on-prem and cloud). Provide production support for messaging-related incidents, including root … cause analysis and resolution. Monitor system performance and health using Prometheus and Grafana ; proactively identify and address anomalies. Configure and optimize Solace across WAN environments , ensuring low-latency, secure, and reliable messaging. Collaborate with development and application support teams to troubleshoot message flow issues and integration problems. Perform capacity planning , scaling, and tuning of Solace infrastructure to meet current and … background in production support , preferably in a 24x7 enterprise environment. Experience working with distributed systems over WAN , with an understanding of networking, latency, and failover strategies. Solid experience with Prometheus and Grafana for system monitoring and alerting. Proficiency in troubleshooting message delivery, persistence, and topic routing. Experience with capacity management , performance tuning, and system scaling. Familiarity with Linux/Unix More ❯
ensuring high availability, optimal performance, and reliability across production and non-production environments. This includes working on incident response, capacity planning, WAN optimization, and system observability using tools like Prometheus and Grafana . Key Responsibilities: Administer and maintain Solace PubSub+ appliances and software brokers across environments (on-prem and cloud). Provide production support for messaging-related incidents, including root … cause analysis and resolution. Monitor system performance and health using Prometheus and Grafana ; proactively identify and address anomalies. Configure and optimize Solace across WAN environments , ensuring low-latency, secure, and reliable messaging. Collaborate with development and application support teams to troubleshoot message flow issues and integration problems. Perform capacity planning , scaling, and tuning of Solace infrastructure to meet current and … background in production support , preferably in a 24x7 enterprise environment. Experience working with distributed systems over WAN , with an understanding of networking, latency, and failover strategies. Solid experience with Prometheus and Grafana for system monitoring and alerting. Proficiency in troubleshooting message delivery, persistence, and topic routing. Experience with capacity management , performance tuning, and system scaling. Familiarity with Linux/Unix More ❯
ensuring high availability, optimal performance, and reliability across production and non-production environments. This includes working on incident response, capacity planning, WAN optimization, and system observability using tools like Prometheus and Grafana . Key Responsibilities: Administer and maintain Solace PubSub+ appliances and software brokers across environments (on-prem and cloud). Provide production support for messaging-related incidents, including root … cause analysis and resolution. Monitor system performance and health using Prometheus and Grafana ; proactively identify and address anomalies. Configure and optimize Solace across WAN environments , ensuring low-latency, secure, and reliable messaging. Collaborate with development and application support teams to troubleshoot message flow issues and integration problems. Perform capacity planning , scaling, and tuning of Solace infrastructure to meet current and … background in production support , preferably in a 24x7 enterprise environment. Experience working with distributed systems over WAN , with an understanding of networking, latency, and failover strategies. Solid experience with Prometheus and Grafana for system monitoring and alerting. Proficiency in troubleshooting message delivery, persistence, and topic routing. Experience with capacity management , performance tuning, and system scaling. Familiarity with Linux/Unix More ❯
ensuring high availability, optimal performance, and reliability across production and non-production environments. This includes working on incident response, capacity planning, WAN optimization, and system observability using tools like Prometheus and Grafana . Key Responsibilities: Administer and maintain Solace PubSub+ appliances and software brokers across environments (on-prem and cloud). Provide production support for messaging-related incidents, including root … cause analysis and resolution. Monitor system performance and health using Prometheus and Grafana ; proactively identify and address anomalies. Configure and optimize Solace across WAN environments , ensuring low-latency, secure, and reliable messaging. Collaborate with development and application support teams to troubleshoot message flow issues and integration problems. Perform capacity planning , scaling, and tuning of Solace infrastructure to meet current and … background in production support , preferably in a 24x7 enterprise environment. Experience working with distributed systems over WAN , with an understanding of networking, latency, and failover strategies. Solid experience with Prometheus and Grafana for system monitoring and alerting. Proficiency in troubleshooting message delivery, persistence, and topic routing. Experience with capacity management , performance tuning, and system scaling. Familiarity with Linux/Unix More ❯
london (city of london), south east england, united kingdom
BGC Group
ensuring high availability, optimal performance, and reliability across production and non-production environments. This includes working on incident response, capacity planning, WAN optimization, and system observability using tools like Prometheus and Grafana . Key Responsibilities: Administer and maintain Solace PubSub+ appliances and software brokers across environments (on-prem and cloud). Provide production support for messaging-related incidents, including root … cause analysis and resolution. Monitor system performance and health using Prometheus and Grafana ; proactively identify and address anomalies. Configure and optimize Solace across WAN environments , ensuring low-latency, secure, and reliable messaging. Collaborate with development and application support teams to troubleshoot message flow issues and integration problems. Perform capacity planning , scaling, and tuning of Solace infrastructure to meet current and … background in production support , preferably in a 24x7 enterprise environment. Experience working with distributed systems over WAN , with an understanding of networking, latency, and failover strategies. Solid experience with Prometheus and Grafana for system monitoring and alerting. Proficiency in troubleshooting message delivery, persistence, and topic routing. Experience with capacity management , performance tuning, and system scaling. Familiarity with Linux/Unix More ❯
integration with Azure DevOps, Monday dot com, Teams etc. Integrating Azure SSO/RBAC into all the collaboration and development tools. Implement monitoring, logging, and alerting using tools like Prometheus, Grafana, or ELK Manage containerised services using Docker and Kubernetes (EKS) Candidate would also be required to add features and support existing system. Skills: Proven hands-on experience with Microsoft … Azure DevOps, GitHub Actions, etc.) Git-based workflow (PR, Merges, Jira status, CI and then CD). IaC (Terraform, ARM) Scripting (Python, PowerShell, Bash) Monitoring and alerting tools (e.g. Prometheus, Grafana, Azure Monitor). Collaboration tools (Jira and Monday dot com). Integration tools like Power Automate, Slack etc. Good to have Programming language (C#/.Net/Python More ❯
and platform engineering. Tech Stack Cloud: AWS (EC2, RDS, S3, IAM, CloudWatch, Lambda) Infrastructure as Code: Terraform Containerisation & Orchestration: Docker, Kubernetes (EKS), Helm Configuration Management: Ansible Monitoring & Observability: Grafana, Prometheus CI/CD: GitHub Actions Automation & Scripting: Python, Bash, Go or Java What We’re Looking For Proven experience running AWS cloud infrastructure in a production or regulated (financial) environment. … Hands-on experience managing Kubernetes clusters (preferably EKS). Strong understanding of Infrastructure as Code using Terraform. Familiarity with monitoring and observability stacks such as Prometheus and Grafana. Experience building and maintaining CI/CD pipelines (GitHub Actions or similar). Strong scripting or automation skills using Python, Bash, Go or Java . A collaborative mindset — comfortable working alongside developers More ❯
Build and maintain Infrastructure as Code (IaC) using Terraform and Ansible. Design highly reliable, scalable, and secure infrastructure supporting performance-critical workloads. Build proactive monitoring, observability, and alerting with Prometheus, Grafana, Azure Monitor, DataDog, and Dynatrace. Troubleshoot complex system issues spanning applications, networks, and infrastructure. Define platform SLAs, SLOs, and governance standards for self-service use. Collaborate closely with Salesforce … and Ansible, along with scripting in PowerShell, Python, or Bash Experience implementing GitOps workflows and managing platform SLAs, SLOs, and governance standards Familiarity with observability and monitoring tools including Prometheus, Grafana, Azure Monitor, DataDog, or Dynatrace Preferred experience supporting Salesforce DevOps pipelines and working with Java, .NET, or Node.js application environments Exposure to AI/ML platforms, real-time data More ❯
Build and maintain Infrastructure as Code (IaC) using Terraform and Ansible. Design highly reliable, scalable, and secure infrastructure supporting performance-critical workloads. Build proactive monitoring, observability, and alerting with Prometheus, Grafana, Azure Monitor, DataDog, and Dynatrace. Troubleshoot complex system issues spanning applications, networks, and infrastructure. Define platform SLAs, SLOs, and governance standards for self-service use. Collaborate closely with Salesforce … and Ansible, along with scripting in PowerShell, Python, or Bash Experience implementing GitOps workflows and managing platform SLAs, SLOs, and governance standards Familiarity with observability and monitoring tools including Prometheus, Grafana, Azure Monitor, DataDog, or Dynatrace Preferred experience supporting Salesforce DevOps pipelines and working with Java, .NET, or Node.js application environments Exposure to AI/ML platforms, real-time data More ❯
Build and maintain Infrastructure as Code (IaC) using Terraform and Ansible. Design highly reliable, scalable, and secure infrastructure supporting performance-critical workloads. Build proactive monitoring, observability, and alerting with Prometheus, Grafana, Azure Monitor, DataDog, and Dynatrace. Troubleshoot complex system issues spanning applications, networks, and infrastructure. Define platform SLAs, SLOs, and governance standards for self-service use. Collaborate closely with Salesforce … and Ansible, along with scripting in PowerShell, Python, or Bash Experience implementing GitOps workflows and managing platform SLAs, SLOs, and governance standards Familiarity with observability and monitoring tools including Prometheus, Grafana, Azure Monitor, DataDog, or Dynatrace Preferred experience supporting Salesforce DevOps pipelines and working with Java, .NET, or Node.js application environments Exposure to AI/ML platforms, real-time data More ❯
/EKS knowledge to help the team overcome technical barriers. What They’re Looking For - 5–10 years’ hands-on Kubernetes (EKS on AWS) experience. - Strong skills with Terraform, Prometheus, and scaling infra. - Collaborative and adaptable in a fast-paced environment where priorities shift quickly. - Ability to solve technical challenges and mentor others through example. If you're interested and More ❯
/EKS knowledge to help the team overcome technical barriers. What They’re Looking For - 5–10 years’ hands-on Kubernetes (EKS on AWS) experience. - Strong skills with Terraform, Prometheus, and scaling infra. - Collaborative and adaptable in a fast-paced environment where priorities shift quickly. - Ability to solve technical challenges and mentor others through example. If you're interested and More ❯
/EKS knowledge to help the team overcome technical barriers. What They’re Looking For - 5–10 years’ hands-on Kubernetes (EKS on AWS) experience. - Strong skills with Terraform, Prometheus, and scaling infra. - Collaborative and adaptable in a fast-paced environment where priorities shift quickly. - Ability to solve technical challenges and mentor others through example. If you're interested and More ❯
london (city of london), south east england, united kingdom
Propel
/EKS knowledge to help the team overcome technical barriers. What They’re Looking For - 5–10 years’ hands-on Kubernetes (EKS on AWS) experience. - Strong skills with Terraform, Prometheus, and scaling infra. - Collaborative and adaptable in a fast-paced environment where priorities shift quickly. - Ability to solve technical challenges and mentor others through example. If you're interested and More ❯
/EKS knowledge to help the team overcome technical barriers. What They’re Looking For - 5–10 years’ hands-on Kubernetes (EKS on AWS) experience. - Strong skills with Terraform, Prometheus, and scaling infra. - Collaborative and adaptable in a fast-paced environment where priorities shift quickly. - Ability to solve technical challenges and mentor others through example. If you're interested and More ❯
Manchester, Lancashire, England, United Kingdom Hybrid / WFH Options
Eutopia Solutions ltd
working with or supporting Microsoft technology components such as Windows Server OS, IIS, AD, DNS, SQL Server, and Exchange Experience with containerization technologies such as Docker, Kubernetes Experience with Prometheus may be beneficial Apply now for immediate consideration More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Venn Group
of technologies including RHEL, CentOS, Ubuntu, VMware, and F5 load balancers Manage web services, LAMP stack applications, Samba servers, and authentication proxies Utilise tools such as Ansible, Katello, Nagios, Prometheus, and Grafana for configuration and monitoring Automate routine tasks using scripts and infrastructure-as-code practices Maintain clear and up-to-date technical documentation Support knowledge sharing and training for More ❯
stack: Languages: TypeScript, Javascript Libraries and frameworks: gRPC, Redux, React Native, React, Next.js Datastores: Vitess, MySQL, CockroachDB, BigQuery, Redis Infrastructure: Google Cloud Platform, Kubernetes, Docker, PubSub, Terraform Monitoring: Grafana, Prometheus, Sentry, Metabase About you: You are a frontend developer with at least 2 years' experience You are fast and love to deliver incredible code You can reduce complex problems to More ❯
models (MongoDB, PostgreSQL) Implement asset storage, retrieval, and management systems (AWS S3) Build job queue management for async ML workflows (SNS, SQS) Setup application monitoring and logging (CloudWatch, Grafana, Prometheus) Implement CI/CD for application deployment (Bitbucket Pipelines) Create API documentation and developer tools What we are looking for 5+ years backend development experience with production applications Track record More ❯
City of London, London, United Kingdom Hybrid / WFH Options
M-XR
models (MongoDB, PostgreSQL) Implement asset storage, retrieval, and management systems (AWS S3) Build job queue management for async ML workflows (SNS, SQS) Setup application monitoring and logging (CloudWatch, Grafana, Prometheus) Implement CI/CD for application deployment (Bitbucket Pipelines) Create API documentation and developer tools What we are looking for 5+ years backend development experience with production applications Track record More ❯
london, south east england, united kingdom Hybrid / WFH Options
M-XR
models (MongoDB, PostgreSQL) Implement asset storage, retrieval, and management systems (AWS S3) Build job queue management for async ML workflows (SNS, SQS) Setup application monitoring and logging (CloudWatch, Grafana, Prometheus) Implement CI/CD for application deployment (Bitbucket Pipelines) Create API documentation and developer tools What we are looking for 5+ years backend development experience with production applications Track record More ❯