with Python or Ansible is considered advantageous.* Comprehensive understanding of virtualisation platforms and container orchestration tools enables you to propose scalable solutions confidently.* Familiarity with monitoring stacks such as Prometheus or Grafana allows you to provide valuable insights into system performance for clients.* Exceptional interpersonal skills empower you to build rapport with stakeholders at all levels while communicating complex ideas More ❯
Build and maintain Infrastructure as Code (IaC) using Terraform and Ansible. Design highly reliable, scalable, and secure infrastructure supporting performance-critical workloads. Build proactive monitoring, observability, and alerting with Prometheus, Grafana, Azure Monitor, DataDog, and Dynatrace. Troubleshoot complex system issues spanning applications, networks, and infrastructure. Define platform SLAs, SLOs, and governance standards for self-service use. Collaborate closely with Salesforce … and Ansible, along with scripting in PowerShell, Python, or Bash Experience implementing GitOps workflows and managing platform SLAs, SLOs, and governance standards Familiarity with observability and monitoring tools including Prometheus, Grafana, Azure Monitor, DataDog, or Dynatrace Preferred experience supporting Salesforce DevOps pipelines and working with Java, .NET, or Node.js application environments Exposure to AI/ML platforms, real-time data More ❯
EKS, EC2, RDS/Aurora, S3). Develop and maintain Infrastructure as Code using Terraform and configuration management with Ansible. Enhance monitoring, logging, and alerting using the Grafana stack (Prometheus, Loki, Tempo). Participate in incident management, root cause analysis, andpost-incident reviews. Implement automation to reduce manual operational tasks and improve recovery time. Contribute to the definition and tracking … services relevant to production workloads (EKS, EC2, RDS/Aurora, S3, IAM). Infrastructure as Code with Terraform and configuration management with Ansible. Strong experience with observability tools (Grafana, Prometheus, Loki, Tempo). Understanding of SRE concepts (SLIs, SLOs, error budgets, capacity planning). Comfortable working in incident and problem management processes. Strong GitOps mindset for managing platform and configuration More ❯
a job posted by our partner Jooble Below is a snippet of the job description. To read the full text, please click on the "Apply Now" link. Skills Required Prometheus Grafana DataDog Experience with C++, Python, or Golang (optional) About the Company The company itself provides a suite of products and services to help people improve Staff-Level Full-Stack More ❯
Site Reliability Engineer, you will be responsible for designing, developing, and maintaining systems and applications using Golang. You will monitor and optimise system performance with tools such as Grafana, Prometheus, New Relic, and Splunk. Your role will involve identifying and resolving reliability issues, automating processes, and ensuring the seamless operation of the platform. If you have a passion for technology More ❯
Site Reliability Engineer, you will be responsible for designing, developing, and maintaining systems and applications using Golang. You will monitor and optimise system performance with tools such as Grafana, Prometheus, New Relic, and Splunk. Your role will involve identifying and resolving reliability issues, automating processes, and ensuring the seamless operation of the platform. If you have a passion for technology More ❯
Create the future of travel with us Whether it's to visit the people closest to us, starting an exciting adventure, or a career-defining business trip, travel is an essential part of our lives. Yet we've all experienced More ❯
new functionality Maintaining and evolving our cloud infrastructure (GCP, Kubernetes) to ensure high availability, security, and performance Managing service observability and reliability, including logging, metrics and alerting (we use Prometheus and Grafana) Handling database and service upgrades (e.g. MySQL, Kubernetes), secrets management and security best practices Taking ownership of platform-level concerns such as deployment pipelines, configuration management, and cost … best practices across infrastructure and applications, including secrets management and credential rotation. Familiarity with infrastructure-as-code or automation tools is a plus Experience with observability tools (such as Prometheus and Grafana), service monitoring, and debugging in production environments A demonstrated interest in staying up-to-date with new technology, new frameworks, new languages and other developments like AI. A More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Salt Search
overcome technical barriers. Contributing ideas and helping raise capability across the team. Taking part in an out-of-hours escalation rota. Tech Environment Core: Kubernetes (EKS on AWS), Karpenter, Prometheus, Terraform. Preferred: Service mesh (Cilium or similar), Flux/Argo, Ansible. Bonus: High-performance compute/GPUs in Kubernetes. What They're Looking For 5-10 years' hands-on Kubernetes … EKS on AWS) experience - ABSOLUTE MUST Strong skills with Terraform, Prometheus, and scaling infra. Collaborative and adaptable in a fast-paced environment where priorities shift quickly. Ability to solve technical challenges and mentor others through example. Culture The environment is fast-moving, with priorities that change quickly in line with business needs. It's collaborative, technical, and high-output - you More ❯
overcome technical barriers. Contributing ideas and helping raise capability across the team. Taking part in an out-of-hours escalation rota. Tech Environment Core: Kubernetes (EKS on AWS), Karpenter, Prometheus, Terraform. Preferred: Service mesh (Cilium or similar), Flux/Argo, Ansible. Bonus: High-performance compute/GPUs in Kubernetes. What They're Looking For 5-10 years' hands-on Kubernetes … EKS on AWS) experience - ABSOLUTE MUST Strong skills with Terraform, Prometheus, and scaling infra. Collaborative and adaptable in a fast-paced environment where priorities shift quickly. Ability to solve technical challenges and mentor others through example. Culture The environment is fast-moving, with priorities that change quickly in line with business needs. It's collaborative, technical, and high-output - you More ❯
distributed systems, developing, profiling, and maintaining multi-threaded, asynchronous applications JVM monitoring, profiling, performance tuning, and debugging. Experience with analysis tools such as JConsole, JVisualVM, Elastic Search/Logstash, Prometheus; Open tracing Extensive experience of test driven development Knowledge of CI/CD on large complex systems Experience from working in the Risk or pricing domain in investment banking, either … a good understanding of risk sensitivities and f2b processes. Hands-on experience with dynamic scalability; cloud deployment (EKS/Nomad), container/docker deployment, GRPC services, Cloud based services (Prometheus, Elastic Search, databases, Redis, ), Experience with Kafka or event processing thru message bus, Experience with Workflow/Scheduling/State management More ❯
distributed systems, developing, profiling, and maintaining multi-threaded, asynchronous applications JVM monitoring, profiling, performance tuning, and debugging. Experience with analysis tools such as JConsole, JVisualVM, Elastic Search/Logstash, Prometheus; Open tracing Extensive experience of test driven development Knowledge of CI/CD on large complex systems Experience from working in the Risk or pricing domain in investment banking, either … a good understanding of risk sensitivities and f2b processes. Hands-on experience with dynamic scalability; cloud deployment (EKS/Nomad), container/docker deployment, GRPC services, Cloud based services (Prometheus, Elastic Search, databases, Redis, ), Experience with Kafka or event processing thru message bus. More ❯