and best practises Ensuring scalable, automated and secure systems Design & architect software Developing infrastructure as code to ensure synergy across platforms Design and implement monitoring, logging, and tracing, ensure observability and alerting are in place Setting up security best practises to maintain platform and data The candidate: GCP - as much as possible (GKE, Cloud Build, Cloud SQL, Pub/Sub More ❯
Liverpool, Merseyside, North West, United Kingdom Hybrid / WFH Options
Broster Buchanan Ltd
scalability and resilience in applications handling large volumes of traffic and burst events. Work collaboratively with cross-functional teams, including DevOps, Infrastructure, and Product, to deliver robust systems. Leverage observability tools to monitor, alert, and troubleshoot application and integration health. Stay current on AI-driven software development practices (e.g., GPT-assisted development, Agentic AI workflows) and suggest practical implementations. Participate More ❯
Kubernetes) at scale. Experience working with a cloud provider (AWS, Azure, or GCE), or sysadmin/SRE experience in data centers. Expertise in designing, building, and operating high-scale observability or infrastructure systems. Working knowledge of networking fundamentals; experience with CNIs or cloud networking infrastructure is preferred. What We Require 4+ years of professional software development experience on core infrastructure More ❯
mindset, from commit to production Collaborate directly with end-users and internal teams to understand needs and deliver value Operate across multi-cloud environments (AWS, GCP, Azure) Drive system observability and reliability with tools like Datadog Help shape our engineering culture by mentoring, sharing knowledge, and encouraging best practices Push boundaries, challenge assumptions, and ensure delivery of meaningful solutions Tech More ❯
Stoke-On-Trent, Staffordshire, West Midlands, United Kingdom
Evolution Funding Limited
as code (Terraform, AWS CDK, Serverless Framework, CloudFormation). Knowledge of microservices and event-driven architectures. Exposure to container technologies (Docker, ECS, EKS, Kubernetes). Experience with monitoring and observability tools (CloudWatch, Datadog, OpenTelemetry). More ❯
strong written and verbal communication skills. In addition to cloud development, supporting on premise Kubernetes clusters is required. Responsibilities include: Implement best practices in cloud infrastructure, emphasizing Site Reliability, Observability, and Scalability. Foster strong collaboration with various teams, working closely with Product, DBAs, Developers, DevOps, SRE, and Data Engineers to implement AWS standard methodologies, Infrastructure as Code (IaC), and cost More ❯
Apache Airflow for orchestrating complex data workflows and ensuring reliable execution. Understanding of cloud security and governance practices including IAM, KMS, and data access policies. Experience with monitoring and observability tools such as CloudWatch. Experience working in Agile/Scrum environments, participating in sprint planning, retrospectives, and backlog grooming. Good to Have : Exposure to Azure data services such as Azure More ❯
Apache Airflow for orchestrating complex data workflows and ensuring reliable execution. Understanding of cloud security and governance practices including IAM, KMS, and data access policies. Experience with monitoring and observability tools such as CloudWatch. Experience working in Agile/Scrum environments, participating in sprint planning, retrospectives, and backlog grooming. Good to Have : Exposure to Azure data services such as Azure More ❯
Apache Airflow for orchestrating complex data workflows and ensuring reliable execution. Understanding of cloud security and governance practices including IAM, KMS, and data access policies. Experience with monitoring and observability tools such as CloudWatch. Experience working in Agile/Scrum environments, participating in sprint planning, retrospectives, and backlog grooming. Good to Have : Exposure to Azure data services such as Azure More ❯
Apache Airflow for orchestrating complex data workflows and ensuring reliable execution. Understanding of cloud security and governance practices including IAM, KMS, and data access policies. Experience with monitoring and observability tools such as CloudWatch. Experience working in Agile/Scrum environments, participating in sprint planning, retrospectives, and backlog grooming. Good to Have : Exposure to Azure data services such as Azure More ❯
london (city of london), south east england, united kingdom
HCLTech
Apache Airflow for orchestrating complex data workflows and ensuring reliable execution. Understanding of cloud security and governance practices including IAM, KMS, and data access policies. Experience with monitoring and observability tools such as CloudWatch. Experience working in Agile/Scrum environments, participating in sprint planning, retrospectives, and backlog grooming. Good to Have : Exposure to Azure data services such as Azure More ❯
execute solutions. Work with AWS cloud-native services (Lambda, Step Functions, DynamoDB) to develop efficient cloud-basedapplications. Ensure CI/CD best practices ,contributing to GitLab pipelines ,automation, and observability improvements. Integrate AI-powered tools (e.g., GitHubCopilot) to enhance development workflows. Drive continuous improvement in performance, security, andmaintainability . Support cross-squad collaboration ,ensuring architectural consistency and code reusability. Requirements More ❯
practices (encryption with KMS, Secrets Manager, IAM least-privilege, firewall rules, GuardDuty, CloudTrail) to maintain compliance and security posture. Troubleshoot and resolve complex cloud integration issues using monitoring and observability tools (CloudWatch, X-Ray, Datadog). Collaborate with cross-functional teams to deliver enterprise-grade integrations across SaaS applications, packaged apps, APIs, and legacy systems. III. Qualifications A. Required Qualifications More ❯
Dunn Loring, Virginia, United States Hybrid / WFH Options
ALTA IT Services
on designing and implementing scalable, reliable, and efficient systems. • Extensive experience in building a Knowledge Base for AI, cloud migration and engineering cloud centric environments. • Experience supporting AI Operations, observability and incident management. • Extensive experience with Agile practices including Scrum, Azure DevOps, Git and CI/CD. • Expert knowledge of project lifecycles and management methodologies. • Expert knowledge of engineering principles More ❯
on designing and implementing scalable, reliable, and efficient systems. Extensive experience in building a Knowledge Base for AI, cloud migration and engineering cloud centric environments. Experience supporting AI Operations, observability and incident management. Extensive experience with Agile practices including Scrum, Azure DevOps, Git and CI/CD. Expert knowledge of project lifecycles and management methodologies. Expert knowledge of engineering principles More ❯
cloud infrastructure and automation skills using AWS, Terraform, Python, and Lambda functions. This role involves designing, implementing, and maintaining service monitoring solutions while leveraging cloud-native technologies for scalable observability platforms. Key Responsibilities Design KPIs, service definitions, dashboards, and glass tables Configure correlation searches, events, and predictive analytics Build dependency mapping and topology visualization Deploy and manage AWS infrastructure with More ❯
scalability and reduce manual intervention. Operational Security, SRE & Assurance: Ensure security platforms are resilient, continuously monitored, and designed for 24x7 support and incident response readiness. Embed security telemetry and observability to enable proactive threat detection and automated response. Apply SRE principles to improve reliability, performance, and maintainability of security services. Lead platform health, patching automation, and vulnerability remediation workflows. Define More ❯
databases and retrieval strategies. Knowledge of recommender systems and ranking models. Familiarity with LLM evaluation tools (e.g., RAGAS, TruLens, LangSmith, Arize). Exposure to feature stores, data lineage, and observability stacks. Experience in e-commerce or retail environments. Demonstrable ability to weigh up build/build/configure decisions in the LLM space. Charlotte Tilbury is a fast-paced and More ❯
Experience using LLM based workflows and integrating AI capabilities into applications Experience with cloud deployment services (AWS preferred, Azure, GCP) Knowledge of containerization and orchestration (Docker, Kubernetes) Experience with observability tools (Prometheus, Grafana, New Relic, Datadog More ❯
Newcastle Upon Tyne, Tyne and Wear, England, United Kingdom Hybrid / WFH Options
Lorien
cloud infrastructure on Azure or AWS. Driving Infrastructure as Code (IaC) practices using Terraform. Building and optimising CI/CD pipelines to accelerate delivery. Implementing and maintaining monitoring and observability with Prometheus and Grafana. Enabling team collaboration and incident response through Slack and other ChatOps tools. Leading, mentoring, and supporting engineers (or preparing to step into people management if you More ❯
Leighton Buzzard, Bedfordshire, United Kingdom Hybrid / WFH Options
Big Red Recruitment Midlands Limited
voice will be critical in its evolution. KEY RESPONSIBILITIES - Oversee day-to-day platform operations, including monitoring, incident response and trouble shooting. - Moving across orchestration, automation, pipelines, cloud services, observability and security domains. - Leading and managing short and long term project planning. - Developing and implementing cloud governance, security and compliance. - Leading automation and IaC improvements. - Providing mentorship and professional development More ❯
test solutions and automation frameworks using Python, Terraform, and modern cloud-native practices. Contribute to the platform’s CI/CD pipeline by integrating automated testing, resilience checks, and observability hooks at every stage. Lead initiatives that drive testability, platform resilience, and validation as code across all layers of the ML platform stack. Collaborate with engineering, MLOps, and infrastructure teams More ❯
etc.) Comfort with basic computer administration including software installation, system configuration, and networking. Comfort with git and automated build pipelines (Jenkins, GitLab CI/CD, etc.) Preferred Passion for observability (Elastic, APM, Grafana, etc.) Experience integrating software with a Large Language Model (LLM) Experience with retrieval-augmented generation (RAG) Production-grade software development experience with Python Service containerization and deployment More ❯
in piloting new products. Qualifications: Strong experience with Confluent Kafka and AWS cloud, including experience in building and operating solutions for high-scale distributed systems. Prior experience with enabling "Observability" using tools for Distributed tracing, Event logging, APM Synthetic monitoring. Understanding of SRE Practices. Experience in Automation. Experience in building self-service platforms. Prior experience with web services and messaging More ❯
teams. Desirable Microsoft Azure certifications. GraphQL (e.g. HotChocolate). Exposure to Kafka or other event-driven platforms. Knowledge of DevOps/IaC (Docker, Azure DevOps). Familiarity with Azure observability, identity, and security tools. Gitflow knowledge. Personal Qualities Customer-focused and improvement-driven. Positive, proactive, and collaborative. Strong problem-solving and influencing skills. Committed to personal and team development. Think More ❯