London, South East, England, United Kingdom Hybrid / WFH Options
Salt Search
production. Deploy, maintain, and optimise machine learning services within a cloud environment (AWS). Recommend and implement prompt management tools and provide expertise in prompt engineering. Introduce and manage observability, monitoring, and evaluation frameworks for ML and AI services. Enable auto-evaluation of prompts and models against domain-specific requirements. Build Python-based microservices, data pipelines, and serverless functions. Collaborate More ❯
achievable, ensuring delivery on time to an impeccable standard. Tech wise, it is a microservices environment running on Kubernetes hosted in Azure. Distributed systems and cloud native development. IAC, Observability and big bonus points if you have grasp of an object-oriented programming language. Continuous improvement is key across technology so if there's a better tool, it will be More ❯
APIs Experience of writing performance critical code Experience of using Git or similar to track changes Experience of both the full .NET Framework and .NET Core Experience of using observability systems such as Elastic APM or DataDog to track and diagnose issues in production A solid understanding of security principles and secure coding including OWASP Top 10 Nice to haves More ❯
teams to align on data architecture and ensure our ML systems meet overarching business objectives. Evolve our MLOps infrastructure, driving the strategy for model versioning, automated deployments, monitoring, and observability using modern tools like Prefect. Mentor and guide other members of the team, fostering a culture of technical excellence and continuous improvement through code reviews, design discussions, and knowledge sharing. More ❯
including Salesforce-specific pipelines. Build and maintain Infrastructure as Code (IaC) using Terraform and Ansible. Design highly reliable, scalable, and secure infrastructure supporting performance-critical workloads. Build proactive monitoring, observability, and alerting with Prometheus, Grafana, Azure Monitor, DataDog, and Dynatrace. Troubleshoot complex system issues spanning applications, networks, and infrastructure. Define platform SLAs, SLOs, and governance standards for self-service use. … Infrastructure as Code with Terraform and Ansible, along with scripting in PowerShell, Python, or Bash Experience implementing GitOps workflows and managing platform SLAs, SLOs, and governance standards Familiarity with observability and monitoring tools including Prometheus, Grafana, Azure Monitor, DataDog, or Dynatrace Preferred experience supporting Salesforce DevOps pipelines and working with Java, .NET, or Node.js application environments Exposure to AI/ More ❯
secure handling of sensitive operational data and compliance with relevant standards Developed and maintained robust APIs for system integration Drove operational excellence and continuous improvement Implemented and managed monitoring, observability, and troubleshooting tools for deployed systems Designed and handled containerised applications (e.g., Docker, Kubernetes) Qualifications Bachelor's degree in Computer Science, Engineering, or a related technical field Relevant experience as More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Rise Technical Recruitment Limited
data is delivered on time and without failure. The ideal candidate will have a strong experience working with streaming and batch data systems, a solid understanding of monitoring a observability, and hands-on experience working with AWS, Apache Flink, Kafka, and Python. This is a fantastic opportunity to step into a SRE role focused on data reliability in a modern More ❯
and Engineering background Proficient in writing infrastructure as code for public cloud Experience with Python coding/testing or any Cloud-based technology (AWS preferred) Good understanding of Data Observability Good understanding of Hosting Platform Linux/Unix (EKS and Container experience is a plus) Good understanding of Databases, Data Lakes, and Query Engines, SQL/DDLs is preferred We More ❯
colleagues and clients across the Snowflake ecosystemExperience in design and delivering business solutions on other modern data platforms (e.g. Databricks, Azure, AWS or GCP native stacks)Experience with platform observability and CI/CD for data platformsHands-on experience with modern data engineering tools such as dbt, Fivetran, Matillion or AirflowHistory of supporting pre-sales activities in a product or More ❯
a high-performing engineering team, splitting time between coding and people management. Drive delivery of new crypto product features end-to-end, from design to production. Ensure code quality, observability, scalability, and security are embedded in every release. Foster a collaborative, growth-focused team culture with clear goals and high accountability. Coordinate closely with Product, Design, and cross-functional teams More ❯
as Data Engineering and Product, to build a more effective and cohesive ML ecosystem. Deep expertise in data science and engineering best practices (version control, CI/CD, testing, observability) and a history of applying them to build robust, scalable machine learning systems. Exceptional analytical and problem-solving skills, with a demonstrated ability to define and solve highly ambiguous, complex More ❯
AI SRE assistant. Kubernetes promises agility, elasticity, reliability and high availability, but it also introduces complexity, high operational overhead, and cost overruns due to over provisioning of workloads. Traditional observability only surfaces the "what" - Komodor goes further by delivering the "why", "where" and the "how"; providing a full platform to detect, investigate and remediate while optimizing workloads. By combining our More ❯
and help shape how platform engineering is done as the team continues to scale. Tech stack AWS (Core services - EC2, RDS, S3, IAM, etc.) Configuration Management Ansible Monitoring and Observability Grafana, Prometheus Kubernetes (building and managing production clusters) Terraform (IaC provisioning) Python or Java (scripting, automation) GitHub Actions (CI/CD pipelines) What They’re Looking For Experience in AWS … cloud infrastructure (ideally in a regulated or high-traffic environment) Previous experience working with Monitoring and Observability Tools Hands-on Kubernetes know-how, specifically with EKS. Solid IaC experience with Terraform. Experience with containerisation (Docker, Helm) and CI/CD (GitHub Actions or similar) Solid scripting/Automation experience with Python or Java A good communicator who enjoys working collaboratively More ❯
the future team through recruitment and onboarding. Required Skills - We're primarily using AWS, utilising Lambda, ECS, SQS, API Gateway among others. Our database engine is MongoDB and our observability platform is Datadog. Our application is written in Typescript/NodeJS and our infrastructure is defined in Terraform. Experience working with JavaScript/TypeScript but also open to other languages More ❯
Azure SQL/T-SQL), including cutover and risk. Define conceptual/logical/physical data models and interfaces. Oversee ETL/ELT in Azure Data Factory , ensuring reliability, observability and cost optimisation. Guide PL/SQL T-SQL conversion, recommending tooling (e.g. SSMA ). Champion DevOps practices (Git/GitHub, CI/CD) and produce clear design/runbook More ❯
the use of Large Language Models Take ownership of the design, deployment, and maintenance of machine learning models Recommend, implement, and use tooling to improve the development, operations, and observability of machine learning models, large language models, and AI-related services Essential skills: Previous experience working on online Chat or Chatbors, particular voice, is a must-have. Strong recent hands More ❯
the use of Large Language Models Take ownership of the design, deployment, and maintenance of machine learning models Recommend, implement, and use tooling to improve the development, operations, and observability of machine learning models, large language models, and AI-related services Essential skills: Previous experience working on online Chat or Chatbors, particular voice, is a must-have. Strong recent hands More ❯
Selenium, Puppeteer). Orchestrate pipelines using Airflow, and manage data quality workflows. Model and transform data in SQL and Snowflake to create clean, analytics-ready datasets. Ensure data quality, observability, and governance across workflows. Collaborate closely with product managers, analysts, and engineers to deliver high-quality data products for dashboards and reporting. Technical Skills & Experience We're looking for candidates More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Lorien
growing to meet our business needs. What you'll lead: Shape and evolve the backend technical architecture to support product scale and complexity Identify and drive improvements in performance, observability, and infrastructure Lead the design of domain models aligned with evolving business needs Be a go-to person for backend excellence, and improve code quality Engineering centric requirement definition (user More ❯
F# are welcome) Proven track record of building and scaling distributed backend systems Solid understanding of infrastructure-as-code and cloud orchestration (AWS, Terraform, Docker) Familiarity with queue management, observability tooling, and shipping in fast-paced environments Awareness of GenAI and prompt engineering, or a keen interest to develop expertise in this area A self-starter attitude, with a strong More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Method Resourcing
teams to operationalize models and ship ML-powered features into production. Continuously assess and iterate on production models, balancing long-term ML strategy with tactical improvements. Champion code quality, observability, and resilience within their ML systems through reviews and hands-on contributions. Help shape their internal ML standards and practices, ensuring they stay ahead of industry advancements. Offer technical mentorship More ❯
across AWS, Flink, Kafka, and Python layers. Lead post-incident reviews: identify root causes, document findings, and drive corrective actions to closure. Reliability & Monitoring Design, implement, and maintain robust observability for data pipelines: dashboards, alerts, distributed tracing. Define SLOs/SLIs for data freshness, throughput, and error rates; continuously monitor and optimize. Automate capacity planning, scaling policies, and disaster-recovery More ❯
Streaming Data Strategy with a comprehensive approach to data control, compliance, and security; unconstrained by their infrastructure providers. Our platform mitigates data security risks while enhancing communication, automation, and observability across data flows, enabling teams to collaborate effortlessly across the organisation. With hubs in London and New York, we're looking for people who are passionate about our mission and More ❯
the architecture of our platform: modular, secure, scalable, and maintainable from day one Define integration patterns across internal services and third-party providers Own key infrastructure choices (messaging systems, observability, deployment strategies, etc.) Collaborate closely with Product Managers, Designers, and Mobile Engineers to shape end-to-end journeys Be hands-on in code when needed, but primarily act as a More ❯
Write production-quality software with strong engineering rigor-designing clean APIs, building reliable systems, and collaborating closely with product engineers. Build high-reliability ML infrastructure: training pipelines, model registries, observability, and CI/CD for ML. Ensure ML solutions meet enterprise standards for security, compliance, data privacy (e.g., SOC2, GDPR), explainability, and auditability. Develop evaluation and monitoring frameworks that measure More ❯