such as Azure, AWS or GCP Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and operational playbooks. Useful/Bonus Skills More ❯
OWASP Top 10, and threat modeling. Proficiency in cloud platforms (AWS, Azure, GCP) and associated reliability tools. Hands-on experience with monitoring and logging tools such as Prometheus, Grafana, Datadog, Splunk, or ELK stack. Familiarity with containerization and orchestration tools (Docker, Kubernetes). Strong understanding of distributed systems, fault tolerant design, and high availability architectures. Experience in root cause analysis More ❯
as Kubernetes or Amazon ECS to streamline application deployment, scaling, and management. Monitoring and Logging: Implement monitoring and logging solutions using tools such as Prometheus, Grafana, ELK Stack, or Datadog to monitor system performance, detect issues, and troubleshoot problems proactively. Security and Compliance: Implement security best practices and compliance standards within DevOps processes and infrastructure, ensuring the security and integrity More ❯
Edinburgh, Midlothian, United Kingdom Hybrid / WFH Options
Aberdeen
tech talks to share knowledge and promote adoption of tools and practices. About the Candidate The ideal candidate will possess the following: Experience with observability tools (eg, Grafana, Prometheus, Datadog). Background in DevOps, SRE, or platform engineering with a security first mindset. Strong programming skills in languages such as .Net, JavaScript, Python or similar. Experience with CI/CD More ❯
as ECS, Kubernetes, and Docker. Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others Preferred qualifications, capabilities, and skills Knowledge of using GENAI tools such as Copilot or Windsurf and how to use them as Code Assistants Ability to expand and More ❯
configuration management tools (e.g., Ansible, Puppet, Chef). Knowledge of infrastructure as code (IaC) tools (e.g., Terraform, CloudFormation). Experience with monitoring and logging tools (e.g., Prometheus, ELK Stack, Datadog). Passion for continuous learning and professional development. ABOUT BUSINESS UNIT IBM Consulting is IBM's consulting and global professional services business, with market leading capabilities in business and technology More ❯
distributed systems, microservices architecture, and RESTful API design. Hands-on experience with Kubernetes and container orchestration. Familiarity with monitoring, alerting, and logging tools (e.g., Prometheus, Grafana, ELK stack, or Datadog). Experience with Elastic will be highly helpful with this position. Hands-on experience with incident response, including designing and improving incident management processes. Expertise in Observability practices, including metrics More ❯
as needed. Experience with relational and non-relational databases. Experience delivering high levels of observability and proficiency in improving early warning systems, for example: has worked with Grafana/DataDog/Prometheus. Collaborating with internal/external teams/engineers and fostering an inclusive environment, where all points of view are welcomed and encouraged. Own and lead multiple domains of More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Harnham - Data & Analytics Recruitment
optimisation Nice to Have Experience with ML tooling (MLflow, Kubeflow) Knowledge of FastAPI , Databricks, or Snowflake Exposure to SRE practices or cloud security certifications Familiarity with Prometheus , Grafana , or Datadog Interested? If you want to be part of a world-class AI team at an early stage-where your infrastructure decisions will directly shape the company's success-apply today More ❯
engineering (SRE), or a similar role. Proficiency in cloud platforms (AWS, Azure, GCP) and associated reliability tools. Hands-on experience with monitoring and logging tools such as Prometheus, Grafana, Datadog, Splunk, or ELK stack. Proficiency in scripting languages like Python, Bash, or Go for automation. Familiarity with containerization and orchestration tools (Docker, Kubernetes). Strong understanding of distributed systems, fault More ❯
frontend architecture (e.g., Module Federation or Single-SPA). Experience with cloud-native DevOps tooling: Docker, Kubernetes, AWS/GCP deployments. Proficiency in analytics and observability tools like Sentry, Datadog, or LogRocket. Soft Skills Strategic thinker with strong problem-solving and decision-making skills. Ability to work in fast-paced, agile environments with cross-functional teams. Clear communication and documentation More ❯
as the operating system for car parking Contribute to the improvement of our CI pipelines for both backend and IoT deployments Improve our monitoring system for our services with Datadog Assist in scaling up our systems for managing thousands of parking lots Shape our engineering culture by employing modern software engineering practices, focusing on writing clean, well-tested, and efficient More ❯
translate requirements into scalable technical solutions. Very good knowledge of SQL, Python, ETL design, dbt, Airflow. Experience with Infrastructure as Code (Terraform), CI/CD (GitHub), and monitoring (e.g., Datadog, Grafana, Prometheus). Understanding of data governance, Unity Catalog, cost monitoring, and metadata management are an advantage. Initial experience with Jira and Confluence is desirable. Strong communication skills, team orientation More ❯
translate requirements into scalable technical solutions. Very good knowledge of SQL, Python, ETL design, dbt, Airflow. Experience with Infrastructure as Code (Terraform), CI/CD (GitHub), and monitoring (e.g., Datadog, Grafana, Prometheus). Understanding of data governance, Unity Catalog, cost monitoring, and metadata management are an advantage. Initial experience with Jira and Confluence is desirable. Strong communication skills, team orientation More ❯
translate requirements into scalable technical solutions. Very good knowledge of SQL, Python, ETL design, dbt, Airflow. Experience with Infrastructure as Code (Terraform), CI/CD (GitHub), and monitoring (e.g., Datadog, Grafana, Prometheus). Understanding of data governance, Unity Catalog, cost monitoring, and metadata management are an advantage. Initial experience with Jira and Confluence is desirable. Strong communication skills, team orientation More ❯
translate requirements into scalable technical solutions. Very good knowledge of SQL, Python, ETL design, dbt, Airflow. Experience with Infrastructure as Code (Terraform), CI/CD (GitHub), and monitoring (e.g., Datadog, Grafana, Prometheus). Understanding of data governance, Unity Catalog, cost monitoring, and metadata management are an advantage. Initial experience with Jira and Confluence is desirable. Strong communication skills, team orientation More ❯
tuning. Lead technical triage and root cause analysis for infrastructure-related issues Develop and deploy applications using Docker and AWS FARGATE Use CloudWatch, CloudTrail, and third-party tools like Datadog for performance and cost efficiency Configure AWS networking (VPCs, TGWs), enforce governance via AWS Config and tagging policies Maintain architecture diagrams, SOPs, and collaborate across engineering and product teams Should More ❯
tuning. Lead technical triage and root cause analysis for infrastructure-related issues Develop and deploy applications using Docker and AWS FARGATE Use CloudWatch, CloudTrail, and third-party tools like Datadog for performance and cost efficiency Configure AWS networking (VPCs, TGWs), enforce governance via AWS Config and tagging policies Maintain architecture diagrams, SOPs, and collaborate across engineering and product teams Should More ❯
Experience of using Git or similar to track changes Experience of both the full .NET Framework and .NET Core Experience of using observability systems such as Elastic APM or DataDog to track and diagnose issues in production A solid understanding of security principles and secure coding including OWASP Top 10 Nice to haves: o Experience in VOIP, (SIP and RTP More ❯
Experience of using Git or similar to track changes Experience of both the full .NET Framework and .NET Core Experience of using observability systems such as Elastic APM or DataDog to track and diagnose issues in production A solid understanding of security principles and secure coding including OWASP Top 10 Nice to haves: o Experience in VOIP, (SIP and RTP More ❯
re looking for someone with deep expertise in: oInfrastructure as Code: Terraform, CloudFormation o Security best practices: IAM, KMS, encryption in transit/at rest, DevSecOps o Monitoring & observability: Datadog, Prometheus, Grafana, ELK, or similar What You Bring o 6+ years in DevOps or platform engineering, with experience in a technical lead role. o Proven experience designing and operating cloud More ❯
Out in Science, Technology, Engineering, and Mathematics
technical, ambiguous domains. Strong knowledge of REST APIs , distributed system design, and performance optimization. Experience with both SQL and NoSQL data stores , caching layers, and observability tooling (e.g., Prometheus, Datadog). Nice to have: Experience deploying or integrating LLMs or NLP models in production systems. Comfortable balancing short-term execution with long-term architectural thinking . Passion for building highly More ❯
Richfield, Ohio, United States Hybrid / WFH Options
Charles Schwab
and support 6-8 years of experience in writing automation scripts, building application dashboards for proactive monitoring, setting up Alerts for early determination of the issues in Splunk, Grafana, Datadog etc 6-8 years of experience practicing SDLC (Software Development Lifecycle) practice, process improvements Hands on enterprise systems administration, monitoring, and deployment activities Experience with Windows 2016, 2019, 2022 hosted More ❯