Southampton, Hampshire, United Kingdom Hybrid / WFH Options
NICE
experience of Grafana Observability Suite (Loki, Mimir, Tempo). Administration and/or development experience of standard monitoring and automation tools such as Splunk, Datadog, Pagerduty, Rundeck. Familiarity with configuration management tools like Ansible, Puppet, or Chef. Certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or More ❯
IAT Certification; higher levels preferred. • Experience with serverless computing (AWS Lambda, Azure Functions, etc.). • Familiarity with logging and monitoring platforms like CloudWatch, Prometheus, Datadog, or Splunk. • Experience with CI/CD tools like Jenkins, GitHub Actions, or GitLab CI. • An adjudicated Counterintelligence Polygraph. Soft Skills: • Self-driven • Strong communication More ❯
modeling. Proficiency in cloud platforms (AWS, Azure, GCP) and associated reliability tools. Hands-on experience with monitoring and logging tools such as Prometheus, Grafana, Datadog, Splunk, or ELK stack. Familiarity with containerization and orchestration tools (Docker, Kubernetes). Strong understanding of distributed systems, fault tolerant design, and high availability architectures. More ❯
AWS, Azure, or GCP Manage infrastructure as code using tools like Terraform Monitor and maintain production systems using tools such as Prometheus, Grafana, or Datadog Collaborate with development and QA teams to improve deployment processes and system reliability Contribute to incident response, troubleshooting, and root cause analysis Requirements Approximately More ❯
AWS, Azure, or GCP Manage infrastructure as code using tools like Terraform Monitor and maintain production systems using tools such as Prometheus, Grafana, or Datadog Collaborate with development and QA teams to improve deployment processes and system reliability Contribute to incident response, troubleshooting, and root cause analysis Requirements Approximately More ❯
and practice maintaining uniformity and cleanliness in a large codebases and infrastructure projects Desirable Skills & Experience Hands on experience monitoring large production infrastructure using DataDog and CloudWatch Previously owned end-to-end responsibility in a service, including development and production support Experience using configuration management tools such as Chef, Ansible More ❯
rate limiting, IP reputation controls, and downstream system protections (e.g., ECIS IronPort). Proactively monitor email traffic and build custom dashboards in Splunk and Datadog to detect misuse, anomalies, or policy violations. Implement remediation actions via automated and manual rule sets. Collaborate with application and DevOps teams to onboard and More ❯
optimization. Configure and maintain cloud-based services and resources. Monitoring and Logging: Implement and maintain monitoring and logging systems (e.g., Prometheus, Grafana, ELK stack, Datadog). Set up alerts and notifications for critical system events. Analyze logs and metrics to identify and resolve performance issues. Automation and Scripting: Develop and More ❯
GCP. Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation. Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar. Proven track record of maintaining highly-available and performant production environments. Ability to identify and implement effective mitigation strategies and operational More ❯
GCP Background knowledge and hands-on practice in Observability, specifically experience working with one or more of the following tools - Kibana, Open-Search, Grafana, Datadog, Sumologic, NewRelic, AppDynamics, Dynatrace, Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience More ❯
e.g. JIRA, Confluence Monitoring, Logging, and Performance Tuning - Skills in monitoring systems' performance and logs to ensure uptime and identify performance bottlenecks - e.g. Grafana, Datadog Networking Concepts - Knowledge in TCP/IP, DNS, VPN, load balancing, and firewalls Security Best Practices - Implementing security in DevOps (e.g., IAM policies, network security More ❯
cloud environments. Familiarity with cloud security principles and best practices (e.g., IAM, encryption, threat monitoring). Experience with monitoring and alerting tools (e.g., CloudWatch, Datadog, Prometheus). Strong problem-solving, troubleshooting, and communication skills. Preferred Skills: Cloud certifications (e.g., AWS Certified SysOps Administrator, Microsoft Certified: Azure Administrator, Google Cloud Professional More ❯
/Sub, RabbitMQ, Kafka). Excellent communication skills; thrive in a fully remote, high-autonomy environment. NICE TO HAVE SRE & Kubernetes expertise (GKE) and Datadog/Prometheus/Grafana observability stacks. Experience with large-scale 3D asset pipelines, real-time rendering or streaming media. Contributions to open-source Unreal or More ❯
and Active Directory. Experience with disaster recovery and redundancy strategies in both cloud and on-premises environments. Proficiency with leading monitoring tools, such as Datadog, Splunk , Prometheus, Grafana, ELK Stack, and New Relic. Programming expertise, especially in systems programming languages (e.g., Java, Kotlin, Scala) and databases (e.g., SQL Server, PostgreSQL More ❯
and Active Directory. Experience with disaster recovery and redundancy strategies in both cloud and on-premises environments. Proficiency with leading monitoring tools, such as Datadog, Splunk , Prometheus, Grafana, ELK Stack, and New Relic. Programming expertise, especially in systems programming languages (e.g., Java, Kotlin, Scala) and databases (e.g., SQL Server, PostgreSQL More ❯
. Knowledge of scripting or programming languages, such as Python, PowerShell, or Bash. Familiarity with log management and monitoring tools (e.g., Splunk, Datadog, or ELK stack). Experience with SIEM and/or SOAR tools and capabilities. Travel: Less than 10% travel is expected for this position. Travel may include More ❯
CloudFormation, and manage resources for optimal performance. Monitor, troubleshoot, and resolve incidents, optimizing systems to ensure reliability and minimize downtime. Implement monitoring (Prometheus, Grafana, Datadog) and set up alerting systems to proactively address issues and ensure scalability. Work with DevOps, engineering, and security teams to improve application deployment, infrastructure management More ❯
Proven ability to troubleshoot complex infrastructure issues, perform root cause analysis, and implement system improvements. Experience with monitoring and alerting systems like Prometheus, Grafana, Datadog, or equivalent. Excellent communication and collaboration skills, with the ability to work cross-functionally and explain technical concepts to non-technical stakeholders. More ❯
Manchester, Lancashire, United Kingdom Hybrid / WFH Options
Smart DCC
Develop automated test suites for data pipelines, ensuring data quality and transformation integrity. Monitoring & Performance Optimization: Monitor data pipelines with tools like Prometheus and Datadog to ensure optimal performance and health. Proactively implement anomaly detection and optimize system performance and resource allocation. Collaborate with cross-functional teams to align DataOps More ❯
Actions, Gitlab, Jenkins, Teamcity Scripting languages such as PowerShell, bash L1 to L3 networking Logging and monitoring systems, and visualisation tools, such as Splunk, Datadog, Log Analytics, Cloudwatch, ELK, Grafana, PowerBI, Prometheus, Application Insights IaC tools such as Terraform, Cloudformation, Chef, Ansible, Puppet, Pulumi, Bicep Database systems such as MSSQL More ❯
of networking, containers (Docker, Kubernetes), and cloud infrastructure (AWS/GCP/Azure). -Strong skills in monitoring, observability, and alerting systems (Prometheus, Grafana, Datadog, etc.). -Proficiency with infrastructure-as-code tools like Terraform or Pulumi. -Experience with CI/CD pipelines and GitOps practices. -Excellent communication and incident More ❯
in the process of containerization for applications and their subsequent orchestration within Kubernetes environments. Experience working on at least one monitoring/observability stack (Datadog, ELK, Splunk, Loki, Grafana). Strong knowledge of Unix or Linux Strong communication skills to collaborate with various stakeholders Able to work independently in a More ❯
culture and cloud platforms such as Azure, AWS, or GCP. Knowledge of Kubernetes is desirable. Experience monitoring application performance with tools like Grafana, Prometheus, DataDog, or Sentry. Strong advocate for performant applications and simplicity in problem-solving. Excellent communication skills for collaborating across audiences. Experience with high-traffic service development More ❯
Unix Shell. Deep understanding of software applications and technical processes, with emerging expertise in specific disciplines. Experience with observability tools like Grafana, Dynatrace, Prometheus, Datadog, Splunk, including monitoring, SLO alerting, and telemetry collection. Knowledge of CI/CD tools such as Jenkins, GitLab, Terraform. Experience with containers and orchestration tools More ❯
of cloud platforms (AWS, GCP, Azure) and modern infrastructure technologies (Kubernetes, Docker, Terraform). Expertise in monitoring, logging, and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk). Proficiency in at least one programming or scripting language (e.g., Python, Go, Bash). Deep understanding of networking, databases, and distributed systems. Strong More ❯