incident detection, notifications, triage, and resolution. Key Responsibilities: Pipeline Approach: Adopt a pipeline approach to enable observability of services deployed across multiple environments, balancing monitoring, logging, and tracing based on service classification. Intelligent Alerts: Design and build intelligent alerts using pipelines, onboarding automated runbooks triggered with clear audit/… logs in service management tools like Jira Service Management. Dashboards: Create and maintain dashboards for proactivemonitoring of services to help teams resolve incidents quickly. Monitoring Capability: Continuously improve monitoring capabilities to identify key alerts and thresholds for early warnings before services fail. Automation: Enable intelligent … and commercial observability tools (e.g., Prometheus, Grafana, NewRelic). Expertise in cloud environments (e.g., AWS, Azure) and infrastructure as code (IaC) tools like Terraform. Monitoring and Observability: Experience in creating and maintaining dashboards for proactivemonitoring of services. Ability to design and build intelligent alerts using pipelines More ❯
Bradford, Yorkshire, United Kingdom Hybrid / WFH Options
Freemans Grattan Holdings (fgh)
E-commerce DevOps Engineer role is responsible for managing and optimising software deployment processes for E-Commerce B2C websites and shopping Apps and proactively monitoring and reporting E-Commerce application and infrastructure performance. The role involves: Working collaboratively with software architects, software engineers and network, infrastructure and operations teams … to ensure smooth deployment, scalability and security of E-Commerce B2C websites and shopping apps using CI/CD pipelines and performance monitoring tools. Monitoring E-Commerce system performance, optimizing caching, ensuring uptime and responding to incidents. WHAT YOU'LL BE DOING Further developing and managing CI/… CD pipelines to automate deployment and reduce release cycle times. Ensuring website availability, performance and security through proactivemonitoring and incident response and implementing website performance monitoring and optimisation strategies to improve page load times, identify, diagnose and resolve issues and enhance customer experience. Enhancing system observability More ❯
Follow ITIL-aligned processes for escalation and management of incidents. Participate in an On-Call Rota for out-of-hours incident response. System Maintenance & Monitoring Perform regular system health checks on client infrastructure, including servers, networks, and backups. Implement preventive maintenance plans and updates to minimise downtime. Proactively monitor More ❯
Follow ITIL-aligned processes for escalation and management of incidents. Participate in an On-Call Rota for out-of-hours incident response. System Maintenance & Monitoring Perform regular system health checks on client infrastructure, including servers, networks, and backups. Implement preventive maintenance plans and updates to minimise downtime. Proactively monitor More ❯
Chorley, Lancashire, North West, United Kingdom Hybrid / WFH Options
Nextech Group Limited
Follow ITIL-aligned processes for escalation and management of incidents. Participate in an On-Call Rota for out-of-hours incident response. System Maintenance & Monitoring Perform regular system health checks on client infrastructure, including servers, networks, and backups. Implement preventive maintenance plans and updates to minimise downtime. Proactively monitor More ❯
Asset Management (stock level check, tracking, receiving, preparing and shipment of assets To provide VIP support as required, includes expediated end user device troubleshooting, proactive support, proactivemonitoring and health checks, targeted training on new tools for executive, custom onboarding process for executive etc. Stay updated with More ❯
Asset Management (stock level check, tracking, receiving, preparing and shipment of assets To provide VIP support as required, includes expediated end user device troubleshooting, proactive support, proactivemonitoring and health checks, targeted training on new tools for executive, custom onboarding process for executive etc. Stay updated with More ❯
Asset Management (stock level check, tracking, receiving, preparing and shipment of assets To provide VIP support as required, includes expediated end user device troubleshooting, proactive support, proactivemonitoring and health checks, targeted training on new tools for executive, custom onboarding process for executive etc. Stay updated with More ❯
watford, hertfordshire, east anglia, United Kingdom
Cognizant
Asset Management (stock level check, tracking, receiving, preparing and shipment of assets To provide VIP support as required, includes expediated end user device troubleshooting, proactive support, proactivemonitoring and health checks, targeted training on new tools for executive, custom onboarding process for executive etc. Stay updated with More ❯
company’s IT infrastructure, ensuring the seamless operation of hardware, servers, storage, and network systems, both on-premise and in the cloud. This includes proactivemonitoring, troubleshooting, and support across various IT functions such as hardware, security, and application management. The role also oversees IT service desk operations … service delivery, and continuous improvement of IT processes to meet IT needs and enhance user satisfaction. Duties and responsibilities Responsible for installing, configuring and monitoring IT Hardware Responsible for supporting and monitoring IT Servers Monitor and support storage appliances and IaaS solutions for performance, availability and security Ensure More ❯
What you’ll be doing: Public Cloud Infrastructure Management which involves provisioning, configuration and maintaining various Cloud resources to ensure scalability, reliability and security. Monitoring and Performance Optimisation by implementing monitoring solutions to track performance and identify areas for optimisation to enhance user experience and automate improvements where … possible. System Availability and Reliability by ensuring high availability and data integrity through proactivemonitoring, alerting, backups and DR planning and testing. Continuous Improvement by staying updated with the latest Cloud & Infrastructure technologies and continuously evaluating and proposing enhancements to existing systems, services and processes whilst also ensuring More ❯
What you’ll be doing: Public Cloud Infrastructure Management which involves provisioning, configuration and maintaining various Cloud resources to ensure scalability, reliability and security. Monitoring and Performance Optimisation by implementing monitoring solutions to track performance and identify areas for optimisation to enhance user experience and automate improvements where … possible. System Availability and Reliability by ensuring high availability and data integrity through proactivemonitoring, alerting, backups and DR planning and testing. Continuous Improvement by staying updated with the latest Cloud & Infrastructure technologies and continuously evaluating and proposing enhancements to existing systems, services and processes whilst also ensuring More ❯
What you’ll be doing: Public Cloud Infrastructure Management which involves provisioning, configuration and maintaining various Cloud resources to ensure scalability, reliability and security. Monitoring and Performance Optimisation by implementing monitoring solutions to track performance and identify areas for optimisation to enhance user experience and automate improvements where … possible. System Availability and Reliability by ensuring high availability and data integrity through proactivemonitoring, alerting, backups and DR planning and testing. Continuous Improvement by staying updated with the latest Cloud & Infrastructure technologies and continuously evaluating and proposing enhancements to existing systems, services and processes whilst also ensuring More ❯
Level Agreements for fault resolutions and service requests completions. Provide customer service to internal and external customers to ensure a consistent experience. Adopt a proactive approach towards all client activities. Day to day incident management and proactivemonitoring of IT Security Systems and associated platforms and components … networking hardware and software products. Support end user workstation hardware, software, networked peripheral devices, cabling, and networking hardware and software products by testing, maintaining, monitoring, and troubleshooting in order to determine source of computer problems (hardware, software, user access, etc.) Conduct updates of technical documents and knowledge base to … where any additional hardware or software is included within the network component inventory. Prepare, maintain, and adhere to procedures for logging, reporting, and statistically monitoring network data as directed. Adhere to business continuity and disaster recovery plans, and maintain current knowledge of plan executables. Respond to emergency network outages More ❯
Level Agreements for fault resolutions and service requests completions. Provide customer service to internal and external customers to ensure a consistent experience. Adopt a proactive approach towards all client activities. Day to day incident management and proactivemonitoring of IT Security Systems and associated platforms and components … networking hardware and software products. Support end user workstation hardware, software, networked peripheral devices, cabling, and networking hardware and software products by testing, maintaining, monitoring, and troubleshooting in order to determine source of computer problems (hardware, software, user access, etc.) Conduct updates of technical documents and knowledge base to … where any additional hardware or software is included within the network component inventory. Prepare, maintain, and adhere to procedures for logging, reporting, and statistically monitoring network data as directed. Adhere to business continuity and disaster recovery plans, and maintain current knowledge of plan executables. Respond to emergency network outages More ❯
Level Agreements for fault resolutions and service requests completions. Provide customer service to internal and external customers to ensure a consistent experience. Adopt a proactive approach towards all client activities. Day to day incident management and proactivemonitoring of IT Security Systems and associated platforms and components … networking hardware and software products. Support end user workstation hardware, software, networked peripheral devices, cabling, and networking hardware and software products by testing, maintaining, monitoring, and troubleshooting in order to determine source of computer problems (hardware, software, user access, etc.) Conduct updates of technical documents and knowledge base to … where any additional hardware or software is included within the network component inventory. Prepare, maintain, and adhere to procedures for logging, reporting, and statistically monitoring network data as directed. Adhere to business continuity and disaster recovery plans, and maintain current knowledge of plan executables. Respond to emergency network outages More ❯
Basingstoke, Hampshire, United Kingdom Hybrid / WFH Options
Nomios UK&I Limited
play a pivotal role in our customer support processes, ensuring the seamless operation of UK-based ISP and Enterprise networks. Your responsibilities will include proactivemonitoring, maintenance, and troubleshooting to deliver optimal performance and reliability. You will be integral to providing 24/7 coverage and support to … quo. A love for learning and obtaining certifications, coupled with an entrepreneurial mindset, will set you apart. We're looking for someone who is proactive, eager to grow, and ready to mentor others while contributing to a dynamic and supportive team. Responsibilities Key responsibilities of the role include: Network … Monitoring & Incident Management: Monitor Nomios customer network infrastructures (routers, switches, firewalls, servers) using various NOC tools Identify, troubleshoot, and resolve network issues affecting service availability, performance, and reliability Respond to alerts and notifications to ensure incidents are resolved promptly within defined SLAs Escalate unresolved issues to the appropriate technical More ❯
critical role in place, we anticipate enhanced integration of Generative AI deployments, consistent AI performance, and the unlocking of transformative AI-driven initiatives. This proactive approach will empower Mars to scale digital experiences with the trust and agility that our stakeholders expect, positioning us at the forefront of innovation … frameworks such as TensorFlow, PyTorch, LangChain, or similar technologies Demonstrated ability to lead cross-functional teams and operate within complex enterprise ecosystems Familiarity with monitoring, observability, and platform telemetry tools (e.g., Prometheus, Grafana, Azure Monitor) Exceptional communication and stakeholder engagement skills to partner with business, technical, and governance teams … to enhance platform reliability and resilience, proactively addressing potential challenges. LLMOps Implementation Develop and operationalize Large Language Model Operations (LLMOps) practices, encompassing model deployment, monitoring, versioning, rollback, and performance tuning at scale. Ensure efficient management of AI models to maximize their effectiveness and business impact. Service Management & Support Establish More ❯
critical role in place, we anticipate enhanced integration of Generative AI deployments, consistent AI performance, and the unlocking of transformative AI-driven initiatives. This proactive approach will empower Mars to scale digital experiences with the trust and agility that our stakeholders expect, positioning us at the forefront of innovation … frameworks such as TensorFlow, PyTorch, LangChain, or similar technologies Demonstrated ability to lead cross-functional teams and operate within complex enterprise ecosystems Familiarity with monitoring, observability, and platform telemetry tools (e.g., Prometheus, Grafana, Azure Monitor) Exceptional communication and stakeholder engagement skills to partner with business, technical, and governance teams … to enhance platform reliability and resilience, proactively addressing potential challenges. LLMOps Implementation Develop and operationalize Large Language Model Operations (LLMOps) practices, encompassing model deployment, monitoring, versioning, rollback, and performance tuning at scale. Ensure efficient management of AI models to maximize their effectiveness and business impact. Service Management & Support Establish More ❯
critical role in place, we anticipate enhanced integration of Generative AI deployments, consistent AI performance, and the unlocking of transformative AI-driven initiatives. This proactive approach will empower Mars to scale digital experiences with the trust and agility that our stakeholders expect, positioning us at the forefront of innovation … frameworks such as TensorFlow, PyTorch, LangChain, or similar technologies Demonstrated ability to lead cross-functional teams and operate within complex enterprise ecosystems Familiarity with monitoring, observability, and platform telemetry tools (e.g., Prometheus, Grafana, Azure Monitor) Exceptional communication and stakeholder engagement skills to partner with business, technical, and governance teams … to enhance platform reliability and resilience, proactively addressing potential challenges. LLMOps Implementation Develop and operationalize Large Language Model Operations (LLMOps) practices, encompassing model deployment, monitoring, versioning, rollback, and performance tuning at scale. Ensure efficient management of AI models to maximize their effectiveness and business impact. Service Management & Support Establish More ❯
might arise. Liaise with third party partners, suppliers and other parties when required. Maintain the security, integrity and performance of our systems through regular, proactivemonitoring and housekeeping. Keep colleagues informed regarding any issues which arise, take remedial action where necessary, using available tools where applicable. SKILLS, KNOWLEDGE More ❯
drive business value through enhanced productivity and collaboration by understanding colleague needs and wider collaboration technology landscape. Ensure product reliability, performance, and security through proactivemonitoring and incident management, leveraging observability tooling to increase issue prevention and proactive support. Use data-led insights and continuous improvement to More ❯
leeds, west yorkshire, yorkshire and the humber, united kingdom
Virgin Money
drive business value through enhanced productivity and collaboration by understanding colleague needs and wider collaboration technology landscape. Ensure product reliability, performance, and security through proactivemonitoring and incident management, leveraging observability tooling to increase issue prevention and proactive support. Use data-led insights and continuous improvement to More ❯
drive business value through enhanced productivity and collaboration by understanding colleague needs and wider collaboration technology landscape. Ensure product reliability, performance, and security through proactivemonitoring and incident management, leveraging observability tooling to increase issue prevention and proactive support. Use data-led insights and continuous improvement to More ❯
drive business value through enhanced productivity and collaboration by understanding colleague needs and wider collaboration technology landscape. Ensure product reliability, performance, and security through proactivemonitoring and incident management, leveraging observability tooling to increase issue prevention and proactive support. Use data-led insights and continuous improvement to More ❯