reliability engineer develops and implements solutions to prevent them, ultimately enhancing the reliability of systems, equipment, and processes. Responsibilities: Analyzing equipment failure data to detect patterns and trends. Conducting rootcauseanalysis to identify the underlying causes of issues. Creating and implementing new maintenance procedures. Designing and establishing new protocols for monitoring and testing equipment. Exploring new … is incorporated into all areas of the organization. System Reliability: Design and implement strategies to improve the availability, reliability, and performance of critical systems and applications. Incident Management: Lead rootcauseanalysis for major incidents, identify systemic issues, and implement long-term solutions to prevent recurrences. Monitoring and Alerting: Develop and maintain robust monitoring systems to detect … issues proactively and optimize alerting mechanisms to ensure timely response. Capacity Planning: Analyze system usage patterns to predict future growth, optimize capacity, and ensure scalability. Failure Analysis: Conduct thorough failure analysis and implement fault tolerant systems to minimize the impact of potential failures. Collaboration: Work closely with software engineering, DevOps, and infrastructure teams to design reliable architecture and More ❯
is for Team B Day Shift, the hours are 7 AM-7 PM Thursday - Saturday and every other Sunday. Responsibilities: Monitor network traffic for security events and perform triage analysis to identify security incidents. Respond to computer security incidents by collecting, analyzing, and preserving digital evidence, and ensure that incidents are recorded and tracked in accordance with SOC requirements. … Document actions taken and create technical reports detailing investigation efforts and case outcome to SOC Management and the client. Utilize technologies to conduct host forensics, Endpoint Detection & Response, log analysis, and network forensics (full packet capture solution). Provide cybersecurity root-causeanalysis and investigative alerts to examine endpoint activity and network-based data. Conduct malware … analysis, host and network, forensics, log analysis, and triage in support of incident response. Recognize attacker and APT activity, tactics, and procedures as indicators of compromise (IOCs) that can be used to improve monitoring, analysis, and incident response. Develop and build security content, scripts, tools, or methods to enhance the incident investigation processes. Isolate and remove malware. More ❯
hands-on role supporting high-availability systems, rapid deployments, and production incident response. Key Responsibilities - Manage and monitor AWS infrastructure for performance and security - Respond to production incidents, perform rootcauseanalysis, and implement fixes - Maintain observability tools (Prometheus, Grafana, Splunk) and write PromQL queries - Improve and operate CI/CD pipelines using GitHub Actions and Kubernetes … Prometheus, Grafana, Splunk, and PromQL - Proficient in scripting (Python, Go, Bash, SQL) - Skilled in GitHub, CI/CD, and Kubernetes operations Desirable: - Experience with Terraform or CloudFormation - Advanced log analysis with Splunk - Strong problem-solving and analytical thinking More ❯
to-end tests on code commits and pull-requests. • Monitor pipeline health and test results; collaborate with DevOps to optimize build times, parallelize tests, and reduce pipeline flakiness. Result Analysis & RootCause • Analyze test outputs, system logs, and metrics (e.g., via ELK Stack or Prometheus/Grafana) to pinpoint failures and performance regressions. • Lead root-cause … testing activity efficiently. An ISTQB Foundation Certification is a strong asset and shows your commitment to professional testing standards. A key part of this role involves problem investigation and rootcauseanalysis, so strong analytical and communication skills are a must. You'll enjoy working as part of a collaborative team, contributing your insights to improve outcomes More ❯
cloud environments, including compute and storage scalability Containerisation & Virtualisation: Familiarity with virtual and physical server provisioning, especially in strategic data centres Platform Resilience & Observability: Designing for uptime, performance, and rootcause analysis. Web Services & APIs: Used for Integration with 24+ LBGI systems Batch Processing: Understanding of batch suite performance and scheduling constraints RPA & Automation (Batching): Familiarity with robotic … process automation Log Aggregation & Analysis: Tooling for log interrogation and rootcauseanalysis (e.g., Splunk, Dynatrace). Dashboarding: Real-time analytics dashboards for infrastructure and application health Support & Troubleshooting: Remote operations, incident response, and environment health checks. About working for us Our ambition is to be the leading UK business for diversity, equity and inclusion supporting More ❯
foundational understanding of cybersecurity operations, with specific exposure to threat detection and incident response. This role is critical to our Security Operations Center (SOC), providing 24/7 monitoring, analysis, and response to security events and threats across our enterprise. Key Responsibilities: Monitor computer networks in real-time for security issues and suspicious activity. Investigate and respond to security … breaches, cyber incidents, and anomalous behavior. Document security breaches and assess the scope and impact of each incident. Perform initial triage and analysis of alerts generated by security tools (e.g., SIEM platforms). Conduct forensic analysis of digital artifacts including disk images and log data. Assist with penetration testing and vulnerability assessments. Apply remediation measures to detected vulnerabilities … and provide security hardening recommendations. Support the deployment and monitoring of firewalls, encryption tools, and other security technologies. Generate incident reports and provide input for rootcauseanalysis and lessons learned. Participate in deployable Incident Response Team (IRT) support tasks. Perform dynamic analysis and develop timelines and file signature comparisons during investigations. Required Qualifications: Hands-on More ❯
implement scalable, resilient, and secure infrastructure solutions aligned to organisational strategy Lead BAU operations across networks, firewalls, hosting platforms, and server endpoints Proactively monitor systems, troubleshoot issues, and conduct rootcauseanalysis Own disaster recovery and business continuity planning, testing, and documentation Act as a subject matter expert on infrastructure and cybersecurity best practice Mentor junior engineers … Certifications such as ITIL, CCNA, Microsoft, VMware, or Citrix preferred Familiarity with automation tools (Ansible, Terraform) is a bonus Leadership and mentoring capabilities Data-driven decision-making and performance analysis Vendor and stakeholder management Strong problem-solving and risk mitigation skills Customer-focused with an eye for service delivery improvements Excellent communication and strategic thinking abilities If you are More ❯
teams to support feature development from concept through to deployment, focusing on delivering high-quality releases. Collaborate closely with developers and stakeholders to identify requirements and acceptance criteria. Perform rootcauseanalysis of issues, identifying areas for improvement in both applications and processes. Review and update testing strategies, ensuring alignment with best practices in Microsoft and Linux More ❯
Dynamics 365 (D365) Finance and Operations, Business Central (F&O), or comparative ERP systems. ( Certification in Dynamics 365 or a related ERP system is desirable). Experience with data analysis, process mapping, rootcauseanalysis and problem-solving in an ERP environment. Excellent communication and collaboration skills with internal and external stakeholders, with the ability to More ❯
in tuning and improving alerting thresholds in SIEM tools Create and maintain standard operating procedures (SOPs) Participate in cybersecurity drills and incident response exercises Collaborate with intelligence and threat analysis teams to enhance detection capabilities Document incidents and contribute to after-action reviews and reports Provide input on improving cybersecurity architecture and tooling SKILLS Cybersecurity & Threat Intelligence Real-time … Enterprise Security) Knowledge of threat actors, tactics, techniques, and procedures (TTPs) Familiarity with threat intelligence feeds and correlation Security Operations & Incident Response Incident triage and escalation procedures Conducting forensic analysis and evidence collection Threat hunting and anomaly detection Rootcauseanalysis and post-incident reporting Technical & Analytical Proficiency Network traffic analysis and packet capture interpretation … Endpoint detection and response (EDR) tools and techniques Log analysis (system, application, network, firewall) Knowledge of intrusion detection/prevention systems (IDS/IPS) Scripting or automation with Python, PowerShell, or Bash (bonus) Security Frameworks & Compliance - Familiarity with frameworks such as NIST, MITRE ATT&CK, or ISO 27001 Understanding of government cybersecurity regulations and FISMA compliance Communication & Collaboration Clear More ❯
disciplinary teams, ensuring alignment with product and business goals. Provide mentorship and technical guidance to less experienced engineers. Promote collaboration across international and distributed teams. Engage in system architecture, rootcauseanalysis, and continuous integration processes What We're Looking For: Degree in Computer Science, Software Engineering, or a related field. Professional level expertise in C++ development … Fitnesse, Cucumber), and hardware debuggers (e.g., Lauterbach) is beneficial. Familiarity with configuration management, including version control, automated build systems, release management, and technical documentation. Strong analytical skills in requirements analysis, user story development, backlog management, and estimation. Excellent communication, leadership, and interpersonal skills, with the ability to collaborate across teams and influence stakeholders. Experience in industrial printing or related More ❯
Monitor production systems and infrastructure, ensuring uptime and performance metrics are met Troubleshoot, diagnose, and resolve production issues in real time, minimizing service impact Manage incident response, including escalation, rootcauseanalysis, and post-mortem reporting Collaborate with engineering teams to develop and implement monitoring tools, alert systems, and automated recovery processes Analyze system logs, metrics, and More ❯
entity. Serve as a senior incident responder, addressing emerging threats across the environment. Collaborate with infrastructure, network, and cross-functional teams to contain, investigate, and remediate security incidents. Conduct rootcauseanalysis and participate in forensic investigations as needed. Enhance system visibility by expanding logging coverage and implementing additional monitoring capabilities. Maintain, update, and regularly test incident … Ability to manage time and prioritize work to maximize productivity Excellent communication skills (both written and verbal) Exceptional attention to detail and quality Excellent problem-solving techniques and trouble analysis skills Endpoint security concepts, controls, and best practices for Servers (e.g. Windows and Linux) General IT networking concepts, protocols, standards and network security concepts, controls, and best practices Cryptography More ❯
Role Come join a flexible and responsive IT environment to quickly meet evolving and changing mission priorities for an IC customer. Mission operations supported include strategic and tactical intelligence analysis and integration; screening, vetting, and watchlisting; situational awareness; and strategic planning. Operate and maintain the IT environment, integrate new systems into the environment from separate developers, design, develop, and … of systems, processes, and troubleshooting procedures. Provide Tier 3 support for IT infrastructure, including servers, storage, and networking equipment. Troubleshoot and resolve complex hardware, software, and network issues. Perform rootcauseanalysis and implement long-term solutions to recurring problems. Manage and maintain network infrastructure, including switches, routers, firewalls, and VPNs. Monitor network performance and troubleshoot connectivity More ❯
new and updated system changes Developing, executing, and improving documentation for installation, configuration, hardening, and operations and maintenance tasks Ensuring compliance with IT infrastructure standards, policies, and procedures Conducting rootcauseanalysis and resolving system and application faults and errors Ensuring operating systems and applications comply with Department of Defense (DoD) guidelines, including DISA Security Technical Implementation More ❯
operations and maintenance tasks Document activities, status, and issues worked on Provide input to and follow Configuration Management processes Ensure adherence to IT infrastructure standards, policies, and procedures Perform rootcauseanalysis and resolve system and application faults and errors Maintain working knowledge of Microsoft Active Directory, Group Policy Objects (GPOs), DHCP, DNS, and PowerShell General understanding More ❯
of infrastructure components. 2. Monitoring and Incident Management: - Develop and maintain monitoring solutions to proactively identify performance bottlenecks, system outages, and other potential issues. - Participate in incident response and rootcauseanalysis efforts to drive continuous improvement and prevent future incidents. 3. Reliability and Performance Optimization: - Optimise system performance, reliability, and cost efficiency through continuous monitoring, performance More ❯
Shrivenham, Oxfordshire, United Kingdom Hybrid / WFH Options
Gold Group
Collaborate with engineering teams to support unified access devices (UADs), endpoint management, and virtualized environments. * Provide hands-on support for automation scripts, workflows, and infrastructure monitoring tools. * Contribute to rootcauseanalysis efforts for recurring platform incidents. * Support capacity planning and performance optimization by analysing system usage and trends. * Offer feedback on tools and processes, identifying improvements More ❯
Work closely with the tools team to evaluate, recommend, and optimize IT service management tools that align with ITIL standards. Assist with major incident management and problem resolution, ensuring rootcauseanalysis and prevention of recurrence. Collaborate with cross-functional teams, including IT operations, service desk, and project management, to ensure seamless delivery of IT services. Define More ❯
and test network engineering/administration activities. • Create and maintain Standard Operating Procedures (SOPs) and technical documentation. • Provide follow-up reports (technical findings, feedback, and resolution steps taken) for RootCauseAnalysis and process improvement initiatives. Required Qualifications: • Minimum of a Bachelor's degree in Science, with 12-15 years' experience or Master's degree with More ❯
and test network engineering/administration activities. Create and maintain Standard Operating Procedures (SOPs) and technical documentation. Provide follow-up reports (technical findings, feedback, and resolution steps taken) for RootCauseAnalysis and process improvement initiatives. Required Qualifications Top Secret Clearance Minimum of a Bachelor's degree in Science, with 12-15 years' experience or Master's More ❯
activities with the NMCI Operations Manager, NOC Lead, Release Management team and other key stakeholders. •Tier III escalation support and vendor engagement supporting Incident Management activities. •Active participation in RootCauseAnalysis for Problem Management activities. You'll Bring These Qualifications: •Requires B.S. Degree and 8-12 years of prior relevant experience or Masters with More ❯
analyze overall health of Splunk infrastructure to include daily indexing volume, search volume and performance, data source reporting, user activity reporting, and custom apps/dashboards/visualizations. Perform rootcauseanalysis on any issues with recommendations. Implement tactical and strategic solutions to problems. Develop, manage, and maintain documents supporting Splunk architecture and operational processes. Data onboarding More ❯
and build environments using Infrastructure as Code with Terraform and configuration management tools like Ansible Automating repetitive tasks to eliminate toil and drive consistency + repeatability Incident response and root-causeanalysis, support a blameless post-mortems culture As a Cloud Systems Engineer, you will have a direct role in providing infrastructure services to our development environment. More ❯
and test network engineering/administration activities. Create and maintain Standard Operating Procedures (SOPs) and technical documentation. Provide follow-up reports (technical findings, feedback, and resolution steps taken) for RootCauseAnalysis and process improvement initiatives. Basic Qualifications Minimum of a Bachelor's degree in Science, with 8+ years' experience or Master's degree with 6+ years More ❯