Network Site Reliability Engineer, Associate
About This Role Platform Engineering Network and Data Center Services (NDS) is part of the Platform engineering pillar, the backbone for both the client and the investment lifecycles. The Network SRE domain is a vertical within NDS responsible for the automation and enhancement of repetitive engineering and operational issues across a hybrid platform environment, encompassing both on-premises and public cloud environments. The team closely collaborates with the engineering teams to convert manual incidents & issues into orchestrated and automated solutions. About the Role: We are seeking a talented and driven Network SRE to join our growing team. This role is ideal for a highly skilled network engineer with SRE expertise, who is passionate about integrating and automating network infrastructure within a SRE framework. You will play a critical role in the implementation and support of our network and monitoring infrastructure, focusing on the deployment and configuration process to ensure the stability and scalability of our systems. You will be part of a collaborative, innovative team where you can make an impact and will work on exciting, large-scale projects that challenge and expand your skills. Key Responsibilities: As a SRE engineer, you will:
- Lead the SRE function utilizing agile, SRE principles to deliver on automation of on-prem and cloud infrastructure.
- Collaborate closely with the Engineering teams to ensure smooth integration of network services into broader infrastructure pipelines.
- Be hands-on, improve operational efficiency, and develop a vision that leads to our DevOps team's long-term success.
- Design and manage network configuration code library that deploys secure network infrastructure via CI/CD pipelines.
- Define, plan and execute strategic roadmaps for self-service, highly scalable, cost-efficient, observable, auditable, and reliable infrastructure services as standard practice, including DevOps and automation.
- Develop scripts to streamline and automate network tasks and augment our monitoring tools, using Python.
- Implement and manage CI/CD pipelines, with a strong focus on automation, using Azure DevOps.
- Develop, test, and manage infrastructure as code (IaC) and maintain accurate configuration management using best practices.
- Provide support for network-related incidents, identify root causes, and implement preventative measures with automation.
- Incorporate SRE-centric principles to ensure the reliability, performance, and scalability of large-scale infrastructure.
- Develop SRE program geared towards reducing incident count and 100% network uptime.
- Automate disaster recovery plans.
- Keep systems up and reliable, mitigating broken systems, and preventing future disruptions.
- Maintain production stability and respond to on-call incidents
- Partner with Engineering teams to develop post change/build validation & checkout processes.