production issues. Drive post-incident reviews, root cause analysis, and the implementation of preventive measures to continuously improve service reliability and reduce recurrence. Lead service improvement programs with a focus on proactive monitoring, early detection, and elimination of systemic issues. Champion best practices aligned with SRE principles to elevate … designing, developing, and deploying tools and scripts to reduce manual toil, improve operational efficiency, and enhance productivity. Identify opportunities for process automation and lead their implementation. Plan, manage, monitor, and optimize production infrastructure including Linux hosts, distributed computing environments, database systems, and network components. Ensure robust capacity planning and resiliency ...