AI enabled Snowflake/Postgres EMEA Lead
We’re seeking someone to join our team as a Postgres and Snowflake Engineer to be part of the Enterprise Computing Data Services Organization at Morgan Stanley. The Snowflake/ Postgres Customer Engagement Team (CET) is part of the Enterprise Computing Data Services Organization in Morgan Stanley. It is part of the Data & Analytics Technology (DAT) fleet, responsible for managing mission critical distributed database platforms like Snowflake, Postgres and Greenplum on public-cloud and on-prem. In the Technology division, we leverage innovation to build the connections and capabilities that power our Firm, enabling our clients and colleagues to redefine markets and shape the future of our communities. This is a Principal Infrastructure Production Management & Reliability Engineering position at VP level, which is part of the job family responsible for maintaining the stability and reliability of the organization's infrastructure systems, ensuring optimal performance and availability to support business operations. Since 1935, Morgan Stanley is known as a global leader in financial services, continuously evolving and innovating to better serve our clients and our communities in more than 40 countries around the world. What You’ll Do In The Role
- This position is for a senior platform operation manager of Snowflake/ Postgres CET based in Glasgow office with Site Reliability Engineering (SRE) oversight responsible for managing and improving the global database infrastructure services.
- Be on call / Rotation and escalation manager.
- The successful candidate will be also be designated incident and escalation manager for the global production Data and Analytics infrastructure during EMEA time zone.
- The person will also lead run-the-bank type of projects such as data center migration , plantwide version upgrade , release management , plant automation, database design and architecture, performance monitoring and optimization.
- In addition, the person would also participate at least one squad as SRE, following Agile practice and contributing to the infra modernization and automation.
- Develop and implement strategies to optimize the performance, availability, and reliability of infrastructure systems.
- Lead a team of engineers to ensure the smooth operation of production environments and adherence to SLAs.
- Collaborate with other technology teams to drive automation initiatives and improve operational efficiency.
- Define and enforce best practices for incident management, problem resolution, and change management processes.
- Evaluate emerging technologies and tools to enhance the infrastructure monitoring and management capabilities.
- Provide guidance and mentorship to junior engineers to foster their professional growth and development.
- Create and maintain documentation for systems, processes, and procedures to ensure knowledge sharing and continuity.
- Act as a key stakeholder in planning and executing disaster recovery and business continuity exercises to mitigate risks and ensure resilience of infrastructure systems.
- Represent the department in senior leadership meetings and strategic planning sessions.
- Bachelor's degree or equivalent experience in Computer Science, Information Technology, or related field.
- Visionary leadership and strategic planning skills.
- Seasoned executive with significant experience in leading infrastructure functions within a global organization.
- Visionary leader with a track record of shaping infrastructure strategy and driving transformational initiatives.
- Strategic thinker with the ability to align technology with business objectives.
- Strong communication skills with the ability to influence at the executive level.
- Experience in managing large-scale technology programs and projects.
- Expertise in governance, risk management, and compliance.
- Proven ability to drive organizational change and foster a culture of innovation
- 10+ years of overall enterprise level IT experience.
- Strong domain expertise related to distributed database platforms both on-prem/cloud like Snowflake /Postgres or Greenplum.
- Strong shell scripting and python programming skills for SRE related work.
- Advanced Linux / Unix skills
- Experience on using Splunk OR Grafana/Prometheus/Loki stack
- General understanding of Project Management , Database design and architecture , Data Integrity and security , Disaster recovery and backup.
- Knowledge on Agile methodologies
- Effective oral and written communication skills, and interpersonal skills to work well in a team environment required.
- Strong organizational and coordination skills with the ability to manage multiple tasks and high-pressure situations for outage handling, management, or resolution.
- Strong Incident Management Skills with proper understanding of ITIL procedures.
- Be available for weekend work.
- Deploy Optimize and manage enterprise scale distributed database platforms like Greenplum , Snowflake and Postgres.
- Respond to incidents, troubleshoot issues, and conduct root cause analysis.
- Design, implement, and maintain disaster recovery and high-availability solutions.
- Automate plant wide operational tasks related to provisioning, monitoring, backups, scaling, and recovery.
- Monitor system health, identify performance bottlenecks, and implement optimizations.
- Collaborate with development teams to support schema design, query optimization, and database best practices.
- Ensure data security, compliance, and access controls are enforced.
- Participate in on-call rotations and incident response.
- Experience with database deployment, upgrades, backup/restore, and schema management in production environments
- Proficiency in database monitoring, performance tuning, and troubleshooting
- Familiarity with distributed/OLTP/OLAP database environments deployed on-prem/cloud like Greenplum / Postgres and Snowflake.
- Familiarity with cloud platforms (AWS, Azure) and cloud-native databases
- Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible, CloudFormation)
- Automation and configuration management using scripting languages (Python, Bash, etc.)
- Setting up and using monitoring, logging, and alerting tools (Prometheus, Grafana, ELK/EFK, Datadog, etc.)
- Understanding of Service-Level Indicators/Objectives/Agreements (SLI/SLO/SLA)
- Designing and implementing HA/DR solutions (failover, automated recovery, geo-replication)
- Running and reviewing disaster recovery drills
- Incident response and on-call support for database outages or performance issues
- Root cause analysis and post-mortem writing
- Capacity planning and scaling distributed systems
- Change management and production rollout best practices
- Experience with container orchestration (Kubernetes, Docker) for database workloads
- Familiarity with CI/CD pipelines and database migration automation
- Knowledge of regulatory compliance (GDPR, HIPAA) as it pertains to data storage and handling
- Strategic thinking and problem-solving.
- Familiarity with modern data architectures and cloud services.
- Strong organizational and documentation skills