Site Reliability Engineer

Company Description

WALT Labs, a leading managed service provider, is dedicated to empowering businesses by harnessing the power of cloud technology. Our team specializes in delivering customized solutions tailored to meet the unique needs of our clients, driving growth and operational efficiency across industries. From supporting small businesses with seamless data migration to enabling large corporations to manage complex infrastructure projects, we provide exceptional service while staying at the forefront of cloud technology advancements.

Role Description

This is a full-time on-site role 3 days a week minimum in Kings Cross London. We are seeking a skilled Site Reliability Engineer with a strong focus on Google Cloud Platform (GCP) to join our dynamic team. In this role, you’ll be responsible for maintaining cloud infrastructure, managing incidents, and ensuring seamless operations for our clients. You’ll use tools like incident.io and JIRA to manage and resolve support requests efficiently.

Responsibilities

  • Serve as L2 on-call escalation point for complex technical issues requiring advanced troubleshooting
  • Lead response to critical incidents, coordinating multiple teams and ensuring effective communication
  • Provide expert-level support for GCP services including advanced networking, security, and architecture
  • Perform advanced Google Workspace administration including domain management, security policies, and integration
  • Use incident.io to manage escalated incidents, major incidents, and coordinate war room activities
  • Optimize support workflows in JIRA, creating automation rules and improving ticket routing
  • Monitor and tune infrastructure performance using advanced Grafana queries and custom metrics
  • Lead technical projects including migrations, upgrades, and new service implementations
  • Create comprehensive documentation including architectural diagrams, runbooks, and best practices guides
  • Achieve minimum 50% billable hours through complex Cloud Assist/Managed Cloud customers and consulting engagements
  • Mentor Cloud Support Engineers and juniors through formal and informal training sessions
  • Identify and implement process improvements to increase efficiency and reduce resolution time
  • Conduct thorough root cause analysis for recurring issues and implement permanent fixes
  • Present technical solutions and recommendations to customer stakeholders and management
  • Design and implement monitoring strategies for complex multi-cloud environments
  • Develop automation scripts and tools to improve team efficiency and reduce manual work
  • Participate in pre-sales activities providing technical expertise for solution design
  • Review and approve changes to production environments following change management procedures
  • Lead knowledge sharing sessions and technical deep-dives for the team
  • Coordinate with vendor support for complex issues requiring manufacturer assistance
  • Maintain expertise in multiple GCP services and stay current with new feature releases
  • Participation in business hours escalation rotation

Qualifications

  • 3-5 years experience with Google Cloud Platform
  • Minimum 2 Google Cloud Professional certifications
  • Advanced Kubernetes knowledge and troubleshooting
  • Proficient in Infrastructure as Code (Terraform)
  • Strong scripting abilities (Python, Go, Bash)
  • Expert with monitoring tools (Grafana, Datadog)
  • Experience leading incident response
  • Excellent communication and mentoring skills
  • Proven track record of process improvement
  • Ability to manage multiple priorities effectively
  • Strong customer service orientation

Benefits

  • 20 holiday days + bank holidays (earn 1.5 days every 3 years)
  • Private health insurance

Job Details

Company
WALT Labs
Location
City of London, London, United Kingdom
Posted