Site Reliability Engineer

Job Title: GCP SRE

Location: Manchester, Leeds or Halifax (Hybrid 3 days a week)

Duration: 6 months and expendable

We're looking for a Google Product Site Reliability Engineer to join our Public Cloud Platform. You’ll have a unique opportunity to be part of an ambitious team to strengthen observability, reliability and operation excellence across our GCP platform, with the purpose of driving our tech modernization agenda and enable us to become the biggest Fintech in the UK.

The ideal candidate will have demonstrable experience in Cloud engineering, Observability platforms and a passion for technology. Commitment to delivering high-quality, scalable solutions is a must. You'll bring your expertise to partner closely with the product engineering teams to ensure systems are observable, reliable and operable at scale

What you’ll do

• Define and evolve observability standards across metrics, logs, traces and events

• Partner with teams to ensure services are observable by design

• Use Dynatrace as the primary observability tool to ensure effective instrumentation and coverage, meaningful dashboards and SLO based alerting aligned to user impact

• Be hands-on engineering, maintaining our Infrastructure as Code and CI/CD pipeline-based product and services by responding to change, implementing enhancements & improving reliability and customer experience

• Observing, investigating & fixing service issues, with an engineering attitude – resolving via code changes and implementing improvements to prevent repeat issues

• Implementing further automation and reducing toil, by utilizing existing Cloud tooling or implementing new technologies

What you’ll need

• Certifications and experience working with Google products

• Strong DevOps understanding, including experience of Infrastructure as Code and CI/CD pipelines, such as Terraform and Jenkins, or alternatives such as Azure DevOps

• Hands-on with Observability Tooling (Observability as Code and SLO-based Dynatrace Monitoring)

• Ability to quickly understand, update and write code in languages such as Python, Groovy, BASH, PowerShell

• Experience of developing and administrating Kubernetes clusters in a production environment

• Strong experience in automating to remove toil

• Strong knowledge of incident management and issue resolution

• Strong knowledge of Infrastructure as Code and creating modular, easy to maintain code

• A strong understanding of Cloud security, networking and APIs

• Experience in problem-solving, able to demonstrate logical thinking and excellent troubleshooting skills

• Strong understanding and demonstrable use of source control practice and collaborative working as part of an engineering team

• Able to demonstrate a passion to continue to learn and develop your engineering skills

Apply Now

Site Reliability Engineer

Job Details