AWS Site Reliability Engineer

We’re seeking an AWS Site Reliability Engineer (SRE) with strong incident operations experience to support and improve the reliability of cloud and data platform services across AWS and Snowflake.

This role is hands-on and operationally focused: proactive monitoring, rapid incident response, service restoration, root cause analysis, and automation to improve resilience and reduce MTTR.

What you’ll do

Lead incident triage, coordination, and resolution for AWS and Snowflake services in production
Monitor and respond to alerts, dashboards, and service health indicators
Perform root cause analysis (RCA) and drive post-incident remediation and continuous improvement
Create, maintain, and improve runbooks, operational procedures, and on-call readiness
Participate in and strengthen on-call rotations (including operational handovers)
Automate repetitive operational tasks to reduce toil, improve reliability, and reduce MTTR

What you’ll bring (required)

Strong knowledge of AWS, including EC2, S3, IAM, VPC, Lambda, CloudWatch
Experience with Snowflake administration and troubleshooting
Familiarity with observability tooling such as CloudWatch, Datadog, Grafana, and/or Splunk
Solid understanding of SRE principles: SLIs, SLOs, error budgets, incident management
Scripting/automation skills in Python, Bash, and/or Terraform

Apply Now

AWS Site Reliability Engineer

Job Details