Site Reliability Engineer

Generative Engineering is bringing AI design into the real world by enabling generative engineering design for physical products. Our focus is creating millions more engineers globally and giving them the data and knowledge necessary to make efficient decisions quickly, one of the main challenges of the physical engineering industry today.

Our team has a background in scaling software to millions of users and successfully disrupting industries, creating Unicorn’s and Decacorn’s along the way. We combine the advantages of an early-stage start-up with the ability to focus on creating high-quality, high-impact systems, without the distraction of fundraising.

We’re looking for a Site Reliability Engineer to keep our platform fast, available, and trustworthy as we scale. You’ll own the AWS and Terraform footprint behind our services, build the CI/CD and observability that let us ship without fear, and be the person who can drop into a misbehaving container and actually figure out what’s going on.

Must Haves

Any depth of SRE, DevOps, or Platform Engineering experience — we don’t care how many years you’ve been working. We’re looking for solid Infra and sharp judgement.
Strong AWS production experience (EC2, ECS/Fargate, Lambda, S3, IAM, VPC, RDS, networking) — ideally including incidents you owned end-to-end.
Terraform in anger — modular, reviewed, version-controlled.
Comfortable debugging Python services (FastAPI or similar) in production — from container, to ALB, to DNS, to security group.
Docker fluency: building lean images, debugging missing tools (yes, curl), and reasoning about healthchecks, lifecycle hooks, and rollback loops.
CI/CD experience — ideally GitLab CI, but GitHub Actions / Argo / Buildkite count too. Fast, safe, observable pipelines.
Networking depth — you can reach for curl, dig, tcpdump, or a flow log without panicking; you understand IPv4/IPv6 dual-stack, egress, and why a healthcheck can pass externally and fail internally.
Calm under pressure: you’ve been on call, handled real incidents, and written post-mortems that actually changed how the system runs.
A clear point of view on AI tooling — when to use it, when to ignore it, and how to keep it from making your infra worse.

Nice to Have

A ComSci or related degree —
Experience in a fast-paced startup environment.
Observability stack experience — Prometheus, Grafana, OpenTelemetry, Datadog, Loki, or equivalents. You know what a good SLO looks like.
Container orchestration beyond ECS — Kubernetes in production, including debugging it under load.
Database operations — PostgreSQL migrations, RDS tuning, backups you’ve actually restored.
Security and compliance: SOC 2, IAM hardening, secrets management, supply-chain hygiene, least-privilege as a default reflex.
Scaling and cost work — Fargate vs Lambda trade-offs, autoscaling, spot fleets, capacity planning.
HPC / batch compute experience (AWS Batch, ParallelCluster, Slurm, Karpenter) for heavy simulation or ML workloads.
GPU infrastructure: CUDA-aware scheduling, GPU operator, driver pain.
Nix experience (inc Nix Flakes) for reproducible builds and dev environments.
Open-source contributions, especially in the SRE / infra ecosystem.
Chaos engineering, game days, or anything else that proves you trust your runbooks.
Just state the word ‘Salmon’ anywhere in your application, just to prove you can read a job advert.

We aim to improve all our colleagues’ abilities and careers by exposing them to the bare bones of a tech start-up whilst giving them the opportunity to support the company in any way. If our people continuously improve, so does our product.

Apply Now

Site Reliability Engineer

Job Details