AI Platform Engineer
About the Role
Our client, a Series B company automating decision-making for field workers at scale, is looking for a Senior AI Platform Engineer to own the infrastructure layer enabling autonomous agents to operate reliably at production scale.
You'll design and build the systems that allow their applied AI team to ship production LLM applications without breaking things. The focus is infrastructure engineering for AI systems — not traditional MLOps or feature engineering.
You'll be responsible for:
- Multi-modal data pipelines — Ingesting video, audio, and structured data at scale; designing processing infrastructure that handles thousands of concurrent users
- Agentic orchestration on serverless AWS — Building Lambda + Step Functions infrastructure for autonomous workflows; managing state, cost, and reliability
- Observability and guardrails — Implementing monitoring that catches when autonomous agents fail; tracking decision quality, tool-use patterns, and failure modes
- Platform reliability — Designing systems with sub-100ms latency requirements that scale reliably as user volume grows
The platform team is small (3 people). You'll have significant autonomy and direct impact on product direction.
What We're Looking For
Must have:
- 4+ years building production systems (not just experiments or prototypes)
- Strong Python skills; experience designing libraries or shared services that other engineers depend on
- Deep AWS infrastructure knowledge — particularly serverless (Lambda, Step Functions, SQS, EventBridge)
- Experience designing reliable, maintainable systems for teams (not just individual contributor features)
- Hands-on experience with agentic AI systems, LLM orchestration frameworks (LangGraph, CrewAI, etc.), or similar multi-step autonomous workflows
- Platform engineer mindset — you think about schema design, API stability, backward/forward compatibility, and developer experience
Nice to have:
- Experience with multi-modal systems (video, audio, or image processing)
- AWS Bedrock, SageMaker, or similar managed AI services
- Observability frameworks for LLM systems (Langfuse, Arize, LangSmith)
- RAG pipelines, vector databases, or retrieval-augmented systems
- CI/CD automation; infrastructure-as-code (Terraform, CloudFormation)
Why This Matters
You're not building features. You're building the infrastructure that enables autonomous decision-making at scale. Every design choice affects reliability, cost, and whether the system can scale.
This requires thinking like a platform engineer:
- How does data flow through the system without corruption?
- How do you evolve APIs without breaking consuming services?
- How do you catch failures before customers do?
- How do you keep costs predictable as scale grows?
About You
You've shipped production systems. You understand the difference between a feature and a platform. You care about schema design, API stability, and developer experience — not just "does it work today."
You've thought about backward compatibility, breaking changes, and how to evolve systems without breaking the engineers who depend on them. You're comfortable with serverless, Python, and the AWS stack. You care about observability and knowing when things break.
You're not intimidated by agentic AI systems — you understand they're just orchestration patterns, and infrastructure principles remain the same.