AI Solutions Engineer

About Chaseit

Chaseit builds AI voice agents for loan servicing and collections. Our agents make tens of thousands of human-like calls every day for lenders across Europe and beyond — automating everything from payment reminders to payment-plan negotiation, in 10+ languages, while staying compliant and genuinely empathetic.

We are a VC-backed startup with live customers, real revenue, and a lean team where every person has outsized impact.

What you will do

This is not a prompt-tweaking role, and it is not plain AI programming. As an AI Solutions Engineer, you own one question: are our agents measurably getting better at the outcomes our customers actually care about?

You'll work side by side with our Deployment Strategists. They own the customer relationship and define what success looks like in production — payment rates, promise-to-pay, resolution, containment, escalations. You own turning those targets into a proactive, systematic improvement engine: forming hypotheses about what's holding a metric back, running experiments to test them, and shipping the changes that move the number.

You will:

  • Own target metrics for live deployments alongside the Deployment Strategist, and treat moving them as the job — not a side effect
  • Form hypotheses about what limits agent performance (conversation design, prompts, model choice, tooling, latency, handoff logic) and prioritise them by expected impact
  • Design and run improvement experiments and A/B tests on real call traffic — define treatment and control, success metrics, guardrails, and sample sizes; then read the results honestly, including the ones that don't work
  • Build and own automated improvement flows: eval pipelines that score every prompt, flow, and model change before it ships; regression suites that catch quality drops; online evaluation and monitoring that surface failures and metric regressions automatically
  • Build evaluators and evaluation datasets (LLM-as-judge plus deterministic checks) that capture what "good" actually sounds like on a real collections call
  • Mine production call data to find failure clusters and high-impact opportunities, then turn them into eval cases and experiments
  • Close the loop: ship the winning changes, quantify the impact, document what you learned, and feed it back into how every agent is built
  • Build the tooling and infrastructure that lets the whole team improve agents faster, and with confidence

Some days are deep experiment design and data analysis. Some days are firefighting a metric that dropped overnight. You should enjoy both.

Why this role exists

The obvious way to improve an AI agent is reactively: wait for issues to surface — a bug report, a flagged call, a piece of customer feedback — and fix them one by one. That work is real, and we take it seriously. It keeps quality from slipping and it earns customer trust.

But responding to what's already visible isn't the same as systematically moving the numbers that decide whether a deployment succeeds. The biggest gains usually sit in patterns you only see when you go looking — across tens of thousands of calls, not one at a time — and they have to be proven, not assumed.

That's why this role exists. We need someone who starts from the metric: who can find where payment rates or resolution are stuck, form a clear hypothesis about why, prove the fix with a clean experiment, and roll it out with confidence. Hypothesis, experiment, real-world impact — that's the discipline this role is built around.

Who you are

  • 3+ years building with LLMs or production AI/ML systems, with a track record of improving a system's performance through systematic evaluation and iteration — not vibes
  • Strong in Python (TypeScript a plus) for coding, building evals, experiments, data pipelines, and analysis
  • Fluent in evaluation and experiment design: offline and online evals, LLM-as-judge and code-based graders, A/B testing, and enough statistics to know when a result is real and when it's noise
  • Comfortable in data: SQL, digging through logs and transcripts, building dashboards, and reasoning clearly about metrics
  • Product sense and extreme ownership: you can define what success means for an ambiguous problem and drive it from hypothesis to production without being told what matters
  • Comfort with ambiguity and intensity — priorities shift, metrics move, and some days are firefighting days. You stay calm and effective.
  • Outstanding written English; you can explain a result or a trade-off clearly to engineers and non-technical stakeholders alike

Nice to have

  • Experience with conversational or voice AI, call-center operations, or QA of voice/chat systems
  • Background in lending, collections, payments, or another regulated fintech domain
  • Experience with experimentation platforms (e.g. Statsig) and AI observability / eval tooling (e.g. Arize Phoenix, LangSmith)
  • Familiarity with agent orchestration frameworks and prompt engineering
  • Experience with Linear and Notion
  • Additional European languages
  • Experience in high-growth or early-stage environments

What you'll get

  • Salary: €40,000 – €70,000 gross per year, depending on experience
  • Equity: early team members get stock options, so you share in what we build
  • Direct work with the founding team — your experiments shape the product and the roadmap
  • A front-row seat to deploying AI agents at enterprise scale, in production, every day

Location: Remote or Vilnius, Lithuania (hybrid)

Job Details

Company
Chaseit
Location
London Area, United Kingdom
Posted