Teach your team to evaluate agents properly.
A guided curriculum we run with platform teams shipping LLM agents. From scoring fundamentals to scenario design, run as a workshop or self-paced cohort. Built by the team that ships LangWatch.
- M01 · Why evals, and when they break
- M02 · Scoring fundamentals
- M03 · Designing scenarios that survive
- M04 · LLM-as-judge, done well
- M05 · Online vs offline, and CI
- M06 · From evaluation to optimization
From first eval to a production quality habit.
The training distills two years of running quality programs with teams from healthcare, fintech, customer support, and voice platforms. Practical and opinionated, by design.
Why classic test pyramids do not survive contact with a non-deterministic agent. The vocabulary teams need to talk about quality.
Built-in evaluators, LLM-as-judge, bring-your-own metrics. When to use which, and how to stop chasing scores.
Crafting scenarios that uncover real failure modes. Voice, multi-tool, multi-turn. How few scenarios you actually need.
CI gates that actually fail builds. Production monitors. Connecting business metrics to agent quality.
Teams leave with a quality habit, not a slide deck.
Bring evals training to your team.
Open the curriculum or talk to us about running the workshop for your team, remote or on-site.