At-scale automated testing for voice and chat agents.
The open-source platform for voice agent QA, from pre-launch simulation to production monitoring.
Hello, and thank you for joining the interview. I am an AI assistant conducting this interview — the conversation may be recorded and assessed, and you can request a human at any time. Let's start: could you tell me about a recent project where you led the development of an LLM evaluation tool?
i haven't led an llm evaluation tool project, but i have led backend development for a large-scale data processing pipeline with real-time analytics and scaling challenges. would you like me to describe that?
That's great, thank you for sharing. Let's dig into that. Could you walk me through some of the key design decisions you made in that backend development — specifically around scaling and real-time analytics?
sure, we used kafka for event streaming to handle high throughput, and microservices with autoscaling on kubernetes. for real-time analytics we used apache flink for low-latency streams. one key decision was sharding data by user region to reduce cross-region latency and improve fault tolerance — that let us scale to millions of events per second.
Common voice agent testing challenges
Audio, latency, interruptions, and tool calls all have to work together in real time.
Shipping multiple agents a week? Manual QA can't keep up.
Validate behaviour across edge cases before going live.
Accents, noise, interruptions, and API failures rarely make the test plan.
Every prompt change risks a regression.
From prompt to production in four steps
Connect, describe the caller and verdict, run in CI, monitor in production. One scenario.run() API.
pip install, then point a thin adapter at the agent you already shipped.
A UserSimulatorAgent plays the caller, a JudgeAgent decides what passing looks like.
scenario.run() drives interruptions, DTMF, and audio effects on every commit.
The visualizer, OpenTelemetry traces, and alerts close the loop after launch.
Write a real voice test in fifteen lines
No new DSL, no dashboard clicking. One code-first API powers text and voice. Drop it into your test runner and gate every commit in CI.
# test_voice_agent.py (illustrative)
import scenario
result = scenario.run(
name="caller reschedules an appointment over the phone",
description="A caller with background noise asks to move a booking.",
agents=[
my_voice_agent, # your agent, behind a thin adapter
scenario.UserSimulatorAgent(
persona="hurried caller, interrupts, mild accent",
),
scenario.JudgeAgent(criteria=[
"The agent confirms the new time in the caller's timezone",
"The agent verifies identity before changing the booking",
]),
],
audio=scenario.AudioConditions(noise="cafe", interruptions=True),
)
assert result.successSnippet is illustrative. See the docs for the current API.
Catch regressions, benchmark performance, iterate faster
One framework for everything that has to go right before a voice agent can answer the phone.
Capture a production failure once, then replay it forever as a regression test.
Noise, codec degradation, and interruptions, so you test the call your users actually make.
Time to first token, turn latency, and judge verdicts land in one inspectable trace.
Probe for jailbreaks, prompt leaks, and off-policy answers before an attacker does.
We mimic your customers: the User Simulator generates natural, multi-turn conversations over your real protocol.
Gate every release in your pipeline, or drive runs programmatically through the REST API.
Cross-industry voice agent testing
For high-stakes domains where a mistake means churn, compliance issues, or lost revenue.
Timezones, identity verification, and double-booking conflicts, handled on the call.
Escalation paths, compliance language, and emotional conversations that stay on script.
Care instructions, prescriptions, and clean emergency escalation when it matters.
Menu rules, background noise, and rush-hour pace without dropped items.
On-message pitches, objection handling, and a clean lead handoff every time.
Field capture, confidentiality, and accurate routing to the right team.
Works with your voice stack
Test the agent you actually shipped, over the exact protocol it runs on. No re-platforming, no wrappers.
Built on OpenTelemetry, works with 10+ frameworks.
From first test to production monitoring
Simulation is the start. LangWatch closes the loop with live observability, alerts, and ROI metrics.
Tie agent quality to the numbers your business actually tracks.
Watch live calls, catch drift, and get alerted when quality slips.
Replay any run as a full conversation you can inspect and share.
Turn real conversations into datasets, then close the loop on fixes.
Let product and QA author scenarios without touching the codebase.
Open source and self-hostable, so your call data stays yours.
LangWatch reminded me of how we used to evaluate models in classic ML. I knew immediately it was what we needed to ship conversational AI with confidence.
Ship voice agents with confidence, not crossed fingers.
Get up and running with LangWatch in as little as ten minutes.
$ pip install langwatch-scenario