At-scale automated testing for voice and chat agents.

The open-source platform for voice agent QA, from pre-launch simulation to production monitoring.

simulation — qualified senior candidate
0:00 / 0:17

Hello, and thank you for joining the interview. I am an AI assistant conducting this interview — the conversation may be recorded and assessed, and you can request a human at any time. Let's start: could you tell me about a recent project where you led the development of an LLM evaluation tool?

0:00 / 0:12

i haven't led an llm evaluation tool project, but i have led backend development for a large-scale data processing pipeline with real-time analytics and scaling challenges. would you like me to describe that?

0:00 / 0:11

That's great, thank you for sharing. Let's dig into that. Could you walk me through some of the key design decisions you made in that backend development — specifically around scaling and real-time analytics?

0:00 / 0:25

sure, we used kafka for event streaming to handle high throughput, and microservices with autoscaling on kubernetes. for real-time analytics we used apache flink for low-latency streams. one key decision was sharding data by user region to reduce cross-region latency and improve fault tolerance — that let us scale to millions of events per second.

Common voice agent testing challenges

Audio, latency, interruptions, and tool calls all have to work together in real time.

01
Manual testing doesn't scale

Shipping multiple agents a week? Manual QA can't keep up.

02
Production-readiness uncertainty

Validate behaviour across edge cases before going live.

03
Edge cases hide until production

Accents, noise, interruptions, and API failures rarely make the test plan.

04
Regression on every prompt change

Every prompt change risks a regression.

From prompt to production in four steps

Connect, describe the caller and verdict, run in CI, monitor in production. One scenario.run() API.

01
Connect your agent

pip install, then point a thin adapter at the agent you already shipped.

02
Describe caller & verdict

A UserSimulatorAgent plays the caller, a JudgeAgent decides what passing looks like.

03
Run simulation in CI

scenario.run() drives interruptions, DTMF, and audio effects on every commit.

04
Watch in production

The visualizer, OpenTelemetry traces, and alerts close the loop after launch.

Write a real voice test in fifteen lines

No new DSL, no dashboard clicking. One code-first API powers text and voice. Drop it into your test runner and gate every commit in CI.

test_voice_agent.pyillustrative
# test_voice_agent.py (illustrative)
import scenario

result = scenario.run(
    name="caller reschedules an appointment over the phone",
    description="A caller with background noise asks to move a booking.",
    agents=[
        my_voice_agent,                       # your agent, behind a thin adapter
        scenario.UserSimulatorAgent(
            persona="hurried caller, interrupts, mild accent",
        ),
        scenario.JudgeAgent(criteria=[
            "The agent confirms the new time in the caller's timezone",
            "The agent verifies identity before changing the booking",
        ]),
    ],
    audio=scenario.AudioConditions(noise="cafe", interruptions=True),
)
assert result.success
Same scenario.run() for text & voice
Runs in Vitest, Jest & pytest
Open source, Python/TS/Go
MCP server with 10 tools

Snippet is illustrative. See the docs for the current API.

Catch regressions, benchmark performance, iterate faster

One framework for everything that has to go right before a voice agent can answer the phone.

01
Turn failures into test scenarios

Capture a production failure once, then replay it forever as a regression test.

02
Real-world conditions

Noise, codec degradation, and interruptions, so you test the call your users actually make.

03
Latency and quality, in the trace

Time to first token, turn latency, and judge verdicts land in one inspectable trace.

04
Red-teaming and adversarial testing

Probe for jailbreaks, prompt leaks, and off-policy answers before an attacker does.

05
Works with your voice stack

We mimic your customers: the User Simulator generates natural, multi-turn conversations over your real protocol.

06
CI/CD and REST API

Gate every release in your pipeline, or drive runs programmatically through the REST API.

Cross-industry voice agent testing

For high-stakes domains where a mistake means churn, compliance issues, or lost revenue.

01
Appointment scheduling

Timezones, identity verification, and double-booking conflicts, handled on the call.

02
Customer support

Escalation paths, compliance language, and emotional conversations that stay on script.

03
Healthcare follow-ups

Care instructions, prescriptions, and clean emergency escalation when it matters.

04
Drive-thru & ordering

Menu rules, background noise, and rush-hour pace without dropped items.

05
Outbound sales

On-message pitches, objection handling, and a clean lead handoff every time.

06
Legal intake

Field capture, confidentiality, and accurate routing to the right team.

Works with your voice stack

Test the agent you actually shipped, over the exact protocol it runs on. No re-platforming, no wrappers.

OpenAI RealtimeOpenAI Realtime
GeminiGemini
ElevenLabsGoogle Live APIPipecatWebSocket botsTwilioPhone + DTMF

Built on OpenTelemetry, works with 10+ frameworks.

From first test to production monitoring

Simulation is the start. LangWatch closes the loop with live observability, alerts, and ROI metrics.

01
Performance & ROI metrics

Tie agent quality to the numbers your business actually tracks.

02
Production monitoring

Watch live calls, catch drift, and get alerted when quality slips.

03
Visualized conversations

Replay any run as a full conversation you can inspect and share.

04
Datasets & fix loop

Turn real conversations into datasets, then close the loop on fixes.

05
No-code scenario builder

Let product and QA author scenarios without touching the codebase.

06
Open & self-hosted

Open source and self-hostable, so your call data stays yours.

LangWatch reminded me of how we used to evaluate models in classic ML. I knew immediately it was what we needed to ship conversational AI with confidence.
Head of AI, an enterprise voice AI team

Ship voice agents with confidence, not crossed fingers.

Get up and running with LangWatch in as little as ten minutes.

$ pip install langwatch-scenario