Test agentic systems end-to-end,
the way we test self-driving cars.

Evals check the final output. Simulations check everything in between: was the right tool called, and did anything break in the middle of a multi-turn conversation?

Start simulating Read the docs Star on GitHub langwatch/scenario

Evals look at the output.
Simulations look at the journey.

Evaluations

Score the final answer - faithfulness, relevancy, correctness. Perfect for single-turn quality.

input

“Cancel my order”

output

“Sure, I can help you cancel your order.”

faithfulness0.94 ✓

Looks great in isolation - but never sees what happens next.

Simulations

Watch the whole interaction - was the right tool called? did the agent recover? did something break in turn 4 of a 6-turn conversation? Clear, binary, business-level outcomes.

user-sim · turn 1
I need to cancel my order
agent · turn 2
Sure - can you share your order number?
lookup_order
user-sim · turn 3
I don’t have it, I paid with my work card
agent · turn 4
Let me search by your email…
search_by_email
agent · turn 5
“I can’t find anything.” (gives up)
user-sim · turn 6
So you can’t help me?

FAIL · agent didn’t recover from a broken tool call in turn 4

Run scenarios Read the docs

Trusted by teams shipping agents to production

Trusted in production by

The Agent Testing Pyramid

Ever since we put tools in the hands of LLMs, one question keeps coming back: how do we systematically know our agents actually work - and where do evals fit? Building more and more complex agents with our customers, a pattern emerged. We call it the Agent Testing Pyramid: a three-layer approach to the quality assurance reliable agents need. (First written up by Rogério Chaves.)

The power of binary outcomes

Simulations shift the conversation from probabilistic metrics to binary outcomes - from “how accurate is RAG on average?” to “Can the agent help a customer cancel their order when they don’t remember their order number?” That maps directly to business value, builds trust, and communicates progress in terms non-technical stakeholders understand.

Finding the balance

The pyramid isn’t rigid. Early-stage agents might jump straight to simulations; mature systems invest heavily in the middle layer. But all three layers earn their place in agents that work in the real world.

Read the full article

Write your first scenario in minutes.

pytest · CI/CD ready

$ uv add langwatch-scenario pytest

import scenario, pytest

scenario.configure(default_model="openai/gpt-4.1-mini")

@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_recipe_agent():
    class RecipeAgent(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput):
            return my_agent(input.messages)   # your agent, any framework

    result = await scenario.run(
        name="dinner recipe request",
        description="It's saturday evening, the user is hungry and tired,"
                    " has no money to order out, and wants a recipe.",
        agents=[
            RecipeAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent should ask at most one follow-up, then give a recipe",
                "Recipe should be vegetarian, with ingredients and steps",
            ]),
        ],
    )
    assert result.success

Agent under test

Your agent, wrapped with a one-method AgentAdapter.call().

User Simulator Agent

Generates realistic user messages from the scenario description.

Judge Agent

Evaluates the conversation against your criteria and decides whether it proceeds.

Script

Optional precise control over the conversation flow (user / agent / judge turns).

…and watch the run in the Simulations Visualizer

langwatch · simulations

4/6 passed · refund-agent

passed96%