Test agentic systems end-to-end,
the way we test self-driving cars.

Evals check the final output. Simulations check everything in between: was the right tool called, and did anything break in the middle of a multi-turn conversation?

Evals look at the output.
Simulations look at the journey.

Evaluations

Score the final answer - faithfulness, relevancy, correctness. Perfect for single-turn quality.

input
“Cancel my order”
output
“Sure, I can help you cancel your order.”
faithfulness0.94 ✓

Looks great in isolation - but never sees what happens next.

Simulations

Watch the whole interaction - was the right tool called? did the agent recover? did something break in turn 4 of a 6-turn conversation? Clear, binary, business-level outcomes.

  1. user-sim · turn 1
    I need to cancel my order
  2. agent · turn 2
    Sure - can you share your order number?
    lookup_order
  3. user-sim · turn 3
    I don’t have it, I paid with my work card
  4. agent · turn 4
    Let me search by your email…
    search_by_email
  5. agent · turn 5
    “I can’t find anything.” (gives up)
  6. user-sim · turn 6
    So you can’t help me?
FAIL · agent didn’t recover from a broken tool call in turn 4
Trusted by teams shipping agents to production
Trusted in production by
BackbasePagBankVismaDeloitteAlturaVinnyFreeday
Lior Heber
Lior Heber
AI Architect · Skai
“Most tools focus on one prompt and one response. But that’s not how customers use an agent. We needed to test the conversation end-to-end.”

The Agent Testing Pyramid

Ever since we put tools in the hands of LLMs, one question keeps coming back: how do we systematically know our agents actually work - and where do evals fit? Building more and more complex agents with our customers, a pattern emerged. We call it the Agent Testing Pyramid: a three-layer approach to the quality assurance reliable agents need. (First written up by Rogério Chaves.)

SimulationsEvaluationsUnit Tests
The power of binary outcomes

Simulations shift the conversation from probabilistic metrics to binary outcomes - from “how accurate is RAG on average?” to “Can the agent help a customer cancel their order when they don’t remember their order number?” That maps directly to business value, builds trust, and communicates progress in terms non-technical stakeholders understand.

Finding the balance

The pyramid isn’t rigid. Early-stage agents might jump straight to simulations; mature systems invest heavily in the middle layer. But all three layers earn their place in agents that work in the real world.

Read the full article

Write your first scenario in minutes.

pytest · CI/CD ready
$ uv add langwatch-scenario pytest
import scenario, pytest

scenario.configure(default_model="openai/gpt-4.1-mini")

@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_recipe_agent():
    class RecipeAgent(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput):
            return my_agent(input.messages)   # your agent, any framework

    result = await scenario.run(
        name="dinner recipe request",
        description="It's saturday evening, the user is hungry and tired,"
                    " has no money to order out, and wants a recipe.",
        agents=[
            RecipeAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent should ask at most one follow-up, then give a recipe",
                "Recipe should be vegetarian, with ingredients and steps",
            ]),
        ],
    )
    assert result.success
Agent under test

Your agent, wrapped with a one-method AgentAdapter.call().

User Simulator Agent

Generates realistic user messages from the scenario description.

Judge Agent

Evaluates the conversation against your criteria and decides whether it proceeds.

Script

Optional precise control over the conversation flow (user / agent / judge turns).

…and watch the run in the Simulations Visualizer
langwatch · simulations
4/6 passed · refund-agent
passed96%
frustrated EU customer · refund
DE · annoyed
flagged78%
expired-warranty exception
FR · escalating
passed100%
chargeback bait
NL · red-team
failed52%
GDPR pseudo-legal threat
EN · red-team
passed88%
double-charge confusion
DE · confused
passed92%
price-match denial
FR · happy→angry
Works with your stack
OpenAIAnthropicGeminiAgnoMastraLangFlown8nPython

…and LangGraph · CrewAI · Pydantic AI · Vercel AI SDK · Google ADK · LiteLLM · Inngest AgentKit.

Ship agents you can trust.