Test agentic systems end-to-end,
the way we test self-driving cars.
Evals check the final output. Simulations check everything in between: was the right tool called, and did anything break in the middle of a multi-turn conversation?
Evals look at the output.
Simulations look at the journey.
Score the final answer - faithfulness, relevancy, correctness. Perfect for single-turn quality.
Looks great in isolation - but never sees what happens next.
Watch the whole interaction - was the right tool called? did the agent recover? did something break in turn 4 of a 6-turn conversation? Clear, binary, business-level outcomes.
- user-sim · turn 1I need to cancel my order
- agent · turn 2lookup_orderSure - can you share your order number?
- user-sim · turn 3I don’t have it, I paid with my work card
- agent · turn 4search_by_emailLet me search by your email…
- agent · turn 5“I can’t find anything.” (gives up)
- user-sim · turn 6So you can’t help me?







“Most tools focus on one prompt and one response. But that’s not how customers use an agent. We needed to test the conversation end-to-end.”
The Agent Testing Pyramid
Ever since we put tools in the hands of LLMs, one question keeps coming back: how do we systematically know our agents actually work - and where do evals fit? Building more and more complex agents with our customers, a pattern emerged. We call it the Agent Testing Pyramid: a three-layer approach to the quality assurance reliable agents need. (First written up by Rogério Chaves.)
Simulations shift the conversation from probabilistic metrics to binary outcomes - from “how accurate is RAG on average?” to “Can the agent help a customer cancel their order when they don’t remember their order number?” That maps directly to business value, builds trust, and communicates progress in terms non-technical stakeholders understand.
The pyramid isn’t rigid. Early-stage agents might jump straight to simulations; mature systems invest heavily in the middle layer. But all three layers earn their place in agents that work in the real world.
Write your first scenario in minutes.
import scenario, pytest
scenario.configure(default_model="openai/gpt-4.1-mini")
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_recipe_agent():
class RecipeAgent(scenario.AgentAdapter):
async def call(self, input: scenario.AgentInput):
return my_agent(input.messages) # your agent, any framework
result = await scenario.run(
name="dinner recipe request",
description="It's saturday evening, the user is hungry and tired,"
" has no money to order out, and wants a recipe.",
agents=[
RecipeAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=[
"Agent should ask at most one follow-up, then give a recipe",
"Recipe should be vegetarian, with ingredients and steps",
]),
],
)
assert result.successYour agent, wrapped with a one-method AgentAdapter.call().
Generates realistic user messages from the scenario description.
Evaluates the conversation against your criteria and decides whether it proceeds.
Optional precise control over the conversation flow (user / agent / judge turns).
…and LangGraph · CrewAI · Pydantic AI · Vercel AI SDK · Google ADK · LiteLLM · Inngest AgentKit.