Four values for the teams building agents that actually ship.
Systematic approach to quality over vibe checking. Few well-thought scenarios over thousands of auto-generated ones. Business metrics over technical ones. Incremental agent improvements over premature AGI.
Better Agents is a CLI tool and a set of standards for developing agents we have come to value. On this page you can read the manifesto that guides the philosophy behind it.
Systematic approach to quality over vibe checking
The current state of agent development is still very much based on manual testing: sending messages each time a change is made, eyeballing for a subjective feeling of quality, and deciding whether the new version is better than the last.
Agents are highly non-deterministic systems where previous software testing practices do not fit. That does not mean quality cannot be controlled and improved consistently. We believe a combination of simulation tests, evaluations, monitoring, and defined iteration processes is essential to consistently iterate and create better agents.
There is still value in the items on the right. Vibe checking captures human expertise and intuition on what is off. It becomes part of the systematic approach when used to feed back into simulations and evaluations with new insights.
Few well-thought scenarios over thousands of auto-generated ones
When dealing with AI where everything is easy to generate, one may be tempted to have it automatically generate thousands of simulations for the agent and assess its quality fully by itself. We believe, however, that defining what is important for your business, the scenarios, and the metrics is exactly where the whole value is.
A small number of well-thought-out, well-cared-for scenarios enables oversight by developers, domain experts, and the business. They can read through a humanly possible amount of key simulated conversations that test the critical edges of the agent and create real trust in it.
Millions of auto-generated simulated conversations will never be read. Having another AI tell you what good quality means for your agent, and then evaluating itself on it, does not actually create trust.
There is still value in the items on the right. Automatic exploration is extremely useful for finding new cases for the test suite, uncovering edge cases at scale, and surfacing problems you would not have written by hand.
Business metrics over technical ones
The value of agents over workflows is precisely the flexibility to handle never-seen-before situations. Simulations and evaluations help prevent regressions, but the value the agent brings in real life should be the main guiding principle for agent quality.
Value and ROI can be hard to define for agent projects due to the cognitive nature of the work they automate. We found that teams that focus on it regardless deliver better agents. Proxy metrics like escalation rates or user acceptance rate of suggestions can already bring enormous value.
Technical metrics still matter as the base of correct functioning. Sometimes they contribute to business performance directly, like the latency of a customer support voice agent.
Incremental agent improvements over premature AGI
We found that teams who start simple and expand with incremental changes to their agent have much higher success rates than teams who start with an already very capable multi-agent system and try to refine it afterward.
Starting with a single LLM plus tool, launching, proving it does one small job very well, and expanding from there is actually faster than starting with an orchestrator, fifteen subagents, forty-two tools, voice, and UI generation, and then trying to lick it into shape.
We call that pattern Premature AGI.
There is still value in the items on the right. Bigger steps can pragmatically lead to better results and uncover new directions. Eventual experiments with hyped new approaches should be part of the systematic approach to quality.
Build agents the way teams build software that works in production: with tests you can read, metrics that the business cares about, and iteration small enough that the next change is obvious.
Ship agents with confidence, not crossed fingers.
Get up and running with LangWatch in as little as five minutes.