Evals that live in your codebase and sync within the platform
Define quality metrics, run them offline in CI or live on production traffic, and catch regressions before they ship. Bring your own datasets, tools, and pandas - add a few lines and start tracking.
import langwatch
evaluation = langwatch.evaluation.init("rag-quality-experiment")
for index, row in evaluation.loop(df.iterrows(), threads=4):
response = execute_rag_pipeline(row["question"])
evaluation.run("ragas/faithfulness", index=index, data={...})
evaluation.log("confidence", index=index, score=response.confidence)Add evaluation tracking to your existing workflow.
Keep using pandas and your favourite tools. Initialise an experiment, loop over your dataset, run built-in evaluators, and log your own metrics alongside - in parallel.
import langwatch
evaluation = langwatch.evaluation.init("rag-quality-experiment")
for index, row in evaluation.loop(df.iterrows(), threads=4):
def evaluate(index, row):
response, contexts = execute_rag_pipeline(row["question"])
# built-in RAGAS faithfulness evaluator
evaluation.run(
"ragas/faithfulness",
index=index,
data={"input": row["question"], "output": response, "contexts": contexts},
settings={"model": "openai/gpt-5", "max_tokens": 2048},
)
# log your own metric alongside
evaluation.log("confidence", index=index, score=response.confidence)
evaluation.submit(evaluate, index, row) # runs in parallelA full suite of evaluators, out of the box.
Browse all evaluatorsLLM-as-a-judge scores answers against your own natural-language criteria. Need something bespoke? Bring your own evaluator and it shows up right next to the built-ins.
Custom evaluators, connected to your traces.
Have an in-house metric? Run it in your own code and attach the result to the current trace or span so it shows up next to the built-in evaluators.
name is required, and at least one of passed / score / label must be set.
import langwatch
@langwatch.span(type="evaluation")
def evaluation_step():
# ... your custom evaluation logic ...
langwatch.get_current_span().add_evaluation(
name="category_match", # required
passed=True,
score=0.5,
label="category_detected",
details="explanation of the result",
)Offline, online, and in your CI/CD.
Self-host or run in your own VPC - keep everything local when you need to.
Run experiments on datasets, compare prompts and models side-by-side, and validate model upgrades before they ship.
Run evals continuously on production traffic and alert when quality drops.
Run your eval suite on every PR via the Python & TypeScript SDKs, and gate merges on the results.
Developer-first, but not developer-only.
Developers define evals in code; product owners, QA, and domain experts define the quality framework and run evaluations with the zero-code wizard. One shared source of truth.