Evals that live in your codebase and sync within the platform

Define quality metrics, run them offline in CI or live on production traffic, and catch regressions before they ship. Bring your own datasets, tools, and pandas - add a few lines and start tracking.

eval.py
import langwatch

evaluation = langwatch.evaluation.init("rag-quality-experiment")

for index, row in evaluation.loop(df.iterrows(), threads=4):
    response = execute_rag_pipeline(row["question"])
    evaluation.run("ragas/faithfulness", index=index, data={...})
    evaluation.log("confidence", index=index, score=response.confidence)
inputscorestatusmscost
How do I cancel?0.94pass420$0.002
Refund for expired…0.61fail680$0.003
Where is my order?0.97pass390$0.002

Add evaluation tracking to your existing workflow.

Keep using pandas and your favourite tools. Initialise an experiment, loop over your dataset, run built-in evaluators, and log your own metrics alongside - in parallel.

parallel · threaded
import langwatch

evaluation = langwatch.evaluation.init("rag-quality-experiment")

for index, row in evaluation.loop(df.iterrows(), threads=4):
    def evaluate(index, row):
        response, contexts = execute_rag_pipeline(row["question"])

        # built-in RAGAS faithfulness evaluator
        evaluation.run(
            "ragas/faithfulness",
            index=index,
            data={"input": row["question"], "output": response, "contexts": contexts},
            settings={"model": "openai/gpt-5", "max_tokens": 2048},
        )

        # log your own metric alongside
        evaluation.log("confidence", index=index, score=response.confidence)

    evaluation.submit(evaluate, index, row)   # runs in parallel
modern-solid-koalaRun #33
Total Cost
Avg Latency
Pass Rate
productbrandtier
Appel-/perensapFlevosapB-brand
SmoothiesInnocentA-brand
VitamineshotsG’ngerA-brand

A full suite of evaluators, out of the box.

Browse all evaluators
RAG quality
RAGAS faithfulnessAnswer relevancyContext precisionContext recall
Safety
Hallucination detectionToxicityPII detectionJailbreak / prompt-injection
Quality
Answer correctnessLLM-as-a-judge (your criteria)BLEUEmbedding distance
Ops
Latency (automatic)Cost (automatic)

LLM-as-a-judge scores answers against your own natural-language criteria. Need something bespoke? Bring your own evaluator and it shows up right next to the built-ins.

Custom evaluators, connected to your traces.

Have an in-house metric? Run it in your own code and attach the result to the current trace or span so it shows up next to the built-in evaluators.

name is required, and at least one of passed / score / label must be set.

custom_evaluator.py
import langwatch

@langwatch.span(type="evaluation")
def evaluation_step():
    # ... your custom evaluation logic ...
    langwatch.get_current_span().add_evaluation(
        name="category_match",        # required
        passed=True,
        score=0.5,
        label="category_detected",
        details="explanation of the result",
    )

Offline, online, and in your CI/CD.

Self-host or run in your own VPC - keep everything local when you need to.

Offline / batch

Run experiments on datasets, compare prompts and models side-by-side, and validate model upgrades before they ship.

gpt-50.92
claude-sonnet0.88
Model A vs Model B
Real-time / online

Run evals continuously on production traffic and alert when quality drops.

CI/CD

Run your eval suite on every PR via the Python & TypeScript SDKs, and gate merges on the results.

Developer-first, but not developer-only.

Developers define evals in code; product owners, QA, and domain experts define the quality framework and run evaluations with the zero-code wizard. One shared source of truth.

langwatch · experiments workbench · no code
proud-solid-lynx Run
inputexpected_outputdemo-promptScore 33%infield-agent v1Score 0%
How do I update my billing information?Update it in Settings → Billing, then Edit Payment Method.To update your billing information, follow these steps: 1. Log in to your account…LLM Answer MatchI can only help with Davis Instruments weather stations and field-monitoring tasks.LLM Answer Match
I’m having trouble logging into my accountTry the “Forgot Password” link on the login page.I’m sorry to hear you’re having trouble logging in. Could you share a few details?LLM Answer MatchSearching the knowledge base for “login”, “password reset” for Davis Instruments…LLM Answer Match
What are your business hours?We’re available Mon-Fri, 9 AM to 6 PM in your local timezone.Our business hours are Monday through Friday, 9:00 AM to 6:00 PM.LLM Answer MatchI don’t have business hours - I only help with Davis weather stations.LLM Answer Match
No code required - product owners, QA and domain experts run and compare evals right in the UI, or ask Langy to build one.

Start evaluating.