The LangWatch Blog

Engineering deep-dives, product updates, and field lessons on evaluating, testing, and observing AI agents in production.

Article

Customer Story: How Roojoom automates AI Agent Quality Control with LangWatch Scenario

Using LangWatch Scenario, the Rojoom product team built a daily automation way to ship new AI features with confidence.

Manouk Draisma · June 25, 2026

Article

Introducing: Testing voice agents like you test your chat agents

Test voice agents the way you test text agents - simulated callers, traces, playback, and judge-based evaluation -…

Manouk · June 2, 2026

Article

What happens when two engineering teams just... talk

Manouk Draisma · May 19, 2026

Developers

Eat Sleep Append Repeat…

At LangWatch, we process a not-insignificant number of LLM traces, agentic simulations, evaluations, and experiment…

Alex Forbes-Reed · April 20, 2026

Developers

Four Refactors and a Funeral: Migrating a Live System to Event Sourcing

LangWatch is open source. Every commit hash in this post is real and clickable. You can see exactly how we got from…

Alex Forbes-Reed · April 20, 2026

Developers

Internal Product vs Internalised Trauma: Supporting Event Sourced Systems

Before we built anything custom we added metric reporting to the group queue, sent to Grafana. Group queue overview,…

Alex Forbes-Reed · April 20, 2026

Getting to value with LangWatch, faster than ever - how to migrate from Langfuse to LangWatch with Skills.

March 25, 2026Article

A Note on the LiteLLM Vulnerability

March 25, 2026Product Features

Product Managers and leaders are running agent simulations now, and it changing how AI ships

March 24, 2026Article

Making your AI Agent reliable: Adding Evaluations to your multi-modal agent with LangWatch Skills

March 23, 2026Product Features

LangWatch Skills: Your coding agent already knows how to test your agent

March 12, 2026Product Features

Introducing LangWatch MCP: Test and evaluate AI Agents without leaving your workflow

March 6, 2026Developers

The Agent Development Lifecycle: Why shipping is the easy part

February 20, 2026Article

New Pricing: AI growth shouldn’t increase your bill

February 10, 2026Article

What is LLM monitoring? (Quality, cost, latency, and drift in production)

February 10, 2026Article

What is Prompt Management? And how to version, control & deploy prompts in productions

February 3, 2026Developers

How OpenClaw / ClawBot works behind the scenes - and why agent observability matter

February 3, 2026Article

Instrumenting Your OpenClaw Agent with LangWatch via OpenTelemetry

February 3, 2026Article

How to Use Clawdbot + LangWatch to Monitor Your Agents in Production

February 2, 2026Article

LLM Evaluations Explained: Experiments, Online Evaluations, Guardrails, and when to use each in 2026

January 30, 2026Article

4 best tools for monitoring LLM & agent applications in 2026

January 30, 2026Article

Arize AI alternatives: Top 5 Arize competitors compared (2026)

January 30, 2026Article

Top 8 LLM Observability Tools: Complete Guide for 2025

January 30, 2026Article

Top 5 AI evaluation tools for AI agents & products in production (2026)

January 29, 2026Article

How to test AI Agents with LangWatch & Mastra / Google ADK and ship them reliably

December 30, 2025Article

Top Tools for Evaluating Voice Agents in 2025

December 29, 2025Article

What are the AI Agent Events in 2026: The must-attend conferences for Agentic AI Builders

December 24, 2025Product Features

Closing the year Strong: December Product Updates

December 23, 2025Article

How to do Tracing, Evaluation, and Observability for Google ADK

December 23, 2025Article

Top 5 AI Prompt Management Tools of 2025

December 23, 2025Article

Writing Effective AI Evaluations, that hold up in production

December 12, 2025Article

Why Agentic AI needs a new layer of testing

November 26, 2025Product Features

Launch Week Day 5: Better Agents CLI: The reliability layer for the next wave of agent development

November 25, 2025Article

Scenario MCP: Automatic Agent Test Generation inside your editor

November 24, 2025Article

Testing Voice Agents with LangWatch Scenario in Real Time

November 20, 2025Article

A Systematic way of Testing of AI Agents

November 20, 2025Product Features

Introducing: LangWatch newest Prompt Playground

October 27, 2025Article

How LangWatch helps enterprises test, evaluate, and trust their AI before release

October 17, 2025Article

Build vs Buy - Should you build your own LLMOps stack or leverage a purpose-built platform designed for enterprise scale?

October 17, 2025Article

The 4 Best LLM Evaluation Platforms in 2025: Why LangWatch redefines the category with Agent Testing (with Simulations)

October 15, 2025Article

Need-based Context Engineering: Let tests tell you what your AI agent actually needs

October 6, 2025Article

The Ultimate RAG Blueprint: Everything you need to know about RAG in 2025/2026

September 26, 2025Article

From Scenario to Finished: How to Test AI Agents with Domain-Driven TDD

September 25, 2025Article

Building Reliable AI Applications: Why Evals (and Scenarios) Are the backbone of trustworthy AI

September 7, 2025Article

Are evals dead?

September 3, 2025Article

Essential LLM evaluation metrics for AI quality control: From error analysis to binary checks

August 22, 2025Article

Trace IDs in AI: LLM Observability and Distributed Tracing

August 19, 2025Article

The 6 context engineering challenges stopping AI from scaling in production

August 18, 2025Article

LLMOps is the new DevOps, here’s what every developer must know

August 14, 2025Article

LLM observability: What is it and why it matters

August 8, 2025Product Features

GPT-5 Release: From Benchmarks to production reality

August 7, 2025Article

LLM-as-a-Judge: Using the Panel of Judges Approach to Approximate Human Preference

August 1, 2025Product Features

Observability Framework Design for LLM Apps - The Complete LangWatch Guide

July 18, 2025Developers

Top 4 Humanloop Alternatives in 2025

July 7, 2025Article

Why Agent Simulations are the new Unit Tests for AI

June 27, 2025Article

Real-time simulation visualization and debug mode

June 26, 2025Article

Scripted simulations, evaluations, and guardrails

June 25, 2025Article

Test agents on Mastra, Agno, and 10+ other frameworks

June 24, 2025Article

Introducing simulation-based agent testing

June 24, 2025Article

Why LangWatch Scenarios represents the future of AI agent testing

June 21, 2025Article

Best AI Agent Frameworks in 2025: Comparing LangGraph, DSPy, CrewAI, Agno, and More

June 20, 2025Article

Multilingual AI Agent Testing: Using Scenario to Simulate, Break, and Improve LLMs

June 18, 2025Article

LangSmith Alternatives: What to use if you need more security and control

June 13, 2025Article

Intro to Scenario (Testing AI agents)

June 12, 2025Article

Simulations from First Principles (How to test your agents)

June 11, 2025Article

Agent Evaluation: Framework for Testing AI Agents

June 6, 2025Article

Simulation Based Eval Framework

May 30, 2025Article

Introduction: The Real Issue isn’t RL

May 28, 2025Article

Simulations to Test My Agent

May 15, 2025Developers

New Python SDK Brings Native OpenTelemetry to GenAI Observability

May 5, 2025Product Features

April Product Recap: Selene Integration, Eval Wizard Upgrades, Prompt Studio & More

May 5, 2025Product Features

LLM Monitoring & Evaluation for Real-World Production Use

April 24, 2025Article

Systematically Improving RAG Agents

April 22, 2025Product Features

Introducing the Evaluations Wizard: How to evaluate your LLM: Building an LLM evaluation framework that actually works

April 18, 2025Article

Function Calling vs. MCP: Why You Need Both - and How LangWatch Makes It Click

April 18, 2025Article

Why LLM Observability is Now Table Stakes

April 17, 2025Article

LangWatch vs. LangSmith vs. Braintrust vs. Langfuse: Choosing the Best LLM Evaluation & Monitoring Tool in 2025

April 8, 2025Article

Introducing Scenario: Use an Agent to Test Your Agent

April 4, 2025Article

Tackling LLM Hallucinations with LangWatch: Why Monitoring and Evaluation Matter

April 3, 2025Article

LLM evaluations at Swis for Dutch government projects by LangWatch

April 2, 2025Article

Why Your AI Team Needs an AI PM (Quality) Lead

March 27, 2025Article

LangWatch and adesso join forces: Accelerating Secure LLM Adoption for Enterprises

March 25, 2025Article

LLMOps Is Still About People: How to Build AI Teams That Don’t Implode

March 20, 2025Article

Practical LLM Evaluation Framework for AI Development Teams

March 16, 2025Article

What is Model Context Protocol (MCP)? And how's LangWatch involved?

March 14, 2025Article

How PHWL.ai uses LLM Observability and Optimization to Improve AI Coaching with LangWatch

February 25, 2025Article

LangWatch.ai - Announcing - €1M funding round to bring the power of Evaluations and Auto-Optimizations to AI teams.

February 20, 2025Article

OpenAI, Anthropic, Deepseek and other LLM Providers keep dropping prices: Should you host your own model?

January 1, 2025Article

7 Predictions for AI in 2025: A CTO's, Rogerio Chaves Perspective

December 20, 2024Article

Customer Stories: HolidayHero AI start-up <> LangWatch

December 10, 2024Article

LangWatch Optimization Studio - Built for AI Engineers, by AI Engineers

November 10, 2024Article

The power of MIPROv2 (DSPy) in a Low-Code environment with LangWatch’s Optimization Studio

November 7, 2024Article

What is Prompt Optimization? An Introduction to DSPy and Optimization Studio

July 27, 2024Article

Deploying an OpenAI RAG Application to AWS ElasticBeanstalk

July 3, 2024Article

The complete guide for TDD with LLMs

June 27, 2024Article

Data Flywheel: Using your production data to build better LLM products

June 11, 2024Article

How Algomo reduced AI hallucinations with LangWatch

June 10, 2024Article

The AI Team: Integrating User and Domain Expert Feedback to Enhance LLM-Powered Applications

June 10, 2024Article

Unit Testing Your LLM: The Power of Datasets

June 3, 2024Product Features

Introducing DSPy Visualizer

May 20, 2024Article

New Dutch Startup, LangWatch, brings much-needed quality control to GenAI

May 14, 2024Article

How to build a RAG application from scratch with the least possible AI Hallucinations

May 13, 2024Article

LLM Reliability with Retrieval-Augmented Generation

May 13, 2024Article

Safeguarding Your First LLM-Powered Innovation: Essential Practices for Security

May 10, 2024Article

What is User Analytics for LLMs, The Difference With Traditional Analytics, And Why is it Important?

May 8, 2024Article

Unlocking the Potential of Large Language Models: The LLM's Beyond the Hype

May 6, 2024Article

The 8 Types of LLM Hallucinations

May 1, 2024Article

5 Things You Must Consider Before Putting Your Chatbot Live in Production

May 1, 2024Article

Navigating the Complexities of AI-Powered Products

April 29, 2024Article

Understanding Hallucinations: What are they?

April 18, 2024Article

Mastering the GenAI Wave: Strategies for Success in AI Adoption

April 18, 2024Article

Successfully building an AI Startup in the current booming industry

April 17, 2024Article

How Struck.build improved AI Performance with LangWatch

April 8, 2024Article

Journey Through Innovation: The LLM Adventure

Article

The LangWatch Blog

Customer Story: How Roojoom automates AI Agent Quality Control with LangWatch Scenario

Introducing: Testing voice agents like you test your chat agents

What happens when two engineering teams just... talk

Eat Sleep Append Repeat…

Four Refactors and a Funeral: Migrating a Live System to Event Sourcing

Internal Product vs Internalised Trauma: Supporting Event Sourced Systems

Every way your AI agent can be broken (and how attackers actually do it)

Why AI Red teaming is broken (and how we fixed it)

How we test Agent Skills with Scenario simulations

Getting to value with LangWatch, faster than ever - how to migrate from Langfuse to LangWatch with Skills.

A Note on the LiteLLM Vulnerability

Product Managers and leaders are running agent simulations now, and it changing how AI ships

Making your AI Agent reliable: Adding Evaluations to your multi-modal agent with LangWatch Skills

LangWatch Skills: Your coding agent already knows how to test your agent

Introducing LangWatch MCP: Test and evaluate AI Agents without leaving your workflow

The Agent Development Lifecycle: Why shipping is the easy part

New Pricing: AI growth shouldn’t increase your bill

What is LLM monitoring? (Quality, cost, latency, and drift in production)

What is Prompt Management? And how to version, control & deploy prompts in productions

How OpenClaw / ClawBot works behind the scenes - and why agent observability matter

Instrumenting Your OpenClaw Agent with LangWatch via OpenTelemetry

How to Use Clawdbot + LangWatch to Monitor Your Agents in Production

LLM Evaluations Explained: Experiments, Online Evaluations, Guardrails, and when to use each in 2026

4 best tools for monitoring LLM & agent applications in 2026

Arize AI alternatives: Top 5 Arize competitors compared (2026)

Top 8 LLM Observability Tools: Complete Guide for 2025

Top 5 AI evaluation tools for AI agents & products in production (2026)

How to test AI Agents with LangWatch & Mastra / Google ADK and ship them reliably

Top Tools for Evaluating Voice Agents in 2025

What are the AI Agent Events in 2026: The must-attend conferences for Agentic AI Builders

Closing the year Strong: December Product Updates

How to do Tracing, Evaluation, and Observability for Google ADK

Top 5 AI Prompt Management Tools of 2025

Writing Effective AI Evaluations, that hold up in production

Why Agentic AI needs a new layer of testing

Launch Week Day 5: Better Agents CLI: The reliability layer for the next wave of agent development

Scenario MCP: Automatic Agent Test Generation inside your editor

Testing Voice Agents with LangWatch Scenario in Real Time

A Systematic way of Testing of AI Agents

Introducing: LangWatch newest Prompt Playground

How LangWatch helps enterprises test, evaluate, and trust their AI before release

Build vs Buy - Should you build your own LLMOps stack or leverage a purpose-built platform designed for enterprise scale?

The 4 Best LLM Evaluation Platforms in 2025: Why LangWatch redefines the category with Agent Testing (with Simulations)

Need-based Context Engineering: Let tests tell you what your AI agent actually needs

The Ultimate RAG Blueprint: Everything you need to know about RAG in 2025/2026

From Scenario to Finished: How to Test AI Agents with Domain-Driven TDD

Building Reliable AI Applications: Why Evals (and Scenarios) Are the backbone of trustworthy AI

Are evals dead?

Essential LLM evaluation metrics for AI quality control: From error analysis to binary checks

Trace IDs in AI: LLM Observability and Distributed Tracing

The 6 context engineering challenges stopping AI from scaling in production

LLMOps is the new DevOps, here’s what every developer must know

LLM observability: What is it and why it matters

GPT-5 Release: From Benchmarks to production reality

LLM-as-a-Judge: Using the Panel of Judges Approach to Approximate Human Preference

Observability Framework Design for LLM Apps - The Complete LangWatch Guide

Top 4 Humanloop Alternatives in 2025

Why Agent Simulations are the new Unit Tests for AI

Real-time simulation visualization and debug mode

Scripted simulations, evaluations, and guardrails

Test agents on Mastra, Agno, and 10+ other frameworks

Introducing simulation-based agent testing

Why LangWatch Scenarios represents the future of AI agent testing

Best AI Agent Frameworks in 2025: Comparing LangGraph, DSPy, CrewAI, Agno, and More

Multilingual AI Agent Testing: Using Scenario to Simulate, Break, and Improve LLMs

LangSmith Alternatives: What to use if you need more security and control

Intro to Scenario (Testing AI agents)

Simulations from First Principles (How to test your agents)

Agent Evaluation: Framework for Testing AI Agents

Simulation Based Eval Framework

Introduction: The Real Issue isn’t RL

Simulations to Test My Agent

New Python SDK Brings Native OpenTelemetry to GenAI Observability

April Product Recap: Selene Integration, Eval Wizard Upgrades, Prompt Studio & More

LLM Monitoring & Evaluation for Real-World Production Use

Systematically Improving RAG Agents

Introducing the Evaluations Wizard: How to evaluate your LLM: Building an LLM evaluation framework that actually works

Function Calling vs. MCP: Why You Need Both - and How LangWatch Makes It Click

Why LLM Observability is Now Table Stakes