Evaluation and Observability

Regression Testing

definition

Regression testing for agent systems verifies that changes to prompts, tools, models, or configurations don't break previously working behavior, catching the "fixed one thing, broke three others" pattern that is endemic to non-deterministic systems. Unlike traditional software regression tests with binary pass/fail outcomes, agent regression tests must account for acceptable variation — the output is allowed to differ in wording while still being semantically correct.

Regression testing for agent systems verifies that changes to prompts, tools, models, or configurations don't break previously working behavior, catching the "fixed one thing, broke three others" pattern that is endemic to non-deterministic systems. Unlike traditional software regression tests with binary pass/fail outcomes, agent regression tests must account for acceptable variation — the output is allowed to differ in wording while still being semantically correct. Effective approaches include snapshot testing (comparing current outputs to golden examples), statistical evaluation (measuring quality metrics across a test suite and flagging significant degradation), and canary deployments (rolling out changes to a subset of traffic and monitoring for regressions). Regression testing is critical for agentic systems because prompts are fragile: a change that improves one task category often degrades another, and without regression tracking, teams optimize in circles. This concept connects to eval-driven development for the broader evaluation practice, eval frameworks for the tooling that runs regression suites, quality metrics for defining what "regression" means quantitatively, and prompt iteration for the improvement process that regression tests protect.

on the map

Regression Testing Evaluation and Observability

related concepts

Test-Driven Agentic Development