The Agentic Workflow

Eval-Driven Development

Eval-driven development treats evaluations as first-class development artifacts, measuring agent behavior against defined criteria before, during, and after every change, analogous to test-driven development but designed for non-deterministic AI systems. Instead of manually checking "does this seem right?", eval-driven teams build evaluation datasets that encode expected behavior and run them automatically whenever prompts, tools, or models change. Without systematic evaluation, prompt changes that improve one use case silently degrade others, creating a whack-a-mole dynamic that prevents meaningful improvement, and research consistently shows that subjective assessment underperforms even simple automated evals at detecting regressions.

subtopics

Eval-First Workflow

Feedback Loops

connected to

Error Recovery Eval Frameworks Quality Metrics

resources

Anthropic: Evaluation Guidedocs.anthropic.comAnthropic's guide to building systematic evaluations for Claude-based systems (docs.anthropic.com)Promptfoopromptfoo.devOpen-source eval framework for comparing prompts, models, and configurations (promptfoo.dev)Braintrustbraintrust.devPlatform for building, running, and analyzing LLM evaluations at scale (braintrust.dev)Hamel Husain: Your AI Product Needs Evalshamel.devInfluential essay on why evals are the most important thing you can build (hamel.dev)

view in track