The Agentic Workflow

Eval-Driven Development

Eval-driven development treats evaluations as first-class development artifacts, measuring agent behavior against defined criteria before, during, and after every change, analogous to test-driven development but designed for non-deterministic AI systems. Instead of manually checking "does this seem right?", eval-driven teams build evaluation datasets that encode expected behavior and run them automatically whenever prompts, tools, or models change. Without systematic evaluation, prompt changes that improve one use case silently degrade others, creating a whack-a-mole dynamic that prevents meaningful improvement, and research consistently shows that subjective assessment underperforms even simple automated evals at detecting regressions.