Evaluation frameworks provide standardized tooling for defining test cases, running them against agent systems, and comparing results across different configurations of prompts, models, and tools. Key options include Promptfoo (an open-source CLI tool for comparing prompt variations), Braintrust (an end-to-end eval platform with trace analysis), and LangSmith (eval and observability integrated into the LangChain ecosystem), each handling the infrastructure that makes eval-driven development practical: test case management, parallel execution, regression detection, and human review for ambiguous outputs.