Foundations

Agent Benchmarks

Agent benchmarks are standardized evaluation suites that measure how well models and agent systems perform on specific task categories like coding, web navigation, tool use, and multi-step reasoning, with widely used examples including SWE-bench (real-world GitHub issue resolution), HumanEval (code generation), and Chatbot Arena (human preference rankings). Benchmarks give the field a shared vocabulary for comparing models and architectures, but benchmark scores often overstate production utility because vendors optimize for known test sets in ways that do not generalize to the specific problems you actually need to solve. The practical lesson is to treat public benchmarks as a first filter for model selection, then validate against domain-specific evaluations you build yourself before committing to any model in production.

subtopics

SWE-Bench and HumanEval

Benchmark Limitations

connected to

Eval-Driven Development Quality Metrics

resources

SWE-benchswebench.comBenchmark for evaluating LLMs on real-world software engineering tasks from GitHub (swebench.com)Chatbot Arenalmarena.aiCrowdsourced model ranking based on human preference in blind comparisons (lmarena.ai)Artificial Analysisartificialanalysis.aiIndependent benchmarking of LLM speed, quality, and pricing across providers (artificialanalysis.ai)Aider Leaderboardaider.chatBenchmark comparing coding assistant performance across models (aider.chat)

view in track