home/glossary/Error Recovery
Agent Architecture Patterns

Error Recovery

definition

Error recovery patterns determine how agents detect, respond to, and recover from failures during execution — from tool call failures and malformed outputs to reasoning dead-ends and infinite loops. Effective recovery strategies include retry with backoff (for transient failures), fallback models (switching to a different LLM when one fails), context truncation (reducing context size when hitting limits), and graceful degradation (completing partial work rather than failing entirely).

Error recovery patterns determine how agents detect, respond to, and recover from failures during execution — from tool call failures and malformed outputs to reasoning dead-ends and infinite loops. Effective recovery strategies include retry with backoff (for transient failures), fallback models (switching to a different LLM when one fails), context truncation (reducing context size when hitting limits), and graceful degradation (completing partial work rather than failing entirely). The most important architectural decision is whether to let the agent self-recover (by including the error in its context and asking it to reason about alternatives) or to handle errors programmatically in the host application. In production agent systems, error recovery is not a nice-to-have — it determines the difference between 60% and 99% task completion rates, because agents that can recover from mid-task failures avoid expensive full restarts. This concept connects to error handling tools for the tool-level error design, human-in-the-loop for escalating to humans when recovery fails, and supervision for monitoring and intervening in agent failures.