Evaluation and Observability

Latency Optimization

definition

Latency optimization reduces the end-to-end time for agent task completion through techniques like streaming responses, parallel tool calls, model selection for speed, prompt compression, and caching strategies. In multi-step agent loops, latency compounds across iterations — a 2-second inference call in a 10-step task means 20 seconds of model time alone — making per-step optimization critical for user-facing applications.

Latency optimization reduces the end-to-end time for agent task completion through techniques like streaming responses, parallel tool calls, model selection for speed, prompt compression, and caching strategies. In multi-step agent loops, latency compounds across iterations — a 2-second inference call in a 10-step task means 20 seconds of model time alone — making per-step optimization critical for user-facing applications. Key strategies include using smaller, faster models for routine steps (model routing), parallelizing independent tool calls, pre-computing potentially needed data, and streaming intermediate results to maintain user engagement during long-running operations. The architectural insight is that perceived latency often matters more than actual latency: showing the user what the agent is doing (streaming tokens, displaying tool call progress) dramatically improves the experience even when total time is unchanged. This concept connects to model selection for speed-quality trade-offs, context caching for reducing redundant processing, cost tracking because faster models are often cheaper, and token economics for understanding the relationship between context size and inference speed.

on the map

Latency Optimization Evaluation and Observability

related concepts

Context Caching Model Selection