glossary

98 agentic coding concepts, defined.

A/B Testing Agents

A/B testing agents involves running different agent configurations (prompts, models, tools, parameters) in parallel on real traffic to measure which performs better on production metrics. Unlike offline evals that test against curated datasets, A/B testing captures real-world performance including edge cases, user patterns, and environmental factors that test suites can't predict.

A2A Protocol

The Agent-to-Agent (A2A) protocol, introduced by Google, defines a standard for how independent AI agents communicate and collaborate with each other to complete complex tasks. While MCP standardizes how agents connect to tools and data, A2A standardizes how agents connect to other agents — enabling scenarios where a coding agent delegates a research task to a specialized research agent, or a planning agent coordinates multiple execution agents.

Agent Benchmarks

Agent benchmarks are standardized evaluation suites that measure how well models and agent systems perform on specific task categories like coding, web navigation, tool use, and multi-step reasoning. Popular benchmarks include SWE-bench (real-world GitHub issue resolution), HumanEval (code generation), GAIA (general AI assistants), and Chatbot Arena (human preference rankings).

Agent Config Files

Configuration files like CLAUDE.md, .cursorrules, and Copilot instruction files give agents persistent project context across sessions.

Agentic Git Workflow

Managing version control alongside agentic coding tools requires adapted git workflows that account for the volume and nature of AI-generated changes. Best practices include creating branches for each agentic task (so changes can be reviewed atomically), committing frequently (so you can bisect and revert if the agent introduces regressions), and using descriptive commit messages that indicate AI involvement.

Agentic vs Chat

Chatbots respond to individual messages in a stateless, turn-based fashion, while agentic systems take autonomous actions in loops, using tools and making decisions across multiple steps without waiting for human input at each turn. The key distinction is agency: an agent decides what to do next, executes actions, observes results, and iterates until a goal is achieved — or until it determines it needs human guidance.

API Basics

Making HTTP API calls to LLM providers is the foundational skill for building any agent system, as every agentic tool — from IDE agents to custom pipelines — ultimately sends requests to a model API and processes the structured response. Understanding request anatomy (authentication, endpoints, model parameters, streaming), response handling (token usage, finish reasons, tool call outputs), and provider differences (OpenAI's chat completions vs Anthropic's messages API vs Google's Gemini API) is essential for any developer moving beyond pre-built tools.

Audit Logging

Audit logging creates a tamper-resistant record of every action an agent takes — tool calls, file modifications, API requests, data access, and decision points — providing the forensic trail needed for security investigations, compliance audits, and post-incident analysis. Unlike standard application logging, agent audit logs must capture the full reasoning context: not just what the agent did, but what it was thinking, what information it had access to, and what triggered each decision.

Autonomous CI/CD

Autonomous CI/CD extends traditional CI/CD agents by giving them the ability to not just run pipelines but autonomously respond to pipeline events — automatically fixing failing builds, resolving merge conflicts, updating dependencies, and even triaging and fixing bug reports without human intervention. The vision is a software delivery pipeline where agents handle the routine maintenance that currently consumes significant developer time: a test failure triggers an agent that diagnoses the cause, writes a fix, runs the test suite, and opens a PR — all before a human developer even notices the failure.

Blast Radius Containment

Blast radius containment is the practice of designing agent systems so that any single failure, error, or security compromise affects the smallest possible scope — preventing a problem in one agent or tool from cascading into a system-wide incident. Key containment strategies include filesystem scoping (restricting agents to specific directories), network isolation (limiting which endpoints agents can reach), transaction boundaries (making destructive operations reversible), and resource limits (capping tokens, time, and compute per task).

Chain of Thought

Chain-of-thought prompting instructs models to show their reasoning steps before arriving at an answer, significantly improving accuracy on complex tasks like math, logic, and multi-step planning. For agent systems, visible reasoning makes decisions transparent and debuggable, helping developers understand why an agent chose a particular tool or action — without chain-of-thought, agent failures become opaque black boxes.

Choosing Your Stack

Selecting the right combination of agentic coding tools requires evaluating language and framework support, model flexibility, privacy requirements, and how well tools integrate with your existing workflow. The best stack depends on your context: a solo developer might prefer an all-in-one IDE agent like Cursor, while a team might combine a CLI agent for automation with a code review agent for quality assurance and a dedicated PR reviewer for governance.

CI/CD Agents

CI/CD agents integrate agentic capabilities into continuous integration and deployment pipelines, automatically fixing failing tests, resolving dependency conflicts, managing infrastructure changes, and responding to alerts without human intervention. These agents can be triggered by CI events (build failure, test regression, security scan alerts) and autonomously branch, fix, commit, and open PRs — extending the agentic coding paradigm from development-time assistance to fully automated pipeline operations.

CLI Agents

Command-line agentic tools operate directly in the terminal, reading and writing files, running commands, and iterating on code without a graphical IDE. Tools like Claude Code, Aider, and OpenAI Codex are powerful for automation, scripting, and CI/CD integration because they work in any environment with a shell — including headless servers and Docker containers.

Code Review Agents

Code review agents automatically analyze pull requests to catch bugs, flag style inconsistencies, identify security vulnerabilities, and suggest improvements before human reviewers get involved. These agents integrate directly with GitHub, GitLab, and other platforms to post inline comments on diffs, effectively functioning as an always-available first reviewer.

Code Review Workflow

The process of reviewing, validating, and approving AI-generated code before it merges into the main codebase, arguably the most important quality gate in any agentic coding workflow. Reviewing agent-generated code requires a different mental model than reviewing human code: agents produce syntactically correct code that may be architecturally wrong, subtly misaligned with project conventions, or implementing the wrong abstraction entirely.

Compliance

Compliance in agentic systems addresses the regulatory, legal, and organizational requirements that govern how AI agents handle data, make decisions, and affect systems in production environments. Key compliance frameworks include GDPR (data privacy in Europe), SOC 2 (security and availability for SaaS), HIPAA (healthcare data protection), and emerging AI-specific regulations like the EU AI Act, each imposing constraints on how agents can process information and what oversight is required.

Context Assembly Pipelines

Context assembly pipelines are the programmatic systems that gather, filter, prioritize, and format information from multiple sources before injecting it into the model's context window for each inference call. Unlike simple prompt templates with static content, these pipelines dynamically assemble context based on the current task — pulling relevant files from a codebase index, recent conversation history, retrieved documentation, active errors, and tool outputs into a structured prompt that maximizes the model's effectiveness.

Context Caching

Context caching lets you reuse previously processed prompt prefixes across multiple API calls, reducing both cost and latency for repeated context like system prompts, documentation, or few-shot examples. Anthropic's prompt caching can reduce input token costs by up to 90% and latency by 85% for cached content, while OpenAI and Google offer similar automatic caching mechanisms.

Context Density

Context density measures how much useful, task-relevant information is packed into each token of context, and optimizing for density is the primary lever for getting better agent performance within fixed context window constraints. Low-density context wastes tokens on boilerplate, irrelevant examples, verbose formatting, and redundant information; high-density context strips everything to its essential signal — tight type signatures instead of full source files, relevant test failures instead of entire test suites, summarized conversation history instead of raw transcripts.

Context Engineering vs Prompting

Context engineering is the broader discipline of designing everything a model sees, not just the user prompt. It encompasses system prompts, conversation history, retrieved documents, tool definitions, few-shot examples, and tool results that together shape model behavior — while prompt engineering focuses narrowly on crafting individual instructions.

Context Window Budget

Every model has a finite context window measured in tokens, and managing it as a budget is essential for effective agent design. You must allocate space for system instructions, conversation history, retrieved context, tool definitions, and leave room for the model's reasoning and output — exceeding the context window causes silent truncation or errors, while wasting tokens on irrelevant information degrades model performance even within the limit.

Cost Tracking

Cost tracking monitors and attributes the token and API expenditure of agent systems, providing financial visibility into how much individual tasks, workflows, and users cost to serve. In agentic systems, costs are particularly unpredictable because the number of inference calls per task varies based on the model's reasoning path — a simple task might take 2 tool calls while a complex one generates 50, and without tracking you can't distinguish normal variation from runaway loops burning money.

Data Exfiltration

Data exfiltration in agentic systems occurs when an agent is tricked or exploited into sending sensitive information — API keys, source code, customer data, environment variables — to unauthorized external destinations. This can happen through prompt injection (where a malicious instruction tells the agent to include secrets in an HTTP tool call), through tool misuse (where the agent inadvertently includes sensitive data in a response), or through context leakage (where conversation history containing secrets gets sent to an unintended party).

Debugging Agents

Debugging agentic systems requires fundamentally different approaches from debugging traditional software because agent behavior is non-deterministic, multi-step, and depends on both the model's reasoning and the tool responses it receives. The primary debugging workflow involves inspecting traces (the full sequence of thoughts, actions, and observations), comparing expected versus actual tool call parameters, and identifying where the agent's reasoning diverged from the intended path.

Deterministic vs Probabilistic Evals

Deterministic evaluations use fixed, rule-based criteria with binary pass/fail outcomes (does the output match the expected regex? does the code compile?

Documentation Agents

Documentation agents generate and maintain docs by analyzing source code, type definitions, and existing documentation patterns across your codebase. They are most effective at producing API references, inline code comments, and README files that stay synchronized with the code, reducing the documentation maintenance burden that causes most project docs to become stale within weeks of creation.

Embedding Models

Embedding models convert text, code, and other content into dense vector representations that capture semantic meaning, enabling similarity-based search and retrieval across agent memory systems. These vectors power RAG pipelines, semantic code search, and long-term memory retrieval by allowing agents to find "conceptually similar" content rather than relying on exact keyword matching.

Ephemeral Execution Environments

Ephemeral execution environments are short-lived, isolated sandboxes that are created fresh for each agent task and destroyed after completion, ensuring that no state, credentials, or side-effects persist between executions. This pattern provides the strongest form of isolation for agentic systems because even if an agent is compromised through prompt injection or makes a destructive mistake, the damage is contained within a disposable environment that gets wiped.

Episodic vs Semantic Memory

Episodic memory stores specific past experiences and interactions (what happened during a particular coding session, how a specific bug was resolved), while semantic memory stores general knowledge and facts (project conventions, API patterns, architectural decisions) — and the distinction determines how agents learn from experience versus encode persistent knowledge. Episodic memories are time-stamped, contextual, and decay naturally — they're most useful for recalling "that one time the agent tried the wrong approach and had to backtrack," enabling the agent to avoid repeating mistakes.

Error Handling for Tools

Tools must return informative, structured error messages that help the agent understand what went wrong and decide what to do next, rather than throwing opaque exceptions that crash the agent loop. Well-designed error handling includes error codes, human-readable descriptions, and suggested retry strategies that the model can reason about — the difference between "Error 500" and "Rate limited: retry after 30 seconds" determines whether an agent recovers gracefully or loops forever.

Error Recovery

Error recovery patterns determine how agents detect, respond to, and recover from failures during execution — from tool call failures and malformed outputs to reasoning dead-ends and infinite loops. Effective recovery strategies include retry with backoff (for transient failures), fallback models (switching to a different LLM when one fails), context truncation (reducing context size when hitting limits), and graceful degradation (completing partial work rather than failing entirely).

Eval Frameworks

Evaluation frameworks provide standardized tooling for defining test cases, running them against agent systems, and comparing results across different configurations of prompts, models, and tools. Key frameworks include Promptfoo (open-source CLI tool for comparing prompt variations), Braintrust (end-to-end eval platform with trace analysis), and LangSmith (eval and observability integrated into the LangChain ecosystem).

Eval-Driven Development

Eval-driven development treats evaluations as first-class development artifacts, systematically measuring agent behavior against defined criteria before, during, and after changes — analogous to test-driven development but for non-deterministic AI systems. Instead of manually checking "does this seem right?

Few-Shot Examples

Few-shot learning provides the model with examples of desired input-output pairs directly in the prompt, guiding its behavior through demonstration rather than instruction alone. This technique is one of the most reliable ways to improve output quality, especially for formatting, tone, and domain-specific conventions that are difficult to describe in words but obvious when shown.

Function Calling

Function calling is the mechanism through which language models invoke external tools by outputting structured requests that match predefined function signatures. Rather than generating free-text descriptions of what they want to do, models produce structured JSON payloads specifying the function name and arguments — which the host application then routes to the actual implementation.

Graph RAG vs Vector RAG

Vector RAG retrieves information based on semantic similarity (finding chunks that "sound like" the query), while Graph RAG traverses structured relationships between entities (finding information that is "connected to" the query) — and understanding when each approach excels determines the quality of your agent's knowledge retrieval. Vector RAG works well for open-ended questions where the answer lives in a specific passage ("what does this error mean?

Human in the Loop

Human-in-the-loop (HITL) patterns insert human checkpoints into agent workflows at critical decision points, requiring explicit approval before the agent takes high-stakes or irreversible actions. This is the primary safety mechanism for production agent systems — even the most capable models make mistakes, and HITL ensures those mistakes are caught before they reach production databases, customer-facing systems, or financial transactions.

IDE Agents

AI-powered coding assistants embedded in editors like VS Code and JetBrains IDEs are the primary way most developers interact with agentic coding today. Tools like GitHub Copilot, Cursor, and Windsurf provide inline completions, chat-driven editing, and multi-file refactoring directly in the editor, understanding your project context through open files, workspace structure, and agent configuration files.

Idempotent Tools

Idempotent tools produce the same result regardless of how many times they are called with the same arguments, making them safe to retry on failure without causing unintended side effects. This property is critical for agent systems because agents frequently retry failed tool calls or re-execute steps when their reasoning loop encounters an error — and without idempotency, a retry could create duplicate records, send duplicate messages, or double-charge a payment.

JSON Schema for Tools

JSON Schema is the formal specification language used to define the structure, types, and constraints of tool parameters that agents pass to functions. It serves as the contract between the model's output and the tool's input — specifying required fields, types (string, number, array), allowed values (enums), and validation rules that the model must satisfy when generating tool call arguments.

Knowledge Graphs

Knowledge graphs represent information as a network of entities and their relationships, providing structured, queryable knowledge that complements the unstructured retrieval of vector-based RAG systems. In agent systems, knowledge graphs enable precise relationship queries ("what tools does this agent depend on?

Latency Optimization

Latency optimization reduces the end-to-end time for agent task completion through techniques like streaming responses, parallel tool calls, model selection for speed, prompt compression, and caching strategies. In multi-step agent loops, latency compounds across iterations — a 2-second inference call in a 10-step task means 20 seconds of model time alone — making per-step optimization critical for user-facing applications.

Least Privilege

The principle of least privilege dictates that agents should be granted only the minimum permissions and access needed to complete their assigned task — no more. This is the foundational security principle for agentic systems because agents are inherently unpredictable: even a well-designed agent can be manipulated through prompt injection, make reasoning errors, or encounter unexpected edge cases that lead to unintended actions.

LLM Fundamentals

Large language models generate text through next-token prediction, using transformer architectures with self-attention mechanisms to process and produce sequences. Understanding how scale, training data, and architecture choices affect model capabilities is essential for building effective agents — these fundamentals explain why models can follow instructions, use tools, and reason through complex problems.

Long-Term Memory

Long-term memory enables agents to persist and retrieve information across sessions, maintaining knowledge about user preferences, past interactions, learned facts, and project-specific context that survives beyond a single conversation. Implementation approaches include vector database storage (embedding memories for semantic retrieval), structured databases (storing explicit facts and relationships), and file-based persistence (like CLAUDE.

MCP Client Architecture

MCP client architecture defines how host applications (IDE agents, CLI tools, custom agents) implement the client side of the Model Context Protocol — discovering servers, negotiating capabilities, managing connections, routing tool calls, and handling the lifecycle of MCP sessions. Understanding client architecture is important because the client is the trust boundary: it decides which tool call requests from the model to execute, which servers to connect to, and what permissions to grant — making client design the primary security control point in the MCP ecosystem.

MCP Client Roots

Roots are a mechanism through which MCP clients inform servers about which parts of the filesystem or data they have access to, helping servers scope their operations to relevant directories or resources. When a client connects to an MCP server, it can declare roots — typically filesystem paths like project directories — that tell the server "here's where my code lives.

MCP Overview

The Model Context Protocol (MCP) is an open standard that provides a universal interface for connecting AI models to external tools, data sources, and services. Instead of building custom integrations for every tool and model combination, MCP defines a common protocol — analogous to USB-C for hardware — that any client can use to discover and invoke any server's capabilities.

MCP Security

MCP introduces unique security considerations because it creates a standardized channel through which language models can invoke external actions — making it both a powerful integration layer and a potential attack surface. Key security concerns include transport security (ensuring messages between clients and servers are encrypted and authenticated), input validation (preventing prompt injection attacks that could trick the model into invoking dangerous tools), and capability scoping (ensuring servers only expose the minimum capabilities needed for each use case).

MCP Server Primitives: Prompts

Prompts are one of MCP's three server primitives, representing reusable prompt templates that a server can expose for agents and users to invoke by name. Unlike tools (which execute actions) and resources (which expose data), prompts provide pre-built interaction patterns — like "summarize this document" or "review this code" — that encapsulate domain expertise into discoverable, parameterized templates.

MCP Server Primitives: Resources

Resources are one of MCP's three server primitives, representing data that a server exposes for an agent to read — like files, database records, API responses, or live state from connected systems. Unlike tools (which perform actions), resources are passive data sources identified by URIs that the agent or client can retrieve to populate its context.

MCP Server Primitives: Tools

Tools are one of MCP's three server primitives, representing executable actions that an agent can discover and invoke through the protocol. Unlike resources (which expose data for reading) and prompts (which provide templates for interaction), tools are the active capability — they let agents take actions like querying a database, creating a file, sending a message, or triggering a deployment.

MCP Transport

MCP defines two transport mechanisms for communication between clients and servers: stdio (standard input/output) for local processes and SSE (Server-Sent Events) over HTTP for remote servers. The transport layer determines how MCP messages are serialized and delivered, using JSON-RPC 2.

Memory Management

Memory management governs when agents store new memories, how they retrieve relevant ones, and when they evict or summarize old information to stay within operational limits. Effective strategies include recency-weighted retrieval (favoring recent information), importance scoring (prioritizing high-signal memories), summarization (compressing old conversations into distilled knowledge), and explicit forgetting policies (removing outdated or contradictory information).

Memory Types

Agent memory systems are categorized into distinct types that mirror concepts from cognitive science, each solving a different persistence and retrieval challenge. Short-term memory (conversation history within a session) provides immediate context; working memory (scratchpads, state variables) tracks in-progress task state; long-term memory (vector databases, knowledge bases) persists across sessions; and episodic memory (past interaction logs) enables agents to learn from previous experiences.

Micro-Tools vs God-Tools

The micro-tools vs god-tools spectrum defines the fundamental granularity decision in tool design: should you give an agent many small, focused tools (read_file, write_file, search_code) or fewer large, multi-capability tools (manage_codebase with subcommands)? Micro-tools are easier for agents to learn and compose, produce more interpretable traces, and fail in contained ways, but they require more tool calls and more context to describe all available options.

Model Selection

Choosing the right model for your agent involves balancing task complexity, latency requirements, cost constraints, and tool-use capabilities. Frontier models like Claude Sonnet and GPT-4o excel at complex reasoning and multi-step planning, while smaller models like Claude Haiku and GPT-4o-mini are better suited for high-volume, lower-complexity tasks like classification or extraction.

Multi-Agent Architectures

Multi-agent architectures coordinate multiple specialized agents to complete complex tasks that exceed what a single agent can handle, distributing work across agents that each have focused capabilities and smaller, more manageable context windows. Common patterns include supervisor architectures (one agent manages others), swarm patterns (agents dynamically hand off between each other), and parallel execution (multiple agents work simultaneously on subtasks).

Observability Platforms

Observability platforms capture, store, and visualize the full execution telemetry of agent systems — including traces, token usage, latency, cost, tool calls, and reasoning chains — providing the production monitoring infrastructure that makes agents debuggable and improvable at scale. Unlike traditional APM tools designed for deterministic software, LLM observability platforms are purpose-built for non-deterministic systems where the same input can produce different outputs and where "errors" may be subtle reasoning failures rather than exceptions.

Orchestrator Pattern

The orchestrator pattern uses a central "boss" agent that breaks complex tasks into subtasks, delegates them to specialized worker agents, and then synthesizes the results into a final output. This is the most common multi-agent architecture in production because it provides clear control flow — the orchestrator decides what to do, who does it, and when the task is complete — while avoiding the complexity of fully autonomous agent swarms.

Orchestrator-Worker Pattern

The orchestrator-worker pattern is the most common production multi-agent architecture, where a central orchestrator agent manages a pool of specialized worker agents, dynamically assigning subtasks based on their capabilities and the requirements of the current step. Unlike the simpler orchestrator pattern where the boss agent does all the reasoning and merely delegates execution, the orchestrator-worker pattern gives workers genuine autonomy — each worker has its own agent loop, tools, and system prompt, making independent decisions within the scope of its assigned subtask.

OWASP Top 10 for LLMs

The OWASP Top 10 for LLM Applications catalogs the most critical security risks specific to LLM-based systems, providing a standardized framework for identifying and mitigating vulnerabilities in agent systems. The list includes prompt injection (LLM01), insecure output handling (LLM02), training data poisoning (LLM03), denial of service (LLM04), supply chain vulnerabilities (LLM05), sensitive information disclosure (LLM06), insecure plugin/tool design (LLM07), excessive agency (LLM08), overreliance (LLM09), and model theft (LLM10).

Pair Programming With Agents

Working alongside an agent in real-time as a collaborative coding partner is the most common daily workflow for developers using agentic coding tools. Effective pair programming with an AI means staying engaged to provide direction, catch mistakes early, and guide the agent through ambiguous decisions — the developer drives strategy, architecture, and intent while the agent handles implementation velocity and mechanical consistency.

Permission Models

Permission models define the authorization framework that governs what actions an agent can take, which resources it can access, and under what conditions it needs human approval. Common patterns include allow-list models (only explicitly permitted actions are available), deny-list models (everything is allowed except listed actions), tiered permissions (routine actions auto-approved, destructive actions require approval), and capability-based security (agents receive unforgeable tokens granting specific permissions).

Pipeline Pattern

The pipeline pattern chains agent steps in a fixed sequence where each step's output becomes the next step's input, creating a predictable, linear workflow for tasks with clear stages. Unlike the flexible orchestrator pattern where the boss agent decides the order dynamically, pipelines enforce a predetermined sequence — like lint → test → fix → verify — making them more deterministic and easier to monitor in production.

Planning Patterns

Planning patterns address how agents decompose complex goals into sequences of concrete steps before executing them, rather than reacting one step at a time. Approaches range from simple plan-then-execute (generate a full plan upfront, then follow it) to iterative replanning (adjust the plan after each step based on results), to hierarchical planning (decompose into subgoals, then subgoals into tasks).

Prompt Injection

Prompt injection is the primary attack vector against LLM-based systems, where malicious input manipulates the model into ignoring its system instructions and executing unintended actions. Direct prompt injection embeds malicious instructions in user input ("Ignore all previous instructions and...

Prompt Iteration

The process of systematically improving prompts through testing, measurement, and refinement rather than ad-hoc trial and error. Effective prompt iteration treats prompts as code — using version control, evaluation suites, and A/B testing to converge on optimal instructions rather than relying on intuition or anecdotal feedback.

Quality Metrics

Quality metrics define what "good agent behavior" looks like in quantitative terms, providing the measurement foundation for eval-driven development and production monitoring. Key metric categories include task completion rate (did the agent finish the task?

RAG Patterns

Retrieval-Augmented Generation (RAG) patterns address how agents dynamically retrieve relevant information from external knowledge sources and inject it into the model's context before generating a response. The core pattern involves chunking documents into segments, embedding them into vectors, storing them in a vector database, and at query time, retrieving the most semantically similar chunks to include in the prompt.

Rate Limiting

Rate limiting constrains how many actions, API calls, or tokens an agent can consume within a given time period, preventing runaway loops, denial-of-service conditions, and unexpected cost spikes. Without rate limits, a single malfunctioning agent loop — such as one caught in an infinite retry cycle or an overly ambitious planning step — can burn through an entire API budget in minutes.

ReAct Pattern

The ReAct (Reasoning + Acting) pattern is the foundational architecture where agents alternate between thinking steps and tool calls in a loop, combining chain-of-thought reasoning with grounded actions. Each cycle produces a thought (the model's reasoning about what to do), an action (a tool call or output), and an observation (the result of the action) that feeds the next iteration.

Reasoning Models

Reasoning models like OpenAI's o1/o3 and Anthropic's Claude with extended thinking represent a paradigm shift where models spend more computational effort at inference time to solve harder problems, trading latency and cost for significantly improved accuracy on complex tasks. Unlike standard LLMs that generate output in a single forward pass, reasoning models use chain-of-thought internally — sometimes generating thousands of "thinking tokens" before producing a response — which makes them dramatically better at multi-step logic, math, and code generation.

Refactoring Agents

Refactoring agents can restructure code across multiple files while preserving behavior, making them valuable for large-scale migrations, API updates, and codebase modernization tasks that would take humans days or weeks. They excel at repetitive structural changes like updating import paths, migrating between frameworks, or applying new patterns consistently across hundreds of files — tasks where mechanical consistency matters more than creative judgment.

Regression Testing

Regression testing for agent systems verifies that changes to prompts, tools, models, or configurations don't break previously working behavior, catching the "fixed one thing, broke three others" pattern that is endemic to non-deterministic systems. Unlike traditional software regression tests with binary pass/fail outcomes, agent regression tests must account for acceptable variation — the output is allowed to differ in wording while still being semantically correct.

Resilient Tool Contracts

Resilient tool contracts define the stable interface between an agent and its tools such that changes to either side don't break the other, providing the same kind of backward compatibility guarantees that API versioning provides for human-facing services. A well-designed tool contract includes clear input schemas with sensible defaults, explicit error types that tell the agent what went wrong and what to try instead, and output formats that remain consistent even when the underlying implementation changes.

Retrieval Augmented Generation

Retrieval-Augmented Generation (RAG) is the foundational pattern for giving language models access to external knowledge by retrieving relevant documents and including them in the prompt before the model generates its response. Rather than relying solely on what the model learned during training (which has a knowledge cutoff and can hallucinate), RAG grounds the model's output in specific, verifiable source material — dramatically improving accuracy for domain-specific questions.

Short-Term Memory

Session-scoped memory that persists within a conversation but resets between sessions, implemented through conversation history, scratchpad patterns, and working state that agents maintain during multi-step task execution. Short-term memory is the simplest form of agent memory and is built into every chat-based agent by default — it's literally the messages array passed to each API call.

Single Agent Patterns

The simplest architecture where one agent handles the entire task from start to finish, and the pattern most problems should start with before reaching for multi-agent complexity. A single well-configured agent with the right tools can handle remarkably complex tasks while keeping coordination overhead at zero and making debugging straightforward — you only have one reasoning chain to inspect.

Spec-Driven Development

Writing detailed specifications before letting agents write code transforms the spec into the agent's roadmap for implementation, dramatically improving output quality and reducing wasted iterations. A good spec gives the agent clear success criteria, constraints, edge cases, and architectural context upfront — the difference between "build me a login page" and a structured spec that defines auth flow, error states, styling conventions, and API contracts is the difference between throwaway code and production-ready output.

State Machines vs Pure ReAct

State machine agents enforce explicit transitions between well-defined states (gather_requirements → plan → implement → test → review), while pure ReAct agents let the model freely choose what to do next at each step based on its reasoning — representing a fundamental trade-off between predictability and flexibility. State machines excel when the workflow has a known structure and compliance requirements demand auditability of the execution path, because every transition is explicit and observable.

Structured Output

Structured output constrains model responses to follow specific formats like JSON, XML, or schemas defined by JSON Schema, ensuring that agents produce machine-parseable output rather than free-form text. This is critical for agent systems where tool calls require precisely formatted parameters and downstream code needs to parse model output reliably — a single missing field or wrong type can crash an entire agent pipeline.

Supervision

Supervision patterns govern how agent behavior is monitored and controlled in production through a combination of human-in-the-loop checkpoints, automated guardrails, escalation policies, and real-time anomaly detection. Supervisors can approve high-risk actions before execution, catch errors before they propagate to downstream systems, and enforce policy constraints on agent behavior — functioning as a safety layer between the agent's intentions and the real world.

System Prompts

The system prompt is the most important single piece of context you control in an agent system. It sets the agent's role, behavior boundaries, output format, and operational constraints before any user interaction begins — effectively serving as the agent's "constitution" that governs all subsequent decisions.

Test Generation

Agents can generate unit tests, integration tests, and test fixtures by analyzing your code's behavior, edge cases, and type signatures, dramatically increasing test coverage especially for legacy codebases that lack tests. However, generated tests require careful review because agents may encode current behavior rather than intended behavior, potentially locking in bugs as passing tests — a subtle but dangerous form of false confidence.

Test-Driven Agentic Development

Test-Driven Agentic Development (TDAD) adapts the traditional TDD cycle (write test → fail → implement → pass → refactor) for agent-assisted coding, where you write tests first and then let the agent implement until all tests pass — effectively using tests as executable specifications that constrain agent behavior. This workflow inverts the typical agent interaction: instead of describing what you want in natural language and hoping the agent interprets correctly, you encode your requirements as tests that provide unambiguous success criteria.

The Agent Loop

The agent loop is the core observe-think-act cycle that drives all agentic behavior. The model receives context (observations), reasons about what to do next (thinking), selects and executes a tool or produces output (acting), then feeds the result back into the next iteration.

The Auto-Fix Loop

The auto-fix loop is the pattern where an agent writes code, runs tests or linters, observes failures, and autonomously iterates until all checks pass — representing one of the most powerful and practical applications of agentic coding. This tight feedback loop (write → run → fail → fix → re-run) is what distinguishes a coding agent from a code generator: the agent doesn't just produce output, it validates and repairs its own work through interaction with the real development environment.

The Autonomy Spectrum

Agent autonomy exists on a spectrum from fully human-controlled (copilot mode) to fully autonomous (unattended operation), and understanding where to position your system on this spectrum is one of the most important design decisions in agentic architecture. At the low-autonomy end, agents suggest but don't act — like autocomplete or inline code suggestions — while at the high-autonomy end, agents independently execute multi-step workflows, create branches, merge code, and deploy changes without human intervention.

Token Economics

LLM pricing is based on the number of input and output tokens processed, with output tokens typically costing 3-5x more than input tokens. Understanding tokenization, context window costs, and the price differences between models is essential for building agent systems that are economically viable at scale — an unoptimized agent loop can burn through hundreds of dollars per day in production.

Tool Composition

Composable tools allow agents to build sophisticated workflows by combining simple, focused primitives rather than relying on monolithic tools that handle complex operations. The core principle mirrors Unix philosophy: design small tools that do one thing well and let the agent chain them together based on the task at hand.

Tool Definition Patterns

Well-designed tool definitions are the interface contract between an agent and the external world, and their quality directly determines how reliably an agent selects and invokes the right tool for a given task. Effective tools have clear, descriptive names that signal purpose, precise descriptions that explain when and how to use them, and well-typed parameters with constraints that prevent misuse.

Tool Sandboxing

Running agent tool calls in isolated environments limits the blast radius when things go wrong, preventing agents from accidentally modifying production data, executing dangerous commands, or accessing resources outside their intended scope. Common approaches include Docker containers, virtual machines, Firecracker microVMs, restricted file system access, and permission-based execution where certain actions require explicit human approval.

Tool Synergy and Specialization

Different agentic coding tools excel at different stages of development, and understanding their complementary strengths enables developers to build workflows where tools amplify each other rather than overlap. IDE agents like Cursor excel at interactive, high-context work within a single file or feature; CLI agents like Claude Code shine at multi-file refactoring and codebase-wide tasks; code review agents operate best as async quality gates in CI pipelines.

Trace Analysis

Examining the full sequence of agent decisions, tool calls, inputs, outputs, and intermediate results to understand and debug agent behavior. Traces are the primary debugging tool for agent systems, revealing exactly where reasoning went wrong, why a particular tool was selected, or why a tool call returned unexpected data — without them, agent failures are opaque black boxes.

Vector Databases

Databases optimized for storing and searching over high-dimensional vector embeddings using similarity metrics like cosine distance, forming the backbone of most retrieval-augmented generation (RAG) systems. Vector databases enable semantic search that finds relevant documents based on meaning rather than exact keyword matches — allowing agents to query a codebase with "functions that handle authentication" rather than grep-style exact string searches.

When Not to Use Agents

Not every problem benefits from an agentic approach. Deterministic tasks, simple CRUD operations, and well-defined algorithms are better served by traditional software patterns that are faster, cheaper, and more predictable.