Something went wrong. A user reports unexpected behavior. An automated monitor fires an alert. A customer complains. You need to figure out what happened, why, and how to prevent it from happening again. Agent debugging differs from traditional software debugging because agent behavior emerges from reasoning rather than deterministic code paths. The techniques that work for traditional debugging need adaptation for the probabilistic, context-dependent nature of agent systems.

The Debugging Mindset for Agents

Traditional debugging assumes reproducibility. Given the same inputs, the program produces the same outputs. You can set breakpoints, step through code, and inspect state at any point. The bug is a defect in the code that executes the same way every time.

Agent debugging requires a different mindset. The same inputs might produce different outputs depending on model sampling, conversation history, and subtle context differences. You cannot step through code because the behavior emerges from model responses, not predetermined logic. The "bug" might be a prompt issue, a tool problem, a context misunderstanding, or model limitations.

This does not mean agents are impossible to debug. It means you need different techniques - examining traces rather than code, understanding reasoning rather than execution paths, and looking for patterns rather than single points of failure.

Starting with the Symptom

Every debugging session starts with understanding what went wrong from the user's perspective.

Collect specific details. "The agent sent the wrong email" is a start. Better: "The agent sent an email to [email protected] instead of [email protected] when I asked it to email Jane about the project update." The specific details guide where to look.

Identify the conversation. You need to find the exact agent run where the problem occurred. Conversation IDs, timestamps, and user identifiers help locate it. Good systems make this lookup easy.

Understand what should have happened. Before diagnosing why something went wrong, be clear about what should have happened. Sometimes what the user expected was not what the agent was designed to do.

Reading the Timeline

With the problematic conversation identified, examine the complete timeline of what happened. A good observability system shows every step: user messages, agent reasoning, tool calls and results, and state changes.

Read through the timeline from the beginning, not starting at the error. Context builds as the conversation progresses. Understanding what the agent knew at each point explains why it made particular decisions.

Identify the divergence point. Find where the agent's behavior diverged from what should have happened. This is usually not at the end - the bad output is the result of an earlier decision or misunderstanding.

Examine the context at that point. What information did the agent have when it made the problematic decision? What messages had been exchanged? What tool results had it received?

Look at the reasoning. If the system captures agent reasoning, see what the agent was thinking. Did it misunderstand the request? Did it have incorrect information? Did it make a logical error?

Common Issue Categories

Agent problems cluster into recognizable categories. Identifying which category applies guides your investigation.

Misunderstood Requests

The agent interpreted the user's request differently than intended. Signs include tool calls that seem unrelated to what was asked, or responses that answer a different question.

To investigate: Compare the user's message with the agent's apparent interpretation. Look at any reasoning that shows how the agent understood the request. Check if the request was ambiguous or if the agent missed key details.

To fix: Improve the system prompt to handle this type of request better. Add clarification prompts when requests are ambiguous. Update tool descriptions to guide better tool selection.

Wrong Tool Selection

The agent called a tool that was not appropriate for the task, or failed to call a tool that would have helped.

To investigate: Look at what tools were available and their descriptions. See if the chosen tool's description made it seem appropriate. Check if a better tool existed but was not chosen.

To fix: Improve tool descriptions to be more specific about when each tool should be used. Add negative guidance about when not to use certain tools. Consider whether the tool set is appropriate for the tasks users expect.

Tool Returned Bad Data

The agent called the correct tool with reasonable parameters, but the tool returned incorrect or unhelpful results.

To investigate: Examine the exact tool input and output. Verify whether the tool behaved correctly given its input. Check if the tool's external dependencies (APIs, databases) were functioning properly.

To fix: This is often not an agent problem but a tool problem. Fix the tool, add input validation, or add error handling. If the tool is unreliable, consider fallbacks.

Agent Looping

The agent got stuck repeating similar actions without making progress. Signs include repeated tool calls with the same or similar parameters, or circular reasoning patterns.

To investigate: Look for repeated patterns in the timeline. Identify why the agent kept trying the same approach - was it not understanding that the approach was not working?

To fix: Add loop detection in the system prompt. Provide better guidance for what to do when an approach is not working. Set hard limits on repeated tool calls.

Context Overflow

Long conversations can exceed context limits, causing the agent to lose important information from earlier in the conversation.

To investigate: Check the conversation length when the problem occurred. Look for whether key information from early messages was lost. See if the agent made decisions without information it should have had.

To fix: Implement summarization strategies for long conversations. Use memory to store important facts that should persist regardless of context limits. Break long interactions into separate conversations where appropriate.

Investigation Patterns

Several patterns help investigate agent issues effectively.

Trace Back from Output

Start with the problematic output and work backward. What immediate input produced this output? What earlier steps provided that input? Keep tracing back until you find the root cause.

Compare Good and Bad Cases

Find a similar conversation where the agent behaved correctly. Compare the timelines side by side. What was different? Sometimes the difference is subtle - a word in the user's request, a slightly different tool result, a different conversation history.

Isolate Components

If you suspect a specific tool or sub-agent, test it in isolation. Send the same input it received in the problematic case and see if the output is correct. This separates tool issues from agent issues.

Reproduce with Variations

Try to reproduce the problem with slight variations. Does it happen every time with this input? Does it happen with similar inputs? Understanding the conditions under which the problem occurs helps identify the cause.

Prevention and Monitoring

Debugging is reactive. Prevention and monitoring catch issues before they reach users.

Review samples regularly. Periodically examine random conversation samples, even when nothing is reported wrong. Problems often exist before users report them.

Set up alerts. Monitor for unusual patterns: high error rates, unexpected tool failures, unusually long conversations, excessive tool calls.

Test edge cases. Before deploying changes, test scenarios that have caused problems in the past. Build a test suite of tricky cases that verify important behaviors.

Monitor cost and latency. Anomalies in cost or latency often indicate behavioral issues. A sudden increase in tokens used might mean agents are looping or producing overly verbose responses.

Documentation and Knowledge

Document issues you investigate, even ones you do not fully resolve. Build knowledge of common problems and their solutions. When similar issues arise, the documentation accelerates investigation.

Maintain a collection of problematic conversation examples. These serve as regression tests and training examples for improving prompts and tools.

Share debugging knowledge across the team. Patterns one person discovers help everyone debug faster.

For teams building and operating agents, inference.sh provides built-in observability that captures the complete timeline of every agent run. When issues arise, the data you need for debugging is already there - conversation history, reasoning traces, tool calls, and state changes in one view.

Debugging agents requires different techniques than debugging traditional software, but it is not mysterious. With good observability, systematic investigation, and accumulated knowledge, agent issues become understandable and fixable.

FAQ

How do I debug issues that cannot be reproduced?

Non-reproducible issues are common with agents due to model sampling variability. Focus on the specific instance where the issue occurred rather than trying to reproduce it. Examine the complete timeline from that instance - what reasoning led to the problematic behavior, what information the agent had, what decisions it made. Even without reproduction, understanding the single instance often reveals the cause. If patterns emerge across multiple non-reproducible instances, look for common factors like conversation length, request type, or time of day. Some issues are genuinely random - model sampling occasionally produces poor results even from good prompts.

Should I log everything or sample agent interactions?

Log everything if you can afford the storage and query performance at full volume. Sampling creates the risk that the specific problematic interaction was not captured. When debugging user-reported issues, you need the exact conversation, not a statistical sample. If volume makes complete logging impractical, use stratified sampling that captures all errors and unusual patterns while sampling normal interactions. Never sample away errors or edge cases. The interactions you most need to debug are often the unusual ones.

How do I debug issues in multi-agent systems?

Multi-agent debugging requires tracing across agent boundaries. Start by identifying which agent produced the problematic output - the final response might come from an orchestrator, but the underlying issue might be in a specialist. Examine the delegation: what task did the orchestrator assign? What did the specialist return? Was the specialist's response appropriate for the assignment? Multi-agent issues often involve either poor delegation (unclear or inappropriate assignments) or poor integration (good specialist results poorly combined by the orchestrator). The same techniques apply at each level - trace back from output, examine reasoning, compare with successful cases.