Everyone is talking about AI agents, but most of the content out there is either theoretical ("what agents could do") or demo-level ("look at this toy example"). I have spent the last few months building agents that run in production, handling real tasks with real consequences. Here is what I have learned.
What an Agent Actually Is
Strip away the hype and an agent is a loop: observe the current state, decide what to do, take an action, observe the result, repeat. I explored the conceptual foundation in my post on agentic coding explained. The LLM is the decision-maker in the middle. It receives observations as context and outputs actions in a structured format that your code can execute.
while not done:
observation = get_current_state()
action = llm.decide(observation, available_tools)
result = execute(action)
done = is_task_complete(result)
The simplicity of this loop is deceptive. The difficulty is in every detail: what observations to include, how to describe the tools, how to handle errors, when to stop, and how to prevent the agent from going off the rails.
Start With Tools, Not the Agent
The most common mistake I see is building the agent loop first and then figuring out the tools. Do it the other way around. Define the tools your agent needs, make them robust and well-documented, and test them thoroughly in isolation. The agent is only as capable as its tools.
Each tool should have:
- A clear, specific name (not "do_stuff" but "query_database" or "send_email")
- A precise description of what it does, when to use it, and what it returns
- Strong input validation that returns helpful error messages
- Predictable output format
- Graceful failure handling - an agent that crashes on a tool error is useless
I spend more time writing tool descriptions than I spend on the system prompt. The tool descriptions are the agent's instruction manual. If they are vague, the agent will use tools incorrectly.
Error Handling Is Everything
In a traditional program, an unhandled error crashes the process. In an agent, an unhandled error can cause the agent to spin in a loop, hallucinate recovery steps, or silently produce wrong results. Every tool call needs to be wrapped in error handling that feeds useful information back to the agent.
def execute_tool(tool_name, args):
try:
result = tools[tool_name](**args)
return {"status": "success", "data": result}
except ValidationError as e:
return {"status": "error", "message": f"Invalid input: {e}"}
except TimeoutError:
return {"status": "error", "message": "Tool timed out. Try simpler query."}
except Exception as e:
return {"status": "error", "message": f"Unexpected error: {e}"}
When an agent receives a clear error message, it can often self-correct - retry with different parameters, try an alternative approach, or ask for clarification. When it receives a stack trace or nothing at all, it flails.
Limit the Loop
Always set a maximum number of iterations. An agent that can loop infinitely will loop infinitely, running up your API bill while accomplishing nothing. I typically set a limit of 10-15 steps for most tasks. If the agent cannot complete the task in that many steps, something is wrong - either the task is too complex and needs to be broken down, or the tools are insufficient.
I also implement cost tracking per agent run. If a single task exceeds a cost threshold, the agent stops and escalates to a human. This has saved me from several runaway loops during development.
Evaluation Is the Hard Part
Building an agent that works on your three test cases is easy. Building one that works reliably across hundreds of real-world inputs is hard. You need evaluation.
My evaluation approach:
- Collect real examples of the task the agent needs to perform.
- Define what "correct" looks like for each example - ideally an automated check, but human review works for small sets.
- Run the agent against the full set and measure success rate.
- Analyze failures to identify patterns - is it a tool issue, a prompt issue, or a model capability issue?
- Fix the root cause and re-run.
I target a 90%+ success rate before putting an agent in production with human oversight, and 98%+ before running it fully autonomously. Getting from 80% to 95% takes as much effort as getting from 0% to 80%.
Keep the System Prompt Focused
Long, rambling system prompts produce worse results than short, focused ones. The system prompt should tell the agent: who it is, what task it is performing, what its constraints are, and how to handle ambiguity. That is it.
Do not put instructions about individual tools in the system prompt - put them in the tool descriptions. Do not put examples of every possible scenario - the model already knows how to reason. Focus the system prompt on the judgment calls specific to your use case: when to stop, when to ask for help, what to prioritize.
Observability
Log everything. Every prompt sent, every response received, every tool call made, every tool result returned. When an agent produces a wrong result in production, you need to trace through its reasoning to understand what went wrong. Without logs, you are debugging blind.
I use structured logging with a trace ID per agent run, so I can pull the complete chain of reasoning for any given task. This has been invaluable for debugging and for improving the agent over time.
The Honest Truth
Agents are powerful but fragile. They work well for structured tasks with clear success criteria and reliable tools. They struggle with ambiguous tasks, multi-step reasoning chains longer than 10-15 steps, and situations that require common sense the model lacks. I collected more hard-won lessons in my post on AI agents in production. The Anthropic documentation is also worth reading for best practices on tool use and agent design. Build for the strengths, design guardrails for the weaknesses, and always have a human fallback for edge cases.