AI Experiments: What I Learned Building with Agents

Over the past three months, I ran a series of experiments building AI agent systems for various development and operational tasks. Some worked remarkably well. Others failed in instructive ways. Here are the results, along with the lessons that apply broadly to anyone building with agents in 2026.

Experiment 1: Automated Dependency Updates

Goal: An agent that monitors dependencies across 8 repositories, identifies available updates, evaluates changelogs for breaking changes, creates branches with the updates, runs test suites, and opens PRs with summary notes.

Result: Success. This agent now runs weekly and has handled 47 dependency updates without human intervention. Three times it correctly identified breaking changes and included migration steps in the PR description. Twice it flagged updates it was unsure about and requested human review instead of proceeding.

Why it worked: The task is highly structured. The inputs are well-defined (dependency manifests, changelogs, test results), the tools are reliable (package managers, git, CI), and success is easy to measure (tests pass or they do not). This is the ideal agent use case.

Experiment 2: Bug Triage Agent

Goal: An agent that reads incoming bug reports, reproduces the issue in a staging environment, identifies the likely root cause by analyzing the codebase, and assigns a severity level.

Result: Partial success. The agent correctly triaged about 65% of bugs. It was good at identifying well-known error patterns and tracing them to the responsible code. It struggled with bugs that required understanding user intent (the user says "it is broken" without specifying what they expected) and with bugs that only manifest under specific timing or data conditions.

Lesson: Agents work best when the problem space is constrained. Bug triage involves too much ambiguity - ambiguous reports, environment-specific issues, and problems that require running the application as a real user would. I scaled this back to a "first-pass" triage that categorizes bugs and enriches reports with relevant code context, then hands off to a human for final assessment.

Experiment 3: Code Migration Agent

Goal: Migrate a medium-sized Express.js application from JavaScript to TypeScript, file by file, while maintaining a working application throughout the process.

Result: Success with caveats. The agent migrated 43 files over two days. The resulting TypeScript code was correct and the application worked. However, the type definitions it generated were overly permissive in about 30% of cases - using any where a proper type should have been defined, or using union types that were technically correct but too broad to be useful.

Lesson: AI agents optimize for "correct and passing" rather than "idiomatic and optimal." For migrations, this means you get working code that may need a second pass for quality. I now run migrations in two phases: the agent handles the mechanical conversion, and then I (or a second AI pass with stricter instructions) tightens up the types and patterns.

Experiment 4: Multi-Agent Development Pipeline

Goal: A pipeline of specialized agents: one that takes a feature specification and produces a design document, one that generates code from the design, one that writes tests, and one that reviews the output of the others.

Result: The most instructive failure. Each individual agent worked reasonably well in isolation. But the pipeline as a whole produced worse results than a single agent handling the entire task. The problem was context loss at each handoff. The design agent made assumptions the coding agent did not understand. The coding agent made implementation choices the testing agent did not account for. The review agent flagged issues that were intentional design decisions.

Lesson: Multi-agent systems sound elegant in theory but introduce coordination overhead that often exceeds the benefits. Unless you have a very clear interface between agents and each agent's task is truly independent, a single agent with the full context produces better results. This matches what I have seen in human teams too - a small team with shared context outperforms a large team with handoffs.

Experiment 5: Self-Healing Monitoring

Goal: An agent that monitors application logs and metrics, detects anomalies, diagnoses the cause, and applies known fixes automatically (restart a service, clear a cache, scale up resources).

Result: Success for known issues, dangerous for unknown ones. For the 12 failure modes I trained it on, the agent responded correctly every time - faster than I could have manually. But when it encountered an issue outside its training set, it applied the closest matching fix, which twice made things worse. A memory leak was "fixed" by restarting the service, which just delayed the crash and lost diagnostic data.

Lesson: Autonomous agents in operations need very strict boundaries. The agent should have a whitelist of actions it can take and default to alerting a human for anything outside that list. "When in doubt, do nothing and tell someone" is the right default for operational agents.

Meta-Lessons

Across all five experiments, several patterns emerged:

Structured tasks with clear success criteria are where agents shine. The more ambiguity in the task, the less reliable the agent. If you cannot write an automated check for "did the agent succeed," the agent will struggle too.

Single-agent beats multi-agent for most tasks. The coordination cost of multiple agents is higher than the benefit of specialization, unless the tasks are genuinely independent with well-defined interfaces.

Error handling determines production-readiness. An agent that works 95% of the time and fails gracefully the other 5% is production-ready. An agent that works 99% of the time and fails catastrophically the other 1% is not.

Evaluation before deployment is non-negotiable. Every experiment that succeeded in production was one where I invested heavily in evaluation before deployment. Every one that surprised me was one where I skipped proper evaluation because "it worked in testing."

We are still in the early days of building with agents. The tools and models will improve. But the engineering principles - clear interfaces, robust error handling, thorough evaluation, and graceful degradation - are the same principles that make any software reliable. The AI part is new. The engineering part is not.