Using AI to Write Better Tests

Testing is the area where AI has surprised me the most. Not because the generated tests are perfect, but because AI is genuinely better than most developers at thinking about edge cases. Here's my workflow for getting AI to write tests that actually catch bugs, not just inflate coverage numbers.

Why AI is good at this

Testing requires a specific mindset: think about how things can break, not how they should work. Most developers write tests from the perspective of "verify it does what I built it to do." Good tests come from the perspective of "try to break it." AI models, having seen millions of bugs in their training data, are naturally inclined toward the second mindset. They've seen every category of failure and will suggest test cases for scenarios you didn't consider.

The wrong way to use AI for testing

The temptation is to paste a function and say "write tests for this." You'll get tests. They'll even pass. But they're usually shallow. The happy path with two or three obvious inputs. Maybe one null check. This approach gives you coverage metrics without actual confidence in your code.

I've seen teams use AI to go from 40% to 90% test coverage in a day. Their code wasn't more reliable. They just had more tests that verified the code does what the code does, which is circular logic that catches nothing.

My three-phase workflow

Phase 1: Edge case brainstorming. Before generating any test code, I ask the AI to analyze the function and list every possible edge case. No code yet, just a list.

Look at this function and list every edge case, boundary
condition, and failure mode you can think of. Include
concurrency issues, type coercion problems, and any
implicit assumptions the code makes. Don't write test
code yet, just list the scenarios.

[paste function]

This consistently produces 15-25 scenarios, many of which I wouldn't have thought of. Things like "what if the input array is sorted in reverse order" or "what happens if this timestamp is exactly midnight UTC" or "this regex doesn't handle Unicode characters." I review the list, mark which ones are relevant, and move to phase 2.

Phase 2: Targeted test generation. I take the relevant scenarios from phase 1 and ask for specific tests.

Write tests for these specific scenarios using [framework].
Each test should be independent and test exactly one behavior.
Name format: should_[behavior]_when_[condition]

Scenarios:
[paste selected scenarios from phase 1]

Function under test:
[paste function]

By providing specific scenarios instead of asking for "comprehensive tests," the output is focused and meaningful. Each test has a clear purpose documented in its name.

Phase 3: Mutation testing review. I ask the AI to review the tests from the perspective of a mutation testing tool.

Look at these tests and the function they test.
If I changed one line of the function (introduced a bug),
which mutations would these tests NOT catch?
Suggest additional tests to close those gaps.

This phase catches the subtle gaps. Maybe none of the tests verify the return value precisely, only that it's truthy. Maybe the error handling path is tested for the right error type but not the right error message. These are the kinds of gaps that let real bugs slip through.

Specific techniques that work well

Property-based test suggestions. Ask the AI "what properties should always be true about this function's output regardless of input?" This generates invariants that make excellent property-based tests. For a sort function: output length equals input length, every element in the input appears in the output, each element is less than or equal to the next.

Failure injection. Ask "if the database/network/filesystem call inside this function fails, what should happen?" AI is good at identifying external dependency failures that need handling.

Boundary value analysis. Ask "what are the boundary values for each parameter?" AI will identify minimums, maximums, zeros, empty collections, and transition points that deserve explicit test cases.

What to watch out for

AI-generated tests sometimes test implementation details instead of behavior. If the test breaks when you refactor the internals without changing behavior, it's a bad test. Review each generated test and ask "would this test still be valid if I rewrote the function differently but with the same inputs and outputs?"

AI also tends to over-mock. It'll mock three dependencies when you could use a simpler test by passing real objects. I explicitly tell it "no mocking unless the dependency has side effects or is slow" and this produces more useful, less brittle tests.

Don't blindly run AI-generated tests and consider yourself covered. Read every assertion. Understand what each test verifies. Delete tests that don't add confidence. A test suite of 20 meaningful tests beats 100 superficial ones every time.

The results

Since adopting this workflow, my test suites catch more bugs during development, before they hit code review or production. The three-phase approach takes about 20 minutes per function versus 45 minutes writing tests manually. The quality is comparable or better because the AI-driven brainstorming phase surfaces scenarios I would have missed. This is one of the clearest AI productivity wins I've experienced.