I wanted to test the limits of autonomous AI coding. Not a landing page or a todo app. A real application with a database, authentication, an API, and a frontend. I wrote a detailed spec, gave it to Claude Code, and walked away. Here's what happened.

The Spec

I wrote a 2-page spec for a bookmark manager. Features: user registration and login, save bookmarks with tags, full-text search, import/export, and a clean UI. Tech stack: Next.js 14 with App Router, Prisma with PostgreSQL, NextAuth for authentication, and Tailwind for styling.

The spec included database schemas, API endpoint definitions, page layouts, and acceptance criteria for each feature. I spent about 90 minutes writing it. This is more detailed than most product specs I've seen in real companies.

The Process

I created an empty directory, initialized a Next.js project, and started a Claude Code session with the spec in the CLAUDE.md file. I used the /claude headless mode with the instruction: "Build the full application described in CLAUDE.md. Set up the database, create all models, implement all API endpoints, build all pages, and write tests. Don't ask me questions, make reasonable decisions."

Then I closed the terminal and went to lunch.

What Came Back: The 80% That Worked

Database setup: Perfect. Prisma schema matched my spec exactly. Migrations were clean. Indexes on the right columns. Foreign keys and cascading deletes configured correctly.

Authentication: Perfect. NextAuth with credentials provider, proper session handling, protected routes, middleware for auth checking. It even added rate limiting on the login endpoint, which I didn't specify but appreciated.

API endpoints: Almost perfect. All CRUD operations for bookmarks worked. Tag management worked. Search worked with proper full-text Postgres queries. The API responses had consistent formatting and proper error codes.

Frontend structure: Good. Pages for dashboard, bookmark list, search, settings. Components were well-organized. The layout was responsive. Navigation worked.

Tests: Decent. Unit tests for the API endpoints covered the main paths. Integration tests for auth flow worked. About 65% coverage overall.

The 20% That Failed

This is the interesting part. The failures reveal where AI coding actually breaks down.

Import/export was broken. The export generated valid JSON, but the import function assumed a specific format that didn't match the export format. The import used field names like url while the export used bookmark_url. This is a classic integration bug: each piece works in isolation, but they don't match. An agent working on one file at a time doesn't catch this without end-to-end tests.

The UI had subtle interaction bugs. The tag selector dropdown didn't close when you clicked outside it. The bookmark edit form didn't pre-populate existing tags. The search results page lost the query when you paginated. These are the kinds of bugs that only show up when you actually use the application as a user, clicking around and testing workflows.

Error states were incomplete. The happy paths all worked. But try to search with an empty query, import a malformed file, or visit a bookmark that was deleted by another session. The error handling was either missing or showed a generic "Something went wrong" message. AI wrote great error handling at the API level but neglected it at the UI level.

Performance wasn't considered. The bookmark list loaded all bookmarks at once, then paginated client-side. With 10 bookmarks, this is fine. With 10,000, it would be unusable. The AI chose the simplest implementation and didn't think about scale, which is exactly what I'd expect from a junior developer.

The Fix Time

It took me about 3 hours to fix the 20% that was broken. The import/export mismatch was a 15-minute fix. The UI interaction bugs took about an hour. The error states took another hour. The pagination fix was 30 minutes.

Total time from spec to working app: 90 minutes (spec) + ~45 minutes (AI building) + 3 hours (human fixes) = about 5 hours. My estimate for building this from scratch would have been 3-4 days. So we're looking at roughly a 5x speedup, which aligns with my earlier estimates.

What This Tells Us

AI is excellent at implementing known patterns. Auth, CRUD, database schemas, API endpoints. These are solved problems and AI executes them near-perfectly.

AI is weak at integration. The points where features connect to each other, where data flows between systems, where user workflows span multiple pages. These require a holistic understanding that current AI doesn't have.

AI doesn't use the app. The UI bugs were all discoverable by spending 5 minutes clicking around the application. But AI doesn't "use" what it builds. It writes code that should work based on its understanding of the spec. The feedback loop of using your own product is still uniquely human.

The spec is everything. The quality of the AI output was directly proportional to the quality of my spec. The database section was detailed and the database was perfect. The UI section was vaguer and the UI had the most issues.

My Takeaway

Autonomous AI coding works for about 80% of a real application. The remaining 20% needs a human who can use the app, spot the integration failures, and handle the edge cases. The future of development isn't "AI builds everything." It's "AI builds the structure, human adds the polish and catches the gaps." Five hours instead of four days is still transformative.