I ran both Claude Opus 4 and GPT-5 through the same set of coding tasks over two weeks. Not synthetic benchmarks. Real tasks from my actual projects. Here's what I found.

The Test Suite

I designed 15 tasks across five categories: bug fixes, feature additions, refactoring, test writing, and code review. Each task was given identical instructions. I scored on correctness (does it work?), quality (is it clean?), and instruction following (did it do what I asked, nothing more, nothing less?).

Context Handling

This was the category with the biggest gap. Claude Opus 4 consistently maintained coherence across long conversations and large codebases. When I loaded a 3,000-line module and asked for a change that depended on understanding relationships between functions 200 lines apart, Claude got it right every time.

GPT-5 improved significantly over GPT-4o in this area, but still showed drift on longer contexts. Around the 80K token mark, it started missing references to things I'd established earlier in the conversation. I had to re-state constraints more often.

Winner: Claude Opus 4. Not close.

Code Quality

Both models produce professional-grade code now. The days of obvious AI-generated code with weird patterns are mostly over.

Claude's code tends to be more conservative. It matches existing patterns, uses standard library functions, and avoids clever one-liners. GPT-5 is slightly more creative, sometimes using newer language features or more concise patterns. Whether that's better depends on your preference.

One consistent difference: Claude produces better error handling. Not just try-catch blocks, but actual error types, meaningful messages, and proper error propagation. GPT-5 tends to add error handling that covers the happy path but misses edge cases unless you specifically ask.

Winner: Slight edge to Claude for production code. GPT-5 for prototyping.

Instruction Following

This is where Claude really pulls ahead. When I say "only modify the handler function, don't touch the middleware," Claude modifies only the handler. GPT-5 would occasionally "helpfully" improve the middleware too, which is exactly the kind of unexpected change that breaks things in a team environment.

Claude also handles negative constraints better. "Don't use any external libraries" actually means no external libraries. GPT-5 treated this as a soft suggestion a couple of times, pulling in a utility library because "it makes the solution cleaner."

Winner: Claude Opus 4. Decisively.

Refactoring

I gave both models a 400-line function and asked them to break it into smaller, well-named functions without changing behavior. This is where I expected GPT-5 to shine since it tends to be more opinionated about code structure.

Both did well. GPT-5 produced a slightly more modular result with cleaner separation of concerns. Claude produced a safer refactoring with more conservative boundaries between the extracted functions. Claude's version was easier to verify for correctness because each extraction was smaller and more obvious.

Winner: Tie. Different tradeoffs.

Test Writing

I gave both models the same function and asked for comprehensive tests. Claude wrote 12 tests covering the happy path, edge cases, error cases, and boundary values. GPT-5 wrote 9 tests that covered the happy path and some edge cases but missed a few error conditions.

More importantly, Claude's test names were descriptive enough that you could understand the test suite without reading the implementations. GPT-5's names were more generic, like "test_edge_case_1" rather than "test_returns_empty_when_input_is_null."

Winner: Claude Opus 4.

Speed and Cost

GPT-5 is faster for most responses. Claude Opus 4 thinks longer, especially on complex tasks. The difference is noticeable but not a dealbreaker. Claude's thinking time usually correlates with better output, so I've learned to be patient.

Cost-wise, they're comparable for API usage. Claude Code's subscription model makes it more predictable for heavy use.

My Verdict

For day-to-day coding work, Claude Opus 4 wins. Better context handling, better instruction following, better tests, better error handling. The gap in instruction following alone is enough to make my recommendation.

GPT-5 is still excellent and beats Claude for certain creative tasks, rapid prototyping, and conversations where you want the model to be more opinionated. If I'm brainstorming architecture, I'll sometimes use GPT-5 because it's more willing to push back on my ideas.

But for writing production code that I need to ship? Claude, every time.