I ran both Claude Opus 4 and GPT-5 through the same set of coding tasks over two weeks. Not synthetic benchmarks. Real tasks from my actual projects. Here's what I found.
The Test Suite
I designed 15 tasks across five categories: bug fixes, feature additions, refactoring, test writing, and code review. Each task was given identical instructions. I scored on correctness (does it work?), quality (is it clean?), and instruction following (did it do what I asked, nothing more, nothing less?).
Context Handling
This was the category with the biggest gap. Claude Opus 4 consistently maintained coherence across long conversations and large codebases. When I loaded a 3,000-line module and asked for a change that depended on understanding relationships between functions 200 lines apart, Claude got it right every time.
GPT-5 improved significantly over GPT-4o in this area, but still showed drift on longer contexts. Around the 80K token mark, it started missing references to things I'd established earlier in the conversation. I had to re-state constraints more often.
Winner: Claude Opus 4. Not close.
Code Quality
Both models produce professional-grade code now. The days of obvious AI-generated code with weird patterns are mostly over.
Claude's code tends to be more conservative. It matches existing patterns, uses standard library functions, and avoids clever one-liners. GPT-5 is slightly more creative, sometimes using newer language features or more concise patterns. Whether that's better depends on your preference.
One consistent difference: Claude produces better error handling. Not just try-catch blocks, but actual error types, meaningful messages, and proper error propagation. GPT-5 tends to add error handling that covers the happy path but misses edge cases unless you specifically ask.
Winner: Slight edge to Claude for production code. GPT-5 for prototyping.
Instruction Following
This is where Claude really pulls ahead. When I say "only modify the handler function, don't touch the middleware," Claude modifies only the handler. GPT-5 would occasionally "helpfully" improve the middleware too, which is exactly the kind of unexpected change that breaks things in a team environment.
Claude also handles negative constraints better. "Don't use any external libraries" actually means no external libraries. GPT-5 treated this as a soft suggestion a couple of times, pulling in a utility library because "it makes the solution cleaner."
Winner: Claude Opus 4. Decisively.
Refactoring
I gave both models a 400-line function and asked them to break it into smaller, well-named functions without changing behavior. This is where I expected GPT-5 to shine since it tends to be more opinionated about code structure.
Both did well. GPT-5 produced a slightly more modular result with cleaner separation of concerns. Claude produced a safer refactoring with more conservative boundaries between the extracted functions. Claude's version was easier to verify for correctness because each extraction was smaller and more obvious.
Winner: Tie. Different tradeoffs.
Test Writing
I gave both models the same function and asked for comprehensive tests. Claude wrote 12 tests covering the happy path, edge cases, error cases, and boundary values. GPT-5 wrote 9 tests that covered the happy path and some edge cases but missed a few error conditions.
More importantly, Claude's test names were descriptive enough that you could understand the test suite without reading the implementations. GPT-5's names were more generic, like "test_edge_case_1" rather than "test_returns_empty_when_input_is_null."
Winner: Claude Opus 4.
Speed and Cost
GPT-5 is faster for most responses. Claude Opus 4 thinks longer, especially on complex tasks. The difference is noticeable but not a dealbreaker. Claude's thinking time usually correlates with better output, so I've learned to be patient.
Cost-wise, they're comparable for API usage. Claude Code's subscription model makes it more predictable for heavy use.
My Verdict
For day-to-day coding work, Claude Opus 4 wins. Better context handling, better instruction following, better tests, better error handling. The gap in instruction following alone is enough to make my recommendation.
GPT-5 is still excellent and beats Claude for certain creative tasks, rapid prototyping, and conversations where you want the model to be more opinionated. If I'm brainstorming architecture, I'll sometimes use GPT-5 because it's more willing to push back on my ideas.
But for writing production code that I need to ship? Claude, every time.