I was checking BenchGecko during lunch when the GPT-5.5 numbers showed up. Terminal-Bench at 82.7%, CyberGym at 81.8%. I closed my laptop, told my team I had a dentist appointment, and went home to test it.
OpenAI dropped GPT-5.5 this morning. They're calling it "a new class of intelligence for real work" which is the kind of sentence that makes you instinctively distrust everything that follows. But the benchmarks are hard to ignore. Terminal-Bench jumped from 75.1% to 82.7%. FrontierMath T4 went from 27.1% to 35.4%. These aren't incremental bumps. On paper, GPT-5.5 just took a sledgehammer to Opus 4.7 on almost every metric OpenAI chose to publish.
I don't trust paper. I trust my Go file, my broken Next.js project, and the four hours I just spent yelling at my terminal. Here's what actually happened.
The Numbers First
Before I get into my tests, the raw data from OpenAI's benchmark table. I'm including Opus 4.7 and Gemini 3.1 Pro for context because those are the two models I'd actually consider switching from.
| Benchmark | GPT-5.5 Thinking | GPT-5.4 Thinking | GPT-5.5 Pro | GPT-5.4 Pro | Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 75.1% | - | - | 69.4% | 68.5% |
| GDPval | 84.9% | 83.0% | 82.3% | 82.0% | 80.3% | 67.3% |
| OSWorld-Verified | 78.7% | 75.0% | - | - | 78.0% | - |
| Toolathalon | 55.6% | 54.6% | - | - | - | 48.8% |
| BrowseComp | 84.4% | 82.7% | 90.1% | 89.3% | 79.3% | 85.9% |
| FrontierMath T1-3 | 51.7% | 47.6% | 52.4% | 50.0% | 43.8% | 36.9% |
| FrontierMath T4 | 35.4% | 27.1% | 39.6% | 38.0% | 22.9% | 16.7% |
| CyberGym | 81.8% | 79.0% | - | - | 73.1% | - |
The Terminal-Bench gap between GPT-5.5 (82.7%) and Opus 4.7 (69.4%) is 13.3 points. That's not close. That's a different weight class. CyberGym at 81.8% means GPT-5.5 is closing in on Mythos's 83.1%. FrontierMath T4 nearly doubled compared to Opus 4.7 (35.4% vs 22.9%), which is wild for advanced math reasoning.
But I don't trust benchmarks. Never have. Every lab picks the evaluations that make them look best. OpenAI didn't include SWE-bench Verified in this table, where Opus 4.7 still leads at 87.6%. Funny how that works. So I closed the tab and started testing.
Test 1: The Refactoring Baseline
I have a 400-line Go file I throw at every new model. It's called pipeline.go and it's disgusting on purpose. Three interfaces that overlap in confusing ways (Processor, Handler, Transformer - the last two do basically the same thing), nested error handling six levels deep in one function, and a goroutine that spawns inside a for loop without a WaitGroup or any cancellation. The goroutine leak is the hardest thing to catch because the code doesn't crash - it just slowly eats memory until you notice your container restarting at 3 AM.
GPT-5.4 used to find the overlapping interfaces and suggest merging Handler and Transformer. It never caught the goroutine leak. Opus 4.7 caught the leak about half the time, depending on how I phrased the prompt (which tells you something about how fragile these models are on real tasks - the prompt wording shouldn't matter this much but it does).
GPT-5.5 caught it on the first try. Not only did it identify the missing WaitGroup, it actually explained why the leak was hard to detect: "The spawned goroutines hold references to the channel but the channel is never closed when the parent context is cancelled, so they block on send indefinitely." That's exactly right. It then suggested a fix using errgroup from golang.org/x/sync that I wouldn't have thought of myself - I would have just slapped a WaitGroup on it.
The error handling refactoring was also better than 5.4. It pulled the six-level nesting into an early-return pattern and introduced a custom error type that wraps context about which pipeline stage failed. I've seen Opus 4.7 do similar things, but 5.5's version was cleaner. The variable names were better. Small thing, but it matters when you're reading the diff at 11 PM.
One miss though: it didn't flag that Processor has a method called Process that shadows the interface name. Opus 4.7 catches that. Every time. Not a huge deal, but if we're keeping score, 5.5 isn't perfect here.
Test 2: Multi-File Debugging
I have a Next.js project (14.3, App Router) that's been broken for two weeks. The bug: a server component fetches user preferences from an API route, passes them as props to a client component, and the client component renders with the wrong data on first load. Specifically, the theme preference shows "light" even when the database says "dark." Hard refresh fixes it. It only happens when the page is accessed from a link within the app, not on direct navigation.
I know the root cause. I've known for a week. It's a stale closure in a useEffect that reads from a client-side cache before the server component's props have hydrated. The cache was populated by a previous page visit and the effect fires before React reconciles the server-rendered HTML with the client state. Classic hydration race condition. I just haven't had time to fix it properly because every "proper" fix I've tried breaks the caching layer.
I gave GPT-5.5 three files: the server component (page.tsx), the client component (ThemeProvider.tsx), and the API route (route.ts). Total around 280 lines. Same prompt I used on Opus 4.7 last week: "There's a hydration bug causing stale data on client-side navigation. Find it."
GPT-5.5 went down the wrong path first. It spent about 400 tokens talking about a potential issue with revalidatePath that doesn't exist in my code (I'm not using ISR on this route). That was annoying. But then it corrected itself and zeroed in on the useEffect dependency array missing the server-provided prop. It identified the stale closure correctly and, here's what I didn't expect, suggested using useSyncExternalStore to bridge the server and client state instead of just fixing the dependency array.
I hadn't considered useSyncExternalStore for this. That's actually a better fix than anything I came up with because it avoids the flash of wrong content entirely - the server snapshot is used until the client store syncs. I need to test it more, but early results look solid.
For reference, Opus 4.7 found the stale closure immediately (no wrong path first) but suggested fixing the dependency array, which is the obvious fix that doesn't fully solve the problem. GPT-5.5 took a detour but arrived at a better destination. I'm not sure which behavior I prefer. Getting the right answer slowly with a false start, or getting an okay answer instantly? Depends on the day, I guess. And on how much AI-generated code you're willing to blindly ship.
Test 3: Build Something From Scratch
"Write me a CLI tool in Python that monitors a directory for new files and uploads them to S3. Use watchdog for filesystem events, boto3 for S3, and include retry logic for failed uploads."
I give this prompt to every model. It's a good test because it involves real dependencies, error handling, configuration, and the kind of boring plumbing code that AI models should be perfect at.
GPT-5.4 always produced working code but missed edge cases. It wouldn't handle the case where a file is created and then immediately deleted before the upload starts. It wouldn't debounce rapid file creation events. And it would put AWS credentials in a config file instead of using environment variables or IAM roles.
GPT-5.5 produced something surprisingly good. First attempt. It used watchdog.observers.Observer with a custom FileSystemEventHandler, which is standard. But it also added a debounce mechanism using a dictionary of pending uploads with timestamps - if a file is modified within 2 seconds of creation, it resets the timer instead of uploading twice. It used exponential backoff on S3 upload failures with a max of 3 retries. It pulled credentials from environment variables with a fallback to ~/.aws/credentials. It even added a --dry-run flag.
The one thing it didn't do that I would have liked: it didn't handle the "file deleted before upload" case. I pointed this out and it immediately added a os.path.exists() check before the upload call with proper logging. Not a big deal, but it's the kind of edge case that separates vibe-coded prototypes from production code.
Opus 4.7 on this same prompt asks clarifying questions first ("What S3 bucket? What file types? Should I filter by extension?"), which I appreciate in a coding assistant but hate in a benchmark test. GPT-5.5 just builds the thing and lets you customize after. Different philosophies. I think for CLI tools and scripts, just building it is the right call. For larger applications, asking first is better. Neither model adapts its approach to the task size, which is something I wish someone would fix.
Test 4: Code Review Head-to-Head
Same PR. Same prompt. GPT-5.5 vs Opus 4.7. The PR is a real one from a side project - about 340 lines adding a rate limiter to a Go HTTP middleware. I wrote it quickly last weekend and I know it has problems because I wrote it at 1 AM after too much coffee.
Prompt: "Review this PR for bugs, security issues, performance problems, and code style. Be specific."
What GPT-5.5 caught:
- The rate limiter uses
time.Now()inside a goroutine without considering clock skew in distributed deployments. Fair point, though I'm running this on a single instance. - The token bucket refill rate is calculated using integer division, which silently drops fractional tokens. At low rates (like 1 request per 10 seconds), this means the bucket never refills. Legitimate bug. I would have found this in production at 3 AM.
- The mutex protecting the bucket map is held during the entire HTTP handler execution, not just during bucket access. This serializes all requests through the rate limiter. Performance killer I completely missed.
What Opus 4.7 caught:
- The rate limiter key is derived from
r.RemoteAddrwithout stripping the port, so the same client gets a new bucket on every connection if their source port changes. Security issue - the rate limiter doesn't actually limit. - No
X-RateLimit-RemainingorRetry-Afterheaders in the 429 response. Standards compliance. - The bucket cleanup goroutine runs every 60 seconds but the bucket expiry is 30 seconds, so there's a window where expired buckets still consume memory. Minor, but correct.
Here's the thing. GPT-5.5 found the worse bugs. The integer division issue and the mutex scope problem are both real, would-bite-me-in-production findings. Opus 4.7 found the RemoteAddr port issue, which is arguably more important from a security perspective because it means the rate limiter is functionally broken. But Opus missed the performance and math bugs, and 5.5 missed the security issue.
Neither model caught everything. If you're relying on a single AI model for code review, you're going to miss things regardless of which one you pick. I've been saying this for months in my workflow guide - run both if you can afford it. The overlap in findings is maybe 40%. The union is where the value is.
The Pricing Math
GPT-5.5's API price is 2x GPT-5.4. That sounds terrible. But OpenAI claims it uses fewer tokens per task because it's more efficient - solves problems in fewer passes, generates less filler. I tracked the numbers across my four test sessions to see if that's true.
Test 1 (Go refactoring): GPT-5.4 used ~3,200 output tokens across two passes (it didn't catch the goroutine leak, so I had to prompt again). GPT-5.5 used ~2,100 tokens in a single pass and caught everything. At 2x the per-token rate, 5.5 cost about 31% more than 5.4 for a better result. Worth it.
Test 2 (Next.js debugging): GPT-5.5 used ~2,800 tokens including the false start about revalidatePath. GPT-5.4 on the same task (which I ran last month) used ~1,900 tokens but gave a wrong answer. Hard to compare costs when one model is right and the other isn't.
Test 3 (CLI tool): GPT-5.5 generated the complete tool in ~1,600 tokens. GPT-5.4 generates a comparable (but worse) version in ~1,400 tokens. At 2x pricing, 5.5 costs about 2.3x for a meaningfully better output. The efficiency claim doesn't hold here - this was a straightforward generation task where both models get it done in one pass.
Test 4 (Code review): Nearly identical token counts. ~1,100 for 5.5 and ~1,050 for 5.4. So 5.5 costs almost exactly 2x for a better review.
Bottom line: the "fewer tokens" claim is true for tasks where GPT-5.4 would need multiple attempts and 5.5 gets it in one shot. For one-shot tasks where both models succeed, you're paying close to 2x more. If you use AI tools at any serious volume, that matters. My rough estimate is that my monthly API spend goes up 40-60% switching from 5.4 to 5.5, depending on task mix. Same latency per token though, so at least you're not waiting longer.
Opus 4.7's One-Week Problem
Anthropic released Opus 4.7 literally seven days ago. One week. And here's GPT-5.5 beating it on every single benchmark in this table. Terminal-Bench gap is 13.3 points. CyberGym is 8.7 points. FrontierMath T4 is 12.5 points. BrowseComp is 5.1 points. Even OSWorld, where Opus 4.7 was competitive at 78.0%, GPT-5.5 edges it at 78.7%.
That has to sting. You spend months building a new model, ship it to applause, and seven days later someone drops numbers that make yours look like a mid-tier also-ran. The April 2026 rankings just reshuffled completely.
The saving grace for Anthropic is Mythos. CyberGym 83.1% vs GPT-5.5's 81.8%. Terminal-Bench 82.0% vs 82.7% - nearly tied. If Mythos were publicly available, this would be a real fight. But Mythos is locked behind Project Glasswing and nobody I know has access. You can't win a market with a model that 99.97% of developers can't use.
The other saving grace is SWE-bench. OpenAI conspicuously didn't include SWE-bench Verified or SWE-bench Pro in this release's benchmark table. Opus 4.7 scores 87.6% on Verified and 64.3% on Pro. If GPT-5.5 had better numbers, they'd be in the table. The absence tells you something.
This is the reality of the AI model race in April 2026: no single model wins everything. Every lab cherry-picks benchmarks. The models that actually matter to developers are the ones that work best on your specific tasks, not the ones with the best press release. Which brings me to the question I've been dodging all evening.
Am I Switching?
Short answer: partially.
I'm keeping Claude as my primary for agentic work. Claude Code is too well-integrated into my workflow - the hooks, the project memory, the way it handles multi-file edits. GPT-5.5 is a better raw model on these benchmarks, but the model is only half the story. The tooling around it matters just as much, and OpenAI's Codex improvements still aren't at Claude Code's level for the kind of iterative, multi-step coding I do daily.
For one-shot tasks, though? I'm switching to GPT-5.5. Code generation from a spec. Refactoring a single file. Writing a CLI tool. Code reviews where I just paste a diff and want feedback. GPT-5.5 is flat-out better at these. The pipeline.go test convinced me - catching that goroutine leak on the first pass is something no other model has done consistently.
For the Claude vs GPT debate more broadly: the answer as of today is "both." I know that sounds like a cop-out. It isn't. I literally have both open right now, using them for different things. GPT-5.5 for generation, Claude for iteration. If I had to pick one and only one, gun to my head, I'd probably pick GPT-5.5 right now. The benchmark gaps are too wide to ignore. But I'd miss Claude Code within a day, and I'd probably buy back in within a week.
The real question isn't which model is better. It's whether the 2x API price is worth it over GPT-5.4 for people who aren't obsessing over marginal quality gains. For most developers, 5.4 is still plenty good. GPT-5.5 is for people who already pushed 5.4 to its limits and kept hitting walls. I'm one of those people. Most developers aren't.
I'll update my AI workflow guide and the April model comparison this week once I've had more time with it. Right now I'm running on four hours of testing and a fake dentist appointment's worth of guilt. But first impressions are strong. OpenAI wasn't bluffing on this one.
Now I need to figure out if I can expense the API bill from tonight's testing. 340,000 tokens across four models isn't cheap, and my manager is going to ask why I was at the "dentist" for six hours.