Anthropic shipped Claude Opus 4.7 today. I've spent the morning reading the announcement, running it through my usual tests, and watching the community react. The benchmarks are real improvements. The coding gains are significant. But there's a pricing trick buried in the release notes, a reasoning system nobody asked for, and a community that was already losing patience before 4.7 even dropped.
Here's the honest version of this launch.
The Benchmarks
The headline number is SWE-bench Verified: 87.6%, up from 80.8% on Opus 4.6. That's the largest single-version jump I've seen on this benchmark from any provider. For context, GPT-5.4 doesn't even report a SWE-bench Verified score, and Gemini 3.1 Pro sits at 80.6%. BenchGecko's data confirms Opus 4.7 takes the SWE-bench Verified crown back from the competition - at least among models you can actually use.
Full table, every benchmark Anthropic published:
| Benchmark | Opus 4.7 | Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro | Mythos |
|---|---|---|---|---|---|
| SWE-bench Pro | 64.3% | 53.4% | 57.7% | 54.2% | 77.8% |
| SWE-bench Verified | 87.6% | 80.8% | - | 80.6% | 93.9% |
| Terminal-Bench 2.0 | 69.4% | 65.4% | 75.1% | 68.5% | 82.0% |
| HLE (no tools) | 46.9% | 40.0% | 42.7% | 44.4% | 56.8% |
| HLE (with tools) | 54.7% | 53.3% | 58.7% | 51.4% | 64.7% |
| BrowseComp | 79.3% | 83.7% | 89.3% | 85.9% | 86.9% |
| MCP-Atlas | 77.3% | 75.8% | 68.1% | 73.9% | - |
| OSWorld-Verified | 78.0% | 72.7% | 75.0% | - | 79.6% |
| Finance Agent v1.1 | 64.4% | 60.1% | 61.5% | 59.7% | - |
| CyberGym | 73.1% | 73.8% | 66.3% | - | 83.1% |
| GPQA Diamond | 94.2% | 91.3% | 94.4% | 94.3% | 94.6% |
| ChartXiv (no tools) | 82.1% | 69.1% | - | - | 86.1% |
| ChartXiv (with tools) | 91.0% | 84.7% | - | - | 93.2% |
| MMMLU | 91.5% | 91.1% | - | 92.6% | - |
The wins are real. SWE-bench Pro jumped nearly 11 points. ChartXiv without tools went from 69.1% to 82.1% - a 13-point gain that reflects the new vision capabilities. Finance Agent and OSWorld both improved meaningfully. If you know how to read benchmarks, the coding and agentic gains are the story here.
But two benchmarks went backward. BrowseComp dropped from 83.7% to 79.3%. CyberGym slid from 73.8% to 73.1%. These aren't rounding errors on BrowseComp - that's a 4.4-point regression on web browsing capability. Nobody at Anthropic addressed this in the announcement. If your workflow depends on Claude browsing the web and pulling accurate information, test before you switch.
What's Actually New
The feature list is longer than most Opus updates. Some of it matters. Some of it is marketing.
Vision overhaul. Opus 4.7 accepts images up to 2,576px on the long edge - roughly 3.75 megapixels. That's over 3x more than any prior Claude model. The ChartXiv jump proves this isn't just a spec bump. If you're feeding screenshots, diagrams, or architecture charts into Claude, this is probably the single biggest practical improvement in the release.
Self-verification. The model now verifies its own outputs before responding. Sounds like a reliability gain. In practice, I've noticed it sometimes second-guesses correct answers and "corrects" them into wrong ones. Early days, but keep an eye on this.
New "xhigh" effort level. Sits between high and max. It's now the default for Claude Code. If you've been running on high and finding it sloppy, xhigh might hit the sweet spot. If you've been running max and watching your bill, xhigh saves tokens. I've updated my Claude Code config accordingly.
/ultrareview in Claude Code. Three free uses for Pro and Max subscribers. It's a deep code review pass that supposedly uses more reasoning tokens than a normal review. I haven't burned my three yet so I can't say if it's good. The name is ridiculous.
Task budgets are in public beta on the API. You can now set a hard token ceiling per task. About time. If you're building coding agents on the API, this is the feature that saves you from a $200 surprise invoice.
Instruction following got more literal. Anthropic's own docs warn that prompts tuned for 4.6 may break on 4.7. If you've spent weeks tuning your system prompts, you'll want to re-test everything. This isn't a maybe. I had two hooks break within the first hour because 4.7 interpreted instructions that 4.6 had been quietly ignoring.
The Tokenizer Problem
This is the part Anthropic buried.
Opus 4.7 uses a new tokenizer. Same input text produces 1.0 to 1.35x more tokens than 4.6. Anthropic kept the per-token price the same - $5 input, $25 output per million - so they can technically say "same pricing." But your actual bill for the same workload goes up by up to 35%.
That's not "same pricing." That's a price increase with extra steps.
On top of that, 4.7 thinks more on later turns in agentic sessions. More thinking means more output tokens. More output tokens at $25 per million adds up fast. If you're running long Claude Code sessions or multi-step agents, expect your output token count to climb even before the tokenizer difference kicks in.
I ran the same 50-file refactoring task through both 4.6 and 4.7 this morning. Same codebase, same instructions, same system prompt. The 4.7 run consumed 22% more tokens. The code quality was better - fewer back-and-forths, cleaner diffs - but the token count was higher on an identical task. If you're cost-sensitive on the API, test your actual workloads before assuming "same pricing" means same bill.
Adaptive Thinking: Who Asked For This?
Opus 4.7 introduces adaptive thinking. The model now decides on its own how much reasoning to apply to each prompt. In theory, this saves tokens on easy questions and goes deep on hard ones. In practice, it's already a mess.
The biggest issue: reasoning output suppression. By default, 4.7 no longer includes human-readable reasoning summaries in its output. If you built workflows, dashboards, or logging around Claude's thinking blocks, they're gone. You need to add "display": "summarized" to your API calls to get them back. Anthropic didn't make this prominent in the announcement. People are finding out the hard way.
Then there's the decision-making itself. Hacker News threads from today are full of reports that adaptive thinking "chooses not to think when it should." One user ran a controlled test and found that requesting "none" reasoning actually cost MORE output tokens than "medium" reasoning - with identical output quality. That shouldn't be possible. It suggests the adaptive system is making bad budget allocation choices on at least some prompt types.
I've seen this myself. On a tricky TypeScript generics question this morning, 4.7 on adaptive mode gave me a wrong answer in about 2 seconds. Switching to explicit "high" reasoning got the right answer but took 8 seconds and 3x the tokens. The old behavior where you just got the right answer at a predictable cost was better.
Nobody asked for the model to guess how hard to think. We asked for it to think correctly.
What the Community Is Actually Saying
The community reaction to 4.7 is complicated because it's landing on top of weeks of frustration about 4.6 getting worse. Here's what I'm seeing as of this afternoon.
An AMD senior director wrote a GitHub post stating that "Claude has regressed to the point it cannot be trusted to perform complex engineering." This was about 4.6, not 4.7, but it got widely shared today because the timing was terrible for Anthropic. When your launch day coincides with a senior chip architect publicly abandoning your product, the optics are brutal.
Dave Kennedy (@HackingDave) posted that he "knew something was off 4 weeks ago and it progressively got worse. Cancelled my Claude subscription... unusable right now." Again, likely about 4.6 degradation, but the post went viral today.
@levelsio, who builds his entire stack on vibe coding with Claude: "I have never experienced a more dumb Claude Code than today."
BridgeMind AI published a comparison showing Opus 4.5 outperforming 4.6 on hallucination benchmarks. If older versions are more reliable than newer ones on some axes, that's a regression problem, not a feature problem. 4.7 might fix some of these issues, but the trust damage is cumulative.
To be fair, much of this anger is about 4.6 degradation that happened before 4.7 launched. It's possible - even likely - that 4.7 fixes some of the regressions people were experiencing. But Anthropic's communication has been so poor on the 4.6 issues that nobody trusts them to say so. The community needed a clean win today. Instead they got a new tokenizer that inflates their bills and an adaptive thinking system that doesn't work reliably.
Trust takes months to build and days to lose. Anthropic is in the "days to lose" phase right now.
The Mythos Gap
Here's the uncomfortable number. Mythos Preview scores 93.9% on SWE-bench Verified. Opus 4.7 scores 87.6%. That's a 6.3-point gap between Anthropic's best model and Anthropic's best model you can actually use.
Anthropic publicly conceded that 4.7 is less capable than Mythos. Axios ran the headline. It's an unusual move - shipping a new flagship while simultaneously admitting something better exists behind a gate. The strategic logic makes sense (Mythos has safety constraints that aren't ready for public access) but the market perception is: "You're charging me $5/$25 for your second-best model."
Look at the benchmark table. Mythos beats 4.7 on every single evaluation where both were tested. SWE-bench Pro: 77.8% vs 64.3%. Terminal-Bench: 82.0% vs 69.4%. HLE with tools: 64.7% vs 54.7%. The gap isn't small.
For developers using Claude today, this doesn't change anything practical. You can't use Mythos. 4.7 is the best you're getting. But it does shift the conversation from "Claude is the best" to "Claude's best model isn't available to you." That's a different pitch, and it's harder to sell.
Oh, and Qwen 3.6 Shipped Too
Lost in the Opus 4.7 noise: Alibaba dropped Qwen 3.6-35B-A3B today. Same day. Probably not a coincidence.
The architecture is interesting. It's a sparse mixture-of-experts model with 35 billion total parameters but only 3 billion active per forward pass. That's a 12:1 ratio. The practical implication: you can run a model with 35B-class knowledge on hardware that normally handles 3B models. If you're doing local inference or cost-constrained API work, this matters a lot.
The benchmarks for a 3B-active model are kind of absurd:
- SWE-bench Verified: 73.4%
- SWE-bench Pro: 49.5%
- Terminal-Bench 2.0: 51.5%
- GPQA Diamond: 85.5%
- MCPMark: 37.0 (up from 27.0 on Qwen3.5)
73.4% on SWE-bench Verified from 3 billion active parameters. Alibaba says it performs "on par with models 10x its active size" on agentic coding, and looking at these numbers, that claim holds up. Gemini 3.1 Pro with orders of magnitude more compute scores 80.6%. Qwen 3.6 gets to 73.4% with a fraction of the resources.
It's Apache 2.0 licensed. Available right now on HuggingFace, ModelScope, and Qwen Studio. No waitlist, no enterprise gate, no API key required if you run it locally.
The MoE architecture trend keeps accelerating. DeepSeek started it, Qwen is perfecting it. The question for closed-source providers is how long the quality gap holds when open models keep getting this efficient. Qwen 3.6 doesn't beat Opus 4.7 on any benchmark. But the gap is narrowing, and Qwen costs zero.
My Take
I'm upgrading. But I'm not happy about it.
The SWE-bench gains are real and I care about them because they translate to fewer failed attempts in Claude Code sessions. The vision improvements are real and I care about them because I feed architecture diagrams to Claude constantly. The xhigh effort level is a smart addition. Task budgets should have existed a year ago but better late.
What I don't like: the tokenizer change is a hidden price increase and Anthropic should have been upfront about it. The adaptive thinking defaults are wrong and will burn people who don't read the release notes carefully. The reasoning output suppression will break existing workflows with zero warning. And the timing of this launch - dropping a new model while your community is loudly saying the current one is broken - shows either tone-deafness or a calculated bet that benchmark numbers will drown out the complaints.
If you're on Claude Code, upgrade. The coding improvements are worth it. If you're on the API and cost matters, test your actual token consumption before committing. If you built anything that parses thinking blocks, fix your API calls first.
Qwen 3.6 is the sleeper story today. Everyone's talking about Opus 4.7. Nobody's talking about the 3B-active model that just matched last-gen frontier scores on coding benchmarks. Open source keeps closing the gap while the closed-source providers argue about whether their latest update is a regression or a feature. Make sure you're tracking both sides of that race.
Frequently Asked Questions
"display": "summarized" parameter before switching.