Claude Opus 4.7: Better at Coding, Worse at Trust

Q: Does Claude Opus 4.7 cost more than Opus 4.6?

The per-token price is the same: 5 dollars per million input tokens and 25 dollars per million output tokens. But Opus 4.7 uses a new tokenizer that converts the same text into 1.0 to 1.35x more tokens. So you pay the same rate per token but use more tokens for identical prompts. Anthropic calls it same pricing. Your invoice will disagree.

Q: What is adaptive thinking in Claude Opus 4.7?

Adaptive thinking lets Opus 4.7 decide on its own how much reasoning to apply to each prompt. The problem is that multiple users report the model choosing not to think on tasks where it clearly should. One HN user found that requesting no reasoning actually cost more output tokens than requesting medium reasoning, with identical results. You can override it manually but the defaults are unreliable.

Q: Is Claude Opus 4.7 better than Mythos?

No. Anthropic publicly conceded that Opus 4.7 is less capable than Mythos Preview on most benchmarks. Mythos scores 93.9% on SWE-bench Verified versus 87.6% for Opus 4.7, and 77.8% on SWE-bench Pro versus 64.3%. Mythos remains restricted to Project Glasswing partners.

Q: What is Qwen 3.6 and why does it matter?

Qwen 3.6-35B-A3B is a sparse mixture-of-experts model from Alibaba released on April 17, 2026 - the same day as Opus 4.7. It has 35 billion total parameters but only activates 3 billion per forward pass. It scores 73.4% on SWE-bench Verified and ships under Apache 2.0. For a model with 3B active parameters, those coding benchmarks match models 10x its active size.

Q: Should I upgrade to Claude Opus 4.7 or wait?

If you use Claude Code heavily, the upgrade is worth it for the improved SWE-bench scores and the new xhigh effort level. If you are on the API and cost-sensitive, test first - the new tokenizer will inflate your bills by up to 35% for the same prompts. If your workflows depend on reasoning output summaries, do not upgrade until you have added the display summarized parameter to your API calls.

Anthropic shipped Claude Opus 4.7 today. I've spent the morning reading the announcement, running it through my usual tests, and watching the community react. The benchmarks are real improvements. The coding gains are significant. But there's a pricing trick buried in the release notes, a reasoning system nobody asked for, and a community that was already losing patience before 4.7 even dropped.

Here's the honest version of this launch.

The Benchmarks

The headline number is SWE-bench Verified: 87.6%, up from 80.8% on Opus 4.6. That's the largest single-version jump I've seen on this benchmark from any provider. For context, GPT-5.4 doesn't even report a SWE-bench Verified score, and Gemini 3.1 Pro sits at 80.6%. BenchGecko's data confirms Opus 4.7 takes the SWE-bench Verified crown back from the competition - at least among models you can actually use.

Full table, every benchmark Anthropic published:

Benchmark	Opus 4.7	Opus 4.6	GPT-5.4	Gemini 3.1 Pro	Mythos
SWE-bench Pro	64.3%	53.4%	57.7%	54.2%	77.8%
SWE-bench Verified	87.6%	80.8%	-	80.6%	93.9%
Terminal-Bench 2.0	69.4%	65.4%	75.1%	68.5%	82.0%
HLE (no tools)	46.9%	40.0%	42.7%	44.4%	56.8%
HLE (with tools)	54.7%	53.3%	58.7%	51.4%	64.7%
BrowseComp	79.3%	83.7%	89.3%	85.9%	86.9%
MCP-Atlas	77.3%	75.8%	68.1%	73.9%	-
OSWorld-Verified	78.0%	72.7%	75.0%	-	79.6%
Finance Agent v1.1	64.4%	60.1%	61.5%	59.7%	-
CyberGym	73.1%	73.8%	66.3%	-	83.1%
GPQA Diamond	94.2%	91.3%	94.4%	94.3%	94.6%
ChartXiv (no tools)	82.1%	69.1%	-	-	86.1%
ChartXiv (with tools)	91.0%	84.7%	-	-	93.2%
MMMLU	91.5%	91.1%	-	92.6%	-

The wins are real. SWE-bench Pro jumped nearly 11 points. ChartXiv without tools went from 69.1% to 82.1% - a 13-point gain that reflects the new vision capabilities. Finance Agent and OSWorld both improved meaningfully. If you know how to read benchmarks, the coding and agentic gains are the story here.

But two benchmarks went backward. BrowseComp dropped from 83.7% to 79.3%. CyberGym slid from 73.8% to 73.1%. These aren't rounding errors on BrowseComp - that's a 4.4-point regression on web browsing capability. Nobody at Anthropic addressed this in the announcement. If your workflow depends on Claude browsing the web and pulling accurate information, test before you switch.

What's Actually New

The feature list is longer than most Opus updates. Some of it matters. Some of it is marketing.

Vision overhaul. Opus 4.7 accepts images up to 2,576px on the long edge - roughly 3.75 megapixels. That's over 3x more than any prior Claude model. The ChartXiv jump proves this isn't just a spec bump. If you're feeding screenshots, diagrams, or architecture charts into Claude, this is probably the single biggest practical improvement in the release.

Self-verification. The model now verifies its own outputs before responding. Sounds like a reliability gain. In practice, I've noticed it sometimes second-guesses correct answers and "corrects" them into wrong ones. Early days, but keep an eye on this.

New "xhigh" effort level. Sits between high and max. It's now the default for Claude Code. If you've been running on high and finding it sloppy, xhigh might hit the sweet spot. If you've been running max and watching your bill, xhigh saves tokens. I've updated my Claude Code config accordingly.

/ultrareview in Claude Code. Three free uses for Pro and Max subscribers. It's a deep code review pass that supposedly uses more reasoning tokens than a normal review. I haven't burned my three yet so I can't say if it's good. The name is ridiculous.

Task budgets are in public beta on the API. You can now set a hard token ceiling per task. About time. If you're building coding agents on the API, this is the feature that saves you from a $200 surprise invoice.

Instruction following got more literal. Anthropic's own docs warn that prompts tuned for 4.6 may break on 4.7. If you've spent weeks tuning your system prompts, you'll want to re-test everything. This isn't a maybe. I had two hooks break within the first hour because 4.7 interpreted instructions that 4.6 had been quietly ignoring.

The Tokenizer Problem

This is the part Anthropic buried.

Opus 4.7 uses a new tokenizer. Same input text produces 1.0 to 1.35x more tokens than 4.6. Anthropic kept the per-token price the same - $5 input, $25 output per million - so they can technically say "same pricing." But your actual bill for the same workload goes up by up to 35%.

That's not "same pricing." That's a price increase with extra steps.

On top of that, 4.7 thinks more on later turns in agentic sessions. More thinking means more output tokens. More output tokens at $25 per million adds up fast. If you're running long Claude Code sessions or multi-step agents, expect your output token count to climb even before the tokenizer difference kicks in.

I ran the same 50-file refactoring task through both 4.6 and 4.7 this morning. Same codebase, same instructions, same system prompt. The 4.7 run consumed 22% more tokens. The code quality was better - fewer back-and-forths, cleaner diffs - but the token count was higher on an identical task. If you're cost-sensitive on the API, test your actual workloads before assuming "same pricing" means same bill.

Adaptive Thinking: Who Asked For This?

Opus 4.7 introduces adaptive thinking. The model now decides on its own how much reasoning to apply to each prompt. In theory, this saves tokens on easy questions and goes deep on hard ones. In practice, it's already a mess.

The biggest issue: reasoning output suppression. By default, 4.7 no longer includes human-readable reasoning summaries in its output. If you built workflows, dashboards, or logging around Claude's thinking blocks, they're gone. You need to add "display": "summarized" to your API calls to get them back. Anthropic didn't make this prominent in the announcement. People are finding out the hard way.

Then there's the decision-making itself. Hacker News threads from today are full of reports that adaptive thinking "chooses not to think when it should." One user ran a controlled test and found that requesting "none" reasoning actually cost MORE output tokens than "medium" reasoning - with identical output quality. That shouldn't be possible. It suggests the adaptive system is making bad budget allocation choices on at least some prompt types.

I've seen this myself. On a tricky TypeScript generics question this morning, 4.7 on adaptive mode gave me a wrong answer in about 2 seconds. Switching to explicit "high" reasoning got the right answer but took 8 seconds and 3x the tokens. The old behavior where you just got the right answer at a predictable cost was better.

Nobody asked for the model to guess how hard to think. We asked for it to think correctly.

What the Community Is Actually Saying

The community reaction to 4.7 is complicated because it's landing on top of weeks of frustration about 4.6 getting worse. Here's what I'm seeing as of this afternoon.

An AMD senior director wrote a GitHub post stating that "Claude has regressed to the point it cannot be trusted to perform complex engineering." This was about 4.6, not 4.7, but it got widely shared today because the timing was terrible for Anthropic. When your launch day coincides with a senior chip architect publicly abandoning your product, the optics are brutal.

Dave Kennedy (@HackingDave) posted that he "knew something was off 4 weeks ago and it progressively got worse. Cancelled my Claude subscription... unusable right now." Again, likely about 4.6 degradation, but the post went viral today.

@levelsio, who builds his entire stack on vibe coding with Claude: "I have never experienced a more dumb Claude Code than today."

BridgeMind AI published a comparison showing Opus 4.5 outperforming 4.6 on hallucination benchmarks. If older versions are more reliable than newer ones on some axes, that's a regression problem, not a feature problem. 4.7 might fix some of these issues, but the trust damage is cumulative.

To be fair, much of this anger is about 4.6 degradation that happened before 4.7 launched. It's possible - even likely - that 4.7 fixes some of the regressions people were experiencing. But Anthropic's communication has been so poor on the 4.6 issues that nobody trusts them to say so. The community needed a clean win today. Instead they got a new tokenizer that inflates their bills and an adaptive thinking system that doesn't work reliably.

Trust takes months to build and days to lose. Anthropic is in the "days to lose" phase right now.

The Mythos Gap

Here's the uncomfortable number. Mythos Preview scores 93.9% on SWE-bench Verified. Opus 4.7 scores 87.6%. That's a 6.3-point gap between Anthropic's best model and Anthropic's best model you can actually use.

Anthropic publicly conceded that 4.7 is less capable than Mythos. Axios ran the headline. It's an unusual move - shipping a new flagship while simultaneously admitting something better exists behind a gate. The strategic logic makes sense (Mythos has safety constraints that aren't ready for public access) but the market perception is: "You're charging me $5/$25 for your second-best model."

Look at the benchmark table. Mythos beats 4.7 on every single evaluation where both were tested. SWE-bench Pro: 77.8% vs 64.3%. Terminal-Bench: 82.0% vs 69.4%. HLE with tools: 64.7% vs 54.7%. The gap isn't small.

For developers using Claude today, this doesn't change anything practical. You can't use Mythos. 4.7 is the best you're getting. But it does shift the conversation from "Claude is the best" to "Claude's best model isn't available to you." That's a different pitch, and it's harder to sell.

Oh, and Qwen 3.6 Shipped Too

Lost in the Opus 4.7 noise: Alibaba dropped Qwen 3.6-35B-A3B today. Same day. Probably not a coincidence.

The architecture is interesting. It's a sparse mixture-of-experts model with 35 billion total parameters but only 3 billion active per forward pass. That's a 12:1 ratio. The practical implication: you can run a model with 35B-class knowledge on hardware that normally handles 3B models. If you're doing local inference or cost-constrained API work, this matters a lot.

The benchmarks for a 3B-active model are kind of absurd:

SWE-bench Verified: 73.4%
SWE-bench Pro: 49.5%
Terminal-Bench 2.0: 51.5%
GPQA Diamond: 85.5%
MCPMark: 37.0 (up from 27.0 on Qwen3.5)

73.4% on SWE-bench Verified from 3 billion active parameters. Alibaba says it performs "on par with models 10x its active size" on agentic coding, and looking at these numbers, that claim holds up. Gemini 3.1 Pro with orders of magnitude more compute scores 80.6%. Qwen 3.6 gets to 73.4% with a fraction of the resources.

It's Apache 2.0 licensed. Available right now on HuggingFace, ModelScope, and Qwen Studio. No waitlist, no enterprise gate, no API key required if you run it locally.

The MoE architecture trend keeps accelerating. DeepSeek started it, Qwen is perfecting it. The question for closed-source providers is how long the quality gap holds when open models keep getting this efficient. Qwen 3.6 doesn't beat Opus 4.7 on any benchmark. But the gap is narrowing, and Qwen costs zero.

My Take

I'm upgrading. But I'm not happy about it.

The SWE-bench gains are real and I care about them because they translate to fewer failed attempts in Claude Code sessions. The vision improvements are real and I care about them because I feed architecture diagrams to Claude constantly. The xhigh effort level is a smart addition. Task budgets should have existed a year ago but better late.

What I don't like: the tokenizer change is a hidden price increase and Anthropic should have been upfront about it. The adaptive thinking defaults are wrong and will burn people who don't read the release notes carefully. The reasoning output suppression will break existing workflows with zero warning. And the timing of this launch - dropping a new model while your community is loudly saying the current one is broken - shows either tone-deafness or a calculated bet that benchmark numbers will drown out the complaints.

If you're on Claude Code, upgrade. The coding improvements are worth it. If you're on the API and cost matters, test your actual token consumption before committing. If you built anything that parses thinking blocks, fix your API calls first.

Qwen 3.6 is the sleeper story today. Everyone's talking about Opus 4.7. Nobody's talking about the 3B-active model that just matched last-gen frontier scores on coding benchmarks. Open source keeps closing the gap while the closed-source providers argue about whether their latest update is a regression or a feature. Make sure you're tracking both sides of that race.

Frequently Asked Questions

Is Claude Opus 4.7 better than Opus 4.6?

On coding benchmarks, yes. Opus 4.7 scores 87.6% on SWE-bench Verified versus 80.8% for Opus 4.6, and 64.3% on SWE-bench Pro versus 53.4%. But it regressed on BrowseComp (83.7% down to 79.3%) and CyberGym (73.8% to 73.1%). The new tokenizer also means the same prompt costs up to 1.35x more tokens, so your actual bill goes up even though per-token pricing hasn't changed.

Does Claude Opus 4.7 cost more than Opus 4.6?

The per-token price is the same: $5 input, $25 output per million tokens. But 4.7's new tokenizer converts the same text into more tokens - anywhere from 1.0x to 1.35x more. So you pay the same rate per token but use more tokens for identical prompts. Your actual bill will be higher for the same workloads.

What is adaptive thinking in Claude Opus 4.7?

Adaptive thinking lets 4.7 decide how much reasoning to apply to each prompt. The problem is that multiple users report the model choosing not to think on tasks where reasoning is clearly needed. You can override it by explicitly setting reasoning effort, but the automatic defaults are unreliable based on day-one reports.

Is Claude Opus 4.7 better than Mythos?

No. Anthropic publicly conceded this. Mythos Preview scores 93.9% on SWE-bench Verified versus 87.6% for Opus 4.7, and leads on every other shared benchmark. Mythos remains restricted to about 40 organizations through Project Glasswing.

What is Qwen 3.6 and why does it matter?

Qwen 3.6-35B-A3B is a sparse mixture-of-experts model with 35 billion total parameters but only 3 billion active per pass. It scores 73.4% on SWE-bench Verified under Apache 2.0 licensing. For a model with 3B active parameters matching last-gen frontier coding scores, it represents a step change in efficiency for the open-source ecosystem.

Should I upgrade to Claude Opus 4.7 or wait?

If you use Claude Code heavily, upgrade for the coding improvements and xhigh effort level. If you're on the API and cost-sensitive, test your actual token consumption first since the new tokenizer inflates bills. If your workflows parse reasoning output, add the "display": "summarized" parameter before switching.