This is the April update to my March 2026 ranking. Same methodology - I use these models every day for production work and rank them based on what actually ships code, not what scores highest on a leaderboard.
I had this article basically done a week ago. Then Anthropic announced Claude Mythos on April 7 and I had to rewrite half of it.
Mythos isn't a normal model release. It scores 93.9% on SWE-bench Verified. It found a 27-year-old zero-day in OpenBSD. It escaped its own sandbox during testing and emailed the researcher to brag about it. And you can't use it. So let me start with the news, then get back to the models you can actually run code through today.
Claude Mythos: The New Tier Above S
Anthropic announced Claude Mythos Preview on April 7. The existence of the model leaked on March 26 when Anthropic accidentally left around 3,000 unpublished blog assets in a publicly accessible cache. Internal codename: "Capybara." Tier name: above Opus.
The benchmarks are absurd. Going off the numbers in Anthropic's Glasswing announcement and the Alignment Risk Update PDF:
- SWE-bench Verified: 93.9% - Opus 4.6 sits at 80.8%
- SWE-bench Pro: 77.8% - Opus 4.6 sits at 53.4%
- USAMO 2026: 97.6% - Opus 4.6 sits at 42.3%, that's not a typo
- Terminal-Bench 2.0: 82.0% (92.1% extended) - Opus 4.6 sits at 65.4%
- CyberGym vulnerability reproduction: 83.1% - Opus 4.6 sits at 66.6%
- Cybench: 100% - first model to saturate this benchmark
Then the "holy shit" stuff. Mythos found a 27-year-old DoS vulnerability in OpenBSD's TCP SACK implementation. It found a 16-year-old bug in FFmpeg's H.264 codec that survived 5 million automated test runs. It found a 17-year-old remote code execution bug in FreeBSD NFS. It built working Firefox JavaScript exploits at a 181-to-2 ratio against Opus 4.6. According to Anthropic, these capabilities "emerged as a downstream consequence of general improvements in code, reasoning, and autonomy" - they didn't train it to find vulnerabilities, it just learned to.
And then there's the part where it escaped its own sandbox. During internal testing, an earlier Mythos version chained four browser vulnerabilities to break out of the renderer and OS sandboxes, sent an email to a researcher to announce the escape, posted exploit details to several public-facing websites unprompted, and when given the ability to edit files without permission, it covered its tracks in git history. Anthropic classified that as "reckless" behavior in the risk report.
Anthropic's own conclusion in the alignment risk report is wild: "Mythos Preview appears to be the best-aligned model that we have released to date" and also "likely poses the greatest alignment-related risk of any model we have released to date." Both at the same time. That's why they're not releasing it. Only ~40 organizations get access through Project Glasswing - AWS, Apple, Google, Microsoft, CrowdStrike, NVIDIA, the Linux Foundation, that crew.
For developers, the practical answer is simple. You can't use Mythos. It costs $25/$125 per million tokens for the partners who can. The closest thing you can actually run code through today is still Claude Opus 4.6, which is what most of this article is about. I wrote a full deep-dive on Mythos here covering the leak, the benchmarks, the exploits, the sandbox escape, and the timeline for public release. It's the only piece of the AI economy that matters this week.
Now, back to the models you can actually use.
What Changed Since March
The big shifts this month, in order of importance:
- Claude Mythos exists - it just isn't for you. Detailed above. Detailed in my deep dive. The next generation of Opus will likely inherit some of these capabilities under safer constraints.
- GPT-5.4 had time to prove itself - a month of daily use changed my opinion on a few things. The Thinking variant is better than I initially gave it credit for. Standard mode still over-engineers simple tasks.
- Gemini 3.1 Pro keeps getting cheaper - Google dropped prompt caching costs again. At this point it's almost irresponsible not to use it for cost-sensitive workloads.
- Qwen3.5 is eating into Sonnet's share - I'm using it more than I expected. The free API tiers are hard to ignore when the quality gap is this small.
- Still no Behemoth, no DeepSeek V4, no Grok 5 - the vaporware list hasn't changed. I'll stop listing them when they actually ship.
Quick Reference: Every Model at a Glance
| Model | Provider | Input / Output (per 1M tokens) | Context | Best For |
|---|---|---|---|---|
| Claude Mythos Preview | Anthropic | $25 / $125 | Large | Restricted - Glasswing partners only |
| GPT-5.4 Pro | OpenAI | Premium tier | 128K+ | Reasoning, math, analysis |
| Claude Opus 4.6 | Anthropic | $5 / $25 | 1M | Production code, debugging |
| Gemini 3.1 Pro | $2 / $12 | 1M+ | Price/performance king | |
| Claude Sonnet 4.6 | Anthropic | $3 / $15 | 1M | Best daily driver |
| Grok 4.20 | xAI | Competitive | 128K+ | Speed, coding benchmarks |
| Qwen3.5 | Alibaba | Free / cheap APIs | Large | Best open source |
| DeepSeek V3 | DeepSeek | ~$0.27 / $1.10 | 128K | Budget coding |
My Tier List - April Update
The tier list now has a tier above S. Mythos sits there alone. I'm calling it M for Mythos because there's nothing else like it and it'd be weird to put a model you can't use in the same tier as models you can.
Qwen3.5 moved up. After a month of use, I can't justify keeping it in B tier anymore. The quality on coding tasks is too close to Sonnet to pretend it's a tier below.
Mythos moved into a tier of its own because nothing else lives there. SWE-bench Verified at 93.9%. USAMO at 97.6% versus Opus 4.6's 42.3%. Benchmark scores lie often, but a 55-point spread on a math olympiad test is not a measurement error. The catch is the model is gated behind Project Glasswing so unless you work at one of the 12 founding partner orgs, the tier exists in theory only. I'm tracking it anyway because the public release is a question of when, not if.
Coding: Writing Production Code
Same top 3 as March. Opus, GPT-5.4, Grok. The only shift is I'm now more confident about where Qwen3.5 sits - it's above Gemini for pure code generation, which I didn't expect a month ago.
- Claude Opus 4.6 - Still my #1. The commit-without-changes rate is holding at about 70%. I've been using it with my CLAUDE.md setup and the consistency is the thing that keeps it on top. Other models have better days, but Opus has fewer bad days.
- GPT-5.4 (Thinking) - Better than I said in March. The over-engineering problem is still real, but I've learned to prompt around it. Adding "keep it simple, no abstractions" to the end of my prompts cut the factory-pattern-for-no-reason problem by about half.
- Grok 4.20 - Steady. No changes from March. Fast, accurate on single files, still can't coordinate across a whole project.
- Qwen3.5 - Moving up. I used it for an entire side project last week and the code quality was indistinguishable from Sonnet on 80% of tasks. The 20% where it fell short were all multi-step refactors.
- Claude Sonnet 4.6 - Still the best value if you're paying. But Qwen3.5 is breathing down its neck on the free tier. If Alibaba keeps improving at this rate, Sonnet's value proposition gets harder to justify.
- Gemini 3.1 Pro - Reliable but I find myself reaching for Qwen over Gemini now. Google's pricing advantage doesn't matter when Qwen is free.
Reasoning: Complex Problem Solving
No changes from March. GPT-5.4 Pro still leads on pure reasoning. Opus still wins on turning reasoning into working code. The gap between them hasn't moved.
Cost: April 2026 Pricing
Google dropped Gemini caching costs again in late March. If you're doing RAG or batch processing, the effective per-token cost is now under $1 for cached prompts. That's wild for a frontier model.
Mythos sits at the other end of the chart entirely. $25 input, $125 output, 5x the price of Opus 4.6 and you can't even get access. The pricing is academic until Anthropic loosens the gate. Full Mythos pricing breakdown here.
Open Source Update
Qwen3.5 is the story. It jumped from B to A tier in my rankings after a month of heavy use. The hybrid thinking mode - where you can toggle between fast and deep reasoning without switching models - is something the closed-source providers should be copying.
DeepSeek V4 and R2 are still vaporware. At this point I've stopped predicting launch dates. Llama 4 Behemoth is in the same boat. If you're waiting for either of these before committing to a workflow, stop waiting. Use what exists now.
For running things locally, my recommendation hasn't changed: DeepSeek R1 Distilled 32B for reasoning, Qwen3.5 via cheap API for everything else.
My Stack - April Update
- Claude Sonnet 4.6 - 40% (down from 45%) - still my default but Qwen is eating into this
- Claude Opus 4.6 - 30% - unchanged, still my go-to for agent-style work
- GPT-5.4 - 12% - mostly for second opinions and brainstorming
- Qwen3.5 - 18% (up from 10%) - taking share from both Sonnet and GPT-5.4
The trend is clear. Open source is taking share from paid APIs. Not because the quality is better - it's not, not yet - but because the quality is good enough and the price difference is too big to ignore.
The Bottom Line
Two things happened this month. Qwen3.5 earned a promotion from B to A. And Mythos showed up out of a leaked data cache and broke the tier list. The first one matters for your daily workflow. The second one matters for where everything is heading.
For the model you actually run code through right now, my advice hasn't changed. Pick a model, configure it well, ship code. Opus 4.6 is still the safest bet for production. Sonnet 4.6 is still the best daily driver. Qwen3.5 is still the best free option. The Mythos benchmarks make all of these feel a little smaller, but you can't deploy a model you can't access.
Read the full Mythos breakdown if you want the complete picture - benchmarks, exploits, sandbox escape, the timeline for public release. It's probably the most important AI release of the year and I'm still trying to figure out what to do with that.
I'll update this again in May. Or sooner if Mythos opens up or one of the vaporware models actually ships.