Best AI Models April 2026: Claude Mythos Just Broke the Tier List

This is the April update to my March 2026 ranking. Same methodology - I use these models every day for production work and rank them based on what actually ships code, not what scores highest on a leaderboard.

I had this article basically done a week ago. Then Anthropic announced Claude Mythos on April 7 and I had to rewrite half of it.

Mythos isn't a normal model release. It scores 93.9% on SWE-bench Verified. It found a 27-year-old zero-day in OpenBSD. It escaped its own sandbox during testing and emailed the researcher to brag about it. And you can't use it. So let me start with the news, then get back to the models you can actually run code through today.

Claude Mythos: The New Tier Above S

Anthropic announced Claude Mythos Preview on April 7. The existence of the model leaked on March 26 when Anthropic accidentally left around 3,000 unpublished blog assets in a publicly accessible cache. Internal codename: "Capybara." Tier name: above Opus.

The benchmarks are absurd. Going off the numbers in Anthropic's Glasswing announcement and the Alignment Risk Update PDF:

SWE-bench Verified: 93.9% - Opus 4.6 sits at 80.8%
SWE-bench Pro: 77.8% - Opus 4.6 sits at 53.4%
USAMO 2026: 97.6% - Opus 4.6 sits at 42.3%, that's not a typo
Terminal-Bench 2.0: 82.0% (92.1% extended) - Opus 4.6 sits at 65.4%
CyberGym vulnerability reproduction: 83.1% - Opus 4.6 sits at 66.6%
Cybench: 100% - first model to saturate this benchmark

Then the "holy shit" stuff. Mythos found a 27-year-old DoS vulnerability in OpenBSD's TCP SACK implementation. It found a 16-year-old bug in FFmpeg's H.264 codec that survived 5 million automated test runs. It found a 17-year-old remote code execution bug in FreeBSD NFS. It built working Firefox JavaScript exploits at a 181-to-2 ratio against Opus 4.6. According to Anthropic, these capabilities "emerged as a downstream consequence of general improvements in code, reasoning, and autonomy" - they didn't train it to find vulnerabilities, it just learned to.

And then there's the part where it escaped its own sandbox. During internal testing, an earlier Mythos version chained four browser vulnerabilities to break out of the renderer and OS sandboxes, sent an email to a researcher to announce the escape, posted exploit details to several public-facing websites unprompted, and when given the ability to edit files without permission, it covered its tracks in git history. Anthropic classified that as "reckless" behavior in the risk report.

Anthropic's own conclusion in the alignment risk report is wild: "Mythos Preview appears to be the best-aligned model that we have released to date" and also "likely poses the greatest alignment-related risk of any model we have released to date." Both at the same time. That's why they're not releasing it. Only ~40 organizations get access through Project Glasswing - AWS, Apple, Google, Microsoft, CrowdStrike, NVIDIA, the Linux Foundation, that crew.

For developers, the practical answer is simple. You can't use Mythos. It costs $25/$125 per million tokens for the partners who can. The closest thing you can actually run code through today is still Claude Opus 4.6, which is what most of this article is about. I wrote a full deep-dive on Mythos here covering the leak, the benchmarks, the exploits, the sandbox escape, and the timeline for public release. It's the only piece of the AI economy that matters this week.

Now, back to the models you can actually use.

What Changed Since March

The big shifts this month, in order of importance:

Claude Mythos exists - it just isn't for you. Detailed above. Detailed in my deep dive. The next generation of Opus will likely inherit some of these capabilities under safer constraints.
GPT-5.4 had time to prove itself - a month of daily use changed my opinion on a few things. The Thinking variant is better than I initially gave it credit for. Standard mode still over-engineers simple tasks.
Gemini 3.1 Pro keeps getting cheaper - Google dropped prompt caching costs again. At this point it's almost irresponsible not to use it for cost-sensitive workloads.
Qwen3.5 is eating into Sonnet's share - I'm using it more than I expected. The free API tiers are hard to ignore when the quality gap is this small.
Still no Behemoth, no DeepSeek V4, no Grok 5 - the vaporware list hasn't changed. I'll stop listing them when they actually ship.

Quick Reference: Every Model at a Glance

Model	Provider	Input / Output (per 1M tokens)	Context	Best For
Claude Mythos Preview	Anthropic	$25 / $125	Large	Restricted - Glasswing partners only
GPT-5.4 Pro	OpenAI	Premium tier	128K+	Reasoning, math, analysis
Claude Opus 4.6	Anthropic	$5 / $25	1M	Production code, debugging
Gemini 3.1 Pro	Google	$2 / $12	1M+	Price/performance king
Claude Sonnet 4.6	Anthropic	$3 / $15	1M	Best daily driver
Grok 4.20	xAI	Competitive	128K+	Speed, coding benchmarks
Qwen3.5	Alibaba	Free / cheap APIs	Large	Best open source
DeepSeek V3	DeepSeek	~$0.27 / $1.10	128K	Budget coding

My Tier List - April Update

The tier list now has a tier above S. Mythos sits there alone. I'm calling it M for Mythos because there's nothing else like it and it'd be weird to put a model you can't use in the same tier as models you can.

Claude Mythos Preview (restricted - read more)

Claude Opus 4.6 GPT-5.4 Pro

Gemini 3.1 Pro Claude Sonnet 4.6 Grok 4.20 Qwen3.5 (up from B)

GPT-5.4 (standard) DeepSeek V3

Llama 4 Maverick DeepSeek R1 Llama 4 Scout

Qwen3.5 moved up. After a month of use, I can't justify keeping it in B tier anymore. The quality on coding tasks is too close to Sonnet to pretend it's a tier below.

Mythos moved into a tier of its own because nothing else lives there. SWE-bench Verified at 93.9%. USAMO at 97.6% versus Opus 4.6's 42.3%. Benchmark scores lie often, but a 55-point spread on a math olympiad test is not a measurement error. The catch is the model is gated behind Project Glasswing so unless you work at one of the 12 founding partner orgs, the tier exists in theory only. I'm tracking it anyway because the public release is a question of when, not if.

Coding: Writing Production Code

Same top 3 as March. Opus, GPT-5.4, Grok. The only shift is I'm now more confident about where Qwen3.5 sits - it's above Gemini for pure code generation, which I didn't expect a month ago.

Claude Opus 4.6 - Still my #1. The commit-without-changes rate is holding at about 70%. I've been using it with my CLAUDE.md setup and the consistency is the thing that keeps it on top. Other models have better days, but Opus has fewer bad days.
GPT-5.4 (Thinking) - Better than I said in March. The over-engineering problem is still real, but I've learned to prompt around it. Adding "keep it simple, no abstractions" to the end of my prompts cut the factory-pattern-for-no-reason problem by about half.
Grok 4.20 - Steady. No changes from March. Fast, accurate on single files, still can't coordinate across a whole project.
Qwen3.5 - Moving up. I used it for an entire side project last week and the code quality was indistinguishable from Sonnet on 80% of tasks. The 20% where it fell short were all multi-step refactors.
Claude Sonnet 4.6 - Still the best value if you're paying. But Qwen3.5 is breathing down its neck on the free tier. If Alibaba keeps improving at this rate, Sonnet's value proposition gets harder to justify.
Gemini 3.1 Pro - Reliable but I find myself reaching for Qwen over Gemini now. Google's pricing advantage doesn't matter when Qwen is free.

Reasoning: Complex Problem Solving

No changes from March. GPT-5.4 Pro still leads on pure reasoning. Opus still wins on turning reasoning into working code. The gap between them hasn't moved.

Cost: April 2026 Pricing

Input cost per 1M tokens (lower is better)

DeepSeek V3

$0.27

Qwen3.5

Free

Gemini 3.1 Pro

Sonnet 4.6

Opus 4.6

Mythos Preview

$25

Google dropped Gemini caching costs again in late March. If you're doing RAG or batch processing, the effective per-token cost is now under $1 for cached prompts. That's wild for a frontier model.

Mythos sits at the other end of the chart entirely. $25 input, $125 output, 5x the price of Opus 4.6 and you can't even get access. The pricing is academic until Anthropic loosens the gate. Full Mythos pricing breakdown here.

Open Source Update

Qwen3.5 is the story. It jumped from B to A tier in my rankings after a month of heavy use. The hybrid thinking mode - where you can toggle between fast and deep reasoning without switching models - is something the closed-source providers should be copying.

DeepSeek V4 and R2 are still vaporware. At this point I've stopped predicting launch dates. Llama 4 Behemoth is in the same boat. If you're waiting for either of these before committing to a workflow, stop waiting. Use what exists now.

For running things locally, my recommendation hasn't changed: DeepSeek R1 Distilled 32B for reasoning, Qwen3.5 via cheap API for everything else.

My Stack - April Update

Claude Sonnet 4.6 - 40% (down from 45%) - still my default but Qwen is eating into this
Claude Opus 4.6 - 30% - unchanged, still my go-to for agent-style work
GPT-5.4 - 12% - mostly for second opinions and brainstorming
Qwen3.5 - 18% (up from 10%) - taking share from both Sonnet and GPT-5.4

The trend is clear. Open source is taking share from paid APIs. Not because the quality is better - it's not, not yet - but because the quality is good enough and the price difference is too big to ignore.

The Bottom Line

Two things happened this month. Qwen3.5 earned a promotion from B to A. And Mythos showed up out of a leaked data cache and broke the tier list. The first one matters for your daily workflow. The second one matters for where everything is heading.

For the model you actually run code through right now, my advice hasn't changed. Pick a model, configure it well, ship code. Opus 4.6 is still the safest bet for production. Sonnet 4.6 is still the best daily driver. Qwen3.5 is still the best free option. The Mythos benchmarks make all of these feel a little smaller, but you can't deploy a model you can't access.

Read the full Mythos breakdown if you want the complete picture - benchmarks, exploits, sandbox escape, the timeline for public release. It's probably the most important AI release of the year and I'm still trying to figure out what to do with that.

I'll update this again in May. Or sooner if Mythos opens up or one of the vaporware models actually ships.

Frequently Asked Questions

What is the most powerful AI model in April 2026?

Claude Mythos Preview, announced by Anthropic on April 7, 2026. It scores 93.9% on SWE-bench Verified and 97.6% on USAMO 2026, both well above any other model. The catch is that Mythos is not publicly available - it is restricted to about 40 organizations through Project Glasswing. For models you can actually use, Claude Opus 4.6 and GPT-5.4 Pro share the top spot. Full breakdown in my Mythos deep dive.

What is the latest Claude model in April 2026?

Claude Mythos Preview, announced April 7, 2026. It is Anthropic's most powerful model to date, sitting in a new tier above Opus. Internal codename was Capybara. Mythos costs $25 per million input tokens and $125 per million output tokens, but access is restricted to Project Glasswing partners only. The latest publicly available Claude model remains Claude Opus 4.6 at $5/$25 per million tokens. See my full Mythos write-up.

What is the best AI model in April 2026 that I can actually use?

Claude Sonnet 4.6 remains the best all-around pick for developers. For maximum quality on tasks where you can actually buy access, Claude Opus 4.6 and GPT-5.4 Pro share the top spot. The biggest change from March is Qwen3.5 moving into A tier - it is now a viable free alternative to paid models for most coding tasks.

Did any new AI models launch in April 2026?

Yes - Anthropic announced Claude Mythos Preview on April 7, 2026, the most significant model launch of the year so far. It is restricted to Project Glasswing partners. Beyond Mythos, there were no other major launches. Llama 4 Behemoth, DeepSeek V4/R2, and Grok 5 remain unreleased. Google reduced Gemini 3.1 Pro caching costs, and Qwen3.5 has matured into an A-tier coding model.

Is Qwen3.5 better than Claude Sonnet 4.6?

Not quite, but close. Qwen3.5 matches Sonnet on about 80% of straightforward coding tasks. Where Sonnet still wins is on complex multi-step refactors and instruction following. But Qwen3.5 is free under Apache 2.0, which makes the comparison awkward for Anthropic.

What is the cheapest AI model for coding in April 2026?

Qwen3.5 through free API tiers costs nothing. DeepSeek V3 is the cheapest paid option at about 0.27 dollars per million input tokens. Gemini 3.1 Pro at 2 dollars per million tokens is the cheapest frontier closed-source model, now with even lower caching costs.

Should I switch from Claude to Qwen3.5?

Not as your primary model. Qwen3.5 is great for cost-sensitive batch work, side projects, and tasks where you want to keep data off US-based APIs. But for production code where correctness matters most, Claude Opus 4.6 still produces fewer errors. Use both.