Update: The April 2026 ranking is now live with the latest model updates.
I rewrote this ranking three times in the past month. That's how fast things moved. If you want the full picture of what shipped when, I built an interactive AI release timeline that covers every major launch since 2024.
In February alone, Anthropic dropped Claude Opus 4.6 and Sonnet 4.6. Google released Gemini 3.1 Pro. xAI pushed Grok 4.20 into beta. Alibaba shipped Qwen3.5 with 201 language support. Then on March 5, OpenAI released GPT-5.4 and I had to throw out half my notes and start over.
So here's the deal. I use these models eight hours a day for actual work - building web apps, debugging production issues, reviewing pull requests. Not running benchmarks in a sandbox. I don't care what a model scores on MMLU. I care if it writes code I can commit without spending twenty minutes fixing it.
This is where everything landed after three weeks of daily use.
Quick Reference: Every Model at a Glance
| Model | Provider | Input / Output (per 1M tokens) | Context | Best For |
|---|---|---|---|---|
| GPT-5.4 Pro | OpenAI | Premium tier | 128K+ | Reasoning, math, analysis |
| Claude Opus 4.6 | Anthropic | $5 / $25 | 1M | Production code, debugging |
| Gemini 3.1 Pro | $2 / $12 | 1M+ | Price/performance king | |
| Claude Sonnet 4.6 | Anthropic | $3 / $15 | 1M | Best daily driver |
| Grok 4.20 | xAI | Competitive | 128K+ | Speed, coding benchmarks |
| GPT-5.4 | OpenAI | Standard tier | 128K+ | General purpose |
| Qwen3.5 | Alibaba | Free / cheap APIs | Large | Best open source |
| DeepSeek V3 | DeepSeek | ~$0.27 / $1.10 | 128K | Budget coding |
| Llama 4 Maverick | Meta | ~$0.40 / varies | 1M | Self-hosting |
| Llama 4 Scout | Meta | ~$0.15 / varies | 10M | Massive context |
| DeepSeek R1 | DeepSeek | ~$0.55 / $2.19 | 128K | Reasoning (open source) |
My Tier List
Based on three weeks of daily use across coding, reasoning, and general tasks. This is subjective. This is mine.
C tier doesn't mean bad. It means I wouldn't reach for them as a primary model. They all have specific use cases where they shine, but none of them are what I open first thing in the morning.
Coding: Writing Production Code
The only category that actually matters to me. Everything else is academic.
- Claude Opus 4.6 - I commit its output without changes about 70% of the time. That number was maybe 50% with Opus 4. The 4.6 update fixed the thing that annoyed me most - multi-file refactoring used to break halfway through. Now it holds. It reads my project conventions from context and writes error handling that matches the existing patterns instead of inventing its own. When I run it as an agent, it can go 15-20 minutes without going off the rails.
- GPT-5.4 - Close. Really close. The Thinking variant is better than Opus for complex algorithms, and sometimes writes more elegant solutions. My problem with it: I'll ask for a utility function and get back an entire abstraction layer with three interfaces and a factory pattern. The 33% error reduction over 5.2 is noticeable though - I stopped double-checking its import statements.
- Grok 4.20 - OK I'll admit I was wrong about Grok. My 2024 review was harsh, and Grok 4 earned a spot here. ~75% on SWE-bench is legit. Fast, accurate, great for single-file tasks. But give it a task that touches four files? It'll edit three of them correctly and completely butcher the fourth.
- Gemini 3.1 Pro - Consistent. Never amazing, never terrible. I use Low thinking for boilerplate, High for anything that requires actual thought. At $2/$12 per million tokens it's hard to complain about much.
- Claude Sonnet 4.6 - Gets me 90% of the way to Opus for a third of the cost. This is what I use for the boring stuff - quick refactors, one-off scripts, simple features. The instruction following got way better since Sonnet 4.
- Qwen3.5 - A free model shouldn't be this good. The 397B MoE architecture rivals Sonnet on straightforward tasks. I've been using it through an API provider and I keep forgetting I'm not paying for it.
These numbers are tight at the top. The difference between #1 and #3 is one percentage point. What the benchmark doesn't show is how the models handle your specific codebase, your conventions, your tech stack. That's where the real separation happens, and it's why I rank Opus above Grok for coding despite the lower SWE-bench score - Opus is better at following project-level instructions.
Reasoning: Complex Problem Solving
Finding race conditions, tracing business logic through six files, figuring out why the cache is returning stale data at 3am. That kind of reasoning.
- GPT-5.4 Pro (Thinking) - 50% on FrontierMath Tiers 1-3. I don't usually care about benchmarks but that one matters - it means the model can actually do math I can't do in my head. Last week it found a deadlock in my connection pool that I'd been staring at for two hours. The Thinking mode is slow but it catches things I miss.
- Claude Opus 4.6 - Where Opus beats GPT-5.4 is the follow-through. It doesn't just find the bug, it gives me a fix that actually works with my existing code. GPT-5.4 will sometimes identify the issue perfectly and then propose a solution that doesn't compile.
- Gemini 3.1 Pro - 77.1% on ARC-AGI-2, more than double the previous version. Google's Low/Medium/High thinking tiers are useful - I can throw easy questions at Low and save money, then crank it up for the hard stuff.
- Grok 4.20 - Good at well-defined STEM problems. Bad at the messy "should we use a queue or a webhook here" type decisions. It picks one answer and commits fully, even when the right answer is "it depends."
- DeepSeek R1 - The thinking-out-loud model. Its chain-of-thought traces are worth reading even when the final answer is wrong, because they show you angles you hadn't considered. The distilled 32B runs on my Mac.
Speed: Time to Useful Response
Not much to say here. Some days I just want the answer now.
- Grok 4.20 - Fastest. Not close.
- Claude Sonnet 4.6 - Fast enough that I never notice the wait. Good enough that I rarely retry.
- Gemini 3.1 Pro (Low thinking) - Quick on Low, sluggish on High. Pick your tradeoff.
- GPT-5.4 - Fine in standard mode. Thinking mode? Go make coffee.
- Claude Opus 4.6 - Slowest of the lot. But here's the thing - it gets the answer right the first time more often, so I spend less total time. I'll take one slow correct response over three fast wrong ones.
Cost: What's Actually Worth Paying For
Pricing moves around a lot so I'll keep this updated. Numbers as of late March 2026.
| Model | Input | Output | Monthly Est. (heavy use) | Verdict |
|---|---|---|---|---|
| Gemini 3.1 Pro | $2 | $12 | $40-80 | Best value frontier |
| Claude Sonnet 4.6 | $3 | $15 | $60-120 | Best daily driver |
| Claude Opus 4.6 | $5 | $25 | $100-250 | Worth it for hard problems |
| DeepSeek V3 | $0.27 | $1.10 | $5-15 | Absurdly cheap |
| Qwen3.5 | Free tiers | Available | $0-20 | Best free option |
| GPT-5.4 Pro | Premium | Premium | $150-400+ | Only for the hard stuff |
Quick cost check
InteractiveMy take: start with Sonnet 4.6 for everything. Escalate to Opus 4.6 when you hit something Sonnet can't handle. Use Gemini 3.1 Pro if cost is a primary concern. Use DeepSeek V3 or Qwen3.5 for batch processing or when you're burning through millions of tokens.
If you use Claude Code like I do, the Max plan subscription makes the per-token math less relevant. But for API-heavy workflows and building with AI tools, these numbers add up fast.
Open Source: What You Can Actually Run
Open source caught up. I'm not being diplomatic - some of these are just straight up good, and I'd use them even if I had unlimited API budget.
Qwen3.5 (Alibaba) - 397B parameter MoE with 17B active. Apache 2.0 license, so you can do whatever you want with it. Supports 201 languages and dialects. The hybrid thinking mode (toggle between fast and deep reasoning) is something I haven't seen in other open models. Available on most API providers with free tiers.
Llama 4 Maverick (Meta) - 400B total, 17B active, 1M context. Natively multimodal. Strong on chat quality and coding. The Llama License isn't pure open source (companies over 700M MAU need a separate agreement), but for most developers it's effectively free. Harder to self-host than Scout due to the 128-expert MoE architecture.
DeepSeek V3 - 671B total, 37B active, 128K context. MIT license. The model that made OpenAI's stock price nervous. Training cost was reportedly $5.5M - that number broke a lot of people's brains about what AI training needs to cost. Their API is so cheap it's almost suspicious.
Llama 4 Scout (Meta) - 109B total, 17B active, 10 million token context. That context window is not a typo. For tasks that require processing massive codebases or documentation sets, Scout is unmatched. Runs on a single H100 with INT4 quantization. Decent quality for its active parameter count.
DeepSeek R1 Distilled 32B - If you want reasoning capability on local hardware, this is the one. Based on Qwen 2.5 32B, fine-tuned with R1's reasoning outputs. Runs on a Mac with 64GB RAM. I use it for RAG pipelines where I don't want data leaving my machine. If you're new to running models locally, I wrote a guide to Ollama that covers the basics.
What I'd actually do: Run Qwen3.5 through a cheap API provider. If you need to keep everything local - maybe you're handling client data or you just don't trust cloud APIs - DeepSeek R1 Distilled 32B or Llama 4 Scout are your best bets without needing a server rack.
The Vaporware Report
Three models that dominate Twitter discourse but don't actually exist yet.
Llama 4 Behemoth - Announced alongside Scout and Maverick back in April 2025. 288B active parameters, 16 experts. Was supposed to ship shortly after launch. It's now March 2026 and Zuckerberg is reportedly dissatisfied with the results. Internal benchmarks showed it beating GPT-4.5 and Claude Sonnet 3.7 on STEM tasks, but apparently that wasn't good enough. No ship date in sight.
DeepSeek V4 / R2 - Expected since late 2025. V4 is rumored at 1 trillion parameters with native multimodal and 1M context. R2 would be the next-gen reasoning model. Multiple predicted launch dates - mid-February, late February, early March - have all come and gone. The DeepSeek CEO is reportedly not happy with the quality either. Pattern match: both Meta and DeepSeek are discovering that "bigger" doesn't automatically mean "better."
Grok 5 - 6 trillion parameters. Training on Colossus 2, a 1-gigawatt supercluster. Musk says there's a "10% probability of achieving AGI." Prediction markets give it a 1% chance of shipping by March 31 and about 33% by June 30. I'll believe it when I can use it.
My Weekly Stack
Here's how I actually split my usage across a typical working week. This is from the past two weeks of real development work, not a theoretical recommendation.
- Claude Sonnet 4.6 - 45% - Quick code generation, simple refactoring, utility functions, prompt iteration, basic questions. The default until a task proves it needs something stronger.
- Claude Opus 4.6 - 30% - Complex debugging, code review, multi-file refactoring, architecture discussions. When I'm working through something that requires understanding the whole codebase, Opus is worth every token. I explained how this workflow makes me 10x more productive.
- GPT-5.4 - 15% - Second opinions, creative problem solving, brainstorming approaches. When Opus and I agree but I'm still not confident, I'll run it through GPT-5.4 as a sanity check. Also my go-to for explaining technical concepts to non-technical people.
- Qwen3.5 via API - 10% - Cost-sensitive batch processing, anything involving data I'd rather not send to US-based APIs, and quick transformations where speed matters more than peak quality.
I don't use Grok regularly. It's fast and fun but I can't fully trust it for production work yet. Gemini 3.1 Pro deserves more of my time than I give it - the price-to-quality ratio is objectively better than Sonnet - but I'm locked into the Claude Code workflow and switching has a cost. I compared Cursor and Windsurf for the IDE side of things if you're shopping for tools.
Which Model Should You Pick?
Quick pick
Interactive1. Main use?
The Bottom Line
A year ago there was an obvious hierarchy. Now? Five or six models can all do serious work and the differences are mostly about workflow fit, not raw capability.
Honestly, the model matters less than how you use it. My Claude Code setup with a good CLAUDE.md file will produce better results than someone throwing prompts at GPT-5.4 Pro with no context. The IDE matters. The agent framework matters. The system prompt matters. The model is maybe 40% of the equation now.
Pick one. Get good at it. Ship code. Stop agonizing over benchmarks.