📅 Last updated: February 2026

No-nonsense reference for developers who just want to know which AI model to pick. Bookmark this and stop Googling.

📌 Quick pick - just tell me what to use

Features / bugs / tests
Claude Sonnet 4.5
Gemini 3 Pro
GPT-5.2
Agentic / CLI / scaffolding
Claude Haiku 4.5
Gemini 3 Flash
GPT-4o
Architecture / refactors
Claude Opus 4.5
Gemini 3 Pro
GPT-5.2 high
Code review
Claude Sonnet 4.5
Gemini 3 Pro
GPT-5.2
Documentation
Claude Sonnet 4.5
Gemini 3 Pro
GPT-4o
Design / planning
Claude Opus 4.5
Gemini 3 Pro
GPT-5.2 high

A note on versions

You’ll see version numbers everywhere: Sonnet 3.5, Sonnet 4, Sonnet 4.5. Gemini 2.5, Gemini 3. GPT-4o, GPT-5.

Don’t overthink it. The tier matters more than the version. “Sonnet” is the mid-tier Claude. “Opus” is the heavyweight Claude. “Flash” is the fast/cheap Gemini. Your IDE usually offers the latest version of each tier - just pick the tier that fits your task.

When this page says “Sonnet”, it means whatever the current Sonnet is. Same for the others.

The big three model families

Speed key: ⚡⚡⚡ Fast · ⚡⚡ Medium · ⚡ Slow

Anthropic (Claude)

Model What it’s for Speed Cost
Haiku Fast tasks, scaffolding, CLI ⚡⚡⚡ 💰
Sonnet Everyday coding ⚡⚡ 💰💰
Opus Complex reasoning, design 💰💰💰
Start with Sonnet. It's the workhorse. Only reach for Opus when Sonnet genuinely can't handle the complexity.

OpenAI (GPT)

Model What it’s for Speed Cost
GPT-4o-mini Fast tasks, high volume ⚡⚡⚡ 💰
GPT-4o Everyday coding ⚡⚡ 💰💰
GPT-4.1 Complex coding ⚡⚡ 💰💰
GPT-5 / 5.2 Heavy lifting 💰💰💰
o1 Deep reasoning (expensive) 💰💰💰💰
o3 / o4-mini Reasoning (cheaper) 💰💰
Skip the "o-series" reasoning models (o1, o3, o4) for everyday coding. They think longer and cost more - o1 is particularly expensive at ~$1.35/task. Save them for:
  • Implementing algorithms (graph traversal, dynamic programming)
  • Debugging race conditions or complex state machines
  • Mathematical proofs or formal verification

Google (Gemini)

Model What it’s for Speed Cost
Gemini 2.0 Flash Ultra-cheap, simple tasks ⚡⚡⚡ 💰 (cheapest)
Gemini 2.5 Flash Fast tasks, high volume ⚡⚡⚡ 💰
Gemini 2.5 Pro Complex reasoning 💰💰💰
Gemini 3 Flash Everyday coding ⚡⚡ 💰💰
Gemini 3 Pro Heavy lifting 💰💰💰

Benchmarks

Want numbers?

The TLDR:

  • Claude Opus 4.5 & Gemini 3 Pro tie at 74% on SWE-bench - but Gemini costs half as much ($0.22 vs $0.50/task)
  • GPT-5 high reasoning dominates Aider at 88% - followed by Gemini 2.5 Pro thinking (83%) and Sonnet 4.5 (82%)
  • Budget champions: GLM-4.7 ($0.05/task, 1440 Arena Elo) and Minimax M2 ($0.03/task, 1408 Elo) punch way above their weight
  • The most expensive model is not automatically the best at coding

Benchmarks are useful for gut-checking, but the real test is running a model on your own work.

Marketing BS decoder

They say It means
“Most intelligent” Bigger, slower, pricier
“Balanced” Mid-tier - usually right
“Fast” / “efficient” Smaller, cheaper, simpler
“Reasoning” / “thinking” Extra thinking time - see below
“Preview” / “experimental” Unstable - skip it
“200K context” Can see lots of code - but should it?
Opus is NOT a "thinking" model. It's just big and slow. "Thinking" models (o1, o3, Opus-thinking, Sonnet-thinking) explicitly reason step-by-step before responding - you'll see them labeled with "thinking" or "reasoning" in the model name. Regular Opus/Sonnet/GPT-5 are slower because they're larger, not because they're doing extra reasoning passes.
💡 "Thinking..." in the UI ≠ reasoning model. When your IDE shows "Thinking..." or a spinner, that's just the model processing your request - every model does this. True reasoning models show you their actual chain-of-thought (sometimes in a collapsible section), and are explicitly labeled "thinking" or "reasoning" in the model picker. Don't confuse a slow response with deep reasoning.

When do “reasoning” models actually help?

Reasoning models (o1, o3, “thinking” variants) work through problems step-by-step before responding.

Worth it for:

  • Implementing complex algorithms (A*, red-black trees, constraint solvers)
  • Debugging concurrency issues, race conditions, deadlocks
  • Untangling deeply nested dependency chains
  • Mathematical proofs or formal logic

Overkill for:

  • Adding a new API endpoint
  • Fixing a null pointer exception
  • Writing unit tests
  • Refactoring for readability
  • Most day-to-day feature work

A standard model with a good prompt is faster and cheaper for 90% of coding tasks.

What about context window size?

Context window (what’s this?) = how much code the model can “see” at once. Bigger sounds better, but:

  • More context = more noise. The model gets distracted.
  • More context = slower and pricier. You pay per token.
  • You rarely need it. Most tasks involve a few files, not hundreds.

Big windows help for: exploring unfamiliar codebases, analysing logs, multi-file refactors. For everyday coding, focused context beats massive context.

Sources