📅 Snapshot: December 2025

This page collates benchmark data from independent sources to help you compare models. These aren’t my benchmarks - I’m just pulling highlights so you don’t have to tab between sites.

For the latest data, always check the original sources.


SWE-bench Verified

Source: swebench.com · Tests whether models can fix real GitHub issues
Model Score $/task Copilot
Claude 4.5 Opus 74.4% $0.72
Gemini 3 Pro 74.2% $0.46
GPT-5.2 (high reasoning) 71.8% $0.52
Claude 4.5 Sonnet 70.6% $0.56
GPT-5.2 69.0% $0.27
Claude Opus 4.1 67.6% $1.13 10×
GPT-5 (medium reasoning) 65.0% $0.28
o3 58.4% $0.33 -
Gemini 2.5 Pro 53.6% $0.29
o4-mini 45.0% $0.21 -
GPT-4.1 39.6% $0.15
Gemini 2.5 Flash 28.7% $0.13 -

$/task = cost to solve one benchmark task via direct API (based on token usage × provider pricing). Useful for API users and for comparing relative costs. Copilot = GitHub Copilot premium request multiplier (0× = free on paid plans).

Takeaway: Sonnet 4.5 hits 70% at $0.56/task and 1× Copilot cost. Opus 4.5 is marginally better but 3× in Copilot. The older Opus 4.1 costs 10× - always use 4.5 if you need Opus.

Aider Polyglot

Source: aider.chat/docs/leaderboards · Tests code editing across C++, Go, Java, JavaScript, Python, Rust
Why so few Copilot models? Aider benchmarks older model versions (Claude 3.5, not 4.5). GitHub Copilot has moved on to newer models, so there's limited overlap. GPT-4o is the main one you can directly compare.
Model % Correct Copilot
o1 84.2% -
Claude 3.5 Sonnet 84.2% -
Gemini exp 80.5% -
Claude 3.5 Haiku 75.2% -
GPT-4o 72.9%
DeepSeek Coder V2 72.9% -
Claude 3 Opus 68.4% -
GPT-4o-mini 55.6% -
Claude 3 Haiku 47.4% -
Takeaway: On this benchmark, GPT-4o (0× in Copilot - free on paid plans) scores 72.9%. That's your baseline. If Aider updates with Claude 4.5/GPT-5 results, expect higher scores.

Chatbot Arena WebDev

Source: lmarena.ai/leaderboard/webdev · Human preference voting on web development tasks
Rank Model Score $/task Copilot Notes
1 Claude Opus 4.5 thinking 1520 $0.72 Thinking variant
2 GPT-5.2 high 1484 $0.52 High reasoning mode
3 Claude Opus 4.5 1480 $0.72 Standard (non-thinking)
4 Gemini 3 Pro 1478 $0.46  
5 Gemini 3 Flash 1465 $0.13 0.33×  
8 Claude Sonnet 4.5 thinking 1393 $0.56 Thinking variant
11 Claude Sonnet 4.5 1387 $0.56 Standard (non-thinking)
21 Claude Haiku 4.5 1290 $0.21 0.33×  
"Thinking" variants are labeled explicitly. Claude Opus 4.5 (rank 3) is NOT a thinking model - it's just big. The thinking variant (rank 1) is a separate model that does explicit reasoning passes. Same for Sonnet: standard (rank 11) vs thinking (rank 8). The thinking versions score higher but are slower and cost more.

What benchmarks don’t tell you

  • Latency - high-scoring models can feel sluggish
  • Consistency - benchmark runs are controlled; your prompts aren’t
  • Your stack - generic benchmarks miss framework-specific quirks
  • Cost at scale - 5% better might not justify 3x the price

The best benchmark is running a model on your own work for a day.


Other benchmarks

Benchmark What it tests Notes
HumanEval Python function completion Classic but dated
MBPP Basic Python problems Also dated
CodeContests Competitive programming Harder, less realistic
LiveCodeBench Fresh problems livecodebench.github.io - avoids training contamination

For day-to-day coding, SWE-bench and Aider are most relevant.


Appendix: GitHub Copilot cost column

The Copilot column shows the premium request multiplier for GitHub Copilot (VS Code, etc.):

Multiplier What it means
Included free (GPT-4.1, GPT-4o, GPT-5 mini) - no premium requests used
0.33× Cheap (Haiku 4.5, Gemini 3 Flash) - 3 requests for the price of 1
Standard (Sonnet, GPT-5, Gemini Pro) - 1 premium request per use
Expensive (Opus 4.5) - costs 3 premium requests per use
10× Very expensive (Opus 4.1) - costs 10 premium requests per use
- Not available in GitHub Copilot

Paid plans get 300-1500 premium requests/month depending on tier. Extra requests cost $0.04 each.

← Back to AI Guide