AI coding benchmarks
📅 Snapshot: February 2026
This page collates benchmark data from independent sources to help you compare models. These aren’t my benchmarks - I’m just pulling highlights so you don’t have to tab between sites.
For the latest data, always check the original sources. Data current as of: SWE-bench (December 2025), Aider (October 2025), Arena Code (February 2026).
SWE-bench Verified
| Model | Score | $/task | Copilot |
|---|---|---|---|
| Claude Opus 4.5 | 74.4% | $0.50 | 3× |
| Gemini 3 Pro | 74.2% | $0.22 | 1× |
| GPT-5.2 (high reasoning) | 71.8% | $0.53 | 1× |
| Claude Sonnet 4.5 | 70.6% | $0.30 | 1× |
| GPT-5.2 | 69.0% | $0.23 | 1× |
| Claude Opus 4.1 | 67.6% | $1.50 | 10× |
| GPT-5 (medium reasoning) | 65.0% | $0.16 | 1× |
| Gemini 3 Flash | 63.8% | $0.08 | 0.33× |
| Kimi K2 Thinking Turbo | 63.4% | $0.06 | - |
| Minimax M2 | 61.0% | $0.03 | - |
| DeepSeek V3.2 Reasoner | 60.0% | $0.02 | - |
| o3 | 58.4% | $0.18 | - |
| GLM-4.6 | 55.4% | $0.05 | - |
| Devstral 2 | 53.8% | - | - |
| Gemini 2.5 Pro | 53.6% | $0.16 | 1× |
| GPT-4o | 48.9% | $0.23 | 0× |
| Claude Haiku 4.5 | 48.4% | $0.10 | 0.33× |
| o4-mini | 45.0% | $0.10 | - |
| GPT-4.1 | 39.6% | $0.18 | 0× |
| DeepSeek V3.2 Chat | 39.0% | $0.02 | - |
| Gemini 2.5 Flash | 28.7% | $0.04 | - |
| Gemini 2.0 Flash | 22.0% | $0.01 | - |
| GPT-4o-mini | 18.6% | $0.01 | - |
| GPT-5 mini | 14.2% | $0.03 | 0× |
$/task = cost to solve one benchmark task via direct API (based on token usage × provider pricing). Useful for API users and for comparing relative costs. Copilot = GitHub Copilot premium request multiplier (0× = free on paid plans).
Aider Polyglot
| Model | % Correct | Copilot |
|---|---|---|
| GPT-5 (high reasoning) | 88.0% | 1× |
| o3-pro (high) | 84.9% | - |
| Gemini 2.5 Pro 32k (think) | 83.1% | 1× |
| Claude Sonnet 4.5 | 82.4% | 1× |
| Claude Opus 4.1 | 82.1% | 10× |
| o3 (high) | 81.3% | - |
| o4-mini | 75.4% | - |
| DeepSeek V3.2 Exp Reasoner | 74.2% | - |
| Claude Haiku 4.5 | 73.5% | 0.33× |
| GPT-4o | 72.9% | 0× |
| Claude Opus 4.5 | 70.7% | 3× |
| DeepSeek V3.2 Chat | 70.2% | - |
| Gemini 2.5 Flash | 68.0% | - |
| GPT-4.1 | 66.4% | 0× |
| Kimi K2 | 59.1% | - |
| Gemini 2.0 Flash | 58.0% | - |
| GPT-4o-mini | 55.6% | - |
| GPT-5 mini | 50.2% | 0× |
LiveBench
What it is: A contamination-free benchmark with 23 diverse tasks spanning Coding, Agentic Coding, Data Analysis, Language, Instruction Following, Math, and Reasoning. Questions refresh every 6 months and are delay-released to minimize training contamination. Scores use objective ground-truth answers, not LLM judges.
Why it matters: Most benchmarks face contamination (models train on test data). LiveBench addresses this with regular question rotation and delayed public release. The Global Average provides a single score across multiple capabilities, avoiding narrow specialization.
| Model | Global Avg | Coding | Agentic | Data | Language | IF | Math | Reasoning |
|---|---|---|---|---|---|---|---|---|
| Claude 4.5 Opus Thinking High | 76.0 | 80.1 | 79.7 | 63.3 | 90.4 | 74.4 | 81.3 | 62.6 |
| GPT-5.2 high reasoning | 74.8 | 83.2 | 76.1 | 51.7 | 93.2 | 78.2 | 79.8 | 61.8 |
| Gemini 3 Pro | 73.4 | 77.4 | 74.6 | 55.0 | 81.8 | 74.4 | 84.6 | 65.9 |
| Gemini 3 Flash | 72.4 | 76.3 | 71.8 | 56.7 | 86.6 | 75.6 | 81.2 | 58.5 |
| GPT-5 | 70.5 | 77.5 | 68.9 | 45.0 | 86.4 | 75.1 | 77.2 | 63.4 |
| DeepSeek V3.2 Reasoner | 62.2 | 65.3 | 58.4 | 41.7 | 78.0 | 65.9 | 75.4 | 51.1 |
| Kimi K2 Thinking Turbo | 61.6 | 66.1 | 64.9 | 40.0 | 73.6 | 63.0 | 66.3 | 56.8 |
| Grok 4.1 Fast | 60.0 | 58.4 | 63.6 | 40.0 | 78.4 | 61.4 | 71.2 | 47.3 |
| GLM-4.7 | 58.1 | 60.1 | 57.2 | 36.7 | 69.6 | 57.5 | 68.8 | 56.8 |
| Gemini 2.5 Pro | 58.3 | 57.1 | 55.9 | 46.7 | 70.2 | 56.9 | 69.6 | 51.7 |
| Claude Opus 4.5 | 59.1 | 67.1 | 64.8 | 40.0 | 67.8 | 56.5 | 63.0 | 54.2 |
| GLM-4.6 | 55.2 | 57.9 | 49.8 | 41.7 | 67.0 | 56.1 | 61.3 | 52.5 |
| Claude Opus 4.1 | 54.5 | 59.3 | 56.8 | 30.0 | 62.9 | 52.0 | 58.7 | 61.8 |
| Claude Sonnet 4.5 | 53.7 | 58.9 | 56.5 | 38.3 | 61.3 | 52.8 | 59.6 | 48.5 |
| DeepSeek V3.2 Chat | 51.8 | 60.7 | 47.5 | 40.0 | 60.1 | 51.6 | 58.9 | 43.7 |
| Gemini 2.5 Flash | 47.7 | 51.1 | 41.4 | 31.7 | 57.6 | 47.2 | 56.5 | 48.3 |
| Claude Haiku 4.5 | 45.3 | 52.2 | 43.5 | 26.7 | 54.1 | 42.5 | 51.4 | 47.0 |
| Devstral 2 | 41.2 | 49.4 | 39.1 | 26.7 | 48.3 | 38.1 | 45.5 | 41.4 |
• Thinking models dominate: Claude Opus 4.5 Thinking (76.0) outperforms base version (59.1) by 17 points
• Contamination-free leader: GPT-5.2 high reasoning (74.8) maintains top-3 position with advantage on Language (93.2) and Coding (83.2)
• Flash punches above weight: Gemini 3 Flash (72.4) scores within 1 point of Pro (73.4), remarkable for a lightweight model
• Category variance reveals specialization: Models vary 20-40 points across categories (e.g., Claude 90.4 Language vs 62.6 Reasoning)
• Budget champion: GLM-4.7 (58.1) delivers 75% of top-tier performance at $0.05/task (10× cheaper than Claude Opus)
Chatbot Arena Code
| Rank | Model | Elo Score | $/task | Copilot | Notes |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.5 thinking-32k | 1497 | $0.50 | - | Thinking variant |
| 2 | GPT-5.2 high reasoning | 1470 | $0.53 | 1× | High reasoning mode |
| 3 | Claude Opus 4.5 | 1468 | $0.50 | 3× | Standard (non-thinking) |
| 4 | Gemini 3 Pro | 1454 | $0.22 | 1× | |
| 5 | Gemini 3 Flash | 1443 | $0.08 | 0.33× | |
| 6 | GLM-4.7 | 1440 | $0.05 | - | |
| 7 | GPT-5.2 | 1432 | $0.23 | 1× | |
| 8 | Claude Opus 4.1 | 1431 | $1.50 | 10× | |
| 9 | o3 | 1417 | $0.18 | - | |
| 10 | Minimax M2.1 preview | 1408 | $0.03 | - | |
| 11 | GPT-5 | 1407 | $0.16 | 1× | |
| 12 | Grok 4.1 Fast | 1393 | - | 0.25× | |
| 13 | Claude Sonnet 4.5 | 1383 | $0.30 | 1× | |
| 14 | GPT-4o | 1372 | $0.23 | 0× | |
| 15 | Gemini 2.5 Pro | 1372 | $0.16 | 1× | |
| 16 | Devstral 2 | 1363 | - | - | |
| 17 | Kimi K2 Thinking Turbo | 1356 | $0.06 | - | |
| 18 | DeepSeek V3.2 Reasoner | 1350 | $0.02 | - | |
| 19 | o4-mini | 1310 | $0.10 | - | |
| 20 | GPT-4.1 | 1305 | $0.18 | 0× | |
| 21 | Claude Haiku 4.5 | 1290 | $0.10 | 0.33× | |
| 22 | DeepSeek V3.2 Chat | 1287 | $0.02 | - |
What benchmarks don’t tell you
- Latency - high-scoring models can feel sluggish
- Consistency - benchmark runs are controlled; your prompts aren’t
- Your stack - generic benchmarks miss framework-specific quirks
- Cost at scale - 5% better might not justify 3x the price
The best benchmark is running a model on your own work for a day.
Other benchmarks
| Benchmark | What it tests | Notes |
|---|---|---|
| HumanEval | Python function completion | Classic but dated |
| MBPP | Basic Python problems | Also dated |
| CodeContests | Competitive programming | Harder, less realistic |
| LiveCodeBench | Fresh problems | livecodebench.github.io - avoids training contamination |
For day-to-day coding, SWE-bench and Aider are most relevant.
Appendix: GitHub Copilot cost column
The Copilot column shows the premium request multiplier for GitHub Copilot (VS Code, etc.):
| Multiplier | What it means |
|---|---|
| 0× | Included free (GPT-4.1, GPT-4o, GPT-5 mini) - no premium requests used |
| 0.33× | Cheap (Haiku 4.5, Gemini 3 Flash) - 3 requests for the price of 1 |
| 1× | Standard (Sonnet, GPT-5, Gemini Pro) - 1 premium request per use |
| 3× | Expensive (Opus 4.5) - costs 3 premium requests per use |
| 10× | Very expensive (Opus 4.1) - costs 10 premium requests per use |
| - | Not available in GitHub Copilot |
Paid plans get 300-1500 premium requests/month depending on tier. Extra requests cost $0.04 each.
← Back to AI Guide