AI coding benchmarks
📅 Snapshot: June 2026
This page collates benchmark data from independent sources to help you compare models. These aren’t my benchmarks - I’m just pulling highlights so you don’t have to tab between sites.
For the latest data, always check the original sources. Data current as of: SWE-bench (February 2026), Aider (June 2025), Arena Code (February 2026).
SWE-bench Verified
| Model | Score | $/task | Copilot |
|---|---|---|---|
| Claude Opus 4.5 | 76.8% | $0.50 | ✓ |
| Minimax M2.5 | 75.8% | $0.07 | - |
| Gemini 3 Flash | 75.8% | $0.06 | ✓ |
| Claude Opus 4.6 | 75.6% | $0.50 | ✓ |
| GPT-5.2 (high reasoning) | 72.8% | $0.23 | ✓ |
| GLM-5 | 72.8% | $0.05 | - |
| GPT-5.2 | 72.8% | $0.23 | ✓ |
| Claude Sonnet 4.5 | 71.4% | $0.30 | ✓ |
| Kimi K2.5 | 70.8% | $0.15 | - |
| DeepSeek V4 Flash | 70.0% | $0.01 | - |
| Gemini 3.1 Pro | 69.6% | $0.22 | ✓ |
| Claude Opus 4.1 | 67.6% | $1.50 | ✓ |
| Claude Haiku 4.5 | 66.6% | $0.10 | ✓ |
| GPT-5 | 65.0% | $0.16 | ✓ |
| Kimi K2 Thinking Turbo | 63.4% | $0.06 | - |
| GPT-5 mini | 56.2% | $0.03 | ✓ |
| Gemini 2.5 Pro | 53.6% | $0.16 | ✓ |
$/task = cost to solve one benchmark task via direct API (based on token usage × provider pricing). Copilot = available in GitHub Copilot (✓ = yes, token-based AI credit billing since Jun 2026).
Aider Polyglot
| Model | % Correct | Copilot |
|---|---|---|
| GPT-5 (high reasoning) | 88.0% | ✓ |
| o3-pro (high) | 84.9% | - |
| Gemini 2.5 Pro 06-05 (32k think) | 83.1% | ✓ |
| Claude Sonnet 4.5 | 82.4% | ✓ |
| Claude Opus 4.1 | 82.1% | ✓ |
| o3 (high) | 81.3% | - |
| Grok 4 (high) | 79.6% | - |
| DeepSeek V4 Flash (Reasoner) | 74.2% | - |
| Claude Haiku 4.5 | 73.5% | ✓ |
| o4-mini | 72.0% | - |
| Claude Opus 4.5 | 70.7% | ✓ |
| DeepSeek V4 Flash (Chat) | 70.2% | - |
| Kimi K2 | 59.1% | - |
| Claude Sonnet 4 | 56.4% | ✓ |
| Gemini 2.5 Flash (thinking) | 55.1% | - |
| DeepSeek V3 (0324) | 55.1% | - |
| Grok 3 Beta | 53.3% | - |
| GPT-4.1 | 52.4% | - |
| Grok 3 Mini Beta (high) | 49.3% | - |
| GPT-5 mini | 50.2% | ✓ |
LiveBench
What it is: A contamination-free benchmark with 23 diverse tasks spanning Coding, Agentic Coding, Data Analysis, Language, Instruction Following, Math, and Reasoning. Questions refresh every 6 months and are delay-released to minimize training contamination. Scores use objective ground-truth answers, not LLM judges.
Why it matters: Most benchmarks face contamination (models train on test data). LiveBench addresses this with regular question rotation and delayed public release. The Global Average provides a single score across multiple capabilities, avoiding narrow specialization.
| Model | Global Avg | Coding | Agentic | Data | Language | IF | Math | Reasoning |
|---|---|---|---|---|---|---|---|---|
| GPT-5.5 Thinking xHigh | 80.7 | 87.7 | 82.5 | 56.7 | 96.3 | 81.1 | 87.7 | 73.0 |
| GPT-5.4 Thinking xHigh | 80.3 | 88.1 | 77.5 | 70.0 | 94.2 | 79.3 | 82.6 | 70.2 |
| Gemini 3.1 Pro | 79.9 | 84.0 | 76.5 | 65.0 | 91.0 | 78.5 | 85.4 | 79.1 |
| Claude Fable 5 Thinking xHigh | 78.3 | 87.3 | 78.6 | 60.0 | 93.9 | 80.0 | 88.5 | 60.0 |
| Claude 4.8 Opus Thinking xHigh | 77.2 | 89.7 | 79.3 | 60.0 | 84.3 | 78.3 | 81.4 | 67.5 |
| Claude 4.7 Opus Thinking xHigh | 76.9 | 87.7 | 82.1 | 60.0 | 93.1 | 78.3 | 77.9 | 59.3 |
| Claude 4.6 Opus Thinking | 76.3 | 88.7 | 78.2 | 61.7 | 89.3 | 69.9 | 83.3 | 63.3 |
| Claude 4.5 Opus Thinking High | 76.0 | 80.1 | 79.7 | 63.3 | 90.4 | 74.4 | 81.3 | 62.6 |
| Claude 4.6 Sonnet Thinking | 75.5 | 84.8 | 79.3 | 60.0 | 87.0 | 78.0 | 76.1 | 63.2 |
| Gemini 3.5 Flash High | 75.0 | 82.0 | 78.2 | 51.7 | 88.2 | 64.9 | 84.6 | 75.6 |
| GPT-5.2 high reasoning | 74.8 | 83.2 | 76.1 | 51.7 | 93.2 | 78.2 | 79.8 | 61.8 |
| Qwen 3.7 Max | 74.3 | 83.3 | 74.2 | 51.7 | 85.3 | 71.8 | 79.7 | 74.0 |
| GPT-5.1 Codex Max | 74.0 | 83.7 | 80.7 | 53.3 | 83.2 | 70.1 | 76.5 | 70.4 |
| DeepSeek V4 Pro | 73.6 | 82.7 | 70.0 | 56.7 | 90.7 | 74.5 | 78.1 | 62.4 |
| GPT-5.3 Codex High | 72.8 | 80.2 | 78.2 | 55.0 | 87.8 | 62.7 | 80.1 | 65.4 |
| Gemini 3 Flash | 72.4 | 76.3 | 71.8 | 56.7 | 86.6 | 75.6 | 81.2 | 58.5 |
| Kimi K2.6 Thinking | 72.2 | 79.4 | 78.6 | 58.3 | 84.3 | 65.1 | 75.1 | 64.4 |
| GPT-5.1 | 72.0 | 78.8 | 72.5 | 53.3 | 86.9 | 69.6 | 79.3 | 63.9 |
| GLM-5 | 68.9 | 69.1 | 73.6 | 55.0 | 83.5 | 67.9 | 77.5 | 55.3 |
| GPT-5 | 70.5 | 77.5 | 68.9 | 45.0 | 86.4 | 75.1 | 77.2 | 63.4 |
| Qwen 3.6 Plus | 70.9 | 75.8 | 78.2 | 55.0 | 83.7 | 69.9 | 75.0 | 58.3 |
| GPT-5.4 nano | 70.1 | 81.1 | 72.1 | 49.1 | 91.3 | 67.6 | 62.5 | 67.2 |
| Minimax M3 | 70.0 | 74.5 | 68.2 | 60.0 | 77.0 | 76.2 | 76.8 | 57.5 |
| Kimi K2.5 Thinking | 69.1 | 76.0 | 77.9 | 48.3 | 84.9 | 61.4 | 77.7 | 57.4 |
| GPT-5.4 mini | 67.5 | 72.5 | 71.6 | 47.5 | 78.6 | 71.0 | 71.5 | 60.3 |
| DeepSeek V4 Flash | 67.3 | 70.6 | 69.2 | 50.0 | 79.7 | 68.0 | 70.1 | 63.1 |
| Grok 4.3 | 66.7 | 70.8 | 69.9 | 50.0 | 84.3 | 55.8 | 73.6 | 62.8 |
| Grok 4.20 Beta | 68.0 | 75.3 | 66.1 | 43.3 | 87.1 | 62.9 | 77.7 | 63.4 |
| Grok 4.1 Fast | 60.0 | 58.4 | 63.6 | 40.0 | 78.4 | 61.4 | 71.2 | 47.3 |
| Grok 4 | 62.0 | 79.1 | 73.1 | 30.0 | 83.0 | 63.4 | 76.4 | 29.1 |
| Minimax M2.7 | 63.5 | 74.8 | 54.9 | 50.0 | 80.5 | 56.3 | 66.8 | 61.1 |
| Kimi K2 Thinking Turbo | 61.6 | 66.1 | 64.9 | 40.0 | 73.6 | 63.0 | 66.3 | 56.8 |
| Gemini 3.1 Flash-Lite | 61.7 | 59.7 | 68.5 | 33.3 | 73.6 | 54.9 | 73.2 | 68.6 |
| Minimax M2.5 | 60.1 | 59.3 | 70.7 | 51.7 | 77.4 | 49.6 | 55.1 | 57.2 |
| DeepSeek V3.2 Thinking | 62.2 | 65.3 | 58.4 | 41.7 | 78.0 | 65.9 | 75.4 | 51.1 |
| Gemini 2.5 Pro | 58.3 | 57.1 | 55.9 | 46.7 | 70.2 | 56.9 | 69.6 | 51.7 |
| GLM-4.7 | 58.1 | 60.1 | 57.2 | 36.7 | 69.6 | 57.5 | 68.8 | 56.8 |
| Claude Opus 4.5 | 59.1 | 67.1 | 64.8 | 40.0 | 67.8 | 56.5 | 63.0 | 54.2 |
| Claude Opus 4.1 | 54.5 | 59.3 | 56.8 | 30.0 | 62.9 | 52.0 | 58.7 | 61.8 |
| Claude Sonnet 4.5 | 53.7 | 58.9 | 56.5 | 38.3 | 61.3 | 52.8 | 59.6 | 48.5 |
| Gemini 2.5 Flash | 47.7 | 51.1 | 41.4 | 31.7 | 57.6 | 47.2 | 56.5 | 48.3 |
| Claude Haiku 4.5 | 45.3 | 52.2 | 43.5 | 26.7 | 54.1 | 42.5 | 51.4 | 47.0 |
• New #1: GPT-5.5 (80.7) edges out GPT-5.4 (80.3) and Gemini 3.1 Pro (79.9) at the top
• New Anthropic models: Claude Fable 5 (78.3) and Claude 4.8 Opus (77.2) — Fable 5 is now Anthropic's top-tier at $1.00/task
• Gemini 3.5 Flash: New Google model (75.0) slots just below Claude Fable 5, available in Copilot at $0.17/task
• DeepSeek V4 Pro (73.6) competes with GPT-5.3 Codex at just $0.03/task
• Copilot billing changed June 1, 2026: Moved to token-based AI credits. Multipliers are gone — see appendix.
Chatbot Arena Code
| Rank | Model | Elo Score | $/task | Copilot | Notes |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.5 thinking-32k | 1497 | $0.50 | ✓ | Thinking variant |
| 2 | GPT-5.2 high reasoning | 1470 | $0.23 | ✓ | High reasoning mode |
| 3 | Claude Opus 4.5 | 1468 | $0.50 | ✓ | Standard (non-thinking) |
| 4 | GLM-4.7 | 1440 | $0.05 | - | |
| 5 | Gemini 3 Flash | 1443 | $0.06 | ✓ | |
| 6 | GPT-5.2 | 1432 | $0.23 | ✓ | |
| 7 | Claude Opus 4.1 | 1431 | $1.50 | ✓ | |
| 8 | o3 | 1417 | $0.18 | - | |
| 9 | Minimax M2.1 preview | 1408 | $0.03 | - | |
| 10 | GPT-5 | 1407 | $0.16 | ✓ | |
| 11 | Grok 4.1 Fast | 1393 | - | - | |
| 12 | Claude Sonnet 4.5 | 1383 | $0.30 | ✓ | |
| 13 | GPT-4o | 1372 | $0.23 | - | |
| 14 | Gemini 2.5 Pro | 1372 | $0.16 | ✓ | |
| 15 | Kimi K2 Thinking Turbo | 1356 | $0.06 | - | |
| 16 | DeepSeek V4 Flash | 1350 | $0.01 | - | |
| 17 | Claude Haiku 4.5 | 1290 | $0.10 | ✓ | |
| 18 | GPT-4.1 | 1305 | $0.18 | - |
Note: Arena Code data not refreshed this update (access issues). Data as of February 2026. DeepSeek V3.2 Reasoner renamed to DeepSeek V4 Flash.
What benchmarks don’t tell you
- Latency - high-scoring models can feel sluggish
- Consistency - benchmark runs are controlled; your prompts aren’t
- Your stack - generic benchmarks miss framework-specific quirks
- Cost at scale - 5% better might not justify 3x the price
The best benchmark is running a model on your own work for a day.
Other benchmarks
| Benchmark | What it tests | Notes |
|---|---|---|
| HumanEval | Python function completion | Classic but dated |
| MBPP | Basic Python problems | Also dated |
| CodeContests | Competitive programming | Harder, less realistic |
| LiveCodeBench | Fresh problems | livecodebench.github.io - avoids training contamination |
For day-to-day coding, SWE-bench and Aider are most relevant.
Appendix: GitHub Copilot — billing changed June 1, 2026
GitHub moved Copilot to usage-based AI credit billing on June 1, 2026. The old “premium request multiplier” system is now legacy-only (affects only Copilot Pro/Pro+ users who were on existing annual plans). For everyone else:
- 1 AI credit = $0.01 USD
- Models are priced per token (same rates as direct API access)
- The Copilot column in these tables now simply shows ✓ (available) or - (not available)
Models available in Copilot as of June 2026:
| Provider | Models |
|---|---|
| OpenAI | GPT-5.5, GPT-5.4, GPT-5.4 mini, GPT-5.4 nano, GPT-5.3-Codex, GPT-5 mini |
| Anthropic | Claude Fable 5, Claude Opus 4.5–4.8, Claude Sonnet 4–4.6, Claude Haiku 4.5 |
| Gemini 3.5 Flash, Gemini 3.1 Pro, Gemini 3 Flash, Gemini 2.5 Pro | |
| Other | Raptor mini (GitHub), MAI-Code-1-Flash (Microsoft) |
Note: GPT-4o and GPT-4.1 are no longer listed in Copilot’s published model pricing as of June 2026.
← Back to AI Guide