📅 Snapshot: June 2026

This page collates benchmark data from independent sources to help you compare models. These aren’t my benchmarks - I’m just pulling highlights so you don’t have to tab between sites.

For the latest data, always check the original sources. Data current as of: SWE-bench (February 2026), Aider (June 2025), Arena Code (February 2026).


SWE-bench Verified

Source: swebench.com (February 2026) · Tests whether models can fix real GitHub issues · Standardized harness: mini-SWE-agent v2.0.0, high reasoning mode where available
Model Score $/task Copilot
Claude Opus 4.5 76.8% $0.50
Minimax M2.5 75.8% $0.07 -
Gemini 3 Flash 75.8% $0.06
Claude Opus 4.6 75.6% $0.50
GPT-5.2 (high reasoning) 72.8% $0.23
GLM-5 72.8% $0.05 -
GPT-5.2 72.8% $0.23
Claude Sonnet 4.5 71.4% $0.30
Kimi K2.5 70.8% $0.15 -
DeepSeek V4 Flash 70.0% $0.01 -
Gemini 3.1 Pro 69.6% $0.22
Claude Opus 4.1 67.6% $1.50
Claude Haiku 4.5 66.6% $0.10
GPT-5 65.0% $0.16
Kimi K2 Thinking Turbo 63.4% $0.06 -
GPT-5 mini 56.2% $0.03
Gemini 2.5 Pro 53.6% $0.16

$/task = cost to solve one benchmark task via direct API (based on token usage × provider pricing). Copilot = available in GitHub Copilot (✓ = yes, token-based AI credit billing since Jun 2026).

Takeaway: Scores across the board are higher with the standardized harness. Claude Opus 4.5 leads at 76.8%, but Minimax M2.5 (75.8%, $0.07) and Gemini 3 Flash (75.8%, $0.06) are right behind — at a fraction of the cost. DeepSeek V4 Flash (70.0%, $0.01) is the extreme budget option. Note: Gemini 2.0 Flash has been shut down (June 1, 2026). DeepSeek V3.2 Reasoner was renamed DeepSeek V4 Flash — same API, new name.

Aider Polyglot

Source: aider.chat/docs/leaderboards (June 2025) · Tests code editing across C++, Go, Java, JavaScript, Python, Rust
Note: Aider's latest entries run up to June 2025. Includes GPT-5, Claude 4.x, Gemini 2.5 Pro, and Grok 4 variants.
Model % Correct Copilot
GPT-5 (high reasoning) 88.0%
o3-pro (high) 84.9% -
Gemini 2.5 Pro 06-05 (32k think) 83.1%
Claude Sonnet 4.5 82.4%
Claude Opus 4.1 82.1%
o3 (high) 81.3% -
Grok 4 (high) 79.6% -
DeepSeek V4 Flash (Reasoner) 74.2% -
Claude Haiku 4.5 73.5%
o4-mini 72.0% -
Claude Opus 4.5 70.7%
DeepSeek V4 Flash (Chat) 70.2% -
Kimi K2 59.1% -
Claude Sonnet 4 56.4%
Gemini 2.5 Flash (thinking) 55.1% -
DeepSeek V3 (0324) 55.1% -
Grok 3 Beta 53.3% -
GPT-4.1 52.4% -
Grok 3 Mini Beta (high) 49.3% -
GPT-5 mini 50.2%
Takeaway: GPT-5 high reasoning still dominates at 88%, followed by o3-pro (84.9%) and Gemini 2.5 Pro 06-05 thinking (83.1%). Claude Sonnet 4.5 (82.4%) remains the practical choice. DeepSeek V4 Flash is V3.2 rebranded — same strong scores (74.2% reasoner, 70.2% chat). Claude Sonnet 4 plain (56.4%) shows the thinking tokens really do matter for Aider tasks.

LiveBench

Source: livebench.ai (June 2026) · Contamination-free benchmark with 23 diverse tasks

What it is: A contamination-free benchmark with 23 diverse tasks spanning Coding, Agentic Coding, Data Analysis, Language, Instruction Following, Math, and Reasoning. Questions refresh every 6 months and are delay-released to minimize training contamination. Scores use objective ground-truth answers, not LLM judges.

Why it matters: Most benchmarks face contamination (models train on test data). LiveBench addresses this with regular question rotation and delayed public release. The Global Average provides a single score across multiple capabilities, avoiding narrow specialization.

Model Global Avg Coding Agentic Data Language IF Math Reasoning
GPT-5.5 Thinking xHigh 80.7 87.7 82.5 56.7 96.3 81.1 87.7 73.0
GPT-5.4 Thinking xHigh 80.3 88.1 77.5 70.0 94.2 79.3 82.6 70.2
Gemini 3.1 Pro 79.9 84.0 76.5 65.0 91.0 78.5 85.4 79.1
Claude Fable 5 Thinking xHigh 78.3 87.3 78.6 60.0 93.9 80.0 88.5 60.0
Claude 4.8 Opus Thinking xHigh 77.2 89.7 79.3 60.0 84.3 78.3 81.4 67.5
Claude 4.7 Opus Thinking xHigh 76.9 87.7 82.1 60.0 93.1 78.3 77.9 59.3
Claude 4.6 Opus Thinking 76.3 88.7 78.2 61.7 89.3 69.9 83.3 63.3
Claude 4.5 Opus Thinking High 76.0 80.1 79.7 63.3 90.4 74.4 81.3 62.6
Claude 4.6 Sonnet Thinking 75.5 84.8 79.3 60.0 87.0 78.0 76.1 63.2
Gemini 3.5 Flash High 75.0 82.0 78.2 51.7 88.2 64.9 84.6 75.6
GPT-5.2 high reasoning 74.8 83.2 76.1 51.7 93.2 78.2 79.8 61.8
Qwen 3.7 Max 74.3 83.3 74.2 51.7 85.3 71.8 79.7 74.0
GPT-5.1 Codex Max 74.0 83.7 80.7 53.3 83.2 70.1 76.5 70.4
DeepSeek V4 Pro 73.6 82.7 70.0 56.7 90.7 74.5 78.1 62.4
GPT-5.3 Codex High 72.8 80.2 78.2 55.0 87.8 62.7 80.1 65.4
Gemini 3 Flash 72.4 76.3 71.8 56.7 86.6 75.6 81.2 58.5
Kimi K2.6 Thinking 72.2 79.4 78.6 58.3 84.3 65.1 75.1 64.4
GPT-5.1 72.0 78.8 72.5 53.3 86.9 69.6 79.3 63.9
GLM-5 68.9 69.1 73.6 55.0 83.5 67.9 77.5 55.3
GPT-5 70.5 77.5 68.9 45.0 86.4 75.1 77.2 63.4
Qwen 3.6 Plus 70.9 75.8 78.2 55.0 83.7 69.9 75.0 58.3
GPT-5.4 nano 70.1 81.1 72.1 49.1 91.3 67.6 62.5 67.2
Minimax M3 70.0 74.5 68.2 60.0 77.0 76.2 76.8 57.5
Kimi K2.5 Thinking 69.1 76.0 77.9 48.3 84.9 61.4 77.7 57.4
GPT-5.4 mini 67.5 72.5 71.6 47.5 78.6 71.0 71.5 60.3
DeepSeek V4 Flash 67.3 70.6 69.2 50.0 79.7 68.0 70.1 63.1
Grok 4.3 66.7 70.8 69.9 50.0 84.3 55.8 73.6 62.8
Grok 4.20 Beta 68.0 75.3 66.1 43.3 87.1 62.9 77.7 63.4
Grok 4.1 Fast 60.0 58.4 63.6 40.0 78.4 61.4 71.2 47.3
Grok 4 62.0 79.1 73.1 30.0 83.0 63.4 76.4 29.1
Minimax M2.7 63.5 74.8 54.9 50.0 80.5 56.3 66.8 61.1
Kimi K2 Thinking Turbo 61.6 66.1 64.9 40.0 73.6 63.0 66.3 56.8
Gemini 3.1 Flash-Lite 61.7 59.7 68.5 33.3 73.6 54.9 73.2 68.6
Minimax M2.5 60.1 59.3 70.7 51.7 77.4 49.6 55.1 57.2
DeepSeek V3.2 Thinking 62.2 65.3 58.4 41.7 78.0 65.9 75.4 51.1
Gemini 2.5 Pro 58.3 57.1 55.9 46.7 70.2 56.9 69.6 51.7
GLM-4.7 58.1 60.1 57.2 36.7 69.6 57.5 68.8 56.8
Claude Opus 4.5 59.1 67.1 64.8 40.0 67.8 56.5 63.0 54.2
Claude Opus 4.1 54.5 59.3 56.8 30.0 62.9 52.0 58.7 61.8
Claude Sonnet 4.5 53.7 58.9 56.5 38.3 61.3 52.8 59.6 48.5
Gemini 2.5 Flash 47.7 51.1 41.4 31.7 57.6 47.2 56.5 48.3
Claude Haiku 4.5 45.3 52.2 43.5 26.7 54.1 42.5 51.4 47.0
⚡ Key takeaways:
New #1: GPT-5.5 (80.7) edges out GPT-5.4 (80.3) and Gemini 3.1 Pro (79.9) at the top
New Anthropic models: Claude Fable 5 (78.3) and Claude 4.8 Opus (77.2) — Fable 5 is now Anthropic's top-tier at $1.00/task
Gemini 3.5 Flash: New Google model (75.0) slots just below Claude Fable 5, available in Copilot at $0.17/task
DeepSeek V4 Pro (73.6) competes with GPT-5.3 Codex at just $0.03/task
Copilot billing changed June 1, 2026: Moved to token-based AI credits. Multipliers are gone — see appendix.

Chatbot Arena Code

Source: lmarena.ai Code category (February 2026) · Human preference voting on coding tasks
Rank Model Elo Score $/task Copilot Notes
1 Claude Opus 4.5 thinking-32k 1497 $0.50 Thinking variant
2 GPT-5.2 high reasoning 1470 $0.23 High reasoning mode
3 Claude Opus 4.5 1468 $0.50 Standard (non-thinking)
4 GLM-4.7 1440 $0.05 -  
5 Gemini 3 Flash 1443 $0.06  
6 GPT-5.2 1432 $0.23  
7 Claude Opus 4.1 1431 $1.50  
8 o3 1417 $0.18 -  
9 Minimax M2.1 preview 1408 $0.03 -  
10 GPT-5 1407 $0.16  
11 Grok 4.1 Fast 1393 - -  
12 Claude Sonnet 4.5 1383 $0.30  
13 GPT-4o 1372 $0.23 -  
14 Gemini 2.5 Pro 1372 $0.16  
15 Kimi K2 Thinking Turbo 1356 $0.06 -  
16 DeepSeek V4 Flash 1350 $0.01 -  
17 Claude Haiku 4.5 1290 $0.10  
18 GPT-4.1 1305 $0.18 -  

Note: Arena Code data not refreshed this update (access issues). Data as of February 2026. DeepSeek V3.2 Reasoner renamed to DeepSeek V4 Flash.

"Thinking" variants are labeled explicitly. Claude Opus 4.5 thinking-32k (rank 1, 1497 Elo) does explicit reasoning passes. The standard Opus 4.5 (rank 3, 1468 Elo) is still excellent but slightly lower. Both cost $0.50/task but thinking models are slower and burn more tokens on complex tasks.
Takeaway: Top tier is tightly packed (1468-1497 Elo). For budget: GLM-4.7 (1440 Elo) at $0.05/task or Minimax M2.1 (1408) at $0.03/task punch way above their weight. Note: Arena Code data is from February 2026 — newer models (Claude 4.6, GPT-5.4, Gemini 3.1 Pro) don’t have Arena scores yet.

What benchmarks don’t tell you

  • Latency - high-scoring models can feel sluggish
  • Consistency - benchmark runs are controlled; your prompts aren’t
  • Your stack - generic benchmarks miss framework-specific quirks
  • Cost at scale - 5% better might not justify 3x the price

The best benchmark is running a model on your own work for a day.


Other benchmarks

Benchmark What it tests Notes
HumanEval Python function completion Classic but dated
MBPP Basic Python problems Also dated
CodeContests Competitive programming Harder, less realistic
LiveCodeBench Fresh problems livecodebench.github.io - avoids training contamination

For day-to-day coding, SWE-bench and Aider are most relevant.


Appendix: GitHub Copilot — billing changed June 1, 2026

GitHub moved Copilot to usage-based AI credit billing on June 1, 2026. The old “premium request multiplier” system is now legacy-only (affects only Copilot Pro/Pro+ users who were on existing annual plans). For everyone else:

  • 1 AI credit = $0.01 USD
  • Models are priced per token (same rates as direct API access)
  • The Copilot column in these tables now simply shows (available) or - (not available)

Models available in Copilot as of June 2026:

Provider Models
OpenAI GPT-5.5, GPT-5.4, GPT-5.4 mini, GPT-5.4 nano, GPT-5.3-Codex, GPT-5 mini
Anthropic Claude Fable 5, Claude Opus 4.5–4.8, Claude Sonnet 4–4.6, Claude Haiku 4.5
Google Gemini 3.5 Flash, Gemini 3.1 Pro, Gemini 3 Flash, Gemini 2.5 Pro
Other Raptor mini (GitHub), MAI-Code-1-Flash (Microsoft)

Note: GPT-4o and GPT-4.1 are no longer listed in Copilot’s published model pricing as of June 2026.

← Back to AI Guide