AI coding model comparison
📅 Data snapshot: February 2026
| Model | Family | Copilot | $/task | SWE-bench | Aider | Arena | LiveBench |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.5 thinking-32k | Anthropic | - | $0.50 | 74.4% | 70.7% | 1497 | 76.0 |
| Claude Opus 4.5 | Anthropic | 3× | $0.50 | 74.4% | 70.7% | 1468 | 59.1 |
| Gemini 3 Pro | 1× | $0.22 | 74.2% | - | 1454 | 73.4 | |
| GPT-5.2 high reasoning | OpenAI | 1× | $0.53 | 71.8% | 88.0% | 1470 | 74.8 |
| Claude Sonnet 4.5 | Anthropic | 1× | $0.30 | 70.6% | 82.4% | 1383 | 53.7 |
| GPT-5.2 | OpenAI | 1× | $0.23 | 69.0% | 88.0% | 1432 | 48.9 |
| Claude Opus 4.1 | Anthropic | 10× | $1.50 | 67.6% | 82.1% | 1431 | 54.5 |
| GPT-5 | OpenAI | 1× | $0.16 | 65.0% | 88.0% | 1407 | 70.5 |
| Gemini 3 Flash | 0.33× | $0.08 | 63.8% | - | 1443 | 72.4 | |
| Kimi K2 Thinking Turbo | Moonshot | - | $0.06 | 63.4% | 59.1% | 1356 | 61.6 |
| Minimax M2 | Minimax | - | $0.03 | 61.0% | - | 1408 | - |
| DeepSeek V3.2 Reasoner | DeepSeek | - | $0.02 | 60.0% | 74.2% | 1350 | 62.2 |
| o3 | OpenAI | - | $0.18 | 58.4% | 84.9% | 1417 | - |
| GLM-4.6 | Zhipu | - | $0.05 | 55.4% | - | - | 55.2 |
| Devstral 2 | Mistral | - | - | 53.8% | - | 1363 | 41.2 |
| Gemini 2.5 Pro | 1× | $0.16 | 53.6% | 83.1% | 1372 | 58.3 | |
| Grok 4.1 Fast | xAI | 0.25× | - | - | - | 1393 | 60.0 |
| GPT-4o | OpenAI | 0× | $0.23 | 48.9% | 72.9% | 1372 | - |
| GLM-4.7 | Zhipu | - | $0.05 | - | - | 1440 | 58.1 |
| Minimax M2.1 preview | Minimax | - | $0.03 | - | - | 1408 | - |
| Claude Haiku 4.5 | Anthropic | 0.33× | $0.10 | 48.4% | 73.5% | 1290 | 45.3 |
| o4-mini | OpenAI | - | $0.10 | 45.0% | 75.4% | 1310 | - |
| GPT-4.1 | OpenAI | 0× | $0.18 | 39.6% | 66.4% | 1305 | - |
| DeepSeek V3.2 Chat | DeepSeek | - | $0.02 | 39.0% | 70.2% | 1287 | 51.8 |
| Gemini 2.5 Flash | - | $0.04 | 28.7% | 68.0% | 1233 | 47.7 | |
| Gemini 2.0 Flash | - | $0.01 | 22.0% | 58.0% | 1214 | - | |
| GPT-4o-mini | OpenAI | - | $0.01 | 18.6% | 55.6% | 1176 | - |
| GPT-5 mini | OpenAI | 0× | $0.03 | 14.2% | 50.2% | 1145 | - |
Column guide
| Column | What it means |
|---|---|
| Copilot | GitHub Copilot premium request multiplier (0× = free, 1× = standard, 3× = expensive, - = not available) |
| $/task | Estimated cost per task if using APIs directly (50K in + 10K out tokens). Useful for comparing relative model costs - Copilot users pay via the multiplier instead. |
| SWE-bench | % of real GitHub issues the model can fix autonomously (source) - December 2025 data |
| Aider | % correct on multi-language code editing (source) - October 2025 data |
| Arena | Elo rating from human preference voting on Code category (source) - February 2026 data |
| LiveBench | Global average score across 23 diverse tasks (source) - January 2026 data, contamination-free |
Data sources:
SWE-bench (Dec 2025) ·
Aider (Oct 2025) ·
Arena Code (Feb 2026) ·
LiveBench (Jan 2026) ·
GitHub Copilot
API pricing: Anthropic · OpenAI · Google · DeepSeek · Zhipu (GLM)
API pricing: Anthropic · OpenAI · Google · DeepSeek · Zhipu (GLM)