AI coding model comparison
📅 Data snapshot: December 2025
| Model | Family | Copilot | $/task | SWE-bench | Aider | Arena |
|---|---|---|---|---|---|---|
| Claude Opus 4.5 | Anthropic | 3× | $0.50 | 74.4% | - | 1480 |
| Claude Opus 4.5 thinking | Anthropic | - | $0.50 | - | - | 1520 |
| Claude Opus 4.1 | Anthropic | 10× | $1.50 | 67.6% | - | - |
| Claude Sonnet 4.5 | Anthropic | 1× | $0.30 | 70.6% | - | 1387 |
| Claude Sonnet 4.5 thinking | Anthropic | - | $0.30 | - | - | 1393 |
| Claude 3.5 Sonnet | Anthropic | - | $0.30 | - | 84.2% | - |
| Claude Haiku 4.5 | Anthropic | 0.33× | $0.10 | - | - | 1290 |
| Claude 3.5 Haiku | Anthropic | - | $0.08 | - | 75.2% | - |
| Claude 3 Opus | Anthropic | - | $1.50 | - | 68.4% | - |
| Claude 3 Haiku | Anthropic | - | $0.025 | - | 47.4% | - |
| GPT-5.2 high | OpenAI | 1× | $0.23 | 71.8% | - | 1484 |
| GPT-5.2 | OpenAI | 1× | $0.23 | 69.0% | - | - |
| GPT-5 | OpenAI | 1× | $0.16 | 65.0% | - | - |
| GPT-4.1 | OpenAI | 0× | $0.18 | 39.6% | - | - |
| GPT-4o | OpenAI | 0× | $0.23 | - | 72.9% | - |
| GPT-4o-mini | OpenAI | - | $0.01 | - | 55.6% | - |
| GPT-5 mini | OpenAI | 0× | $0.03 | - | - | - |
| o1 | OpenAI | - | $1.35 | - | 84.2% | - |
| o3 | OpenAI | - | $0.18 | 58.4% | - | 1417 |
| o4-mini | OpenAI | - | $0.10 | 45.0% | - | - |
| Gemini 3 Pro | 1× | $0.22 | 74.2% | - | 1478 | |
| Gemini 3 Flash | 0.33× | $0.06 | - | - | 1465 | |
| Gemini 2.5 Pro | 1× | $0.16 | 53.6% | - | - | |
| Gemini 2.5 Flash | - | $0.04 | 28.7% | - | - | |
| Gemini 2.0 Flash | - | $0.01 | - | - | - | |
| DeepSeek Coder V2 | DeepSeek | - | - | - | 72.9% | - |
💡 Tip: Click "Copilot" header to sort by cost - free models (0×) first. Click "SWE-bench" to find top performers. The best value models have good scores AND low Copilot cost.
Column guide
| Column | What it means |
|---|---|
| Copilot | GitHub Copilot premium request multiplier (0× = free, 1× = standard, 3× = expensive, - = not available) |
| $/task | Estimated cost per task if using APIs directly (50K in + 10K out tokens). Useful for comparing relative model costs - Copilot users pay via the multiplier instead. |
| SWE-bench | % of real GitHub issues the model can fix autonomously (source) |
| Aider | % correct on multi-language code editing (source) |
| Arena | Elo rating from human preference voting on web dev tasks (source) |
Data sources:
SWE-bench ·
Aider ·
Chatbot Arena WebDev ·
GitHub Copilot
API pricing: Anthropic · OpenAI · Google
API pricing: Anthropic · OpenAI · Google
Best value picks
Based on the data:
| Use case | Best value model | Why |
|---|---|---|
| Daily coding (Copilot) | Claude Sonnet 4.5 | 70% SWE-bench at 1× cost |
| Free in Copilot | GPT-4o | 73% Aider, costs nothing extra |
| Cheap in Copilot | Gemini 3 Flash | Arena #5 at 0.33× and only $0.06/task |
| When you need the best | Gemini 3 Pro | 74% SWE-bench at 1× (beats Opus!) |
| Cheapest API | GPT-4o-mini | $0.01/task - 50× cheaper than GPT-4o |
| Best $/performance | Gemini 2.5 Flash | $0.04/task, 29% SWE-bench |