AI coding model comparison

📅 Data snapshot: February 2026

Model	Family	Copilot	$/task	SWE-bench	Aider	Arena	LiveBench
Claude Opus 4.5 thinking-32k	Anthropic	-	$0.50	74.4%	70.7%	1497	76.0
Claude Opus 4.5	Anthropic	3×	$0.50	74.4%	70.7%	1468	59.1
Gemini 3 Pro	Google	1×	$0.22	74.2%	-	1454	73.4
GPT-5.2 high reasoning	OpenAI	1×	$0.53	71.8%	88.0%	1470	74.8
Claude Sonnet 4.5	Anthropic	1×	$0.30	70.6%	82.4%	1383	53.7
GPT-5.2	OpenAI	1×	$0.23	69.0%	88.0%	1432	48.9
Claude Opus 4.1	Anthropic	10×	$1.50	67.6%	82.1%	1431	54.5
GPT-5	OpenAI	1×	$0.16	65.0%	88.0%	1407	70.5
Gemini 3 Flash	Google	0.33×	$0.08	63.8%	-	1443	72.4
Kimi K2 Thinking Turbo	Moonshot	-	$0.06	63.4%	59.1%	1356	61.6
Minimax M2	Minimax	-	$0.03	61.0%	-	1408	-
DeepSeek V3.2 Reasoner	DeepSeek	-	$0.02	60.0%	74.2%	1350	62.2
o3	OpenAI	-	$0.18	58.4%	84.9%	1417	-
GLM-4.6	Zhipu	-	$0.05	55.4%	-	-	55.2
Devstral 2	Mistral	-	-	53.8%	-	1363	41.2
Gemini 2.5 Pro	Google	1×	$0.16	53.6%	83.1%	1372	58.3
Grok 4.1 Fast	xAI	0.25×	-	-	-	1393	60.0
GPT-4o	OpenAI	0×	$0.23	48.9%	72.9%	1372	-
GLM-4.7	Zhipu	-	$0.05	-	-	1440	58.1
Minimax M2.1 preview	Minimax	-	$0.03	-	-	1408	-
Claude Haiku 4.5	Anthropic	0.33×	$0.10	48.4%	73.5%	1290	45.3
o4-mini	OpenAI	-	$0.10	45.0%	75.4%	1310	-
GPT-4.1	OpenAI	0×	$0.18	39.6%	66.4%	1305	-
DeepSeek V3.2 Chat	DeepSeek	-	$0.02	39.0%	70.2%	1287	51.8
Gemini 2.5 Flash	Google	-	$0.04	28.7%	68.0%	1233	47.7
Gemini 2.0 Flash	Google	-	$0.01	22.0%	58.0%	1214	-
GPT-4o-mini	OpenAI	-	$0.01	18.6%	55.6%	1176	-
GPT-5 mini	OpenAI	0×	$0.03	14.2%	50.2%	1145	-

Column guide

Column	What it means
Copilot	GitHub Copilot premium request multiplier (0× = free, 1× = standard, 3× = expensive, - = not available)
$/task	Estimated cost per task if using APIs directly (50K in + 10K out tokens). Useful for comparing relative model costs - Copilot users pay via the multiplier instead.
SWE-bench	% of real GitHub issues the model can fix autonomously (source) - December 2025 data
Aider	% correct on multi-language code editing (source) - October 2025 data
Arena	Elo rating from human preference voting on Code category (source) - February 2026 data
LiveBench	Global average score across 23 diverse tasks (source) - January 2026 data, contamination-free

Data sources: SWE-bench (Dec 2025) · Aider (Oct 2025) · Arena Code (Feb 2026) · LiveBench (Jan 2026) · GitHub Copilot
API pricing: Anthropic · OpenAI · Google · DeepSeek · Zhipu (GLM)