Benchmark Rankings

All 20 models ranked across 9 benchmarks. Click a column header to sort.

Model	MMLU General knowledge	GPQA Diamond Hard science Q&A	MATH Math problem solving	AIME 2025 Advanced math	GSM8K Grade-school math	HumanEval Code generation	LiveCodeBench Live coding	SWE-bench Real-world software	HellaSwag Commonsense inference
Claude Haiku 4.5 Anthropic	80.1%	—	—	—	—	—	—	—	—
Claude Opus 4.7 Anthropic	89.8%	94.2%	94.1%	100%	98.4%	—	—	87.6%	—
Claude Sonnet 4.6 Anthropic	89.5%	68.3%	86.4%	—	95.6%	90.8%	—	55.3%	—
DeepSeek R1 DeepSeek	90.8%	71.5%	83.5%	—	97.3%	85.7%	—	49.2%	—
DeepSeek V3 DeepSeek	88.5%	59.1%	75.9%	—	92.8%	82.6%	—	—	—
DeepSeek V4 DeepSeek	87.5%	90.1%	88%	—	—	—	93.5%	80.6%	—
Gemini 2.5 Flash Google	85.8%	—	74.1%	—	93.8%	83.2%	—	—	—
Gemini 3.1 Pro Google	90.3%	94.3%	88.7%	—	96.4%	88.4%	91.7%	80.6%	—
GLM-5.1 Z.ai	—	86.2%	—	95.3%	—	—	—	58.4%	—
GPT-4o OpenAI	88.7%	53.6%	76.6%	—	95.3%	87.1%	—	38.4%	—
GPT-4o mini OpenAI	82%	40.9%	70.2%	—	93.2%	87%	—	—	—
Grok 3 xAI	85.7%	56.7%	76%	—	—	—	—	—	—
Llama 3.3 70B Meta	86%	50.5%	73.5%	—	93.7%	78.9%	—	—	—
Llama 4 Maverick Meta	88.2%	58.3%	79.1%	—	—	—	—	32.1%	—
MiMo-V2.5-Pro Xiaomi	86.7%	54%	75%	41%	95.5%	—	—	57.2%	—
MiniMax M2.7 MiniMax	89.4%	—	—	—	—	93.2%	—	56.2%	—
Mistral Large 2 Mistral AI	84%	—	71.2%	—	92%	82%	—	—	—
o3 OpenAI	91.4%	87.7%	96.7%	96.7%	97.2%	92.8%	—	69.1%	—
o4-mini OpenAI	89.5%	79.7%	93.4%	—	96.5%	90.1%	—	—	—
Qwen3.6 Plus Alibaba	—	—	—	—	—	—	—	78.8%	—

Sources: provider technical reports & independent evaluations (AI Bytes, GPT0X Tracker, TokenCalculator, Precision AI Academy, BenchLM, Artificial Analysis). N/A = not yet published for this model. Results may vary across benchmark versions and evaluation setups.