All models

Benchmark Rankings

All 20 models ranked across 9 benchmarks. Click a column header to sort.

Model
MMLU
General knowledge
GPQA Diamond
Hard science Q&A
MATH
Math problem solving
AIME 2025
Advanced math
GSM8K
Grade-school math
HumanEval
Code generation
LiveCodeBench
Live coding
SWE-bench
Real-world software
HellaSwag
Commonsense inference
Anthropic
Claude Haiku 4.5
Anthropic
80.1%
Anthropic
Claude Opus 4.7
Anthropic
89.8%94.2%94.1%100%98.4%87.6%
Anthropic
Claude Sonnet 4.6
Anthropic
89.5%68.3%86.4%95.6%90.8%55.3%
DeepSeek
DeepSeek R1
DeepSeek
90.8%71.5%83.5%97.3%85.7%49.2%
DeepSeek
DeepSeek V3
DeepSeek
88.5%59.1%75.9%92.8%82.6%
DeepSeek
DeepSeek V4
DeepSeek
87.5%90.1%88%93.5%80.6%
Google
Gemini 2.5 Flash
Google
85.8%74.1%93.8%83.2%
Google
Gemini 3.1 Pro
Google
90.3%94.3%88.7%96.4%88.4%91.7%80.6%
Z.ai
GLM-5.1NEW
Z.ai
86.2%95.3%58.4%
OpenAI
GPT-4o
OpenAI
88.7%53.6%76.6%95.3%87.1%38.4%
OpenAI
GPT-4o mini
OpenAI
82%40.9%70.2%93.2%87%
xAI
Grok 3
xAI
85.7%56.7%76%
Meta
Llama 3.3 70B
Meta
86%50.5%73.5%93.7%78.9%
Meta
Llama 4 Maverick
Meta
88.2%58.3%79.1%32.1%
Xiaomi
MiMo-V2.5-ProNEW
Xiaomi
86.7%54%75%41%95.5%57.2%
MiniMax
MiniMax M2.7
MiniMax
89.4%93.2%56.2%
Mistral AI
Mistral Large 2
Mistral AI
84%71.2%92%82%
OpenAI
o3
OpenAI
91.4%87.7%96.7%96.7%97.2%92.8%69.1%
OpenAI
o4-mini
OpenAI
89.5%79.7%93.4%96.5%90.1%
Alibaba
Qwen3.6 PlusNEW
Alibaba
78.8%

Sources: provider technical reports & independent evaluations (AI Bytes, GPT0X Tracker, TokenCalculator, Precision AI Academy, BenchLM, Artificial Analysis). N/A = not yet published for this model. Results may vary across benchmark versions and evaluation setups.