All models
Benchmark Rankings
All 20 models ranked across 9 benchmarks. Click a column header to sort.
| Model | MMLU General knowledge | GPQA Diamond Hard science Q&A | MATH Math problem solving | AIME 2025 Advanced math | GSM8K Grade-school math | HumanEval Code generation | LiveCodeBench Live coding | SWE-bench Real-world software | HellaSwag Commonsense inference |
|---|---|---|---|---|---|---|---|---|---|
Claude Haiku 4.5 Anthropic | 80.1% | — | — | — | — | — | — | — | — |
Claude Opus 4.7 Anthropic | 89.8% | 94.2% | 94.1% | 100% | 98.4% | — | — | 87.6% | — |
Claude Sonnet 4.6 Anthropic | 89.5% | 68.3% | 86.4% | — | 95.6% | 90.8% | — | 55.3% | — |
DeepSeek R1 DeepSeek | 90.8% | 71.5% | 83.5% | — | 97.3% | 85.7% | — | 49.2% | — |
DeepSeek V3 DeepSeek | 88.5% | 59.1% | 75.9% | — | 92.8% | 82.6% | — | — | — |
DeepSeek V4 DeepSeek | 87.5% | 90.1% | 88% | — | — | — | 93.5% | 80.6% | — |
Gemini 2.5 Flash Google | 85.8% | — | 74.1% | — | 93.8% | 83.2% | — | — | — |
Gemini 3.1 Pro Google | 90.3% | 94.3% | 88.7% | — | 96.4% | 88.4% | 91.7% | 80.6% | — |
GLM-5.1NEW Z.ai | — | 86.2% | — | 95.3% | — | — | — | 58.4% | — |
GPT-4o OpenAI | 88.7% | 53.6% | 76.6% | — | 95.3% | 87.1% | — | 38.4% | — |
GPT-4o mini OpenAI | 82% | 40.9% | 70.2% | — | 93.2% | 87% | — | — | — |
Grok 3 xAI | 85.7% | 56.7% | 76% | — | — | — | — | — | — |
Llama 3.3 70B Meta | 86% | 50.5% | 73.5% | — | 93.7% | 78.9% | — | — | — |
Llama 4 Maverick Meta | 88.2% | 58.3% | 79.1% | — | — | — | — | 32.1% | — |
MiMo-V2.5-ProNEW Xiaomi | 86.7% | 54% | 75% | 41% | 95.5% | — | — | 57.2% | — |
MiniMax M2.7 MiniMax | 89.4% | — | — | — | — | 93.2% | — | 56.2% | — |
Mistral Large 2 Mistral AI | 84% | — | 71.2% | — | 92% | 82% | — | — | — |
o3 OpenAI | 91.4% | 87.7% | 96.7% | 96.7% | 97.2% | 92.8% | — | 69.1% | — |
o4-mini OpenAI | 89.5% | 79.7% | 93.4% | — | 96.5% | 90.1% | — | — | — |
Qwen3.6 PlusNEW Alibaba | — | — | — | — | — | — | — | 78.8% | — |
Sources: provider technical reports & independent evaluations (AI Bytes, GPT0X Tracker, TokenCalculator, Precision AI Academy, BenchLM, Artificial Analysis). N/A = not yet published for this model. Results may vary across benchmark versions and evaluation setups.