评测榜单
致力于探索最先进的大模型,为产研界提供全面、客观、中立的评测参考
Rank | Model | Overall Score | WTQ | PersonRelQA | ReportFixer | MedQA | AffairQA | BioTextQA | MatTextQA | PharmKGQA | ChineseLawFact | VersiCode |
---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
Grok 3 | 55.82% | 76.50% | 4.70% | 77.80% | 49.00% | 45.50% | 80.00% | 64.29% | 42.11% | 54.25% | 64.00% |
![]() |
QWQ-32B | 50.65% | 70.50% | 3.00% | 32.30% | 78.30% | 45.00% | 76.67% | 62.38% | 45.67% | 69.00% | 23.70% |
![]() |
Hunyuan-turbo | 50.10% | 55.10% | 1.40% | 2.20% | 84.50% | 43.00% | 85.71% | 60.95% | 32.52% | 83.87% | 51.70% |
4 | Qwen 2.5-72B | 50.02% | 65.50% | 2.50% | 38.90% | 59.50% | 45.00% | 81.43% | 62.86% | 38.09% | 70.50% | 35.90% |
5 | GPT-4o | 48.49% | 69.40% | 3.20% | 44.70% | 59.00% | 41.00% | 43.81% | 61.43% | 39.23% | 56.63% | 66.50% |
6 | DeepSeek-R1-671B | 47.36% | 74.30% | 6.80% | 59.70% | 48.00% | 45.50% | 33.81% | 50.48% | 31.37% | 58.00% | 65.60% |
7 | DeepSeek V3 | 45.83% | 69.90% | 2.60% | 57.90% | 59.50% | 42.50% | 55.71% | 39.90% | 39.04% | 53.87% | 37.40% |
8 | Llama3.1-70B | 44.57% | 47.70% | 2.20% | 24.20% | 27.00% | 40.00% | 88.57% | 71.43% | 34.33% | 59.38% | 50.90% |
9 | Doubao-pro | 44.54% | 46.00% | 0.00% | 25.30% | 53.00% | 40.00% | 83.33% | 50.00% | 27.14% | 57.50% | 63.10% |
10 | GLM4-9B | 41.24% | 39.10% | 0.20% | 6.60% | 46.50% | 38.50% | 80.95% | 58.10% | 17.70% | 66.25% | 58.50% |
11 | Claude 3.7 Sonnet | 38.48% | 28.30% | 0.50% | 42.30% | 46.00% | 22.10% | 78.10% | 48.80% | 40.10% | 60.38% | 18.20% |
12 | Qwen2.5-7B | 32.93% | 30.60% | 0.50% | 17.00% | 34.50% | 46.00% | 50.95% | 37.50% | 31.55% | 62.88% | 17.80% |
13 | Llama3.1-8B | 30.11% | 35.70% | 0.20% | 2.50% | 17.00% | 42.00% | 55.23% | 55.98% | 23.53% | 57.13% | 11.80% |
14 | Baichuan2-7B | 24.80% | 4.80% | 0.00% | 12.00% | 20.00% | 43.50% | 51.43% | 50.95% | 21.43% | 43.87% | 0.00% |
15 | Baichuan2-13B | 24.74% | 14.80% | 0.00% | 13.60% | 26.50% | 37.00% | 57.14% | 22.86% | 14.76% | 56.63% | 4.10% |
Rank | Model | Overall Score |
---|---|---|
![]() |
Grok 3 | 26.57% |
![]() |
OpenAI o1 | 26.17% |
![]() |
Hunyuan-turbo | 23.64% |
4 | QWQ-32B | 22.07% |
5 | DeepSeek-R1-671B | 19.70% |
6 | Qwen2.5-72B | 18.89% |
7 | GPT-4o | 17.79% |
8 | Doubao-pro | 16.46% |
9 | Llama3.1-70B | 16.45% |
10 | Claude 3.7 Sonnet | 15.62% |
11 | GLM4-9B | 15.42% |
12 | DeepSeek-V3 | 13.68% |
13 | Llama3.1-8B | 10.83% |
14 | Baichuan2-13B | 10.58% |
15 | Qwen2.5-7B | 9.59% |
16 | Baichuan2-7B | 9.45% |
Rank | Model | Text Reasoning | MedicalQA | BioQA | MaterialQA | ChineseLawFact |
---|---|---|---|---|---|---|
![]() |
Hunyuan-turbo | 78.76% | 84.50% | 85.71% | 60.95% | 83.87% |
![]() |
QWQ-32B | 71.59% | 78.30% | 76.67% | 62.38% | 69.00% |
![]() |
Qwen 2.5-72B | 68.57% | 59.50% | 81.43% | 62.86% | 70.50% |
4 | GLM4-9B | 62.95% | 46.50% | 80.95% | 58.10% | 66.25% |
5 | Grok 3 | 61.89% | 49.00% | 80.00% | 64.29% | 54.25% |
6 | Llama3.1-70B | 61.60% | 27.00% | 88.57% | 71.43% | 59.38% |
7 | Doubao-pro | 60.96% | 53.00% | 83.33% | 50.00% | 57.50% |
8 | Claude 3.7 Sonnet | 58.32% | 46.00% | 78.10% | 48.80% | 60.38% |
9 | GPT-4o | 55.22% | 59.00% | 43.81% | 61.43% | 56.63% |
10 | DeepSeek V3 | 52.25% | 59.50% | 55.71% | 39.90% | 53.87% |
11 | DeepSeek-R1-671B | 47.57% | 48.00% | 33.81% | 50.48% | 58.00% |
12 | Qwen2.5-7B | 46.46% | 34.50% | 50.95% | 37.50% | 62.88% |
13 | Llama3.1-8B | 46.34% | 17.00% | 55.23% | 55.98% | 57.13% |
14 | Baichuan2-7B | 41.56% | 20.00% | 51.43% | 50.95% | 43.87% |
15 | Baichuan2-13B | 40.78% | 26.50% | 57.14% | 22.86% | 56.63% |
Rank | Model | Knowledge Graph Reasoning | PersonQA | Report | PoliticalQA | PharmKGQA |
---|---|---|---|---|---|---|
![]() |
Grok 3 | 42.53% | 4.70% | 77.80% | 45.50% | 42.11% |
![]() |
DeepSeek-R1-671B | 35.84% | 6.80% | 59.70% | 45.50% | 31.37% |
![]() |
DeepSeek V3 | 35.51% | 2.60% | 57.90% | 42.50% | 39.04% |
4 | GPT-4o | 32.03% | 3.20% | 44.70% | 41.00% | 39.23% |
5 | QWQ-32B | 31.49% | 3.00% | 32.30% | 45.00% | 45.67% |
6 | Qwen 2.5-72B | 31.12% | 2.50% | 38.90% | 45.00% | 38.09% |
7 | Claude 3.7 Sonnet | 26.25% | 0.50% | 42.30% | 22.10% | 40.10% |
8 | Llama3.1-70B | 25.18% | 2.20% | 24.20% | 40.00% | 34.33% |
9 | Qwen2.5-7B | 23.76% | 0.50% | 17.00% | 46.00% | 31.55% |
10 | Doubao-pro | 23.11% | 0.00% | 25.30% | 40.00% | 27.14% |
11 | Hunyuan-turbo | 19.78% | 1.40% | 2.20% | 43.00% | 32.52% |
12 | Baichuan2-7B | 19.23% | 0.00% | 12.00% | 43.50% | 21.43% |
13 | Llama3.1-8B | 17.06% | 0.20% | 2.50% | 42.00% | 23.53% |
14 | Baichuan2-13B | 16.34% | 0.00% | 13.60% | 37.00% | 14.76% |
15 | GLM4-9B | 15.75% | 0.20% | 6.60% | 38.50% | 17.70% |
Rank | Model | Table Reasoning |
---|---|---|
![]() |
Grok 3 | 76.50% |
![]() |
DeepSeek-R1-671B | 74.30% |
![]() |
QWQ-32B | 70.50% |
4 | DeepSeek V3 | 69.90% |
5 | GPT-4o | 69.40% |
6 | Qwen 2.5-72B | 65.50% |
7 | Hunyuan-turbo | 55.10% |
8 | Llama3.1-70B | 47.70% |
9 | Doubao-pro | 46.00% |
10 | GLM4-9B | 39.10% |
11 | Llama3.1-8B | 35.70% |
12 | Qwen2.5-7B | 30.60% |
13 | Claude 3.7 Sonnet | 28.30% |
14 | Baichuan2-13B | 14.80% |
15 | Baichuan2-7B | 4.80% |
16 | Baichuan2-7B | 9.45% |
Rank | Model | Code Reasoning |
---|---|---|
![]() |
GPT-4o | 66.50% |
![]() |
DeepSeek-R1-671B | 65.60% |
![]() |
Grok 3 | 64.00% |
4 | Doubao-pro | 63.10% |
5 | GLM4-9B | 58.50% |
6 | Hunyuan-turbo | 51.70% |
7 | Llama3.1-70B | 50.90% |
8 | DeepSeek V3 | 37.40% |
9 | Qwen 2.5-72B | 35.90% |
10 | QWQ-32B | 23.70% |
11 | Claude 3.7 Sonnet | 18.20% |
12 | Qwen2.5-7B | 17.80% |
13 | Llama3.1-8B | 11.80% |
14 | Baichuan2-13B | 4.10% |
15 | Baichuan2-7B | 0.00% |
16 | Baichuan2-7B | 9.45% |