动态榜单
致力于探索最先进的大模型,为产研界提供全面、客观、中立的评测参考
| Rank | Task | Text-R-stress | Text-K-stress | Table-R-Stress | KG-K-Stress | KG-R-Stress | StressEval(EM) | |||
| model | EM | F1 | EM | F1 | EM | F1 | EM | EM | ||
|
|
GPT-5.5 | 55.00% | 59.56% | 45.00% | 56.83% | 86.67% | 89.33% | 42.00% | 62.00% | 57.65% |
|
|
Claude-opus-4.6-thinking | 65.00% | 72.56% | 35.00% | 43.44% | 83.33% | 85.33% | 40.00% | 60.00% | 55.88% |
|
|
Gemini-3.1-pro | 60.00% | 64.56% | 30.00% | 43.17% | 86.67% | 88.67% | 34.00% | 64.00% | 54.71% |
| 4 | Gemini-3-pro | 60.00% | 64.56% | 40.00% | 49.83% | 83.33% | 85.33% | 30.00% | 60.00% | 52.94% |
| 5 | Qwen3.6-plus | 65.00% | 69.74% | 20.00% | 31.50% | 83.33% | 85.33% | 38.00% | 56.00% | 52.35% |
| 6 | Qwen3.5-plus | 55.00% | 61.56% | 30.00% | 39.83% | 90.00% | 90.00% | 20.00% | 66.00% | 51.18% |
| 7 | Glm-5 | 65.00% | 69.56% | 30.00% | 40.67% | 83.33% | 85.24% | 20.00% | 62.00% | 50.00% |
| 8 | GPT-5.4 | 65.00% | 71.56% | 10.00% | 20.67% | 76.67% | 79.17% | 36.00% | 56.00% | 49.41% |
| 9 | Claude-sonnet-4.5 | 65.00% | 69.56% | 15.00% | 20.67% | 63.33% | 65.33% | 22.00% | 56.00% | 43.53% |
| 10 | Hunyua-2.0 | 50.00% | 59.42% | 15.00% | 17.50% | 80.00% | 81.67% | 20.00% | 52.00% | 42.94% |
| 11 | Deepseek-V3.2 | 50.00% | 54.56% | 10.00% | 15.67% | 66.67% | 71.17% | 16.00% | 60.00% | 41.18% |
| 12 | GPT-5.2 | 65.00% | 71.56% | 5.00% | 14.52% | 66.67% | 68.57% | 14.00% | 58.00% | 41.18% |
| 13 | Deepseek-V4-pro | 55.00% | 59.56% | 25.00% | 37.92% | 66.67% | 68.33% | 8.00% | 60.00% | 41.18% |
| 14 | Doubao-seed-1.6 | 45.00% | 59.28% | 15.00% | 20.00% | 83.33% | 85.56% | 2.00% | 62.00% | 40.59% |
| 15 | Llama-3.1-70b | 55.00% | 64.42% | 10.00% | 18.17% | 63.33% | 66.44% | 24.00% | 50.00% | 40.59% |
| 16 | Qwen2.5-72b | 50.00% | 58.56% | 10.00% | 10.00% | 60.00% | 63.85% | 34.00% | 42.00% | 40.00% |
| 17 | Qwen3-235b | 55.00% | 63.56% | 15.00% | 20.00% | 63.33% | 64.07% | 20.00% | 50.00% | 40.00% |
| 18 | Llama-3.1-8b | 45.00% | 53.74% | 10.00% | 17.33% | 26.67% | 30.51% | 12.00% | 40.00% | 26.47% |
