动态榜单

致力于探索最先进的大模型,为产研界提供全面、客观、中立的评测参考

Rank Task Text-R-stress Text-K-stress Table-R-Stress KG-K-Stress KG-R-Stress StressEval(EM)
model EM F1 EM F1 EM F1 EM EM
GPT-5.5 55.00% 59.56% 45.00% 56.83% 86.67% 89.33% 42.00% 62.00% 57.65%
Claude-opus-4.6-thinking 65.00% 72.56% 35.00% 43.44% 83.33% 85.33% 40.00% 60.00% 55.88%
Gemini-3.1-pro 60.00% 64.56% 30.00% 43.17% 86.67% 88.67% 34.00% 64.00% 54.71%
4 Gemini-3-pro 60.00% 64.56% 40.00% 49.83% 83.33% 85.33% 30.00% 60.00% 52.94%
5 Qwen3.6-plus 65.00% 69.74% 20.00% 31.50% 83.33% 85.33% 38.00% 56.00% 52.35%
6 Qwen3.5-plus 55.00% 61.56% 30.00% 39.83% 90.00% 90.00% 20.00% 66.00% 51.18%
7 Glm-5 65.00% 69.56% 30.00% 40.67% 83.33% 85.24% 20.00% 62.00% 50.00%
8 GPT-5.4 65.00% 71.56% 10.00% 20.67% 76.67% 79.17% 36.00% 56.00% 49.41%
9 Claude-sonnet-4.5 65.00% 69.56% 15.00% 20.67% 63.33% 65.33% 22.00% 56.00% 43.53%
10 Hunyua-2.0 50.00% 59.42% 15.00% 17.50% 80.00% 81.67% 20.00% 52.00% 42.94%
11 Deepseek-V3.2 50.00% 54.56% 10.00% 15.67% 66.67% 71.17% 16.00% 60.00% 41.18%
12 GPT-5.2 65.00% 71.56% 5.00% 14.52% 66.67% 68.57% 14.00% 58.00% 41.18%
13 Deepseek-V4-pro 55.00% 59.56% 25.00% 37.92% 66.67% 68.33% 8.00% 60.00% 41.18%
14 Doubao-seed-1.6 45.00% 59.28% 15.00% 20.00% 83.33% 85.56% 2.00% 62.00% 40.59%
15 Llama-3.1-70b 55.00% 64.42% 10.00% 18.17% 63.33% 66.44% 24.00% 50.00% 40.59%
16 Qwen2.5-72b 50.00% 58.56% 10.00% 10.00% 60.00% 63.85% 34.00% 42.00% 40.00%
17 Qwen3-235b 55.00% 63.56% 15.00% 20.00% 63.33% 64.07% 20.00% 50.00% 40.00%
18 Llama-3.1-8b 45.00% 53.74% 10.00% 17.33% 26.67% 30.51% 12.00% 40.00% 26.47%