Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
0
Zitationen
35
Autoren
2026
Jahr
Abstract
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 668 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,143 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate eight state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://huggingface.co/datasets/skylenage/HLE-Verified
Ähnliche Arbeiten
Autoren
- Weiqi Zhai
- Zhihai Wang
- Jinghang Wang
- Boyu Yang
- Xiaogang Li
- Xiang Xu
- Bohan Wang
- Peng Wang
- Xingzhe Wu
- Anfeng Li
- Qiyuan Feng
- Yuhao Zhou
- Shoulin Han
- Wenjie Luo
- Yiyuan Li
- Yaxuan Wang
- Ruixian Luo
- Guojie Lin
- Peiyao Xiao
- Chengliang Xu
- Ben Wang
- Zeyu Wang
- Zichao Chen
- Jianan Ye
- Yijie Hu
- Jialong Chen
- Zongwen Shen
- Yuliang Xu
- An Yang
- Bowen Yu
- Dayiheng Liu
- Junyang Lin
- Hu Wei
- Que Shen
- Bing Zhao