Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
With a Hop, Skip, and a Prefill: How Benchmark Volatility Distorts the Accuracy of Long-Context Benchmarks and How To Combat It
0
Zitationen
2
Autoren
2026
Jahr
Abstract
Contemporary large language models now support context windows of up to millions of tokens, and this capability enables higher accuracy, new tasks, and longer conversational history. Researchers rely on long-context inference benchmarks to evaluate specific model behaviours, but practitioners still find it difficult to translate benchmark results into AI system design decisions, such as model selection and configuration for target workloads. In this work, we analyse 16 long-context benchmarks to characterise their composition in terms of tasks, prompt token sizes, and variation between prompts. We find substantial differences in context-prompt length both across and within benchmarks: the coefficient of variation reaches 313% and the ratio between the 5th and 95th percentile prompt lengths reaches 65x within the same task. Our follow-up analysis shows that this volatility can distort benchmark accuracy, so results may reflect a model's ability to handle extreme prompt lengths rather than the underlying task itself. We further show that token-size-controlled variants of SCBench tasks reveal performance overestimates of up to 40% in the original benchmark, and that modest modifications using a knapsack-based document selection strategy can produce more representative and stable results.