Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
KS-Probe: Benchmarking Context Fidelity Dynamics in Frontier Language Models Across Length, Position, and Format
0
Zitationen
2
Autoren
2026
Jahr
Abstract
Large language models (LLMs) are increasingly deployed with extended context windows, yet their ability to reliably utilize information across long contexts remains poorly characterized. In particular, it is unclear how recall fidelity varies with context length, token position, and conversational depth. We introduce KS-Probe (Kangaroo Shift Probe), a benchmarking framework designed to systematically evaluate information retention and recall in long-context LLMs. KS-Probe operates by injecting synthetic probe facts, defined as discrete and verifiable information units, into controlled filler contexts. Models are subsequently queried to assess recall accuracy. Performance is quantified using the Probe Recall Accuracy (PRA) metric, defined as the proportion of correctly retrieved probe facts under varying experimental conditions. KS-Probe evaluates recall behavior across five dimensions: Context Fidelity: baseline recall as a function of context length Positional Recall Bias: dependence of recall on token position within context Multi-Turn Degradation: decay in recall across sequential interaction turns Silent Truncation: failure modes where context is dropped without explicit indication Tokenizer Divergence: variation in recall induced by tokenization differences across model families We benchmark frontier models including Claude Sonnet 4.6, GPT-5.2, Grok 4.1, and DeepSeek V3.2, providing a comparative analysis of long-context reliability across architectures.