Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Confirmation Bias: A Challenge for Scalable Oversight
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Scalable oversight protocols aim to empower evaluators to verify outputs of AI models more capable than themselves. However, human evaluators' biases can lead to systematic errors. We reanalyse prior work which seemed to show benefits from a simple protocol, and suggest that a strategy of "answer the question myself if I know the answer, defer to the language model otherwise'' likely contributed to its positive results. This strategy fails to provide meaningful oversight when model capability increases. We also present two experiments examining simple protocols, finding no overall advantage for either. In our main experiment, participants in control and intervention groups became more confident in the system’s answers after conducting online research, even when those answers were incorrect. Our null results are restricted to the simple protocols and settings tested, and say little regarding the promise of scalable oversight more broadly. Nevertheless, they underscore the importance of testing the degree to which protocols are robust to confirmation bias, whether they outperform a strategy of simple deference to the model being evaluated, and whether performance scales with increasing problem difficulty and model capability.
Ähnliche Arbeiten
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
2017 · 20.796 Zit.
Generative Adversarial Nets
2023 · 19.896 Zit.
Visualizing and Understanding Convolutional Networks
2014 · 15.334 Zit.
"Why Should I Trust You?"
2016 · 14.607 Zit.
Generative adversarial networks
2020 · 13.214 Zit.