OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 01.05.2026, 13:27

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Confirmation Bias: A Challenge for Scalable Oversight

2026·0 Zitationen·Proceedings of the AAAI Conference on Artificial IntelligenceOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

Scalable oversight protocols aim to empower evaluators to verify outputs of AI models more capable than themselves. However, human evaluators' biases can lead to systematic errors. We reanalyse prior work which seemed to show benefits from a simple protocol, and suggest that a strategy of "answer the question myself if I know the answer, defer to the language model otherwise'' likely contributed to its positive results. This strategy fails to provide meaningful oversight when model capability increases. We also present two experiments examining simple protocols, finding no overall advantage for either. In our main experiment, participants in control and intervention groups became more confident in the system’s answers after conducting online research, even when those answers were incorrect. Our null results are restricted to the simple protocols and settings tested, and say little regarding the promise of scalable oversight more broadly. Nevertheless, they underscore the importance of testing the degree to which protocols are robust to confirmation bias, whether they outperform a strategy of simple deference to the model being evaluated, and whether performance scales with increasing problem difficulty and model capability.

Ähnliche Arbeiten