Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Artificial Intelligence Ethics: The RLHF Critic Bias and Automated Intellectual Gatekeeping
0
Zitationen
1
Autoren
2026
Jahr
Abstract
Large language models trained with Reinforcement Learning from Human Feedback (RLHF) exhibit a structural bias toward the negative evaluation of submitted work, independent of actual scientific merit. This paper identifies and formally characterizes the RLHF Critic Bias—a systemic failure mode in which the asymmetric penalty landscape of human preference optimization produces models that default to skepticism, caution, and critique on evaluative tasks. We demonstrate that this alignment failure leads to "AI-laundered" intellectual gatekeeping, where the thermodynamic cost of generating authoritative-sounding rejection drops to zero. Through analysis of training dynamics and prompt dependence, we document five recurring output patterns: the content-independent "Interesting But" template, the credentialing prior, unfalsifiable objections, hallucinated flaws, and the double standard between endorsement and critique. We demonstrate how this automated gatekeeping disproportionately threatens interdisciplinary researchers, independent scholars, and the global epistemic commons. To restore epistemic sovereignty, we propose structural mitigations, including symmetric reward training and anti-laundering watermarks, to prevent Large Language Models from acting as an automated stagnation engine in scientific discourse.
Ähnliche Arbeiten
The global landscape of AI ethics guidelines
2019 · 4.811 Zit.
The Limitations of Deep Learning in Adversarial Settings
2016 · 3.896 Zit.
Trust in Automation: Designing for Appropriate Reliance
2004 · 3.557 Zit.
Fairness through awareness
2012 · 3.318 Zit.
AI4People—An Ethical Framework for a Good AI Society: Opportunities, Risks, Principles, and Recommendations
2018 · 3.293 Zit.