Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
ChatGPT does not replicate human moral judgments: the importance of examining metrics beyond correlation to assess agreement
2
Zitationen
7
Autoren
2025
Jahr
Abstract
The rise of generative artificial intelligence has prompted claims that large language models (LLMs) can substitute for human participants, particularly in moral judgment tasks where correlations between ChatGPT and humans approach r = 1.00. In response, we conducted a pre-registered study where two LLMs (text-davinci-003 and GPT-4o) predicted human moral judgments of 60 scenarios prior to a large human sample (N = 940) rating them. Despite strong correlations, difference scores revealed substantial, systematic errors: Compared to humans, LLMs provided more extreme morality ratings of moral and neutral scenarios and more extreme immorality ratings of immoral ones. Moreover, ChatGPT differed significantly and with moderate to large effect sizes from human averages on ~ 87% of scenarios. Further, LLM ratings clustered around a restricted number of values, failing to reflect human variability. Re-examination of earlier published data also reflected this clumping. We conclude that broader evaluation criteria are needed for comparing LLM predictions and human responses in moral reasoning tasks.
Ähnliche Arbeiten
The emotional dog and its rational tail: A social intuitionist approach to moral judgment.
2001 · 7.778 Zit.
Social Psychology of Intergroup Relations
1982 · 7.741 Zit.
Implicit social cognition: Attitudes, self-esteem, and stereotypes.
1995 · 6.283 Zit.
A study of normative and informational social influences upon individual judgment.
1955 · 4.684 Zit.
The global landscape of AI ethics guidelines
2019 · 4.563 Zit.