Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Assessing ChatGPT’s Mastery of Bloom’s Taxonomy using psychosomatic medicine exam questions
5
Zitationen
8
Autoren
2023
Jahr
Abstract
Abstract Introduction Large language models (LLMs) such as GPT-4 are increasingly used in medicine and medical education. However, these models are prone to “hallucinations” – outputs that sound convincing while being factually incorrect. It is currently unknown how these errors by LLMs relate to the different cognitive levels defined in Bloom’s Taxonomy. Methods We used a large dataset of psychosomatic medicine multiple-choice questions (MCQ) (N = 307) with real-world results derived from medical school exams. GPT-4 answered the MCQs using two distinct prompt versions – detailed and short. The answers were analysed using a quantitative and qualitative approach. We focussed on incorrectly answered questions, categorizing reasoning errors according to Bloom’s Taxonomy. Results GPT-4’s performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty compared to questions that GPT-4 answered incorrectly (p=0.002 for the detailed prompt and p<0.001 for the short prompt). Independent of the prompt, GPT-4’s lowest exam performance was 78.9%, always surpassing the pass threshold. Our qualitative analysis of incorrect answers, based on Bloom’s Taxonomy, showed errors mainly in the “remember” (29/68) and “understand” (23/68) cognitive levels. Specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Discussion GPT-4 displayed a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated against Bloom’s hierarchical framework, our data revealed that GPT-4 occasionally ignored specific facts (“remember”), provided illogical reasoning (“understand”), or failed to apply concepts to a new situation (“apply”). These errors, though confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood. Conclusion While GPT-4 mostly excels at medical exam questions, discerning its occasional cognitive errors is crucial.
Ähnliche Arbeiten
The Strengths and Difficulties Questionnaire: A Research Note
1997 · 14.578 Zit.
Making sense of Cronbach's alpha
2011 · 13.759 Zit.
QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies
2011 · 13.596 Zit.
A method for estimating the probability of adverse drug reactions
1981 · 11.467 Zit.
Evidence-Based Medicine
1992 · 4.141 Zit.