Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparison of AI-generated and clinician-designed multiple-choice questions in emergency medicine exam: a psychometric analysis
4
Zitationen
5
Autoren
2025
Jahr
Abstract
BACKGROUND: Artificial intelligence (AI) has shown promise in generating multiple-choice questions (MCQs) for medical education, yet the psychometric quality of such items remains underexplored. This study aimed to compare the psychometric properties of MCQs created by ChatGPT-4o and those written by emergency medicine clinicians. METHODS: Eighteen emergency medicine residents completed a 100-item examination comprising 50 AI-generated and 50 clinician-authored questions across core emergency medicine topics. Each item was analyzed for difficulty (P_index), discrimination (D_index), and point-biserial correlation (PBCC). Items were also categorized based on standardized index classifications. RESULTS: ChatGPT-4o-generated questions exhibited a higher mean difficulty index (P_index: 0.76 ± 0.23) compared to those created by clinicians (0.65 ± 0.24; p = 0.02), indicating that the AI-generated items were generally easier. Participants achieved significantly higher scores on AI-generated items (76.8 ± 8.18) than on clinician-authored questions (67.3 ± 9.65; p = 0.003). The mean discrimination index did not differ significantly between AI-generated (0.172 ± 0.23) and clinician-generated items (0.196 ± 0.26; p = 0.634). Likewise, the mean point-biserial correlation coefficient (PBCC) was nearly identical between the two groups (AI: 0.23 ± 0.28; clinicians: 0.23 ± 0.25; p = 0.99), suggesting similar internal consistency. Categorical analysis revealed that 56% of AI-generated items were classified as "easy," compared to 36% of clinician-designed items. Furthermore, based on PBCC values, 36% of AI-generated items and 24% of clinician items were identified as "problematic" (p = 0.015), indicating a higher rate of psychometric concerns among AI-generated questions. CONCLUSION: The findings suggest that AI-generated questions, while generally easier and associated with higher participant scores, may pose psychometric limitations, as evidenced by a greater proportion of items classified as problematic. Although the overall internal consistency and discrimination indices were comparable to clinician-authored items, careful quality control and validation are essential when integrating AI-generated content into assessment frameworks.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.774 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.685 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.244 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.898 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.