OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 26.05.2026, 06:43

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparison of AI-generated and clinician-designed multiple-choice questions in emergency medicine exam: a psychometric analysis

2025·4 Zitationen·BMC Medical EducationOpen Access
Volltext beim Verlag öffnen

4

Zitationen

5

Autoren

2025

Jahr

Abstract

BACKGROUND: Artificial intelligence (AI) has shown promise in generating multiple-choice questions (MCQs) for medical education, yet the psychometric quality of such items remains underexplored. This study aimed to compare the psychometric properties of MCQs created by ChatGPT-4o and those written by emergency medicine clinicians. METHODS: Eighteen emergency medicine residents completed a 100-item examination comprising 50 AI-generated and 50 clinician-authored questions across core emergency medicine topics. Each item was analyzed for difficulty (P_index), discrimination (D_index), and point-biserial correlation (PBCC). Items were also categorized based on standardized index classifications. RESULTS: ChatGPT-4o-generated questions exhibited a higher mean difficulty index (P_index: 0.76 ± 0.23) compared to those created by clinicians (0.65 ± 0.24; p = 0.02), indicating that the AI-generated items were generally easier. Participants achieved significantly higher scores on AI-generated items (76.8 ± 8.18) than on clinician-authored questions (67.3 ± 9.65; p = 0.003). The mean discrimination index did not differ significantly between AI-generated (0.172 ± 0.23) and clinician-generated items (0.196 ± 0.26; p = 0.634). Likewise, the mean point-biserial correlation coefficient (PBCC) was nearly identical between the two groups (AI: 0.23 ± 0.28; clinicians: 0.23 ± 0.25; p = 0.99), suggesting similar internal consistency. Categorical analysis revealed that 56% of AI-generated items were classified as "easy," compared to 36% of clinician-designed items. Furthermore, based on PBCC values, 36% of AI-generated items and 24% of clinician items were identified as "problematic" (p = 0.015), indicating a higher rate of psychometric concerns among AI-generated questions. CONCLUSION: The findings suggest that AI-generated questions, while generally easier and associated with higher participant scores, may pose psychometric limitations, as evidenced by a greater proportion of items classified as problematic. Although the overall internal consistency and discrimination indices were comparable to clinician-authored items, careful quality control and validation are essential when integrating AI-generated content into assessment frameworks.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsHealth Education and Validation
Volltext beim Verlag öffnen