Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of three artificial intelligence chatbots for generating clinical hematology multiple choice questions for medical students
0
Zitationen
9
Autoren
2026
Jahr
Abstract
The integration of artificial intelligence (AI) into medical education has shown promise in streamlining content creation, yet the reliability and validity of AI-generated assessments remain critical concerns. This study evaluates three AI models-ChatGPT, Perplexity, and DeepSeek-in generating hematology multiple-choice questions (MCQs), focusing on their alignment with clinical guidelines, cognitive complexity, and expert acceptability, to determine their practical utility in medical education. To quantitatively evaluate and compare the performance of three AI models-ChatGPT, Perplexity, and DeepSeek-in generating multiple-choice questions (MCQs) relevant to hematology, with a focus on content validity, cognitive level alignment, and expert acceptance. In this study, each AI model was prompted to generate 50 MCQs across five key hematology topics, following standardized instructions emphasizing guideline alignment and cognitive diversity. Three hematology experts, blinded to question source, independently rated all 150 MCQs on criteria including accuracy, clinical relevance, clarity, distractor plausibility, and overall quality, using a structured rubric. Scores were averaged per model, and questions were categorized by Bloom’s taxonomy level. Acceptance was defined as a total score ≥ 15 out of 25. DeepSeek achieved the highest scores for accuracy (4.7 ± 0.4), clinical relevance (4.8 ± 0.3), and distractor plausibility (4.7 ± 0.4), with a perfect acceptance rate (100%) and no need for revision. Perplexity and ChatGPT also produced clinically relevant questions but required minor revisions (acceptance rates: 96% and 90%, respectively). All models favored higher-order cognitive questions. Knowledge and comprehension questions were limited across all models. AI models, particularly DeepSeek, can efficiently generate high-quality, clinically relevant hematology MCQs suitable for medical education and assessment. While DeepSeek demonstrated superior reliability and required minimal expert revision, all models underrepresented foundational knowledge questions and lacked autonomous image-based item generation. Hybrid human-AI workflows and targeted prompt engineering are recommended to optimize cognitive coverage and ensure educational rigor.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.357 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.221 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.640 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.482 Zit.