Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating cognitive depth of AI-generated multiple-choice questions with Bloom’s Taxonomy
2
Zitationen
5
Autoren
2026
Jahr
Abstract
INTRODUCTION: While LLMs are used to generate medical and dental MCQs, their alignment with Bloom's Taxonomy remains unexplored. MATERIALS AND METHODS: Five widely used LLMs, including ChatGPT-4o (OpenAI), Copilot Pro (Microsoft), Claude Sonnet 4 (Anthropic), Grok 3 (xAI), and DeepSeek R1 (DeepSeek) were evaluated. Each model generated 60 MCQs (total 300) based on content from an oral and maxillofacial anatomy textbook across the five cognitive levels of Bloom's Taxonomy. Two independent investigators assessed each item using a 5-point Likert scale for remembering, understanding, applying, analyzing, and evaluating/creating. Inter-rater reliability was measured using weighted Cohen's kappa. Model performance and inter-model differences were analyzed using the Kruskal-Wallis test. RESULTS: Inter-rater reliability was moderate to strong (kappa = 0.74-0.86). Median scores for remembering, understanding, applying, and evaluating/creating were above 4 across all LLMs, while the analyzing level scored a median of 3.5 for ChatGPT-4o and DeepSeek R1. No significant difference was found between models in remembering and understanding levels (p > 0.05). Claude Sonnet 4 outperformed the other models at the applying, analyzing, and evaluating/creating levels (p = 0.01, 0.003, and 0.005, respectively). Within-model analysis showed that only Copilot Pro and Claude Sonnet 4 consistently aligned with Bloom's cognitive levels across all categories. In contrast, ChatGPT-4o, DeepSeek R1, and Grok 3 performed significantly better at the lower cognitive levels (p = 0.00, 0.00, and 0.001, respectively). CONCLUSIONS: All LLMs performed well at lower cognitive levels, while Claude Sonnet 4 achieved the highest alignment at higher-order levels.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.700 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.605 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.133 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.873 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.