Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Rise of the Machines: Comparing Performance of Artificial Intelligence Large Language Models on Pharmacy Specialty Certification Examination Practice Questions.
0
Zitationen
3
Autoren
2026
Jahr
Abstract
BACKGROUND: Large language models (LLMs) are increasingly used for clinical information retrieval and decision support, yet comparative performance on pharmacy board examination-style content across specialties remains incompletely characterized. METHODS: We evaluated 15 LLMs using 145 publicly available Board of Pharmacy Specialties (BPS) certification practice questions spanning 14 specialty domains. Questions were entered using a standardized prompt without additional prompt engineering. Model responses were scored against BPS-posted answer keys. Overall and specialty-level accuracy were summarized descriptively. Differences among LLMs were tested using Cochran's Q with Bonferroni-adjusted McNemar pairwise comparisons when appropriate, and LLMs were assessed using their default user-facing settings. RESULTS: Across all LLMs, mean accuracy was 86.2% (standard deviation [SD], 3.5%), corresponding to an average of 125/145 items answered correctly. Accuracy ranged from 79.3% (95% confidence interval [CI], 72.6%-86%) for Perplexity AI to 91.7% (95% CI, 87.2%-96.3%) for Microsoft Copilot (GPT-5). Overall performance differed significantly across LLMs (Cochran's Q = 46.262; df = 14; p < 0.001). After Bonferroni adjustment, Microsoft Copilot (GPT-5), Google Gemini 2.5 Flash, and OpenAI o3 (Reasoning) outperformed Perplexity AI (p < 0.001). Microsoft Copilot (GPT-5) also outperformed an earlier version of Microsoft Copilot (GPT-4.1) (p < 0.001). Specialty-level heterogeneity was generally limited, with significant model differences observed in Solid Organ Transplantation Pharmacy and Nuclear Pharmacy. CONCLUSIONS: LLMs demonstrated high accuracy on BPS certification practice questions, with limited variability across LLMs and select specialty domains. These findings support continued evaluation of LLMs for potential use in pharmacy practice and clinical decision support, emphasizing the need for domain-specific validation and ongoing monitoring as LLMs evolve.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.652 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.567 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.083 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.856 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.