Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of GPT-5 Frontier Models in Ophthalmology Question Answering
1
Zitationen
14
Autoren
2025
Jahr
Abstract
Purpose: Novel large language models (LLMs) such as Generative Pretrained Transformer-5 (GPT-5) integrate advanced reasoning capabilities that may enhance performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. Our objective was to evaluate the performance and cost-accuracy trade-offs of OpenAI's GPT-5 compared with previous generation LLMs on ophthalmic question answering. Design: Evaluation of diagnostic test or technology. Participants: Generative Pretrained Transformer-5 is a publicly available LLM. Methods: In August 2025, 12 configurations of OpenAI's GPT-5 series (3 model tiers across 4 reasoning effort settings) were evaluated alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course data set. The study did not include human participants. Main Outcome Measures: The primary outcome was accuracy on the 260-item ophthalmology multiple-choice question set for each model configuration. The secondary outcomes included head-to-head ranking of configurations using a Bradley-Terry model applied to paired win/loss comparisons of answer accuracy, and evaluation of generated natural language rationales using a reference-anchored, pairwise LLM-as-a-judge framework. Additional analyses assessed the accuracy-cost trade-off by calculating mean per-question cost from token usage and identifying Pareto-efficient configurations. Results: < 0.001), but not o3-high (0.958; 95% CI, 0.931-0.981). The configuration GPT-5-high ranked first in accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high), as judged by a reference-anchored LLM-as-a-judge autograder. Cost-accuracy analysis identified multiple GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low providing the most optimal low-cost, high-performance configuration. Conclusions: This study benchmarks the GPT-5 series on a high-quality ophthalmology question-answering data set, demonstrating that GPT-5 with high reasoning effort achieved near-perfect accuracy and outperformed prior reasoning LLMs. This study also introduces an autograder framework for scalable, automated evaluation of LLM-generated answers against reference standards in ophthalmology. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.719 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.628 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.176 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.880 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Autoren
Institutionen
- Cleveland Clinic(US)
- Centre Hospitalier de l’Université de Montréal(CA)
- Cleveland Eye Clinic(US)
- Université de Montréal(CA)
- University of Toronto(CA)
- Hôpital Maisonneuve-Rosemont(CA)
- Cleveland Clinic Lerner College of Medicine(US)
- Case Western Reserve University(US)
- Moorfields Eye Hospital(GB)
- University College London(GB)
- Yale University(US)
- National University of Singapore(SG)
- Singapore National Eye Center(SG)
- Singapore Eye Research Institute(SG)
- Duke-NUS Medical School(SG)