Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

2025·1 Zitationen·Ophthalmology ScienceOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Purpose: Novel large language models (LLMs) such as Generative Pretrained Transformer-5 (GPT-5) integrate advanced reasoning capabilities that may enhance performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. Our objective was to evaluate the performance and cost-accuracy trade-offs of OpenAI's GPT-5 compared with previous generation LLMs on ophthalmic question answering. Design: Evaluation of diagnostic test or technology. Participants: Generative Pretrained Transformer-5 is a publicly available LLM. Methods: In August 2025, 12 configurations of OpenAI's GPT-5 series (3 model tiers across 4 reasoning effort settings) were evaluated alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course data set. The study did not include human participants. Main Outcome Measures: The primary outcome was accuracy on the 260-item ophthalmology multiple-choice question set for each model configuration. The secondary outcomes included head-to-head ranking of configurations using a Bradley-Terry model applied to paired win/loss comparisons of answer accuracy, and evaluation of generated natural language rationales using a reference-anchored, pairwise LLM-as-a-judge framework. Additional analyses assessed the accuracy-cost trade-off by calculating mean per-question cost from token usage and identifying Pareto-efficient configurations. Results: < 0.001), but not o3-high (0.958; 95% CI, 0.931-0.981). The configuration GPT-5-high ranked first in accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high), as judged by a reference-anchored LLM-as-a-judge autograder. Cost-accuracy analysis identified multiple GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low providing the most optimal low-cost, high-performance configuration. Conclusions: This study benchmarks the GPT-5 series on a high-quality ophthalmology question-answering data set, demonstrating that GPT-5 with high reasoning effort achieved near-perfect accuracy and outperformed prior reasoning LLMs. This study also introduces an autograder framework for scalable, automated evaluation of LLM-generated answers against reference standards in ophthalmology. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen