OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 28.03.2026, 21:32

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating the performance of large language models versus human researchers on real world complex medical queries

2025·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

9

Autoren

2025

Jahr

Abstract

Whether large language models (LLMs) can resolve real-world dilemmas faced by clinicians remains unclear and physician quality assessment is often used as a measure of LLM output quality. We compared reports - defined as answers to clinical queries generated by LLMs or written by human researchers - generated by GPT-4o, Gemini 2.0, and Claude Sonnet 3.5 in response to such dilemmas (n = 20) to reports written by trained human researchers and studied whether physician satisfaction correlates with objective report quality. Twenty human reports and fifty-six LLM-reports were analyzed. Human reports met physicians' expectations more frequently (p = 0.044), were considered more reliable (p = 0.032), professionally written (p = 0.003), and time-saving (p = 0.003). Human reports cited more sources (p < 0.001) and while these were from lower ranking journals (median IF: 7 [3, 11] vs 14 [10, 27], p = 0.003), they were considered more relevant (p < 0.001). Unlike LLMs, human reports contained no hallucinated (p < 0.001) or unfaithful (p < 0.001) citations. However, no meaningful correlation was identified between physician satisfaction and objective measures of report quality. A meaningful gap remains between LLM and human capacity to respond reliably and in a relevant manner to real-life clinical dilemmas. Of greater concern is that physician satisfaction with generated content is not a good measure of quality.

Ähnliche Arbeiten