Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Automating expert-level medical reasoning evaluation of large language models

2025·3 Zitationen·npj Digital MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring trustworthy reasoning is paramount. However, current evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains absent. To address this, we present MedThink-Bench, a benchmark designed for rigorous and scalable assessment of LLMs' medical reasoning. MedThink-Bench comprises 500 high-complexity questions spanning ten medical domains, accompanied by expert-authored, step-by-step rationales that elucidate intermediate reasoning processes. Further, we introduce LLM-w-Rationale, an evaluation framework that combines fine-grained rationale assessment with an LLM-as-a-Judge paradigm, enabling expert-level fidelity in evaluating reasoning quality while preserving scalability. Results show that LLM-w-Rationale correlates strongly with expert evaluation (Pearson coefficient up to 0.87) while requiring only 1.4% of the evaluation time. Overall, MedThink-Bench establishes a rigorous and scalable standard for evaluating medical reasoning in LLMs, advancing their safe and responsible deployment in clinical practice.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingMachine Learning in Healthcare

Volltext beim Verlag öffnen

Automating expert-level medical reasoning evaluation of large language models

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen