Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Automating expert-level medical reasoning evaluation of large language models
3
Zitationen
19
Autoren
2025
Jahr
Abstract
As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring trustworthy reasoning is paramount. However, current evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains absent. To address this, we present MedThink-Bench, a benchmark designed for rigorous and scalable assessment of LLMs' medical reasoning. MedThink-Bench comprises 500 high-complexity questions spanning ten medical domains, accompanied by expert-authored, step-by-step rationales that elucidate intermediate reasoning processes. Further, we introduce LLM-w-Rationale, an evaluation framework that combines fine-grained rationale assessment with an LLM-as-a-Judge paradigm, enabling expert-level fidelity in evaluating reasoning quality while preserving scalability. Results show that LLM-w-Rationale correlates strongly with expert evaluation (Pearson coefficient up to 0.87) while requiring only 1.4% of the evaluation time. Overall, MedThink-Bench establishes a rigorous and scalable standard for evaluating medical reasoning in LLMs, advancing their safe and responsible deployment in clinical practice.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.380 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.243 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.671 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.496 Zit.