Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Expert-level validation of AI-generated medical text with scalable language models
0
Zitationen
27
Autoren
2025
Jahr
Abstract
<title>Abstract</title> With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the “LM-as-judge” paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges, including a multilingual task reviewed by bilingual physicians. Each output is reviewed following a physician-defined taxonomy of risk levels and error categories, enabling evaluation of LMs in making safety decisions for deployment. Across 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase, 2) MedVAL-Bench, and 3) MedVAL-4B, the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.316 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.177 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.575 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.468 Zit.
Autoren
- Asad Aali
- Vasiliki Bikia
- Maya Varma
- Nicole Chiou
- Sophie Ostmeier
- Arnav Singhvi
- Magdalini Paschali
- Ashwin Kumar
- Andrew Johnston
- Karimar Amador-Martinez
- Eduardo Guerrero
- Patricia Rivera
- Sergios Gatidis
- Christian Blüthgen
- Eduardo Pontes Reis
- Eddy D. Zandee van Rilland
- Poonam Hosamani
- Kevin Keet
- Minjoung Go
- Evelyn Ling
- David B. Larson
- Curtis P. Langlotz
- Roxana Daneshjou
- Jason Hom
- Oluwasanmi Koyejo
- Emily Alsentzer
- Akshay Chaudhari