Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks
27
Zitationen
10
Autoren
2025
Jahr
Abstract
OBJECTIVES: Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks. MATERIALS AND METHODS: We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities. RESULTS: Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate. DISCUSSION: Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation. CONCLUSION: Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.693 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.598 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.124 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.871 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Autoren
Institutionen
- Harvard University(US)
- Humboldt-Universität zu Berlin(DE)
- Massachusetts General Hospital(US)
- Athinoula A. Martinos Center for Biomedical Imaging(US)
- Freie Universität Berlin(DE)
- Charité - Universitätsmedizin Berlin(DE)
- TUM Klinikum(DE)
- Universitätsklinikum Aachen(DE)
- German Cancer Research Center(DE)
- TU Dortmund University(DE)
- Essen University Hospital(DE)
- Cancer Research Center(US)
- Deutschen Konsortium für Translationale Krebsforschung(DE)
- University of California, San Francisco(US)
- Deutsches Herzzentrum München(DE)