Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmark evaluation of large language models in acute myeloid leukemia prognosis
0
Zitationen
17
Autoren
2025
Jahr
Abstract
Abstract Background: Acute myeloid leukemia (AML) is a highly heterogeneous malignancy requiring accurate prediction of treatment response and relapse risk. Despite advances in genomics, complex mutation patterns limit prognostic precision. Large language models (LLMs) show promise in healthcare but their utility in AML outcome prediction remains unexplored. Methods: We curated multimodal data from 684 newly diagnosed de novo AML patients (mean age 54±16 years) at Zhejiang University Hospital from 2019 to 2022, encompassing demographics, blood counts, genomics (43 fusions and 138 mutations), cytogenetics, treatments, and outcomes. Five state-of-the-art LLMs (Kimi, Qwen, SparkDesk, ChatGPT, DeepSeek) were evaluated using structured prompts for three tasks: 1. Treatment response prediction (remission: CRc vs non-CRc); 2. Relapse risk prediction; 3. Prognostic feature ranking. Performance was assessed via accuracy, precision, recall, F1-score, and cosine similarity against expert judgments. Results: In treatment response, ChatGPT (with O1) achieved highest accuracy (72.22%) and F1-score (82.01%), while Kimi performed poorest (57.89% accuracy). Whereas in relapse prediction, SparkDesk had highest accuracy (58.77%), but all models showed low precision (26.36–30.74%) and high false-positive rates (F1-score: 33.49% for SparkDesk). Notably, in feature ranking, LLMs aligned closely with experts (cosine similarity >0.85). Top-ranked features (e.g., TP53 mutation, CBFB::MYH11 fusion, chromosome 7/17 abnormalities) showed significant differences between CRc and non-CRc groups (p<0.001). However, LLMs overvalued non-discriminative features (e.g., WBC, FCM) compared to experts. Conclusions: Current LLMs demonstrate insufficient reliability for independent AML outcome prediction (relapse accuracy ≤58.77%). However, their robust capability in identifying clinically relevant prognostic features supports their potential as adjunctive tools to augment decision-making in hematologic malignancies. Future integration of longitudinal data and domain-specific fine-tuning may enhance clinical utility.
Ähnliche Arbeiten
The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia
2016 · 10.166 Zit.
Human acute myeloid leukemia is organized as a hierarchy that originates from a primitive hematopoietic cell
1997 · 6.929 Zit.
Diagnosis and management of AML in adults: 2017 ELN recommendations from an international expert panel
2016 · 5.846 Zit.
Proposals for the Classification of the Acute Leukaemias F<scp>rench</scp>‐A<scp>merican</scp>‐B<scp>ritish</scp> (FAB) C<scp>o‐operative</scp> G<scp>roup</scp>
1976 · 5.596 Zit.
Genomic and Epigenomic Landscapes of Adult De Novo Acute Myeloid Leukemia
2013 · 5.137 Zit.
Autoren
Institutionen
- First Affiliated Hospital Zhejiang University(CN)
- Hangzhou Dianzi University(CN)
- Tongde Hospital of Zhejiang Province(CN)
- Hangzhou Hospital of Traditional Chinese Medicine(CN)
- Institute of Computing Technology(CN)
- Zhejiang Lab(CN)
- Soochow University(CN)
- First Affiliated Hospital of Soochow University(CN)
- Nanjing Medical University(CN)