Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmark evaluation of large language models in acute myeloid leukemia prognosis
0
Zitationen
17
Autoren
2025
Jahr
Abstract
Abstract Background: Acute myeloid leukemia (AML) is a highly heterogeneous malignancy requiring accurate prediction of treatment response and relapse risk. Despite advances in genomics, complex mutation patterns limit prognostic precision. Large language models (LLMs) show promise in healthcare but their utility in AML outcome prediction remains unexplored. Methods: We curated multimodal data from 684 newly diagnosed de novo AML patients (mean age 54±16 years) at Zhejiang University Hospital from 2019 to 2022, encompassing demographics, blood counts, genomics (43 fusions and 138 mutations), cytogenetics, treatments, and outcomes. Five state-of-the-art LLMs (Kimi, Qwen, SparkDesk, ChatGPT, DeepSeek) were evaluated using structured prompts for three tasks: 1. Treatment response prediction (remission: CRc vs non-CRc); 2. Relapse risk prediction; 3. Prognostic feature ranking. Performance was assessed via accuracy, precision, recall, F1-score, and cosine similarity against expert judgments. Results: In treatment response, ChatGPT (with O1) achieved highest accuracy (72.22%) and F1-score (82.01%), while Kimi performed poorest (57.89% accuracy). Whereas in relapse prediction, SparkDesk had highest accuracy (58.77%), but all models showed low precision (26.36–30.74%) and high false-positive rates (F1-score: 33.49% for SparkDesk). Notably, in feature ranking, LLMs aligned closely with experts (cosine similarity >0.85). Top-ranked features (e.g., TP53 mutation, CBFB::MYH11 fusion, chromosome 7/17 abnormalities) showed significant differences between CRc and non-CRc groups (p<0.001). However, LLMs overvalued non-discriminative features (e.g., WBC, FCM) compared to experts. Conclusions: Current LLMs demonstrate insufficient reliability for independent AML outcome prediction (relapse accuracy ≤58.77%). However, their robust capability in identifying clinically relevant prognostic features supports their potential as adjunctive tools to augment decision-making in hematologic malignancies. Future integration of longitudinal data and domain-specific fine-tuning may enhance clinical utility.
Ähnliche Arbeiten
The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia
2016 · 10.083 Zit.
Human acute myeloid leukemia is organized as a hierarchy that originates from a primitive hematopoietic cell
1997 · 6.905 Zit.
Diagnosis and management of AML in adults: 2017 ELN recommendations from an international expert panel
2016 · 5.796 Zit.
Proposals for the Classification of the Acute Leukaemias F<scp>rench</scp>‐A<scp>merican</scp>‐B<scp>ritish</scp> (FAB) C<scp>o‐operative</scp> G<scp>roup</scp>
1976 · 5.587 Zit.
Genomic and Epigenomic Landscapes of Adult De Novo Acute Myeloid Leukemia
2013 · 5.090 Zit.
Autoren
Institutionen
- First Affiliated Hospital Zhejiang University(CN)
- Hangzhou Dianzi University(CN)
- Hangzhou Hospital of Traditional Chinese Medicine(CN)
- Tongde Hospital of Zhejiang Province(CN)
- Zhejiang Lab(CN)
- Institute of Computing Technology(CN)
- First Affiliated Hospital of Soochow University(CN)
- Soochow University(CN)
- Nanjing Medical University(CN)