Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmark evaluation of large language models in acute myeloid leukemia prognosis

2025·0 Zitationen·Blood

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Background: Acute myeloid leukemia (AML) is a highly heterogeneous malignancy requiring accurate prediction of treatment response and relapse risk. Despite advances in genomics, complex mutation patterns limit prognostic precision. Large language models (LLMs) show promise in healthcare but their utility in AML outcome prediction remains unexplored. Methods: We curated multimodal data from 684 newly diagnosed de novo AML patients (mean age 54±16 years) at Zhejiang University Hospital from 2019 to 2022, encompassing demographics, blood counts, genomics (43 fusions and 138 mutations), cytogenetics, treatments, and outcomes. Five state-of-the-art LLMs (Kimi, Qwen, SparkDesk, ChatGPT, DeepSeek) were evaluated using structured prompts for three tasks: 1. Treatment response prediction (remission: CRc vs non-CRc); 2. Relapse risk prediction; 3. Prognostic feature ranking. Performance was assessed via accuracy, precision, recall, F1-score, and cosine similarity against expert judgments. Results: In treatment response, ChatGPT (with O1) achieved highest accuracy (72.22%) and F1-score (82.01%), while Kimi performed poorest (57.89% accuracy). Whereas in relapse prediction, SparkDesk had highest accuracy (58.77%), but all models showed low precision (26.36–30.74%) and high false-positive rates (F1-score: 33.49% for SparkDesk). Notably, in feature ranking, LLMs aligned closely with experts (cosine similarity &gt;0.85). Top-ranked features (e.g., TP53 mutation, CBFB::MYH11 fusion, chromosome 7/17 abnormalities) showed significant differences between CRc and non-CRc groups (p&lt;0.001). However, LLMs overvalued non-discriminative features (e.g., WBC, FCM) compared to experts. Conclusions: Current LLMs demonstrate insufficient reliability for independent AML outcome prediction (relapse accuracy ≤58.77%). However, their robust capability in identifying clinically relevant prognostic features supports their potential as adjunctive tools to augment decision-making in hematologic malignancies. Future integration of longitudinal data and domain-specific fine-tuning may enhance clinical utility.

Autoren

Institutionen

Themen

Acute Myeloid Leukemia ResearchArtificial Intelligence in Healthcare and EducationDigital Imaging for Blood Diseases

Volltext beim Verlag öffnen

Benchmark evaluation of large language models in acute myeloid leukemia prognosis

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen