Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative analysis of the performance of the large language models ChatGPT-3.5, ChatGPT-4 and Open AI-o1 in the field of Programmed Cell Death in myeloma
1
Zitationen
9
Autoren
2025
Jahr
Abstract
ABS: OBJECTIVE: This study aimed to compare the performance of three large language models (LLMs)-ChatGPT-3.5, ChatGPT-4, and Open AI-o1-in addressing clinical questions related to Programmed Cell Death in multiple myeloma. By evaluating each model's accuracy, comprehensiveness, and self-correcting capabilities, the investigation sought to determine the most effective tool for supporting clinical decision-making in this specialized oncological context. METHODS: A comprehensive set of forty clinical questions was curated from recent high-impact oncology journals, International Myeloma Working Group (IMWG) guidelines, and reputable medical databases, covering various aspects of Programmed Cell Death in multiple myeloma. These questions were refined and validated by a panel of four hematologists-oncologists with expertise in the field. Each question was individually posed to ChatGPT-3.5, ChatGPT-4, and Open AI-o1 in controlled sessions. Responses were anonymized and evaluated by the same panel using a five-point Likert scale assessing accuracy, depth, and completeness. Responses were categorized as "excellent," "satisfactory," or "insufficient" based on cumulative scores. Additionally, the models' self-correcting abilities were assessed by providing feedback on initially insufficient responses and re-evaluating the revised answers. Interrater reliability was measured using Cohen's Kappa coefficients. RESULTS: Open AI-o1 consistently generated the most extensive and detailed responses, achieving significantly higher total scores across all domains compared to ChatGPT-3.5 and ChatGPT-4. It demonstrated the lowest proportion of "insufficient" responses and the highest percentage of "excellent" answers, particularly excelling in guideline-based questions. Open AI-o1 also exhibited superior self-correcting capacity, effectively enhancing its responses upon receiving feedback. The highest Cohen's Kappa coefficient among the models indicated greater consistency in evaluations by clinical experts. User satisfaction surveys revealed that 85% of hematologists-oncologists rated Open AI-o1 as "highly satisfactory," compared to 60% for ChatGPT-4 and 45% for ChatGPT-3.5. CONCLUSION: Open AI-o1 outperforms ChatGPT-3.5 and ChatGPT-4 in accuracy, depth, and reliability when addressing complex clinical questions related to Programmed Cell Death in multiple myeloma. Its advanced "thinking" ability facilitates comprehensive and evidence-based responses, making it a more dependable tool for clinical decision support. These findings suggest that Open AI-o1 holds significant potential for enhancing clinical practices in specialized oncological fields, though ongoing validation and integration with human expertise remain essential.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.700 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.605 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.133 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.873 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.