Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of Large Language Models in Interventional Cardiology: The ILLUMINATE Blinded Model-Comparison Study

2025·0 Zitationen·The Journal of invasive cardiology/The journal of invasive cardiology

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

OBJECTIVES: Large language models (LLMs) have the potential to assist in complex decision making for interventional cardiology (IC). However, their comparative performance in providing clinical recommendations remains uncertain. In this blinded model‑comparison study, the authors evaluated and compared the quality of recommendations produced by 6 LLMs for complex IC cases. METHODS: Twenty detailed and complex clinical cases focusing on coronary artery disease (n=10) and structural heart disease (n=10) were developed. Six LLMs were tested: default ChatGPT (ChatGPTd), ChatGPT with European Society of Cardiology guidelines (ChatGPT-gl), ChatGPT with internet search enabled (ChatGPTi), Gemini (Google), Mistral 7B (Mistral AI), and Perplexity AI (Perplexity AI, Inc.). Only the ordering of anonymized outputs was randomized to ensure blinding. Five expert ICs independently assessed the anonymized and randomized responses using a 0 to 10 scale for appropriateness, accuracy, relevance, clarity, and clinical utility, generating a composite score. Statistical analysis was performed using a mixed linear model. RESULTS: Six hundred blinded evaluations (20 cases x 6 models x 5 raters) were analyzed, yielding an overall composite score of 7.1 (95% CI, 7.0-7.2). Performance significantly varied across LLMs (P less than .001), with ChatGPTi (7.8 [7.5-8.0]) and ChatGPT-gl (7.7 [7.4-7.9]) outperforming others. ChatGPTd (6.9 [6.6-7.3]), Mistral 7B (7.0 [6.7-7.3]), and Perplexity AI (7.0 [6.7-7.3]) performed moderately, while Gemini had the lowest score (6.3 [6.0-6.7]). These differences were consistent across all scoring dimensions (P less than .001). Case type did not affect LLM performance (P = .900). CONCLUSIONS: LLMs show promise in IC decision making, but their performance remains suboptimal. Maximizing their potential requires systematic integration of web search capabilities and guideline-based knowledge retrieval.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareBiomedical Text Mining and Ontologies

Volltext beim Verlag öffnen

Performance of Large Language Models in Interventional Cardiology: The ILLUMINATE Blinded Model-Comparison Study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen