Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Promise, Pitfalls and the Path Ahead for <scp>LLMs</scp> as Diagnostic Assistants for Focal Liver Lesions

2025·0 Zitationen·Liver InternationalOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

A recent study by Sheng et al. highlighted the growing role of artificial intelligence (AI) in analysing CT and MRI reports of patients with histopathologically confirmed focal liver lesions (FLLs) [1]. This reflects a broader trend in healthcare, where AI is increasingly being adopted to address complex clinical challenges [2]. Among these AI tools, large language models (LLMs), a type of transformer-based neural network, have gained significant attention. Trained on vast text amounts from diverse sources, LLMs, with billions of parameters, have shown promise as powerful tools, particularly in the management of complex diseases in hepatology [3]; they have the potential to improve healthcare due to their capability to analyse complex concepts and generate context-based responses [4, 5]. FLLs present a diagnostic challenge due to their diverse aetiologies and overlapping imaging features. According to the latest ACG clinical guideline, accurate diagnosis requires careful integration of imaging findings, clinical history and laboratory results, with recommendations for advanced imaging or biopsy in indeterminate cases [6, 7]. In this setting, LLMs have the potential to aid future clinicians by integrating imaging reports and clinical data to inform diagnoses, demonstrating significant strengths alongside recognised limitations (Figure 1). In this context, Sheng et al. investigated the performance of two LLMs, ChatGPT-4o and Gemini, for the diagnosis of FLLs [1]. They were prompted with both clinical information and the ‘findings’ section of radiology reports to generate differential diagnoses. This study directly compared the two LLMs against radiologists of varying experience levels, both junior and middle-level, assessed independently and subsequently with LLM assistance as well as the evaluation of single-step versus two-step prompting strategies for ChatGPT-4o. This study provides early evidence supporting the potential role of LLMs in clinical decision support. Remarkably, even without domain-specific fine-tuning, the performance of two-step ChatGPT-4o in particular approached that of real-world radiology reports and junior radiologists in the differential diagnosis of FLLs. This highlights the inherent capability of general-purpose LLMs to analyse clinical data and produce clinically meaningful outputs, positioning them as valuable diagnostic aids or second-opinion generators to enhance workflow efficiency. Recent work by Clusmann et al. further supports the notion that LLMs can extract relevant clinical concepts and reasoning patterns from complex medical texts, reinforcing their potential role in clinical decision support [8]. FLL diagnosis represents an ideal test case for such applications, given its reliance on the integration of imaging reports and clinical information, frequent diagnostic ambiguity and significant impact on patient management. As such, this study serves as an example of how emerging technologies can be pragmatically evaluated and iteratively refined. The use of histopathology as the diagnostic gold standard ensures a highly reliable reference, lending high credibility to the accuracy assessments. However, while matching junior radiologists is an achievement, the study also indicates that the LLM did not surpass human performance and, importantly, did not offer incremental diagnostic value when used as an assistive tool by radiologists. This suggests that current LLMs, when applied to textual report data, may replicate existing diagnostic capabilities rather than significantly augmenting them, especially for more experienced clinicians. The absence of senior experts in liver imaging means the study cannot definitively conclude how LLMs would fare against the highest level of human expertise [9]. It is plausible that the performance gap would be even wider, or conversely, that LLMs might offer some benefit for experts, which this study does not explore. Another limitation, as acknowledged by the authors, is that the LLMs in this study processed textual radiology reports rather than interpreting the CT/MRI images directly. The true transformative potential of AI in radiology is likely to emerge from multimodal models capable of integrating imaging data with clinical and textual information [10]. For example, previous research by Fervers et al. demonstrated that LLMs like ChatGPT yielded limited accuracy when tasked with determining standardised imaging classification systems such as LI-RADS from free-text and structured radiology reports [11]. This highlights a gap, while text-based analysis provides an initial layer of assistance, it fails to fully capitalise on the strengths of AI in image-intensive disciplines like radiology. For instance, Wei et al. demonstrated the potential of deep learning models for detecting FLLs by analysing CT images, highlighting that significant diagnostic information resides within the images themselves, which can be used to improve diagnostic accuracy [12]. This emphasises the importance of incorporating imaging data directly into AI models, as it holds critical information that cannot be fully captured through text-based analysis alone. Recent work by Ying et al. exemplifies this direction, proposing an AI system for the detection and diagnosis of FLLs by combining imaging data with clinical information through a multimodal framework [13]. Their approach demonstrated improved diagnostic performance by integrating imaging data and clinical information through a dedicated multimodal framework, offering a valuable example for FLL detection and diagnosis. However, their system did not rely on LLMs, but on task-specific deep learning architectures. At present, truly multimodal LLMs capable of directly processing complex medical images like CT and MRI alongside text remain underdeveloped [14]. Further, while general-purpose LLMs can demonstrate impressive baseline performance, their diagnostic accuracy could potentially be enhanced through fine-tuning on domain-specific corpora or by integrating external, curated medical resources via retrieval-augmented generation (RAG) techniques to ground their outputs in relevant clinical knowledge [10]. The absence of such adaptations means the models were not optimised for the nuances of medical report interpretation and liver lesion diagnosis, likely limiting their full potential in this context [8]. Future studies should explore how targeted fine-tuning or retrieval-based augmentation can improve the clinical reliability and applicability of LLMs in specialised diagnostic workflows. The integration of LLMs into diagnostic pathways for conditions like FLLs raises ethical questions. Local deployment of LLMs may mitigate data privacy and security risks by keeping patient information within institutional boundaries, however, this approach entails considerable hardware, financial and operational resources, posing potential barriers to widespread adoption, particularly in smaller or resource-limited healthcare systems. Furthermore, accountability in the event of diagnostic errors made with LLM assistance needs clear delineation: is it the clinician, the institution or the LLM developer who bears responsibility? Transparency in how LLMs arrive at their conclusions (explainability) is also crucial for building trust and enabling clinicians to critically evaluate AI-generated advice [15]. This ‘black box’ nature also poses a significant challenge for regulatory oversight, which typically requires transparency and predictability. Additionally, determining whether an LLM used for diagnostic support constitutes a medical device is a key question facing bodies like the FDA and EMA [16]. Establishing clear guidelines for the validation, approval and post-market surveillance of these rapidly evolving technologies is essential [17]. Although the integration of LLMs into clinical decision support is still in its formative stages, emerging studies such as this one offer promising insights into their potential utility. Considerable obstacles remain, including the necessity for domain-specific model adaptation, effective incorporation of multimodal data, transparency in decision processes and careful navigation of ethical and regulatory frameworks. Nonetheless, the advantages LLMs can offer are increasingly evident. Their capacity to efficiently confirm complex clinical narratives and imaging findings, provide decision support for less experienced clinicians and promote consistency in diagnostic workflows presents an exciting direction for future development. The authors declare no conflicts of interest. No new data were generated or analysed in support of this article.

Autoren

Institutionen

RWTH Aachen University(DE)

Themen

Radiomics and Machine Learning in Medical ImagingArtificial Intelligence in Healthcare and EducationHepatocellular Carcinoma Treatment and Prognosis

Volltext beim Verlag öffnen

Promise, Pitfalls and the Path Ahead for <scp>LLMs</scp> as Diagnostic Assistants for Focal Liver Lesions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen