OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 28.03.2026, 11:26

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Right Diagnoses with the Wrong Justification: Limitations of Current Large-Language Models for Screening of Rheumatoid Arthritis

2025·0 Zitationen·Journal of Clinical Rheumatology and ImmunologyOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2025

Jahr

Abstract

Background: Rheumatoid arthritis (RA) is often diagnosed late due to lack of awareness. Artificial Intelligence (AI)-driven Large Language Models (LLMs), accessible via mobile devices, offer potential screening solution. We explored various agentic AI frameworks to identify the most effective configuration, SARA (Screening Agent for RA), and evaluated their chain of reasoning. Methods: We had developed the PreRAID (Pre-Screening Rheumatoid Arthritis Information Database) from consenting patients with joint pain, classified as RA or not RA as per physician diagnosis. Of 350 cases, 280 formed the knowledge base (KB) and 70 were used for testing. Six LLMs (four closed-source and two open-source) were tested under three configurations: (1) single-agent without KB; (2) single-agent with retrieval-augmented generation (RAG); (3) dual-agent setup with diagnosis-validation flow. A Neo4j vector database enabled embedding-based retrieval. [Figure 1] Accuracy of each framework was evaluated on 50 cases. Diagnostic reasoning was independently rated by two physicians using a four-point Likert scale. Results: Using the PreRAID dataset (84% RA and 16% controls), Deepseek R1 showed the highest accuracy (82%) in the single-agent with KB setting, followed by o1 and o3 mini (80% each). Accuracy dropped in the two-agent setup, most notably for Gemini 2.5 Pro (37%) and Gemini 2.0 Flash (40%) [Figure 2]. Reasons given for diagnosing RA were suboptimal across all models. Gemini 2.0 Flash (36/50) and Deepseek R1 (28/50) had the highest proportions of correct justifications, while QwQ and Gemini 2.5Pro scored the lowest (6 and 10, respectively). Dual agents configurations had poorer accuracy without any gain in reasoning logic. Conclusion: Deepseek-R1 demonstrated the highest diagnostic accuracy (82%) in the single-agent with KB setup. However, reasoning quality was suboptimal across models. Dual-agent configuration did not enhance reasoning as intended but reduced diagnostic performance. While SARA is promising for screening, lack of explainability limits clinical use. Future directions should prioritize developing explainable AI.

Ähnliche Arbeiten