Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Right Diagnoses with the Wrong Justification: Limitations of Current Large-Language Models for Screening of Rheumatoid Arthritis
0
Zitationen
7
Autoren
2025
Jahr
Abstract
Background: Rheumatoid arthritis (RA) is often diagnosed late due to lack of awareness. Artificial Intelligence (AI)-driven Large Language Models (LLMs), accessible via mobile devices, offer potential screening solution. We explored various agentic AI frameworks to identify the most effective configuration, SARA (Screening Agent for RA), and evaluated their chain of reasoning. Methods: We had developed the PreRAID (Pre-Screening Rheumatoid Arthritis Information Database) from consenting patients with joint pain, classified as RA or not RA as per physician diagnosis. Of 350 cases, 280 formed the knowledge base (KB) and 70 were used for testing. Six LLMs (four closed-source and two open-source) were tested under three configurations: (1) single-agent without KB; (2) single-agent with retrieval-augmented generation (RAG); (3) dual-agent setup with diagnosis-validation flow. A Neo4j vector database enabled embedding-based retrieval. [Figure 1] Accuracy of each framework was evaluated on 50 cases. Diagnostic reasoning was independently rated by two physicians using a four-point Likert scale. Results: Using the PreRAID dataset (84% RA and 16% controls), Deepseek R1 showed the highest accuracy (82%) in the single-agent with KB setting, followed by o1 and o3 mini (80% each). Accuracy dropped in the two-agent setup, most notably for Gemini 2.5 Pro (37%) and Gemini 2.0 Flash (40%) [Figure 2]. Reasons given for diagnosing RA were suboptimal across all models. Gemini 2.0 Flash (36/50) and Deepseek R1 (28/50) had the highest proportions of correct justifications, while QwQ and Gemini 2.5Pro scored the lowest (6 and 10, respectively). Dual agents configurations had poorer accuracy without any gain in reasoning logic. Conclusion: Deepseek-R1 demonstrated the highest diagnostic accuracy (82%) in the single-agent with KB setup. However, reasoning quality was suboptimal across models. Dual-agent configuration did not enhance reasoning as intended but reduced diagnostic performance. While SARA is promising for screening, lack of explainability limits clinical use. Future directions should prioritize developing explainable AI.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.