Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
When Retrieval Hurts: A Critical Analysis of RAG in Medical Question Answering
0
Zitationen
5
Autoren
2025
Jahr
Abstract
Retrieval-Augmented Generation (RAG) has been widely adopted to enhance Large Language Models (LLMs) by incorporating external knowledge, yet its effectiveness in medical question answering remains underexplored. This study presents a systematic evaluation of RAG systems on the MedQA-USMLE dataset, revealing counterintuitive findings that challenge the "always-RAG" paradigm. We evaluated three RAG approaches (BM25, Dense Retrieval, and Hybrid) against a pure LLM baseline using Qwen3-8B-Instruct as the primary LLM, with a knowledge base of 125,847 medical textbook chunks from 18 authoritative sources. Surprisingly, all RAG methods underperformed the pure LLM baseline (60.69%), with BM25-RAG, Dense-RAG, and Hybrid-RAG achieving 59.43%, 60.14%, and 59.91% accuracy, respectively. Our analysis reveals that RAG introduces a dual effect: while correcting 5-6% of questions, it simultaneously misleads 6-7% of cases, resulting in a net negative impact of 7-16 questions. Critically, we demonstrate that retrieval confidence scores fail to predict RAG utility (AUC ≈ 0.53, Cohen’s d = 0.06), indicating current retrieval systems cannot distinguish helpful from harmful contexts. Oracle analysis suggests a theoretical upper bound improvement of +6.05% if perfect RAG selection were achievable, highlighting the urgent need for selective RAG strategies. These findings challenge the assumption that external knowledge universally improves medical AI systems and emphasize the importance of context-aware retrieval mechanisms for clinical safety.