Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Development and comparative evaluation of knowledge graph–enhanced large language models for domain-specific question answering in nursing
0
Zitationen
13
Autoren
2026
Jahr
Abstract
In busy clinical settings, timely access to comprehensive and actionable guideline recommendations can be constrained, motivating interest in large language models (LLMs) as adjunct tools for scalable evidence access and education. This study aimed to evaluate the performance of LLMs in domain-specific question-answering tasks within the nursing field and assess the effectiveness of GraphRAG technology in optimizing LLMs. A knowledge graph was constructed from high-quality clinical practice guidelines for pressure injury management and integrated with two base models—Qwen-turbo-0715 and DeepSeek-V3.1—to develop optimized versions (Qwen-turbo-0715-GraphRAG and DeepSeek-V3.1-GraphRAG). Model performance was compared between 10 non-specialist nurses, 10 specialist nurses, and the LLMs using a self-developed 25-item questionnaire designed with expert input to assess knowledge of pressure injury management. Group differences were analyzed using the Kruskal–Wallis H test with Bonferroni correction, followed by post hoc pairwise analysis. Average response times were also recorded for each model. Significant differences were observed among non-specialist nurses, specialist nurses, and LLMs (H = 17.662, P-value < 0.001). On this structured, guideline-derived benchmark under the study conditions, the LLM groups obtained higher scores than the nurse groups. The Qwen-turbo-0715-GraphRAG achieved the highest mean score (98.4), followed by DeepSeek-V3.1-GraphRAG (87.2), Qwen-turbo-0715 (86.4), ChatGPT-5 (82.4), and DeepSeek-V3.1 (77.6). GraphRAG optimization was associated with higher benchmark scores, but at the cost of longer response times. Overall, knowledge graph-enhanced LLMs showed more guideline-aligned and source-grounded outputs on this benchmark. However, these improvements were observed within a standardized, guideline-based assessment framework and may not capture the full complexity of clinical expertise. The findings support further evaluation of knowledge graph–enhanced LLMs in larger, more practice-oriented nursing settings and highlight the importance of considering the relative strengths of different base models before clinical application.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.773 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.682 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.242 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.898 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Autoren
Institutionen
- Beijing University of Chinese Medicine(CN)
- Chinese Academy of Medical Sciences & Peking Union Medical College(CN)
- Peking Union Medical College Hospital(CN)
- Second Affiliated Hospital of Zhejiang University(CN)
- Peking University(CN)
- Peking University Third Hospital(CN)
- Guangzhou University of Chinese Medicine(CN)
- Chinese University of Hong Kong, Shenzhen(CN)
- Shenzhen Pingle Orthopedic Hospital(CN)