OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.05.2026, 02:29

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Size doesn’t matter: Assessing the trustworthiness of large language models in medical contexts: A focus on epidural information retrieval

2026·0 Zitationen·Artificial Intelligence in MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2026

Jahr

Abstract

BACKGROUND: Since the release of ChatGPT, numerous LLMs have emerged, providing easy access to information without the need for technical expertise. However, relying on these systems can influence important life decisions, such as the choice to use epidural analgesia during childbirth. Epidural analgesia is widely regarded as the "gold standard" for pain relief during childbirth. However, limited access to anaesthesiologists and gaps in knowledge may prompt individuals to seek information from unverified sources, including AI systems. Misinformation in this area can discourage the use of effective analgesia, highlighting the need to assess the accuracy of LLM-generated content. OBJECTIVE: To evaluate the reliability of LLM-generated information regarding epidural analgesia. METHODS: We posed 10 standardized questions about epidural analgesia to 12 LLMs, each question reformulated 10 times in both Spanish and English, resulting in 2400 responses. Two anaesthesiologists were involved in the assessment process. One expert performed the initial ratings, while the second independently verified the evaluations assessing the outputs using an extended SERVQUAL framework. RESULTS: ChatGPT performed best, followed by Gemini 2. Medium-sized models, such as Phi-3 and OpenChat, outperformed several larger models like Llama-2 or Llama-3, challenging the notion that "bigger is better" and offering potential advantages in low-resource settings (e.g., Phi-3 outperformed Llama-2 with an average increase of 81% across all metrics). Specialized models did not show superior performance. Except for ChatGPT, English responses were generally more reliable than Spanish, with some Spanish outputs incoherent. ChatGPT also exhibited the least variability between responses. CONCLUSIONS: Despite promising performance, LLMs display limitations in medical contexts. Collaboration between national and international medical societies is crucial to develop evidence-based resources to guide LLM training and improve information trustworthiness.

Ähnliche Arbeiten