Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Variation in Large Language Model Recommendations in Challenging Inpatient Management Scenarios
8
Zitationen
4
Autoren
2025
Jahr
Abstract
Abstract Importance Large language models (LLMs) are entering clinical workflows, yet their behavior in routine bedside decisions that lack a single “correct” recommendation remains unclear. Objective To describe variation within and across commercially available LLMs when confronted with common, judgment-dependent inpatient medicine management scenarios. Design Cross-sectional simulation study. Four brief vignettes requiring a binary management decision were posed to each model in five independent sessions. Six LLMs were queried: five general-purpose (GPT-4o, GPT-o1, Claude 3.7 Sonnet, Grok 3, and Gemini 2.0 Flash) and one domain-specific (OpenEvidence). Exposures Standardized prompts describing (1) transfusion at borderline hemoglobin, (2) resumption of anticoagulation after gastrointestinal bleed, (3) discharge readiness despite a modest creatinine rise, and (4) peri-procedural bridging in a high-risk patient on apixaban. Main Measures Primary outcomes were each model’s overall recommendation (majority across five runs) and its internal consistency (proportion of identical recommendations across runs; range 0–1). Inter-model agreement was the proportion of models giving the same recommendation. Results A total of 120 model-vignette interactions were analyzed. Inter-model recommendations diverged in every scenario: transfuse vs observe (67% vs 33% of models), restart vs hold anticoagulation (50% vs 50%), discharge vs delay (50% vs 50%), and bridge vs no-bridge (17% vs 83%). Across five repeated queries of the same vignette, some models changed recommendations in two of five runs (internal consistency as low as 0.60). OpenEvidence was the most internally consistent and concrete in its recommendations; every other model displayed internal variability in one or more vignettes. Conclusions For nuanced inpatient management questions, widely used LLMs produced inter- and intra-model variation in their recommendations. Clinicians should view LLM output as one perspective among many, consider sampling multiple models or re-prompting, and retain final responsibility for bedside decisions. Prospective studies are needed to test designs that surface model uncertainty and support safe integration of generative AI into complex decision-making.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.707 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.613 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.159 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.875 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.