OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 19.05.2026, 16:05

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Variation in Large Language Model Recommendations in Challenging Inpatient Management Scenarios

2025·8 Zitationen·Journal of General Internal MedicineOpen Access
Volltext beim Verlag öffnen

8

Zitationen

4

Autoren

2025

Jahr

Abstract

Abstract Importance Large language models (LLMs) are entering clinical workflows, yet their behavior in routine bedside decisions that lack a single “correct” recommendation remains unclear. Objective To describe variation within and across commercially available LLMs when confronted with common, judgment-dependent inpatient medicine management scenarios. Design Cross-sectional simulation study. Four brief vignettes requiring a binary management decision were posed to each model in five independent sessions. Six LLMs were queried: five general-purpose (GPT-4o, GPT-o1, Claude 3.7 Sonnet, Grok 3, and Gemini 2.0 Flash) and one domain-specific (OpenEvidence). Exposures Standardized prompts describing (1) transfusion at borderline hemoglobin, (2) resumption of anticoagulation after gastrointestinal bleed, (3) discharge readiness despite a modest creatinine rise, and (4) peri-procedural bridging in a high-risk patient on apixaban. Main Measures Primary outcomes were each model’s overall recommendation (majority across five runs) and its internal consistency (proportion of identical recommendations across runs; range 0–1). Inter-model agreement was the proportion of models giving the same recommendation. Results A total of 120 model-vignette interactions were analyzed. Inter-model recommendations diverged in every scenario: transfuse vs observe (67% vs 33% of models), restart vs hold anticoagulation (50% vs 50%), discharge vs delay (50% vs 50%), and bridge vs no-bridge (17% vs 83%). Across five repeated queries of the same vignette, some models changed recommendations in two of five runs (internal consistency as low as 0.60). OpenEvidence was the most internally consistent and concrete in its recommendations; every other model displayed internal variability in one or more vignettes. Conclusions For nuanced inpatient management questions, widely used LLMs produced inter- and intra-model variation in their recommendations. Clinicians should view LLM output as one perspective among many, consider sampling multiple models or re-prompting, and retain final responsibility for bedside decisions. Prospective studies are needed to test designs that surface model uncertainty and support safe integration of generative AI into complex decision-making.

Ähnliche Arbeiten