Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating Dutch-Language Ambient Listening in Simulated Clinical Encounters: Comparing Three Providers in a Multi-Speaker, Multi-Dialect Study (Preprint)
0
Zitationen
7
Autoren
2026
Jahr
Abstract
<sec> <title>BACKGROUND</title> Clinicians spent a lot of time on Electronic Health Record (EHR) documentation, often at the expense of patient interaction. Ambient listening technology uses artificial intelligence to passively record and summarize clinical encounters. While initial studies are promising, there is limited evidence on system performance in complex, non-English settings. </sec> <sec> <title>OBJECTIVE</title> To compare the documentation performance of three commercially available ambient listening systems in simulated Dutch-language outpatient consultations by assessing note completeness, correctness, and conciseness under predefined linguistic and interactional challenges. </sec> <sec> <title>METHODS</title> Standardized audio recordings of ten scripted physician–patient interactions in two specialties were used. Scenarios included multi-speaker dynamics (patient companion), conversational disruptions (nurse interruption), evasive patient communication, and a regional dialect (Gronings). Three distinct AI documentation systems (Provider A, Provider B, and Provider C) processed the audio files. Eight human raters evaluated the resulting AI-generated notes against reference summaries for Completeness, Conciseness, and Correctness using a 5-point ordinal scale. Inter-rater agreement was assessed using Gwet’s AC2. System-level technical characteristics were assessed alongside clinical performance to aid interpretation of between-vendor differences. </sec> <sec> <title>RESULTS</title> Across 351 ratings on a 1-5 scale, the overall inter-rater agreement was high (Gwet’s AC2 = 0.827). Mean scores were tightly clustered across providers (Provider C: 4.26, Provider B: 4.00, Provider A: 3.82). Mean scores were higher in Otolaryngology (mean 4.36) than Surgical Oncology (mean 3.68). Across scoring domains, correctness received the highest mean score (4.21), while completeness received lowest (3.81). Variation in mean scores was observed across script scenarios. Dialect-specific scenarios showed the lowest mean score (3.77) and the greatest variability across providers. Median summary generation times ranged from 13.5 seconds (Provider C) to 22.0 seconds (Provider B). </sec> <sec> <title>CONCLUSIONS</title> Ambient listening systems demonstrate good performance in Dutch clinical settings, even under conditions simulating common conversational challenges. While accuracy is generally high, performance is sensitive to linguistic variation. Future deployment studies must prioritize linguistic equity, real-world validation of efficiency gains, and evaluation of both clinician and patient perception to understand how these systems influence consultation dynamics and care delivery across diverse patient populations. </sec>
Ähnliche Arbeiten
Making sense of Cronbach's alpha
2011 · 14.135 Zit.
Treatment of Comatose Survivors of Out-of-Hospital Cardiac Arrest with Induced Hypothermia
2002 · 5.409 Zit.
Features and uses of high-fidelity medical simulations that lead to effective learning: a BEME systematic review
2005 · 3.825 Zit.
Defining and Assessing Professional Competence
2002 · 3.071 Zit.
Virtual Reality Training Improves Operating Room Performance
2002 · 2.812 Zit.