Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking And Datasets For Ambient Clinical Documentation: A Scoping Review Of Existing Frameworks And Metrics For AI-Assisted Medical Note Generation
13
Zitationen
1
Autoren
2025
Jahr
Abstract
Abstract Background The increasing adoption of ambient artificial intelligence (AI) scribes in healthcare has created an urgent need for robust evaluation frameworks to assess their performance and clinical utility. While these tools show promise in reducing documentation burden, there remains no standardized approach for measuring their effectiveness and safety. Objective To systematically review existing evaluation frameworks and metrics used to assess AI-assisted medical note generation from doctor-patient conversations, and provide recommendations for future evaluation approaches. Methods A scoping review following PRISMA guidelines was conducted across PubMed, IEEE Explore, Scopus, Web of Science, and Embase to identify studies evaluating ambient scribe technology between 2020-2025. Studies were included if they were peer-reviewed, focused on clinical ambient scribe evaluation from speaking to note production, and described an evaluation approach. Extracted data included evaluation metrics, benchmarking approaches, dataset characteristics, and model performance. Results Seven studies met inclusion criteria. Evaluation approaches varied widely, from traditional natural language processing metrics like ROUGE and BERTScore to domain-specific measures such as clinical accuracy and bias. Critical gaps identified include: 1) wide diversity of evaluation metrics making cross-study comparison challenging, 2) limited integration of clinical relevance in automated metrics, 3) lack of standardized approaches for crucial metrics like hallucinations and errors, and 4) minimal diversity in clinical specialties evaluated. Only two datasets were publicly available for benchmarking. Conclusions This review reveals significant heterogeneity in how ambient scribes are evaluated, highlighting the need for standardized evaluation frameworks. We propose recommendations for developing comprehensive evaluation approaches that combine automated metrics with clinical quality measures. Future work should focus on creating public benchmarks across diverse clinical settings and establishing consensus on critical safety and quality metrics.
Ähnliche Arbeiten
Machine Learning in Medicine
2019 · 3.637 Zit.
Systematic Review: Impact of Health Information Technology on Quality, Efficiency, and Costs of Medical Care
2006 · 3.170 Zit.
Effects of Computerized Clinical Decision Support Systems on Practitioner Performance and Patient Outcomes
2005 · 2.965 Zit.
Studies in health technology and informatics
2008 · 2.903 Zit.
Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success
2005 · 2.688 Zit.