OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 07.05.2026, 21:40

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

MumbleMED: An End-to-End Framework for Fine-Tuning Automatic Speech Recognition Models to Medical Language using Large Language and Text-To-Speech Models (Preprint)

2026·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

16

Autoren

2026

Jahr

Abstract

<sec> <title>BACKGROUND</title> Medical speech is a central interface for numerous tasks in modern healthcare workflows, including documentation, report generation, and structured clinical communication. Although documentation is a cornerstone of clinical care, it imposes a substantial administrative burden on health care providers. Automatic Speech Recognition (ASR) offers a promising solution for real-time documentation. However, general-purpose models such as OpenAI’s Whisper struggle with the linguistic density, domain-specific terminology, and structural conventions characteristic of clinical dictation. Moreover, the development of high-performance medical ASR is constrained by a persistent privacy–utility bottleneck, as access to large-scale, annotated clinical audio data is limited by stringent data protection regulations. </sec> <sec> <title>OBJECTIVE</title> This study aims to develop and evaluate a privacy-preserving framework for fine-tuning ASR models on synthetic clinical speech data, eliminating the need for sensitive patient recordings. We investigate whether LLM-generated, terminology-grounded audio can bridge the performance gap between general-purpose ASR and the demands of real-world clinical dictation across medical specialties. </sec> <sec> <title>METHODS</title> We present MumbleMED, a privacy-by-design framework for scalable and configurable fine-tuning of ASR models. Realistic clinical narratives were synthesized using Large Language Models (LLMs) based on standardized coding systems (ICD-10, OPS, and RadLex) and converted into high-quality audio via a German Text-to-Speech (TTS) system cloning institutional voices. The resulting pipeline was used to fine-tune Whisper model variants (Tiny to Large v2) on 33,898 synthetic samples (32 hours of audio). Performance was evaluated against a gold-standard benchmark comprising 175 professionally recorded clinical reports from five medical specialties, including radiology, pathology, discharge letters, progress notes and surgical reports. </sec> <sec> <title>RESULTS</title> Fine-tuning substantially reduced transcription error rates across all model sizes. On the Combined TTS test dataset, the best-performing model, Combined MumbleMED based on Whisper Large v2, reduced WER from 36.71% (baseline Whisper Large v2) to 8.20% and CER from 21.07% to 4.20%. The baseline Whisper Large v2 model achieved a Word Error Rate (WER) of 68.22% on authentic clinical dictation, which decreased to 28.90% after MumbleMED fine-tuning, representing an absolute reduction of nearly 40%. Stratified analyses revealed the lowest error rates in surgical and radiology reports (WER ~19%), whereas discharge letters remained more challenging. In a focused evaluation of 6,576 medical terms, term-specific WER decreased from 38.45% to 13.31%. Qualitative assessment showed that models learned to correctly map verbalized punctuation to symbols and normalize complex clinical units. </sec> <sec> <title>CONCLUSIONS</title> MumbleMED demonstrates that synthetic clinical speech can effectively bridge the domain gap in ASR performance while remaining compliant with regulatory requirements. By enabling local deployment of high-accuracy ASR within secure clinical infrastructures, the framework provides a practical and privacy-preserving pathway for adapting transcription models to clinical needs. </sec>

Ähnliche Arbeiten