OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 26.05.2026, 08:33

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Language-dependent diagnostic safety of medical AI systems: a cross-lingual benchmarking and prospective clinical study

2026·0 Zitationen·medRxivOpen Access
Volltext beim Verlag öffnen

0

Zitationen

16

Autoren

2026

Jahr

Abstract

Summary Background Patients worldwide receive healthcare in many languages, yet medical AI systems are validated almost exclusively in high-resource languages such as English and Chinese, exposing patients in other linguistic settings to unquantified diagnostic risk. Existing multilingual evaluations rely on translated research-style benchmarks that fail to capture authentic clinical work. We aimed to characterise the patient safety consequences of multilingual medical AI deployment in real-world clinical settings and to develop an auditable detection method for unsafe outputs. Methods We evaluated different language models(LLMs) and visual language models(VLMs) across four real-world clinical tasks (conversational QA, radiology report generation, glaucoma diagnosis, ICU re-intubation prediction) in five languages (English, Chinese, Malay, Thai, Persian). We developed a token-level uncertainty toolkit to localise reasoning instability, com pared three inference paradigms (native-language, English chain-of-thought, back-translation pivot), and conducted a prospective study (50 dialogues, 150 physician-reviewed records). Findings LLMs/VLMs performance degraded consistently from high-to low-resource languages across al l tasks. Key gaps included: HealthBench score declining from 0·3743 to 0·3180; radiology macro-F1 from 0·2938 to 0·2149–0·2424, consistent with selective pathology suppression; glaucoma accuracy from 50·7% to 32·7%; ICU parameter recall from 100·0% to 48·5%. Multimodal inputs amplified degradation. Qwen3 VL 235B showed attenuated decline with no re source-ordered pattern in glaucoma classification. Token-level analysis localised instability to mid-chain stages (40–70% of the normalised trajectory); perplexity-based confidence failed to flag errors (AUROC 0·41–0·66). Back-translation pivot consistently restored performance. In the prospective study, 98·7% of records required physician edits (overall modification score 53·6%); Thai-pivot correction burden (59·0%) exceeded English-pivot (5 0·7%, p=0·003) and Chinese-direct (51·0%, p=0·004). Interpretation Multilingual deployment produced clinically consequential failures, including missed pathology, distorted physiological extraction, and amplified multimodal misclassification, that were invisible to monolingual validation and not reliably flagged by model confidence. Pre training data composition may contribute to multilingual safety risk. Language-specific safety auditing should precede deployment in non-dominant-language healthcare settings; the open-source detection toolkit enables this without model retraining.

Ähnliche Arbeiten