Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

AI-generated data contamination erodes pathological variability and diagnostic reliability

2026·1 Zitationen·medRxivOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Abstract Generative artificial intelligence (AI) is rapidly populating medical records with synthetic or partially AI-generated content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI-generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self-referential cycle drives a rapid erosion of pathological variability and diagnostic reliability of medical data at population scale. By analysing more than 800,000 synthetic data points across clinical text generation, vision–language reporting, and medical image synthesis, we find that models progressively converge toward generic phenotypes regardless of the model architectures. Specifically, rare but critical findings, including pneumothorax and effusions, vanish from the synthetic content generated by AI models, while demographic representations skew heavily toward middle-aged male phenotypes. Crucially, this degradation is masked by false diagnostic confidence. Models continue to issue reassuring reports while failing to detect life-threatening pathology, with false reassurance rates tripling to 40%. Blinded physician evaluation confirms that this decoupling of confidence and accuracy renders AI-generated documentation clinically useless after just two generations. We systematically evaluate three mitigation strategies that can be easily integrated into existing clinical workflows, finding that while synthetic volume scaling fails to prevent collapse, mixing real data with quality-aware filtering effectively preserves diversity. Ultimately, our results suggest that without policy-mandated human oversight, the deployment of generative AI threatens to degrade the very healthcare data ecosystems it relies upon.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationAI in cancer detectionMachine Learning in Healthcare

Volltext beim Verlag öffnen

AI-generated data contamination erodes pathological variability and diagnostic reliability

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen