Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Privacy-Aware Synthetic Tabular Data Generation for Healthcare: Application to Sepsis Detection

2026·0 Zitationen·BioengineeringOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background: Machine learning-based Artificial Intelligence (AI) models have shown significant potential in the biomedical field, offering promising advances in diagnostics, personalized medicine, and patient care. However, to build these models, we have to deal with important challenges, including (1) the scarcity and low quality of available datasets in many important applications and (2) privacy concerns associated with sensitive patient data. Synthetic data (SD) generation has emerged as a promising strategy to address these challenges, yet many existing approaches struggle to simultaneously preserve privacy and accurately model tabular data, the predominant format in healthcare. Methods: We propose Kernel Density Estimation–K-Nearest Neighbors (KDE-KNN), a privacy-aware tabular data generation method, and evaluate its performance against state-of-the-art techniques. Using sepsis detection as a real-world case study, we assess both data utility and privacy protection. Results: Models trained on KDE-KNN-generated SD outperformed those trained on real data across both internal testing and external validation. In particular, a support vector machine achieved superior performance when trained on SD relative to real data. This gain is likely driven by the balanced class distribution of the synthetic dataset, underscoring KDE-KNN’s utility as an effective data balancing strategy. Consistent performance in external validation further supports the robustness and generalizability of the proposed approach. Privacy evaluation indicated a lower re-identification risk, with a mean distance to closest record of 4.971 between synthetic and real samples, compared with 2.715 among real samples. Conclusions: KDE-KNN effectively captures underlying population distributions while generating high-quality SD that preserve statistical fidelity and protect sensitive information. By balancing the trade-off between utility and privacy, the method produces representative datasets without exposing individual records. These findings position KDE-KNN as a valuable tool for data-scarce and privacy-sensitive applications, with broad potential across healthcare and other data-driven domains.

Autoren

Institutionen

Themen

Privacy-Preserving Technologies in DataCOVID-19 Digital Contact TracingArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Privacy-Aware Synthetic Tabular Data Generation for Healthcare: Application to Sepsis Detection

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen