Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Ways to Generate Synthetic Data for AI Training without Leaking Information
0
Zitationen
1
Autoren
2025
Jahr
Abstract
The purpose of the article is to determine how to generate training-ready synthetic data without leaking personal information by comparing three families – differentially trained GANs, variational autoencoders (VAEs), and diffusion models – across privacy–utility trade-offs, domains, and audit practices. Research Methodology. A constrained systematic review of 12 peer-reviewed studies (2022–2025). Titles/abstracts were screened, full texts re-appraised, and reported metrics harmonised. Effect sizes were recalculated against each study’s real-data baseline; qualitative comparative analysis with vote-counting identified Pareto-efficient regions. The privacy evidence considered differential privacy budgets, membership-inference AUC (Area Under the ROC Curve), and duplication checks; no new data were collected. Scientific novelty. (i) A cross-modal synthesis that maps generator families to privacy– utility frontiers rather than single benchmarks; (ii) evidence that diffusion with calibrated, early-step noise consistently attains lower leakage at comparable utility; (iii) an ‘overlap-free similarity’ metric that combines nearest-neighbour redundancy with DP bounds for audit-ready risk scoring; (iv) domain-aware heuristics showing when KD-tree post-processing can harden legacy GAN pipelines for tabular data. Conclusions. Diffusion models paired with calibrated privacy noise offer the most favourable privacy–utility balance in high-stakes settings; GANs remain viable under looser risk budgets or tight computational constraints, especially with post-processing; VAE hybrids bridge the middle regimes. Practically, teams can reach production-grade privacy faster by (a) placing noise where model dynamics dissipate it, (b) adopting the proposed audit metric alongside membership-inference tests, and (c) tailoring generators to domain constraints (healthcare images, finance time-series, recommender logs).
Ähnliche Arbeiten
k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY
2002 · 8.412 Zit.
Calibrating Noise to Sensitivity in Private Data Analysis
2006 · 6.916 Zit.
Deep Learning with Differential Privacy
2016 · 5.640 Zit.
Federated Machine Learning
2019 · 5.610 Zit.
Communication-Efficient Learning of Deep Networks from Decentralized\n Data
2016 · 5.600 Zit.