Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Ways to Generate Synthetic Data for AI Training without Leaking Information

2025·0 Zitationen·Digital Platform Information Technologies in Sociocultural SphereOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

The purpose of the article is to determine how to generate training-ready synthetic data without leaking personal information by comparing three families – differentially trained GANs, variational autoencoders (VAEs), and diffusion models – across privacy–utility trade-offs, domains, and audit practices. Research Methodology. A constrained systematic review of 12 peer-reviewed studies (2022–2025). Titles/abstracts were screened, full texts re-appraised, and reported metrics harmonised. Effect sizes were recalculated against each study’s real-data baseline; qualitative comparative analysis with vote-counting identified Pareto-efficient regions. The privacy evidence considered differential privacy budgets, membership-inference AUC (Area Under the ROC Curve), and duplication checks; no new data were collected. Scientific novelty. (i) A cross-modal synthesis that maps generator families to privacy– utility frontiers rather than single benchmarks; (ii) evidence that diffusion with calibrated, early-step noise consistently attains lower leakage at comparable utility; (iii) an ‘overlap-free similarity’ metric that combines nearest-neighbour redundancy with DP bounds for audit-ready risk scoring; (iv) domain-aware heuristics showing when KD-tree post-processing can harden legacy GAN pipelines for tabular data. Conclusions. Diffusion models paired with calibrated privacy noise offer the most favourable privacy–utility balance in high-stakes settings; GANs remain viable under looser risk budgets or tight computational constraints, especially with post-processing; VAE hybrids bridge the middle regimes. Practically, teams can reach production-grade privacy faster by (a) placing noise where model dynamics dissipate it, (b) adopting the proposed audit metric alongside membership-inference tests, and (c) tailoring generators to domain constraints (healthcare images, finance time-series, recommender logs).

Autoren

Mariia Pozdniakova

Institutionen

Oles Honchar Dnipro National University(UA)

Themen

Privacy-Preserving Technologies in DataArtificial Intelligence in Healthcare and EducationExplainable Artificial Intelligence (XAI)

Volltext beim Verlag öffnen

Ways to Generate Synthetic Data for AI Training without Leaking Information

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen