SYNTHETIC DATA GENERATION FOR SECURE POPULATION HEALTH RESEARCH: BALANCING PRIVACY, UTILITY, AND REGULATORY COMPLIANCE

Authors

  • Adaeze Ojinika Ezeogu Affiliation: University of West Georgia, United State of America. Department: MSc. Cybersecurity & Information Management Author

Keywords:

Synthetic data generation, Privacy-preserving research, Differential privacy, Healthcare GANs, Population health analytics, HIPAA compliance, Data utility metrics, Secure data sharing

Abstract

The balance of population health research with patient privacy is a growing bottleneck in healthcare innovation. This paper proposes a scalable solution to advance research with guarantees of privacy and data utility: synthetic healthcare datasets. We introduce a machine learning framework to generate high-quality synthetic healthcare data with both statistical and mathematical privacy guarantees. These datasets are generated from a model of a real population without using any real patient data, providing an option to perform analytics without needing to access patient data directly.

We describe a hybrid Variational Autoencoder-Generative Adversarial Network (VAE-GAN) framework with differential privacy (DP), uniquely constructed for the challenges and structures of healthcare data (mixed types, temporality, complex correlations). Our solution is built with "medical constraint layers" that respect the natural rules of healthcare (e.g., a male cannot be pregnant) and preserve population statistics. We validate this method on our prior population health segmentation research and found that our synthetic data has 96.7% utility to the real data (across 15 epidemiological metrics) with (ε=1.0, δ=10^-5)-DP. The VAE-GAN-DP solution successfully preserves critical relations: disease comorbidities (r=0.94), population disparities (KL divergence < 0.02), and natural progression of diseases over time (DTW distance < 0.05).

We showcase synthetic data research in three case studies: (1) conducting published population health studies with synthetic data only, with 94% of the original results replicated; (2) training machine learning models on synthetic data that performed within 2.3% of the models trained on the real data; and (3) performing cross-institutional population health studies, for which data sharing was previously impossible due to privacy concerns. We also provide a regulatory review of synthetic data in U.S. healthcare (HIPAA Safe Harbor method is one of the ways to meet HIPAA de-identification standards) and international data laws. We offer the community open-source tools for synthetic data creation, validation, and regulatory compliance documentation.

Our economic impact analysis shows that synthetic data could help population health research to progress 3-5x faster, while lowering compliance costs by 67% ($1.2 billion in data prep and legal expenses could be saved by the U.S. healthcare research industry annually).

Downloads

Download data is not yet available.

Downloads

Published

2025-03-10