Synthetic data generation balancing privacy and utility, using vine copulas

Publikasjonsdetaljer

  • Arrangement: OCBE internal seminar

The availability of high-quality data has led to tremendous advances in science, technology and society at large, when analysed by means of statistical and machine learning (ML) methods. However, real-world data in many cases cannot be made public to the research community due to privacy restrictions. This impairs progress especially in bio-medical research. Synthetic data can substitute the sensitive real data, as long as they do not disclose private aspects. This has proven to be successful in training downstream ML applications. We propose TVineSynth, a vine copula based synthetic tabular data generator. TVineSynth is designed to balance privacy and utility, using the vine tree structure and its truncation to do the trade-off. Contrary to synthetic data generators that achieve differential privacy (DP) by globally adding noise, TVineSynth performs a controlled approximation of the estimated data generating distribution. Because of this it does not suffer from poor utility of the resulting synthetic data for downstream prediction tasks. TVineSynth introduces a targeted bias into the vine copula model. Combined with the specific tree structure of the vine, this causes the model to zero out privacy-leaking dependencies while relying on those that are beneficial for utility. We theoretically justify how the construction of TVineSynth ensures privacy. When compared to competitor models, with and without DP, TVineSynth achieves a superior privacy-utility balance.