Generating Synthetic Data is Complicated: Know Your Data and Know Your Generator

Latner, Jonathan; Neunhoeffer, Marcel; Drechsler, Jörg

doi:10.1007/978-3-031-69651-0_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14915))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

502 Accesses
1 Citations

Abstract

In recent years, more and more synthetic data generators (SDGs) based on various modeling strategies have been implemented as Python libraries or R packages. With this proliferation of ready-made SDGs comes a widely held perception that generating synthetic data is easy. We show that generating synthetic data is a complicated process that requires one to understand both the original dataset as well as the synthetic data generator. We make two contributions to the literature in this topic area. First, we show that it is just as important to pre-process or clean the data as it is to tune the SDG in order to create synthetic data with high levels of utility. Second, we illustrate that it is critical to understand the methodological details of the SDG to be aware of potential pitfalls and to understand for which types of analysis tasks one can expect high levels of analytical validity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Some Clarifications Regarding Fully Synthetic Data

Bayesian Generation of Synthetic Data

Techniques to produce and evaluate realistic multivariate synthetic data

Article Open access 28 July 2023

Notes

1.
Throughout the paper, when we refer to data, we are referring to classical microdata (i.e., one observation per individual unit), as opposed to summary tables or images.
2.
Note that there are different philosophies about the definition of original data and how much pre-processing (e.g., dealing with missing values or outliers) one should do to the original data before data synthesis depending on the synthesis goals (replacement of the original data vs. tool for preparing to work with the original data in a safe environment). In Sect. 3 we describe the data and any pre-processing steps in detail.
3.
https://github.com/jonlatner/KEM_GAN/tree/main/latner/projects/comparison.
4.
http://www.diagnoza.com/index-en.html.
5.
Early versions CTGAN could not be used on data with missing values (https://github.com/sdv-dev/CTGAN/issues/39).
6.
Default means that CART models are used for synthesis with complexity parameter = 0.001 (smaller values will grow larger trees), and minbucket = 5 (the minimum number of observations in any terminal node).
7.
The record with a BMI of 450 has height (cm) = 149 and weight (kg) = NA. If we calculate weight from bmi and height, then weight equals 999 or one metric ton.
8.
We note that the DataSynthesizer paper states [12], “when invoked in correlated attribute mode, DataDescriber samples attribute values in appropriate order from the Bayesian network.” However, in the code, it seems that data are created by uniform sampling within a bin (https://github.com/DataResponsibly/DataSynthesizer/blob/90722857e7f6ed736aaa25068ecf9e77f34f896a/DataSynthesizer/datatypes/AbstractAttribute.py#L125). This illustrates the challenge in understanding the methodological details of a given SDG.
9.
In terms of computing power, SDGs were run on a 2022 Macbook Air with 16GB of RAM and an M2 Chip with 8-Core CPU, 8-Core GPU, and a 16-Core Neural Engine. All SDGs were run one at a time in order to minimize computational power problems from parallelization.

References

Dankar, F.K., Ibrahim, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. 11(5), 21–58 (2021)
Article Google Scholar
Drechsler, J., Haensch, A.C.: 30 years of synthetic data. arXiv preprint arXiv:2304.02107 (2023)
Drechsler, J., Reiter, J.: Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB establishment survey. J. Official Stat. 25(4), 589–603 (2009)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Jordon, J., et al.: Synthetic data – what, why and how? (2022)
Google Scholar
Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. (TODS) 10(3), 395–411 (1985)
Article Google Scholar
Little, C., Elliot, M., Allmendinger, R.: Comparing the utility and disclosure risk of synthetic data with samples of microdata. In: Domingo-Ferrer, J., Laurent, M. (eds.) PSD 2022. LNCS, vol. 13463, pp. 234–249. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13945-1_17
Chapter Google Scholar
Little, R.J., et al.: Statistical analysis of masked data. J. Official Stat. 9, 407–407 (1993)
Google Scholar
Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016)
Article Google Scholar
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384 (2018)
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410 (2016). https://doi.org/10.1109/DSAA.2016.49
Ping, H., Stoyanovich, J., Howe, B.: Datasynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 1–5 (2017)
Google Scholar
Raab, G.M., Nowok, B., Dibben, C.: Guidelines for producing useful synthetic data. arXiv preprint arXiv:1712.04078 (2017)
Rubin, D.B.: Statistical disclosure limitation. J. Official Stat. 9(2), 461–468 (1993)
Google Scholar
Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018)
Article MathSciNet Google Scholar
Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confidentiality 1(1) (2009)
Google Scholar
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Young, J., Graham, P., Penny, R.: Using Bayesian networks to create synthetic data. J. Official Stat. 25(4), 549–567 (2009)
Google Scholar
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work was supported by a grant from the German Federal Ministry of Education and Research (grant number 16KISA096) with funding from the European Union-NextGenerationEU.

Author information

Authors and Affiliations

Institute for Employment Research, Nuremberg, Germany
Jonathan Latner, Marcel Neunhoeffer & Jörg Drechsler
Ludwig-Maximilians-Universität, Munich, Germany
Marcel Neunhoeffer & Jörg Drechsler
University of Maryland, College Park, USA
Jörg Drechsler

Authors

Jonathan Latner
View author publications
You can also search for this author in PubMed Google Scholar
Marcel Neunhoeffer
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Drechsler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcel Neunhoeffer .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Tarragona, Spain
Josep Domingo-Ferrer
EURECOM, Biot, France
Melek Önen

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

A Appendix: Utility Measures

The propensity score (pMSE) is an utility measure that estimates how well one can discriminate between the original and synthetic data based on a classifier [15, 16] and is implemented in R from the synthpop package [9]. This is sometimes called a ‘broad’ [15] or ‘general’ [3] measure of utility or ‘statistical fidelity’ [5]. The main steps are to append or stack the original and the synthetic data, add an indicator (1/0) to distinguish between the two, use a classifier to estimate the propensity of each record in the combined dataset being ‘assigned’ to the original data. The pMSE is the mean squared error of these estimated propensities:

$$\begin{aligned} pMSE = \frac{1}{N}\sum _{i=1}^{N}[\hat{p}_i - c]^2 \end{aligned}$$

(1)

where N is the number of records in the combined dataset, $\hat{p}_i$ is the estimated propensity score for record i, and c is the proportion of data in the merged dataset that is synthetic (in many cases $c=0.5$). The pMSE can be estimated using all the variables in the dataset, but it can also be computed using subsets of the variables, e.g., all pairwise combinations of variables to evaluate specifically how well the distribution of these variables is preserved. The smaller the pMSE, the higher the analytical validity of the synthetic data.

Computational efficiency is the run time (in seconds) required to create one single synthetic dataset from a given SDG.^{Footnote 9} This is sometimes referred to as ‘efficiency’ [5] or ‘output scalability’ [19]. The basic idea is that the algorithms used by SDGs can suffer from the curse of dimensionality.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Latner, J., Neunhoeffer, M., Drechsler, J. (2024). Generating Synthetic Data is Complicated: Know Your Data and Know Your Generator. In: Domingo-Ferrer, J., Önen, M. (eds) Privacy in Statistical Databases. PSD 2024. Lecture Notes in Computer Science, vol 14915. Springer, Cham. https://doi.org/10.1007/978-3-031-69651-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-69651-0_8
Published: 13 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69650-3
Online ISBN: 978-3-031-69651-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Generating Synthetic Data is Complicated: Know Your Data and Know Your Generator

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Some Clarifications Regarding Fully Synthetic Data

Bayesian Generation of Synthetic Data

Techniques to produce and evaluate realistic multivariate synthetic data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

A Appendix: Utility Measures

A Appendix: Utility Measures

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us