Abstract
In recent years, more and more synthetic data generators (SDGs) based on various modeling strategies have been implemented as Python libraries or R packages. With this proliferation of ready-made SDGs comes a widely held perception that generating synthetic data is easy. We show that generating synthetic data is a complicated process that requires one to understand both the original dataset as well as the synthetic data generator. We make two contributions to the literature in this topic area. First, we show that it is just as important to pre-process or clean the data as it is to tune the SDG in order to create synthetic data with high levels of utility. Second, we illustrate that it is critical to understand the methodological details of the SDG to be aware of potential pitfalls and to understand for which types of analysis tasks one can expect high levels of analytical validity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Throughout the paper, when we refer to data, we are referring to classical microdata (i.e., one observation per individual unit), as opposed to summary tables or images.
- 2.
Note that there are different philosophies about the definition of original data and how much pre-processing (e.g., dealing with missing values or outliers) one should do to the original data before data synthesis depending on the synthesis goals (replacement of the original data vs. tool for preparing to work with the original data in a safe environment). In Sect. 3 we describe the data and any pre-processing steps in detail.
- 3.
- 4.
- 5.
Early versions CTGAN could not be used on data with missing values (https://github.com/sdv-dev/CTGAN/issues/39).
- 6.
Default means that CART models are used for synthesis with complexity parameter = 0.001 (smaller values will grow larger trees), and minbucket = 5 (the minimum number of observations in any terminal node).
- 7.
The record with a BMI of 450 has height (cm) = 149 and weight (kg) = NA. If we calculate weight from bmi and height, then weight equals 999 or one metric ton.
- 8.
We note that the DataSynthesizer paper states [12], “when invoked in correlated attribute mode, DataDescriber samples attribute values in appropriate order from the Bayesian network.” However, in the code, it seems that data are created by uniform sampling within a bin (https://github.com/DataResponsibly/DataSynthesizer/blob/90722857e7f6ed736aaa25068ecf9e77f34f896a/DataSynthesizer/datatypes/AbstractAttribute.py#L125). This illustrates the challenge in understanding the methodological details of a given SDG.
- 9.
In terms of computing power, SDGs were run on a 2022 Macbook Air with 16GB of RAM and an M2 Chip with 8-Core CPU, 8-Core GPU, and a 16-Core Neural Engine. All SDGs were run one at a time in order to minimize computational power problems from parallelization.
References
Dankar, F.K., Ibrahim, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. 11(5), 21–58 (2021)
Drechsler, J., Haensch, A.C.: 30 years of synthetic data. arXiv preprint arXiv:2304.02107 (2023)
Drechsler, J., Reiter, J.: Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB establishment survey. J. Official Stat. 25(4), 589–603 (2009)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Jordon, J., et al.: Synthetic data – what, why and how? (2022)
Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. (TODS) 10(3), 395–411 (1985)
Little, C., Elliot, M., Allmendinger, R.: Comparing the utility and disclosure risk of synthetic data with samples of microdata. In: Domingo-Ferrer, J., Laurent, M. (eds.) PSD 2022. LNCS, vol. 13463, pp. 234–249. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13945-1_17
Little, R.J., et al.: Statistical analysis of masked data. J. Official Stat. 9, 407–407 (1993)
Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016)
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384 (2018)
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410 (2016). https://doi.org/10.1109/DSAA.2016.49
Ping, H., Stoyanovich, J., Howe, B.: Datasynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 1–5 (2017)
Raab, G.M., Nowok, B., Dibben, C.: Guidelines for producing useful synthetic data. arXiv preprint arXiv:1712.04078 (2017)
Rubin, D.B.: Statistical disclosure limitation. J. Official Stat. 9(2), 461–468 (1993)
Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018)
Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confidentiality 1(1) (2009)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems (2019)
Young, J., Graham, P., Penny, R.: Using Bayesian networks to create synthetic data. J. Official Stat. 25(4), 549–567 (2009)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)
Acknowledgments
This work was supported by a grant from the German Federal Ministry of Education and Research (grant number 16KISA096) with funding from the European Union-NextGenerationEU.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
A Appendix: Utility Measures
A Appendix: Utility Measures
The propensity score (pMSE) is an utility measure that estimates how well one can discriminate between the original and synthetic data based on a classifier [15, 16] and is implemented in R from the synthpop package [9]. This is sometimes called a ‘broad’ [15] or ‘general’ [3] measure of utility or ‘statistical fidelity’ [5]. The main steps are to append or stack the original and the synthetic data, add an indicator (1/0) to distinguish between the two, use a classifier to estimate the propensity of each record in the combined dataset being ‘assigned’ to the original data. The pMSE is the mean squared error of these estimated propensities:
where N is the number of records in the combined dataset, \(\hat{p}_i\) is the estimated propensity score for record i, and c is the proportion of data in the merged dataset that is synthetic (in many cases \(c=0.5\)). The pMSE can be estimated using all the variables in the dataset, but it can also be computed using subsets of the variables, e.g., all pairwise combinations of variables to evaluate specifically how well the distribution of these variables is preserved. The smaller the pMSE, the higher the analytical validity of the synthetic data.
Computational efficiency is the run time (in seconds) required to create one single synthetic dataset from a given SDG.Footnote 9 This is sometimes referred to as ‘efficiency’ [5] or ‘output scalability’ [19]. The basic idea is that the algorithms used by SDGs can suffer from the curse of dimensionality.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Latner, J., Neunhoeffer, M., Drechsler, J. (2024). Generating Synthetic Data is Complicated: Know Your Data and Know Your Generator. In: Domingo-Ferrer, J., Önen, M. (eds) Privacy in Statistical Databases. PSD 2024. Lecture Notes in Computer Science, vol 14915. Springer, Cham. https://doi.org/10.1007/978-3-031-69651-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-69651-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69650-3
Online ISBN: 978-3-031-69651-0
eBook Packages: Computer ScienceComputer Science (R0)