Abstract
In this paper, we investigate different approaches for generating synthetic microdata from open-source aggregated data. Specifically, we focus on macro-to-micro data synthesis. We explore the potential of the Gaussian copulas framework to estimate joint distributions from aggregated data. Our generated synthetic data is intended for educational and software testing use cases. We propose three scenarios to achieve realistic and high-quality synthetic microdata: (1) zero knowledge, (2) internal knowledge, and (3) external knowledge. The three scenarios involve different knowledge of the underlying properties of the real microdata, i.e., standard deviation, and covariate. Our evaluation includes matching tests to evaluate the privacy of the synthetic datasets. Our results indicate that macro-to-micro synthesis achieves better privacy preservation compared to other methods, demonstrating both the potential and challenges of synthetic data generation in maintaining data privacy while providing useful data for analysis.
The views expressed in this paper are those of the authors and do not necessarily reflect the policy of Statistics Netherlands.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
The Nomenclature of Economic Activities (for short NACE).
- 5.
References
Acharya, A., Sikdar, S., Das, S., Rangwala, H.: GenSyn: a multi-stage framework for generating synthetic microdata using macro data sources. In: IEEE International Conference on Big Data (Big Data), pp. 685–692 (2022)
Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using Generative Adversarial Networks. In: Doshi-Velez, F., Fackler, J., Kale, D., Ranganath, R., Wallace, B., Wiens, J. (eds.) Proceedings of the 2nd Machine Learning for Healthcare Conference, vol. 68, pp. 286–305 (2017)
Choupani, A.A., Mamdoohi, A.R.: Population synthesis using iterative proportional fitting (IPF): a review and future research. Transp. Res. Procedia 17, 223–233 (2016)
Dandekar, R.A., Cohen, M., Kirkendall, N.: Sensitive micro data protection using Latin hypercube sampling technique. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, pp. 117–125. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-47804-3_9
Domingo-Ferrer, J.: A survey of inference control methods for privacy-preserving data mining. In: Aggarwal, C.C., Yu, P.S. (eds.) Privacy-Preserving Data Mining: Models and Algorithms, vol. 34, pp. 53–80. Springer, US (2008). https://doi.org/10.1007/978-0-387-70992-5_3
Domingo-Ferrer, J., Torra, V.: Disclosure risk assessment in statistical data protection. J. Comput. Appl. Math. 164–165, 285–293 (2004). Proceedings of the 10th International Congress on Computational and Applied Mathematics
Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)
Garofalo, G., Slokom, M., Preuveneers, D., Joosen, W., Larson, M.: Machine learning meets data modification. In: Batina, L., Bäck, T., Buhan, I., Picek, S. (eds.) Security and Artificial Intelligence. LNCS, vol. 13049, pp. 130–155. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98795-4_7
Hundepool, A., et al.: Statistical Disclosure Control. Wiley, NewYork (2012)
Kim, J., Lee, S.: A simulated annealing algorithm for the creation of synthetic population in activity-based travel demand model. KSCE J. Civ. Eng. 20, 2513–2523 (2015)
Li, Z., Zhao, Y., Fu, J.: Sync: a copula based framework for generating synthetic data from aggregated sources (2020)
Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. 10(3), 395–411 (1985)
Muralidhar, K.: A re-examination of the Census Bureau reconstruction and reidentification attack. In: Domingo-Ferrer, J., Laurent, M. (eds.) PSD 2022. LNCS, vol. 13463, pp. 312–323. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13945-1_22
Muralidhar, K., Domingo-Ferrer, J.: Database reconstruction is not so easy and is different from reidentification. J. Off. Stat. 39(3), 381–398 (2023)
Murata, T., Harada, T.: Nation-wide synthetic reconstruction method. In: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–6 (2017)
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on Generative Adversarial Networks. In: Proceedings of the 44th International Conference on Very Large Data Bases (VLDB Endowment), vol. 11, no. 10, pp. 1071–1083 (2018)
Rubin, D.B.: Discussion statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Thogarchety, P., Das, K.: Synthetic data generation using genetic algorithm. In: 2023 2nd International Conference for Innovation in Technology (INOCON), pp. 1–6 (2023)
Torra, V.: Privacy in data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 687–716. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_35
Voas, D., Williamson, P.: An evaluation of the combinatorial optimisation approach to the creation of synthetic microdata. Int. J. Popul. Geogr. 6, 349–366 (2000)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 7335–7345 (2019)
Acknowledgments
We are grateful to Guus van de Burgt for providing the data and invaluable insights, and to Arjen de Boer for his effective project management and extensive knowledge, both of which were crucial to the success of this project.
This work was partly supported by the AI, Media, and Democracy Lab, NWA.1332.20.009.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 Dataset Rules
We explain the types of relationships present in the dataset and the rules the synthesizer must account for. We divide the rules into five categories: non-negative rules, summation rules, equality rules, inequality rules, and if-then rules. The non-negative rule imposes that the variable must be greater or equal to zero (e.g. number of workers and various expenses). For the summation rule, the sum of certain variables must equal another variable per definition. Think of the total expense variable being the sum of individual expenses, or the net revenue being a summation of sales and costs. The equality rule suggests that specific variables must be equal to another variable (\(=\)), while the inequality rule imposes that some variables must be lesser or equal (\(\le \)) or greater or equal (\(\ge \)) than other variables. This is the case of repeated variables in the survey (equality rules), and the variable of the number of workers being greater or equal to the full-time equivalent workers (inequality rule). The if-then rules impose the same rules as the previous ones, with the addition of requiring a certain condition to be satisfied (as some rules can only exist if a variable meets a certain value).
1.2 A.2 Extra Analytical Validity Results
For micro-to-micro data synthesis, we measure the KS-Complement metric, which assesses the similarity between a real variable and its synthetic counterpart based on their marginal distributions. Our results are provided in Tables 3 and 4 (Fig. 5).
1.3 A.3 Residuals from the Regression Models
We analyze the performance of the synthetic data by examining the residuals of the linear regression model. Figure 6 shows the residual plots for the models trained on the real data, CART synthetic data, and CTGAN synthetic data. The residuals represent the differences between the observed and predicted values.
In the residual plot for the model trained on the real data (Fig. 6a), the residuals are evenly distributed around zero, indicating a good fit of the model. In contrast, the residual plot for the CART synthetic data (Fig. 6b) shows a similar distribution, though with a slightly larger variance, suggesting that the CART synthetic data maintains the relationships in the real data fairly well. The residual plot for the CTGAN synthetic data (Fig. 6c) reveals a different pattern. The residuals are more spread out and exhibit a clear upward trend, indicating that the CTGAN model has a downward bias and struggles to capture the variability in the real data. This pattern suggests that the CTGAN synthetic data may not be as effective in preserving the underlying relationships present in the real data. Overall, these residual plots highlight that the CART synthetic data provides a better approximation of the real data compared to the CTGAN synthetic data, aligning with the findings from the prediction accuracy analysis.
The residual plots for the synthetic data generated using the macro-to-micro approaches are shown in Fig. 7. In Scenario 1 (Fig. 7a), where no knowledge of the real data was used, the residuals are widely scattered and show a large deviation from zero. This indicates poor accuracy in the synthetic data, as the model fails to capture the true relationships present in the real data. Scenario 2 (Fig. 7b), which incorporates internal knowledge, shows an improvement in the residuals compared to Scenario 1. The residuals are closer to zero, suggesting that the synthetic data better approximates the real data, although there are still noticeable deviations. Scenario 3 (Fig. 7c), which leverages external knowledge, exhibits the most favorable residuals among the three scenarios. The residuals are tightly clustered around zero, indicating a higher accuracy in the synthetic data. This demonstrates that utilizing external insights significantly enhances the quality of the synthetic data generation process.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Aghaddar, M., Su, L.N., Slokom, M., Barnhoorn, L., de Wolf, PP. (2024). A Case Study Exploring Data Synthesis Strategies on Tabular vs. Aggregated Data Sources for Official Statistics. In: Domingo-Ferrer, J., Önen, M. (eds) Privacy in Statistical Databases. PSD 2024. Lecture Notes in Computer Science, vol 14915. Springer, Cham. https://doi.org/10.1007/978-3-031-69651-0_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-69651-0_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69650-3
Online ISBN: 978-3-031-69651-0
eBook Packages: Computer ScienceComputer Science (R0)