A Case Study Exploring Data Synthesis Strategies on Tabular vs. Aggregated Data Sources for Official Statistics | SpringerLink
Skip to main content

A Case Study Exploring Data Synthesis Strategies on Tabular vs. Aggregated Data Sources for Official Statistics

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14915))

Included in the following conference series:

  • 481 Accesses

Abstract

In this paper, we investigate different approaches for generating synthetic microdata from open-source aggregated data. Specifically, we focus on macro-to-micro data synthesis. We explore the potential of the Gaussian copulas framework to estimate joint distributions from aggregated data. Our generated synthetic data is intended for educational and software testing use cases. We propose three scenarios to achieve realistic and high-quality synthetic microdata: (1) zero knowledge, (2) internal knowledge, and (3) external knowledge. The three scenarios involve different knowledge of the underlying properties of the real microdata, i.e., standard deviation, and covariate. Our evaluation includes matching tests to evaluate the privacy of the synthetic datasets. Our results indicate that macro-to-micro synthesis achieves better privacy preservation compared to other methods, demonstrating both the potential and challenges of synthetic data generation in maintaining data privacy while providing useful data for analysis.

The views expressed in this paper are those of the authors and do not necessarily reflect the policy of Statistics Netherlands.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://opendata.cbs.nl/statline/#/CBS/nl/.

  2. 2.

    https://synthpop.org.uk/.

  3. 3.

    https://sdv.dev/.

  4. 4.

    The Nomenclature of Economic Activities (for short NACE).

  5. 5.

    https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/newrowsynthesis.

References

  1. Acharya, A., Sikdar, S., Das, S., Rangwala, H.: GenSyn: a multi-stage framework for generating synthetic microdata using macro data sources. In: IEEE International Conference on Big Data (Big Data), pp. 685–692 (2022)

    Google Scholar 

  2. Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using Generative Adversarial Networks. In: Doshi-Velez, F., Fackler, J., Kale, D., Ranganath, R., Wallace, B., Wiens, J. (eds.) Proceedings of the 2nd Machine Learning for Healthcare Conference, vol. 68, pp. 286–305 (2017)

    Google Scholar 

  3. Choupani, A.A., Mamdoohi, A.R.: Population synthesis using iterative proportional fitting (IPF): a review and future research. Transp. Res. Procedia 17, 223–233 (2016)

    Article  Google Scholar 

  4. Dandekar, R.A., Cohen, M., Kirkendall, N.: Sensitive micro data protection using Latin hypercube sampling technique. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, pp. 117–125. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-47804-3_9

    Chapter  Google Scholar 

  5. Domingo-Ferrer, J.: A survey of inference control methods for privacy-preserving data mining. In: Aggarwal, C.C., Yu, P.S. (eds.) Privacy-Preserving Data Mining: Models and Algorithms, vol. 34, pp. 53–80. Springer, US (2008). https://doi.org/10.1007/978-0-387-70992-5_3

    Chapter  Google Scholar 

  6. Domingo-Ferrer, J., Torra, V.: Disclosure risk assessment in statistical data protection. J. Comput. Appl. Math. 164–165, 285–293 (2004). Proceedings of the 10th International Congress on Computational and Applied Mathematics

    Article  MathSciNet  Google Scholar 

  7. Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)

    Article  MathSciNet  Google Scholar 

  8. Garofalo, G., Slokom, M., Preuveneers, D., Joosen, W., Larson, M.: Machine learning meets data modification. In: Batina, L., Bäck, T., Buhan, I., Picek, S. (eds.) Security and Artificial Intelligence. LNCS, vol. 13049, pp. 130–155. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98795-4_7

    Chapter  Google Scholar 

  9. Hundepool, A., et al.: Statistical Disclosure Control. Wiley, NewYork (2012)

    Book  Google Scholar 

  10. Kim, J., Lee, S.: A simulated annealing algorithm for the creation of synthetic population in activity-based travel demand model. KSCE J. Civ. Eng. 20, 2513–2523 (2015)

    Article  Google Scholar 

  11. Li, Z., Zhao, Y., Fu, J.: Sync: a copula based framework for generating synthetic data from aggregated sources (2020)

    Google Scholar 

  12. Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. 10(3), 395–411 (1985)

    Article  Google Scholar 

  13. Muralidhar, K.: A re-examination of the Census Bureau reconstruction and reidentification attack. In: Domingo-Ferrer, J., Laurent, M. (eds.) PSD 2022. LNCS, vol. 13463, pp. 312–323. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13945-1_22

    Chapter  Google Scholar 

  14. Muralidhar, K., Domingo-Ferrer, J.: Database reconstruction is not so easy and is different from reidentification. J. Off. Stat. 39(3), 381–398 (2023)

    Article  Google Scholar 

  15. Murata, T., Harada, T.: Nation-wide synthetic reconstruction method. In: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–6 (2017)

    Google Scholar 

  16. Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on Generative Adversarial Networks. In: Proceedings of the 44th International Conference on Very Large Data Bases (VLDB Endowment), vol. 11, no. 10, pp. 1071–1083 (2018)

    Google Scholar 

  17. Rubin, D.B.: Discussion statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)

    Google Scholar 

  18. Thogarchety, P., Das, K.: Synthetic data generation using genetic algorithm. In: 2023 2nd International Conference for Innovation in Technology (INOCON), pp. 1–6 (2023)

    Google Scholar 

  19. Torra, V.: Privacy in data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 687–716. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_35

    Chapter  Google Scholar 

  20. Voas, D., Williamson, P.: An evaluation of the combinatorial optimisation approach to the creation of synthetic microdata. Int. J. Popul. Geogr. 6, 349–366 (2000)

    Article  Google Scholar 

  21. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 7335–7345 (2019)

    Google Scholar 

Download references

Acknowledgments

We are grateful to Guus van de Burgt for providing the data and invaluable insights, and to Arjen de Boer for his effective project management and extensive knowledge, both of which were crucial to the success of this project.

This work was partly supported by the AI, Media, and Democracy Lab, NWA.1332.20.009.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mohamed Aghaddar , Liu Nuo Su , Manel Slokom , Lucas Barnhoorn or Peter-Paul de Wolf .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Dataset Rules

We explain the types of relationships present in the dataset and the rules the synthesizer must account for. We divide the rules into five categories: non-negative rules, summation rules, equality rules, inequality rules, and if-then rules. The non-negative rule imposes that the variable must be greater or equal to zero (e.g. number of workers and various expenses). For the summation rule, the sum of certain variables must equal another variable per definition. Think of the total expense variable being the sum of individual expenses, or the net revenue being a summation of sales and costs. The equality rule suggests that specific variables must be equal to another variable (\(=\)), while the inequality rule imposes that some variables must be lesser or equal (\(\le \)) or greater or equal (\(\ge \)) than other variables. This is the case of repeated variables in the survey (equality rules), and the variable of the number of workers being greater or equal to the full-time equivalent workers (inequality rule). The if-then rules impose the same rules as the previous ones, with the addition of requiring a certain condition to be satisfied (as some rules can only exist if a variable meets a certain value).

1.2 A.2 Extra Analytical Validity Results

For micro-to-micro data synthesis, we measure the KS-Complement metric, which assesses the similarity between a real variable and its synthetic counterpart based on their marginal distributions. Our results are provided in Tables 3 and 4 (Fig. 5).

Table 3. KS-Complements macro-to-micro scaled data
Table 4. KS-Complements macro-to-micro
Fig. 5.
figure 5

Univariate distribution of the gk_SBS variable

1.3 A.3 Residuals from the Regression Models

We analyze the performance of the synthetic data by examining the residuals of the linear regression model. Figure 6 shows the residual plots for the models trained on the real data, CART synthetic data, and CTGAN synthetic data. The residuals represent the differences between the observed and predicted values.

In the residual plot for the model trained on the real data (Fig. 6a), the residuals are evenly distributed around zero, indicating a good fit of the model. In contrast, the residual plot for the CART synthetic data (Fig. 6b) shows a similar distribution, though with a slightly larger variance, suggesting that the CART synthetic data maintains the relationships in the real data fairly well. The residual plot for the CTGAN synthetic data (Fig. 6c) reveals a different pattern. The residuals are more spread out and exhibit a clear upward trend, indicating that the CTGAN model has a downward bias and struggles to capture the variability in the real data. This pattern suggests that the CTGAN synthetic data may not be as effective in preserving the underlying relationships present in the real data. Overall, these residual plots highlight that the CART synthetic data provides a better approximation of the real data compared to the CTGAN synthetic data, aligning with the findings from the prediction accuracy analysis.

Fig. 6.
figure 6

Residual plot, with the prediction model trained on the real and micro-to-micro synthetic datasets

The residual plots for the synthetic data generated using the macro-to-micro approaches are shown in Fig. 7. In Scenario 1 (Fig. 7a), where no knowledge of the real data was used, the residuals are widely scattered and show a large deviation from zero. This indicates poor accuracy in the synthetic data, as the model fails to capture the true relationships present in the real data. Scenario 2 (Fig. 7b), which incorporates internal knowledge, shows an improvement in the residuals compared to Scenario 1. The residuals are closer to zero, suggesting that the synthetic data better approximates the real data, although there are still noticeable deviations. Scenario 3 (Fig. 7c), which leverages external knowledge, exhibits the most favorable residuals among the three scenarios. The residuals are tightly clustered around zero, indicating a higher accuracy in the synthetic data. This demonstrates that utilizing external insights significantly enhances the quality of the synthetic data generation process.

Fig. 7.
figure 7

Residual plot on the synthetic data generated using macro-to-micro scenarios with scaled data.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aghaddar, M., Su, L.N., Slokom, M., Barnhoorn, L., de Wolf, PP. (2024). A Case Study Exploring Data Synthesis Strategies on Tabular vs. Aggregated Data Sources for Official Statistics. In: Domingo-Ferrer, J., Önen, M. (eds) Privacy in Statistical Databases. PSD 2024. Lecture Notes in Computer Science, vol 14915. Springer, Cham. https://doi.org/10.1007/978-3-031-69651-0_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-69651-0_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-69650-3

  • Online ISBN: 978-3-031-69651-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics