Synthetic Data: Generate Avatar Data on Demand | SpringerLink
Skip to main content

Synthetic Data: Generate Avatar Data on Demand

  • Conference paper
  • First Online:
Web Information Systems Engineering – WISE 2024 (WISE 2024)

Abstract

Anonymization is crucial for the sharing of personal data in a privacy-aware manner yet it is a complex task that requires to set up a trade-off between the robustness of anonymization (i.e., the privacy level provided) and the quality of the analysis that can be expected from anonymized data (i.e., the resulting utility). Synthetic data has emerged as a promising solution to overcome the limits of classical anonymization methods while achieving similar statistical properties to the original data. Avatar-based approaches are a specific type of synthetic data generation that rely on local stochastic simulation modeling to generate an avatar for each original record. While these approaches have been used in healthcare, their attack surface is not well documented and understood. In this paper, we provide an extensive assessment of such approaches and comparing them against other data synthesis methods. We also propose an improvement based on conditional sampling in the latent space, which allows synthetic data to be generated on demand (i.e., of arbitrary size). Our empirical analysis shows that avatar-generated data are subject to the same utility and privacy trade-off as other data synthesis methods with a privacy risk more important on the edge data, which correspond to records that have the fewest alter egos in the original data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.gartner.com/en/newsroom/press-releases/2022-06-22-is-synthetic-data-the-future-of-ai.

  2. 2.

    https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf.

  3. 3.

    https://github.com/octopize/saiph.

  4. 4.

    https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/2018-differential-privacy-synthetic.

  5. 5.

    https://www.octopize.io/.

References

  1. Alaa, A., Van Breugel, B., Saveliev, E.S., van der Schaar, M.: How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In: ICML, pp. 290–306 (2022)

    Google Scholar 

  2. Appenzeller, A., Leitner, M., Philipp, P., Krempel, E., Beyerer, J.: Privacy and utility of private synthetic data for medical data analyses. Appl. Sci. 12(23) (2022)

    Google Scholar 

  3. Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., Tramer, F.: Membership inference attacks from first principles. Secur. Priv. 1897–1914 (2022)

    Google Scholar 

  4. Chen, J., Liu, Y.: Locally linear embedding: a survey. Artif. Intell. Rev. 36, 29–48 (2011)

    Article  Google Scholar 

  5. Chen, R.J., Lu, M.Y., Chen, T.Y., Williamson, D.F., Mahmood, F.: Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5(6), 493–497 (2021)

    Article  Google Scholar 

  6. Dankar, F.K., Ibrahim, M.K., Ismail, L.: A multi-dimensional evaluation of synthetic data generators. Access 10, 11147–11158 (2022)

    Article  Google Scholar 

  7. De Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3(1), 1–5 (2013)

    Article  Google Scholar 

  8. Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)

    MathSciNet  Google Scholar 

  9. El Emam, K.: Seven ways to evaluate the utility of synthetic data. Secur. Priv. 18(4), 56–59 (2020)

    Article  Google Scholar 

  10. Fang, M.L., Dhami, D.S., Kersting, K.: DP-CTGAN: differentially private medical data generation using CTGANs. In: AIME, pp. 178–188 (2022)

    Google Scholar 

  11. Fonseca, J., Bacao, F.: Tabular and latent space synthetic data generation: a literature review. J. Big Data 10(1), 115 (2023)

    Article  Google Scholar 

  12. Ganev, G., Cristofaro, E.D.: On the inadequacy of similarity-based privacy metrics: reconstruction attacks against “truly anonymous synthetic data” (2023)

    Google Scholar 

  13. Giomi, M., Boenisch, F., Wehmeyer, C., Tasnádi, B.: A unified framework for quantifying privacy risk in synthetic data. PETS (2023)

    Google Scholar 

  14. Guillaudeux, M., et al.: Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis. NPJ Digit. Med. 6(1), 37 (2023)

    Google Scholar 

  15. Hammer, S.M., et al.: A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. N. Engl. J. Med. 335(15), 1081–1090 (1996)

    Article  Google Scholar 

  16. Jordon, J., Yoon, J., van der Schaar, M.: Pate-GAN: generating synthetic data with differential privacy guarantees. In: ICLR (2018)

    Google Scholar 

  17. Kaabachi, B., et al.: Can we trust synthetic data in medicine? A scoping review of privacy and utility metrics (2023)

    Google Scholar 

  18. Kalay, A.F.: Generating synthetic data with the nearest neighbors algorithm (2022)

    Google Scholar 

  19. McKenna, R., Miklau, G., Sheldon, D.: Winning the NIST contest: a scalable and general approach to differentially private synthetic data (2021)

    Google Scholar 

  20. McKenna, R., Sheldon, D., Miklau, G.: Graphical-model based estimation and inference for differential privacy. In: International Conference on Machine Learning, pp. 4435–4444. PMLR (2019)

    Google Scholar 

  21. Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)

    Article  Google Scholar 

  22. Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: DSAA, pp. 399–410 (2016)

    Google Scholar 

  23. Stadler, T., Oprisanu, B., Troncoso, C.: Synthetic data - anonymisation groundhog day. In: USENIX Security Symposium (2022)

    Google Scholar 

  24. Sweeney, L.: k-anonymity: a model for protecting privacy. Internat. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)

    Article  MathSciNet  Google Scholar 

  25. Vallevik, V.B., et al.: Can i trust my fake data - a comprehensive quality assessment framework for synthetic tabular data in healthcare. Int. J. Med. Informatics 185, 105413 (2024)

    Article  Google Scholar 

  26. Wagner, I., Eckhoff, D.: Technical privacy metrics: a systematic survey. Comput. Surv. 51(3) (2018)

    Google Scholar 

  27. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NeurIPS (2019)

    Google Scholar 

  28. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. TODS 42(4), 1–41 (2017)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgment

We would like to thank Octopize for access to their API. This work has been supported by the ANR 22-PECY-0002 IPOP (Interdisciplinary Project on Privacy) project of the Cybersecurity PEPR, the Trusty-IA project supported by the Auvergne Rhône-Alpes region, and the Canada Research Chair program through a Discovery Grant from the NSERC and the DEEL Project CRDPJ 537462-18 funded by the NSERC and the CRIAQ.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antoine Boutet .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lebrun, T., Béziaud, L., Allard, T., Boutet, A., Gambs, S., Maouche, M. (2025). Synthetic Data: Generate Avatar Data on Demand. In: Barhamgi, M., Wang, H., Wang, X. (eds) Web Information Systems Engineering – WISE 2024. WISE 2024. Lecture Notes in Computer Science, vol 15440. Springer, Singapore. https://doi.org/10.1007/978-981-96-0576-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-0576-7_15

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-0575-0

  • Online ISBN: 978-981-96-0576-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics