Abstract
Anonymization is crucial for the sharing of personal data in a privacy-aware manner yet it is a complex task that requires to set up a trade-off between the robustness of anonymization (i.e., the privacy level provided) and the quality of the analysis that can be expected from anonymized data (i.e., the resulting utility). Synthetic data has emerged as a promising solution to overcome the limits of classical anonymization methods while achieving similar statistical properties to the original data. Avatar-based approaches are a specific type of synthetic data generation that rely on local stochastic simulation modeling to generate an avatar for each original record. While these approaches have been used in healthcare, their attack surface is not well documented and understood. In this paper, we provide an extensive assessment of such approaches and comparing them against other data synthesis methods. We also propose an improvement based on conditional sampling in the latent space, which allows synthetic data to be generated on demand (i.e., of arbitrary size). Our empirical analysis shows that avatar-generated data are subject to the same utility and privacy trade-off as other data synthesis methods with a privacy risk more important on the edge data, which correspond to records that have the fewest alter egos in the original data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
References
Alaa, A., Van Breugel, B., Saveliev, E.S., van der Schaar, M.: How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In: ICML, pp. 290–306 (2022)
Appenzeller, A., Leitner, M., Philipp, P., Krempel, E., Beyerer, J.: Privacy and utility of private synthetic data for medical data analyses. Appl. Sci. 12(23) (2022)
Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., Tramer, F.: Membership inference attacks from first principles. Secur. Priv. 1897–1914 (2022)
Chen, J., Liu, Y.: Locally linear embedding: a survey. Artif. Intell. Rev. 36, 29–48 (2011)
Chen, R.J., Lu, M.Y., Chen, T.Y., Williamson, D.F., Mahmood, F.: Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5(6), 493–497 (2021)
Dankar, F.K., Ibrahim, M.K., Ismail, L.: A multi-dimensional evaluation of synthetic data generators. Access 10, 11147–11158 (2022)
De Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3(1), 1–5 (2013)
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)
El Emam, K.: Seven ways to evaluate the utility of synthetic data. Secur. Priv. 18(4), 56–59 (2020)
Fang, M.L., Dhami, D.S., Kersting, K.: DP-CTGAN: differentially private medical data generation using CTGANs. In: AIME, pp. 178–188 (2022)
Fonseca, J., Bacao, F.: Tabular and latent space synthetic data generation: a literature review. J. Big Data 10(1), 115 (2023)
Ganev, G., Cristofaro, E.D.: On the inadequacy of similarity-based privacy metrics: reconstruction attacks against “truly anonymous synthetic data” (2023)
Giomi, M., Boenisch, F., Wehmeyer, C., Tasnádi, B.: A unified framework for quantifying privacy risk in synthetic data. PETS (2023)
Guillaudeux, M., et al.: Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis. NPJ Digit. Med. 6(1), 37 (2023)
Hammer, S.M., et al.: A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. N. Engl. J. Med. 335(15), 1081–1090 (1996)
Jordon, J., Yoon, J., van der Schaar, M.: Pate-GAN: generating synthetic data with differential privacy guarantees. In: ICLR (2018)
Kaabachi, B., et al.: Can we trust synthetic data in medicine? A scoping review of privacy and utility metrics (2023)
Kalay, A.F.: Generating synthetic data with the nearest neighbors algorithm (2022)
McKenna, R., Miklau, G., Sheldon, D.: Winning the NIST contest: a scalable and general approach to differentially private synthetic data (2021)
McKenna, R., Sheldon, D., Miklau, G.: Graphical-model based estimation and inference for differential privacy. In: International Conference on Machine Learning, pp. 4435–4444. PMLR (2019)
Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: DSAA, pp. 399–410 (2016)
Stadler, T., Oprisanu, B., Troncoso, C.: Synthetic data - anonymisation groundhog day. In: USENIX Security Symposium (2022)
Sweeney, L.: k-anonymity: a model for protecting privacy. Internat. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Vallevik, V.B., et al.: Can i trust my fake data - a comprehensive quality assessment framework for synthetic tabular data in healthcare. Int. J. Med. Informatics 185, 105413 (2024)
Wagner, I., Eckhoff, D.: Technical privacy metrics: a systematic survey. Comput. Surv. 51(3) (2018)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NeurIPS (2019)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. TODS 42(4), 1–41 (2017)
Acknowledgment
We would like to thank Octopize for access to their API. This work has been supported by the ANR 22-PECY-0002 IPOP (Interdisciplinary Project on Privacy) project of the Cybersecurity PEPR, the Trusty-IA project supported by the Auvergne Rhône-Alpes region, and the Canada Research Chair program through a Discovery Grant from the NSERC and the DEEL Project CRDPJ 537462-18 funded by the NSERC and the CRIAQ.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lebrun, T., Béziaud, L., Allard, T., Boutet, A., Gambs, S., Maouche, M. (2025). Synthetic Data: Generate Avatar Data on Demand. In: Barhamgi, M., Wang, H., Wang, X. (eds) Web Information Systems Engineering – WISE 2024. WISE 2024. Lecture Notes in Computer Science, vol 15440. Springer, Singapore. https://doi.org/10.1007/978-981-96-0576-7_15
Download citation
DOI: https://doi.org/10.1007/978-981-96-0576-7_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0575-0
Online ISBN: 978-981-96-0576-7
eBook Packages: Computer ScienceComputer Science (R0)