Robust Temporal Graph Clustering for Group Record Linkage | SpringerLink
Skip to main content

Robust Temporal Graph Clustering for Group Record Linkage

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11440))

Included in the following conference series:

  • 2589 Accesses

Abstract

Research in the social sciences is increasingly based on large and complex data collections, where individual data sets from different domains need to be linked to allow advanced analytics. A popular type of data used in such a context are historical registries containing birth, death, and marriage certificates. Individually, such data sets however limit the types of studies that can be conducted. Specifically, it is impossible to track individuals, families, or households over time. Once such data sets are linked and family trees are available it is possible to, for example, investigate how education, health, mobility, and employment influence the lives of people over two or even more generations. The linkage of historical records is challenging because of data quality issues and because often there are no ground truth data available. Unsupervised techniques need to be employed, which generally are based on similarity graphs generated by comparing individual records. In this paper we present a novel temporal clustering approach aimed at linking records of the same group (such as all births by the same mother) where temporal constraints (such as intervals between births) need to be enforced. We combine a connected component approach with an iterative merging step which considers temporal constraints to obtain accurate clustering results. Experiments on a real Scottish data set show the superiority of our approach over a previous clustering approach for record linkage.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Antonie, L., Inwood, K., Lizotte, D.J., Ross, J.A.: Tracking people over time in 19th century Canada for longitudinal analysis. Mach. Learn. 95, 129–146 (2014)

    Article  MathSciNet  Google Scholar 

  2. Aslam, J.A., Pelekhov, E., Rus, D.: The star clustering algorithm for static and dynamic information organization. J. Graph Algorithms Appl. 8, 95–129 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bailey, M., Cole, C., et al.: How well do automated methods perform in historical samples? Evidence from new ground truth. Technical report, NBER (2017)

    Google Scholar 

  4. Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M.: Population Reconstruction. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-319-19884-2

    Book  Google Scholar 

  5. Christen, P.: Data Matching. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  6. Christen, V., Groß, A., Fisher, J., Wang, Q., Christen, P., Rahm, E.: Temporal group linkage and evolution analysis for census data. In: EDBT, Venice (2017)

    Google Scholar 

  7. Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: ACM ICML, Pittsburgh, pp. 233–240 (2006)

    Google Scholar 

  8. Dibben, C., Williamson, L., Huang, Z.: Digitising Scotland (2012). http://gtr.rcuk.ac.uk/projects?ref=ES/K00574X/2

  9. Dillon, L.Y.: Integrating nineteenth-century Canadian and American census data sets. Comput. Hum. 30(5), 381–392 (1996)

    Article  Google Scholar 

  10. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  MATH  Google Scholar 

  11. Fu, Z., Christen, P., Zhou, J.: A graph matching method for historical census household linkage. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014. LNCS (LNAI), vol. 8443, pp. 485–496. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06608-0_40

    Chapter  Google Scholar 

  12. Hand, D., Christen, P.: A note on using the f-measure for evaluating record linkage algorithms. Stat. Comput. 28(3), 539–547 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  13. Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)

    Google Scholar 

  14. Kum, H.C., Krishnamurthy, A., Machanavajjhala, A., Ahalt, S.C.: Social genome: putting big data to work for population informatics. IEEE Comput. 47(1), 56–63 (2014)

    Article  Google Scholar 

  15. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  16. Nanayakkara, C., Christen, P., Ranbaduge, T.: Temporal graph-based clustering for historical record linkage. In: MLG, held at ACM SIGKDD, London (2018)

    Google Scholar 

  17. On, B.W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: IEEE ICDE, Istanbul (2007)

    Google Scholar 

  18. Reid, A., Davies, R., Garrett, E.: Nineteenth-century Scottish demography from linked censuses and civil registers. History Comput. 14(1–2), 61–86 (2002)

    Article  Google Scholar 

  19. Ruggles, S., Fitch, C.A., Roberts, E.: Historical census record linkage. Ann. Rev. Sociol. 44(1), 19–37 (2018)

    Article  Google Scholar 

  20. Saeedi, A., Peukert, E., Rahm, E.: Comparative evaluation of distributed clustering schemes for multi-source entity resolution. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A. (eds.) ADBIS 2017. LNCS, vol. 10509, pp. 278–293. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66917-5_19

    Chapter  Google Scholar 

  21. Saeedi, A., Peukert, E., Rahm, E.: Using link features for entity clustering in knowledge graphs. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 576–592. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_37

    Chapter  Google Scholar 

Download references

Acknowledgements

This work was supported by ESRC grants ES/K00574X/2 Digitising Scotland and ES/L007487/1 ADRC-S. We like to thank Alice Reid of the University of Cambridge and her colleagues Ros Davies and Eilidh Garrett for their work on the Isle of Skye database, and their helpful advice on historical Scottish demography. This work was partially funded by the Australian Research Council under DP160101934.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charini Nanayakkara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nanayakkara, C., Christen, P., Ranbaduge, T. (2019). Robust Temporal Graph Clustering for Group Record Linkage. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11440. Springer, Cham. https://doi.org/10.1007/978-3-030-16145-3_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-16145-3_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-16144-6

  • Online ISBN: 978-3-030-16145-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics