Entity Matching Across Multiple Heterogeneous Data Sources

Kong, Chao; Gao, Ming; Xu, Chen; Qian, Weining; Zhou, Aoying

doi:10.1007/978-3-319-32025-0_9

Chao Kong¹⁹,
Ming Gao¹⁹,
Chen Xu²⁰,
Weining Qian¹⁹ &
…
Aoying Zhou¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9642))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

4167 Accesses
15 Citations

Abstract

Entity matching is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to entity profiling, product recommendation, etc. The matching process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we design an unsupervised approach, called EMAN, to match entities across two or more heterogeneous data sources. The algorithm utilizes the locality sensitive hashing schema to reduce the candidate tuples and speed up the matching process. To handle the heterogeneous entity attributes, we employ the exponential family to model the similarities between the different attributes. EMAN is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EMAN on re-identifying entities from the same data source, as well as matching entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

EnAli: entity alignment across multiple heterogeneous data sources

Article 09 June 2018

Entity Matching: Matching Entities Between Multiple Data Sources

Entity Correspondence with Second-Order Markov Logic

References

Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Article Google Scholar
Scannapieco, M., Figotin, I., Bertino, E., Elmagarmid, A.K.: Privacy preserving schema and data matching. In: SIGMOD, pp. 653–664 (2007)
Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: ACM SIGKDD, pp. 269–278 (2002)
Google Scholar
Wang, Y.R., Madnick, S.E.: The inter-database instance identification problem in integrating autonomous systems. In: Data Eng, pp. 46–55. IEEE (1989)
Google Scholar
Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD, pp. 127–138 (1995)
Google Scholar
Jin, L., Li, C., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006)
Article Google Scholar
Whang, S.E., Garcia-Molina, H.: Incremental entity resolution on rules and data. VLDB J. 23(1), 77–102 (2014)
Article Google Scholar
Kolb, L., Thor, A., Rahm, E.: Block-based load balancing for entity resolution with MapReduce. In: CIKM, pp. 2397–2400 (2011)
Google Scholar
Whang, S., Garcia-Molina, H.: Entity resolution with evolving rules. PVLDB 3(1), 1326–1337 (2010)
Google Scholar
Getoor, L., Machanavajjhala, A.: Entity resolution: theory practice & open challenges. PVLDB 5(12), 2018–2019 (2012)
Google Scholar
Singla, P., Domingos, P.: Entity resolution with markov logic. In: ICDM, pp. 572–582 (2006)
Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)
Article MATH Google Scholar
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)
Book Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Winkler, W.E.: Overview of Record Linkage and Current Research Directions. U.S. Census Brueau, Washington (2006)
Google Scholar
Fellegi, I.P.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article Google Scholar
Wang, J., Li, G., Yu, J.X., Feng, J.: Entity matching: how similar is similar. PVLDB 4(10), 622–633 (2011)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learning string similarity measures. In: ACM SIGKDD, pp. 39–48 (2003)
Google Scholar
Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: ACM SIGMOD International Conference on Management of Data, pp. 85–96 (2005)
Google Scholar
Roos, L.L., Wajda, A.: Record linkage strategies. part I: estimating information and evaluating approaches. Methods Inf. Med. 30(2), 117–123 (1991)
Google Scholar
Grannis, S.J, Overhage, J,M, McDonald, C.J: Analysis of identifier performance using a deterministic linkage algorithm. In: AMIA (2002)
Google Scholar
Rastogi, V., Dalvi, N.N., Garofalakis, M.N.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)
Google Scholar
Lee, S., Lee, J., Hwang, S.-W.: Scalable entity matching computation with materialization. In: CIKM, pp. 2353–2356 (2011)
Google Scholar
DuVall, S.L., Kerber, R.A., Thomas, A.: Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators. J. Biomed. Inform. 43(1), 24–30 (2010)
Article Google Scholar
Sadinle, M., Fienberg, S.E.: A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. J. Am. Stat. Assoc. 108(502), 385–397 (2013)
Article MathSciNet MATH Google Scholar
Gao, M., Lim, E.-P., Lo, D., Zhu, F., Prasetyo, P.K., Zhou, A.: C.N.L.: Collective network linkage across heterogeneous social network. In: ICDM (2015)
Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2011)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Basic Research Program (973) of China (No. 2012CB316203) and NSFC under Grant No. U1401256, 61402177, 61402180 and 61232002. This work is also supported by CCF-Tecent Research Program of China (No. AGR20150114), NSF of Shanghai (No. 14ZR1412600), and a fund of ECNU for oversea scholars, international conference and domestic scholarly visits.

Author information

Authors and Affiliations

Institute for Data Science and Engineering, ECNU-PINGAN Innovative Research Center for Big Data, East China Normal University, Shanghai, China
Chao Kong, Ming Gao, Weining Qian & Aoying Zhou
Technische Universität Berlin, Berlin, Germany
Chen Xu

Authors

Chao Kong
View author publications
You can also search for this author in PubMed Google Scholar
Ming Gao
View author publications
You can also search for this author in PubMed Google Scholar
Chen Xu
View author publications
You can also search for this author in PubMed Google Scholar
Weining Qian
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Gao .

Editor information

Editors and Affiliations

Georgia Institute of Technology, Atlanta, Georgia, USA
Shamkant B. Navathe
University of Texas at Dallas, Richardson, Texas, USA
Weili Wu
University of Minnesota, Minneapolis, Minnesota, USA
Shashi Shekhar
Renmin University, Beijing, China
Xiaoyong Du
Fudan University, Shanghai, China
X. Sean Wang
Rutgers, The State University of New Jer, New Brunswick, New Jersey, USA
Hui Xiong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kong, C., Gao, M., Xu, C., Qian, W., Zhou, A. (2016). Entity Matching Across Multiple Heterogeneous Data Sources. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-32025-0_9
Published: 25 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32024-3
Online ISBN: 978-3-319-32025-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Entity Matching Across Multiple Heterogeneous Data Sources

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

EnAli: entity alignment across multiple heterogeneous data sources

Entity Matching: Matching Entities Between Multiple Data Sources

Entity Correspondence with Second-Order Markov Logic

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Entity Matching Across Multiple Heterogeneous Data Sources

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

EnAli: entity alignment across multiple heterogeneous data sources

Entity Matching: Matching Entities Between Multiple Data Sources

Entity Correspondence with Second-Order Markov Logic

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation