Abstract
The disambiguation of named entities is a challenge in many fields such as scientometrics, social networks, record linkage, citation analysis, semantic web…etc. The names ambiguities can arise from misspelling, typographical or OCR mistakes, abbreviations, omissions… Therefore, the search of names of persons or of organizations is difficult as soon as a single name might appear in many different forms. This paper proposes two approaches to disambiguate on the affiliations of authors of scientific papers in bibliographic databases: the first way considers that a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning. The results are encouraging and the approach is already partially applied in a scientific survey department. However, our experiments also highlight that our approach has some limitations: it cannot process efficiently highly unbalanced data. Alternatives solutions are possible for future developments, particularly with the use of a recent clustering algorithm relying on feature maximization.
Similar content being viewed by others
References
Aswani, N., Bontcheva, K., & Cunningham, H. (2006). Mining information for instance Unification. Lecture Notes in Computer Science, 4273, 329–334.
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., & Fienberg, S. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5), 16–23.
Bourke, P., & Butler, L. (1996). Standards issues in a national bibliometric database: The Australian case. Scientometrics, 35(2), 199–207.
Carayol, N., & Cassi, L. (2009). Whos who in patents. A Bayesian approach. http://hal-paris1.archives-ouvertes.fr/hal-00631750. Accessed 15 April 2013.
Churches, T., Christen, P., Lim, K., & Zhu, J. X. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making, 2. doi:10.1186/1472-6947-2-9.
Cleuziou, G. (2008). An extended version of the k-means method for overlapping clustering. In 19th. International Conference on Pattern Recognition (ICPR 2008), pp. 1–4.
De Bruin, R. E., & Moed, H. F. (1990). The unification of addresses in scientific publications. Informetrics 1989/90, 6578. Amsterdam: Elsevier.
De Bruin, R. E., & Moed, H. F. (1993). Delimitation of scientific subfields using cog nitive words from corporate addresses in scientific publications. Scientometrics, 26(1), 65–80.
Domingos, P., & Pazzani, M. (1996). Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In International Conference on Machine Learning (ICML), 105–112, Bari.
Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.
French, J. C., Powell, A. L., & Schulman, E. (2000). Using clustering strategies for creating authority files. Journal of the American Society for Information Science and Technology, 51, 774–786.
Galvez, C., & Moya-Anegn, F. (2006). The unification of institutional addresses applying parametrized finite-state graphs (P-FSG). Scientometrics, 69(2), 323–345.
Hand, D. J., & Yu, K. (2001). Idiots Bayes not so stupid after all? International Statistical Review, 69(3), 385–398.
Hood, W., & Wilson, C. (2003). Informetric studies using databases: Opportunities and challenges. Scientometrics, 58(3), 587–608.
Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. PKDD 06. LNAI, 4213:536–544, Berlin: Springer.
Jiang, Y., Zheng, H.-T., Wang, X., Lu, B., & Wu, K. (2011). Affiliation disambiguation for constructing semantic digital libraries. Journal of the American Society for Information Science and Technology, 62(6), 1029–1041.
Lamirel J.-C., Mall R., Cuxac P., & Safi G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In International Joint Conference on neural networks–IJCNN 2011, p. 956–965.
Lelu, A. (1993). Modèles neuronaux pour l’analyse de donnes documentaires et textuelles. PhD: University Paris. 6.
Liu, N. C., Cheng, Y., & Liu, L. (2005). Academic ranking of world universities using scientometrics: A comment to the fatal attraction. Scientometrics, 64(1), 101–112.
MacQueen J. (1967). Some methods for classification and analysis of multivariate observations. Proceeding of the 5th Berkeley Symposium on Mathematical Statistics and Probability, p. 281–297.
Moed, H. F. (2005). Citation analysis in research evaluation. Dordrecht: Springer.
Niu, L., Wu, J., & Shi, Y. (2012). Entity disambiguation with textual and connection information. Procedia Computer Science, 9, 1249–1255.
Osareh, F., & Wilson, C. S. (2000). A comparison of Iranian scientific publications in the SCI: 1985–1989 and 1990–1994. Scientometrics, 48(3), 427–442.
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.
Sadinle, M., & Fienberg, S. E. (2012). A generalized Fellegi-Sunter framework for Multiple record linkage with application to homicide record-systems. arXiv:1205.3217. http://arxiv.org/abs/1205.3217. Accessed 15 April 2013.
Sadinle, M., Hall, R., & Fienberg, S. (2010). Approaches to multiple record linkage. Cscmuedu, http://www.cs.cmu.edu/~rjhall/ISIpaperfinal.pdf. Accessed 15 April 2013.
Van Raan, A. F. J. (2005). Fatal attraction: Conceptual and methodological problems in the ranking of universities by bibliometric methods. Scientometrics, 62(1), 133–143.
Ventura, S. L., Nugent, R., & Fuchs, E. R. H. (2012). Methods matter: Revamping inventor disambiguation algorithms with classification models and labeled inventor records. SSRN eLibrary. http://papers.ssrn.com/sol3/papers.cfm?abstractid=2079330.
Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics, 93, 1–21.
Zhou, Y., Talburt, J. R., Su, Y., & Yin, L. (2010). OYSTER: A tool for entity resolution in health information exchange. Proceedings of the 5th International Conference on Cooperation and Promotion of Information Resources in Science and Technology (COINFO 2010 E-BOOK), 358–364.
Zitt, M., & Bassecoulard, E. (2008). Challenges for scientometric indicators: data demining, knowledge-flow measurements and diversity issues. Ethics in Science and Environmental Politics, 8, 49–60.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cuxac, P., Lamirel, JC. & Bonvallot, V. Efficient supervised and semi-supervised approaches for affiliations disambiguation. Scientometrics 97, 47–58 (2013). https://doi.org/10.1007/s11192-013-1025-5
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-013-1025-5