MORE: Toward Improving Author Name Disambiguation in Academic Knowledge Graphs | International Journal of Machine Learning and Cybernetics
Skip to main content

MORE: Toward Improving Author Name Disambiguation in Academic Knowledge Graphs

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Author name disambiguation (AND) is a fundamental task in knowledge alignment for building a knowledge graph network or an online academic search system. Existing AND algorithms tend to cause over-splitting and over-merging problems of papers, severely jeopardizing the performance of downstream tasks. In this paper, we demonstrate the problem of paper over-splitting and over-merging when constructing an academic knowledge graph. To address the problems, we systematically investigate and propose a unified architecture, MORE, which utilizes LightGBM and HAC FOR paper clusteRing as well as HGAT for both cluster alignmEnt and knowledge graph representation learning. Specifically, we first propose a novel representation learning method which leverages OAG-BERT to learn paper entity embedding and utilizes SimCSE to regularizes pre-trained embedding anisotropic space. We then apply LightGBM to calculate the similarity matrix of papers through entity embedding. We also use hierarchical agglomerative clustering (HAC) for grouping clusters to alleviate over-merging. Finally, considering co-author relationships, we improve the HGAT model using hard-cross graph attention mechanism to generate semantic and structural embedding. Experimental results on two large real-world datasets show that our proposed method achieves 6%\(\sim\)16% improvement against the baseline models on F1-score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. https://www.aminer.cn.

  2. https://www.aminer.cn/whoiswho.

  3. https://www.microsoft.com/en-us/research/project/open-academic-graph/.

References

  1. Sanyal DK, Bhowmick PK, Das PP (2021) A review of author name disambiguation techniques for the pubmed bibliographic database. Journal of Information Science 47(2):227–254

    Article  Google Scholar 

  2. Yan H, Peng H, Li C, Li J, Wang L.(2020) Bibliographic name disambiguation with graph convolutional network. In: International Conference on Web Information Systems Engineering, pp. 538–551 . Springer

  3. Pooja K, Mondal S, Chandra J (2021) Exploiting similarities across multiple dimensions for author name disambiguation. Scientometrics 126(9):7525–7560

    Article  Google Scholar 

  4. Xiong B, Bao P, Wu Y (2021) Learning semantic and relationship joint embedding for author name disambiguation. Neural Computing and Applications 33(6):1987–1998

    Article  Google Scholar 

  5. Kim J (2018) Evaluating author name disambiguation for digital libraries: A case of dblp. Scientometrics 116(3):1867–1886

    Article  Google Scholar 

  6. Schulz C, Mazloumian A, Petersen AM, Penner O, Helbing D (2014) Exploiting citation networks for large-scale author name disambiguation. EPJ Data Science 3:1–14

    Article  Google Scholar 

  7. Liu X, Yin D, Zheng J, Zhang X, Zhang P, Yang H, Dong Y, Tang J. (2022)Oag-bert: Towards a unified backbone language model for academic knowledge services. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3418–3428

  8. Friedman J.H.(2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232

  9. Sun Q, Li J, Peng H, Wu J, Ning Y, Yu P.S, He L.(2021) Sugar: Subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism. In: Proceedings of the Web Conference 2021, pp. 2081–2091

  10. Liu, Y., Wan, Y., He, L., Peng, H., Philip, S.Y.: Kg-bart: Knowledge graph-augmented bart for generative commonsense reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.35, pp. 6418–6425 (2021)

  11. Zhu, S., Li, J., Peng, H., Wang, S., He, L.: Adversarial directed graph embedding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 4741–4748 (2021)

  12. Gong, J., Wang, S., Wang, J., Feng, W., Peng, H., Tang, J., Yu, P.S.: Attentional graph convolutional networks for knowledge concept recommendation in moocs in a heterogeneous view. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 79–88 (2020)

  13. Peng H, Zhang R, Dou Y, Yang R, Zhang J, Yu PS (2021) Reinforced neighborhood selection guided multi-relational graph neural networks. ACM Transactions on Information Systems (TOIS) 40(4):1–46

    Article  Google Scholar 

  14. Louppe G, Al-Natsheh HT, Susik M, Maguire EJ (2016) Ethnicity sensitive author disambiguation using semi-supervised learning. In: Ngonga Ngomo A-C, Křemen P (eds) Knowledge Engineering and Semantic Web. Springer, Cham, pp 272–287

    Chapter  Google Scholar 

  15. Subramanian, S., King, D., Downey, D., Feldman, S.: S2and: A benchmark and evaluation system for author name disambiguation. In: 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 170–179 (2021). IEEE

  16. Kim K, Rohatgi S, Giles C.L.(2019) Hybrid deep pairwise classification for author name disambiguation. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2369–2372

  17. Zhang, Y., Zhang, F., Yao, P., Tang, J.: Name disambiguation in aminer: Clustering, maintenance, and human in the loop. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1002–1011 (2018)

  18. Ferreira, A.A., Silva, R., Gonçalves, M.A., Veloso, A., Laender, A.H.: Active associative sampling for author name disambiguation. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 175–184 (2012)

  19. Tang J, Fong AC, Wang B, Zhang J (2011) A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 24(6):975–987

    Article  Google Scholar 

  20. Khabsa M, Treeratpituk P, Giles C.L.(2015) Online person name disambiguation with constraints. In: Proceedings of the 15th Acm/ieee-cs Joint Conference on Digital Libraries, pp. 37–46

  21. D’Angelo CA, van Eck NJ (2020) Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation. Scientometrics 123(2):883–907

    Article  Google Scholar 

  22. Giles, C.L., Zha, H., Han, H.: Name disambiguation in author citations using a k-way spectral clustering method. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’05), pp. 334–343 (2005). IEEE

  23. Müller M.-C.(2017) Semantic author name disambiguation with word embeddings. In: International Conference on Theory and Practice of Digital Libraries, pp. 300–311 . Springer

  24. Louppe, G., Al-Natsheh, H.T., Susik, M., Maguire, E.J.: Ethnicity sensitive author disambiguation using semi-supervised learning. In: International Conference on Knowledge Engineering and the Semantic Web, pp. 272–287 (2016). Springer

  25. Peng H, Wang H, Du B, Bhuiyan MZA, Ma H, Liu J, Wang L, Yang Z, Du L, Wang S, Yu PS (2020) Spatial temporal incidence dynamic graph neural networks for traffic flow forecasting. Information Sciences 521:277–290

  26. He, Y., Song, Y., Li, J., Ji, C., Peng, J., Peng, H.: Hetespaceywalk: A heterogeneous spacey random walk for heterogeneous information network embedding. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 639–648 (2019)

  27. Peng H, Yang R, Wang Z, Li J, He L, Philip SY, Zomaya AY, Ranjan R (2021) Lime: Low-cost and incremental learning for dynamic heterogeneous information networks. IEEE Transactions on Computers 71(3):628–642

    Article  Google Scholar 

  28. Zhang B, Al Hasan M.(2017) Name disambiguation in anonymized graphs using network embedding. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1239–1248

  29. Li, N., Zhu, R., Zhou, X., He, X., Cai, W., Gao, M., Zhou, A.: On disambiguating authors: Collaboration network reconstruction in a bottom-up manner. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 888–899 (2021). IEEE

  30. Chen, B., Zhang, J., Tang, J., Cai, L., Wang, Z., Zhao, S., Chen, H., Li, C.: Conna: Addressing name disambiguation on the fly. IEEE Transactions on Knowledge and Data Engineering (2020)

  31. Santini, C., Gesese, G.A., Peroni, S., Gangemi, A., Sack, H., Alam, M.: A knowledge graph embeddings based approach for author name disambiguation using literals. Scientometrics (2022)

  32. Sun, Q., Peng, H., Li, J., Wang, S., Dong, X., Zhao, L., Yu, P.S., He, L.: Pairwise learning for name disambiguation in large-scale heterogeneous academic networks. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 511–520 (2020)

  33. Zhang J, Tang J (2021) Name disambiguation in aminer. Science China-information sciences 64(4):10–1007

    Article  Google Scholar 

  34. Santana AF, Gonçalves MA, Laender AH, Ferreira AA (2017) Incremental author name disambiguation by exploiting domain-specific heuristics. Journal of the Association for Information Science and Technology 68(4):931–945

    Article  Google Scholar 

  35. Esperidião LVB, Ferreira AA, Laender AH, Gonçalves MA, Gomes DM, Tavares AI, de Assis GT (2014) Reducing fragmentation in incremental author name disambiguation. Journal of Information and Data Management 5(3):293–293

    Google Scholar 

  36. Zhang L, Lu W, (2021)et al. Lagos-and: A large gold standard dataset for scholarly author name disambiguation. CoRR abs/2104.01821

  37. Church KW (2017) Word2vec. Natural Language Engineering 23(1):155–162

    Article  Google Scholar 

  38. Lau J.H, Baldwin T.(2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 78–86

  39. Ulčar, M., Robnik-Šikonja, M.: High quality elmo embeddings for seven less-resourced languages. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4731–4738 (2020)

  40. Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

  41. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992 (2019)

  42. Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620 (2019)

  43. Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, Maschinot A, Liu C, Krishnan D (2020) Supervised contrastive learning. Advances in Neural Information Processing Systems 33:18661–18673

    Google Scholar 

  44. Gao, T., Yao, X., Chen, D.: Simcse: Simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910 (2021)

  45. Wu, L., Li, J., Wang, Y., Meng, Q., Qin, T., Chen, W., Zhang, M., Liu, T.-Y., et al.: R-drop: regularized dropout for neural networks. Advances in Neural Information Processing Systems 34 (2021)

  46. Liu C, Wang R, Liu J, Sun J, Huang F, Si L.(2021) Dialoguecse: Dialogue-based contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2396–2406

Download references

Acknowledgements

This work was supported by National Key R &D Program of China through grant 2021YFB1714800, S &T Program of Hebei through grant 21340301D, Innovation Capability Improvement Plan Project of Hebei Province 22567626H and Hebei Natural Science Foundation of China Grant F2022203072.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Zhao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gong, J., Fang, X., Peng, J. et al. MORE: Toward Improving Author Name Disambiguation in Academic Knowledge Graphs. Int. J. Mach. Learn. & Cyber. 15, 37–50 (2024). https://doi.org/10.1007/s13042-022-01686-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-022-01686-5

Keywords