Deep context of citations using machine-learning models in scholarly full-text articles

Hassan, Saeed-Ul; Imran, Mubashir; Iqbal, Sehrish; Aljohani, Naif Radi; Nawaz, Raheel

doi:10.1007/s11192-018-2944-y

Deep context of citations using machine-learning models in scholarly full-text articles

Published: 29 October 2018

Volume 117, pages 1645–1662, (2018)
Cite this article

Scientometrics Aims and scope Submit manuscript

Saeed-Ul Hassan ORCID: orcid.org/0000-0002-6509-9190¹,
Mubashir Imran¹,
Sehrish Iqbal¹,
Naif Radi Aljohani² &
…
Raheel Nawaz³

2119 Accesses
50 Citations
Explore all metrics

Abstract

Information retrieval systems for scholarly literature rely heavily not only on text matching but on semantic- and context-based features. Readers nowadays are deeply interested in how important an article is, its purpose and how influential it is in follow-up research work. Numerous techniques to tap the power of machine learning and artificial intelligence have been developed to enhance retrieval of the most influential scientific literature. In this paper, we compare and improve on four existing state-of-the-art techniques designed to identify influential citations. We consider 450 citations from the Association for Computational Linguistics corpus, classified by experts as either important or unimportant, and further extract 64 features based on the methodology of four state-of-the-art techniques. We apply the Extra-Trees classifier to select 29 best features and apply the Random Forest and Support Vector Machine classifiers to all selected techniques. Using the Random Forest classifier, our supervised model improves on the state-of-the-art method by 11.25%, with 89% Precision-Recall area under the curve. Finally, we present our deep-learning model, the Long Short-Term Memory network, that uses all 64 features to distinguish important and unimportant citations with 92.57% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Mining the Context of Citations in Scientific Publications

Citation Classification Using Natural Language Processing and Machine Learning Models

Citation Worthiness Identification for Fine-Grained Citation Recommendation Systems

Article 23 January 2022

Notes

References

Abadi, M., & TensorFlow, A. A. B. P. (2016). Large-scale machine learning on heterogeneous distributed systems. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Savannah, GA, USA (pp. 265–283).
Abu-Jbara, A., Ezra, J., & Radev, D. (2013). Purpose and polarity of citation: Towards nlp-based bibliometrics. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 596–606).
Agarwal, S., Choubey, L., & Yu, H. (2010). Automatically classifying the role of citations in biomedical articles. In AMIA Annual Symposium Proceedings (Vol. 2010, p. 11). American Medical Informatics Association.
Athar, A. (2011, June). Sentiment analysis of citations using sentence structure-based features. In Proceedings of the ACL 2011 student session (pp. 81–87). Association for Computational Linguistics.
Auria, L., & Moro, R. A. (2008). Support vector machines (SVM) as a technique for solvency analysis. Technical report, Deutsche Bundesbank, Hannover; German Institute for Economic Research, Berlin. (2007)
Balaban, A. T. (2012). Positive and negative aspects of citation indices and journal impact factors. Scientometrics, 92(2), 241–247.
Article Google Scholar
Bertin, M., & Atanassova, I. (2018). The context of multiple in-text references and their signification. International Journal on Digital Libraries, 19(2-3), 287-303.
Google Scholar
Bett, M., Gross, R., Yu, H., Zhu, X., Pan, Y., Yang, J., & Waibel, A. (2000). Multimodal meeting tracker. In Content-Based Multimedia Information Access (Vol. 1, pp. 32–45).
Borgman, C. L. (1990). Scholarly communication and bibliometrics. Newbury Park: Sage Publications.
Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article Google Scholar
Cao, H., Naito, T., & Ninomiya, Y. (2008, October). Approximate RBF kernel SVM and its applications in pedestrian classification. In The 1st International Workshop on Machine Learning for Vision-based Motion Analysis-MLVMA’08.
Chubin, D. E., & Moitra, S. D. (1975). Content analysis of references: Adjunct or alternative to citation counting? Social Studies of Science, 5(4), 423–441.
Article Google Scholar
Cohan, A., & Goharian, N. (2017). Scientific document summarization via citation contextualization and scientific discourse. International Journal on Digital Libraries, 19(2–3), 287-303.
Google Scholar
Conrad, J. G., & Dabney, D. P. (2001, October). Automatic recognition of distinguishing negative indirect history language in judicial opinions. In Proceedings of the tenth international conference on Information and knowledge management (pp. 287–294). ACM.
De Vocht, L., Softic, S., Verborgh, R., Mannens, E., & Ebner, M. (2017). Social semantic search: a case study on web 2.0 for science. International Journal on Semantic Web and Information Systems, 13(4), 155–180.
Article Google Scholar
Di Ciaccio, A., & Giorgi, G. M. (2015). Deep learning for supervised classification. Rivista Italiana di Economia Demografia e Statistica, 69(2), 2–10.
Google Scholar
Ding, Y., Zhang, G., Chambers, T., Song, M., Wang, X., & Zhai, C. (2014). Content-based citation analysis: The next generation of citation analysis. Journal of the Association for Information Science and Technology, 65(9), 1820–1833.
Article Google Scholar
Egghe, L. (2006). Theory and practise of the g-index. Scientometrics, 69(1), 131–152.
Article MathSciNet Google Scholar
Finney, B. (1979). The reference characteristics of scientific texts. Doctoral dissertation, City University (London, England).
Frost, C. O. (1979). The use of citations in literary research: A preliminary classification of citation functions. The Library Quarterly, 49(4), 399–414.
Article Google Scholar
Garfield, E. (1965, December). Can citation indexing be automated. In Statistical association methods for mechanized documentation, symposium proceedings (Vol. 269, pp. 189–192). Washington, DC: National Bureau of Standards, Miscellaneous Publication 269.
Garfield, E. (2006). The history and meaning of the journal impact factor. The Journal of the American Medical Association, 295(1), 90–93.
Article Google Scholar
Garzone, M., & Mercer, R. (2000). Towards an automated citation classifier. In Conference of the Canadian Society for Computational Studies of Intelligence (pp. 337-346). Springer, Berlin.
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63(1), 3–42.
Article Google Scholar
Hassan, S. U., Akram, A., & Haddawy, P. (2017). Identifying important citations using contextual information from full text. In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), (pp. 1–8). IEEE.
Hassan, S. U., Imran, M., Iftikhar, T., Safder, I., & Shabbir, M. (2017). Deep stylometry and lexical & syntactic features based author attribution on PLoS digital repository. In International Conference on Asian Digital Libraries (pp. 119–127). Springer, Cham.
Hassan, S. U., Iqbal, S., Imran, M., Aljohani, N. R., & Nawaz, R. (2018). Mining the context of citations in scientific publications. In International Conference on Asian Digital Libraries (in-press). Springer, Cham.
Hassan, S. U., Safder, I., Akram, A., & Kamiran, F. (2018b). A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis. Scientometrics, 116(2), 973–996.
Article Google Scholar
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569.
Article Google Scholar
Hirsch, J. E. (2010a). An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship. Scientometrics, 85(3), 741–754.
Article Google Scholar
Hirsch, J. E. (2010b). An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship. Scientometrics, 85(3), 741–754.
Article Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Hoffmann, A., & Pham, S. B. (2003, October). Towards topic-based summarization for interactive document viewing. In Proceedings of the 2nd international conference on Knowledge capture (pp. 28–35). ACM.
Hou, W. R., Li, M., & Niu, D. K. (2011). Counting citations in texts rather than reference lists to improve the accuracy of assessing scientific contribution. BioEssays, 33(10), 724–727.
Article Google Scholar
Jiang, Y., & Yang, M. (2018). Semantic search exploiting formal concept analysis, rough sets, and Wikipedia. International Journal on Semantic Web and Information Systems (IJSWIS), 14(3), 99–119.
Article Google Scholar
Lindsey, D. (1989). Using citation counts as a measure of quality in science measuring what’s measurable rather than what’s valid. Scientometrics, 15(3–4), 189–203.
Article Google Scholar
Luukkonen, T. (1992). Is scientists’ publishing behaviour rewards eeking? Scientometrics, 24(2), 297–319.
Article Google Scholar
Moravcsik, M. J., & Murugesan, P. (1975). Some results on the function and quality of citations. Social Studies of Science, 5(1), 86–92.
Article Google Scholar
Nakov, P. I., Schwartz, A. S., & Hearst, M. (2004). Citances: Citation sentences for semantic analysis of bioscience text. In Proceedings of the SIGIR (Vol. 4, pp. 81–88).
Nanba, H., & Okumura, M. (1999, July). Towards multi-paper summarization using reference information. In IJCAI (Vol. 99, pp. 926-931).
Oppenheim, C., & Renn, S. P. (1978). Highly cited old papers and the reasons why they continue to be cited. Journal of the Association for Information Science and Technology, 29(5), 225–231.
Google Scholar
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends ^® in Information Retrieval, 2(1–2), 1–135.
Peritz, B. (1983). A classification of citation roles for the social sciences and related fields. Scientometrics, 5(5), 303–312.
Article Google Scholar
Pride, D., & Knoth, P. (2017, September). Incidental or influential? Challenges in automatically detecting citation importance using publication full texts. In International conference on theory and practice of digital Libraries (pp. 572–578). Springer, Cham.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85–117.
Article Google Scholar
Shardlow, M., Batista-Navarro, R., Thompson, P., Nawaz, R., McNaught, J., & Ananiadou, S. (2018). Identification of research hypotheses and new knowledge from scientific literature. BMC Medical Informatics and Decision Making, 18(1), 46.
Article Google Scholar
Small, H., & Greenlee, E. (1980). Citation context analysis of a co-citation cluster: Recombinant-DNA. Scientometrics, 2(4), 277–301.
Article Google Scholar
Taşkın, Z., & Al, U. (2018). A content-based citation analysis study based on text categorization. Scientometrics, 114(1), 335-357.
Article Google Scholar
Teufel, S., Siddharthan, A., & Tidhar, D. (2006, July). Automatic classification of citation function. In Proceedings of the 2006 conference on empirical methods in natural language processing (pp. 103–110). Association for Computational Linguistics.
Thompson, P., Nawaz, R., McNaught, J., & Ananiadou, S. (2011). Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinformatics, 12(1), 393.
Article Google Scholar
Valenzuela, M., Ha, V., & Etzioni, O. (2015, April). Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data.
Waltman, L., van Eck, N. J., van Leeuwen, T. N., & Visser, M. S. (2013). Some modifications to the SNIP journal impact indicator. Journal of Informetrics, 7(2), 272–285.
Article Google Scholar
Xu, H., Martin, E., & Mahidadia, A. (2013). Using heterogeneous features for scientific citation classification. In Proceedings of the 13th Conference of the Pacific Association for Computational Linguistics.
Zhang, P., & Koppaka, L. (2007, June). Semantics-based legal citation network. In Proceedings of the 11th International Conference on Artificial Intelligence and Law (pp. 123–130). ACM.

Download references

Author information

Authors and Affiliations

Information Technology University, 346-B, Ferozepur Road, Lahore, Pakistan
Saeed-Ul Hassan, Mubashir Imran & Sehrish Iqbal
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia
Naif Radi Aljohani
School of Computer Science, Manchester Metropolitan University, Manchester, UK
Raheel Nawaz

Authors

Saeed-Ul Hassan
View author publications
You can also search for this author inPubMed Google Scholar
Mubashir Imran
View author publications
You can also search for this author inPubMed Google Scholar
Sehrish Iqbal
View author publications
You can also search for this author inPubMed Google Scholar
Naif Radi Aljohani
View author publications
You can also search for this author inPubMed Google Scholar
Raheel Nawaz
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Saeed-Ul Hassan.

Appendix

See Table 13.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hassan, SU., Imran, M., Iqbal, S. et al. Deep context of citations using machine-learning models in scholarly full-text articles. Scientometrics 117, 1645–1662 (2018). https://doi.org/10.1007/s11192-018-2944-y

Download citation

Received: 26 February 2018
Published: 29 October 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11192-018-2944-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Deep context of citations using machine-learning models in scholarly full-text articles

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Mining the Context of Citations in Scientific Publications

Citation Classification Using Natural Language Processing and Machine Learning Models

Citation Worthiness Identification for Fine-Grained Citation Recommendation Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now