An Efficient Framework for Algorithmic Metadata Extraction over Scholarly Documents Using Deep Neural Networks | SN Computer Science Skip to main content
Log in

An Efficient Framework for Algorithmic Metadata Extraction over Scholarly Documents Using Deep Neural Networks

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

The conventional text documents have made it possible to efficiently retrieve large amounts of text data with the development of various search engines. However, these traditional search approaches frequently have lower accuracy in retrieval, particularly when documents have certain characteristics that call for more in-depth semantic extraction. A search engine for algorithms called Algorithm Seer has recently been developed. The normal search engine collects the deep textual metadata and pseudo-codes from research papers. However, such a system is unable to accommodate user searches that attempt to identify algorithm-specific information, such as the data sets on which algorithms operate their effectiveness, runtime complication, etc. A number of improvements to the previously suggested algorithm search engine are given in this study. We provide various ways to identify automatically and extract pseudo-codes and phrases which transmit metadata utilizing various machine learning methods. Around the 89,000 text lines are used for conducting the experiments; we provided new properties to extract algorithmic pseudo-codes. These characteristics include feature groups with a focus on content, font style, and structure. Our suggested pseudo-code extraction method outperforms current strategies by 28% and obtains a 94.23% Classification Accuracy. Additionally, we suggest a technique for extracting phrases linked to algorithms utilizing deep neural networks, which achieves an 82% of accuracy compared to recent rule-based provides 23.5% and support vector machine provides 21.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The dataset generated and analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Al Zaidy RA, Giles CL. A machine learning approach for semantic structuring of scientific charts in scholarly documents, Twenty-Ninth IAAI Conference. 2017.

  2. Altınel B, Ganiz MC. Semantic text classification: a survey of past and recent advances. Inf Process Manage. 2018;54(6):1129–53.

    Article  Google Scholar 

  3. Ramanaidu S, Thompson, Nawaz R. Enhancing search: Events and their discourse context. In: International Conference on Intelligent Text Processing and Computational Linguistics. 2013; Berlin: Springer, pp 318–334.

  4. Arshad N, Bakar A, Soroya S, Safder I, Haider S, Hassan S, Aljohani N, Alelyani S, Nawaz R. Extracting scientific trends by mining topics from Call for Paper, Library Hitch. 2019; https://doi.org/10.1108/LHT-02-2019-0048.

  5. Azad HK, Deepak A. Query expansion techniques for information retrieval: a survey. Inf Process Manage. 2019;56(5):1698–735.

    Article  Google Scholar 

  6. Al Zadran, Giles CL. Extracting semantic relations for scholarly knowledge base construction. In: 2018 IEEE 12th international conference on semantic computing (ICSC); 2018. pp. 56–63.

  7. Batista-Navarro RT, Kontonatsios G, Mihaly C, Thompson P, Nawaz R, Mihaly L, Ramanaidu S. Facilitating the analysis of discourse phenomena in an interoperable NLP platform. In: International Conference on Intelligent Text Processing and Computational Linguistics. Berlin: Springer 2013; pp. 559–571.

  8. Bakar A, Sederma, Hasan’s U. Mining algorithmic complexity in full-text scholarly documents. In: ICADL Poster Proceedings the University of Waikato. 2018.

  9. Bhatia’s, Mitra. Summarizing figures, tables, and algorithms in scientific publications to augment search results’. Trans Inform Syst (TOIS). 2013;30(1),3.

  10. Giles. Curves parathion for line graphs in scholarly documents. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 2016+; pp. 277–278.

  11. Clark, Divola’s. Mining figures from research papers, in: Digital Libraries (JCDL). In: 2016 IEEE/ACM Joint Conference. 2016; pp. 143–152. IEEE.

  12. Conneau A, Schwenk H, Barrault L, Lecun Y. Very deep convolutional networks for text classification. ECACL. 2016;1:1107–16.

    Google Scholar 

  13. Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Chmidhuber J. LSTM: A search space odyssey. IEEE Trans Neural Netw Learn Syst. 2016;28(10):2222–32.

    Article  MathSciNet  Google Scholar 

  14. Hassan T. Object-level document analysis of PDF files. Proceedings of the 9th ACM symposium on Document engineering 1. 2009; pp. 47–55). ACM.

  15. Huang M, Qian Q, Zhu X. Encoding syntactic knowledge in neural networks for sentiment classification. ACM Trans Inform Syst (TOIS). 2017;35(3):26.

    Google Scholar 

  16. Imran M, Akhtar A, Said A, Safder I, Hassan SU, Aljohani NR. Exploiting social networks of Twitter in alt metrics big data. In: 23rd international conference on science and technology indicators (STI2018) Centre for Science and Technology Studies (CWTS) September12–142018 Sep11. 2018.

  17. Jahangir M, Afzal H, Ahmed M, Khurshid K, Nawaz R. An expert system for diabetes prediction using auto tuned multi-layer perceptron. In: 2017IntelligentSystemsConference (IntelliSys). 2017; pp. 722–728. IEEE.

  18. Joachims T. Text Giles. C.L.: Learning with many relevant features. European conference on machine learning. 1998; Springer, pp. 137–142.

  19. Jorge A, Springer. June Iterative part-of-speech tagging. In International Conference on Learning Language in Logic. 1999; pp. 170–183.

  20. Giles CL, Treatise M, Giles CL. Ackseer: a repository and search engine for automatically extracted acknowledgments from digital libraries. 2012

  21. Kim Y. Convolutional neural networks for sentence classification. EMNL P. 2014; 1746–1175.

  22. Kim Y, Jernite Y, Sontag D, Rush AM. Character-aware neural language models. In: Thirtieth AA AI Conference on Artificial Intelligence. 2016.

  23. Lai, S., Xu, L., Liu, K., &Zhao, J. (2015). Recurrent convolutionalneural networks for text classification. Twenty-ninth AAAI conference on artificial intelligence.

  24. Li X, de Rinke M. Characterizing and predicting downloads in academic search. Inf Process Manage. 2019;56(3):394–407.

    Article  Google Scholar 

  25. Mai Gaulke L, Scherm A. Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text. In: Proceedings of the18thACM/IEEE on Joint Conference on Digital Libraries. 2018; pp. 169–178. ACM.

  26. Mischke, Ngoma, Searcher, Beijinger, Kimd, Kimd. An adaptive image-based plagiarism detection approach. In: Proceedings of the18thACM/IEEE on Joint Conference on Digital Libraries. 2018; pp. 131–140. ACM.

  27. Manolov, Manolov, Chunk, Caradog’s, Dean. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. 2013; pp. 3111–3119.

  28. Militia RL, Labor ES, Pessoa AA. A work-efficient parallel algorithm for constructing Huffman codes. Proceedings DCC '99 Data Compression Conference. 1999; pp. 277–286. IEEE Cat. No.PR00096.

  29. Mitra P, Giles CL, Snub, Lucy. Chem Seer: a digital library and data repository for chemical kinetics. In: Proceedings of the ACM first workshop on Cyber Infrastructure: information management in e Science. 2007; pp. 7–10. ACM.

  30. Mahmood Z, Safder I, Nawab RMA, Bukhari F, Nawaz R, Alfaki AS, Hassan SU. Deep sentiments in Roman Urdu text using Recurrent Convolutional Neural Network model. Inf Process Manage. 2020;57(4): 102233.

    Article  Google Scholar 

  31. Nawaz R, Thompson P, Ramanaidu S. Identification of Manner in Bio-Events. LREC. 2012; pp. 3505–3510.

  32. Petrakis E, Georgiadis C. Evaluation of spatial similarity methods for image retrieval. In: Conference on Signal Processing Communications and Computer Science. 2000; pp. 13–18.

  33. Rastan R, Paik HY, Shepherd J. TEXUS: a unified framework for extracting and understanding tables in PDF documents. Inf Process Manage. 2019;56(3):895–918.

    Article  Google Scholar 

  34. Rubin TN, Chambers A, Smyth P, Stivers M. Statistical topic models for multi-label document classification. Mach Learn. 2012;88(1–2):157–208.

    Article  MathSciNet  MATH  Google Scholar 

  35. Rush AM, Harvard SEAS, Chopra S, Weston J. A neural attention model for sentence summarization. ACL Web. In: Proceedings of the 2015 conference on empirical methods in natural language processing; 2017.

  36. Safder I, Hassan SU. Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications. Scientometrics. 2019;119(1):257–77.

    Article  Google Scholar 

  37. Safder I, Hassan SU. DS4 A: Deep search system for algorithms from full-text scholarly bigdata. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW). 2018; pp.1308–1315.

  38. Safder I, Hassan SU, Aljohani NR. AI cognition in searching for relevant knowledge from scholarly big data, using a multi-layer perceptron and recurrent convolutional neural network model. In: Companion Proceedings of the Web Conference 2018. pp. 251–258. International World Wide Web Conferences Steering Committee.

  39. Safder I, Sarfraz J, Hassan SU, Ali M, Taarab S. Detecting target text related to algorithmic efficiency in scholarly big data using recurrent convolutional neural network model. In: International conference on Asian digital libraries. Cham: Springer; 2017. p. 30–40.

    Google Scholar 

  40. Hassan SU, Imran M, Iftikhar T, Safder I, Shabbir M. Deep stylometry and lexical & syntactic features-based author attribution on Plops digital repository. In: International conference on Asian digital libraries. 2017; Springer, pp. 119–127.

  41. Shardlow M, Batista-Navarro R, Thompson P, Nawaz R, McNaught J, Ramanaidu S. Identification of research hypotheses and new knowledge from scientific literature. BMC Med Informat Decis Making. 2018;18(1):46.

    Article  Google Scholar 

  42. Siegelman, Lourie, Power, Ammar. Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the18thACM/IEEE on joint conference on digital libraries. 2018; pp. 223–232. ACM.

  43. Siegel N, Horvitz Z, Levin R, Farhadi A. Figure seer: parsing result-figures in research papers. In: European Conference on Computer Vision. Berlin: Springer; 2016. p. 664–80.

    Google Scholar 

  44. Sinoara RA, Camacho-Collados J, Rossi RG, Navigli R, Rezende SO. Knowledge-enhanced document embeddings for text classification. Knowl Based Syst. 2019;163:955–71.

    Article  Google Scholar 

  45. Sunder Meyer M, Schlüter R, Ney H. LSTM neural networks for language modeling. In: Thirteenth annual conference of the international speech communication association. 2012.

  46. Suzuki T, Fuji A. Mathematical document categorization with structure of mathematical expressions. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL). 2017; pp. 1–10. IEEE.

  47. Taarab S, Bhatia S, Mitra P, Giles CL. Algorithm Seer: a system for extracting and searching for algorithms in scholarly big data. IEEE Trans Big Data. 2016;2(1):3–17.

    Article  Google Scholar 

  48. Taarab S, Mitra P, Giles CL. A hybrid approach to discover semantic hierarchical sections in scholarly documents. In: 2015 13th international conference on document analysis and recognition (ICDAR). 2015; pp. 1081–1085. IEEE.

  49. Wang X, Rak R, Restiform A, Nowata C, Rupp CJ, Batista-Navarro TB, Raheel N, Ramanaidu S. Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature. BMC Bioinformatics. 2011;12(S11).

Download references

Funding

No funding received for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. Raghavendra Nayaka.

Ethics declarations

Conflict of Interest

No conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Image Processing, Wireless Networks, Cloud Applications and Network Security” guest edited by P. Raviraj, Maode Ma and Roopashree H R.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Raghavendra Nayaka, P., Ranjan, R. An Efficient Framework for Algorithmic Metadata Extraction over Scholarly Documents Using Deep Neural Networks. SN COMPUT. SCI. 4, 341 (2023). https://doi.org/10.1007/s42979-023-01776-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-01776-3

Keywords