Abstract
The conventional text documents have made it possible to efficiently retrieve large amounts of text data with the development of various search engines. However, these traditional search approaches frequently have lower accuracy in retrieval, particularly when documents have certain characteristics that call for more in-depth semantic extraction. A search engine for algorithms called Algorithm Seer has recently been developed. The normal search engine collects the deep textual metadata and pseudo-codes from research papers. However, such a system is unable to accommodate user searches that attempt to identify algorithm-specific information, such as the data sets on which algorithms operate their effectiveness, runtime complication, etc. A number of improvements to the previously suggested algorithm search engine are given in this study. We provide various ways to identify automatically and extract pseudo-codes and phrases which transmit metadata utilizing various machine learning methods. Around the 89,000 text lines are used for conducting the experiments; we provided new properties to extract algorithmic pseudo-codes. These characteristics include feature groups with a focus on content, font style, and structure. Our suggested pseudo-code extraction method outperforms current strategies by 28% and obtains a 94.23% Classification Accuracy. Additionally, we suggest a technique for extracting phrases linked to algorithms utilizing deep neural networks, which achieves an 82% of accuracy compared to recent rule-based provides 23.5% and support vector machine provides 21.5%.





Similar content being viewed by others
Data availability
The dataset generated and analyzed during the current study are available from the corresponding author on reasonable request.
References
Al Zaidy RA, Giles CL. A machine learning approach for semantic structuring of scientific charts in scholarly documents, Twenty-Ninth IAAI Conference. 2017.
Altınel B, Ganiz MC. Semantic text classification: a survey of past and recent advances. Inf Process Manage. 2018;54(6):1129–53.
Ramanaidu S, Thompson, Nawaz R. Enhancing search: Events and their discourse context. In: International Conference on Intelligent Text Processing and Computational Linguistics. 2013; Berlin: Springer, pp 318–334.
Arshad N, Bakar A, Soroya S, Safder I, Haider S, Hassan S, Aljohani N, Alelyani S, Nawaz R. Extracting scientific trends by mining topics from Call for Paper, Library Hitch. 2019; https://doi.org/10.1108/LHT-02-2019-0048.
Azad HK, Deepak A. Query expansion techniques for information retrieval: a survey. Inf Process Manage. 2019;56(5):1698–735.
Al Zadran, Giles CL. Extracting semantic relations for scholarly knowledge base construction. In: 2018 IEEE 12th international conference on semantic computing (ICSC); 2018. pp. 56–63.
Batista-Navarro RT, Kontonatsios G, Mihaly C, Thompson P, Nawaz R, Mihaly L, Ramanaidu S. Facilitating the analysis of discourse phenomena in an interoperable NLP platform. In: International Conference on Intelligent Text Processing and Computational Linguistics. Berlin: Springer 2013; pp. 559–571.
Bakar A, Sederma, Hasan’s U. Mining algorithmic complexity in full-text scholarly documents. In: ICADL Poster Proceedings the University of Waikato. 2018.
Bhatia’s, Mitra. Summarizing figures, tables, and algorithms in scientific publications to augment search results’. Trans Inform Syst (TOIS). 2013;30(1),3.
Giles. Curves parathion for line graphs in scholarly documents. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 2016+; pp. 277–278.
Clark, Divola’s. Mining figures from research papers, in: Digital Libraries (JCDL). In: 2016 IEEE/ACM Joint Conference. 2016; pp. 143–152. IEEE.
Conneau A, Schwenk H, Barrault L, Lecun Y. Very deep convolutional networks for text classification. ECACL. 2016;1:1107–16.
Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Chmidhuber J. LSTM: A search space odyssey. IEEE Trans Neural Netw Learn Syst. 2016;28(10):2222–32.
Hassan T. Object-level document analysis of PDF files. Proceedings of the 9th ACM symposium on Document engineering 1. 2009; pp. 47–55). ACM.
Huang M, Qian Q, Zhu X. Encoding syntactic knowledge in neural networks for sentiment classification. ACM Trans Inform Syst (TOIS). 2017;35(3):26.
Imran M, Akhtar A, Said A, Safder I, Hassan SU, Aljohani NR. Exploiting social networks of Twitter in alt metrics big data. In: 23rd international conference on science and technology indicators (STI2018) Centre for Science and Technology Studies (CWTS) September12–142018 Sep11. 2018.
Jahangir M, Afzal H, Ahmed M, Khurshid K, Nawaz R. An expert system for diabetes prediction using auto tuned multi-layer perceptron. In: 2017IntelligentSystemsConference (IntelliSys). 2017; pp. 722–728. IEEE.
Joachims T. Text Giles. C.L.: Learning with many relevant features. European conference on machine learning. 1998; Springer, pp. 137–142.
Jorge A, Springer. June Iterative part-of-speech tagging. In International Conference on Learning Language in Logic. 1999; pp. 170–183.
Giles CL, Treatise M, Giles CL. Ackseer: a repository and search engine for automatically extracted acknowledgments from digital libraries. 2012
Kim Y. Convolutional neural networks for sentence classification. EMNL P. 2014; 1746–1175.
Kim Y, Jernite Y, Sontag D, Rush AM. Character-aware neural language models. In: Thirtieth AA AI Conference on Artificial Intelligence. 2016.
Lai, S., Xu, L., Liu, K., &Zhao, J. (2015). Recurrent convolutionalneural networks for text classification. Twenty-ninth AAAI conference on artificial intelligence.
Li X, de Rinke M. Characterizing and predicting downloads in academic search. Inf Process Manage. 2019;56(3):394–407.
Mai Gaulke L, Scherm A. Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text. In: Proceedings of the18thACM/IEEE on Joint Conference on Digital Libraries. 2018; pp. 169–178. ACM.
Mischke, Ngoma, Searcher, Beijinger, Kimd, Kimd. An adaptive image-based plagiarism detection approach. In: Proceedings of the18thACM/IEEE on Joint Conference on Digital Libraries. 2018; pp. 131–140. ACM.
Manolov, Manolov, Chunk, Caradog’s, Dean. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. 2013; pp. 3111–3119.
Militia RL, Labor ES, Pessoa AA. A work-efficient parallel algorithm for constructing Huffman codes. Proceedings DCC '99 Data Compression Conference. 1999; pp. 277–286. IEEE Cat. No.PR00096.
Mitra P, Giles CL, Snub, Lucy. Chem Seer: a digital library and data repository for chemical kinetics. In: Proceedings of the ACM first workshop on Cyber Infrastructure: information management in e Science. 2007; pp. 7–10. ACM.
Mahmood Z, Safder I, Nawab RMA, Bukhari F, Nawaz R, Alfaki AS, Hassan SU. Deep sentiments in Roman Urdu text using Recurrent Convolutional Neural Network model. Inf Process Manage. 2020;57(4): 102233.
Nawaz R, Thompson P, Ramanaidu S. Identification of Manner in Bio-Events. LREC. 2012; pp. 3505–3510.
Petrakis E, Georgiadis C. Evaluation of spatial similarity methods for image retrieval. In: Conference on Signal Processing Communications and Computer Science. 2000; pp. 13–18.
Rastan R, Paik HY, Shepherd J. TEXUS: a unified framework for extracting and understanding tables in PDF documents. Inf Process Manage. 2019;56(3):895–918.
Rubin TN, Chambers A, Smyth P, Stivers M. Statistical topic models for multi-label document classification. Mach Learn. 2012;88(1–2):157–208.
Rush AM, Harvard SEAS, Chopra S, Weston J. A neural attention model for sentence summarization. ACL Web. In: Proceedings of the 2015 conference on empirical methods in natural language processing; 2017.
Safder I, Hassan SU. Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications. Scientometrics. 2019;119(1):257–77.
Safder I, Hassan SU. DS4 A: Deep search system for algorithms from full-text scholarly bigdata. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW). 2018; pp.1308–1315.
Safder I, Hassan SU, Aljohani NR. AI cognition in searching for relevant knowledge from scholarly big data, using a multi-layer perceptron and recurrent convolutional neural network model. In: Companion Proceedings of the Web Conference 2018. pp. 251–258. International World Wide Web Conferences Steering Committee.
Safder I, Sarfraz J, Hassan SU, Ali M, Taarab S. Detecting target text related to algorithmic efficiency in scholarly big data using recurrent convolutional neural network model. In: International conference on Asian digital libraries. Cham: Springer; 2017. p. 30–40.
Hassan SU, Imran M, Iftikhar T, Safder I, Shabbir M. Deep stylometry and lexical & syntactic features-based author attribution on Plops digital repository. In: International conference on Asian digital libraries. 2017; Springer, pp. 119–127.
Shardlow M, Batista-Navarro R, Thompson P, Nawaz R, McNaught J, Ramanaidu S. Identification of research hypotheses and new knowledge from scientific literature. BMC Med Informat Decis Making. 2018;18(1):46.
Siegelman, Lourie, Power, Ammar. Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the18thACM/IEEE on joint conference on digital libraries. 2018; pp. 223–232. ACM.
Siegel N, Horvitz Z, Levin R, Farhadi A. Figure seer: parsing result-figures in research papers. In: European Conference on Computer Vision. Berlin: Springer; 2016. p. 664–80.
Sinoara RA, Camacho-Collados J, Rossi RG, Navigli R, Rezende SO. Knowledge-enhanced document embeddings for text classification. Knowl Based Syst. 2019;163:955–71.
Sunder Meyer M, Schlüter R, Ney H. LSTM neural networks for language modeling. In: Thirteenth annual conference of the international speech communication association. 2012.
Suzuki T, Fuji A. Mathematical document categorization with structure of mathematical expressions. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL). 2017; pp. 1–10. IEEE.
Taarab S, Bhatia S, Mitra P, Giles CL. Algorithm Seer: a system for extracting and searching for algorithms in scholarly big data. IEEE Trans Big Data. 2016;2(1):3–17.
Taarab S, Mitra P, Giles CL. A hybrid approach to discover semantic hierarchical sections in scholarly documents. In: 2015 13th international conference on document analysis and recognition (ICDAR). 2015; pp. 1081–1085. IEEE.
Wang X, Rak R, Restiform A, Nowata C, Rupp CJ, Batista-Navarro TB, Raheel N, Ramanaidu S. Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature. BMC Bioinformatics. 2011;12(S11).
Funding
No funding received for this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
No conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Advances in Computational Approaches for Image Processing, Wireless Networks, Cloud Applications and Network Security” guest edited by P. Raviraj, Maode Ma and Roopashree H R.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Raghavendra Nayaka, P., Ranjan, R. An Efficient Framework for Algorithmic Metadata Extraction over Scholarly Documents Using Deep Neural Networks. SN COMPUT. SCI. 4, 341 (2023). https://doi.org/10.1007/s42979-023-01776-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-023-01776-3