Abstract
A plethora of scientific software packages are published in repositories, e.g., Zenodo and figshare. These software packages are crucial for the reproducibility of published research. As an additional route to scholarly knowledge graph construction, we propose an approach for automated extraction of machine actionable (structured) scholarly knowledge from published software packages by static analysis of their (meta)data and contents (in particular scripts in languages such as Python). The approach can be summarized as follows. First, we extract metadata information (software description, programming languages, related references) from software packages by leveraging the Software Metadata Extraction Framework (SOMEF) and the GitHub API. Second, we analyze the extracted metadata to find the research articles associated with the corresponding software repository. Third, for software contained in published packages, we create and analyze the Abstract Syntax Tree (AST) representation to extract information about the procedures performed on data. Fourth, we search the extracted information in the full text of related articles to constrain the extracted information to scholarly knowledge, i.e. information published in the scholarly literature. Finally, we publish the extracted machine actionable scholarly knowledge in the Open Research Knowledge Graph (ORKG).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
References
Abdelaziz, I., Srinivas, K., Dolby, J., McCusker, J.P.: A demonstration of codebreaker: a machine interpretable knowledge graph for code. In: SEMWEB (2020)
Abdelaziz, I., Dolby, J., McCusker, J., Srinivas, K.: A toolkit for generating code knowledge graphs. In: Proceedings of the 11th on Knowledge Capture Conference, pp. 137–144. K-CAP 2021. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3460210.3493578
Ahmad, W., Chakraborty, S., Ray, B., Chang, K.W.: A transformer-based approach for source code summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4998–5007. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.449
Allamanis, M., Brockschmidt, M., Khademi, M.: Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017)
Brack, A., Hoppe, A., Buschermöhle, P., Ewerth, R.: Cross-domain multi-task learning for sequential sentence classification in research papers. In: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries. JCDL 2022. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3529372.3530922
Brack, A., Müller, D.U., Hoppe, A., Ewerth, R.: Coreference resolution in research papers from multiple domains. In: Hiemstra, D., Moens, M.F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) Advances in Information Retrieval, pp. 79–97. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-72113-8_6
Chao, K., Tao, L., Li, M., Guoyu, J., Yuchao, W., Yu, Z.: Construction and application research of knowledge graph in spacecraft launch. J. Physi. Conf. Ser. 1754, 012180 (2021). https://doi.org/10.1088/1742-6596/1754/1/012180
Enders, M., et al.: A conceptual map of invasion biology: integrating hypotheses into a consensus network. Global Ecol. Biogeograph. 29(6), 978–991 (2020)
Ernst, P., Meng, C., Siu, A., Weikum, G.: KnowLife: a knowledge graph for health and life sciences, pp. 1254–1257, March 2014. https://doi.org/10.1109/ICDE.2014.6816754
Heger, T., et al.: Conceptual frameworks and methods for advancing invasion ecology. Ambio 42(5), 527–540 (2013)
Husain, H., Wu, H.H., Gazit, T., Allamanis, M., Brockschmidt, M.: Codesearchnet challenge: evaluating the state of semantic code search, June 2020. https://www.microsoft.com/en-us/research/publication/codesearchnet-challenge-evaluating-the-state-of-semantic-code-search/
Iyer, S., Konstas, I., Cheung, A., Zettlemoyer, L.: Summarizing source code using a neural attention model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2073–2083. Association for Computational Linguistics, August 2016. https://doi.org/10.18653/v1/P16-1195
Jain, N.: Domain-specific knowledge graph construction for semantic analysis. In: Harth, A., et al. (eds.) ESWC 2020. LNCS, vol. 12124, pp. 250–260. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62327-2_40
Jaradeh, M.Y., et al.: Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge. In: Proceedings of the 10th International Conference on Knowledge Capture, p. 243–246. K-CAP 2019. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3360901.3364435
Jiang, M., D’Souza, J., Auer, S., Downie, J.S.: Improving scholarly knowledge representation: Evaluating BERT-based models for scientific relation classification. In: Ishita, E., Pang, N.L.S., Zhou, L. (eds.) Digital Libraries at Times of Massive Societal Transition, pp. 3–19. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-64452-9_1
Kelley, A., Garijo, D.: A framework for creating knowledge graphs of scientific software metadata. Quant. Sci. Stud. 2(4), 1423–1446 (12 2021). https://doi.org/10.1162/qss_a_00167
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)
Lehmann, J., et al.: Dbpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web J. 6 (2014). https://doi.org/10.3233/SW-140134
Mao, A., Garijo, D., Fakhraei, S.: SoMEF: a framework for capturing scientific software metadata from its documentation. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 3032–3037 (2019). https://doi.org/10.1109/BigData47090.2019.9006447
Reza, S.M., Badreddin, O., Rahad, K.: ModelMine: a tool to facilitate mining models from open source repositories. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3417990.3422006
Spadaro, G., Tiddi, I., Columbus, S., Jin, S., Teije, A.T., Team, C., Balliet, D.: The cooperation databank: machine-readable science accelerates research synthesis. Perspect. Psychol. Sci. 17456916211053319 (2020)
Spadini, D., Aniche, M., Bacchelli, A.: PyDriller: Python framework for mining software repositories, pp. 908–911. ESEC/FSE 2018. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3236024.3264598
Suchanek, F., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge, pp. 697–706, January 2007. https://doi.org/10.1145/1242572.1242667
Vagavolu, D., Swarna, K.C., Chimalakonda, S.: A mocktail of source code representations. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1296–1300 (2021). https://doi.org/10.1109/ASE51524.2021.9678551
Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014). https://doi.org/10.1145/2629489
Zhao, Z., Han, S.K., So, I.M.: Architecture of knowledge graph construction techniques (2018)
Acknowledgment
This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and TIB–Leibniz Information Centre for Science and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Haris, M., Stocker, M., Auer, S. (2022). Scholarly Knowledge Extraction from Published Software Packages. In: Tseng, YH., Katsurai, M., Nguyen, H.N. (eds) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-21756-2_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21755-5
Online ISBN: 978-3-031-21756-2
eBook Packages: Computer ScienceComputer Science (R0)