Scholarly Knowledge Extraction from Published Software Packages | SpringerLink
Skip to main content

Scholarly Knowledge Extraction from Published Software Packages

  • Conference paper
  • First Online:
From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries (ICADL 2022)

Abstract

A plethora of scientific software packages are published in repositories, e.g., Zenodo and figshare. These software packages are crucial for the reproducibility of published research. As an additional route to scholarly knowledge graph construction, we propose an approach for automated extraction of machine actionable (structured) scholarly knowledge from published software packages by static analysis of their (meta)data and contents (in particular scripts in languages such as Python). The approach can be summarized as follows. First, we extract metadata information (software description, programming languages, related references) from software packages by leveraging the Software Metadata Extraction Framework (SOMEF) and the GitHub API. Second, we analyze the extracted metadata to find the research articles associated with the corresponding software repository. Third, for software contained in published packages, we create and analyze the Abstract Syntax Tree (AST) representation to extract information about the procedures performed on data. Fourth, we search the extracted information in the full text of related articles to constrain the extracted information to scholarly knowledge, i.e. information published in the scholarly literature. Finally, we publish the extracted machine actionable scholarly knowledge in the Open Research Knowledge Graph (ORKG).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.dbpedia.org.

  2. 2.

    https://www.wikidata.org/wiki/Wikidata:Main_Page.

  3. 3.

    https://hi-knowledge.org.

  4. 4.

    https://covid-aqs.fz-juelich.de.

  5. 5.

    https://paperswithcode.org.

  6. 6.

    https://cooperationdatabank.org.

  7. 7.

    https://zenodo.org.

  8. 8.

    https://developers.zenodo.org.

  9. 9.

    https://www.orkg.org/orkg/.

  10. 10.

    https://github.com/gitpython-developers/GitPython.

  11. 11.

    https://zenodo.org/record/5874955.

  12. 12.

    https://docs.python.org/3/library/ast.html.

  13. 13.

    https://api.unpaywall.org/v2/10.1186/s12920-019-0613-5?email=unpaywall_01@example.com.

  14. 14.

    https://orkg.org/paper/R209873.

  15. 15.

    https://orkg.org/content-type/Software/R209880.

References

  1. Abdelaziz, I., Srinivas, K., Dolby, J., McCusker, J.P.: A demonstration of codebreaker: a machine interpretable knowledge graph for code. In: SEMWEB (2020)

    Google Scholar 

  2. Abdelaziz, I., Dolby, J., McCusker, J., Srinivas, K.: A toolkit for generating code knowledge graphs. In: Proceedings of the 11th on Knowledge Capture Conference, pp. 137–144. K-CAP 2021. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3460210.3493578

  3. Ahmad, W., Chakraborty, S., Ray, B., Chang, K.W.: A transformer-based approach for source code summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4998–5007. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.449

  4. Allamanis, M., Brockschmidt, M., Khademi, M.: Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017)

  5. Brack, A., Hoppe, A., Buschermöhle, P., Ewerth, R.: Cross-domain multi-task learning for sequential sentence classification in research papers. In: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries. JCDL 2022. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3529372.3530922

  6. Brack, A., Müller, D.U., Hoppe, A., Ewerth, R.: Coreference resolution in research papers from multiple domains. In: Hiemstra, D., Moens, M.F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) Advances in Information Retrieval, pp. 79–97. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-72113-8_6

  7. Chao, K., Tao, L., Li, M., Guoyu, J., Yuchao, W., Yu, Z.: Construction and application research of knowledge graph in spacecraft launch. J. Physi. Conf. Ser. 1754, 012180 (2021). https://doi.org/10.1088/1742-6596/1754/1/012180

  8. Enders, M., et al.: A conceptual map of invasion biology: integrating hypotheses into a consensus network. Global Ecol. Biogeograph. 29(6), 978–991 (2020)

    Google Scholar 

  9. Ernst, P., Meng, C., Siu, A., Weikum, G.: KnowLife: a knowledge graph for health and life sciences, pp. 1254–1257, March 2014. https://doi.org/10.1109/ICDE.2014.6816754

  10. Heger, T., et al.: Conceptual frameworks and methods for advancing invasion ecology. Ambio 42(5), 527–540 (2013)

    Google Scholar 

  11. Husain, H., Wu, H.H., Gazit, T., Allamanis, M., Brockschmidt, M.: Codesearchnet challenge: evaluating the state of semantic code search, June 2020. https://www.microsoft.com/en-us/research/publication/codesearchnet-challenge-evaluating-the-state-of-semantic-code-search/

  12. Iyer, S., Konstas, I., Cheung, A., Zettlemoyer, L.: Summarizing source code using a neural attention model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2073–2083. Association for Computational Linguistics, August 2016. https://doi.org/10.18653/v1/P16-1195

  13. Jain, N.: Domain-specific knowledge graph construction for semantic analysis. In: Harth, A., et al. (eds.) ESWC 2020. LNCS, vol. 12124, pp. 250–260. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62327-2_40

  14. Jaradeh, M.Y., et al.: Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge. In: Proceedings of the 10th International Conference on Knowledge Capture, p. 243–246. K-CAP 2019. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3360901.3364435

  15. Jiang, M., D’Souza, J., Auer, S., Downie, J.S.: Improving scholarly knowledge representation: Evaluating BERT-based models for scientific relation classification. In: Ishita, E., Pang, N.L.S., Zhou, L. (eds.) Digital Libraries at Times of Massive Societal Transition, pp. 3–19. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-64452-9_1

  16. Kelley, A., Garijo, D.: A framework for creating knowledge graphs of scientific software metadata. Quant. Sci. Stud. 2(4), 1423–1446 (12 2021). https://doi.org/10.1162/qss_a_00167

  17. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)

    Google Scholar 

  18. Lehmann, J., et al.: Dbpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web J. 6 (2014). https://doi.org/10.3233/SW-140134

  19. Mao, A., Garijo, D., Fakhraei, S.: SoMEF: a framework for capturing scientific software metadata from its documentation. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 3032–3037 (2019). https://doi.org/10.1109/BigData47090.2019.9006447

  20. Reza, S.M., Badreddin, O., Rahad, K.: ModelMine: a tool to facilitate mining models from open source repositories. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3417990.3422006

  21. Spadaro, G., Tiddi, I., Columbus, S., Jin, S., Teije, A.T., Team, C., Balliet, D.: The cooperation databank: machine-readable science accelerates research synthesis. Perspect. Psychol. Sci. 17456916211053319 (2020)

    Google Scholar 

  22. Spadini, D., Aniche, M., Bacchelli, A.: PyDriller: Python framework for mining software repositories, pp. 908–911. ESEC/FSE 2018. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3236024.3264598

  23. Suchanek, F., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge, pp. 697–706, January 2007. https://doi.org/10.1145/1242572.1242667

  24. Vagavolu, D., Swarna, K.C., Chimalakonda, S.: A mocktail of source code representations. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1296–1300 (2021). https://doi.org/10.1109/ASE51524.2021.9678551

  25. Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014). https://doi.org/10.1145/2629489

  26. Zhao, Z., Han, S.K., So, I.M.: Architecture of knowledge graph construction techniques (2018)

    Google Scholar 

Download references

Acknowledgment

This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and TIB–Leibniz Information Centre for Science and Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Haris .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Haris, M., Stocker, M., Auer, S. (2022). Scholarly Knowledge Extraction from Published Software Packages. In: Tseng, YH., Katsurai, M., Nguyen, H.N. (eds) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21756-2_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21755-5

  • Online ISBN: 978-3-031-21756-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics