{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,7]],"date-time":"2024-08-07T00:26:32Z","timestamp":1722990392744},"reference-count":38,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2022,9,30]],"date-time":"2022-09-30T00:00:00Z","timestamp":1664496000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ministry of Science and Higher Education of Russia","award":["FEWM-2020-0037"]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"This paper is a continuation of our previous work on solving source code authorship identification problems. The analysis of heterogeneous source code is a relevant issue for copyright protection in commercial software development. This is related to the specificity of development processes and the usage of collaborative development tools (version control systems). As a result, there are source codes written according to different programming standards by a team of programmers with different skill levels. Another application field is information security\u2014in particular, identifying the author of computer viruses. We apply our technique based on a hybrid of Inception-v1 and Bidirectional Gated Recurrent Units architectures on heterogeneous source codes and consider the most common commercial development complex cases that negatively affect the authorship identification process. The paper is devoted to the possibilities and limitations of the author\u2019s technique in various complex cases. For situations where a programmer was proficient in two programming languages, the average accuracy was 87%; for proficiency in three or more\u201476%. For the artificially generated source code case, the average accuracy was 81.5%. Finally, the average accuracy for source codes generated from commits was 84%. The comparison with state-of-the-art approaches showed that the proposed method has no full-functionality analogs covering actual practical cases.<\/jats:p>","DOI":"10.3390\/fi14100287","type":"journal-article","created":{"date-parts":[[2022,10,8]],"date-time":"2022-10-08T08:04:56Z","timestamp":1665216296000},"page":"287","source":"Crossref","is-referenced-by-count":1,"title":["Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"http:\/\/orcid.org\/0000-0001-5619-1836","authenticated-orcid":false,"given":"Anna","family":"Kurtukova","sequence":"first","affiliation":[{"name":"Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-2587-2222","authenticated-orcid":false,"given":"Aleksandr","family":"Romanov","sequence":"additional","affiliation":[{"name":"Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia"}]},{"given":"Alexander","family":"Shelupanov","sequence":"additional","affiliation":[{"name":"Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia"}]},{"ORCID":"http:\/\/orcid.org\/0000-0001-7844-4363","authenticated-orcid":false,"given":"Anastasia","family":"Fedotova","sequence":"additional","affiliation":[{"name":"Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia"}]}],"member":"1968","published-online":{"date-parts":[[2022,9,30]]},"reference":[{"key":"ref_1","first-page":"741","article-title":"Identification author of source code by machine learning methods","volume":"18","author":"Kurtukova","year":"2019","journal-title":"Tr. SPIIRAN"},{"doi-asserted-by":"crossref","unstructured":"Kurtukova, A., Romanov, A., and Shelupanov, A. (2020). Source Code Authorship Identification Using Deep Neural Networks. Symmetry, 12.","key":"ref_2","DOI":"10.3390\/sym12122044"},{"doi-asserted-by":"crossref","unstructured":"Abuhamad, M., AbuHmed, T., Mohaisen, A., and Nyang, D. (2018, January 15\u201319). Large-Scale and Language-Oblivious Code Authorship Identification. Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada.","key":"ref_3","DOI":"10.1145\/3243734.3243738"},{"unstructured":"Zhen, L., Chen, G., Chen, C., Zou, Y., and Xu, S. (2022, January 25\u201327). RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation. Proceedings of the 2022 IEEE 44th International Conference on Software Engineering (ICSE), Pittsburgh, PA, USA.","key":"ref_4"},{"doi-asserted-by":"crossref","unstructured":"Holland, C., Khoshavi, N., and Jaimes, L.G. (2022, January 18\u201320). Code authorship identification via deep graph CNNs. Proceedings of the 2022 ACM Southeast Conference (ACM SE \u201822), Virtual.","key":"ref_5","DOI":"10.1145\/3476883.3520227"},{"unstructured":"(2022, August 18). Google Code Jam. Available online: https:\/\/codingcompetitions.withgoogle.com\/codejam.","key":"ref_6"},{"key":"ref_7","first-page":"012011","article-title":"Explainable source code authorship attribution algorithm","volume":"2134","author":"Bogdanova","year":"2021","journal-title":"J. Phys."},{"doi-asserted-by":"crossref","unstructured":"Bogdanova, A. (2021, January 17\u201322). Source code authorship attribution using file embeddings. Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, Zurich, Switzerland.","key":"ref_8","DOI":"10.1145\/3484271.3484981"},{"doi-asserted-by":"crossref","unstructured":"Bogomolov, E., Kovalenko, V., Rebryk, Y., Bacchelli, A., and Bryksin, T. (2021, January 23\u201328). Authorship attribution of source code: A language-agnostic approach and applicability in software engineering. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece.","key":"ref_9","DOI":"10.1145\/3468264.3468606"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"141987","DOI":"10.1109\/ACCESS.2019.2943639","article-title":"Source code authorship attribution using hybrid approach of program dependence graph and deep learning model","volume":"7","author":"Ullah","year":"2019","journal-title":"IEEE Access"},{"doi-asserted-by":"crossref","unstructured":"Bayrami, P., and Rice, J.E. (2021, January 12\u201317). Code authorship attribution using content-based and non-content-based features. Proceedings of the 2021 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Ottawa, ON, Canada.","key":"ref_11","DOI":"10.1109\/CCECE53047.2021.9569061"},{"unstructured":"Caldeira, R.S. (2022, August 18). A Deep Learning Approach to Recognize Source Code Authorship. Available online: https:\/\/maups.GitHub.io\/papers\/tcc_008.pdf.","key":"ref_12"},{"unstructured":"(2022, August 18). Codeforces. Available online: https:\/\/codeforces.com\/.","key":"ref_13"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1016\/j.future.2020.10.020","article-title":"Pkg2Vec: Hierarchical package embedding for code authorship attribution","volume":"116","author":"Mateless","year":"2021","journal-title":"Future Gener. Comput. Syst."},{"unstructured":"Gorshkov, S., Nered, M., Ilyushin, E., Namiot, D., and Sukhomlin, V. (December, January 29). Source code authorship identification using tokenization and boosting algorithms. Proceedings of the International Conference on Modern Information Technology and IT Education, Moscow, Russia.","key":"ref_15"},{"unstructured":"Suman, C., Raj, A., Saha, S., and Bhattacharyya, P. (2020, January 16\u201320). Source Code Authorship Attribution using Stacked classifier. Proceedings of the Forum for Information Retrieval Evaluation, FIRE (Working Notes), Hyderabad, India.","key":"ref_16"},{"unstructured":"Garc\u00eda-D\u00edaz, J.A., and Valencia-Garc\u00eda, R. (2020, January 16\u201320). UMUTeam at AI-SOCO \u20182020: Source Code Authorship Identification based on Character N-Grams and Author\u2019s Traits. Proceedings of the Forum for Information Retrieval Evaluation, FIRE (Working Notes), Hyderabad, India.","key":"ref_17"},{"unstructured":"(2022, August 18). GitHub. Available online: https:\/\/GitHub.com\/.","key":"ref_18"},{"unstructured":"(2022, August 18). Gitlab. Available online: https:\/\/gitlab.com\/.","key":"ref_19"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"264","DOI":"10.1162\/tacl_a_00313","article-title":"Leveraging pre-trained checkpoints for sequence generation tasks","volume":"8","author":"Rothe","year":"2020","journal-title":"Trans. Assoc. Comput. Linguist."},{"unstructured":"Du, Z. (2021). All nlp tasks are generation tasks: A general pretraining framework. arXiv.","key":"ref_21"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"681","DOI":"10.1007\/s11023-020-09548-1","article-title":"GPT-3: Its nature, scope, limits, and consequences","volume":"30","author":"Floridi","year":"2020","journal-title":"Minds Mach."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"101983","DOI":"10.1016\/j.wpi.2020.101983","article-title":"Patent claim generation by fine-tuning OpenAI GPT-2","volume":"62","author":"Lee","year":"2020","journal-title":"World Pat. Inf."},{"unstructured":"Dusheiko, A. (2022). Lead Generation of News Texts using the ruGPT-3 Neural Network. [Master\u2019s Thesis].","key":"ref_24"},{"unstructured":"Pisarevskaya, D., and Shavrina, T. (2022). WikiOmnia: Generative QA corpus on the whole Russian Wikipedia. arXiv.","key":"ref_25"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1","DOI":"10.3390\/ai2010001","article-title":"Automated source code generation and auto-completion using deep learning: Comparing and discussing current language model-related approaches","volume":"2","year":"2021","journal-title":"AI"},{"unstructured":"(2022, August 18). Open AI. Available online: https:\/\/openai.com\/blog\/openai-codex.","key":"ref_27"},{"unstructured":"(2022, August 18). GitHub Copilot. Available online: https:\/\/copilot.GitHub.com.","key":"ref_28"},{"unstructured":"(2022, August 18). AlphaCode. Available online: https:\/\/deepmind.com\/blog\/article\/Competitive-programming-with-AlphaCode.","key":"ref_29"},{"unstructured":"(2022, August 18). Sber AI ruGPT-3. Available online: https:\/\/developers.sber.ru\/portal\/tools\/rugpt-3.","key":"ref_30"},{"unstructured":"(2022, August 18). Polycoder. Available online: https:\/\/venturebeat.com\/2022\/03\/04\/researchers-open-source-code-generating-ai-they-claim-can-beat-openais-codex\/.","key":"ref_31"},{"key":"ref_32","first-page":"1","article-title":"Identifying authorship by bytelevel n-grams: The source code author profile (SCAP) method","volume":"1","author":"Frantzeskou","year":"2007","journal-title":"Int. J. Digital. Evid."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1016\/j.diin.2015.09.001","article-title":"Scripting DNA: Identifying the JavaScript Programmer","volume":"15","author":"Wisse","year":"2015","journal-title":"Digit. Investig."},{"unstructured":"(2022, August 18). FastText. Available online: https:\/\/fasttext.cc\/.","key":"ref_34"},{"unstructured":"(2022, August 18). BERT. Available online: https:\/\/huggingface.co\/docs\/transformers\/model_doc\/bert.","key":"ref_35"},{"unstructured":"(2022, August 18). VGCN-BERT. Available online: https:\/\/arxiv.org\/abs\/2004.05707.","key":"ref_36"},{"unstructured":"(2022, August 18). Bag of Tricks for Efficient Text Classification. Available online: https:\/\/aclanthology.org\/E17-2068\/.","key":"ref_37"},{"unstructured":"Caliskan-Islam, A. (2015, January 12\u201314). Deanonymizing programmers via code stylometry. Proceedings of the 24th USENIX Security Symposium, Washington, DC, USA.","key":"ref_38"}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/14\/10\/287\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T20:01:18Z","timestamp":1722974478000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/14\/10\/287"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,30]]},"references-count":38,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2022,10]]}},"alternative-id":["fi14100287"],"URL":"https:\/\/doi.org\/10.3390\/fi14100287","relation":{},"ISSN":["1999-5903"],"issn-type":[{"type":"electronic","value":"1999-5903"}],"subject":[],"published":{"date-parts":[[2022,9,30]]}}}