{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,9,16]],"date-time":"2024-09-16T08:36:17Z","timestamp":1726475777621},"reference-count":54,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2021,12,22]],"date-time":"2021-12-22T00:00:00Z","timestamp":1640131200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"the Ministry of Science and Higher Education of Russia","award":["Government Order for 2020\u20132022, project no. FEWM-2020-0037 (TUSUR)"]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"Authorship attribution is one of the important fields of natural language processing (NLP). Its popularity is due to the relevance of implementing solutions for information security, as well as copyright protection, various linguistic studies, in particular, researches of social networks. The article is a continuation of the series of studies aimed at the identification of the Russian-language text\u2019s author and reducing the required text volume. The focus of the study was aimed at the attribution of textual data created as a product of human online activity. The effectiveness of the models was evaluated on the two Russian-language datasets: literary texts and short comments from users of social networks. Classical machine learning (ML) algorithms, popular neural networks (NN) architectures, and their hybrids, including convolutional neural network (CNN), networks with long short-term memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and fastText, that have not been used in previous studies, were applied to solve the problem. A particular experiment was devoted to the selection of informative features using genetic algorithms (GA) and evaluation of the classifier trained on the optimal feature space. Using fastText or a combination of support vector machine (SVM) with GA reduced the time costs by half in comparison with deep NNs with comparable accuracy. The average accuracy for literary texts was 80.4% using SVM combined with GA, 82.3% using deep NNs, and 82.1% using fastText. For social media comments, results were 66.3%, 73.2%, and 68.1%, respectively.<\/jats:p>","DOI":"10.3390\/fi14010004","type":"journal-article","created":{"date-parts":[[2021,12,23]],"date-time":"2021-12-23T07:01:19Z","timestamp":1640242879000},"page":"4","source":"Crossref","is-referenced-by-count":11,"title":["Authorship Attribution of Social Media and Literary Russian-Language Texts Using Machine Learning Methods and Feature Selection"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"http:\/\/orcid.org\/0000-0001-7844-4363","authenticated-orcid":false,"given":"Anastasia","family":"Fedotova","sequence":"first","affiliation":[{"name":"Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-2587-2222","authenticated-orcid":false,"given":"Aleksandr","family":"Romanov","sequence":"additional","affiliation":[{"name":"Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia"}]},{"given":"Anna","family":"Kurtukova","sequence":"additional","affiliation":[{"name":"Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia"}]},{"given":"Alexander","family":"Shelupanov","sequence":"additional","affiliation":[{"name":"Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia"}]}],"member":"1968","published-online":{"date-parts":[[2021,12,22]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Romanov, A., Kurtukova, A., Shelupanov, A., Fedotova, A., and Goncharov, V. (2021). Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks. Future Internet, 13.","DOI":"10.3390\/fi13010003"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Romanov, A.S., Kurtukova, A.V., Sobolev, A.A., Shelupanov, A.A., and Fedotova, A.M. (2020). Determining the Age of the Author of the Text Based on Deep Neural Network Models. Information, 11.","DOI":"10.3390\/info11120589"},{"key":"ref_3","unstructured":"Romanov, A., Kurtukova, A., Fedotova, A., and Meshcheryakov, R. (2019, January 27). Natural Text Anonymization Using Universal Transformer with a Self-attention. Proceedings of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), Saint Petersburg, Russia."},{"key":"ref_4","first-page":"104","article-title":"Method of the artificial texts identification based on the calculation of the belonging measure to the invariants","volume":"49","author":"Shumskaya","year":"2016","journal-title":"Inform. Autom."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Kurtukova, A., Romanov, A., and Shelupanov, A. (2020). Source Code Authorship Identification Using Deep Neural Networks. Symmetry, 12.","DOI":"10.3390\/sym12122044"},{"key":"ref_6","unstructured":"Romanov, A.S., Vasilieva, M.I., Kurtukova, A.V., and Meshcheryakov, R.V. (2017, January 27). Sentiment Analysis of Text Using Machine Learning Techniques. Proceedings of the 2nd International Conference \u201cR. Piotrowski\u2019s Readings LE & AL\u20192017\u201d, Saint Petersburg, Russia."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Khomenko, A., Baranova, Y., Romanov, A., and Zadvornov, K. (2021, January 16\u201319). Linguistic Modeling as a Basis for Creating Authorship Attribution Software. Proceedings of the Computational Linguistics and Intellectual Technologies \u201cDialogue\u201d, Moscow, Russia.","DOI":"10.28995\/2075-7182-2021-20-1063-1074"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Varela, P., Justino, E., and Oliveira, L.S. (August, January 31). Selecting syntactic attributes for authorship attribution. Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA.","DOI":"10.1109\/IJCNN.2011.6033217"},{"key":"ref_9","first-page":"30","article-title":"Identification of authorship of Ukrainian-language texts of journalistic style using neural networks","volume":"1","author":"Lupei","year":"2020","journal-title":"East.-Eur. J. Enterp. Technol."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"133","DOI":"10.1016\/j.neucom.2017.08.022","article-title":"A topic drift model for authorship attribution","volume":"273","author":"Yang","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1903","DOI":"10.1007\/s10115-019-01408-4","article-title":"Improved algorithms for extrinsic author verification","volume":"62","author":"Potha","year":"2020","journal-title":"Knowl. Inf. Syst."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1111\/j.2517-6161.1977.tb01600.x","article-title":"Maximum likelihood from incomplete data via the EM algorithm","volume":"39","author":"Dempster","year":"1977","journal-title":"J. R. Stat. Soc. Ser. B (Methodol.)"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Halvani, O., and Graner, L. (2021, January 17\u201320). POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis. Proceedings of the 16th International Conference on Availability, Reliability and Security, Vienna, Austria.","DOI":"10.1145\/3465481.3470050"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Bevendorff, J., Hagen, M., Stein, B., and Potthast, M. (August, January 28). Bias Analysis and Mitigation in the Evaluation of Authorship Verification. Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy.","DOI":"10.18653\/v1\/P19-1634"},{"key":"ref_15","unstructured":"Radhakrishnan, R., and Penstein, C. (2019). Machine Learning Framework for Authorship Identification from Texts. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"13575","DOI":"10.1007\/s11042-020-10361-2","article-title":"Novel authorship verification model for social media accounts compromised by a human","volume":"80","author":"Alterkav","year":"2021","journal-title":"Multimed. Tools Appl."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Demir, N., and Can, M. (2018). Authorship Authentication of Short Messages from Social Networks Machines. Southeast Eur. J. Soft Comput., 7.","DOI":"10.21533\/scjournal.v7i1.148"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Demir, N. (2016). Authorship Authentication for Twitter Messages Using Support Vector Machine. Southeast Eur. J. Soft Comput., 5.","DOI":"10.21533\/scjournal.v5i2.116"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"858","DOI":"10.1002\/asi.24163","article-title":"Automated language-independent authorship verification (for Indo-European languages)","volume":"70","author":"Adamovic","year":"2019","journal-title":"J. Assoc. Inf. Sci. Technol."},{"key":"ref_20","unstructured":"Boumber, D., Zhang, Y., and Mukherjee, A. (2018, January 7\u201312). Experiments with convolutional neural networks for multi-label authorship attribution. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan."},{"key":"ref_21","unstructured":"(2021, November 18). PAN: Shared Tasks. Available online: https:\/\/pan.webis.de\/shared-tasks.html."},{"key":"ref_22","unstructured":"Boenninghoff, B., Nickel, R.M., and Kolossa, D. (2021). O2D2: Out-of-distribution detector to capture undecidable trials in authorship verification. arXiv."},{"key":"ref_23","unstructured":"Weerasinghe, J., Singh, R., and Greenstadt, R. (2021, January 21\u201324). Feature vector difference based authorship verification for open-world settings. Proceedings of the CEUR Workshop 2021, Bucharest, Romania."},{"key":"ref_24","first-page":"259","article-title":"Authorship verification of opinion pieces in Estonian","volume":"10","author":"Petmanson","year":"2014","journal-title":"Eest. Raken. Uhin. Aastaraam."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Baj, M., and Walkowiak, T. (2017, January 11\u201315). Computer Based Stylometric Analysis of Texts in Polish Language. Proceedings of the International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland.","DOI":"10.1007\/978-3-319-59060-8_1"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Kapo\u010di\u016bt\u0117-Dzikicn\u0117, J., and Dama\u0161evi\u010dius, R. (2018, January 9\u201312). Lithuanian Author Profiling with the Deep Learning. Proceedings of the 2018 Federated Conference on Computer Science and Information Systems (FedCSIS), Pozna\u0144, Poland.","DOI":"10.15439\/2018F22"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Venckauskas, A., Karpavicius, A., Dama\u0161evi\u010dius, R., Marcinkevi\u010dius, R., Kapo\u010di\u016bte-Dzikien\u00e9, J., and Napoli, C. (2017, January 3\u20136). Open class authorship attribution of lithuanian internet comments using one-class classifier. Proceedings of the 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), Prague, Czech Republic.","DOI":"10.15439\/2017F461"},{"key":"ref_28","unstructured":"Dinu, L.P., Popescu, M., and Dinu, A. (June, January 26). Authorship Identification of Romanian Texts with Controversial Paternity. Proceedings of the International Conference on Language Resources and Evaluation, Marrakech, Morocco."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"29","DOI":"10.12697\/smp.2018.5.2.02","article-title":"Versification and authorship attribution. A pilot study on Czech, German, Spanish, and English poetry","volume":"5","author":"Bobenhausen","year":"2019","journal-title":"Studia Metr. Poet."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Litvinova, T., Litvinova, O., and Panicheva, P. (2019, January 28\u201330). Authorship attribution of Russian forum posts with different types of n-gram features. Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval, Tokushima, Japan.","DOI":"10.1145\/3342827.3342834"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Pimonova, E., Durandin, O., and Malafeev, A. (2019). Authorship Attribution in Russian with New High-Performing and Fully Interpretable Morpho-Syntactic Features \/\/International Conference on Analysis of Images, Social Networks and Texts, Springer. Chapter 193\u2013204.","DOI":"10.1007\/978-3-030-37334-4_18"},{"key":"ref_32","unstructured":"Panicheva, P., and Litvinova, T. Authorship attribution in Russian in real-world forensics scenario. Proceedings of the International Conference on Statistical Language and Speech Processing;."},{"key":"ref_33","unstructured":"(2021, November 18). FastText: Library for Efficient Text Classification and Representation Learning. Available online: https:\/\/fasttext.cc\/."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Chowdhury, H., Imon, M., and Islam, M. (2018, January 13\u201315). Authorship Attribution in Bengali Literature Using fastText\u2019s Hierarchical Classifier. Proceedings of the 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT), Dhaka, Bangladesh.","DOI":"10.1109\/CEEICT.2018.8628109"},{"key":"ref_35","unstructured":"Van Tussenbroek, T. (2020). Who said that? Comparing Performance of TF-IDF and fastText to Identify Authorship of Short Sentences. [Bachelor\u2019s Thesis, Delft University of Technology]."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"8665","DOI":"10.1007\/s00500-021-05717-1","article-title":"A wrapper metaheuristic framework for handwritten signature verification","volume":"25","author":"Hodashinsky","year":"2021","journal-title":"Soft Comput."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Svetlakov, M., Hodashinsky, I., and Slezkin, A. (2021, January 13\u201314). Gender, Age and Number of Participants Effects on Identification Ability of EEG-based Shallow Classifiers. Proceedings of the 2021 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT), Yekaterinburg, Russia.","DOI":"10.1109\/USBEREIT51232.2021.9455114"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"22","DOI":"10.29001\/2073-8552-2020-35-4-22-31","article-title":"Fuzzy classifiers in cardiovascular disease diagnostics","volume":"35","author":"Hodashinsky","year":"2020","journal-title":"Sib. J. Clin. Exp. Med."},{"key":"ref_39","first-page":"1989","article-title":"A Hybrid Filter-Wrapper Feature Selection Approach for Authorship Attribution","volume":"15","author":"Ma","year":"2019","journal-title":"Int. J. Innov. Comput. Inf. Control."},{"key":"ref_40","unstructured":"Escalante, H., Montes, M., and Villase\u00f1or, L. Particle swarm model selection for authorship verification. Proceedings of the Iberoamerican Congress on Pattern Recognition;."},{"key":"ref_41","unstructured":"Mart\u00edn-del-Campo-Rodr\u00edguez, C. (2019, January 9\u201312). Authorship Attribution through Punctuation n-grams and Averaged Combination of SVM. Proceedings of the CLEF, Lugano, Switzerland."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Hitschler, J., Van Den Berg, E., and Rehbein, I. (2017, January 8). Authorship attribution with convolutional neural networks and POS-eliding. Proceedings of the Workshop on Stylistic Variation, Copenhagen, Denmark.","DOI":"10.18653\/v1\/W17-4907"},{"key":"ref_43","unstructured":"Huang, W., Su, R., and Iwaihara, M. Contribution of improved character embedding and latent posting styles to authorship attribution of short texts. Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Xing, L., and Qiao, Y. (2016, January 23\u201326). Deepwriter: A multi-stream deep CNN for text-independent writer identification. Proceedings of the 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China.","DOI":"10.1109\/ICFHR.2016.0112"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"315","DOI":"10.1007\/s10032-019-00335-y","article-title":"An anchor-free region proposal network for Faster R-CNN-based text detection approaches","volume":"22","author":"Zhong","year":"2019","journal-title":"J. Doc. Anal. Recognit."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"143","DOI":"10.1177\/1475921718804132","article-title":"A novel deep learning-based method for damage identification of smart building structures","volume":"18","author":"Yu","year":"2019","journal-title":"Struct. Health Monit."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Breuel, T. (2017, January 9\u201315). High Performance Text Recognition Using a Hybrid Convolutional-lstm Implementation. Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan.","DOI":"10.1109\/ICDAR.2017.12"},{"key":"ref_48","unstructured":"(2021, November 18). Library of Maxim Moshkov. Available online: http:\/\/www.lib.ru\/."},{"key":"ref_49","unstructured":"Guo, Q., Qiu, X., Liu, P., Xue, X., and Zhang, Z. (2019). Multi-Scale Self-Attention for Text Classification. arXiv."},{"key":"ref_50","unstructured":"(2021, November 18). Sharov\u2019s Russian Frequency Dictionary. Available online: http:\/\/www.slovorod.ru\/freq-sharov\/index.html."},{"key":"ref_51","unstructured":"Ruder, S., Ghaffari, P., and Breslin, J. (2016). Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1016\/j.physa.2017.12.054","article-title":"On the role of words in the network structure of texts: Application to authorship attribution","volume":"495","author":"Akimushkin","year":"2018","journal-title":"Phys. A Stat. Mech. Its Appl."},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"ii4","DOI":"10.1093\/llc\/fqx023","article-title":"Understanding and explaining Delta measures for authorship attribution","volume":"32","author":"Evert","year":"2017","journal-title":"Digit. Scholarsh. Humanit."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"591","DOI":"10.1007\/s10940-017-9346-9","article-title":"The analysis of bounded count data in criminology","volume":"34","author":"Britt","year":"2018","journal-title":"J. Quant. Criminol."}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/14\/1\/4\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T12:06:16Z","timestamp":1721736376000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/14\/1\/4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12,22]]},"references-count":54,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2022,1]]}},"alternative-id":["fi14010004"],"URL":"https:\/\/doi.org\/10.3390\/fi14010004","relation":{},"ISSN":["1999-5903"],"issn-type":[{"value":"1999-5903","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,12,22]]}}}