Abstract
The paper presents an experiment aimed at comparison of results of topic modelling via non-negative matrix factorization (NMF) with that of manual topic annotation performed by an expert. The experiment was conducted on the annotated corpus of Russian short stories of the initial three decades of the 20th century, which contains 310 stories with a total of 1000000 tokens written by 300 Russian writers. The annotation scheme used in topic annotation includes 89 topics, further this list was reduced down to 30 generalized ones, the most frequent of which turned out to be the following: death, relationships, love, social groups, social processes, family, money, human sins, nature, religion, and war. Then, the corpus divided into three consecutive time periods was subjected to NMF topic modelling which provided a model including 24 topics. The results of both topic annotations were compared and described. The paper discusses the main findings of the study and the difficulties of fiction topic modelling which should be taken into account. For example, experimental results showed that topic modelling via NMF should be primarily recommended for the revealing of topics referring to general background of literary texts (e.g., war, love, nature, family) rather than for detecting topics related with some critical events or relations between characters (e.g., death or relations). The comparison of human and automatic topic annotation seems an important step for the improvement of artificial technologies techniques related with NLP.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bakhtin, M.M.: Estetika slovesnogo tvorchestva. Iskusstvo, Moscow (1979)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
Blummer, B., Kenton, J.M.: Academic libraries’ outreach efforts: identifying themes in the literature. Public Serv. Quart. 15(3), 179–204 (2019)
Bodrunova, S., Blekanov, I., Kukarkin, M.: Topic modelling for twitter discussions: model selection and quality assessment. In: Proceedings of the 6th SGEM International Multidisciplinary Scientific Conference on Social Sciences and Arts SGEM2018, Science and Humanities, pp. 207–214. STEF92 Technology Ltd, Sofia, Bulgaria (2018)
Daud, A., Li, J., Zhou, L., Muhammad, F.: Knowledge discovery through directed probabilistic topic models: a survey. Front. Comput. Sci. China (2010)
Erofeeva, A., Mitrofanova, O.: Automatic assignment of topic labels in topic models for Russian text corpora. In: Structural and Applied Linguistics, vol. 12, pp. 122–147. St. Petersburg University (2019)
Greene, D., Cross, J.P.: Unveiling the political agenda of the European parliament plenary: a topical analysis. In: Proceedings of the ACM Web Science Conference (WebSci’15), Oxford, UK (2015)
Greene, D., Cross, J.P.: Exploring the political agenda of the european parliament using a dynamic topic modelling approach. Polit. Anal. 25(1), 77–94 (2017)
Iyyer, M., Guha, A., Chaturvedi, S., Boyd-Graber, J., Daumé III, H.: Feuding families and former friends: unsupervised learning for dynamic fictional relationships. In: Proceedings of the 2016 Conference of the North American Chapter of the Association of the Computational Linguistics, Association for Computational Linguistics, San Diego, California, pp. 1534–1544 (2016)
Kazartsev, E., Davydova, A., Sherstinova, T.: Rhythmic structures of Russian prose and occasional iambs. In: A Diachronic Case Study. SpeCom 2020. LNCS (LNAI), vol. 12335 (2020, in print). https://doi.org/10.1007/978-3-030-60276-5_20
Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian languages. Commun. Comput. Inf. Sci. 542, 320–332 (2015)
Kriukova, A., Erofeeva, A., Mitrofanova, O., Sukharev, K.: Explicit semantic analysis as a means for topic labelling. In: Ustalovet, D., et al. (eds.) Artificial Intelligence and Natural Language, vol. 930, pp. 167–177. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01204-5_11
Krstovski, K., Kurtz, M.J., Smith, D.A., Accomazzi, A.: Multilingual Topic Models. https://arxiv.org/pdf/1712.06704.pdf. Accessed 21 May 2020
Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Association for Computational Linguistics, Stroudsburg, PA (2011)
Lau, J.H., Newman, D., Karimi, S., Baldwin, T.: Best topic word selection for topic labelling, In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 605–613. Association for Computational Linguistics, Stroudsburg, PA (2010)
Loukachevitch, N., Nokel, M., Ivanov, K.: Combining thesaurus knowledge and probabilistic topic models. In: van der Aalst, W., et al. (eds.) Analysis of Images, Social Networks and Texts. AIST 2017. Lecture Notes in Computer Science, vol. 10716, pp. 59–71. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73013-4_6
Martynenko, G., Sherstinova, T.: Corpus of Russian short stories of the first third of the 20th century: theoretical issues and linguistic parameters. Strukturnaya i prikladnaya linguistika 14. St. Petersburg State University, St. Petersburg (in print)
Martynenko, G., Sherstinova, T.: Emotional waves of a plot in literary texts: new approaches for investigation of the dynamics in digital culture. In: Alexandrov, D., Boukhanovsky, A., Chugunov, A., Kabanov, Y., Koltsova, O. (eds.) Digital Transformation and Global Society. DTGS 2018. Communications in Computer and Information Science, vol. 859, pp. 299–309. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02846-6_24
Martynenko, G., Sherstinova, T.: Linguistic and stylistic parameters for the study of literary language in the corpus of Russian short stories of the first third of the 20th century. In: R. Piotrowski’s Readings in Language Engineering and Applied Linguistics, Proceedings of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), CEUR Workshop Proceedings, vol. 2552, pp. 105–120 (2020)
Martynenko, G.Y., Sherstinova, T.Y., Melnik, A.G., Popova, T.I.: Methodological issues related with the compilation of digital anthology of Russian short stories (the first third of the 20th century). In: Proceedings of the XXI International United Conference ʻThe Internet and Modern Societyʼ, IMS-2018, Computational linguistics and computational ontologies, Issue 2, pp. 99–104. ITMO University, St. Petersburg (2018)
Martynenko, G.Y., Sherstinova, T.Y., Popova, T.I., Melnik, A.G., Zamirajlova, E.V.: O printsipakh sozdaniya korpusa russkogo rasskaza pervoy treti XX veka. In: Proc. of the XV Int. Conf. on Computer and Cognitive Linguistics ʻTEL 2018ʼ, pp. 180–197. Kazan Federal University, Kazan (2018)
Melchuk, I.A.: Experience of the Theory of the Linguistic Models “Meaning ⇔ Text”. Moscow (1974/1999)
Mitrofanova, O.A.: Topic modelling of special texts based on LDA algorithm. In: Proceedings of XLII International Philological Conference. Selected works, pp. 220–233. St. Petersburg State University, St. Petersburg (2014)
Mitrofanova, O.A., Shimorina, A.S., Koltsov, S.N., Koltsova, O.Yu.: Modelling semantic links in social media texts using the LDA algorithm (based on the Russian-language segment of the LiveJournal). Strukturnaya i prikladnaya lingvistka 10, 151–168 (2014)
Mitrofanova, O.A., Sedova, A.G.: Topic modelling in parallel and comparable fiction texts (the case study of english and Russian prose). In: Information Technology and Computational Linguistics (ITCL 2017), ICPS Proceedings, IMS2017: Proceedings of the International Conference IMS-2017, pp. 175–180 (2017)
Mitrofanova, O.A.: Topic modelling of the Corpus of ʻRussian folk tales by A. N. Afanasievʼ. Strukturnaya i prikladnaya linguistika 11, 146–154 (2015)
Mitrofanova, O.A.: Verojatnostnoje Modelirovanije Tematiki Russkojazychnyh Korpusov Tekstov s Ispol’zovanijem Kompjuternogo Instrumenta GenSim. In: Proceedings of the International Conference ʻCorpus Linguistics – 2015ʼ. St. Petersburg State University, St. Petersburg (2015)
Nikolenko, S., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies. J. Inf. Sci. 43(1), 88–102 (2017)
O’Callaghan, D., Greene, D., Carthy, J., Cunningham, P.: An analysis of the coherence of descriptors in topic modelling. Expert Syst. Appl. (ESWA) 42(13), 5645–5657 (2015)
Panicheva, P., Litvinova, O., Litvinova, T.: Author clustering with and without topical features. In: Salah, A., Karpov, A., Potapova, R. (eds.) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science (LNAI), vol. 11658, pp. 348–358. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_36
Rhody, L.M.: Topic modelling and figurative language. J. Digit. Hum. 2(1) (2012)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Uncertainty in Artificial Intelligence, pp. 487–494 (2004)
Segalovich, I., Titov, V.: MyStem, https://yandex.ru/dev/MyStem/ (2011). Accessed 12 May 2020
Sherstinova, T., Skrebtsova, T.: Russian literature around the October revolution: a quantitative exploratory study of literary themes and narrative structure in Russian short stories of 1900–1930. In: CompLing (2020, in print)
Skrebtsova, T.G.: Thematic tagging of literary fiction: the case of early 20th century Russian short stories. In: CompLing (2020, in print)
Stockwell, P.: Cognitive Poetics: An Introduction. Routledge, London (2002)
Todd, R.W.: Discourse Topics. John Benjamins, Amsterdam & Philadelphia (2016)
Vorontsov, K., Potapenko, A.: Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. In: Ignatov, D., Khachay, M., Panchenko, A., Konstantinova, N., Yavorsky, R. (eds.) Analysis of Images, Social Networks and Texts, vol. 436, pp. 29–46. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12580-0_3
Zamiraylova, E., Mitrofanova, O.: Dynamic topic modelling of Russian fiction prose of the first third of the XXth century by means of non-negative matrix factorization. In: R. Piotrowski’s Readings in Language Engineering and Applied Linguistics, Proceedings of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), CEUR Workshop Proceedings, vol. 2552, pp. 321–339 (2020)
Zhirmunskii V.M.: Teoriya literatury. Poetika. Stilistika. Leningrad, Nauka (1977)
Zholkovsky A., Shcheglov, Y.: K Ponyatiyam ‘Tema’ i ‘Poeticheskiy Mir’. Trudy po znakovym systemam 7, 143–167. Tartu University, Tartu (1975)
Acknowledgements
The research is supported by the Russian Foundation for Basic Research, project #17-29-09173 “The Russian language on the edge of radical historical changes: the study of language and style in prerevolutionary, revolutionary and post-revolutionary artistic prose by the methods of mathematical and computer linguistics (a corpus-based research on Russian short stories)”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Sherstinova, T., Mitrofanova, O., Skrebtsova, T., Zamiraylova, E., Kirina, M. (2020). Topic Modelling with NMF vs. Expert Topic Annotation: The Case Study of Russian Fiction. In: Martínez-Villaseñor, L., Herrera-Alcántara, O., Ponce, H., Castro-Espinoza, F.A. (eds) Advances in Computational Intelligence. MICAI 2020. Lecture Notes in Computer Science(), vol 12469. Springer, Cham. https://doi.org/10.1007/978-3-030-60887-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-60887-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60886-6
Online ISBN: 978-3-030-60887-3
eBook Packages: Computer ScienceComputer Science (R0)