Topic Modelling with NMF vs. Expert Topic Annotation: The Case Study of Russian Fiction | SpringerLink
Skip to main content

Topic Modelling with NMF vs. Expert Topic Annotation: The Case Study of Russian Fiction

  • Conference paper
  • First Online:
Advances in Computational Intelligence (MICAI 2020)

Abstract

The paper presents an experiment aimed at comparison of results of topic modelling via non-negative matrix factorization (NMF) with that of manual topic annotation performed by an expert. The experiment was conducted on the annotated corpus of Russian short stories of the initial three decades of the 20th century, which contains 310 stories with a total of 1000000 tokens written by 300 Russian writers. The annotation scheme used in topic annotation includes 89 topics, further this list was reduced down to 30 generalized ones, the most frequent of which turned out to be the following: death, relationships, love, social groups, social processes, family, money, human sins, nature, religion, and war. Then, the corpus divided into three consecutive time periods was subjected to NMF topic modelling which provided a model including 24 topics. The results of both topic annotations were compared and described. The paper discusses the main findings of the study and the difficulties of fiction topic modelling which should be taken into account. For example, experimental results showed that topic modelling via NMF should be primarily recommended for the revealing of topics referring to general background of literary texts (e.g., war, love, nature, family) rather than for detecting topics related with some critical events or relations between characters (e.g., death or relations). The comparison of human and automatic topic annotation seems an important step for the improvement of artificial technologies techniques related with NLP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bakhtin, M.M.: Estetika slovesnogo tvorchestva. Iskusstvo, Moscow (1979)

    Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)

    MATH  Google Scholar 

  3. Blummer, B., Kenton, J.M.: Academic libraries’ outreach efforts: identifying themes in the literature. Public Serv. Quart. 15(3), 179–204 (2019)

    Article  Google Scholar 

  4. Bodrunova, S., Blekanov, I., Kukarkin, M.: Topic modelling for twitter discussions: model selection and quality assessment. In: Proceedings of the 6th SGEM International Multidisciplinary Scientific Conference on Social Sciences and Arts SGEM2018, Science and Humanities, pp. 207–214. STEF92 Technology Ltd, Sofia, Bulgaria (2018)

    Google Scholar 

  5. Daud, A., Li, J., Zhou, L., Muhammad, F.: Knowledge discovery through directed probabilistic topic models: a survey. Front. Comput. Sci. China (2010)

    Google Scholar 

  6. Erofeeva, A., Mitrofanova, O.: Automatic assignment of topic labels in topic models for Russian text corpora. In: Structural and Applied Linguistics, vol. 12, pp. 122–147. St. Petersburg University (2019)

    Google Scholar 

  7. Greene, D., Cross, J.P.: Unveiling the political agenda of the European parliament plenary: a topical analysis. In: Proceedings of the ACM Web Science Conference (WebSci’15), Oxford, UK (2015)

    Google Scholar 

  8. Greene, D., Cross, J.P.: Exploring the political agenda of the european parliament using a dynamic topic modelling approach. Polit. Anal. 25(1), 77–94 (2017)

    Article  Google Scholar 

  9. Iyyer, M., Guha, A., Chaturvedi, S., Boyd-Graber, J., Daumé III, H.: Feuding families and former friends: unsupervised learning for dynamic fictional relationships. In: Proceedings of the 2016 Conference of the North American Chapter of the Association of the Computational Linguistics, Association for Computational Linguistics, San Diego, California, pp. 1534–1544 (2016)

    Google Scholar 

  10. Kazartsev, E., Davydova, A., Sherstinova, T.: Rhythmic structures of Russian prose and occasional iambs. In: A Diachronic Case Study. SpeCom 2020. LNCS (LNAI), vol. 12335 (2020, in print). https://doi.org/10.1007/978-3-030-60276-5_20

  11. Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian languages. Commun. Comput. Inf. Sci. 542, 320–332 (2015)

    Google Scholar 

  12. Kriukova, A., Erofeeva, A., Mitrofanova, O., Sukharev, K.: Explicit semantic analysis as a means for topic labelling. In: Ustalovet, D., et al. (eds.) Artificial Intelligence and Natural Language, vol. 930, pp. 167–177. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01204-5_11

    Chapter  Google Scholar 

  13. Krstovski, K., Kurtz, M.J., Smith, D.A., Accomazzi, A.: Multilingual Topic Models. https://arxiv.org/pdf/1712.06704.pdf. Accessed 21 May 2020

  14. Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Association for Computational Linguistics, Stroudsburg, PA (2011)

    Google Scholar 

  15. Lau, J.H., Newman, D., Karimi, S., Baldwin, T.: Best topic word selection for topic labelling, In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 605–613. Association for Computational Linguistics, Stroudsburg, PA (2010)

    Google Scholar 

  16. Loukachevitch, N., Nokel, M., Ivanov, K.: Combining thesaurus knowledge and probabilistic topic models. In: van der Aalst, W., et al. (eds.) Analysis of Images, Social Networks and Texts. AIST 2017. Lecture Notes in Computer Science, vol. 10716, pp. 59–71. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73013-4_6

    Chapter  Google Scholar 

  17. Martynenko, G., Sherstinova, T.: Corpus of Russian short stories of the first third of the 20th century: theoretical issues and linguistic parameters. Strukturnaya i prikladnaya linguistika 14. St. Petersburg State University, St. Petersburg (in print)

    Google Scholar 

  18. Martynenko, G., Sherstinova, T.: Emotional waves of a plot in literary texts: new approaches for investigation of the dynamics in digital culture. In: Alexandrov, D., Boukhanovsky, A., Chugunov, A., Kabanov, Y., Koltsova, O. (eds.) Digital Transformation and Global Society. DTGS 2018. Communications in Computer and Information Science, vol. 859, pp. 299–309. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02846-6_24

    Chapter  Google Scholar 

  19. Martynenko, G., Sherstinova, T.: Linguistic and stylistic parameters for the study of literary language in the corpus of Russian short stories of the first third of the 20th century. In: R. Piotrowski’s Readings in Language Engineering and Applied Linguistics, Proceedings of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), CEUR Workshop Proceedings, vol. 2552, pp. 105–120 (2020)

    Google Scholar 

  20. Martynenko, G.Y., Sherstinova, T.Y., Melnik, A.G., Popova, T.I.: Methodological issues related with the compilation of digital anthology of Russian short stories (the first third of the 20th century). In: Proceedings of the XXI International United Conference ʻThe Internet and Modern Societyʼ, IMS-2018, Computational linguistics and computational ontologies, Issue 2, pp. 99–104. ITMO University, St. Petersburg (2018)

    Google Scholar 

  21. Martynenko, G.Y., Sherstinova, T.Y., Popova, T.I., Melnik, A.G., Zamirajlova, E.V.: O printsipakh sozdaniya korpusa russkogo rasskaza pervoy treti XX veka. In: Proc. of the XV Int. Conf. on Computer and Cognitive Linguistics ʻTEL 2018ʼ, pp. 180–197. Kazan Federal University, Kazan (2018)

    Google Scholar 

  22. Melchuk, I.A.: Experience of the Theory of the Linguistic Models “Meaning ⇔ Text”. Moscow (1974/1999)

    Google Scholar 

  23. Mitrofanova, O.A.: Topic modelling of special texts based on LDA algorithm. In: Proceedings of XLII International Philological Conference. Selected works, pp. 220–233. St. Petersburg State University, St. Petersburg (2014)

    Google Scholar 

  24. Mitrofanova, O.A., Shimorina, A.S., Koltsov, S.N., Koltsova, O.Yu.: Modelling semantic links in social media texts using the LDA algorithm (based on the Russian-language segment of the LiveJournal). Strukturnaya i prikladnaya lingvistka 10, 151–168 (2014)

    Google Scholar 

  25. Mitrofanova, O.A., Sedova, A.G.: Topic modelling in parallel and comparable fiction texts (the case study of english and Russian prose). In: Information Technology and Computational Linguistics (ITCL 2017), ICPS Proceedings, IMS2017: Proceedings of the International Conference IMS-2017, pp. 175–180 (2017)

    Google Scholar 

  26. Mitrofanova, O.A.: Topic modelling of the Corpus of ʻRussian folk tales by A. N. Afanasievʼ. Strukturnaya i prikladnaya linguistika 11, 146–154 (2015)

    Google Scholar 

  27. Mitrofanova, O.A.: Verojatnostnoje Modelirovanije Tematiki Russkojazychnyh Korpusov Tekstov s Ispol’zovanijem Kompjuternogo Instrumenta GenSim. In: Proceedings of the International Conference ʻCorpus Linguistics – 2015ʼ. St. Petersburg State University, St. Petersburg (2015)

    Google Scholar 

  28. Nikolenko, S., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies. J. Inf. Sci. 43(1), 88–102 (2017)

    Article  Google Scholar 

  29. O’Callaghan, D., Greene, D., Carthy, J., Cunningham, P.: An analysis of the coherence of descriptors in topic modelling. Expert Syst. Appl. (ESWA) 42(13), 5645–5657 (2015)

    Article  Google Scholar 

  30. Panicheva, P., Litvinova, O., Litvinova, T.: Author clustering with and without topical features. In: Salah, A., Karpov, A., Potapova, R. (eds.) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science (LNAI), vol. 11658, pp. 348–358. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_36

    Chapter  Google Scholar 

  31. Rhody, L.M.: Topic modelling and figurative language. J. Digit. Hum. 2(1) (2012)

    Google Scholar 

  32. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Uncertainty in Artificial Intelligence, pp. 487–494 (2004)

    Google Scholar 

  33. Segalovich, I., Titov, V.: MyStem, https://yandex.ru/dev/MyStem/ (2011). Accessed 12 May 2020

  34. Sherstinova, T., Skrebtsova, T.: Russian literature around the October revolution: a quantitative exploratory study of literary themes and narrative structure in Russian short stories of 1900–1930. In: CompLing (2020, in print)

    Google Scholar 

  35. Skrebtsova, T.G.: Thematic tagging of literary fiction: the case of early 20th century Russian short stories. In: CompLing (2020, in print)

    Google Scholar 

  36. Stockwell, P.: Cognitive Poetics: An Introduction. Routledge, London (2002)

    Google Scholar 

  37. Todd, R.W.: Discourse Topics. John Benjamins, Amsterdam & Philadelphia (2016)

    Book  Google Scholar 

  38. Vorontsov, K., Potapenko, A.: Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. In: Ignatov, D., Khachay, M., Panchenko, A., Konstantinova, N., Yavorsky, R. (eds.) Analysis of Images, Social Networks and Texts, vol. 436, pp. 29–46. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12580-0_3

    Chapter  Google Scholar 

  39. Zamiraylova, E., Mitrofanova, O.: Dynamic topic modelling of Russian fiction prose of the first third of the XXth century by means of non-negative matrix factorization. In: R. Piotrowski’s Readings in Language Engineering and Applied Linguistics, Proceedings of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), CEUR Workshop Proceedings, vol. 2552, pp. 321–339 (2020)

    Google Scholar 

  40. Zhirmunskii V.M.: Teoriya literatury. Poetika. Stilistika. Leningrad, Nauka (1977)

    Google Scholar 

  41. Zholkovsky A., Shcheglov, Y.: K Ponyatiyam ‘Tema’ i ‘Poeticheskiy Mir’. Trudy po znakovym systemam 7, 143–167. Tartu University, Tartu (1975)

    Google Scholar 

Download references

Acknowledgements

The research is supported by the Russian Foundation for Basic Research, project #17-29-09173 “The Russian language on the edge of radical historical changes: the study of language and style in prerevolutionary, revolutionary and post-revolutionary artistic prose by the methods of mathematical and computer linguistics (a corpus-based research on Russian short stories)”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tatiana Sherstinova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sherstinova, T., Mitrofanova, O., Skrebtsova, T., Zamiraylova, E., Kirina, M. (2020). Topic Modelling with NMF vs. Expert Topic Annotation: The Case Study of Russian Fiction. In: Martínez-Villaseñor, L., Herrera-Alcántara, O., Ponce, H., Castro-Espinoza, F.A. (eds) Advances in Computational Intelligence. MICAI 2020. Lecture Notes in Computer Science(), vol 12469. Springer, Cham. https://doi.org/10.1007/978-3-030-60887-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60887-3_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60886-6

  • Online ISBN: 978-3-030-60887-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics