Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech | SpringerLink
Skip to main content

Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Included in the following conference series:

  • 1475 Accesses

Abstract

The paper presents statistical data on POS distribution in the beginnings and the ends of everyday Russian utterances. The material for this study was a morphologically annotated subcorpus of the ORD corpus of spoken Russian with volume of 149737 tokens and containing fragments of everyday speech of 213 people of different gender, age, and professional groups. In the proposed study, the method of n-gram analysis, which is typically employed in computational linguistics to construct probabilistic language models, was used. In the subcorpus as a whole, the most frequent POS turned out to be verbs (17.23%), personal pronouns (15.60%), nouns (14%), particles (13%), and conjunctions (9%). However, in the initial position of spoken utterances the most frequent POS are particles (19.99%) and conjunctions (12%), and in the final position of utterances the verbs and nouns are used more often than others. The former are more typical for interrogative (27.66%) and narrative (25.42%) utterances, and the latter are frequently used in exclamative (29.95%) and narrative (24.28%) utterances. Besides, the most typical bigrams and trigrams in the beginning of utterances started with a particle and their probabilities are presented. A high percentage of syntactic models containing particles in the initial position of utterances leads us to the assumption that these units have special pragmatic functions, associated with marking phrase boundaries. Statistical data obtained here may be used for modeling of everyday utterances for the variety of dialogue systems and for improvement of Russian speech recognition systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    All respondents taking part in the recording had to fill out a sociological questionnaire about themselves and their main interlocutors. However, the information received in this way is available only for 70% of interlocutors.

References

  1. Asinovsky, A., Bogdanova, N., Rusakova, M., Ryko, A., Stepanova, S., Sherstinova, T.: The ORD speech corpus of Russian everyday communication “One Speaker’s Day”: creation principles and annotation. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 250–257. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04208-9_36

    Chapter  Google Scholar 

  2. Sherstinova, T.: The structure of the ORD speech corpus of Russian everyday communication. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 258–265. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04208-9_37

    Chapter  Google Scholar 

  3. Bogdanova-Beglarian, N., Sherstinova, T., Blinova, O., Ermolova, O., Baeva, E., Martynenko, G., Ryko, A.: Sociolinguistic extension of the ORD corpus of Russian everyday speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 659–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_80

    Chapter  Google Scholar 

  4. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing, pp. 191–227. MIT-Press, Cambridge (1999)

    MATH  Google Scholar 

  5. Baker, P., et al.: Glossary of Corpus Linguistics. Edinburgh University Press, Edinburgh (2006)

    Google Scholar 

  6. Heeman, P.A., Allen, J.F.: Incorporating POS tagging into language modeling. CoRR cmp-lg/9705014 (1997)

    Google Scholar 

  7. Samuelsson, C., Reichl, W.: A class-based language model for large-vocabulary speech recognition extracted from part-of-speech statistics. In: Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 537–540. IEEE (1999)

    Google Scholar 

  8. Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)

    Google Scholar 

  9. Bickel, S., Haider, P., Scheffer, T.: Predicting sentences using N-gram language models. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT 2005), pp. 193–200. Association for Computational Linguistics, Stroudsburg (2005)

    Google Scholar 

  10. TreeTagger. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/. Accessed 01 May 2018

  11. Sherstinova, T.: Macro episodes of Russian everyday oral communication: towards pragmatic annotation of the ORD speech corpus. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 268–276. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_33

    Chapter  Google Scholar 

  12. AntConc. http://www.laurenceanthony.net/software.html. Accessed 01 May 2018

  13. Bogdanova-Beglarian, N., Blinova, O., Martynenko, G., Sherstinova, T.: Some invariant features of Russian everyday speech: phonology, morphology, syntax. In: Komp′juternaja Lingvistika i Intellektual′nye Tehnologii, vol. 2(16), pp. 82–95 (2017). http://www.dialog-21.ru/media/3902/bogdanova-beglariannvetal.pdf

  14. Bogdanova-Beglarian, N.V.: Pragmatemy v ustnoj povsednevnoj rechi: opredelenie pon’atia i obshchaja tipologia [Pragmatems in Spoken Everyday Speech: Definition and General Typology]. In: Vestnik Permskogo universiteta. Rossijskaja i zarubezhnaja filologia [Perm University Herald. Russian and Foreign Philology], vol. 3 (27), pp. 7–20 (2014). (in Russia)

    Google Scholar 

Download references

Acknowledgements

The results described in Sects. 3.24 of this paper were obtained in the framework of a study supported by the Russian Science Foundation, project #18-18-00242 “Pragmatic Markers in Russian Everyday Speech”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tatiana Sherstinova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sherstinova, T. (2018). Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_61

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99579-3_61

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99578-6

  • Online ISBN: 978-3-319-99579-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics