Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech

Sherstinova, Tatiana

doi:10.1007/978-3-319-99579-3_61

Tatiana Sherstinova ORCID: orcid.org/0000-0002-9085-3378^16,17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Included in the following conference series:

International Conference on Speech and Computer

1482 Accesses

Abstract

The paper presents statistical data on POS distribution in the beginnings and the ends of everyday Russian utterances. The material for this study was a morphologically annotated subcorpus of the ORD corpus of spoken Russian with volume of 149737 tokens and containing fragments of everyday speech of 213 people of different gender, age, and professional groups. In the proposed study, the method of n-gram analysis, which is typically employed in computational linguistics to construct probabilistic language models, was used. In the subcorpus as a whole, the most frequent POS turned out to be verbs (17.23%), personal pronouns (15.60%), nouns (14%), particles (13%), and conjunctions (9%). However, in the initial position of spoken utterances the most frequent POS are particles (19.99%) and conjunctions (12%), and in the final position of utterances the verbs and nouns are used more often than others. The former are more typical for interrogative (27.66%) and narrative (25.42%) utterances, and the latter are frequently used in exclamative (29.95%) and narrative (24.28%) utterances. Besides, the most typical bigrams and trigrams in the beginning of utterances started with a particle and their probabilities are presented. A high percentage of syntactic models containing particles in the initial position of utterances leads us to the assumption that these units have special pragmatic functions, associated with marking phrase boundaries. Statistical data obtained here may be used for modeling of everyday utterances for the variety of dialogue systems and for improvement of Russian speech recognition systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Pragmatic Markers Distribution in Russian Everyday Speech: Frequency Lists and Other Statistics for Discourse Modeling

On the Most Frequent Sequences of Words in Russian Spoken Everyday Language (Bigrams and Trigrams): An Experience of Classification

Discourse Particles in French: Prosodic Parameters Extraction and Analysis

Notes

1.
All respondents taking part in the recording had to fill out a sociological questionnaire about themselves and their main interlocutors. However, the information received in this way is available only for 70% of interlocutors.

References

Asinovsky, A., Bogdanova, N., Rusakova, M., Ryko, A., Stepanova, S., Sherstinova, T.: The ORD speech corpus of Russian everyday communication “One Speaker’s Day”: creation principles and annotation. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 250–257. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04208-9_36
Chapter Google Scholar
Sherstinova, T.: The structure of the ORD speech corpus of Russian everyday communication. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 258–265. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04208-9_37
Chapter Google Scholar
Bogdanova-Beglarian, N., Sherstinova, T., Blinova, O., Ermolova, O., Baeva, E., Martynenko, G., Ryko, A.: Sociolinguistic extension of the ORD corpus of Russian everyday speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 659–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_80
Chapter Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing, pp. 191–227. MIT-Press, Cambridge (1999)
MATH Google Scholar
Baker, P., et al.: Glossary of Corpus Linguistics. Edinburgh University Press, Edinburgh (2006)
Google Scholar
Heeman, P.A., Allen, J.F.: Incorporating POS tagging into language modeling. CoRR cmp-lg/9705014 (1997)
Google Scholar
Samuelsson, C., Reichl, W.: A class-based language model for large-vocabulary speech recognition extracted from part-of-speech statistics. In: Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 537–540. IEEE (1999)
Google Scholar
Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
Google Scholar
Bickel, S., Haider, P., Scheffer, T.: Predicting sentences using N-gram language models. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT 2005), pp. 193–200. Association for Computational Linguistics, Stroudsburg (2005)
Google Scholar
TreeTagger. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/. Accessed 01 May 2018
Sherstinova, T.: Macro episodes of Russian everyday oral communication: towards pragmatic annotation of the ORD speech corpus. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 268–276. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_33
Chapter Google Scholar
AntConc. http://www.laurenceanthony.net/software.html. Accessed 01 May 2018
Bogdanova-Beglarian, N., Blinova, O., Martynenko, G., Sherstinova, T.: Some invariant features of Russian everyday speech: phonology, morphology, syntax. In: Komp′juternaja Lingvistika i Intellektual′nye Tehnologii, vol. 2(16), pp. 82–95 (2017). http://www.dialog-21.ru/media/3902/bogdanova-beglariannvetal.pdf
Bogdanova-Beglarian, N.V.: Pragmatemy v ustnoj povsednevnoj rechi: opredelenie pon’atia i obshchaja tipologia [Pragmatems in Spoken Everyday Speech: Definition and General Typology]. In: Vestnik Permskogo universiteta. Rossijskaja i zarubezhnaja filologia [Perm University Herald. Russian and Foreign Philology], vol. 3 (27), pp. 7–20 (2014). (in Russia)
Google Scholar

Download references

Acknowledgements

The results described in Sects. 3.2–4 of this paper were obtained in the framework of a study supported by the Russian Science Foundation, project #18-18-00242 “Pragmatic Markers in Russian Everyday Speech”.

Author information

Authors and Affiliations

Saint Petersburg State University, St. Petersburg, 199034, Russia
Tatiana Sherstinova
National Research University Higher School of Economiсs, St. Petersburg, 190068, Russia
Tatiana Sherstinova

Authors

Tatiana Sherstinova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tatiana Sherstinova .

Editor information

Editors and Affiliations

SPIIRAS, St. Petersburg, Russia
Alexey Karpov
Leipzig University of Telecommunications, Leipzig, Germany
Oliver Jokisch
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sherstinova, T. (2018). Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_61

Download citation

DOI: https://doi.org/10.1007/978-3-319-99579-3_61
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics