Abstract
The paper presents statistical data on POS distribution in the beginnings and the ends of everyday Russian utterances. The material for this study was a morphologically annotated subcorpus of the ORD corpus of spoken Russian with volume of 149737 tokens and containing fragments of everyday speech of 213 people of different gender, age, and professional groups. In the proposed study, the method of n-gram analysis, which is typically employed in computational linguistics to construct probabilistic language models, was used. In the subcorpus as a whole, the most frequent POS turned out to be verbs (17.23%), personal pronouns (15.60%), nouns (14%), particles (13%), and conjunctions (9%). However, in the initial position of spoken utterances the most frequent POS are particles (19.99%) and conjunctions (12%), and in the final position of utterances the verbs and nouns are used more often than others. The former are more typical for interrogative (27.66%) and narrative (25.42%) utterances, and the latter are frequently used in exclamative (29.95%) and narrative (24.28%) utterances. Besides, the most typical bigrams and trigrams in the beginning of utterances started with a particle and their probabilities are presented. A high percentage of syntactic models containing particles in the initial position of utterances leads us to the assumption that these units have special pragmatic functions, associated with marking phrase boundaries. Statistical data obtained here may be used for modeling of everyday utterances for the variety of dialogue systems and for improvement of Russian speech recognition systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
All respondents taking part in the recording had to fill out a sociological questionnaire about themselves and their main interlocutors. However, the information received in this way is available only for 70% of interlocutors.
References
Asinovsky, A., Bogdanova, N., Rusakova, M., Ryko, A., Stepanova, S., Sherstinova, T.: The ORD speech corpus of Russian everyday communication “One Speaker’s Day”: creation principles and annotation. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 250–257. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04208-9_36
Sherstinova, T.: The structure of the ORD speech corpus of Russian everyday communication. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 258–265. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04208-9_37
Bogdanova-Beglarian, N., Sherstinova, T., Blinova, O., Ermolova, O., Baeva, E., Martynenko, G., Ryko, A.: Sociolinguistic extension of the ORD corpus of Russian everyday speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 659–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_80
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing, pp. 191–227. MIT-Press, Cambridge (1999)
Baker, P., et al.: Glossary of Corpus Linguistics. Edinburgh University Press, Edinburgh (2006)
Heeman, P.A., Allen, J.F.: Incorporating POS tagging into language modeling. CoRR cmp-lg/9705014 (1997)
Samuelsson, C., Reichl, W.: A class-based language model for large-vocabulary speech recognition extracted from part-of-speech statistics. In: Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 537–540. IEEE (1999)
Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
Bickel, S., Haider, P., Scheffer, T.: Predicting sentences using N-gram language models. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT 2005), pp. 193–200. Association for Computational Linguistics, Stroudsburg (2005)
TreeTagger. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/. Accessed 01 May 2018
Sherstinova, T.: Macro episodes of Russian everyday oral communication: towards pragmatic annotation of the ORD speech corpus. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 268–276. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_33
AntConc. http://www.laurenceanthony.net/software.html. Accessed 01 May 2018
Bogdanova-Beglarian, N., Blinova, O., Martynenko, G., Sherstinova, T.: Some invariant features of Russian everyday speech: phonology, morphology, syntax. In: Komp′juternaja Lingvistika i Intellektual′nye Tehnologii, vol. 2(16), pp. 82–95 (2017). http://www.dialog-21.ru/media/3902/bogdanova-beglariannvetal.pdf
Bogdanova-Beglarian, N.V.: Pragmatemy v ustnoj povsednevnoj rechi: opredelenie pon’atia i obshchaja tipologia [Pragmatems in Spoken Everyday Speech: Definition and General Typology]. In: Vestnik Permskogo universiteta. Rossijskaja i zarubezhnaja filologia [Perm University Herald. Russian and Foreign Philology], vol. 3 (27), pp. 7–20 (2014). (in Russia)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Sherstinova, T. (2018). Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_61
Download citation
DOI: https://doi.org/10.1007/978-3-319-99579-3_61
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)