Papers by Tatiana Sherstinova
Computational Linguistics and Intellectual Technologies, 2021
The paper presents the results of a study that is part of a large-scale project aimed at studying... more The paper presents the results of a study that is part of a large-scale project aimed at studying the changes that took place in the Russian language during the first three decades of the 20th century. In the history of Russia, this period was marked by stormy events that led to a radical change in the state system and the formation of a new society. To quantify the scale of changes that occurred in the language in the result of these dramatic events, it is necessary to analyze the representative volume of linguistic data and to compare different chronological periods in dynamics using quantitative methods. The research was carried out on the data of an annotated sample from the Corpus of the Russian Short Stories of 1900-1930, which contains texts by 300 Russian writers. All the texts in the Corpus are divided into three time frames: 1) the prewar period (1900-1913), 2) the war and revolutionary years (1914-1922) and 3) the early Soviet period (1923-1930). Frequency distribution of significant vocabulary in dynamics was analyzed, which made it possible to identify the main tendencies in the change of individual words and lexical groups frequencies from one historical period to another and to correlate them with the previously identified dynamics of literary themes. The technique used allows to trace the influence of large-scale political changes on the vocabulary of literary language, to note the peculiarities and tendencies of the writers' worldview in a certain historical period, and also makes it possible to significantly supplement the analysis of the dynamics of literary themes in fiction.
В контексте социальных потрясений первой трети XX в. рассматриваются революционные изменения в ру... more В контексте социальных потрясений первой трети XX в. рассматриваются революционные изменения в русском языке и художественной литературе на материале русского рассказа. Материалом для такого рода исследований является Компьютерная антология русского рассказа, разрабатываемая на кафедре математической лингвистике Санкт-Петербургского университета. Рассмотрены теоретические предпосылки и прикладные этой масштабной задачи. В основе проекта лежат идеи русской формальной школы и, прежде всего, системные идеи Ю. Н. Тынянова. Привлечены критерии, обеспечивающие представительность авторов и их произведений на основе объективных формальных процедур, вчастности, размера статьи в Краткой литературной энциклопедии, посвященной конкретному беллетристу. В дальнейшем предполагается расширение хронологических рамок Антологии с охватом всего XX в., а также конца XIX и начала XXI столетия.
Language Resources and Evaluation, 1998
Routledge eBooks, Jun 27, 2023
The article describes the study of linearized subordination structures and the dependency tree wi... more The article describes the study of linearized subordination structures and the dependency tree width. The study is quantitative in nature. It is related with the rank distributions of subordination structures, in which both the number of dependent members and their right-branching/left-branching variants are concerned. The similarity of rank structures for different authors is observed despite a significant difference in frequency characteristics between the authors. Besides, the work deals with the symmetry properties of these structures, and their bias towards the mirror symmetry and the golden symmetry. An approximate equilibrium of the left and right width in the Russian short story of the first third of the 20th century is revealed. Otto Behaghel's law of the increasing members was tested and adjusted for the Russian language.
One of the important tasks of creating the Corpus of Russian Short Stories of the first third of ... more One of the important tasks of creating the Corpus of Russian Short Stories of the first third of the 20th century is to identify and describe the changes that took place in the Russian language and in stylistics of Russian literature in the chain of dramatic events of the World War I, the February and October Revolutions, and the Civil War. The essential principle for creating the corpus is an attempt to include in the database literary texts of the maximum number of authors who wrote stories in 1900-1930. The article describes the principles of writers and text selection for the annotated subcorpus containing stories of 300 Russian prose writers and considers the list of linguistic and stylistic parameters proposed for studying the language of literary texts in synchrony and diachrony.
Вестник Томского государственного университета, Apr 1, 2021
The purpose of this paper is to test the methodological tools provided by TXM open-source softwar... more The purpose of this paper is to test the methodological tools provided by TXM open-source software for research on dynamics of vocabulary and punctuation marks in diachronic corpora. TXM provides both quantitative and qualitative analysis features. It is shown that Russian revolution of 1917 did make significant changes in the core vocabulary of the corpus of Russian Short Stories (1901-1930). The same methodology may be used both for diachronic studies of literature and for various NLP tasks.
2019 25th Conference of Open Innovations Association (FRUCT)
Pragmatic markers (PMs) mainly have an influence on a pragmatic aspect of communication and are m... more Pragmatic markers (PMs) mainly have an influence on a pragmatic aspect of communication and are mostly devoid of their own referential meaning. These markers are indispensable elements of oral communication in any language. The article suggests a typology of pragmatic markers for Russian everyday speech that includes 10 basic types. The frequency study for the use of various marker types is carried out on the basis of two representative speech corpora-a corpus of Russian Everyday Speech "One Speech Day" (ORD) and "Balanced Annotated Collection of Texts" (SAT). Preliminary data about PM distribution in dialogues and monologues was obtained and the article describes the main difficulties one comes across while annotating PMs according to our methodology. The main requirements for creating a Dictionary of Pragmatic Markers are enumerated. The paper indicates the scope of pragmatic markers and further prospects for their use, which includes (but not limited to) datasets labelling for voice assistants and speech recognition systems development.
The paper examines one of the features of Russian spontaneous speech: its trend to structural iso... more The paper examines one of the features of Russian spontaneous speech: its trend to structural isochrony of its parts, which is especially evident in small phrases/utterances. The research data is taken from two blocks of the Russian Speech Corpus: the SAT balanced annotated corpus and the ORD corpus of Russian everyday communication. The SAT corpus contains monologue speech, whereas the ORD recordings consist of dialogues, polylogues, and isolated utterances. The aim of the given research is to find experimental proof for the hypothesis that the isochrony of spoken language in some cases is maintained by the addition of discourse particles like vot, nu, tam, koroche.
Lexical system is an essential component of any natural language. Frequency word lists are a conv... more Lexical system is an essential component of any natural language. Frequency word lists are a convenient representation of words functional activity in language as a whole or in some particular text. The parameters and properties of frequency word lists are in the center of attention of NLP experts, since they are used in numerous practical applications related to attribution of authorship, text automatic clustering and classification. The article explores frequency word lists of Russian fiction in the period of 1900-1930, which was marked by a series of dramatic historical events and presents unique statistical data on the most frequent words, parts of speech and keywords, and their dynamics. Special attention is paid to the issues of statistical consistency of frequency word list parameters, which becomes especially relevant when studying big text data. The study was carried out on the basis of fiction texts, which by the variety of topics, lexical and stylistic diversity reflects ...
2023 33rd Conference of Open Innovations Association (FRUCT)
2023 33rd Conference of Open Innovations Association (FRUCT)
Lecture Notes in Computer Science, 2014
Text-to-Speech has traditionally been viewed as a “black box” component, where standard “portfoli... more Text-to-Speech has traditionally been viewed as a “black box” component, where standard “portfolio” voices are typically offered with a professional but “neutral” speaking style. For commercially important languages many different portfolio voices may be offered all with similar speaking styles. A customer wishing to use TTS will typically choose one of these voices. The only alternative is to opt for a “custom voice” solution. In this case, a customer pays for a TTS voice to be created using their preferred voice talent. Such an approach allows for some “tuning” of the scripts used to create the voice. Limited script elements may be added to provide better coverage of the customer’s expected domain and “gilded phrases” can be included to ensure that specific phrase fragments are spoken perfectly. However, even with such an approach the recording style is strictly controlled and standard scripts are augmented rather than redesigned from scratch. The “black box” approach to TTS allows for systems to be produced which satisfy the needs of a large number of customers, even if this means that solutions may be limited in the persona they present. Recent advances in conversational agent applications have changed people’s expectations of how a computer voice should sound and interact. Suddenly, it’s much more important for the TTS system to present a persona which matches the goals of the application. Such systems demand a more flamboyant, upbeat and expressive voice. The “black box” approach is no longer sufficient; voices for high-end conversational agents are being explicitly “designed” to meet the needs of such applications. These voices are both expressive and light in tone, and a complete contrast to the more conservative voices available for traditional markets. This paper will describe how Nuance is addressing this new and challenging market.
2022 32nd Conference of Open Innovations Association (FRUCT)
Proceedings of the International Conference IMS-2017
Internet video resources provide wide opportunities for studying dissemination of national cultur... more Internet video resources provide wide opportunities for studying dissemination of national cultures and the impact of these cultures in different countries. Thus, YouTube Analytics allows obtaining various information for each video published on YouTube, such as Performance metrics (views and watch time), Engagement metrics (likes, dislikes, comments, etc.), Demographics (information on the gender and the location of viewers), etc. The study described in this paper concerns the dynamics of interest in Russian music videos published on YouTube. The data for this research were taken from a personal music channel https://www.youtube.com/GMartynenko devoted to classical and traditional vocal music. The dynamics of interest in Russian music videos was traced during three years (2014-2016). It turned out that among neighboring countries of Russia it is possible to distinguish two groups of countries, the users from which have similar music preferences. The list of countries with high interest in Russian musical culture is presented. It turned out that one of leading positions in this list belongs to Latvia. Because of that the dynamics of interest expressed by YouTube users from Latvia has been specially analyzed for the same time period.
Uploads
Papers by Tatiana Sherstinova
The phrasebook was recorded by three native Nenets speakers, who represent three main dialects of the Nenets language.
The phrasebook contains about 550 Russian phrases, which were translated and pronounced by three speakers. These phrases relate to the main 21 topics, which reflect rather well the Nenets traditional culture and the way of life.
The dictionary contains more than 3600 words. It reflects rather well the Nganasan traditional culture and the way of life, which are now almost unknown to the younger generation. The dictionary does not contain the new Russian loan words, as they were not influenced by the phonological system of Nganasan and preserve the Russian pronunciation. The dictionary includes all the known Nganasan idioms. Some of the entries are accompanied by the phrase or context examples. All words, idioms and phrases are provided with the sounding examples.
The present Comparative Dictionary is based on the following Nenets and Nganasan School Dictionaries: [Tereshchenko 1982; Kosterkina et al. 2001]. The corpus of the Dictionary contains about 1.000 frequently used words, viz. the words denoting various everyday activities related to nomadic way of life, reindeer breeding, fishing, hunting; it also contains words for body parts, for details of the dwelling-place, weather, fauna and flora, types of landscapes, and others, including most words from the Swadesh wordlist. "