Abstract
Statistical parameters, usually used for diagnostic procedures, in many cases cannot be considered to be consistent ones from the statistical point of view, being strongly dependent on sample size. It leads to considerable devaluation of diagnostic results. This paper concerns the problem of consistency verification of parameters in the initial (pre-classification) stage of research. A complete list of parameters, which may be useful for description of text lexicostatistical structure, was determined. Each of these parameters was exposed to the justifiability test. In the result, a number of consistent parameters have been selected, which represent a description tool for the system characteristics of any text and corpora. Having rapid speed of convergence to the limit values, they may effectively perform classification procedures on text data of the arbitrary size. The proposed model of approximation makes it possible as well to forecast the values of all parameters for any sample size.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Marusenko M.A: Attribution of anonymous and pseudonymous literary works by means of pattern recognition theory. Leningrad. Leningrad State University, 1990.
Martynenko G.Y.: Fundamentals of Stylometrics. Leningrad. Leningrad State University, 1988.
Frequency Dictionary of Chekhov’s Short Stories: Ed. Martynenko G.Y. Collected by A.O.Grebennikov. St. Petersburg, 1998.
Khajtun S.D.: Scientific measurement. Present Conditions and Perspectives. Moscow. “Nauka”, 1983.
Shrejder Y.A., Sharov A.A.: Systems and Models. Moscow. “Radio i sviaz”, 1982.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Martynenko, G.Y., Sherstinova, T.Y. (2000). Statistical Parameterisation of Text Corpora. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2000. Lecture Notes in Computer Science(), vol 1902. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45323-7_17
Download citation
DOI: https://doi.org/10.1007/3-540-45323-7_17
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41042-3
Online ISBN: 978-3-540-45323-9
eBook Packages: Springer Book Archive