Abstract
Prediction by partial match (PPM) is an effective tool to address the author recognition problem. In this study, we have successfully applied the trained PPM technique for author recognition on Turkish texts. Furthermore, we have investigated the effects of recency, as well as size of the training text on the performance of the PPM approach. Results show that, more recent and larger training texts help decrease the compression rate, which, in turn, leads to increased success in author recognition. Comparing the effects of the recency and the size of the training text, we see that the size factor plays a more dominant role on the performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Clough, P.: Plagiarism in Natural and Programming Languages: An Overview of Current Tools&Techs. Dept. of Comp. Sc., Univ. of Sheffield, UK (2000)
Stylometry Authorship Analysis: http://www.lightplanet.com/response/style.htm
Rudman, J., Holmes, D.I., Tweedie, F.J., Baayen, R.H.: The State of Authorship Studies (1) The History and the Scope (2) The Problems – Towards Credibility and Validity. In: Joint Int’l Conf. of the Assoc. for Comp. & the Humanities and the Assoc. for Literary & Linguistic Computing, Queen’s Univ., Canada (1997)
Burrows, J.F.: Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style. Literary & Linguistic Computing 2(2), 61–70 (1987)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Authorship Attribution. In: Proceedings of EACL (1999)
Khmelev, D.V., Tweedie, F.J.: Using Markov Chains for Identification of Writers. Literary and Linguistic Computing 16(4), 299–307 (2001)
Khmelev, D.V.: Disputed Authorship Resolution through Using Relative Empirical Entropy for Markov Chains of Letters in Human Language Texts. Journal of Quantitative Linguistics 7(3), 201–207 (2000)
Kukushkina, O., Polikarpov, A., Khmelev, D.V.: Using Letters and Grammatical Statistics for Authorship Attribution. Problems of Information Transmission 37(2), 172–184 (2001)
Teahan, W.J.: Modeling English Text. PhD. Thesis, Univ. of Waikato, NZ (1998)
Teahan, W.J., Harper, D.J.: Using Compression-Based Language Models for Text Categorization. In: Workshop on Language Modeling and Information Retrieval, pp. 83–88. Carnegie Mellon University (2001)
Teahan, W.J.: Text Classification and Segmentation Using Minimum Cross-Entropy. In: Proceedings of RIAO 2000, Paris, France, pp. 943–961 (2000)
Tur, G.: Automatic Authorship Detection (2000) (unpublished)
Khmelev, D.V., Teahan, W.J.: A Repetition Based Measure for Verification of Text Collections and for Text Categorization. In: SIGIR 2003, Toronto, Canada (2003)
Witten, I., Moffat, A., Bell, T.C.: Managing Gigabytes, San Fransisco (1999)
Can, F., Patton, J.M.: Change of Writing Style with Time. Computers and the Humanities 38(1), 61–82 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Celikel, E., Dalkılıç, M.E. (2004). Investigating the Effects of Recency and Size of Training Text on Author Recognition Problem. In: Aykanat, C., Dayar, T., Körpeoğlu, İ. (eds) Computer and Information Sciences - ISCIS 2004. ISCIS 2004. Lecture Notes in Computer Science, vol 3280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30182-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-30182-0_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23526-2
Online ISBN: 978-3-540-30182-0
eBook Packages: Springer Book Archive