Abstract
Linguistic information can help improve evaluation of similarity between documents; however, the kind of linguistic information to be used depends on the task. In this paper, we show that distributions of syntactic structures capture the way works are written and accurately identify individual books more than 76% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 66%. However, testing the same features on authorship attribution shows that distributions of syntactic structures are less successful than function words on this task; syntactic structures vary even among the works of the same author whereas features such as function words are distributed more similarly among the works of an author and can more effectively capture authorship.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alexander, D., Kunz, W.J.: Some Classes of Verbs in English. Linguistics Research Project. Indiana University (1964)
Baker, J.C.: A Test of Authorship Based on the Rate at which New Words Enter an Author’s Text. Journal of the Association for Literary and Linguistic Computing 3(1), 36–39 (1988)
Biber, D.: A Typology of English Texts. Language 27, 3–43 (1989)
Biber, D., Conrad, S., Reppen, R.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge (1998)
Brill, E.: A Simple Rule-Based Part of Speech Tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing (1992)
Diab, M., Schuster, J., Bock, P.: A Preliminary Statistical Investigation into the Impact of an N-Gram Analysis Approach based on Word Syntactic Categories toward Text Author Classification. In: Proceedings of Sixth International Conference on Artificial Intelligence Applications (1998)
Brinegar, C.S.: Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship. Journal of the American Statistical Association 58, 85–96 (1963)
Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Glover, A., Hirst, G.: Detecting stylistic inconsistencies in collaborative writing. In: Sharples, M., van der Geest, T. (eds.) The new writing environment: Writers at work in a world of technology. Springer, London (1996)
Halliday, M., Hasan, R.: Cohesion in English. Longman, London (1976)
Halliday, M.: An introduction to functional grammar. Edward Arnold, London (1985)
Hatzivassiloglou, V., Klavans, J., Eskin, E.: Detecting Similarity by Applying Learning over Indicators. In: 37th Annual Meeting of the ACL (1999)
Hatzivassiloglou, V., Klavans, J., Holcombe, M., Barzilay, R., Kan, M.Y., McKeown, K.R.: SimFinder: A Flexible Clustering Tool for Summarization. In: NAACL 2001 Automatic Summarization Workshop (2001)
Holmes, D.I.: Authorship Attribution. Computers and the Humanities 28, 87–106 (1994)
Katz, B.: Using English for Indexing and Retrieving. In: Winston, P.H., Shellard, S.A. (eds.) Artificial Intelligence at MIT: Expanding Frontiers. MIT Press, Cambridge (1990)
Katz, B., Levin, B.: Exploiting Lexical Regularities in Designing Natural Language Systems. In: Proceedings of the 12th International Conference on Computational Linguistics, COLING (1988)
Khmelev, D., Tweedie, F.: Using Markov Chains for Identification of Writers. Literary and Linguistic Computing 16(4), 299–307 (2001)
Kjetsaa, G.: The Authorship of the Quiet Don. International Specialized Book Service Inc. (1984) ISBN 0391029487
Koppel, M., Akiva, N., Dagan, I.: A Corpus-Independent Feature Set for Style-Based Text Categorization. In: Proceedings of IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis (2003)
Kukushkina, O.V., Polikarpov, A.A., Khemelev, D.V.: Using Literal and Grammatical Statistics for Authorship Attribution. Published in Problemy Peredachi Informatsii, vol. 37(2), 96–108 (2000); Translated in “Problems of Information Transmission”, 172–184
Levin, B.: English Verb Classes and Alternations. A Preliminary Investigation. University of Chicago Press, Chicago (1993) ISBN 0-226-47533-6
Mendenhall, T.C.: Characteristic Curves of Composition. Science 11, 237–249 (1887)
Miller, G.A., Newman, E.B., Friedman, E.A.: Length-Frequency Statistics for Written English. Information and Control 1(4), 370–389 (1958)
Morton, A.Q.: The Authorship of Greek Prose. Journal of the Royal Statistical Society (A) 128, 169–233 (1965)
Mosteller, F., Wallace, D.L.: Inference in an authorship Problem. Journal of the American Statistical Association 58(302), 275–309 (1963)
Peng, R.D., Hengartner, H.: Quantitative Analysis of Literary Styles. The American Statistician 56(3), 175–185 (2002)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1998)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Schapire, R.E.: The Boosting Approach to Machine Learning. In: MSRI Workshop on Nonlinear Estimation and Classification (2002)
Sichel, H.S.: On a Distribution Representing Sentence-Length in Written Prose. Journal of the Royal Statistical Society (A) 137, 25–34 (1974)
Smith, M.W.A.: Recent Experience and New Developments of Methods for the Determination of Authorship. Association for Literary and Linguistic Computing Bulletin 11, 73–82 (1983)
Tallentire, D.R.: An Appraisal of Methods and Models in Computational Stylistics, with Particular Reference to Author Attribution. PhD Thesis. University of Cambridge (1972)
Thisted, R., Efron, B.: Did Shakespeare Write a Newly-discovered Poem? Biometrika 74, 445–455 (1987)
Uzuner, Ö.: Identifying Expression Fingerprints using Linguistic Information. Ph.D. Dissertation. Massachusetts Institute of Technology (2005)
Uzuner, Ö., Katz, B.: Capturing Expression Using Linguistic Information. In: Proceedings of the 20th National Conference on Artificial Intelligence, AAAI-2005 (2005)
Uzuner, Ö., Katz, B., Nahnsen, T.: Using Syntactic Information to Identify Plagiarism. In: Proceedings of the Association for Computational Linguistics Workshop on Educational Applications, ACL 2005 (2005)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of ICML-1997, 14th International Conference on Machine Learning, pp. 412–420 (1997)
Yule, G.U.: On Sentence-Length as a Statistical Characteristic of Style in Prose, with Application to Two Cases of Disputed Authorship. Biometrika 30, 363–390 (1938)
Wilkinson, J., diMarco, C.: Automated Multi-purpose Text Processing. In: Proceedings of IEEE Fifth Annual Dual-Use Technologies and Applications Conference (1995)
Williams, C.B.: Mendenhall’s Studies of Word-Length Distribution in the Works of Shakespeare and Bacon. Biometrika 62(1), 207–212 (1975)
Witten, I.H., Frank, E.: Data Mining: Practical machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Uzuner, Ö., Katz, B. (2005). A Comparative Study of Language Models for Book and Author Recognition. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_84
Download citation
DOI: https://doi.org/10.1007/11562214_84
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)