Abstract
We describe a method for using Genetic Programming (GP) to evolve document classifiers. GP’s create regular expression type specifications consisting of particular sequences and patterns of N-Grams (character strings) and acquire fitness by producing expressions, which match documents in a particular category but do not match documents in any other category. Libraries of N-Gram patterns have been evolved against sets of pre-categorised training documents and are used to discriminate between new texts. We describe a basic set of functions and terminals and provide results from a categorisation task using the 20 Newsgroup data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bennet, K., Shawe-Taylor, J., Wu, D.: Enlarging the margins in perceptron decision trees. Machine Learning 41, 295–313 (2000)
Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Cavnar, W., Trenkle, J.: N-Gram-Based Text Categorizatio. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval (1994)
Pickens, J., Croft, W.B.: An Exploratory Analysis of Phrases in Text Retrieval. In: Proceedings of RIAO 2000 Conference, Paris (2000)
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge (1992)
Bergström, A., Jaksetic, P., Nordin, P.: Enhancing Information Retrieval by Automatic Acquisition of Textual Relations Using Genetic Programming. In: Proceedings of the 2000 International Conference on Intelligent User Interfaces (IUI 2000), pp. 29–32. ACM Press, New York (2000)
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848 (1995)
Biskri, I., Delisle, S.: Text Classification and Multilinguism: Getting at Words via Ngrams of Characters. In: Proceedings of the 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2002), Orlando, Florida, USA, vol. V, pp. 110–115 (2002)
Tauritz, D.R., Kok, J.N., Sprinkhuizen-Kuyper, I.G.: Adaptive information filtering using evolutionary computation. Information Sciences, vol 122(2-4), 121–140 (2000)
Langdon, W.B.: Natural Language Text Classification and Filtering with Trigrams and Evolutionary Classifiers. In: Whitley, D. (ed.) Late Breaking Papers at the 2000 Genetic and Evolutionary (2000) Computation Conference, Las Vegas, Nevada, USA, pp. 210–217 (2000)
Lodhi, H., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems 13, pp. 563–569. MIT Press, Cambridge (2001)
Ahonen-Myka, H.: Finding All Maximal Frequent Sequences in Text. In: Proceedings of the 16th International Conference in Machine Learning ICML 1999 (1999)
Lang, K.: Learning to filter netnews. In: Proc. of the 12th Int. Conf. on Machine Learning, pp. 331–339 (1995)
Schapire, R., Singer, Y.: BoosTexter: A boosting-based system for text categorization. Machine Learning 39 (2000)
Slonim, N., Tishby, N.: Agglomerative Information Bottleneck. In: Proc. of Neural Information Processing Systems (NIPS 1999), pp. 617–623 (1999)
Slonim, N., Tishby, N.: The Power of Word Clusters for Text Classification, 23rd European Colloquium on Information Retrieval Research (2001)
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn., Department of Computer Science, University of Glasgow (1979)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hirsch, L., Saeedi, M., Hirsch, R. (2004). Evolving Text Classifiers with Genetic Programming. In: Keijzer, M., O’Reilly, UM., Lucas, S., Costa, E., Soule, T. (eds) Genetic Programming. EuroGP 2004. Lecture Notes in Computer Science, vol 3003. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24650-3_29
Download citation
DOI: https://doi.org/10.1007/978-3-540-24650-3_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21346-8
Online ISBN: 978-3-540-24650-3
eBook Packages: Springer Book Archive