Abstract
We describe a novel method for using Genetic Programming to create compact classification rules based on combinations of N-Grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classification task using the Reuters 21578 dataset. We also suggest that because the induced rules are meaningful to a human analyst they may have a number of other uses beyond classification and provide a basis for text mining applications.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Hayes, P.J., Andersen, P.M., Nirenburg, I.B., Schmandt, L.M.: Tcs: a shell for content-based text categorization. In: Proceedings of CAIA 1990, 6th IEEE Conference on Artificial Intelligence Applications, Santa Barbara, CA, pp. 320–326 (1990)
Apté, C., Damerau, F.J., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. on Inform. Syst. 12, 3, 233–251. ATTARDI (1994)
Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Bennet, K., Shawe-Taylor, J., Wu., D.: Enlarging the margins in perceptron decision trees. Machine Learning 41, 295–313 (2000)
Pickens, J., Croft, W.B.: An Exploratory Analysis of Phrases in Text Retrieval. In: Proceedings of RIAO Conference, Paris, France (2000)
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge (1992)
Clack, C., Farrington, J., Lidwell, P., Yu, T.: Autonomous Document Classification for Business. In: Proceedings of The ACM Agents Conference (1997)
Bergström, A., Jaksetic, P., Nordin, P.: Enhancing Information Retrieval by Automatic Acquisition of Textual Relations Using Genetic Programming. In: Proceedings of the 2000 International Conference on Intelligent User Interfaces, pp. 29–32. ACM Press, New York (2000)
Cavnar, W., Trenkle, J.: N-Gram-Based Text Categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval (1994)
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848 (1995)
Biskri, I., Delisle, S.: Text Classification and Multilinguism: Getting at Words via N-grams of Characters. In: Proceedings of the 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI-2002), Orlando, Florida, USA, vol. 5, pp. 110–115 (2002)
Tauritz, D.R., Kok, J.N., Sprinkhuizen-Kuyper, I.G.: Adaptive information filtering using evolutionary computation. Information Sciences 122(2-4), 121–140 (2000)
Langdon, W.B.: Natural Language Text Classification and Filtering with Trigrams and Evolutionary Classifiers. In: Whitley, D. (ed.) Late Breaking Papers at the 2000 Genetic and Evolutionary Computation Conference, Las Vegas, Nevada, USA, pp. 210–217 (2000)
Lodhi, H., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems 13, pp. 563–569. MIT Press, Cambridge (2001)
Feldman, R., Fresko, M., Kinar, Y., Lindell, O., Liphstat, M., Rajman, Y., Schler, O., Zamir, O.: Text mining at the term level. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, Nantes, France, pp. 65–73 (1998)
Ahonen-Myka, H.: Finding All Maximal Frequent Sequences in Text. In: Proceedings of the 16th International Conference in Machine Learning ICML Bled, Slovenia (1999)
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management: an International Journal 38(4), 529–546 (2002)
Berleant, D., Gu, Z.: Hash table sizes for storing n-grams for text processing, Technical Report 10-00a, Software Research Lab, 3215 Coover Hall, Dept. of Electrical and Computer Engineering, Iowa State University (2000)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2000)
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Department of Computer Science, University of Glasgow (1979)
Montana, D.: Strongly Typed Genetic Programming. In: Evolutionary Computation, vol. 3(2), pp. 199–230. The MIT Press, Cambridge (1995)
Ebert, D., Shaw, D., Zwa, A., Miller, E., Roberts, D.: Interactive Volumetric Information Visualization for Document Corpus Management. In: Proceedings of Graphics Interface 1997, Kelowna, B.C, May 1997, pp. 121–128 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hirsch, L., Saeedi, M., Hirsch, R. (2005). Evolving Rules for Document Classification. In: Keijzer, M., Tettamanzi, A., Collet, P., van Hemert, J., Tomassini, M. (eds) Genetic Programming. EuroGP 2005. Lecture Notes in Computer Science, vol 3447. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31989-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-31989-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25436-2
Online ISBN: 978-3-540-31989-4
eBook Packages: Computer ScienceComputer Science (R0)