Evolving Rules for Document Classification | SpringerLink
Skip to main content

Evolving Rules for Document Classification

  • Conference paper
Genetic Programming (EuroGP 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3447))

Included in the following conference series:

Abstract

We describe a novel method for using Genetic Programming to create compact classification rules based on combinations of N-Grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classification task using the Reuters 21578 dataset. We also suggest that because the induced rules are meaningful to a human analyst they may have a number of other uses beyond classification and provide a basis for text mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Hayes, P.J., Andersen, P.M., Nirenburg, I.B., Schmandt, L.M.: Tcs: a shell for content-based text categorization. In: Proceedings of CAIA 1990, 6th IEEE Conference on Artificial Intelligence Applications, Santa Barbara, CA, pp. 320–326 (1990)

    Google Scholar 

  2. Apté, C., Damerau, F.J., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. on Inform. Syst. 12, 3, 233–251. ATTARDI (1994)

    Google Scholar 

  3. Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    Google Scholar 

  4. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  5. Bennet, K., Shawe-Taylor, J., Wu., D.: Enlarging the margins in perceptron decision trees. Machine Learning 41, 295–313 (2000)

    Article  Google Scholar 

  6. Pickens, J., Croft, W.B.: An Exploratory Analysis of Phrases in Text Retrieval. In: Proceedings of RIAO Conference, Paris, France (2000)

    Google Scholar 

  7. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge (1992)

    MATH  Google Scholar 

  8. Clack, C., Farrington, J., Lidwell, P., Yu, T.: Autonomous Document Classification for Business. In: Proceedings of The ACM Agents Conference (1997)

    Google Scholar 

  9. Bergström, A., Jaksetic, P., Nordin, P.: Enhancing Information Retrieval by Automatic Acquisition of Textual Relations Using Genetic Programming. In: Proceedings of the 2000 International Conference on Intelligent User Interfaces, pp. 29–32. ACM Press, New York (2000)

    Chapter  Google Scholar 

  10. Cavnar, W., Trenkle, J.: N-Gram-Based Text Categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval (1994)

    Google Scholar 

  11. Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848 (1995)

    Article  Google Scholar 

  12. Biskri, I., Delisle, S.: Text Classification and Multilinguism: Getting at Words via N-grams of Characters. In: Proceedings of the 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI-2002), Orlando, Florida, USA, vol. 5, pp. 110–115 (2002)

    Google Scholar 

  13. Tauritz, D.R., Kok, J.N., Sprinkhuizen-Kuyper, I.G.: Adaptive information filtering using evolutionary computation. Information Sciences 122(2-4), 121–140 (2000)

    Article  MATH  Google Scholar 

  14. Langdon, W.B.: Natural Language Text Classification and Filtering with Trigrams and Evolutionary Classifiers. In: Whitley, D. (ed.) Late Breaking Papers at the 2000 Genetic and Evolutionary Computation Conference, Las Vegas, Nevada, USA, pp. 210–217 (2000)

    Google Scholar 

  15. Lodhi, H., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems 13, pp. 563–569. MIT Press, Cambridge (2001)

    Google Scholar 

  16. Feldman, R., Fresko, M., Kinar, Y., Lindell, O., Liphstat, M., Rajman, Y., Schler, O., Zamir, O.: Text mining at the term level. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, Nantes, France, pp. 65–73 (1998)

    Google Scholar 

  17. Ahonen-Myka, H.: Finding All Maximal Frequent Sequences in Text. In: Proceedings of the 16th International Conference in Machine Learning ICML Bled, Slovenia (1999)

    Google Scholar 

  18. Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management: an International Journal 38(4), 529–546 (2002)

    Article  MATH  Google Scholar 

  19. Berleant, D., Gu, Z.: Hash table sizes for storing n-grams for text processing, Technical Report 10-00a, Software Research Lab, 3215 Coover Hall, Dept. of Electrical and Computer Engineering, Iowa State University (2000)

    Google Scholar 

  20. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2000)

    Article  Google Scholar 

  21. Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Department of Computer Science, University of Glasgow (1979)

    Google Scholar 

  22. Montana, D.: Strongly Typed Genetic Programming. In: Evolutionary Computation, vol. 3(2), pp. 199–230. The MIT Press, Cambridge (1995)

    Google Scholar 

  23. Ebert, D., Shaw, D., Zwa, A., Miller, E., Roberts, D.: Interactive Volumetric Information Visualization for Document Corpus Management. In: Proceedings of Graphics Interface 1997, Kelowna, B.C, May 1997, pp. 121–128 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hirsch, L., Saeedi, M., Hirsch, R. (2005). Evolving Rules for Document Classification. In: Keijzer, M., Tettamanzi, A., Collet, P., van Hemert, J., Tomassini, M. (eds) Genetic Programming. EuroGP 2005. Lecture Notes in Computer Science, vol 3447. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31989-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31989-4_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25436-2

  • Online ISBN: 978-3-540-31989-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics