Abstract
There is an increasing interest in categorizing texts using learning algorithms. While the majority of approaches rely on learning linear classifiers, there is also some interest in describing document categories by text patterns. We introduce a model for learning patterns for text categorization (the LPT-model) that does not rely on an attribute-value representation of documents but represents documents essentially “as they are”. Based on the LPT-model, we focus on learning patterns within a relatively simple pattern language. We compare different search heuristics and pruning methods known from various symbolic rule learners on a set of representative text categorization problems. The best results were obtained using the m-estimate as search heuristics combined with the likelihood-ratio-statics for pruning. Even better results can be obtained, when replacing the likelihood-ratio- statics by a new measure for pruning; this we call l-measure. In contrast to conventional measures for pruning, the l-measure takes into account properties of the search space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
C. Apté, F. Damerau and S. Weiss. Towards Language Independent Automated Learning of Text Categorization Models. In: Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), page: 23–30, Dublin, Ireland, July 3–6 1994.
P. Clark and T. Niblett. The CN2 Algorithm. Machine Learning, 3(4) Seite: 261–283, 1989.
W.W. Cohen. Learning to Classify English Text with ILP Methods. In: Advances in Inductive Logic Programming, page: 124–143. IOS Press, 1996.
A. Dengel und K. Hinkelmann. The Specialist Board — A Technology Workbench for Document Analysis and Understanding. In: Proceedings of the 2nd World Conference on Integrated Design and Process Technology (IDPT’ 96), page: 36–47, Austin, TX, USA, December 1996.
J. Fürnkranz. Separate-and-Conquer Rule Learning. Artificial Intelligence Review, 13(1) Seite: 3–54, 1999.
P.J. Hayes, P.M. Anderson, I.B. Nirenburg und L.M. Schmandt. TCS: A Shell for Content-Based Text Categorization. In: Proceedings of 6th Conference on Artificial Intelligence Applications, page: 320–326, Santa Barbara, CA, USA, May 5–9 1990.
M. Junker. Heuristisches Lernen von Regeln für die Textkategorisierung. Dissertation, University of Kaiserslautern, Germany, 2000 (in German).
J.R. Quinlan. Introduction of Decision Trees. Machine Learning, 3 Seite: 81–106, 1986.
C. van Rijsbergen. Information Retrieval. Butterworth, London, England, 1979.
C. Schaffer. Overfitting Avoidance as Bias. Machine Learning, 10(2) Seite: 233–241, February 1993.
H. Theron und I. Cloete. BEXA: A Covering Algorithm for Learning Propositional Concept Descriptions. Machine Learning, 24 Seite: 5–40, 1996.
Y. Yang und X. Liu. A Re-Examination of Text Categorization Methods. In: Proceedings of the 22th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), page: 42–49, Berkeley, CA, USA, August 15–19 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Junker, M., Dengel, A. (2001). Preventing Overfitting in Learning Text Patterns for Document Categorization. In: Singh, S., Murshed, N., Kropatsch, W. (eds) Advances in Pattern Recognition — ICAPR 2001. ICAPR 2001. Lecture Notes in Computer Science, vol 2013. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44732-6_14
Download citation
DOI: https://doi.org/10.1007/3-540-44732-6_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41767-5
Online ISBN: 978-3-540-44732-0
eBook Packages: Springer Book Archive