Preventing Overfitting in Learning Text Patterns for Document Categorization

Junker, Markus; Dengel, Andreas

doi:10.1007/3-540-44732-6_14

Markus Junker⁷ &
Andreas Dengel⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2013))

Included in the following conference series:

International Conference on Advances in Pattern Recognition

675 Accesses

Abstract

There is an increasing interest in categorizing texts using learning algorithms. While the majority of approaches rely on learning linear classifiers, there is also some interest in describing document categories by text patterns. We introduce a model for learning patterns for text categorization (the LPT-model) that does not rely on an attribute-value representation of documents but represents documents essentially “as they are”. Based on the LPT-model, we focus on learning patterns within a relatively simple pattern language. We compare different search heuristics and pruning methods known from various symbolic rule learners on a set of representative text categorization problems. The best results were obtained using the m-estimate as search heuristics combined with the likelihood-ratio-statics for pruning. Even better results can be obtained, when replacing the likelihood-ratio- statics by a new measure for pruning; this we call l-measure. In contrast to conventional measures for pruning, the l-measure takes into account properties of the search space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Improved Document Categorization Through Feature-Rich Combinations

Assessing Intelligence Text Classification Techniques

Text Classification Using Novel “Anti-Bayesian” Techniques

References

C. Apté, F. Damerau and S. Weiss. Towards Language Independent Automated Learning of Text Categorization Models. In: Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), page: 23–30, Dublin, Ireland, July 3–6 1994.
Google Scholar
P. Clark and T. Niblett. The CN2 Algorithm. Machine Learning, 3(4) Seite: 261–283, 1989.
Google Scholar
W.W. Cohen. Learning to Classify English Text with ILP Methods. In: Advances in Inductive Logic Programming, page: 124–143. IOS Press, 1996.
Google Scholar
A. Dengel und K. Hinkelmann. The Specialist Board — A Technology Workbench for Document Analysis and Understanding. In: Proceedings of the 2nd World Conference on Integrated Design and Process Technology (IDPT’ 96), page: 36–47, Austin, TX, USA, December 1996.
Google Scholar
J. Fürnkranz. Separate-and-Conquer Rule Learning. Artificial Intelligence Review, 13(1) Seite: 3–54, 1999.
Article MATH Google Scholar
P.J. Hayes, P.M. Anderson, I.B. Nirenburg und L.M. Schmandt. TCS: A Shell for Content-Based Text Categorization. In: Proceedings of 6th Conference on Artificial Intelligence Applications, page: 320–326, Santa Barbara, CA, USA, May 5–9 1990.
Google Scholar
M. Junker. Heuristisches Lernen von Regeln für die Textkategorisierung. Dissertation, University of Kaiserslautern, Germany, 2000 (in German).
Google Scholar
J.R. Quinlan. Introduction of Decision Trees. Machine Learning, 3 Seite: 81–106, 1986.
Google Scholar
C. van Rijsbergen. Information Retrieval. Butterworth, London, England, 1979.
Google Scholar
C. Schaffer. Overfitting Avoidance as Bias. Machine Learning, 10(2) Seite: 233–241, February 1993.
Google Scholar
H. Theron und I. Cloete. BEXA: A Covering Algorithm for Learning Propositional Concept Descriptions. Machine Learning, 24 Seite: 5–40, 1996.
Google Scholar
Y. Yang und X. Liu. A Re-Examination of Text Categorization Methods. In: Proceedings of the 22th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94), page: 42–49, Berkeley, CA, USA, August 15–19 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

German Research Center for Artificial Intelligence (DFKI) GmbH, P.O. 2080, D-67608, Kaiserslautern, Germany
Markus Junker & Andreas Dengel

Authors

Markus Junker
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Dengel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Exeter, EX4 4PT, Exeter, UK
Sameer Singh
Computational Intelligence Group, Tuiuti University of Parana, Curitiba, Brazil
Nabeel Murshed
Institute of Computer Aided Automation PRIP-Group 1832, Vienna University of Technology, Favoritenstr. 9/2/4, 1040, Wien, Austria
Walter Kropatsch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Junker, M., Dengel, A. (2001). Preventing Overfitting in Learning Text Patterns for Document Categorization. In: Singh, S., Murshed, N., Kropatsch, W. (eds) Advances in Pattern Recognition — ICAPR 2001. ICAPR 2001. Lecture Notes in Computer Science, vol 2013. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44732-6_14

Download citation

DOI: https://doi.org/10.1007/3-540-44732-6_14
Published: 09 May 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41767-5
Online ISBN: 978-3-540-44732-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics