Abstract
The paper demonstrates that the addition of automatically selected word-pairs substantially increases the accuracy of text classification which is contrary to most previously reported research. The word-pairs are selected automatically using a technique based on frequencies of n-grams (sequences of characters), which takes into account both the frequencies of word-pairs as well as the context in which they occur.
These improvements are reported for two different classifiers, support vector machines (SVM) and k-nearest neighbours (kNN), and two different text corpora. For the first of them, a collection of articles from PC Week magazine, the addition of word-pairs increases micro-averaged breakeven accuracy by more than 6% point from a baseline accuracy (without pairs) of around 40%. For second one, the standard Reuters benchmark, SVM classifier using augmentation with pairs outperforms all previously reported results.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
C. Apte, F. Damerau, and S. M. Weiss. Text Mining with Decision Trees and Decision Rules. In Conference on Automated Learning and Discovery, 1998.
J. E. Burnett, D. Cooper, M. F. Lynch, P. Willett, and M. Wycherley. Document Retrieval Experiments Using Indexing Vocabularies of Varying Size. I. Variety Generation Symbols Assigned to the Fronts of Index Terms. Journal of Documentation, 35(3):197–206, (1979).
W. B. Cavnar. N-gram-based text filtering for TREC-2. In Proceedings for Second Text Retrieval Conference (TREC-2), pages 200–215. NIST Special Publication, 1993.
W. B. Cavnar and J. M. Trenkle. N-gram-based text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, 1994.
N. Christianini and J. Shawe-Taylor. Support Vector Machines and other Kernel Based Methods. Cambridge University Press, 2000.
J. D. Cohen. Highlights: Language-and domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, 46(3):162–174, (1995).
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive Learning Algorithms and Representations for Text Categorization. In Seventh International Conference on Information and Knowledge Management, 1998.
J. Furnkranz. A Study Using n-gram Features for Text Categorization. Technical report, Austrian Reserach Institute for Artificial Intelligence, 1998.
J. Furnkranz, T. Mitchell, and E. Riloff. A Case Study in Using Linguistic Phrases for Text Categorization on the WWW. In In AAAI-98 Workshop on Learning for Text Categorization, 1998.
T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of the Tenth European Conference on Machine Learning ECML98, 1998.
K. Kukich. Technique for automatically correcting words in text. ACM Computing Surveys, 24: 377–439, (1992).
D. Lewis. Evaluating Text Categorization. In Proceedings of the Speech and Natural Language Workshop, 1991.
D. Lewis. An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In Proceedings of the Fifteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992.
D. Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, 1992.
D. Mladenic and M. Grobelnik. Word Sequences as Features in Text Learning. In In Seventeenth Electrotechnical and Computer Science Conference, 1998.
S. Scott and S. Matwin. Feature Engineering for Text Classification. In Proceedings of the Sixteenth International Conference on Machine Learning, 1999.
V. Vapnik. Statistical Learning theory. Wiley, 1998.
S. M. Weiss, C. Apte, F. Damerau, D.E. Johnson, F. J. Oles, T. Goetz, and T. Hampp. Maximizing Text-Mining Performance. IEEE Intelligent Systems, 14(4), (1999).
Y. Yang and J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Raskutti, B., Ferrá, H., Kowalczyk, A. (2001). Second Order Features for Maximising Text Classification Performance. In: De Raedt, L., Flach, P. (eds) Machine Learning: ECML 2001. ECML 2001. Lecture Notes in Computer Science(), vol 2167. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44795-4_36
Download citation
DOI: https://doi.org/10.1007/3-540-44795-4_36
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42536-6
Online ISBN: 978-3-540-44795-5
eBook Packages: Springer Book Archive