Abstract
This paper addresses the problem of semi-supervised classification on document collections using retraining (also called self-training). A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. Such an approach is by itself not robust and faces tuning problems regarding parameters like the number of selected documents, the number of retraining iterations, and the ratio of positive and negative classified samples used for retraining. The paper develops methods for automatically tuning these parameters, based on predicting the leave-one-out error for a re-trained classifier and avoiding that the classifier is diluted by selecting too many or weak documents for retraining. Our experiments with three different datasets confirm the practical viability of the approach.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
The 20 newsgroups data set, http://www.ai.mit.edu/~jrennie/20Newsgroups/
Internet movie database, http://www.imdb.com
Amini, M.-R., Gallinari, P.: The use of unlabeled data to improve supervised learning for text summarization. In: SIGIR 2002, pp. 105–112. ACM Press, New York (2002)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Bennett, K.P., Demiriz, A.: Semi-supervised support vector machines. In: NIPS 1999, pp. 368–374. MIT Press, Cambridge (1999)
Bennett, K.P., Demiriz, A., Maclin, R.: Exploiting unlabeled data in ensemble methods. In: SIGKDD, pp. 289–296. ACM Press, New York (2002)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Workshop on Computational Learning Theory (1998)
Brank, J., Grobelnik, M., Milic-Frayling, N., Mladenic, D.: Training text classifiers with SVM on very few positive examples. Technical Report MSR-TR-2003-34, Microsoft Corp. (2003)
Burges, C.: A tutorial on Support Vector Machines for pattern recognition. Data Mining and Knowledge Discovery 2(2) (1998)
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, San Francisco (2002)
Chen, E., Lam, C.: Predictor-corrector with cubic spline method for spectrum estimation in compton scatter correction of spect. Computers in biology and medicine 24(3), 229 (1994), Ingenta
Dumais, S., Chen, H.: Hierarchical classification of Web content. In: SIGIR (2000)
Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. In: SIGKDD Explorations, pp. 30–39 (2004)
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398. Springer, Heidelberg (1998)
Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML 1999, pp. 200–209 (1999)
Joachims, T.: Transductive learning via spectral graph partitioning. In: ICML, pp. 290–297 (2003)
Kohavi, R., John, G.: Automatic parameter selection by minimizing estimated error. Machine Learning (1995)
Krishnapuram, B., Williams, D., Xue, Y., Hartemink, A., Carin, L., Figueiredo, M.: On semi-supervised classification. In: NIPS. MIT Press, Cambridge (2005)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: ICML 1997, Nashville, TN, U.S.A, pp. 179–186 (1997)
Lee, W.S., Liu, B.: Learning with positive and unlabeled examples using weighted logistic regression. In: ICML 2003, Washingtion USA (2003)
Lewis, D.D.: Evaluating text categorization. In: Proceedings of Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, pp. 312–318. Morgan Kaufmann, San Francisco (1991)
Manning, C., Schuetze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Machine Intelligence 39(2/3) (2000)
Okanla, E., Gaydecki, P.: A real-time audio frequency cubic spline interpolator. Signal processing 49(1), 45 (1996), Ingenta
Porter, M.: An algorithm for suffix stripping. Automated Library and Information Systems 14(3)
Seeger, M.: Learning with labeled and unlabeled data. Tech. Rep., Institute for Adaptive and Neural Computation, University of Edinburgh, UK (2001)
Seymour, C., Unsworth, K.: Interactive shape preserving interpolation by curvature continuous rational cubic splines. Appl. Math. 102(1), 87–117 (1999)
Siersdorfer, S., Weikum, G.: Automated retraining methods for document classification and their parameter tuning. Technical Report MPI-I-2005-5-002, Max-Planck-Institute for Computer Science, Germany (2005), http://www.mpi-sb.mpg.de/~stesi/sources/2005/report05retr.pdf
Sizov, S., Biwer, M., Graupmann, J., Siersdorfer, S., Theobald, M., Weikum, G., Zimmer, P.: The BINGO! system for information portal generation and expert Web search. In: Conference on Innovative Systems Research, CIDR (2003)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: NIPS. MIT Press, Cambridge (2004)
Zhou, Z., Chen, K., Jiang, Y.: Exploiting unlabeled data in content-based image retrieval. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 525–536. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Siersdorfer, S., Weikum, G. (2005). Automated Retraining Methods for Document Classification and Their Parameter Tuning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_38
Download citation
DOI: https://doi.org/10.1007/11581062_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30017-5
Online ISBN: 978-3-540-32286-3
eBook Packages: Computer ScienceComputer Science (R0)