Abstract
The k-nearest neighbor rule is a simple and effective classifier for document classification. In this method, a document is put into a particular class if the class has the maximum representation among the k nearest neighbors of the documents in the training set. The k nearest neighbors of a test document are ordered based on their content similarity with the documents in the training set. Document classification is very challenging due to the large number of attributes present in the data set. Many attributes, due to the sparsity of the data, do not provide any information about a particular document. Thus, assigning a document to a predefined class for a large value of k may not be accurate when the margin of majority voting is one or when a tie occurs. This article tweaks the knn rule by putting a threshold on the majority voting and the method proposes a discrimination criterion to prune the actual search space of the test document. The proposed classification rule will enhance the confidence of the voting process and it makes no prior assumption about the number of nearest neighbors. The experimental evaluation using various well known text data sets show that the accuracy of the proposed method is significantly better than the traditional knn method as well as some other document classification methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
http://www-users.cs.umn.edu/∼han/data/tmdata.tar.gz.
The test statistic is of the form \(t=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{s^2_1/n_1+s^2_2/n_2}},\) where \(\bar{x}_1, \bar{x}_2\) are the means, s 1, s 2 are the standard deviations and n 1, n 2 are the number of observations.
References
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27
Fix E, Hodges JL (1951) Discriminatory analysis, nonparametric discrimination: consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, 261–279
Duda R, Hart P, Stork DG (2000) Pattern classification. Wiley, New York
Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press, New York
Dasarathy BV (1991) Nearest neighbor NN norms: NN pattern classification techniques. McGraw-Hill Computer Science Series. IEEE CS Press
Dasarathy BV, Sheela BV (1977) Visiting nearest neighbors: a survey of nearest neighbour classification techniques. In: Proceedings of the international conference on cybernetics and society, 630–635
Dhurandhar A, Dobra A (2012) Probabilistic characterization of nearest neighbor classifiers. Int J Mach Learn Cybern
Fukunaga K, Hostetler LD (1973) Optimization of k-nearest neighbor density estimates. IEEE Trans Inform Theory 19:320–326
Fukunaga K, Hostetler LD (1975) K-nearest neighbor bayes risk estimation. IEEE Trans Inform Theory 21(3):285–293
Loftsgaarden DO, Quesenberry CP (1965) A nonparametric estimate of multivariate density function. Ann Math Stat 36:1049–1051
Friedman JH (1994) Flexible metric nearest neighbor classification. Technical Report, Department of Statistics, Stanford University, Stanford
Jiang L, Cai Z, Wang D, Zhang H (2013) Bayesian citation-KNN with distance weighting. Int J Mach Learn Cybern
Ghosh AK (2007) On nearest neighbor classification using adaptive choice of k. J Comput Graph Stat 16(2):482–502
Lehmann EL (1976) Testing of statistical hypotheses. Wiley, New York
Rao CR, Mitra SK, Matthai A, Ramamurthy KG (eds) (1966) Formulae and tables for statistical work. Statistical Publishing Society, Calcutta
Boley D, Gini M, Gross R, Han EH, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1999) Document categorization and query generation on the World Wide Web using WebACE. J Artif Intell Rev 3(5–6):365–391
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the tenth European conference on machine learning (ECML 98), Berlin, Germany, 137–142
TREC, Text REtrieval conference. http://trec.nist.gov
Porter MF (1980) An algorithm for suffix stripping, Program 14(3):130–137
Lewis DD Reuters-21578 text categorization test collection distribution. http://www.research.att.com/lewis
Dudani SA (1976) The distance weighted K nearest neighbor rule. IEEE Trans Syst Man Cybern SMC-6:325–327
Bailey T, Jain A (1978) A note on distance weighted K nearest neighbor rule. IEEE Trans Syst Man Cybern 8:311–313
Morin RL, Raeside DE (1981) A reappraisal of distance weighted k-nearest neighbor classification for pattern recognition with missing data. IEEE Trans Syst Man Cyber 11(3):241–243
Hui GG, Wang H, Bell D, Bi Y, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430
Karypis G, Han EH (2000) Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In: Proceedings of the ACM conference on information and knowledge management (CIKM 2000), 12–19
Yang Y (1999) An evaluation of statistical approaches to text categorization. Information Retrieval, Kluwer Academic Publishers, Dordrecht, 69–90
Lam W, Ho CY (1998) Using a generalized instance set for automatic text categorization. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (SIGIR 98), 81–89
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York
Lewis DD, Shapire RE, Callan JP, Papka R (1996) Training algorithms for linear text classifiers. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (SIGIR 96), 298–306
Quinlan JR (1986) Induction of decision trees, Mach Learn 1(1):81–106
Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Baoli L, Qin L, Shiwen Y (2004) An adaptive k-nearest neighbor text categorization strategy. ACM Trans Asian Lang Inform Proces 3(4):215–226
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Basu, T., Murthy, C.A. Towards enriching the quality of k-nearest neighbor rule for document classification. Int. J. Mach. Learn. & Cyber. 5, 897–905 (2014). https://doi.org/10.1007/s13042-013-0177-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-013-0177-1