Definition
At a high-level the problem of document clustering is defined as follows. Given a set S of n documents, we would like to partition them into a pre-determined number of k subsets S 1, S 2,...,S k , such that the documents assigned to each subset are more similar to each other than the documents assigned to different subsets. Document clustering is an essential part of text mining and has many applications in information retrieval and knowledge management. Document clustering faces two big challenges: the dimensionality of the feature space tends to be high (i.e., a document collection often consists of thousands or tens of thousands unique words); the size of a document collection tends to be large.
Historical Background
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms as well as in facilitating...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Boley D. Principal direction divisive partitioning. Data Mining Knowl. Discov., 2(4), 1998.
Cutting D.R., Pedersen J.O., Karger D.R., and Tukey J.W. Scatter/gather: A cluster-based approach to browsing large document collections. In Proc. 15th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1992, pp. 318–329.
Dempster A.P., Laird N.M., and Rubin D.B. Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc., 39, 1977.
Dhillon I.S. Co-clustering documents and words using bipartite spectral graph partitioning. In Knowledge Discovery and Data Mining, 2001, pp. 269–274.
Ding C., He X., Zha H., Gu M., and Simon H. 1Spectral min-max cut for graph partitioning and data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.
Duda R.O., Hart P.E., and Stork D.G. Pattern Classification. Wiley, New York, 2001.
Fisher D. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res., 4:147–180, 1996.
Jain A.K. and Dubes R.C. Algorithms for Clustering Data. Prentice Hall, New York, 1988.
Karypis G. Cluto: A clustering toolkit. Technical Report 02-017, Department of Computer Science, University of Minnesota, 2002.
King B. Step-wise clustering procedures. J. Am. Stat. Assoc., 69:86–101, 1967.
MacQueen J. Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math. Stat. Prob., 1967, pp. 281–297.
Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989.
Sneath P.H. and Sokal R.R. Numerical Taxonomy. Freeman, London, UK, 1973.
Zahn K. Graph-tehoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput., (C-20):68–86, 1971.
Zha H., He X., Ding C., Simon H., and Gu M. Bipartite graph partitioning and data clustering. In Proc. Int. Conf. on Information and Knowledge Management, 2001.
Zhao Y. and Karypis G. Criterion functions for document clustering: Experiments and analysis. Mach. Learn. 55:311–331, 2004.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this entry
Cite this entry
Zhao, Y., Karypis, G. (2009). Document Clustering. In: LIU, L., ÖZSU, M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_1479
Download citation
DOI: https://doi.org/10.1007/978-0-387-39940-9_1479
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-35544-3
Online ISBN: 978-0-387-39940-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering