Abstract
The steady increase of information on WWW, digital library, portal, database and local intranet, gave rise to the development of several methods to help user in Information Retrieval, information organization and browsing. Clustering algorithms are of crucial importance when there are no labels associated to textual information or documents. The aim of clustering algorithms, in the text mining domain, is to group documents concerning with the same topic into the same cluster, producing a flat or hierarchical structure of clusters. In this paper we present a Knowledge Discovery System for document processing and clustering. The clustering algorithm implemented in this system, called Induced Bisecting k-Means, outperforms the Standard Bisecting k-Means and is particularly suitable for on line applications when computational efficiency is a crucial aspect.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Boley, D.: Principal Direction Divisive Partitioning, Technical Report TR-97-056, Department of Computer Science and Engineering, University of Minnesota, Minneapolis
Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proc. of ACM International Conference on Management of Data, pp. 117–128 (2000)
Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proc. of 15th Annual ACM-SIGIR, pp. 318–329 (1992)
Dhillon, I., Kogan, J., Nicholas, C.: Feature selection and document clustering. In: Text Data Mining and Applications (2002)
Ferragina, P., Gulli, A.: A personalized search engine based on web-snippet hierarchical clustering. In: Special interest tracks and posters of the 14th International Conference on WWW, pp. 801–810 (2005)
Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: WebACE: A web agent for document categorization and exploration. In: Proc. of the 2nd International Conference on Autonomous Agents, pp. 408–415 (1998)
Kashyap, V., Ramakrishnan, C., Thomas, C., Bassu, D., Rindflesch, T.C., Sheth, A.: TaxaMiner: An experiment framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services 1(2), 240–266 (2005)
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proc. of the 14th International Conference on Machine Learning, pp. 170–178 (1997)
Pirolli, P., Schank, P., Hearst, M., Diehl, C.: Scatter/Gather Browsing Communicates the Topic Structure of a Very Large Text Collection. In: Proc. of CHI, pp. 213–220 (1996)
Reuters-21578, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Salton, G., McGill, M.J.: Introduction to Modern Retrieval. McGraw-Hill Company, New York (1983)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of ACM 18(11), 613–620 (1975)
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Savaresi, M., Boley, D.L.: On the performance of bisecting k-Means and PDDP. In: First SIAM International Conference on Data Mining, pp. 1–14 (2001)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining (2000)
Toda, H., Kataoka, R.: A search Result clustering Method using Informatively Named Entities. In: Proc. of the 7th annual ACM International Workshop on Web information and data management, pp. 81–86 (2005)
TREC: Text Retrieval Conference, http://trec.nist.gov
Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive Clustering of Web document. In: Proc. of KDD, pp. 287–290 (1997)
Zhang, D., Dong, Y.: Semantic, Hierarchical, Online Clustering of Web Search Results. In: Proc. of the 6th Asia Pacific Web Conference (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Archetti, F., Campanelli, P., Fersini, E., Messina, E. (2006). A Hierarchical Document Clustering Environment Based on the Induced Bisecting k-Means. In: Larsen, H.L., Pasi, G., Ortiz-Arroyo, D., Andreasen, T., Christiansen, H. (eds) Flexible Query Answering Systems. FQAS 2006. Lecture Notes in Computer Science(), vol 4027. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11766254_22
Download citation
DOI: https://doi.org/10.1007/11766254_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34638-8
Online ISBN: 978-3-540-34639-5
eBook Packages: Computer ScienceComputer Science (R0)