Abstract
Text clustering is one of the most important research areas in text mining, which handles the text automatically to discover implicit knowledge. It groups text into different clusters by contents without apriori knowledge. In this paper, different text clustering methods are studied and three text clustering validation criteria are studied and used to evaluate the experimental results. We compare and contrast the effectiveness of k-means and FIHC text clustering methods by experiments, and address the different levels of quality of the resulting text clusters.
The research is funded by National Natural Science Foundation of China (Project No. 60402011) and Ministry of Education Key Laboratory of Information Management and Information Economics (Project No. F0607-01).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Sasaki, M., Shinnou, H.: Spam Detection Using Text Clustering. In: International Conference on Cyberworlds, pp. 316–319 (2005)
Kang, M., Asakimori, K., Utsuki, A., Kaburagi, M.: Automated text clustering system on responses to open-ended questions in course evaluations. In: 6th International Conference on Information Technology Based Higher Education and Training, pp. F4B/18–F4B/22 (2005)
Benjamin, C.M., Fung, K.W., Ester, M.: Hierarchical Document Clustering Using Frequent Itemsets. In: Proceedings of the 2003 SIAM International Conference on Data Mining (SDM 2003), San Francisco, CA, pp. 59–70 (2003)
Rocchio, J.J.: Document retrieval systems, Optimization and evaluation. Harvard University, Cambridge (1966)
Jo, T., Japkowicz, N.: Text clustering with NTSO (neural text self organizer). In: Proceedings of 2005 IEEE International Joint Conference on Neural Networks, vol. 1, pp. 558–563 (2005)
Liu, L., Kang, J., Yu, J., Wang, Z.: A comparative study on unsupervised feature selection methods for text clustering. In: Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 597–601 (2005)
Zhongzhi, S.: Knowledge Discovery. Tsinghua University Press, Beijing (2002)
Xu, J.-S., Wang, L.: TCBLHT: a new method of hierarchical text clustering. In: Proceedings of 2005 International Conference on Machine Learning and Cybernetics, vol. 4, pp. 2178–2181 (2005)
Tolat, V.V.: An analysis of Kohonen’s self-organizing maps using a system of energy functions. Biol.Cybern. 64, 155–164 (1990)
Yin, F., Wang, J., Guo, C.: A Novel Approach to Clustering Analysis Based on Support Vector Machine Advances in Neural Networks. In: International Symposium on Neural Networks, Proceedings, Part I, Dalian, China, pp. 565–570 (2004)
Makoto, I., Takenobu, T.: Hierarchical Bayesian clustering for automatic text classification. Department of Computer Science Tokyo Institute of Technology, TechRep, TR95-0015 (1995)
Rigouste, L., Cappe, O., Yvon, F.: Inference for probabilistic unsupervised text clustering. In: Processing of 2005 IEEE/SP 13th Workshop on Statistical Signal, pp. 387–392 (2005)
McNeil, A.R., Sarkodie-Gyan, T.: A neural network based recognition scheme for the classification of industrial components. In: Proceedings of 1995 IEEE International Conference on Fuzzy Systems, vol. 4, pp. 1813–1818 (1995)
Suli, C., Fuhua, Z., Huanguang, C.: Automatic Chinese Text Classification System Based on the Frequency Vector of the Chinese Word. Journal of Shanxi University (Natural Science Edition) 22(2), 44–49 (1999)
Dubes, R.C., Jain, A.K.: Algorithms for Clustering Data. Prentice Hall College Div, Englewood Cliffs (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zheng, Y., Cheng, X., Huang, R., Man, Y. (2006). A Comparative Study on Text Clustering Methods. In: Li, X., Zaïane, O.R., Li, Z. (eds) Advanced Data Mining and Applications. ADMA 2006. Lecture Notes in Computer Science(), vol 4093. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11811305_71
Download citation
DOI: https://doi.org/10.1007/11811305_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37025-3
Online ISBN: 978-3-540-37026-0
eBook Packages: Computer ScienceComputer Science (R0)