Abstract
Text clustering plays an important role in many real-world applications, but it is faced with various challenges, such as, curse of dimensionality, complex semantics and large volume. A lot of researches paid attention to deal with such problems by designing new text representation models and clustering algorithms. However, text clustering still remains a research problem due to the complicated properties of text data. In this paper, a text clustering procedure is proposed based on the principle of granular computing with the aid of Wikipedia. The proposed clustering method firstly identifies the text granules, especially focusing on concepts and words with the aid of Wikipedia. And then, it mines the latent patterns based on the computation of such granules. Experimental results on benchmark data sets (20Newsgroups and Reuters-21578) have shown that the proposed method improves the performance of text clustering by comparing with the existing clustering algorithm together with the existing representation models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proc. of the 30th ACM SIGIR, pp. 787–788 (2007)
Bargiela, A., Pedrycz, W.: Granular computing: an introduction. Kluwer Academic Publishers, Dordrecht (2002)
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Buchanan, B., Feigenbaum, E.: Knowledge-based systems in artificial intelligence. McGraw-Hill, New York (1982)
Furukawa, T.: Som of soms. Neural Networks 22, 463–478 (2009)
Heeman, F.: Granularity in structured documents. Electronic Publishing 5, 143–155 (1992)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. of ACM SIGIR, pp. 50–57 (1999)
Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging wikipedia semantics. In: Proc. of the 31st ACM SIGIR, pp. 179–186 (2008)
Hu, X., Zhang, X., Lu, C., Park, E., Zhou, X.: Exploiting wikipedia as external knowledge for document clustering. In: Proc. of the 15th ACM SIGKDD, pp. 389–396 (2009)
Huang, A., Milne, D., Frank, E., Witten, I.: Clustering documents using a wikipedia-based concept representation. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 628–636. Springer, Heidelberg (2009)
Jing, L., Lau, R.: Granular computing for text mining: New research challenges and opportunities. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Ślęzak, D., Zhu, W. (eds.) RSFDGrC 2009. LNCS, vol. 5908, pp. 478–485. Springer, Heidelberg (2009)
Jing, L., Ng, M., Huang, J.: Knowledge-based vector space model for text clustering. Knowledge and Information Systems (2009)
Kittur, A., Chi, E., Suh, B.: What’s in wikipedia? Mapping topics and conflict using socially annotated category structure. In: Proc. of the 27th CHI, pp. 1509–1512 (2009)
Medelyan, O., Witten, I., Milne, D.: Topic indexing with wikipedia. In: Proc. of AAAI (2008)
Milne, D., Witten, I.: An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proc. of the Workshop on Wikipedia and Artificial Intelligence at AAAI, pp. 25–30 (2008)
Steinbach, S., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proc. of the Workshop on Text Mining at ACM SIGKDD, pp. 1–20 (2000)
Tokunaga, K., Furukawa, T.: Modular network som. Neural Networks 22, 82–90 (2009)
Wang, P., Domeniconi, C.: Building semantic kernels for text classification using wikipedia. In: Proc. of the 14th ACM SIGKDD, New York, NY, USA, pp. 713–721 (2008)
Yao, Y.: Granular computing for data mining. In: Proc. of SPIE Conf. on Data Mining, Instrusion Detection, Information Assurance and Data Networks Security, pp. 1–12 (2006)
Yates, R., Neto, B.: Modern information retrieval. Addison-Wesley Longman, Amsterdam (1999)
Yun, J., Jing, L., Yu, J., Huang, H.: Semantics-based representation model for multi-layer text classification. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6277, pp. 1–10. Springer, Heidelberg (2010)
Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. In: Proc. of SDW Workshop on Clustering High Dimensional Data and its Applications, San Francisco, CA (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jing, L., Yu, J. (2011). Text Clustering Based on Granular Computing and Wikipedia. In: Yao, J., Ramanna, S., Wang, G., Suraj, Z. (eds) Rough Sets and Knowledge Technology. RSKT 2011. Lecture Notes in Computer Science(), vol 6954. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24425-4_85
Download citation
DOI: https://doi.org/10.1007/978-3-642-24425-4_85
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24424-7
Online ISBN: 978-3-642-24425-4
eBook Packages: Computer ScienceComputer Science (R0)