Abstract
Recent studies have shown that link-based clustering methods can significantly improve the performance of content-based clustering. However, most previous algorithms are developed for fixed data sets, and are not applicable to the dynamic environments such as data warehouse and online digital library.
In this paper, we introduce a novel approach which leverages the network structure for incremental clustering. Under this framework, both the link and content information are incorporated to determine the host cluster of a new document. The combination of two types of information ensures a promising performance of the clustering results. Furthermore, the status of core members is used to quickly determine whether to split or merge a new cluster. This filtering process eliminates the unnecessary and time-consuming checks of textual similarity on the whole corpus, and thus greatly speeds up the entire procedure. We evaluate our proposed approach on several real-world publication data sets and conduct an extensive comparison with both the classic content based and the recent link based algorithms. The experimental results demonstrate the effectiveness and efficiency of our method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Angelova, R., Siersdorfer, S.: A neighborhood based approach for clustering of linked document collections. In: Proc. of the 15th ACM CIKM, pp. 778–779 (2006)
Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: Proc. of the 29th ACM SIGIR, pp. 485–492 (2006)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42, 143–175 (2001)
Ding, C.H.Q., He, X., Zha, H., Gu, M., Simon, H.D.: A min-max cut algorithm for graph partitioning and data clustering. In: Proc. of the ICDM, pp. 107–114 (2001)
Ester, M., Kriegel, H.P., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: Proc. of 24th VLDB, pp. 323–333 (1998)
Kazama, J., Torisawa, K.: Exploiting wikipedia as external knowledge for named entity recognition. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 698–707 (2007)
Li, H., Nie, Z., Lee, W., Giles, C., Wen, J.: Scalable community discovery on textual data with relations. In: Proc. of the 17th ACM CIKM, pp. 1203–1212 (2008)
Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proc. of the 25th ACM SIGIR, pp. 590–599 (2002)
Menczer, F.: Lexical and semantic clustering by web links. JASIST 55, 1261–1269 (2004)
Nguyen-Hoang, T.-A., Hoang, K., Bui-Thi, D., Nguyen, A.-T.: Incremental Document Clustering Based on Graph Model. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 569–576. Springer, Heidelberg (2009)
Ordonez, C., Omiecinski, E.: Frem: fast and robust em clustering for large data sets. In: Proc. the ACM CIKM, pp. 590–599 (2002)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. Tech. rep., University of Minnesota (2000)
Wang, J., Zeng, H., Chen, Z., Lu, H., Tao, L., Ma, W.Y.: Recom: Reinforcement clustering of multi-type interrelated data objects. In: Proc. of the 26th ACM SIGIR, pp. 274–281 (2003)
Zhang, X., Hu, X., Zhou, X.: A comparative evaluation of different link types on enhancing document clustering. In: Proc. of the 31st ACM SIGIR, pp. 555–562 (2008)
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Tech. rep., University of Minnesota (2002)
Zhao, Y., Karypis, G., Fayyad, U.: Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery 10, 141–168 (2005)
Zhong, S.: Efficient online spherical k-means clustering. In: Proc. of IEEE IJCNN, pp. 3180–3185 (2005)
Zhou, X., Zhang, X., Hu, X.: Semantic smoothing of document models for agglomerative clustering. In: Proc. of the 20th IJCAI, pp. 2922–2927 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Qian, T., Si, J., Li, Q., Yu, Q. (2012). Leveraging Network Structure for Incremental Document Clustering. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds) Web Technologies and Applications. APWeb 2012. Lecture Notes in Computer Science, vol 7235. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29253-8_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-29253-8_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29252-1
Online ISBN: 978-3-642-29253-8
eBook Packages: Computer ScienceComputer Science (R0)