Summary
In this contribution we propose a hierarchical fuzzy clustering algorithm for dynamically supporting information filtering. The idea is that document filtering can draw advantages from a dynamic hierarchical fuzzy clustering of the documents into overlapping topic categories corresponding with different levels of granularity of the categorisation. Users can have either general interests or specific ones depending on their profile and thus they must be feed with documents belonging to the categories of interest that can correspond with either a high level topic, such as sport news, or a subtopics, such as football news, or even a very specific topics such as football matches of their favourite team. The hierarchical structure of the automatically identified clusters is built so that each level corresponds with a distinct level of overlapping of the clusters in it, so that in climbing the hierarchy this value increases since the topics represented in the upper levels are more general, i.e., fuzzier. The hierarchy of fuzzy clusters is used to support the filtering criteria that are personalized based on user profiles. Since a filter monitors one or more continuously feed document streams, the clustering must be able both to generate a fuzzy hierarchical classification of a collection of documents and to update the hierarchy of existing categories by either including newly found documents or detecting new categories when such new documents have contents that are different from those represented by the existing clusters. The fuzzy clustering algorithm is based on a generalization of the fuzzy C-means algorithm that is iteratively applied to each hierarchical level to identify clusters of the higher level. In order to apply this algorithm in document filtering it has been extended so as to use a cosine similarity instead of the usual Euclidean distance, and to automatically estimate the number of the clusters to detect at each hierarchical level. This number is identified based either on an explicit input that specifies the minimum percentage of common index terms that the clusters of the level can share (that is equivalent to indicate a tolerance for overlapping between the topics dealt with in each fuzzy cluster) or on a statistical analysis of the cumulative curve of overlapping degrees between all pairs of clusters of the level. This way the problem of application of the fuzzy C means that requires the specification of the desired number of the clusters is overcome.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Basu S., Banerjee A., Mooney R.J., Semi-supervised Clustering by Seeding, in Proc. 19th Int. Conf. On Machine Learning (ICML-2002). Sydney, 2002.
Bordogna G., Pasi G., Personalised Indexing and Retrieval of Heterogeneous Structured Documents, Information Retrieval Journal, 8, 301–318, 2005.
Claypool M., Gokhale A., Miranda T., Murnikov P., Netes D., Sartin M., Combining Content-based and Collaborative Filters in an Online Newspaper, in Proc. ACM SIGIR’99 Workshop on Recommender Systems-Implemenation and Evaluation, Berkeley CA, 1999.
Connor M., Herlocker J., Clustering for Collaborative Filtering, in Proc. of ACM SIGIR Workshop on Recommender Systems, Berkeley CA, 1999.
Cutting D.R., Karger D.R., Pedersen J.O., Tukey J.W., Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, in Proc. of 15th Ann In. SIGIR’92., 1992.
Debole F., Sebastiani F., Supervised Term Weighting for Automated Text Categorization. In Proc. SAC-03, 18th ACM Symposium on Applied Computing, 2003.
Dominich S., Goth J., Kiezer T., Szlavik Z., Entropy-based interpretation of Retrieval Status Value-based Retrieval. Journal of the American Society for Information Science and Technology. John Wiley & Sons, 55(7), 613–627, 2004.
Estivill-Castro V., Why so Many Clustering Algorithms: a Position Paper, ACM SIGKDD Explorations Newsletter, 4(1), 2002.
Everitt B.S., Cluster Analysis, 3rd edition. Edward Arnold /Halsted Press, London, 1992.
Grossman D.A., Information retrieval, Algorithms and Heuristics, Kluwer Academic Publishers, 1998.
Hathaway, R.J., Bezdek, J.C. and Hu Y., Generalized Fuzzy C-Means Clustering Strategies Using Lp Norm Distances, IEEE Transactions on Fuzzy Systems, 8(5), 576–582, 2000.
Herrera-Viedma E., Herrera F., Martinez L., Herrera J.C., Lopez A.G., Incorporatine Filtering Techniques in a Fuzzy Linguistic Multi-Agent Model for Information Gathering on the Web, Fuzzy sets and Systems, 148, 61–83, 2004.
http://www.newsinessence.com.
Jain A.K., Murty M.N., Flynn P.J., Data Clustering: a Review, ACM Computing Surveys, 31(3), 264–323, 1999.
Jung, SungYoung, Taek-Soo Kim, An Incremental Similarity Computation Method in Agglomerative Hierarchical Clustering, in Proc. Of the 2nd International Symposium on Advanced Intelligent Systems, Daejeon, Korea, August 25, 2001
Khaled M. Hammouda, Mohamed S. Kamel: Incremental Document Clustering Using Cluster Similarity Histograms. 597–601, 2003.
Kraft D., Chen J., Martin-Bautista M.J., Vila M.A., Textual Information Retrieval with User Profiles using Fuzzy Clustering and Inferencing, in Intelligent Exploration of the Web, Szczepaniak P., Segovia J., Kacprzyk J., Zadeh L.A., Studies in Fuzziness and Soft Comp. Series, 111, Physica Verlag, 2003.
Lin K., Kondadadi Ravikuma, A Similarity-Based Soft Clustering Algorithm for Documents, in Proc. of the 7th International Conference on Database Systems for Advanced Applications, 40–47, 2001.
Mendes Rodrigues M.E.S. and Sacks L., A Scalable Hierarchical Fuzzy Clustering Algorithm for Text Mining, in Proc. of the 4th International Conference on Recent Advances in Soft Computing, RASC’2004, 269–274, Nottingham, UK, 2004.
Murtagh. F. A Survey of Recent Advances in Hierarchical Clustering Algorithms which Use Cluster Centres. Computer Journal, 26, 354–359, 1984.
Pedrycz W., Clustering anf Fuzzy Clustering, chapter 1, in Knowledge-based clustering, J. Wiley and Son, 2005.
Salton G., and McGill M.J., Introduction to modern information retrieval. McGraw-Hill Int. Book Co. 1984.
Sebastiani F., Text Categorization. In Text Mining and its Applications, Alessandro Zanasi (ed.), WIT Press, Southampton, UK, 2005.
Sparck Jones, K. A., A Statistical Interpretation of Term Specificity and its Application in Retrieval., Journal of Documentation, 28(1), 11–20, 1972.
Steinbach M., Karypis G., Kumar V., A Comparison of Document Clustering Techniques, In Proc. of KDD Workshop on Text Mining, 2000.
Tang N., Vemuri V.R., Web-based Knowledge Acquisition to Inpute Missing Values for Classification, in Proc. of the 2004 IEEE/WI/ACM Int. Joint Conf. On the Web Intelligence and Intelligent Agent Tech. (WI/IAT-2004). Beijing, China, 2004.
The Ordered Weighted Averaging Operators: Theory and Applications, R.R. Yager and J. Kacprzyk eds., Kluwer Academic Publishers, 1997.
Ungar, L.H., Foster, D.P.: Clustering Methods for Collaborative Filtering. Proceedings of the Workshop on Recommendation Systems, AAAI Press, Menlo Park California, 1998.
van Rijsbergen, C. J. Information Retrieval. London, England, Butterworths & Co., Ltd., 1979.
Wai-chiu Wong, Ada Wai-chee Fu, Incremental Document Clustering for Web Page Classification, in Proc. 2000 Int. Conf. on Information Society in the 21st Century: Emerging Technologies and New Challenges (IS2000), Aizu-Wakamatsu City, Fukushima, Japan November 5–8, 2000.
Walls F., Jin H., Sista S., Schwartz R., Topic detection in Broadcast News, Proc. of the DARPA Broadcast News Workshop, Feb 28–Mar 3, 1999.
Xuejian Xiong, Kian Lee Tan, Similarity-driven cluster merging method for unsupervised fuzzy clustering, in Proc. of the 20th ACM International Conference on Uncertainty in artificial intelligence, 611–618, 2004.
Zhao Y., Karypis G., Criterion Functions for Document Clustering: Experiments and Analysis. Machine Learning, 2003.
Zhao Y., Karypis G., Empirical and Theoretical Comparisons of Selected Criterion functions for document clustering. Machine Learning, 55, 311–331, 2004.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Bordogna, G., Pagani1, M., Pasi, G. (2006). A Dynamic Hierarchical Fuzzy Clustering Algorithm for Information Filtering. In: Herrera-Viedma, E., Pasi, G., Crestani, F. (eds) Soft Computing in Web Information Retrieval. Studies in Fuzziness and Soft Computing, vol 197. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31590-X_1
Download citation
DOI: https://doi.org/10.1007/3-540-31590-X_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31588-9
Online ISBN: 978-3-540-31590-2
eBook Packages: EngineeringEngineering (R0)