Abstract
With the growth of web-based applications and the increased popularity of the World Wide Web (WWW), the WWW became the greatest source of information available in the world leading to an increased difficulty of extracting relevant information. Moreover, the content of web sites is constantly changing leading to continual changes in Web users’ behaviours. Therefore, there is significant interest in analysing web content data to better serve users. Our proposed approach, which is grounded on automatic textual analysis of a web site independently from the usage attempts to define groups of documents dealing with the same topic. Both document clustering and word clustering are well studied problems. However, most existing algorithms cluster documents and words separately but not simultaneously. In this paper, we propose to apply a block clustering algorithm to categorize a web site pages according to their content. We report results of our recent testing of CROKI2 algorithm on a tourist web site.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Landauer, T.K., Dumais, S.T.: How come you know so much? From practical problems to new memory theory. In: Hermann, D.J., McEvoy, C., Hertzog, C., Hertel, P., Johnson, M.K. (eds.) Basic and applied memory research: Theory in context, vol. 1, pp. 105–126. Lawrence Erlbaum Associates, Mahwah (1996)
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, California, pp. 269–274 (2001)
Chen, H., Schuffels, C., Orwig, R.: Internet Categorization and Search: A Self-Organizing Approach. Journal of visual communication and image representation 7(1), 88–102 (1996)
Rossi, F., El Golli, A., Lechevallier, Y.: Usage Guided Clustering of Web Pages with Mediann Self Organizing Map. In: Proceedings of ESANN 2005 (2005)
Pirolli, P., Pitkow, J., Rao, R.: Silk from a Sow’s Ear: Extracting Usable Structures from the Web. In: Proceedings of the Conference on Human Factors in Computing Systems, CHI 1996 (1996)
Charrad, M., Lechevallier, Y., Saporta, G., Ben Ahmed, M.: Web Content Data Mining: la classification croisée pour l’analyse textuelle d’un site Web. In: Actes des 8émes journées francophones Extraction et Gestion des Connaissances 2008, EGC 2008, Revue des Nouvelles Technologies Informatiques (RNTI), Cépadués-édn., vol. I, pp. 43–54 (2008)
Charrad, M., Lechevallier, Y., Saporta, G., Ben Ahmed, M.: Le bi-partitionnement: Etat de l’art sur les approches et les algorithmes. In: Ecol’IA 2008, Hammamet, Tunisie (2008)
Crimmins, F., Smeaton, A.F., Dkaki, T., Mothe, J.: TetraFusion: information discovery on the Internet. Journal of IEEExpert, 55–62 (1999)
Voorhees, E.M.: The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell University (1986)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1), 143–175 (2001); Also appears as IBM Research Report RJ 10147 (1999)
Schutze, H., Silverstein, C.: Projections for efficient document clustering. In: ACM SIGIR (1997)
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: ACM SIGIR (1992)
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI 2000 Workshop on AI for Web Search (2000)
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
Govaert, G.: Classification croisée. Thése de doctorat d’état, Paris (1983)
Stricker, M.: Réseaux de neurones pour le traitement automatique du langage: conception et réalisatin de filtres d’information. Thése de Doctorat, Electronique, ESPCI (2000)
Madeira, S.C., Oliveira, A.L.: Biclustering Algorithms for Biological Data Analysis: A Survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 01(1), 24–45 (2004)
Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 122–1129 (2006)
Forgy, E.: Cluster analysis of multivariate data:efficiency versus interpretability of classification. Biometrics 21, 768–780 (1965)
Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Jones, D., Somers, H. (eds.) New Methods in Language Processing Studies in Computational Linguistics (1997)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Charrad, M., Lechevallier, Y., Ahmed, M.b., Saporta, G. (2009). Block Clustering for Web Pages Categorization. In: Corchado, E., Yin, H. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2009. IDEAL 2009. Lecture Notes in Computer Science, vol 5788. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04394-9_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-04394-9_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04393-2
Online ISBN: 978-3-642-04394-9
eBook Packages: Computer ScienceComputer Science (R0)