Abstract
The paper describes variations of the classical k-means clustering algorithm that can be used effectively to address the so called Web-site Boundary Detection (WBD) problem. The suggested advantages offered by these techniques are that they can quickly identify most of the pages belonging to a web-site; and, in the long run, return a solution of comparable (if not better) accuracy than other clustering methods. We analyze our techniques on artificial clones of the web generated using a well-known preferential attachment method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abiteboul, S., Cobéna, G., Masanes, J., Sedrati, G.: A first experience in archiving the french web. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 1–15. Springer, Heidelberg (2002)
Aldous, D., Fill, J.: Reversible markov chains and random walks on graphs. Monograph in preparation (2002)
Aleliunas, R., Karp, R.M., Lipton, R.J., Lovasz, L., Rackoff, C.: Random walks, universal traversal sequences, and the complexity of maze problems. In: Proceedings of the 20th Annual Symposium on Foundations of Computer Science, pp. 218–223. IEEE Computer Society, Washington, DC, USA (1979)
Alshukri, A., Coenen, F., Zito, M.: Web-Site Boundary Detection. In: Perner, P. (ed.) ICDM 2010. LNCS, vol. 6171, pp. 529–543. Springer, Heidelberg (2010)
Bharat, K., Chang, B.-W., Henzinger, M.R., Ruhl, M.: Who links to whom: Mining linkage between web sites. In: Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM 2001, pp. 51–58. IEEE Computer Society, Washington, DC, USA (2001)
Broder, A.Z., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomikns, A., Wiener, J.: Graph structure in the web. Computer Networks 33(1-6), 309–320 (2000)
Broder, A.Z., Najork, M., Wiener, J.L.: Efficient url caching for world wide web crawling. In: Proceedings of the 12th International Conference on World Wide Web, pp. 679–689. ACM, New York (2003)
Dmitriev, P.: As we perceive: finding the boundaries of compound documents on the web. In: Proceeding of the 17th International Conference on World Wide Web, WWW 2008, pp. 1029–1030. ACM, New York (2008)
Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall PTR, Upper Saddle River (2002)
Feller, W.: Introduction to probability theory and its applications, vol. 1. WSS (1968)
Gomes, D., Silva, M.J.: Modelling Information Persistence on the Web. In: 6th International Conference on Web Engineering, pp. 193–200. ACM Press, New York (2006)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Koehler, W.: Web page change and persistence A four-year longitudinal study. Journal of the American Society for Information, 162–171 (2002)
Kroeger, T.M., Long, D.D.E., Mogul, J.C.: Exploring the Bounds of Web Latency Reduction from Caching and Prefetching. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems Monterey, p. 2 (December 1997)
Kumar, R.: Trawling the Web for emerging cyber-communities. Computer Networks 31, 1481–1493 (1999)
Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the Web graph. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 57–65. IEEE Computer Society, Washington, DC, USA (2000)
Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2007)
Lovász, L.: Random walks on graphs: A survey. Combinatorics Paul Erdos is Eighty 2, 1–46 (1994)
Padmanabhan, V.N., Mogul, J.C.: Using predictive prefetching to improve World Wide Web latency. ACM SIGCOMM Computer Communication Review 26 (July 1996)
Pokorn, J.: Web Searching and Information Retrieval. Computing in Science and Engineering 6(4), 43–48 (2004)
Rodrigues, E.M., Milic-Frayling, N., Fortuna, B.: Detection of Web Subsites: Concepts, Algorithms, and Evaluation Issues. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 66–73. IEEE Computer Society, Los Alamitos (2007)
Schneider, M.S., Kirsten, F., Michele, K., Gina, J.: Building thematic web collections: challenges and experiences from the september 11 web archive and the election 2002 web archive. In: Digital Libraries, ECDL, pp. 77–94 (2003)
Senellart, P.: Identifying Websites with Flow Simulation. In: Lowe, D.G., Gaedke, M. (eds.) ICWE 2005. LNCS, vol. 3579, pp. 124–129. Springer, Heidelberg (2005)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson International Edition (2006)
Witten, I.H., Frank, E.: Data Mining: practical machine learning tools and techniques. Morgan Kaufman, San Francisco (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alshukri, A., Coenen, F., Zito, M. (2011). Incremental Web-Site Boundary Detection Using Random Walks. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2011. Lecture Notes in Computer Science(), vol 6871. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23199-5_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-23199-5_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23198-8
Online ISBN: 978-3-642-23199-5
eBook Packages: Computer ScienceComputer Science (R0)