Abstract.
With the explosive growth of the World Wide Web, it is becoming increasingly difficult for users to discover Web pages that are relevant to a topic. To address this problem we are developing a system that allows the collection and analysis of Web pages related to a particular topic. In this paper we present the system’s overall architecture and introduce the focused crawler used by the system. We also discuss the various techniques we use to allow the user to analyze and gain useful insights about a collection. Finally, we present some statistics on the collections.
Similar content being viewed by others
References
2001Agg01 Aggarwal CC, Al-Garawi F, Yu PS (2001) Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the 10th international World-Wide Web conference, Hong Kong, May 2001, pp~96–105
1999Ame99 Amento B, Hill W, Terveen L, Hix D, Ju P (1999) An empirical evaluation of user interfaces for topic management of web sites. In Proceedings of the ACM SIGCHI ‘99 conference on human factors in computing systems, Pittsburgh, PA, May 1999, pp~552–559
1999Ben99 Ben-Shaul I, Hersovici M, Jacovi M, Maarek YS, Pelleg D, Shtalheim M, Soroka V, Ur S (1999) Adding support for dynamic and focussed search with Fetuccino. In Proceedings of the 8th international World-Wide Web conference, Toronto, Canada, May 1999, pp 575–588
1998Bha98a Bharat K, Henzinger M (1998) Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the ACM SIGIR ‘98 conference on research and development in information retrieval, Melbourne, Australia, August 1998, pp 104–111
2001Bor01 Borodin A, Roberts GO, Rosenthal JS, Tsaparas P (2001) Finding authorities and hubs from link structures on the World Wide Web. In Proceedings of the 10th international World-Wide Web conference, Hong Kong, May 2001, pp 415–429
2000Bro00 Broder AZ, Ravi~Kumar S, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener JL (2000) Graph structure in the web. In Proceedings of the 9th international World-Wide Web conference, Amsterdam, Netherlands, May 2000, pp 309–320
1998Cha98 Chakrabarti S, Dom B, Gibson D, Kleinberg J, Raghavan P, Rajagopalan S (1998) Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks and ISDN Systems (special issue on the 7th international World-Wide Web conference, Brisbane, Australia) 30(1–7):65–74
1999Cha99 Chakrabarti S, van~den Berg M, Dom B (1999) Focussed crawling: a new approach to topic-specific web resource discovery. In Proceedings of the 8th international World-Wide Web conference, Toronto, Canada, May 1999, pp 545–562
2002Fla02 Flake GW, Lawrence S, Giles CL, Coetzee FM (2002) Self-organization and identification of web communities. IEEE Computer 35(3):66–71
1998Her98 Hersovici M, Jacovi M, Maarek YS, Pelleg D, Shtalheim M, Ur S (1998) The Shark-Search algorithm: an application: tailored web site mapping. Computer Networks and ISDN Systems (special issue on the 7th international World-Wide Web conference, Brisbane, Australia) 30(1–7):317–326
1998Kle98 Kleinberg JM (1998) Authorative sources in a hyperlinked environment. In Proceedings of the 9th ACM-SIAM symposium on discrete algorithms, May 1998
2000Lem00 Lempel R, Moran S (2000) The stochastic approach for link-structure analysis (SALSA) and the TKC effect. In Proceedings of the 9th international World-Wide Web conference, Amsterdam, Netherlands, May 2000, pp 387–401
1997Maa97 Maarek Y, Shaul IZB (1997) WebCutter: a system for dynamic and tailorable site mapping. In Proceedings of the 6th international World-Wide Web conference, Santa Clara, CA, April 1997, pp 713–722
1996Pir96 Pirolli P, Pitkow J, Rao R (1996) Silk from a sow’s ear: extracting usable structures from the Web. In Proceedings of the ACM SIGCHI ‘96 conference on human factors in computing systems, Vancouver, Canada, April 1996, pp 118–125
1997Pit97 Pitkow J, Pirolli P (1997) Life, death and lawfulness on the electronic frontier. In Proceedings of the ACM SIGCHI ‘97 conference on human factors in computing systems, Atlanta, GA, March 1997, pp 383–390
1998Ter98 Terveen L, Will H (1998) Finding and visualizing inter-site clan graphs. In Proceedings of the ACM SIGCHI ‘98 conference on human factors in computing Systems, Los Angeles, CA, April 1998, pp 448–455
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mukherjea, S. Discovering and Analyzing World Wide Web Collections. Knowledge and Information Systems 6, 230–241 (2004). https://doi.org/10.1007/s10115-003-0112-y
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10115-003-0112-y