Discovering and Analyzing World Wide Web Collections | Knowledge and Information Systems Skip to main content
Log in

Discovering and Analyzing World Wide Web Collections

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract.

With the explosive growth of the World Wide Web, it is becoming increasingly difficult for users to discover Web pages that are relevant to a topic. To address this problem we are developing a system that allows the collection and analysis of Web pages related to a particular topic. In this paper we present the system’s overall architecture and introduce the focused crawler used by the system. We also discuss the various techniques we use to allow the user to analyze and gain useful insights about a collection. Finally, we present some statistics on the collections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. 2001Agg01 Aggarwal CC, Al-Garawi F, Yu PS (2001) Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the 10th international World-Wide Web conference, Hong Kong, May 2001, pp~96–105

  2. 1999Ame99 Amento B, Hill W, Terveen L, Hix D, Ju P (1999) An empirical evaluation of user interfaces for topic management of web sites. In Proceedings of the ACM SIGCHI ‘99 conference on human factors in computing systems, Pittsburgh, PA, May 1999, pp~552–559

  3. 1999Ben99 Ben-Shaul I, Hersovici M, Jacovi M, Maarek YS, Pelleg D, Shtalheim M, Soroka V, Ur S (1999) Adding support for dynamic and focussed search with Fetuccino. In Proceedings of the 8th international World-Wide Web conference, Toronto, Canada, May 1999, pp 575–588

  4. 1998Bha98a Bharat K, Henzinger M (1998) Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the ACM SIGIR ‘98 conference on research and development in information retrieval, Melbourne, Australia, August 1998, pp 104–111

  5. 2001Bor01 Borodin A, Roberts GO, Rosenthal JS, Tsaparas P (2001) Finding authorities and hubs from link structures on the World Wide Web. In Proceedings of the 10th international World-Wide Web conference, Hong Kong, May 2001, pp 415–429

  6. 2000Bro00 Broder AZ, Ravi~Kumar S, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener JL (2000) Graph structure in the web. In Proceedings of the 9th international World-Wide Web conference, Amsterdam, Netherlands, May 2000, pp 309–320

  7. 1998Cha98 Chakrabarti S, Dom B, Gibson D, Kleinberg J, Raghavan P, Rajagopalan S (1998) Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks and ISDN Systems (special issue on the 7th international World-Wide Web conference, Brisbane, Australia) 30(1–7):65–74

  8. 1999Cha99 Chakrabarti S, van~den Berg M, Dom B (1999) Focussed crawling: a new approach to topic-specific web resource discovery. In Proceedings of the 8th international World-Wide Web conference, Toronto, Canada, May 1999, pp 545–562

    Google Scholar 

  9. 2002Fla02 Flake GW, Lawrence S, Giles CL, Coetzee FM (2002) Self-organization and identification of web communities. IEEE Computer 35(3):66–71

    Google Scholar 

  10. 1998Her98 Hersovici M, Jacovi M, Maarek YS, Pelleg D, Shtalheim M, Ur S (1998) The Shark-Search algorithm: an application: tailored web site mapping. Computer Networks and ISDN Systems (special issue on the 7th international World-Wide Web conference, Brisbane, Australia) 30(1–7):317–326

  11. 1998Kle98 Kleinberg JM (1998) Authorative sources in a hyperlinked environment. In Proceedings of the 9th ACM-SIAM symposium on discrete algorithms, May 1998

  12. 2000Lem00 Lempel R, Moran S (2000) The stochastic approach for link-structure analysis (SALSA) and the TKC effect. In Proceedings of the 9th international World-Wide Web conference, Amsterdam, Netherlands, May 2000, pp 387–401

  13. 1997Maa97 Maarek Y, Shaul IZB (1997) WebCutter: a system for dynamic and tailorable site mapping. In Proceedings of the 6th international World-Wide Web conference, Santa Clara, CA, April 1997, pp 713–722

  14. 1996Pir96 Pirolli P, Pitkow J, Rao R (1996) Silk from a sow’s ear: extracting usable structures from the Web. In Proceedings of the ACM SIGCHI ‘96 conference on human factors in computing systems, Vancouver, Canada, April 1996, pp 118–125

  15. 1997Pit97 Pitkow J, Pirolli P (1997) Life, death and lawfulness on the electronic frontier. In Proceedings of the ACM SIGCHI ‘97 conference on human factors in computing systems, Atlanta, GA, March 1997, pp 383–390

  16. 1998Ter98 Terveen L, Will H (1998) Finding and visualizing inter-site clan graphs. In Proceedings of the ACM SIGCHI ‘98 conference on human factors in computing Systems, Los Angeles, CA, April 1998, pp 448–455

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sougata Mukherjea.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mukherjea, S. Discovering and Analyzing World Wide Web Collections. Knowledge and Information Systems 6, 230–241 (2004). https://doi.org/10.1007/s10115-003-0112-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-003-0112-y

Keywords

Navigation