Abstract
In this paper, we propose a novel approach to focused crawling that exploits genre and content-related information present in Web pages to guide the crawling process. The effectiveness, efficiency and scalability of this approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi (genre) of computer science courses (content). The results of these experiments show that focused crawlers constructed according to our approach achieve levels of F1 superior to 92% (an average gain of 178% over traditional focused crawlers), requiring the analysis of no more than 60% of the visited pages in order to find 90% of the relevant pages (an average gain of 82% over traditional focused crawlers).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press/Addison-Wesley, New York (1999)
Chakrabarti, S., Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Journal of Computer Networks 31(11-16), 1623–1640 (1999)
De Bra, P.M.E., Post, R.D.J.: Information Retrieval in the World Wide Web: Making Client-Based Searching Feasible. Journal of Computer Networks and ISDN Systems 27(2), 183–192 (1994)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling Using Context Graphs. In: Proc. 26th Int’l Conference on Very Large Data Bases, pp. 527–534 (2000)
Herscovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The Shark-Search Algorithm - An Application: Tailored Web Site Mapping. Journal of Computer Networks 30(1-7), 317–326 (1998)
Lage, J.P., Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic Generation of Agents for Collecting Hidden Web Pages for Data Extraction. Data & Knowledge Engineering 49(2), 177–196 (2004)
Liu, H., Janssen, J.C.M., Milios, E.E.: Using HMM to Learn User Browsing Patterns for Focused Web Crawling. Data & Knowledge Engineering 59(2), 270–291 (2006)
McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Automating the Construction of Internet Portals with Machine Learning. Journal of Information Retrieval 3(2), 127–163 (2000)
Menczer, F., Pant, G., Srinivasan, P.: Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Transactions on Internet Technology 4(4), 378–419 (2004)
Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.E.: Evaluating Topic-driven Web Crawlers. In: Proc. 24th Annual Int’l ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 241–249 (2001)
Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003)
Pant, G., Srinivasan, P.: Link Contexts in Classifier-Guided Topical Crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006)
Pant, G., Srinivasan, P.: Learning to Crawl: Comparing Classification Schemes. ACM Transactions on Information Systems 23(4), 430–462 (2005)
Pant, G., Tsioutsiouliklis, K., Johnson, J., Giles, C.L.: Panorama: Extending digital libraries with topical crawlers. In: Proc. 4th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 142–150 (2004)
Srinivasan, P., Menczer, F., Pant, G.: A General Evaluation Framework for Topical Crawlers. Journal of Information Retrieval 8(3), 417–447 (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
de Assis, G.T., Laender, A.H.F., Gonçalves, M.A., da Silva, A.S. (2007). Exploiting Genre in Focused Crawling. In: Ziviani, N., Baeza-Yates, R. (eds) String Processing and Information Retrieval. SPIRE 2007. Lecture Notes in Computer Science, vol 4726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75530-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-75530-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75529-6
Online ISBN: 978-3-540-75530-2
eBook Packages: Computer ScienceComputer Science (R0)