Abstract
We propose a new approach to discover and extract topic-specific hypertext resources from the WWW. The method, called schema driven and topical crawling, allows a user to define schema and extracting rules for a specific domain of interests. It supports automatically search and extract schema-relevant web pages from the web. Different from common approaches that surf solely on web pages, our approach supports crawler to surf on a virtual network composed by concept instances and relationships. To achieve such a goal, we design an architecture that integrates several techniques including web extractor, meta-search engine and query expansion, and provide a toolkit to support it.
This research work is part of the ALVIS project of EU’s 6th Framework Programme and funded by the Ministry of Science and Technology of China.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. In: Proc. of the 8th International World Wide Web Conference, Toronto, Canada (1999)
Flake, G.W., Lawrence, S., Giles, C.: Efficient Identification of Web Communities. In: Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA (2000)
Flake, G.W., Lawrence, S., Giles, C.: Efficient Identification of Web Communities. In: Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA (2000)
McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Building Domain-Specific Search Engines with Machine Learning Techniques. In: Proc. AAAI 1999 Spring Symposium on Intelligent Agents in Cyberspace (1999)
Qin, J., Zhou, Y., Chau, M.: Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method. In: International Conference on Digital Libraries. Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries (2004)
Chau, M., Chen, H.: Comparison of Three Vertical Search Spiders. IEEE Computer 36(5), 56–62 (2003)
Bergmark, D., Lagoze, C., Sbityakov, A.: Focused Crawls, Tunneling, and Digital Libraries. In: Proc. of the 6th European Conference on Digital Libraries, Rome, Italy (2002)
Arocena, G.O., Mendelzon, A.O.: WEBOQL: Restructuring Documents, Databases, and Webs. In: Proceedings of the 14th IEEE International Conference on Data Engineering, pp. 24–33
May, W., Himmeröder, R., Lausen, G., Ludäscher, B.: A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web. In: International Workshop on International Workshop on the World-Wide Web and Conceptual Modeling (WWWCM 1999), pp. 307–320 (1999)
Kistler, T., Marais, H.: WebL - A programming language for the Web. In: Proceedings of WWW, vol. 7, pp. 259–270 (1998)
Liu, L., Pu, C., Han, W.: XWrap – An XML-enabled Wrapper Construction System for Web Information Sources. In: Proceedings of the 16th International Conference on Data Engi-neering (ICDE 2000) (2000)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. Paper for the 27th International Conference on Very Large Data Bases (VLDB 2001) (2001)
Adelberg, B.: Nodose – a tool for semi-automatically extraction structured and semi-structured data from text documents. In: ACM SIGMOD (1998)
Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Kaing, Y., Quass, D., Smith, R.D.: Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Data and Knowledge Engineering 31(3), 227–251 (1999)
Zhang, Z., Xing, C., Zhou, L., Feng, J.: A New Query Processing Scheme in a Web Data Engine. In: Bhalla, S. (ed.) DNIS 2002. LNCS, vol. 2544, pp. 74–87. Springer, Heidelberg (2002)
Guo, Q., Zhou, L., Zhang, Z., Feng, J.: A Highly Adaptive Web Extractor. In: Proc. of the 6th Asia Pacific Web Conference (2004)
Guo, Q.: Technique Report of GQML, http://dbroup.cs.tsinghua.edu.cn/sesq/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guo, Q., Guo, H., Zhang, Z., Sun, J., Feng, J. (2005). Schema Driven and Topic Specific Web Crawling. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_55
Download citation
DOI: https://doi.org/10.1007/11408079_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25334-1
Online ISBN: 978-3-540-32005-0
eBook Packages: Computer ScienceComputer Science (R0)