Schema Driven and Topic Specific Web Crawling | SpringerLink
Skip to main content

Schema Driven and Topic Specific Web Crawling

  • Conference paper
Database Systems for Advanced Applications (DASFAA 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3453))

Included in the following conference series:

Abstract

We propose a new approach to discover and extract topic-specific hypertext resources from the WWW. The method, called schema driven and topical crawling, allows a user to define schema and extracting rules for a specific domain of interests. It supports automatically search and extract schema-relevant web pages from the web. Different from common approaches that surf solely on web pages, our approach supports crawler to surf on a virtual network composed by concept instances and relationships. To achieve such a goal, we design an architecture that integrates several techniques including web extractor, meta-search engine and query expansion, and provide a toolkit to support it.

This research work is part of the ALVIS project of EU’s 6th Framework Programme and funded by the Ministry of Science and Technology of China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. In: Proc. of the 8th International World Wide Web Conference, Toronto, Canada (1999)

    Google Scholar 

  2. Flake, G.W., Lawrence, S., Giles, C.: Efficient Identification of Web Communities. In: Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA (2000)

    Google Scholar 

  3. Flake, G.W., Lawrence, S., Giles, C.: Efficient Identification of Web Communities. In: Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA (2000)

    Google Scholar 

  4. McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Building Domain-Specific Search Engines with Machine Learning Techniques. In: Proc. AAAI 1999 Spring Symposium on Intelligent Agents in Cyberspace (1999)

    Google Scholar 

  5. Qin, J., Zhou, Y., Chau, M.: Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method. In: International Conference on Digital Libraries. Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries (2004)

    Google Scholar 

  6. Chau, M., Chen, H.: Comparison of Three Vertical Search Spiders. IEEE Computer 36(5), 56–62 (2003)

    Google Scholar 

  7. Bergmark, D., Lagoze, C., Sbityakov, A.: Focused Crawls, Tunneling, and Digital Libraries. In: Proc. of the 6th European Conference on Digital Libraries, Rome, Italy (2002)

    Google Scholar 

  8. Arocena, G.O., Mendelzon, A.O.: WEBOQL: Restructuring Documents, Databases, and Webs. In: Proceedings of the 14th IEEE International Conference on Data Engineering, pp. 24–33

    Google Scholar 

  9. May, W., Himmeröder, R., Lausen, G., Ludäscher, B.: A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web. In: International Workshop on International Workshop on the World-Wide Web and Conceptual Modeling (WWWCM 1999), pp. 307–320 (1999)

    Google Scholar 

  10. Kistler, T., Marais, H.: WebL - A programming language for the Web. In: Proceedings of WWW, vol. 7, pp. 259–270 (1998)

    Google Scholar 

  11. Liu, L., Pu, C., Han, W.: XWrap – An XML-enabled Wrapper Construction System for Web Information Sources. In: Proceedings of the 16th International Conference on Data Engi-neering (ICDE 2000) (2000)

    Google Scholar 

  12. Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. Paper for the 27th International Conference on Very Large Data Bases (VLDB 2001) (2001)

    Google Scholar 

  13. Adelberg, B.: Nodose – a tool for semi-automatically extraction structured and semi-structured data from text documents. In: ACM SIGMOD (1998)

    Google Scholar 

  14. Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Kaing, Y., Quass, D., Smith, R.D.: Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Data and Knowledge Engineering 31(3), 227–251 (1999)

    Article  MATH  Google Scholar 

  15. Zhang, Z., Xing, C., Zhou, L., Feng, J.: A New Query Processing Scheme in a Web Data Engine. In: Bhalla, S. (ed.) DNIS 2002. LNCS, vol. 2544, pp. 74–87. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  16. Guo, Q., Zhou, L., Zhang, Z., Feng, J.: A Highly Adaptive Web Extractor. In: Proc. of the 6th Asia Pacific Web Conference (2004)

    Google Scholar 

  17. Guo, Q.: Technique Report of GQML, http://dbroup.cs.tsinghua.edu.cn/sesq/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guo, Q., Guo, H., Zhang, Z., Sun, J., Feng, J. (2005). Schema Driven and Topic Specific Web Crawling. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_55

Download citation

  • DOI: https://doi.org/10.1007/11408079_55

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25334-1

  • Online ISBN: 978-3-540-32005-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics