TreeWrapper: Automatic Data Extraction Based on Tree Representation | SpringerLink
Skip to main content

TreeWrapper: Automatic Data Extraction Based on Tree Representation

  • Conference paper
AI 2006: Advances in Artificial Intelligence (AI 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4304))

Included in the following conference series:

  • 3058 Accesses

Abstract

This paper introduces a new algorithm that learns to extract data from Web pages with relatively regular data structures. Current existing systems require training on either manually labelled pages or at least two similar unlabelled pages, and they often have difficulties on handling Web pages with complex formats such as nested tables or lists. Our previous system AutoWrapper does not need any training and can automatically extract data from any single page. This paper improves AutoWrapper by handling nested structures and finding multiple regular data areas. The main contributions include a tree-based representation for Web pages, an expressive language for representing information extraction patterns, and a learning algorithm that automatically detects regular data areas by finding similar sub-trees.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Kushmerick, N.: Wrapper Induction for Information Extraction. PhD thesis, Department of Computer Science and Engineering, University of Washington (1997)

    Google Scholar 

  2. Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: The 3rd conference on Autonomous Agents (Agent 1999) (1999)

    Google Scholar 

  3. Cohen, W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: The Eleventh International World Wide Web Conference WWW 2002 (2002)

    Google Scholar 

  4. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)

    Google Scholar 

  5. Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: Automatic Text Extraction and Mining workshop (ATEM 2001), IJCAI 2001, Seattle, WA (2001)

    Google Scholar 

  6. Hong, T.W., Clark, K.L.: Using grammatical inference to automate information extraction from the Web. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 216–223. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  7. Gao, X., Zhang, M., Andreae, P.: Automatic pattern construction for web information extraction. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 12(4) (2004)

    Google Scholar 

  8. Gao, X., Andreae, P., Collins, R.: Approximately repetitively structure detection of wrapper induction. In: The 8th Pacific Rim International Conferences on Artificial Intelligence, Auckland, New Zealand, pp. 585–594 (2004)

    Google Scholar 

  9. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. of Mol. Biol. 147, 195–197 (1981)

    Article  Google Scholar 

  10. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  11. Doorenbos, R.B., Etzioni, O., Weld, D.S.: A scalable comparison-shopping agent. In: Agent 1997 (1997)

    Google Scholar 

  12. Hsu, C.N.: Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In: AAAI 1998 Workshop on AI and Information Integration (1998), http://www.isi.edu/ariadne/aiii98-wkshp/proceedings.html

  13. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, Chiba, Japan (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gao, X., Zhang, M., Cao, M.D. (2006). TreeWrapper: Automatic Data Extraction Based on Tree Representation. In: Sattar, A., Kang, Bh. (eds) AI 2006: Advances in Artificial Intelligence. AI 2006. Lecture Notes in Computer Science(), vol 4304. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11941439_61

Download citation

  • DOI: https://doi.org/10.1007/11941439_61

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49787-5

  • Online ISBN: 978-3-540-49788-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics