Abstract
This paper introduces a new algorithm that learns to extract data from Web pages with relatively regular data structures. Current existing systems require training on either manually labelled pages or at least two similar unlabelled pages, and they often have difficulties on handling Web pages with complex formats such as nested tables or lists. Our previous system AutoWrapper does not need any training and can automatically extract data from any single page. This paper improves AutoWrapper by handling nested structures and finding multiple regular data areas. The main contributions include a tree-based representation for Web pages, an expressive language for representing information extraction patterns, and a learning algorithm that automatically detects regular data areas by finding similar sub-trees.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kushmerick, N.: Wrapper Induction for Information Extraction. PhD thesis, Department of Computer Science and Engineering, University of Washington (1997)
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: The 3rd conference on Autonomous Agents (Agent 1999) (1999)
Cohen, W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: The Eleventh International World Wide Web Conference WWW 2002 (2002)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: Automatic Text Extraction and Mining workshop (ATEM 2001), IJCAI 2001, Seattle, WA (2001)
Hong, T.W., Clark, K.L.: Using grammatical inference to automate information extraction from the Web. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 216–223. Springer, Heidelberg (2001)
Gao, X., Zhang, M., Andreae, P.: Automatic pattern construction for web information extraction. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 12(4) (2004)
Gao, X., Andreae, P., Collins, R.: Approximately repetitively structure detection of wrapper induction. In: The 8th Pacific Rim International Conferences on Artificial Intelligence, Auckland, New Zealand, pp. 585–594 (2004)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. of Mol. Biol. 147, 195–197 (1981)
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
Doorenbos, R.B., Etzioni, O., Weld, D.S.: A scalable comparison-shopping agent. In: Agent 1997 (1997)
Hsu, C.N.: Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In: AAAI 1998 Workshop on AI and Information Integration (1998), http://www.isi.edu/ariadne/aiii98-wkshp/proceedings.html
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, Chiba, Japan (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gao, X., Zhang, M., Cao, M.D. (2006). TreeWrapper: Automatic Data Extraction Based on Tree Representation. In: Sattar, A., Kang, Bh. (eds) AI 2006: Advances in Artificial Intelligence. AI 2006. Lecture Notes in Computer Science(), vol 4304. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11941439_61
Download citation
DOI: https://doi.org/10.1007/11941439_61
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49787-5
Online ISBN: 978-3-540-49788-2
eBook Packages: Computer ScienceComputer Science (R0)