Abstract
Information Extraction is an important research topic in data mining. In this paper we introduce a web information extraction approach based on similar patterns, in which the construction of pattern library is a knowledge acquisition bottleneck. We use a method based on similarity computation to automatically acquire patterns from large-scale corpus. According to the given seed patterns, relevant patterns can be learned from unlabeled training web pages. The generated patterns can be put to use after little manual correction. Compared to other algorithms, our approach requires much less human intervention and avoids the necessity of hand-tagging training corpus. Experimental results show that the acquired patterns achieve IE precision of 79.45% and recall of 66.51% in open test.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Soderland, S.: Learning Information Extraction Rules from Semi-Structured and free text. Machine Learning 34, 233–272 (1999)
Huffman, S.B.: Learning Information Extraction Patterns from Examples. In: Proceeding of the 1995 IJCAI Workshop on New Approaches to Learning for Natural Language Processing (1995)
Nobata, C., Sekine, S.: Automatic Acquisition of Pattern for Information Extraction. In: Proceeding of the ICCPOL 1999 (1999)
Yangarber, R., Grishman, R., Tapanainen, P.: Automatic Acquisition of Domain Knowledge for Information Extraction. In: Proceeding of the COLING 2000 (2000)
Riloff, E., Jones, R.: Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence
Yao, T.S., et al.: Natural Language Processing- A research of making computers understand human languages. Tsinghua University Press (2002) (in Chinese)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ye, N., Wu, X., Zhu, J., Chen, W., Yao, T. (2004). Web Information Extraction Based on Similar Patterns. In: Li, Q., Wang, G., Feng, L. (eds) Advances in Web-Age Information Management. WAIM 2004. Lecture Notes in Computer Science, vol 3129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27772-9_67
Download citation
DOI: https://doi.org/10.1007/978-3-540-27772-9_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22418-1
Online ISBN: 978-3-540-27772-9
eBook Packages: Springer Book Archive