Abstract
This paper surveys our recent results on the knowledge discovery from semistructured texts, which contain heterogeneous structures represented by labeled trees. The aim of our study is to extract useful information from documents on the Web. First, we present the theoretical results on learning rewriting rules between labeled trees. Second, we apply our method to the learning HTML trees in the framework of the wrapper induction. We also examine our algorithms for real world HTML documents and present the results.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
S. Abiteboul, P. Buneman, D. Suciu, Data on theWeb: From relations to semistructured data and XML, Morgan Kaufmann, San Francisco, CA, 2000.
D. Angluin, Queries and concept learning, Machine Learning vol.2, pp.319–342, 1988.
H. Arimura, Learning Acyclic First-order Horn Sentences From Entailment, Proc. 7th Int. Workshop on Algorithmic Learning Theory, LNAI 1316, pp.432–445, 1997.
H. Arimura, H. Ishizaka, T. Shinohara, Learning unions of tree patterns using queries, Theoretical Computer Science vol.185, pp.47–62, 1997.
W. W. Cohen, W. Fan, Learning Page-Independent Heuristics for Extracting Data from Web Pages, Proc. WWW-99, 1999.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery, Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence vol. 118 pp. 69–113, 2000.
N. Dershowitz, J.-P. Jouannaud, Rewrite Systems, Chapter 6, Formal Models and Semantics, Handbook of Theoretical Computer Science Vol. B, Elseveir, 1990.
F. Drewes, Computation by Tree Transductions, Ph D. Thesis, University of Bremen, Department of Mathematics and Informatics, February 1996.
M. Frazier, L. Pitt, Learning from entailment: an application to propositional Horn sentences, Proc. 10th Int. Conf. Machine Learning, pp.120–127, 1993.
D. Freitag, Information extraction from HTML: Application of a general machine learning approach. Proc. the Fifteenth National Conference on Artificial Intelligence, pp. 517–523, 1998.
K. Hirata, K. Yamada, H. Harao, Tractable and intractable second-order matching problems. Proc. 5th Annual International Computing and Combinatorics Conference, 1627, pp. 432–441, 1999.
J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, Extracting semistructured information from the Web. Proc. the Workshop on Management of Semistructured Data, pp. 18–25, 1997.
C.-H. Hsu, Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In papers from the 1998 Workshop on AI and Information Integration, pp. 66–73, 1998.
R. Khardon, Learning function-free Horn expressions, Proc. COLT’98, pp. 154–165, 1998.
P. Kilpelainen, H. Mannila, Ordered and unordered tree inclusion, SIAM J. Comput., vol. 24, pp.340–356, 1995.
N. Kushmerick, Wrapper induction: efficiency and expressiveness. Artificial Intelligence vol. 118, pp. 15–68, 2000.
I. Muslea, S. Minton, C. A. Knoblock, Wrapper induction for semistructured, web-based information sources. Proc. the Conference on Automated Learning and Discovery, 1998.
H. Sakamoto, H. Arimura, S. Arikawa, Identification of tree translation rules from examples. Proc. 5th International Colloquium on Grammatical Inference. LNAI 1891, pp. 241–255, 2000.
H. Sakamoto, Y. Murakami, H. Arimura, S. Arikawa, Extracting Partial Structures from HTML Documents, Proc. the 14the International FLAIRS Conference, pp.264–268, 2001, AAAI Press.
K. Taniguchi, H. Sakamoto, H. Arimura, S. Shimozono, S. Arikawa, Mining Semi-Structured Data by Path Expressions, Proc. the 4th International Conference on Discovery Science, (to appear).
L. G. Valiant, A theory of learnable, Commun. ACM vol.27, pp. 1134–1142, 1984.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Sakamoto, H., Arimura, H., Arikawa, S. (2002). Knowledge Discovery from Semistructured Texts. In: Arikawa, S., Shinohara, A. (eds) Progress in Discovery Science. Lecture Notes in Computer Science(), vol 2281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45884-0_45
Download citation
DOI: https://doi.org/10.1007/3-540-45884-0_45
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43338-5
Online ISBN: 978-3-540-45884-5
eBook Packages: Springer Book Archive