The ontological key: automatically understanding and integrating forms to access the deep Web | The VLDB Journal Skip to main content
Log in

The ontological key: automatically understanding and integrating forms to access the deep Web

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Forms are our gates to the Web. They enable us to access the deep content of Web sites. Automatic form understanding provides applications, ranging from crawlers over meta-search engines to service integrators, with a key to this content. Yet, it has received little attention other than as component in specific applications such as crawlers or meta-search engines. No comprehensive approach to form understanding exists, let alone one that produces rich models for semantic services or integration with linked open data. In this paper, we present opal, the first comprehensive approach to form understanding and integration. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems, opal advances the state of the art: For form labeling, it combines features from the text, structure, and visual rendering of a Web page. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern Web forms, opal outperforms previous approaches for form labeling by a significant margin. For form interpretation, opal uses a schema (or ontology) of forms in a given domain. Thanks to this domain schema, it is able to produce nearly perfect (\(>\)97 % accuracy in the evaluation domains) form interpretations. Yet, the effort to produce a domain schema is very low, as we provide a datalog-based template language that eases the specification of such schemata and a methodology for deriving a domain schema largely automatically from an existing domain ontology. We demonstrate the value of opal’s form interpretations through a light-weight form integration system that successfully translates and distributes master queries to hundreds of forms with no error, yet is implemented with only a handful translation rules.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

References

  1. Araujo, S., Gao, Q., Leonardi, E., Houben, G.-J.: Carbon: domain-independent automatic web form filling. In: Proceedings of the International Conference on Web Engineering (ICWE), pp. 292–306 (2010)

  2. Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the International World Wide Web Conference (WWW), pp. 431–440 (2007)

  3. Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceedings of the Brazilian Symposium on Databases, pp. 309–321 (2004)

  4. Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. J. ACM 55(5), 24:1–247:4 (2008)

    Article  MathSciNet  Google Scholar 

  5. Benedikt, M., Gottlob, G., Senellart, P.: Determining relevance of accesses at runtime. In: Proceedings Symposium on Principles of Database Systems (PODS), pp. 211–222 (2011)

  6. Benedikt, M., Koch, C.: XPath leashed. In: ACM Computing Surveys, pp. 3:1–3:54 (2007)

  7. Cafarella, M.J., Chang, E.Y., Fikes, A., Halevy, A.Y., Hsieh, W.C., Lerner, A., Madhavan, J., Muthukrishnan, S.: Data management projects at Google. Sigmod Record 37(1), 34–38 (2008)

    Article  Google Scholar 

  8. Chang, K.C.-C., He, B., Zhang, Z.: Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explor. Newsl. 6(2), 67–76 (2004)

    Article  Google Scholar 

  9. Crescenzi, W., Merialdo, P., Qiu, D.: A framework for learning web wrappers from the crowd. In: Proceedings of the International World Wide Web Conference (WWW) (2013)

  10. Dragut, E.C., Kabisch, T., Yu, C., Leser, U.: A hierarchical approach to model web query interfaces for web source integration. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 325–336 (2009)

  11. Dragut, E.C., Meng, W., Yu, C.T.: Deep Web Query Interface Understanding and Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2012)

  12. Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: OPAL: automated form understanding for the deep web. In: Proceedings of the International World Wide Web Conference (WWW), pp. 829–838 (2012)

  13. Furche, T., Gottlob, G., Grasso, G., Orsi, G., Schallhart, C., Wang, C.: Little knowledge rules the web: domain-centric result page extraction. In: Proceedings of the International Conference on Web Reasoning and Rule Systems (RR), pp. 61–76 (2011)

  14. He, B., Zhang, Z., Chang, K.C.-C.: Towards building a MetaQuerier: extracting and matching web query interfaces. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 1098–1099 (2005)

  15. He, H., Meng, W., Lu, Y., Yu, C., Wu, Z.: Towards deeper understanding of the search interfaces of the deep web. Word Wide Web 10, 133–155 (2007)

    Article  Google Scholar 

  16. Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., Paepcke, A.: Efficient web form entry on PDAs. In: Proceedings of the International World Wide Web Conference (WWW), pp. 663–672 (2001)

  17. Khare, R., An, Y.: An empirical study on using hidden markov model for search interface segmentation. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 17–26 (2009)

  18. Khare, R., An, Y., Song, I.-Y.: Understanding deep web search interfaces: a survey. Sigmod Records 39(1), 33–40 (2010)

    Article  Google Scholar 

  19. Lehmann, J., Furche, T., Grasso, G., Ngomo, A.-C.N., Schallhart, C., Sellers, A., Unger, C., Bühmann, L., Gerber, D., Konrad Höffner, D.L., Auer S.: Deqa: deep web extraction for question answering. In: Proceedings of the International Semantic Web Conference (ISWC) (2012)

  20. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 1241–1252 (2008)

  21. Maiti, A., Dasgupta, A., Zhang, N., Das, G.: HDSampler: revealing data behind web form interfaces. In: Proceedings of the Symposium on Management of Data (SIGMOD), pp. 1131–1134 (2009)

  22. Navarrete, I., Morales, A., Cardenas, M., Sciavicco, G.: Spatial reasoning with rectangular cardinal relations—the convex tractable subalgebra. Ann. Math. Artif. Intell. (2012)

  23. Nguyen, H., Nguyen, T., Freire, J.: Learning to extract form labels. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 684–694 (2008)

  24. Nguyen, T.H., Nguyen, H., Freire, J.: PruSM: a prudent schema matching approach for web forms. In: International Conference on Information and Knowledge Management (CIKM), pp. 1385–1388 (2010)

  25. Niu, F., Zhang, C., Re, C., Shavlik, J.: DeepDive: web-scale knowledge-base construction using statistical learning and inference. In: Very Large Data Search (VLDS), pp. 25–28 (2012)

  26. Pedersen, T., Patwardhan, S., Michelizzi, J.: Wordnet::similarity—measuring the relatedness of concepts. In: Proceedings of the HLT-NAACL-Demonstrations, pp. 38–41 (2004)

  27. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001)

  28. Shestakov, D., Bhowmick, S., Lim, E.: Deque: querying the deep web. Data Knowl. Eng. (DKE) 52(3), 273–311 (2005)

    Google Scholar 

  29. Su, W., Wang, J., Lochovsky, F.H.: ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2), 12:1–12:35 (2009)

    Article  Google Scholar 

  30. Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F.H., Cai, H., Huang, T.: Understanding query interfaces by statistical parsing. ACM Trans. Web 7(2), 8:1–8:22 (2012)

    Google Scholar 

  31. Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 408–419 (2004)

  32. Wu, W., Doan, A., Yu, C., Meng, W.: Modeling and extracting deep-web query interfaces. Adv. Inf. Intell. Syst., pp. 65–90 (2009)

  33. Yuan, X., Zhang, H., Yang, Z., Wen, Y.: Understanding the search interfaces of the deep web based on domain model. In: International Conference on Computer and Information Science, pp. 1194–1199 (2009)

  34. Zhang, Z., He, B., Chang, K.C.-C.: Understanding web query interfaces: best-effort parsing with hidden syntax. In: Proceedings of the Symposium on Management of Data (SIGMOD), (2004)

Download references

Acknowledgments

The research leading to these results has received funding from the European Research Council under the European Community’s FP7/2007-2013/ERC Grant agreement DIADEM, No. 246858, and the Oxford Martin School.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tim Furche.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Furche, T., Gottlob, G., Grasso, G. et al. The ontological key: automatically understanding and integrating forms to access the deep Web. The VLDB Journal 22, 615–640 (2013). https://doi.org/10.1007/s00778-013-0323-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-013-0323-0

Keywords

Navigation