Abstract
Forms are our gates to the Web. They enable us to access the deep content of Web sites. Automatic form understanding provides applications, ranging from crawlers over meta-search engines to service integrators, with a key to this content. Yet, it has received little attention other than as component in specific applications such as crawlers or meta-search engines. No comprehensive approach to form understanding exists, let alone one that produces rich models for semantic services or integration with linked open data. In this paper, we present opal, the first comprehensive approach to form understanding and integration. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems, opal advances the state of the art: For form labeling, it combines features from the text, structure, and visual rendering of a Web page. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern Web forms, opal outperforms previous approaches for form labeling by a significant margin. For form interpretation, opal uses a schema (or ontology) of forms in a given domain. Thanks to this domain schema, it is able to produce nearly perfect (\(>\)97 % accuracy in the evaluation domains) form interpretations. Yet, the effort to produce a domain schema is very low, as we provide a datalog-based template language that eases the specification of such schemata and a methodology for deriving a domain schema largely automatically from an existing domain ontology. We demonstrate the value of opal’s form interpretations through a light-weight form integration system that successfully translates and distributes master queries to hundreds of forms with no error, yet is implemented with only a handful translation rules.
Similar content being viewed by others
References
Araujo, S., Gao, Q., Leonardi, E., Houben, G.-J.: Carbon: domain-independent automatic web form filling. In: Proceedings of the International Conference on Web Engineering (ICWE), pp. 292–306 (2010)
Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the International World Wide Web Conference (WWW), pp. 431–440 (2007)
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceedings of the Brazilian Symposium on Databases, pp. 309–321 (2004)
Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. J. ACM 55(5), 24:1–247:4 (2008)
Benedikt, M., Gottlob, G., Senellart, P.: Determining relevance of accesses at runtime. In: Proceedings Symposium on Principles of Database Systems (PODS), pp. 211–222 (2011)
Benedikt, M., Koch, C.: XPath leashed. In: ACM Computing Surveys, pp. 3:1–3:54 (2007)
Cafarella, M.J., Chang, E.Y., Fikes, A., Halevy, A.Y., Hsieh, W.C., Lerner, A., Madhavan, J., Muthukrishnan, S.: Data management projects at Google. Sigmod Record 37(1), 34–38 (2008)
Chang, K.C.-C., He, B., Zhang, Z.: Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explor. Newsl. 6(2), 67–76 (2004)
Crescenzi, W., Merialdo, P., Qiu, D.: A framework for learning web wrappers from the crowd. In: Proceedings of the International World Wide Web Conference (WWW) (2013)
Dragut, E.C., Kabisch, T., Yu, C., Leser, U.: A hierarchical approach to model web query interfaces for web source integration. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 325–336 (2009)
Dragut, E.C., Meng, W., Yu, C.T.: Deep Web Query Interface Understanding and Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2012)
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: OPAL: automated form understanding for the deep web. In: Proceedings of the International World Wide Web Conference (WWW), pp. 829–838 (2012)
Furche, T., Gottlob, G., Grasso, G., Orsi, G., Schallhart, C., Wang, C.: Little knowledge rules the web: domain-centric result page extraction. In: Proceedings of the International Conference on Web Reasoning and Rule Systems (RR), pp. 61–76 (2011)
He, B., Zhang, Z., Chang, K.C.-C.: Towards building a MetaQuerier: extracting and matching web query interfaces. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 1098–1099 (2005)
He, H., Meng, W., Lu, Y., Yu, C., Wu, Z.: Towards deeper understanding of the search interfaces of the deep web. Word Wide Web 10, 133–155 (2007)
Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., Paepcke, A.: Efficient web form entry on PDAs. In: Proceedings of the International World Wide Web Conference (WWW), pp. 663–672 (2001)
Khare, R., An, Y.: An empirical study on using hidden markov model for search interface segmentation. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 17–26 (2009)
Khare, R., An, Y., Song, I.-Y.: Understanding deep web search interfaces: a survey. Sigmod Records 39(1), 33–40 (2010)
Lehmann, J., Furche, T., Grasso, G., Ngomo, A.-C.N., Schallhart, C., Sellers, A., Unger, C., Bühmann, L., Gerber, D., Konrad Höffner, D.L., Auer S.: Deqa: deep web extraction for question answering. In: Proceedings of the International Semantic Web Conference (ISWC) (2012)
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 1241–1252 (2008)
Maiti, A., Dasgupta, A., Zhang, N., Das, G.: HDSampler: revealing data behind web form interfaces. In: Proceedings of the Symposium on Management of Data (SIGMOD), pp. 1131–1134 (2009)
Navarrete, I., Morales, A., Cardenas, M., Sciavicco, G.: Spatial reasoning with rectangular cardinal relations—the convex tractable subalgebra. Ann. Math. Artif. Intell. (2012)
Nguyen, H., Nguyen, T., Freire, J.: Learning to extract form labels. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 684–694 (2008)
Nguyen, T.H., Nguyen, H., Freire, J.: PruSM: a prudent schema matching approach for web forms. In: International Conference on Information and Knowledge Management (CIKM), pp. 1385–1388 (2010)
Niu, F., Zhang, C., Re, C., Shavlik, J.: DeepDive: web-scale knowledge-base construction using statistical learning and inference. In: Very Large Data Search (VLDS), pp. 25–28 (2012)
Pedersen, T., Patwardhan, S., Michelizzi, J.: Wordnet::similarity—measuring the relatedness of concepts. In: Proceedings of the HLT-NAACL-Demonstrations, pp. 38–41 (2004)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001)
Shestakov, D., Bhowmick, S., Lim, E.: Deque: querying the deep web. Data Knowl. Eng. (DKE) 52(3), 273–311 (2005)
Su, W., Wang, J., Lochovsky, F.H.: ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2), 12:1–12:35 (2009)
Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F.H., Cai, H., Huang, T.: Understanding query interfaces by statistical parsing. ACM Trans. Web 7(2), 8:1–8:22 (2012)
Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 408–419 (2004)
Wu, W., Doan, A., Yu, C., Meng, W.: Modeling and extracting deep-web query interfaces. Adv. Inf. Intell. Syst., pp. 65–90 (2009)
Yuan, X., Zhang, H., Yang, Z., Wen, Y.: Understanding the search interfaces of the deep web based on domain model. In: International Conference on Computer and Information Science, pp. 1194–1199 (2009)
Zhang, Z., He, B., Chang, K.C.-C.: Understanding web query interfaces: best-effort parsing with hidden syntax. In: Proceedings of the Symposium on Management of Data (SIGMOD), (2004)
Acknowledgments
The research leading to these results has received funding from the European Research Council under the European Community’s FP7/2007-2013/ERC Grant agreement DIADEM, No. 246858, and the Oxford Martin School.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Furche, T., Gottlob, G., Grasso, G. et al. The ontological key: automatically understanding and integrating forms to access the deep Web. The VLDB Journal 22, 615–640 (2013). https://doi.org/10.1007/s00778-013-0323-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-013-0323-0