Abstract
To make semantic search a reality, we need to be able to efficiently publish large data sets containing rich semantic structure. We have tools for translating relational and semi-structured data into RDF, but such translation tools do not have the goal of adding or providing the kind of semantics necessary to achieve the goals of the Semantic Web and semantic search over the Web. In this chapter, we present LinQuer, a tool for creating semantic links within a data source and between data sources. We focus on link discovery over structured (relational) data since many Semantic Web sources are the result of publishing relational data as RDF and since relational engines provide the scalability and flexibility we need for large scale link discovery. The LinQuer framework is based on the declarative specification of linkage requirements by a user. We present algorithms for translating these requirements to queries that can run over relational data sources, potentially using semantic information (such as a class hierarchy or a more general ontology) to enhance the recall of the link discovery. We show that this framework is flexible enough to permit linking real data, including dirty data (which is commonly found on the Web) and data with a variety of semantic connections.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Part of this work has appeared in Proceedings of the 18th ACM Conference on Information and Knowledge Management [17]. ©2009 Association for Computing Machinery, Inc. Reprinted by permission.
- 2.
To make our example queries simple, we assume that the databases are denormalized and we have a single table for clinical trials (trial), a table storing patient visits (visit), and tables storing DBpedia disease (dbpedia_disease) and drug (dbpedia_drug) data. In reality, the database is normalized and these relations are decomposed into multiple relations.
References
Appelt, D.E.: Introduction to information extraction. AI Commun. 12(3), 161–172 (1999)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. Proceedings of the International Conference on very Large Data Bases (VLDB), pp. 918–929 (2006)
Auer, S., Dietzold, S., Lehmann, J., Hellmann, S., Aumueller, D.: Triplify: light-weight linked data publication from relational databases. International World Wide Web Conference (WWW), pp. 621–630 (2009)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. International World Wide Web Conference (WWW), pp. 131–140. Banff, Canada (2007)
Bilke, A., Bleiholder, J., Böhm, C., Draba, K., Naumann, F., Weis, M.: Automatic data fusion with HumMer. Proceedings of the International Conference on very Large Data Bases (VLDB), pp. 1251–1254 (2005)
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia – a crystallization point for the web of data. J. Web Semant. 7(3), 154–165 (2009)
Bizer, C., Seaborne, A.: D2RQ – treating non-RDF databases as virtual RDF graphs. Proceedings of the International Semantic Web Conference (ISWC) (2004)
Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18(3), 288–321 (2000)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pp. 73–78. Acapulco, Mexico (2003)
Das, S., Chong, E.I., Eadon, G., Srinivasan, J.: Supporting ontology-based semantic matching in RDBMS. Proceedings of the International Conference on very Large Data Bases (VLDB), pp. 1054–1065 (2004)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Erling, O., Mikhailov, I.: Virtuoso: RDF support in a native RDBMS. Semantic Web Information Management, pp. 501–519. Springer, Berlin, Heidelberg, New York (2009)
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.: Declarative data cleaning: language, model, and algorithms. Proceedings of the International Conference on very Large Data Bases (VLDB), pp. 371–380 (2001)
Galperin, M.Y., Cochrane, G.: The 2011 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Res. 39(Database-Issue), 1–6 (2011)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. Proceedings of the International Conference on very Large Data Bases (VLDB), pp. 491–500 (2001)
Hassanzadeh, O.: Benchmarking declarative approximate selection predicates. Master’s Thesis, University of Toronto, Toronto, Ontario, Canada (2007)
Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R.J., Wang, M.: A framework for semantic link discovery over relational data. Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 1027–1036 (2009). URL http://dx.doi.org/10.1145/1645953.1646084
Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R.J., Wang, M.: LinkedCT: a linked data space for clinical trials. CoRR abs/0908.0567(2009)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. ACM SIGMOD international conference on the management of data, pp. 127–138 (1995)
Indyk, P., Motwani, R., Raghavan, P., Vempala, S.: Locality-preserving hashing in multidimensional spaces. ACM Symposym on Theory of Computing (STOC), pp. 618–625 (1997)
Kementsietsidis, A., Lim, L., Wang, M.: Supporting ontology-based keyword search over medical databases. Proceedings of the AMIA 2008 Symposium, pp. 409–13. American Medical Informatics Association (2008)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Naumann, F., Herschel, M.: An introduction to duplicate detection. Synthesis Lectures on Data Management. Morgan and Claypool Publishers, Seattle, WA USA (2010)
Sioutos, N., de Coronado, S., Haber, M.W., Hartel, F.W., Shaiu, W., Wright, L.W.: NCI thesaurus: a semantic model integrating cancer-related clinical and molecular information. J. Biomed. Inform. 40(1), 30–43 (2007)
Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: a large ontology from Wikipedia and WordNet. J. Web Semant. 6(3), 203–217 (2008)
Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maintaining links on the web of data. Proceedings of the International Semantic Web Conference (ISWC), pp. 650–665 (2009)
Yeganeh, S.H., Hassanzadeh, O., Miller, R.J.: Linking semistructured data on the web. Proceedings of the International Workshop on the Web and Databases (WebDB) (2011)
ClinicalTrials.gov, A Service of the US National Institutes of Health – http://clinicaltrials.gov/(2011). Accessed 28 July 2011
State of the LOD Cloud. Version 0.2. http://www4.wiwiss.fu-berlin.de/lodcloud/state/(2011). Accessed 28 July 2011
The LinQuer Project - http://purl.org/linquer(Accessed 28 July 2011).
Acknowledgements
This work has been partially supported by the NSERC Business Intelligence Network. Hassanzadeh has been supported by an IBM Graduate Fellowship. We thank Reynold S. Xin for implementation of the LinQuer API and Web interface, and improving the overall design of the system and the LinQL grammar.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Hassanzadeh, O., Miller, R.J., Kementsietsidis, A., Lim, L., Wang, M. (2012). Semantic Link Discovery over Relational Data. In: De Virgilio, R., Guerra, F., Velegrakis, Y. (eds) Semantic Search over the Web. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25008-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-25008-8_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25007-1
Online ISBN: 978-3-642-25008-8
eBook Packages: Computer ScienceComputer Science (R0)