Similarity queries: their conceptual evaluation, transformations, and processing | The VLDB Journal Skip to main content
Log in

Similarity queries: their conceptual evaluation, transformations, and processing

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Many application scenarios can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, for example join and selection, to be aware of data similarities, there has not been much study on the role and implementation of similarity-aware operations as first-class database operators. Furthermore, very little work has addressed the problem of evaluating and optimizing queries that combine several similarity operations. The focus of this paper is the study of similarity queries that contain one or multiple first-class similarity database operators such as Similarity Selection, Similarity Join, and Similarity Group-by. Particularly, we analyze the implementation techniques of several similarity operators, introduce a consistent and comprehensive conceptual evaluation model for similarity queries, and present a rich set of transformation rules to extend cost-based query optimization to the case of similarity queries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37
Fig. 38
Fig. 39
Fig. 40
Fig. 41
Fig. 42
Fig. 43
Fig. 44
Fig. 45
Fig. 46

Similar content being viewed by others

References

  1. Silva, Y.N., Aref, W.G, Ali, M.H.: Similarity group-by. In: Proceedings of the 2009 IEEE International Conference on Data, Engineering, 2009

  2. Silva, Y.N., Aref, W.G., Ali, M.H.: The similarity join database operator. In: Proceedings of the 2010 IEEE International Conference on Data, Engineering, 2010

  3. Silva, Y.N., Arshad, M.U., Aref, W.G.: Exploiting similarity-aware grouping in decision support systems. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, 2009

  4. Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.-A.: Simdb: a similarity-aware database system. In: Proceedings of the 2010 International Conference on Management of data, 2010

  5. Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of data, 1998

  6. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of data, 1996

  7. Zhang, C., Huang, Y.: Cluster by: a new sql extension for spatial data aggregation. In: Proceedings of the 15th Annual ACM International Symposium on Advances in Geographic, Information Systems, 2007

  8. Li, C., Wang, M., Lim, L., Wang, H., Chang, K.C.-C.: Supporting ranking and clustering as generalized order-by and group-by. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of data, 2007

  9. Schallehn, E., Sattler, K.-u., Saake, G.: Extensible grouping and aggregation for data reconciliation. In: In Proceedings of 4th International Workshop on Engineering Federated, Information Systems, EFIS01, 2001

  10. Schallehn, E., Sattler, K.-U., Saake, G.: Efficient similarity-based operations for data integration. Data Knowl. Eng. 48, 361–387 (2004)

    Article  Google Scholar 

  11. Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Datab. Syst. 33, 7:1–7:38 (2008)

    Google Scholar 

  12. Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of data, 1998

  13. Böhm, C., Krebs, F.: The k-nearest neighbour join: turbo charging the kdd process. Knowl. Inf. Syst. 6, 728–749 (2004)

    Article  Google Scholar 

  14. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data, Engineering, 2006

  15. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Data, Bases, 2001

  16. Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: Proceedings of the 2008 IEEE 24th International Conference on Data, Engineering, 2008

  17. Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of data, 2008

  18. Wichterich, M., Assent, I., Kranen, P., Seidl, T.: Efficient emd-based similarity search in multimedia databases via flexible dimensionality reduction. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of data, 2008

  19. Adali, S., Bonatti, P., Sapino, M.L., Subrahmanian, V.S.: A multi-similarity algebra. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 1998

  20. Ferreira, M.R.P., Traina, C., Jr., Traina, A.J.M.: An efficient framework for similarity query optimization. In: Proceedings of the 15th Annual ACM International Symposium on Advances in Geographic, Information Systems, 2007

  21. Traina, C. Jr., Traina, A.J.M., Vieira, M.R., Arantes, A.S., Faloutsos, C.: Efficient processing of complex similarity queries in rdbms through query rewriting, In: Proceedings of the 15th ACM International Conference on Information and, Knowledge Management, 2006

  22. Barioni, M.C.N., Razente, H., Traina, A., Traina, C. Jr.: Siren: a similarity retrieval engine for complex data. In: Proceedings of the 32nd International Conference on Very Large Data Bases, 2006

  23. Baioco, G.B., Traina, A.J.M., Traina, C. Jr.: Mamcost: Global and local estimates leading to robust cost estimation of similarity queries. In: Proceedings of the 19th International Conference on Scientific and Statistical Database Management, 2007

  24. TPC-H version 2.14.3. [Online]. Available: http://www.tpc.org/tpch/

  25. Silva, Y.N., Aref, W.G., Larson, P.-A., Pearson, S.S., Ali, M.H.: Similarity queries—transformation rules and proofs. Arizona State University, Tech. Rep., 2012. [Online]. Available: http://www.public.asu.edu/~ynsilva/tr/SQTRep.pdf

  26. Chaudhuri, S., Shim, K.: Including group-by in query optimization. In: Proceedings of the 20th International Conference on Very Large Data, Bases, 1994

  27. Yan, W.P., Larson, P.-A.: Eager aggregation and lazy aggregation. In: Proceedings of the 21th International Conference on Very Large Data, Bases, 1995

  28. Graefe, G.: The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)

    Google Scholar 

  29. Graefe, G., McKenna, W.J.: The volcano optimizer generator: Extensibility and efficient search. In: Proceedings of the Ninth International Conference on Data Engineering, pp. 209–218. IEEE Computer Society, Washington, DC (1993)

  30. Ciaccia, P., Patella, M., Zezula, P.: A cost model for similarity queries in metric spaces. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 59–68. ACM, New York, NY (1998)

  31. Lee, H., Ng, R.T., Shim, K.: Similarity join size estimation using locality sensitive hashing. Proc. VLDB Endow. 4, 338–349 (2011)

    Google Scholar 

Download references

Acknowledgments

Walid G. Aref’s research was partially supported by the National Science Foundation under Grants III-1117766, IIS-0964639, and IIS-0811954.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yasin N. Silva.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 712 KB)

Appendix: Similarity-aware transformation rules

Appendix: Similarity-aware transformation rules

See Table 4.

Table 4 Transformation rules for similarity-aware operators—instances of generic rules

Rights and permissions

Reprints and permissions

About this article

Cite this article

Silva, Y.N., Aref, W.G., Larson, PA. et al. Similarity queries: their conceptual evaluation, transformations, and processing. The VLDB Journal 22, 395–420 (2013). https://doi.org/10.1007/s00778-012-0296-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-012-0296-4

Keywords

Navigation