{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,4,10]],"date-time":"2024-04-10T07:15:42Z","timestamp":1712733342579},"reference-count":31,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2007,6]]},"abstract":"\n Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on\n inverse strings<\/jats:italic>\n and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.\n <\/jats:p>\n \n We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is\n independent<\/jats:italic>\n of the number of database strings and\n linear<\/jats:italic>\n with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.\n <\/jats:p>","DOI":"10.1145\/1242524.1242529","type":"journal-article","created":{"date-parts":[[2007,9,14]],"date-time":"2007-09-14T13:44:55Z","timestamp":1189777495000},"page":"12","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":24,"title":["Estimating the selectivity of approximate string queries"],"prefix":"10.1145","volume":"32","author":[{"given":"Arturas","family":"Mazeika","sequence":"first","affiliation":[{"name":"Free University of Bozen-Bolzano, Bozen-Bolzano BZ, Italy"}]},{"given":"Michael H.","family":"B\u00f6hlen","sequence":"additional","affiliation":[{"name":"Free University of Bozen-Bolzano, Bozen-Bolzano BZ, Italy"}]},{"given":"Nick","family":"Koudas","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, Ontario"}]},{"given":"Divesh","family":"Srivastava","sequence":"additional","affiliation":[{"name":"AT&T Labs--Research, Florham Park, NJ"}]}],"member":"320","published-online":{"date-parts":[[2007,6]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304203"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.5555\/829502.830043"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/647819.736184"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the International Conference on Data Engineering (ICDE). 227--239","author":"Chaudhuri S.","unstructured":"Chaudhuri , S. , Ganti , V. , and Gravano , L . 2004. Selectivity estimation for string predicates: Overcoming the underestimation problem . In Proceedings of the International Conference on Data Engineering (ICDE). 227--239 . Chaudhuri, S., Ganti, V., and Gravano, L. 2004. Selectivity estimation for string predicates: Overcoming the underestimation problem. In Proceedings of the International Conference on Data Engineering (ICDE). 227--239."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0022-0000(02)00031-4"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/SFCS.1994.365694"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the International Conference on Data Engineering (ICDE). 489--499","author":"Cohen E., D., M.","unstructured":"Cohen , E., D., M. , Fujiwara , S. , Gionis , A. , Indyk , P. , Motwani , R. , Ullman , J. D. , and Yang , C . 2000. Finding interesting associations without support pruning . In Proceedings of the International Conference on Data Engineering (ICDE). 489--499 . Cohen, E., D., M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J. D., and Yang, C. 2000. Finding interesting associations without support pruning. In Proceedings of the International Conference on Data Engineering (ICDE). 489--499."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jalgor.2003.12.001"},{"key":"e_1_2_1_9_1","unstructured":"Frakes B. and Yates R. 1992. Information Retrieval Data Structures and Algorithms. Prentice-Hall Upper Saddle River NJ. Frakes B. and Yates R. 1992. Information Retrieval Data Structures and Algorithms. Prentice-Hall Upper Saddle River NJ."},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the International Conference on Very Large Databases (VLDB). 491--500","author":"Gravano L.","unstructured":"Gravano , L. , Ipeirotis , P. G. , Jagadish , H. V. , Koudas , N. , Muthukrishnan , S. , and Srivastava , D . 2001. Approximate string joins in a database (almost) for free . In Proceedings of the International Conference on Very Large Databases (VLDB). 491--500 . Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the International Conference on Very Large Databases (VLDB). 491--500."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2003.1232265"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/s007780000029"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the International Conference on Very Large Databases (VLDB). 275--286","author":"Jagadish H. V.","unstructured":"Jagadish , H. V. , Koudas , N. , Muthukrishnan , S. , Poosala , V. , Sevcik , K. C. , and Suel , T . 1998. Optimal histograms with quality guarantees . In Proceedings of the International Conference on Very Large Databases (VLDB). 275--286 . Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the International Conference on Very Large Databases (VLDB). 275--286."},{"key":"e_1_2_1_14_1","unstructured":"Jain A. and Dubes R. 1988. Algorithms for Clustering Data. Prentice-Hall Upper Saddle River NJ. Jain A. and Dubes R. 1988. Algorithms for Clustering Data. Prentice-Hall Upper Saddle River NJ."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/331499.331504"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the International Conference on Very Large Databases (VLDB). 793--804","author":"Jin L.","unstructured":"Jin , L. , Koudas , N. , Li , C. , and Tung , A. K. H. 2005. Indexing mixed types for approximate retrieval . In Proceedings of the International Conference on Very Large Databases (VLDB). 793--804 . Jin, L., Koudas, N., Li, C., and Tung, A. K. H. 2005. Indexing mixed types for approximate retrieval. In Proceedings of the International Conference on Very Large Databases (VLDB). 793--804."},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the International Conference on Very Large Databases (VLDB). 397--408","author":"Jin L.","unstructured":"Jin , L. and Li , C . 2005. Selectivity estimation for fuzzy string predicates in large data sets . In Proceedings of the International Conference on Very Large Databases (VLDB). 397--408 . Jin, L. and Li, C. 2005. Selectivity estimation for fuzzy string predicates in large data sets. In Proceedings of the International Conference on Very Large Databases (VLDB). 397--408."},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA). 137","author":"Jin L.","unstructured":"Jin , L. , Li , C. , and Mehrotra , S . 2003. Efficient record linkage in large data sets . In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA). 137 . Jin, L., Li, C., and Mehrotra, S. 2003. Efficient record linkage in large data sets. In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA). 137."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/233269.233341"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/146370.146380"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/276304.276344"},{"key":"e_1_2_1_22_1","unstructured":"Matias Y. Vitter J. S. and Wang M. 2000. Dynamic maintenance of wavelet-based histograms. VLDB J. 101--110. Matias Y. Vitter J. S. and Wang M. 2000. Dynamic maintenance of wavelet-based histograms. VLDB J. 101--110."},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the CleanDB Workshop (in conjunction with) the International Conference on Very Large Databases (VLDB).","author":"Mazeika A.","unstructured":"Mazeika , A. and B\u00f6hlen , M. H . 2006. Cleansing databases of misspelled proper nouns . In Proceedings of the CleanDB Workshop (in conjunction with) the International Conference on Very Large Databases (VLDB). Mazeika, A. and B\u00f6hlen, M. H. 2006. Cleansing databases of misspelled proper nouns. In Proceedings of the CleanDB Workshop (in conjunction with) the International Conference on Very Large Databases (VLDB)."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/50202.50205"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/375360.375365"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/233269.233342"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the International Conference on Data Engineering (ICDE). 125--137","author":"Sahinalp C.","unstructured":"Sahinalp , C. , Tasan , M. , Macker , J. , and Ozsoyoglu , Z. M . 2003. Distance based indexing for string proximity search . In Proceedings of the International Conference on Data Engineering (ICDE). 125--137 . Sahinalp, C., Tasan, M., Macker, J., and Ozsoyoglu, Z. M. 2003. Distance based indexing for string proximity search. In Proceedings of the International Conference on Data Engineering (ICDE). 125--137."},{"key":"e_1_2_1_28_1","volume-title":"Introduction to Modern Information Retrieval","author":"Salton G.","unstructured":"Salton , G. and McGill , M. J. 1986. Introduction to Modern Information Retrieval . McGraw-Hill , New York . Salton, G. and McGill, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, New York."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/AINA.2005.184"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/647891.739587"},{"key":"e_1_2_1_31_1","unstructured":"Vernica R. and Li C. 2007. Flamingo project. http:\/\/www.ics.uci.edu\/~flamingo\/. Vernica R. and Li C. 2007. Flamingo project. http:\/\/www.ics.uci.edu\/~flamingo\/."}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1242524.1242529","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T21:50:16Z","timestamp":1672264216000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1242524.1242529"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,6]]},"references-count":31,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2007,6]]}},"alternative-id":["10.1145\/1242524.1242529"],"URL":"https:\/\/doi.org\/10.1145\/1242524.1242529","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"value":"0362-5915","type":"print"},{"value":"1557-4644","type":"electronic"}],"subject":[],"published":{"date-parts":[[2007,6]]},"assertion":[{"value":"2007-06-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}