Abstract
The cosine and Tanimoto similarity measures are widely applied in information retrieval, text and Web mining, data cleaning, chemistry and bio-informatics for finding similar objects, their clustering and classification. Recently, a few very efficient methods were offered to deal with the problem of lossless determination of such objects, especially in large and very high-dimensional data sets. They typically relate to objects that can be represented by (weighted) binary vectors. In this paper, we offer methods suitable for searching vectors with domains consisting of zero, a positive number and a negative number; that is, being a generalization of weighted binary vectors. Our results are not worse than their existing analogs offered for (weighted) binary vectors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proc. of VLDB 2006. ACM (2006)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proc. of WWW 2007, pp. 131–140. ACM (2007)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. Computer Networks 29(8-13) 1157–1166 (1997)
Chaudhuri, S., Ganti, V., Kaushik, R.L.: A primitive operator for similarity joins in data cleaning. In: Proceedings of ICDE 2006. IEEE Computer Society (2006)
De Baets, B., De Meyer, H., Naessens, H.: A class of rational cardinality-based similarity measures. J. Comput. Appl. Math. 132, 51–69 (2001)
Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions via hashing. In: Proc. of VLDB 1999, pp. 518–529 (1999)
Kryszkiewicz, M.: Efficient Determination of Binary Non-Negative Vector Neighbors with Regard to Cosine Similarity. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds.) IEA/AIE 2012. LNCS (LNAI), vol. 7345, pp. 48–57. Springer, Heidelberg (2012)
Kryszkiewicz, M.: Bounds on Lengths of Real Valued Vectors Similar with Regard to the Tanimoto Similarity. In: Selamat, A., Nguyen, N.T., Haron, H. (eds.) ACIIDS 2013, Part I. LNCS, vol. 7802, pp. 445–454. Springer, Heidelberg (2013)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38(6), 983–996 (1998)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann (1999)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proc. of WWW Conference, pp. 131–140 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kryszkiewicz, M. (2013). On Cosine and Tanimoto Near Duplicates Search among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number. In: Larsen, H.L., Martin-Bautista, M.J., Vila, M.A., Andreasen, T., Christiansen, H. (eds) Flexible Query Answering Systems. FQAS 2013. Lecture Notes in Computer Science(), vol 8132. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40769-7_46
Download citation
DOI: https://doi.org/10.1007/978-3-642-40769-7_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40768-0
Online ISBN: 978-3-642-40769-7
eBook Packages: Computer ScienceComputer Science (R0)