Keywords

1 Introduction

Nowadays, similarity searching has become an important task for retrieving objects in a multimedia database; with applications in pattern recognition, data mining and computational biology, to name a few. This task can be mapped into a metric space problem. A Metric Space is a pair \((\mathbb {X}, d)\), where \(\mathbb {X}\) is a universe of objects, and d is a distance function \(d:\mathbb {X} \times \mathbb {X} \rightarrow \mathbb {R}^{+} \cup \{0\}\). The distance function is a metric if it satisfies, for all \(x,y,z \in \mathbb {X}\), the following properties: reflexivity \(d(x,y)=0\) iff \(x=y\), symmetry \(d(x,y)=d(y,x)\), and triangle inequality \(d(x,y)\le d(x,z)+d(z,y)\). The last one being useful to discard objects when solving similarity queries.

In practical applications, we have a subset \(\mathbb {U}\) of n objects taken from the universe \(\mathbb {X}\). So, the similarity searching problem can be defined as the problem of finding a small subset \(\mathbb {S} \subset \mathbb {U}\) of the objects that are close to a given query q with respect to a particular metric function.

Basically, there are two types of queries: range query \((q,r)_d\) and k-nearest neighbor query \(kN\!N(q)_d\). The first one retrieves all the objects within a given radius r measured from q, that is, \((q,r)_d = \{u \in \mathbb {U}, d(q,u) \le r\}\). The second retrieves the k objects in \(\mathbb {U}\) that are the closest to q. Formally, \(|kN\!N(q)_d|=k\), and \(\forall ~u \in kN\!N(q)_d, v \in \mathbb {U} \setminus kN\!N(q)_d\), \(d(u,q)\le d(v,q)\).

There are some indices to speed up similarity queries. One of them, the permutation-based index (PBI) [3] has a competitive performance in the hard case of high dimensional spaces. Oddly, in low dimensional spaces, this technique has poor performance when compared with, for instance, the pivot-based index (which is particularly well suited for low dimensional spaces, as reported in [4]). Another important drawback of the PBI is that it does not allow to discard objects when solving similarity queries.

Our contribution consists in granting the basic PBI the capability of safely discard objects. For this sake, we enhance the PBI with distance information in a convenient way, so that the increment of the space requirement is very small. Our technique allows to improve the retrieval of the PBI in low and medium dimensional metric spaces in a dramatic way. As a matter of fact, in the real world metric space of NASA images, we obtain the true answer of the \(1N\!N(q)_{d}\) using 90 % less distances evaluations than AESA, the best pivot index.

2 Related Work

2.1 Metric Space Indices

There are three kinds of indices for proximity searching in metric spaces, namely, pivot-based indices, partition-based indices and permutation-based indices. In [4], the reader can find a complete survey of the first two kinds.

Pivot-based indices consider a subset of objects \(A = \{a_1, a_2, \dots , a_{|A|}\} \subseteq ~\mathbb {U}\), called the pivots. These indices keep all the n|A| distances between every object of the dataset \(\mathbb {U}\) to all \(a_{i} \in A\). Later, to solve a range query \((q,r)_{d}\), the pivot-based searching algorithms measure \(d(q,a_1)\) and, by virtue of the triangle inequality, for every \(u\in \mathbb {U}\) they lower bound \(d(q,u) \ge |d(q,a_1)-d(u,a_1)|\). So, if \(|d(q,a_1)-d(u,a_1)| > r\) then \(d(q,u)>r\) and they discard u avoiding the computation of d(qu). Once they are done with \(a_1\), they use \(a_{2}\) to try to discard elements from the remaining set, and so on, until using all the pivots in A. The elements that still cannot be discarded at this point are directly compared with q.

There are many techniques based on this idea. Some of them try to reduce the memory requirements in order to get a small index, and others to reduce the number of distance evaluations. These kinds of techniques have a competitive performance in low dimensional spaces. One can imagine that a low dimensional space is one that can be embedded in a uniformly distributed vector space whose dimension is lower than 16, preserving the relative distances among objects.

Among the pivoting techniques, AESA [10] excels in the searching performance. To do this, AESA considers that every object could be a potential pivot. So, it stores the whole matrix of distances between every object pair. The matrix requires \(\frac{n(n-1)}{2}\) memory. Since every object can operate as a pivot, the authors also define a scheme to sequencing the pivot selection. For this sake, they use an array SumLB which accumulates the lower bounds of the distances between the query and every non-discarded object, with respect to all the previous objects in \(\mathbb {U}\) used as pivots. Formaly, if it has previously selected i pivots, \(SumLB(u) = \sum _{j=1}^i |d(q,a_j)-d(u,a_j)|\). So, the first pivot is an object chosen at random, and from the second pivot, AESA uses the non-discarded object that minimize SumLB(u). AESA is the bottom line of exact algorithms. However, in several real-world scenarios, AESA is impractical to use, as its memory requirement is only plausible for reduced size dataset (up to tens of thousands).

Partition-based indices split the space into zones as compact as possible and assign the objects to these zones. To do this, a set of centers \(\{c_1, \ldots , c_m \} \in \mathbb {U}\) is selected, so that each other object is placed in the zone of its closest center. These indices have a competitive performance in high dimensional spaces.

2.2 The Permutation Based Index (PBI)

Let \(\mathbb {P} \subset \mathbb {U}\) be a subset of permutants of size m. Each element \(u \in \mathbb {U}\) induces a preorder \(\le _u\) given by the distance from u towards each permutant, defined as \(y \le _u z \Leftrightarrow d(u,y)\le d(u,z)\), for any pair \(y,z \in \mathbb {P}\).

Let \(\varPi _u = i_1, i_2, \dots , i_{m}\) be the permutation of u, where permutant \(p_{i_{j}} \le _u p_{i_{j+1}}\). Permutants at the same distance take an arbitrary but consistent order. Every object in \(\mathbb {U}\) computes its preorder of \(\mathbb {P}\) and associates it to a permutation which is stored in the index (PBI does not store distances). Thus, a simple implementation needs nm space. Nevertheless, as only the permutant identifiers are necessary, several identifiers can be compacted in a single machine word.

The hypothesis of the PBI is that two equal objects are associated to the same permutation, while similar objects are, hopefully, related to similar permutations. So, if \(\varPi _u\) is similar to \(\varPi _v\) one expects that u is close to v.

Later, given the query \(q \in \mathbb {X} \setminus \mathbb {U}\), the PBI search algorithm computes its permutation \(\varPi _q\) and compares it with all the permutations stored in the index. Then, the dataset \(\mathbb {U}\) is traversed in increasing permutation dissimilarity, comparing the objects in \(\mathbb {U}\) with the query using the distance d of the particular metric space. Regrettably, PBI cannot discard objects at query time. Instead, a premature cut off in the reviewing process produces a probabilistic search algorithm (as it reports the right answer to the query with some probability).

There are many similarity measures between permutations. One of them is the \(L_p\) family of distances [5], that obeys Eq. (1), where \(\varPi ^{-1}(i_{j})\) denotes the position of permutant \(i_{j}\) within the permutation \(\varPi \).

$$\begin{aligned} L_p(\varPi _u,\varPi _q) = \sum _{j=[1,|\mathbb {P}| ]}|\varPi _u^{-1}(i_j) - \varPi _q^{-1}(i_j)|^p \end{aligned}$$
(1)

With \(p=1\), we obtain Spearman Footrule (\(S_{F}\)) and for \(p=2\), Spearman Rho (\(S_{\rho }\)). For example, let \(\varPi _u = (42153)\) and \(\varPi _q = (32154)\) be the object u and query q permutations, respectively. So, \(S_F(\varPi _u, \varPi _q)=8\), \(S_{\rho }(\varPi _u, \varPi _q)=32\). As reported in [3], \(S_{F}\) has a good performance, requiring less computational effort than \(S_{\rho }\).

There have been several works trying to improve the PBI’s performance. There are some techniques to select good permutants obtaining little improvement [1, 7]. Other variants have been proposed with the aim of reducing the PBI’s space requirement [2, 8, 9] by discarding some portions of the permutations; however, these techniques lose precision in the query retrieval.

In general terms, all of the PBI’s variants are designed for high dimensional spaces and none of them can prune the dataset when solving similarity queries.

2.3 Distance Quantization

A simple way to reduce the memory requirement when representing the distances among objects is to quantize them. Of course, there is a tradeoff between memory requirement and precision. However, there are some cases where the quantization is effective. For instance, in BAESA [6], the authors quantize the whole distance matrix of AESA [10]. Using only four bits per distance, eight times less space than AESA, BAESA needs just 2.5 times the distance computations of AESA.

3 Our Contribution

Essentially, we introduce a novel modification into the PBI. It consists in enhancing it with quantized distance information. Since we now have distance data, we can safely discard objects, and this dramatically improves the performance of the PBI search algorithm in low and medium dimensionality metric spaces.

In the following, we describe the modification of the PBI and also how to adapt its searching algorithm in order to benefit from the extra data.

3.1 Enhancing PBI with Distance Quantization

For each object \(u \in \mathbb {U}\), the index stores not only its permutation, but also quantized distance information. To do that, for each permutant \(p \in \mathbb {P}\), we define \(\zeta \) concentric zones \(z_1, \dots , z_\zeta \) limited by \(\zeta -1\) radii \(r_1, \dots , r_{\zeta -1}\) and two extra values \(r_{0}=0\) and \(r_{\zeta }=\infty \) (for limiting the first and last zone). So, each object \(u \in \mathbb {U}\) is located in the zone \(z_{i}\) where it satisfies the relation \(r_{i-1}< d(u,p) \le r_{i}\).

To compute the zones, we do not need extra distance computations. In fact, we compute them during the calculation of the permutation. We call \(\varPi _u\) the permutation for u and \(Z_u\) the zones information for object u. So, for the permutant in the j-th cell of the permutation, \(Z_u(j)\) indicates in which of its zones the object u lies. Hence, every object has its permutation and new extra information, the zone where it belongs according to every permutant.

Geometrically, for each \(p_{i} \in \mathbb {P}\), this can be seen as partitioning the space with \(\zeta \) distance rings centered in \(p_{i}\). Figure 1 illustrates the idea, where the dataset is \(\mathbb {U} = \{p_1, p_2, p_3, u_1, \dots , u_7\}\) and the first three objects are the permutants (\(\mathbb {P} = \{p_1, p_2, p_3\}\)). Each one has 3 zones (which are denoted by z1, z2, z3). For each object in \(\mathbb {U}\), we show its permutation \(\varPi _{u}\) in the sequence closed by [ ] and its zone information \(Z_{u}\) in the one closed by ( ). For example, for \(u_6\), its permutation is [2, 1, 3] and for permutants \(p_2, p_{1}\) and \(p_{3}\), \(u_6\) belongs to zones 1, 2 and 2. Notice that with our proposal, now we can figure out whether two objects with equal permutations are close or not.

Fig. 1.
figure 1

Example of our technique, permutations and zones for every object.

Our proposal has two significant advantages, namely, (i) now we can discard some elements without compute their distances directly, and (ii) we improve the prediction power of the PBI.

In terms of space, besides the table of permutations, we need to store the distance information. Each object needs \(m \lceil \log _{2} m\rceil \) bits for the permutation (remember that \(m=|\mathbb {P}|\)) and also needs to store the zones. In order to represent the m zones (one for each permutant), we need \(m \lceil \log _{2} \zeta \rceil \) bits (where \(\zeta \) is the number of zones). Finally, for each zone of a permutant, we need a float number to store the radius. This counts for \(m \zeta 32\) bits. Adding up for the n objects we obtain \(n m (\lceil \log _{2} m\rceil + \lceil \log _{2} \zeta \rceil ) + \zeta m 32\) bits.

For the sake of determining an equivalence in space requirement between permutants with or without zones, we follow this rationale. The standard PBI uses \(n m \lceil \log _{2} m\rceil \) bits. The extra memory used by the zones allows to add more permutants to the plain PBI, but is not enough to double the number of permutants. Assuming that m is a power of 2, any extra permutant forces to use another bit. We also assume that \(\zeta \) is a power of 2. So, in the numerator of Eq. (2), we have the space used by the PBI with zones, and in its denominator, the space used by a plain PBI with \(m^{*}\) permutants, where \(m^{*} \in (m, 2m)\).

$$\begin{aligned} \frac{n m (\log _{2} m + \log _{2} \zeta ) + \zeta m 32}{n m^{*} (\log _{2} m + 1)} = \frac{m (\log _{2} m + \log _{2} \zeta )}{m^{*} (\log _{2} m+ 1)} + \frac{\zeta m 32}{n m^{*} (\log _{2} m + 1)} \end{aligned}$$
(2)

The second term is \(O\left( \frac{\zeta }{n \log _{2}m}\right) \), so is negligible. The first term is \(\frac{m (\log _{2} m + \log _{2} \zeta )}{m^{*} (\log _{2} m+ 1)}\) \(= \frac{m}{m^{*}} \left( 1 + \frac{\log _{2} \zeta -1}{\log _{2} m + 1}\right) \). To do a fair memory comparison, we set this term to 1. So, the number of equivalent permutants \(m^{*}\) for m permutants and \(\zeta \) zones is

$$\begin{aligned} m^{*} = m \left( 1+ \frac{\log _{2} \zeta -1}{\log _{2}m+1}\right) \end{aligned}$$
(3)

In the rest of this section, we show how to use the PBI index enhanced with the quantized distance information during the query time. For shortness, we call this index the PZBI.

Object Discarding. In order to discard objects in the new PZBI, we adapt the pivot excluding criterion. To do so, after computing the distance from q to all the permutants in order to produce \(\varPi _{q}\), we compute the zones where q belongs (\(Z_{q}\)) and the zones where the query ball \((q,r)_{d}\) intersects. For this sake, we manage two arrays FZ and LZ. For each permutant, in FZ and LZ we store the first and last zone, respectively, where the query ball intersects. Therefore, given the query radius r, for each permutant \(p \in \mathbb {P}\), \(FZ_{p}\) and \(LZ_{p}\) store the number of the zone that contains \(d(p,q)-r\) and \(d(p,q)+r\), respectively.

With this, when we are reviewing the objects, for each permutant \(p \in \mathbb {P}\) we discard, by virtue of the triangle inequality, every element in \(\mathbb {U}\) that belongs to any zone that is not in the range \([FZ_{p},LZ_{p}]\). This allows discarding some elements without performing the direct comparison with the query.

Note that we can simulate \(kN\!N(q)_d\) queries with range queries whose initial radius is \(\infty \), and that its radius reduces to the final one as long as the search process progresses.

Improving the Distance Permutation. Since we have more information per object (its permutation and corresponding zones), we can use the zone information so as to improve the reviewing order when solving a similarity query. We propose several strategies as follow. Let us define \(Z_{D} (Z_{u},Z_{v}) = \sum _{j=[1,|\mathbb {P}| ]}|Z_u(j) - Z_v(j)|\). \(Z_{D}\) accumulates the sum of absolute values of differences in the zone identifiers between the objects u and \(v \in \mathbb {X}\) for all the permutants. With this, we define the following variants:

  • PsF : Just computing Spearman Footrule \(S_{F}\) as the plain PBI, see Eq. (1).

  • SPZ : It is the sum of \(S_F(\varPi _u, \varPi _q)\) and \(Z_{D}(Z_{u}, Z_{q})\).

  • PZ : We compute \(S_F(\varPi _u, \varPi _q)\) and \(Z_{D}(Z_{u}, Z_{q})\), separately. Next, we sort by \(S_{F}\) in increasing order and use \(Z_{D}\) to break ties.

  • ZP : Analogous to PZ, but we sort by \(Z_{D}(Z_{u}, Z_{q})\) and break ties with \(S_F(\varPi _u, \varPi _q)\).

Computing Radius. Another important issue to define is how to establish the radii of the concentric zones. Every radius can be asigned as follow:

  • Uniformly distributed by distances. (REQ). We obtain a sample of objects and compute their distance towards the permutant. This gives us a good approximation of the distribution of distances with respect to that permutant. Let us call \(r_{max}\) and \(r_{min}\) the maximum and minimum distance computed in the sample for that permutant. Then, we basically use \((r_{max} - r_{min}) / \zeta \) as the increment between one radius to the next.

  • Uniformly distributed by elements (EEQ). Once again we obtain a sample of objects and compare them with the permutant. Later, we sort the distances and select points taken at regular intervals (a fraction \(\frac{1}{\zeta }\) of the sample size).

4 Experimental Evaluation

We tested our contribution with two kinds of databases: uniformly distributed vectors in the unitary cube of dimension in [8, 32] and NASA images. The first one allows us to know the performance of the search algorithm and how the parameters can change the results. The second one gives us a good idea of how our technique works in real life.

4.1 Unitary Cube

We use a dataset with 80,000 vectors uniformly distributed in the unitary cube using the Euclidean distance. We tested \(1N\!N(q)_d\) queries with a query set of size 100 in dimension 8 to 32.

In Fig. 2, we show the performance of the PZBI using 8 permutants in dimension 8, with REQ parameter. In Fig. 2(a), the y-axe shows the average of the percentage of elements retrieved by the \(1N\!N(q)_d\), as we review an increasing portion of the database (x-axe). Figure 2(b) shows the impact of the number of zones in the performance of the PZBI. Notice that we can process \(1N\!N(q)_{d}\) queries faster than pivot-based algorithm. For instance, with 8 zones and EEQ, PZBI requires up to 81 % less distance evaluations. Pivots are represented with a horizontal line just to allow a visual comparison of the results. In this plot, we can notice that the strategy SPZ is better than PZ or ZP.

Fig. 2.
figure 2

Searching performance in the space of vectors of dimension 8, using 8 permutants/pivots.

In Fig. 3, we use dimensions 8, 16 and 32. We show that the PBI is beated by the PZBI using 8 zones and SPZ in low dimensional spaces. Notice that the PBI becomes better in high dimension. Pivot-based algorithm in dimension 16 is out the plot with almost 54,000 distances. In this figure we use \(m^{*}\) permutants and we show that in high dimension is better to use more permutants, however, in low dimension, is better to split the space with zones.

Fig. 3.
figure 3

Searching performance in the space of vectors, with 8 zones, using SPZ for dimensions 8, 16 and 32.

4.2 NASA Images

We use a dataset of 40,150 20-dimensional feature vectors, generated from images downloaded from NASAFootnote 1, where duplicated vectors were eliminated. We also use Euclidean distance.

In Fig. 4, we use 8 and 16 permutants with 8 zones (Fig. 4 (a) and (b), respectively). Notice that using the strategy SPZ we have an excellent improvement in the distance evaluations. In this figure, we are comparing our results with AESA (see Sect. 2.1) and its quantized version BAESA (see Sect. 2.3). Notice that our PZBI needs \(\frac{1}{3}\) of the distance computations used by AESA, as can be seen when using 8 zones, EEQ space partitioning and SPZ. AESA and pivots are represented with an horizontal line in order to simplify the visual comparison.

Fig. 4.
figure 4

Nasa image dataset. Changing the number of zones and the impact on the selection of REQ or EEQ.

In Fig. 5, we test \(kN\!N(q)_{d}\) queries, varying \(k \in [1,10]\) (that is, increasing the size of the query answer), the number of zones \(\zeta \in [8, 32]\), with two different size of permutants, \(m = |\mathbb {P}|=\) 8 and 64. In Fig. 5(a), we notice that using \(m=8\) permutants, we compute less than 50 % of distance evaluations of AESA for \(1N\!N(q)_{d}\) and just 3 times or \(10N\!N(q)_{d}\) using significantly less space in the index. We also notice that with just 8 permutants and 8 zones, we can retrieve up to \(4N\!N(q)_{d}\) faster than AESA. On the other hand, using \(m=64\) permutants with 32 zones (5 bits per element) we are computing just 25 % of the pivot technique even in the hardest case.

In Fig. 5(b), we use 64 permutants and we have a better performance. For example, with 32 zones, 64 permutants and REQ we use just the 10 % of distance of AESA, that is to say 90 % less than the reference-algorithm for metric spaces.

Fig. 5.
figure 5

NASA image dataset. \(kN\!N(q)_{d}\) queries, varying k.

In both plots (Figs. 4 and 5), our technique beats AESA. This is really surprising, as AESA used to be the lower bound of searching in metric spaces.

5 Conclusions and Future Work

The basic Permutant Based Index (PBI) has shown an excellent performance in high dimensional metric spaces. Its main drawback is that it does not allow to safely discard objects when solving similarity queries. Our contribution consists in granting the basic PBI the capability of safely discard objects. For this sake, we enhance the PBI with distance information in a quantized way.

Our technique allows to improve the retrieval of the PBI in low and medium dimensional metric spaces in a dramatic way.

In order to illustrate the benefits of our technique, we can say that in the real world metric space of NASA images PBI with quantized distance is capable to beat AESA search algorithm. As a matter of fact, with 32 zones and 64 permutants we use just the 10 % of distance evaluation of AESA, that is to say 90 % less than the reference-algorithm for metric spaces.

Future Work. We plan to develop a searching algorithm to efficiently solve k-nearest neighbor queries based on our PBI with quantized distances.

Another trend is to develop efficient mechanism to avoid the sorting of the whole set of non-discarded objects.