Keywords

1 Introduction

Recently, surveillance systems have become ubiquitous in public places like airports, railway stations, college campuses, and office buildings [1]. There are a large number of cameras in the surveillance systems and they provide huge amounts of video data. The analysis of the computer vision obtained in a surveillance system often requires the ability to track people across multiple cameras. Therefore, person re-identification (Re-ID) has attracted more and more interests [2,3,4]. Re-ID is defined as a process of establishing correspondence between images of a person taken from different cameras. In the past five years, a large number of models have been proposed for Re-ID systems [5, 6], which can be categorized generally into two types: (1) designing discriminative, descriptive and robust visual descriptors to characterize a person’s appearance [7, 8]; (2) learning suitable distance metrics that maximize the chance of a correct correspondence [9,10,11]. In this paper, we focus on the second type of person re-identification, that is, given a set of features extracted from each person image, we seek to quantify and differentiate these features by learning the optimal distance measure that is most likely to give correct matches.

Many metric learning algorithms have been proposed to act with the distances or similarity functions of person re-identification features. For example, Pedagadi et al. applied the Local Fisher Discriminate Analysis (LFDA) [12] to solve person re-identification problems. The authors in [13] introduced the KISSME method from equivalence constraints based on a statistical inference perspective. Dikmen et al. proposed a metric learning framework to obtain a robust Mahalanobis metric for Large Margin Nearest Neighbour classification with Rejection (LMNN) [14]. [15] presented an information-theoretic approach to learn a Mahalanobis distance function. Zheng et al. proposed the Relative Distance Comparison (RDC) [16] approach to maximize the likelihood of a pair of true matches which having a relatively smaller distance than that of a wrongly matched pair in a soft discriminate manner. Chen [17] formulated an asymmetric distance model for learning camera-specific projections to transform the unmatched features of each view to a common space. Chen [18] presented a relevance metric learning method with listwise constraints (RMLLCs) by adopting listwise similarities using predefined similarity lists.

However, the existing Re-ID methods discussed above all have two shortcomings.

  1. (1)

    As we all known, for the person re-identification datasets, there are usually much more inter-class person image pairs than intra-class ones. Imbalanced datasets would drastically affect the modelling of machine learning methods.

  2. (2)

    The training set for learning the model consists of images of matched people across different camera views. In order to capture the large intra and inter variations, this represents a large scale learning problem that challenges existing machine learning algorithms.

To address the problems of the existing Re-ID methods, a novel relative distance metric leaning based on clustering centralization and projection vectors learning for person re-identification (RDML-CCPVL) is proposed. First, there are usually much more inter-class instances than intra-class instances when traditional metric learning methods collect training dataset. Constructing counterexamples of each instance needs to compute the distances of all the other instances, so the training time is greatly increased. Using FCM [19], the number of counterexamples of each instance is decreased by divided into clusters and the important structural information is retained, so the overfitting problem caused by class imbalance is relieved.

Second, traditional matrix projection learning methods usually have greater storage and computing complexity. In this work, we decomposed the projection matrices into low rank ones by eigenvalue decomposition for projection matrices. According to our iterative optimization method, updating the distance vectors of instance features only need to learn a new projection vector using the updated training dataset each time and stop when achieving a good enough accuracy. The conjugate gradient method is used to learn the projection vector, which only needs to compute the initial gradient one time. For the quadratic function, the conjugate gradient method can converge to the target precision soon due to quadratic termination. Our method can effectively reduce the computational complexity and storage. In addition, our algorithm can approximately ensure to keep the orthogonal characteristics of the vectors after eigenvalue decomposition.

2 Relative Distance Metric Leaning Based on Clustering Centralization and Projection Vectors Learning

2.1 Learning Function Based on Clustering Centralization

The person re-identification problem can be casted into a distance comparison problem [16]. Suppose we have a set of m training pedestrian images \(\mathbf{D }^k = [\mathbf{{X} },\mathbf{Y }]_1^m\) with a feature dataset \(\mathbf {X} = \{{x}_i,i = 1,...,{m}\} ,{x}_{i} \in \mathbf R ^d\), where d is the dimension of the features and \(\mathbf Y = \{ {y_i},i = 1,...,m\}\) is the input label dataset. For an instance \(x _a\) of person A, we want to find another instance \(x _b\) of pedestrian A captured elsewhere in space and time by learning the distance \(dis (x _a,x _b)\) of these two instances. As we all known, the distances of intra-class instances should be smaller in general compared with the distances of inter-class instances, so that \(dis (x _a,x _b) < dis (x _a,x _c)\), where \(x _c\) is an instance of any other pedestrian except A. Here we construct a pairwise dataset \(\mathbb {S} = \{ \mathbb {S}_t = ({\varvec{{d}}}_t^{pos},{\varvec{d}}_t^{neg})\} _{t = 1}^m\) to describe the distances of the instances, where \({\varvec{d}}_t^{pos}\) is the distance of \(x _t\) with intra-class instances and \({\varvec{d}}_t^{neg}\) is the distance of \(x _t\) with inter-class instances.

However, as we all known, the number of inter-class instances is much larger than the number of intra-class instances. So the pairwise dataset \(\mathbb {S} = \{ \mathbb {S}_t = ({\varvec{d}}_t^{pos},{\varvec{d}}_t^{neg})\} _{t = 1}^m\) will be a class imbalanced dataset. Training with such a dataset will lead to over fit-ting of the inter-class instances and under fitting of the intra-class instances, which will decrease the performance of the learning algorithm. Here we choose FCM [19] to alleviate the class imbalanced data problem. The principle of our clustering centralization is shown in Fig. 1. From Fig. 1, the traditional training datasets constructing methods [16] (shown with green lines) include much more inter-class instances, while our method (shown with red lines) with clustering centralization built training datasets by these clusters, which effectively alleviate the class imbalanced problem of the existing Re-ID algorithms. The other advantage of our training clustering centralization method is the number of clusters can adjust for different Re-ID datasets to get the optimal performance, which will be discussed in Sect. 4.3 lately. After clustering centralization, we have , is the distance of \(x _t\) with the centers of clusters of its counterexamples. In order to maximize the inter-class distance and minimize the intra-class distance at the same time, we can formulate it into a minimization problem as following:

(1)

where g() is a distance function. The function in Eq. (1) is unbounded, so it cannot guarantee convergence during iteration. Here, we transform it to a continuous sigmoid function as following:

(2)
Fig. 1.
figure 1

The principle of the clustering centralization (Color figure online)

Considering the convenience of computing, Eq. (1) is then transformed to a logistic form as following:

(3)

We can see minimizing Eq. (1) is equivalent to maximizing Eq. (2), maximizing Eq. (2) is equivalent to minimizing Eq. (3).

Because the Mahalanobis matrix of the Mahalanobis distance function has good projective property and learning property, here we choose the Mahalanobis distance function as the distance function \(g \):

$$\begin{aligned} g({\varvec{d}}_t^{pos}) = {({\varvec{d}}_t^{pos})^T}\mathbf{M }({\varvec{d}}_t^{pos}), \end{aligned}$$
(4)

where M is a semi-definite matrix. Our goal becomes to learn M in Eq. (4) by minimizing the functional defined in Eq. (3). By performing eigenvalue decomposition on M, we can find \(\mathbf{M } = \mathbf{P }\mathbf{P }^T\), P is a matrix of column orthogonal vectors. The number of orthogonal bases may be smaller than the rank of matrix M, therefore, \(\mathbf{P } \in \mathbf{R ^{n*d'}}\) can be regard as a dimension reduction matrix, where \(d'\) is the number of orthogonal basis after dimension reduction. Equation (4) can be transformed as following:

$$\begin{aligned} g({\varvec{d}}_t^{pos}) = {({\varvec{d}}_t^{pos})^T}\mathbf{M }({\varvec{d}}_t^{pos}) = {({\varvec{d}}_t^{pos})^T}\mathbf{P }{\mathbf{P }^T}({\varvec{d}}_t^{pos}) = {\left\| {{\mathbf{P }^T}{\varvec{d}}_t^{pos}} \right\| ^2}, \end{aligned}$$
(5)

In addition, for a small dataset, the function shown in Eq. (3) may be overfitting. In order to alleviate the risk of over fitting and ensure the sparsity of the projection matrix, we introduce \(r{\left\| \mathbf P \right\| ^2}\) as a regularization term, where r is the regularization factor. The distance function can be formulated as:

(6)

2.2 An Iterative Optimization Algorithm for Projection Vector Learning

In this paper, we choose a similar iterative optimization method of [16] to learn an optimal P.

Step 1. Assume that after \(\textit{l}-1\) iterations a set of orthogonal vectors \(\varvec{p}_1,\varvec{p}_2,...,\varvec{p}_{\textit{l}-1}\) have been learned, to the next vector \(\varvec{p}\), let

$$\begin{aligned} {\varvec{d}}_t^{s,l} = {\varvec{d}}_t^{s,l - 1} - \frac{{{\varvec{p}_{l - 1}}{\varvec{p}}_{l - 1}^T}}{{{{\left\| {{\varvec{p}_{l - 1}} + u} \right\| }^2}}}{\varvec{d}}_t^{s,l - 1}, \end{aligned}$$
(7)

where \(s \in \{ pos,neg\} \), \(t \in 1,...,\left| \mathbb {S} \right| \).

Step 2. After obtain \({\varvec{d}}_t^{s,l}\) from Eq. (7), let . Then, we use conjugate gradient method to learn projection vectors of the Eq. (8).

(8)

The optimal projection vector after kth iteration is defined as following:

$$\begin{aligned} {\varvec{p}}_l^{k + 1} = {\varvec{p}}_l^k + {\alpha _k}{{\varvec{q}}_k}, \end{aligned}$$
(9)

where \({\alpha _k}\) is computed by \(f({\varvec{p}}_l^k + \alpha {{\varvec{q}}_k})\) using one-dimensional accurate search, \({{\varvec{q}}_k}\) is the search direction of projection vector after kth iteration, and the conjugate direction is computed by PRP equation as following.

$$\begin{aligned} {{\varvec{q}}_k} = - {g_k} + {\beta _{k - 1}}{{\varvec{q}}_{k - 1}},{\beta _{k - 1}} = \left\{ {\begin{array}{*{20}{r}} {\frac{{g_k^T({g_k} - {g_{k - 1}})}}{{g_{k - 1}^T{g_{k - 1}}}},~k > 1}\\ {\begin{array}{*{20}{c}} 0&{}{}&{}{} \end{array},~k = 1} \end{array}} \right. , \end{aligned}$$
(10)

where \({g_l}\) is:

(11)

If \(|f({\varvec{p}}_l^k) - f({\varvec{p}}_l^{k - 1})| < {\varepsilon _g}\), the iteration is terminated.

The initial value of \({{\varvec{p}}_l}\) is formulated as following:

(12)

Different from the iterative optimization algorithm of [16], a smaller perturbation term u is added into Eq. (7), which make each projection spaces preserving a relation with each other and are more suitable to the real-world learning problem.

According to Eqs. (9) and (11), we can see \({{\varvec{p}}_l} \in span\{ {\varvec{d}}_i^{s,l}\}\), where \(span\{ {\varvec{d}}_i^{s,l}\} \) is a range space of \(\{ {\varvec{d}}_i^{pos,l}\} \cup \{ {\varvec{d}}_i^{neg,l}\} \), \(s \in \{ pos,neg\}\), \(i \in 1,...,\left| \mathbb {S} \right| \). According to Eq. (7), we know that \({\varvec{p}}_j^T{\varvec{d}}_i^{s,j + 1} \approx 0\) where \(j = 1,...,l - 1\) and \({{\varvec{p}}_l} \in span\{ {\varvec{d}}_i^{s,l}\} \), where \(span\{ {\varvec{d}}_i^{s,l}\} \subseteq span\{ {\varvec{d}}_i^{s,l - 1}\} \subseteq ... \subseteq span\{ {\varvec{d}}_i^{s,0}\} \), so \({{\varvec{p}}_l}\) and \({{\varvec{p}}_j},j = 1,...,l - 1\) is approximately orthogonal.

3 Learning Algorithm for RDML-CCPVL

Based on the above iteration, the learning algorithm of the proposed RDML-CCPVL is presented in Algorithm 1.

figure a

4 Experimental and Analysis

4.1 Datasets and Experimental Setting

Four popular datasets are selected for our experiments: VIPeR [20], CUHK01 [21], 3DPeS [22], CAVIAR4REID [23], TownCenter [24] and Market-1501 [25]. Datasets have different problems of angle, lighting, occlusion and low resolution, etc. For each image, we use six type of features descriptor, such as RGB, YCbCr, HSV, Lab, YIQ and Gabor [26]. Then we use PCA to compress them into 2688-dimensional feature vectors.

Fig. 2.
figure 2

Illustration of different dataset, which from left to right are VIPeR, CUHK01, 3DPeS, CAVIAR4REID, Town Centre and Market-1501

We compare the proposed RDML-CCPVL with seven existing person re-identification works: ITML [15], LMNN [14], KISSME [13], PRDC [16], LFDA [12], CVDCA [17], RMLLC [18]. The performance of all the methods is evaluated in terms of cumulative matching characteristic (CMC), which is a standard measurement for Re-ID. The CMC curve represents the probability of finding the correct match over the top r in the gallery image ranking, with r varying from 1 to 30. In order to evaluate the efficiency of our algorithm, in this paper, the comparison of the training time between our algorithm with other existing algorithms is also shown. Since the number c of the clustering centralization is a important parameter for our algorithm, Normalized Discounted Cumulative Gain (NDCG) [27] is choose to evaluate the performance of RDML-CCPVL with varying number of clusters c. We run all the benchmarking algorithms with MATLAB 7 on a 1.90 GHz machine with 8G RAM.

4.2 Comparisons of Performance on Different Re-ID Datasets

Comparisons of the CMC and training time on different Re-ID datasets are shown in Tables 1, 2, 3, 4, 5 and 6 with optimal value of c, respectively. Since clustering centralization is used to solve the imbalanced data problem of the training datasets, it can be seen that our algorithm is considerably superior to other algorithms. The CMC of RDML-CCPVL is at least 3.2%–4.5% higher than other algorithms at rank 30. The advantage of clustering centralization is the reduction of the training datasets, we can see the training time of RDML-CCPVL is much less than other algorithms except method of KISSME, because this algorithm is a one-pass metric learning method.

Table 1. Comparisons of performance on dataset VIPeR with \(c =2(\%)\)
Table 2. Comparisons of performance on dataset CUHK01 with \(c =2(\%)\)
Table 3. Comparisons of performance on dataset 3DPeS with \(c =10(\%)\)
Table 4. Comparisons of performance on dataset CAVIAR4REID with \(c =15(\%)\)
Table 5. Comparisons of performance on dataset Town Centre with \(c =2(\%)\)
Table 6. Comparisons of performance on dataset Market-1501 with \(c =2(\%)\)

4.3 Experiments with Vary Number of Clusters

In this section, we report the change tendency of the performance of RDML-CCPVL by running them on the different datasets with varying number of clusters c. There are two pictures of each person in dataset VIPeR and four pictures in dataset CUHK01, so we choose two clusters of each person of these two datasets. From the Fig. 3 we can see that the proposed RDML-CCPVL is considerably sensitive to c for all the datasets except Town Centre. The performance on 3DPeS, CAVIAR4REID and Market-1501 of will become better as c increases because more clusters mean more information of the datasets. But when c becomes too large, the curves in Fig. 2 start to go down because we find that c has an optimal range of value for different datasets and too many clusters also can not generate useful distribution information. For Town Centre, which include only images of Video sequences, so how many clusters of intra-class instances of the same person has little difference. When \(c =2\), the proposed algorithm achieves the optimal performance on Town Centre.

Fig. 3.
figure 3

Comparison of the performance with varying number of the clusters c. (a) 3DPeS; (b) CAVIAR4REID; (c) Town Centre; (d) Market-1501.

5 Conclusion

In this work, we proposed a relative distance metric leaning algorithm based on clustering centralization and projection vectors learning for person re-identification problem. We have shown that cluster centralization can improve the performance and efficiency in person re-identification and reduce both the computational complexity and the storage space. In addition, the conjugate gradient method is used in the projection vector learning. The proposed approach shows a significant improvement over other existing algorithms.