Abstract
The fundamental difficulty in person re-identification (ReID) lies in learning the correspondence among individual cameras. It strongly demands costly inter-camera annotations, yet the trained models are not guaranteed to transfer well to previously unseen cameras. These problems significantly limit the application of ReID. This paper rethinks the working mechanism of conventional ReID approaches and puts forward a new solution. With an effective operator named Camera-based Batch Normalization (CBN), we force the image data of all cameras to fall onto the same subspace, so that the distribution gap between any camera pair is largely shrunk. This alignment brings two benefits. First, the trained model enjoys better abilities to generalize across scenarios with unseen cameras as well as transfer across multiple training sets. Second, we can rely on intra-camera annotations, which have been undervalued before due to the lack of cross-camera information, to achieve competitive ReID performance. Experiments on a wide range of ReID tasks demonstrate the effectiveness of our approach. The code is available at https://github.com/automan000/Camera-based-Person-ReID.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Person re-identification (ReID) aims at matching identities across disjoint cameras. Generally, it is achieved by mapping images from the same and different cameras into a feature space, where features of the same identity are closer than those of different identities. To learn the relations between identities from all cameras, there are two different objectives: learning the relations between identities in the same camera and learning identity relations across cameras.
However, there is an inconsistency between these two objectives. As shown in Fig. 1(a), due to the large appearance variation caused by illumination conditions, camera views, etc., images from different cameras are subject to distinct distributions. Handling the distribution gap between cameras is crucial for inter-camera identity matching, yet learning within a single camera is much easier. As a consequence, the conventional ReID approaches mainly focus on associating different cameras, which demands costly inter-camera annotations. Besides, after learning on a training set, part of the learned knowledge is strongly correlated to the connections among these particular cameras, making the model generalize poorly on scenarios consisting of unseen cameras. As shown in Fig. 1b, the ReID model learned on one dataset often has a limited ability of describing images from other datasets, i.e., its generalization ability across datasets is limited. For simplicity, we denote this formulation neglecting within-dataset inconsistencies as the dataset-based formulation. We emphasize that lacking the ability to bridge the distribution gap between all cameras from all datasets leads to two problems: the unsatisfying generalization ability and the excessive dependence on inter-camera annotations. To tackle these problems simultaneously, we propose to align the distribution of all cameras explicitly. As shown in Fig. 1(c), we eliminate the distribution inconsistency between all cameras, so the ReID knowledge can always be learned, accumulated, and verified in the same input distribution, which facilitates the generalization ability across different ReID scenarios. Moreover, with the aligned distributions among all cameras, intra- and inter-camera annotations can be regarded as the same, i.e., labeling the image relations under the same input distribution. This allows us to approximate the effect of inter-camera annotations with only intra-camera annotations. It may relieve the exhaustive human labor for the costly inter-camera annotations.
We denote our solution that disassembles ReID datasets and aligns each camera independently as the camera-based formulation. We implement it via an improved version of Batch Normalization (BN) [9] named Camera-based Batch Normalization (CBN). In training, CBN disassembles each mini-batch and standardizes the corresponding input according to its camera labels. In testing, CBN utilizes few samples to approximate the BN statistics of every testing camera and standardizes the input to the training distribution. In practice, multiple ReID tasks benefit from our work, such as fully-supervised learning [1, 36, 51, 53, 54, 58], direct transfer [8, 21], domain adaptation [3, 4, 33, 41, 52, 57], and incremental learning [12, 15, 28]. Extensive experiments indicate that our method improves the performance of these tasks simultaneously, such as \(0.9\%\), \(5.7\%\), and \(14.2\%\) averaged Rank-1 accuracy improvements on fully-supervised learning, domain adaptation, and direct transfer, respectively, and \(9.7\%\) less forgetting on Rank-1 accuracy for incremental learning. Last but not least, even without inter-camera annotations, a weakly-supervised pipeline [60] with our formulation can achieve competitive performance on multiple ReID datasets, which demonstrates that the value of intra-camera annotations may have been undervalued in the previous literature. To conclude, our contribution is three-fold:
-
In this paper, we emphasize the importance of aligning the distribution of all cameras and propose a camera-based formulation. It can learn discriminative knowledge for ReID tasks while excluding training-set-specific information.
-
We implement our formulation with Camera-based Batch Normalization. It facilitates the generalization and transfer ability of ReID models across different scenarios and makes better use of intra-camera annotations. It provides a new solution for ReID tasks without costly inter-camera annotations.
-
Experiments on fully-supervised, weakly-supervised, direct transfer, domain adaptation, and incremental learning tasks validate our method, which confirms the universality and effectiveness of our camera-based formulation.
2 Related Work
Our formulation aligns the distribution per camera. In training, it eliminates the distribution gap between all cameras. ReID models can treat intra- and inter-camera annotations equally and make better use of them, which benefits both fully-supervised and weakly-supervised ReID tasks. It also guarantees that the distribution of each testing camera is aligned to the same training distribution. Thus, the knowledge can better generalize and transfer across datasets. It helps direct transfer, domain adaptation, and incremental learning. In this section, we briefly categorize and summarize previous works on the above ReID topics.
Supervision. The supervision in ReID tasks is usually in the form of identity annotations. Although there are many outstanding unsupervised methods [44,45,46,47] that do not need annotations, it is usually hard for them to achieve competitive performance as the supervised ReID methods. For better performance, lots of previous methods [1, 11, 36, 42, 51, 53, 54, 58] utilized fully-supervised learning, in which identity labels are annotated manually across all training cameras. Many of them designed spatial alignment [34, 37, 49], visual attention [13, 19], and semantic segmentation [11, 31, 38] for extracting accurate and fine-grained features. GAN-based methods [10, 20, 23] were also utilized for data augmentation. However, although these methods achieved remarkable performance on ReID tasks, they required costly inter-camera annotations. To reduce the cost of human labor, ReID researchers began to investigate weakly-supervised learning. SCT [48] presumes that each identity appears in only one camera. In ICS [60], an intra-camera supervision task is studied in which an identity could have different labels under different cameras. In [17, 18], pseudo labels are used to supervised the ReID model.
Generalization. The generalization ability in ReID tasks denotes how well a trained model functions on unseen datasets, which is usually examined by direct transfer tasks. Researchers found that many fully-supervised ReID models perform poorly on unseen datasets [3, 32, 41]. To improve the generalization ability, various strategies were adopted as additional constraints to avoid over-fitting, such as label smoothing [21] and sophisticated part alignment approaches [8].
Transfer. The transfer ability in ReID tasks corresponds to the capability of ReID models transferring and preserving the discriminative knowledge across multiple training sets. There are two related tasks. Domain adaptation transfers knowledge from labeled source domains to unlabeled target domains. One solution [3, 41, 57] bridged the domain gap by transferring source images to the target image style. Other solutions [4, 6, 16, 33, 40] utilized the knowledge learned from the source domain to mine the identity relations in target domains. Incremental learning [12, 15, 28] also values the transfer ability. Its goal is to preserve the previous knowledge and accumulate the common knowledge for all seen datasets. A recent ReID work that relates to incremental learning is MASDF [43], which distilled and incorporated the knowledge from multiple datasets.
3 Methodology
3.1 Conventional ReID: Learning Camera-Related Knowledge
ReID is a task of retrieving identities according to their appearance. Given a training set consisting of disjoint cameras, learning a ReID model on it requires two types of annotations: inter-camera annotations and intra-camera annotations. The conventional ReID formulation regards a ReID dataset as a whole and learns the relations between identities as well as the connections between training cameras. Given an image \(\mathbf {I}^{\mathcal {D}_j}_i\) from any training set \(\mathcal {D}_j\), the training goal of this formulation is:
where \(\mathbf {f}^{\mathcal {D}_{j}}\left( \cdot \right) \) and \(\mathbf {g}^{\mathcal {D}_{j}}\left( \cdot \right) \) are the corresponding feature extractor and classifier for \(\mathcal {D}_{j}\), respectively. \(\mathbf {y}^{\mathcal {D}_{j}}_{i}\) denotes the identity label of the image \(\mathbf {I}^{\mathcal {D}_{j}}_{i}\).
In our opinion, this formulation has three drawbacks. First, images from different cameras, even of the same identity, are subject to distinct distributions. To associate images across cameras, conventional approaches strongly demand the costly inter-camera annotations. Meanwhile, the intra-camera annotations are less exploited since they provide little information across cameras. Second, such learned knowledge not only discriminates the identities in the training set but also encodes the connections between training cameras. These connections are associated with the particular training cameras and hard to generalize to other cameras, since the learned knowledge may not apply to the distribution of previously unseen cameras. For example, when transferring a ReID model trained on Market-1501 to DukeMTMC-reID, it produces a poor Rank-1 accuracy of \(37.0\%\) without fine-tuning. Third, the learned knowledge is hard to preserve when being fine-tuned. For instance, after fine-tuning the aforementioned model on DukeMTMC-reID, the Rank-1 accuracy drops \(14.2\%\) on Market-1501, because it turns to fit the relations between the cameras in DukeMTMC-reID. We analyze these three problems and find that the particular relations between training cameras are the primary cause of them. Thus, we believe that the conventional method of handling these camera-related relations may need a re-design.
3.2 Our Insight: Towards Camera-Independent ReID
We rethink the relations between cameras. More specifically, we believe that the exclusive knowledge for bridging the distribution gap between the particular training cameras should be suppressed during training. Such knowledge is associated to the cameras in the training set and sacrifices the discriminative and generalization ability on unseen scenarios.
To this end, we propose to align the distribution of all cameras explicitly, so that the distribution gap between all cameras is eliminated, and much less camera-specific knowledge will be learned during training. We denote this formulation as the camera-based formulation. To align the distribution of each camera, we estimate the raw distribution of each camera and standardize images from each camera with the corresponding distribution statistics. We use \(\mathbf {\varvec{\eta }}\left( \cdot \right) \) to denote the estimated statistics related to the distribution of a camera. Then, given a related image \(\mathbf {I}_{i}^{\left( c\right) }\), aligning the camera-wise distribution will transform this image as:
where \(\mathbf {DA}\left( \cdot \right) \) represents a distribution alignment mechanism, \(\tilde{\mathbf {I}}_{i}^{\left( c\right) }\) denotes the aligned \(\mathbf {I}_{i}^{\left( c\right) }\) and \(\varvec{\eta }\left( {c}\right) \) is the estimated alignment parameters for camera c. For any training set \(\mathcal {D}_{j}\), we can now learn the ReID knowledge from this aligned distribution by replacing \(\mathbf {I}^{\mathcal {D}_{j}}_{i}\) in Eq. 1 with \(\tilde{\mathbf {I}}^{\left( c\right) }_{i}\).
With the distributions of all cameras aligned by \(\mathbf {DA}\left( \cdot \right) \), images from all these cameras can be regarded as distributing on a “standardized camera”. By learning on this “standardized camera”, we eliminate the distribution gap between cameras, so the raw learning objectives within the same and across different cameras can be treated equally, making the training procedure more efficient and effective. Besides, without the disturbance caused by the training-camera-related connections, the learned knowledge can generalize better across various ReID scenarios. Last but not least, now that the additional knowledge for associating diverse distributions is much less required, our formulation can make better use of the intra-camera annotations. It may relieve human labor for the costly inter-camera annotations, and provides a solution for ReID in a large-scale camera network with fewer demands of inter-camera annotations.
3.3 Camera-Based Batch Normalization
In practice, a possible solution for aligning camera-related distributions is to conduct batch normalization in a camera-wise manner. We propose the Camera-based Batch Normalization (CBN) for aligning the distribution of all training and testing cameras. It is modified from the conventional Batch Normalization [9], and estimates camera-related statistics rather than dataset-related statistics.
Batch Normalization Revisited. The Batch Normalization [9] is designed to reduce the internal covariate shifting. In training, it standardizes the data with the mini-batch statistics and records them for approximating the global statistics. During testing, given an input \(\mathbf {x}_{i}\), the output of the BN layer is:
where \(\mathbf {x}_{i}\) is the input and \(\mathbf {\hat{x}}_{i}\) is the corresponding output. \(\hat{\mu }\) and \(\hat{\sigma }^\mathrm {2}\) are the global mean and variance of the training set. \(\gamma \) and \(\beta \) are two parameters learned during training. In ReID tasks, BN has significant limitations. It assumes and requires that all testing images are subject to the same training distribution. However, this assumption is satisfied only when the cameras in the testing set and training set are exactly the same. Otherwise, the standardization fails.
Batch Normalization within Cameras. Our Camera-based Batch Normalization (CBN) aligns all training and testing cameras independently. It guarantees an invariant input distribution for learning, accumulating, and verifying the ReID knowledge. Given images or corresponding intermediate features \(\mathbf {x}_{m}^{\left( c\right) }\) from camera c, CBN standardizes them according to the camera-related statistics:
where \(\mu _{\left( c\right) }\) and \(\sigma _{\left( c\right) }^\mathrm {2}\) denote the mean and variance related to this camera c. During training, we disassemble each mini-batch and calculate the camera-related mean and variance for each involved camera. The camera with only one sampled images is ignored. During testing, before employing the learned ReID model to extract features, the above statistics have to be renewed for every testing camera. In short, we collect several unlabeled images and calculate the camera-related statistics per testing camera. Then, we employ these statistics and the learned weights to generate the final features.
3.4 Applying CBN to Multiple ReID Scenarios
The proposed CBN is generic and nearly cost-free for existing methods on multiple ReID tasks. To demonstrate its superiority, we setup a bare-bones baseline, which only contains a deep neural network, an additional BN layer as the bottleneck, and a fully connected layer as the classifier. As shown in Fig. 2(a), our camera-based formulation can be implemented by simply replacing all BN layers in a usual convolutional network with CBN layers.
With a modified network mentioned above, our camera-based formulation can be applied to many popular tasks, such as fully-supervised learning, weakly-supervised learning, direct transfer, and domain adaptation. Apart from them, we also evaluate a rarely discussed ReID task, i.e., incremental learning. It studies the problem of learning knowledge incrementally from a sequence of training sets while preserving and accumulating the previously learned knowledge. As shown in Fig. 2, we propose two settings. (1) Data-Free: once we finish the training procedure on a dataset, the training data along with the corresponding classifier are abandoned. When training the model on the subsequent training sets, the old data will never show up again. (2) Replay: unlike Data-Free, we construct an exemplar set from each old training set. The exemplar set and the corresponding classifier are preserved and used during the entire training sequence.
3.5 Discussions
Bridging ReID Tasks. We briefly demonstrate our understandings of the relations between ReID tasks and how we bridge these tasks. Different ReID tasks handle different combinations of training and testing sets. Since datasets have distinct cameras, previous methods have to learn exclusive relations between particular training cameras and adapt them to specific testing camera sets. Our formulation aligns the distribution of all cameras for learning and testing ReID knowledge, and suppresses the exclusive training-camera relations. It may reveal the latent connections between ReID tasks. First, by aligning the distribution of seen and unseen cameras, fully-supervised learning and direct transfer are united since training and testing distributions are always aligned in a camera-wise manner. Second, since there is no need to learn relations between distinct camera-related distributions, intra- and inter-camera annotations can be treated almost equally. Knowledge is better shared among cameras which helps fully- and weakly-supervised learning. Third, with the aligned training and testing distributions, it is more efficient to learn, accumulate, and preserve knowledge across datasets. It offers an elegant solution to preserve old knowledge (incremental learning) and absorb new knowledge (domain adaptation) in the same model.
Relationship to Previous Works. There are two types of previous works that closely relate to ours: camera-related methods and BN variants. Camera-related methods such as CamStyle [57] and CAMEL [45] noticed the camera view discrepancy inside the dataset. CamStyle augmented the dataset by transferring the image style in a camera-to-camera manner, but still learned ReID models in the dataset-based formulation. Consequently, transferring across datasets is still difficult. CAMEL [45] is the most similar work with ours, which learned camera-related projections and mapped camera-related distributions into an implicit common distribution. However, these projections are associated with the training cameras, limiting its ability to transfer across datasets. BN variants such as AdaBN also inspire us. AdaBN aligned the distribution of the entire dataset. It neither eliminated the camera-related relations in training nor handled the camera-related distribution gap in testing. Unlike them, CBN is specially designed for our camera-based formulation. It is much more general and precise for ReID tasks. More comparisons will be provided in Sects. 4.2 and 4.3.
4 Experiments
4.1 Experiment Setup
Datasets. We utilize three large scale ReID datasets, including Market-1501 [50], DukeMTMC-reID [52], and MSMT17 [41]. Market-1501 dataset has \(1\mathrm {,}501\) identities in total. 751 identities are used for training and the rest for testing. The training set contains \(12\mathrm {,}936\) images and the testing set contains \(15\mathrm {,}913\) images. DukeMTMC-reID dataset contains \(16\mathrm {,}522\) images of 702 identities for training, and \(1\mathrm {,}110\) identities with \(17\mathrm {,}661\) images are used for testing. MSMT17 dataset is the current largest ReID dataset with \(126\mathrm {,}441\) images of \(4\mathrm {,}101\) identities from 15 cameras. For short, we denote Market-1501 as Market, DukeMTMC-reID as Duke, and MSMT17 as MSMT in the rest of this paper. It is worth noting that in these datasets, the training and testing subsets contain the same camera combinations. It could be the reason that previous dataset-based methods create remarkable fully-supervised performance but catastrophic direct transfer results.
Implementation Details. In this paper, all experiments are conducted with PyTorch. The image size is \(256\times 128\) and the batch size is 64. In training, we sample 4 images for each identity. The baseline network presented in Sect. 3.4 uses ResNet-50 [7] as the backbone. To train this network, we adopt SGD optimizer with momentum [27] of 0.9 and weight decay of \(5\times 10^{-4}\). The initial learning rate is 0.01, and it decays after the 40th epoch by a factor of 10. For all experiments, the training stage will end up with 60 epochs. For incremental learning, we include a warm-up stage. In this stage, we freeze the backbone and only fine-tune the classifier(s) to avoid damaging the previously learned knowledge. During testing, our framework will first sample a few unlabeled images from each camera and use them to approximate the camera-related statistics. Then, these statistics are fixed and employed to process the corresponding testing images. Following the conventions, mean Average Precision (mAP) and Cumulative Matching Characteristic (CMC) curves are utilized for evaluations.
4.2 Performance on Different ReID Tasks
We evaluate our proposed method on five types of ReID tasks, i.e., fully-supervised learning, weakly-supervised learning, direct transfer, domain adaptation, and incremental learning. The corresponding experiments are organized as follows. First, we demonstrate the importance of aligning the distribution of all cameras from all datasets, and simultaneously conduct fully-supervised learning and direct transfer on multiple ReID datasets. Second, we demonstrate that it is possible to learn discriminative knowledge with only intra-camera annotations. We utilize the network architecture in Sect. 3.4 to compare the fully-supervised learning and weakly-supervised learning. To evaluate the generalization ability, direct transfer is also conducted for these two settings. Third, we evaluate the transfer ability of our method. This part of experiments includes domain adaptation, i.e., transferring the knowledge from the old domain to new domains, and incremental learning, i.e., preserving the old knowledge and accumulating the common knowledge for all training sets.
Note that, for simplicity, we denote the results of training and testing the model on the same dataset with fully annotated data as the fully-supervised learning results. For similar experiments that only use the intra-camera annotations, we denote their results as the weakly-supervised learning results.
Supervisions and Generalization. In this section, we evaluate and analyze the supervisions and the generalization ability in ReID tasks. For all experiments in this section, the testing results on both the training domain and other unseen testing domains are always obtained by the same learned model. We first conduct experiments on fully-supervised learning and direct transfer. As shown in Table 1, our proposed method shows good advantages, e.g., there is an averaged \(1.1\%\) improvement in Rank-1 accuracy for the fully-supervised learning task. Meanwhile, without bells and whistles, there is an average \(13.6\%\) improvement in Rank-1 accuracy for the direct transfer task. We recognize that our method has to collect a few unlabeled samples from each testing camera for estimating the camera-related statistics. However, this process is fast and nearly cost-free.
Our method can also boost previous methods. Take BoT [21], a recent state-of-the-art method, as an example. We integrate our proposed CBN into BoT and conduct experiments with almost the same settings as in the original paper, including the network architecture, objective functions, and training strategies. The only difference is that we disable Random Erasing [54] due to its constant negative effects on direct transfer. The results of the fully-supervised learning on Market and Duke are shown in Table 2. It should be pointed out that in fully-supervised learning, training and testing subsets contain the same cameras. Therefore, there is no significant shift among the BN statistics of the training set and the testing set, which favors the conventional formulation. Even so, our method still improves the performance on both Market and Duke. We believe that both aligning camera-wise distributions and better utilizing all annotations contribute to these improvements. Moreover, we also present results on direct transfer in Table 4. It is clear that our method improves BoT significantly, e.g., there is a \(15.3\%\) Rank-1 improvement when training on Duke but testing on Market. These improvements on both fully-supervised learning and direct transfer demonstrate the advantages of our camera-based formulation.
Weak Supervisions. As we demonstrated in Sect. 3.1, the conventional ReID formulation strongly demands the inter-camera annotations for associating identities under distinct camera-related distributions. Since our method eliminates the distribution gap between cameras, the intra-camera annotations can be better used for learning the appearance features. We compare the performance of using all annotations (fully-supervised learning) and only intra-camera annotations (weakly-supervised learning). The results are in Table 3. For weakly-supervised experiments, we follow the same settings in MT [60]. Since there are no inter-camera annotations, the identity labels of different cameras are independent, and we assign each individual camera with a separate classifier. Each of these classifiers is supervised by the corresponding intra-camera identity labels. Surprisingly, even without inter-camera annotations, the weakly-supervised learning achieves competitive performance. According to these results, we believe that the importance of intra-camera annotations is significantly undervalued.
Transfer. In this section, we evaluate the ability to transfer ReID knowledge between the old and new datasets. First, we evaluate the ability to transfer previous knowledge to new domains. The related task is domain adaptation, which usually involves a labeled source training set and another unlabeled target training set. We integrate our formulation into a recent state-of-the-art method ECN [56]. The results are shown in Table 4. By aligning the distributions of source labeled images and target unlabeled images, the performance of ECN is largely boosted, e.g., when transferring from Duke to Market, the Rank-1 accuracy and mAP are improved by \(6.6\%\) and \(9.0\%\), respectively. Meanwhile, compared to other methods that also utilize camera labels, such as CamStyle [57] and CASCL [44], our method outperforms them significantly. These improvements demonstrate the effectiveness of our camera-based formulation in domain adaptation.
Second, we evaluate the ability to preserve old knowledge as well as accumulate common knowledge for all seen datasets when being fine-tuned. Incremental learning, which fine-tunes a model on a sequence of training sets, is used for this evaluation. Experiments are designed as follows. Given three large-scale ReID datasets, there are in total six training sequences of length 2, such as (Market\(\rightarrow \)Duke) and six sequences of length 3, such as (Market\(\rightarrow \)Duke\(\rightarrow \)MSMT). We use the baseline method described in Sect. 3.4 and train it on all sequences separately. After training on each dataset of every sequence, we evaluate the latest model on the first dataset of the corresponding sequence and record the performance decreases. Both the Data-Free and Replay settings are tested. For the Replay settings, the exemplars are selected by randomly sampling one image for each identity. Compared to the original training sets, the size of the exemplar set for Market, Duke, and MSMT is only \(5.5\%\), \(4.2\%\), and \(3.4\%\), respectively. Note that in Replay settings, the old classifiers will also be updated in training. The corresponding results are shown in Table 5. To better demonstrate our improvements, we report the averaged results of the sequences that are of the same length and share the same initial dataset, e.g., averaging the results of testing Market on the sequences Market\(\rightarrow \)Duke and Market\(\rightarrow \)MSMT. In short, our formulation outperforms the dataset-based formulation in all experiments. These results further demonstrate the effectiveness of our formulation.
4.3 Ablation Study
The experiments above demonstrate that our camera-based formulation boosts all the mentioned tasks. Now, we conduct more ablation studies to validate CBN.
Comparisons Between CBN and Other BN Variants. We compare CBN with three types of BN variants. (1) BN [9] and IBN [25] correspond to the methods that use training-set-specific statistics to normalize all testing data. (2) AdaBN [14] is a dataset-wise adaptation that utilizes the testing-set-wise statistics to align the entire testing set. (3) The combination of BN and our CBN is to verify the importance of training ReID models with CBN. As shown in Table 6, training and testing the ReID model with CBN achieves the best performance in both fully-supervised learning and direct transfer.
Samples Required for CBN Approximation. We conduct experiments for approximating the camera-related statistics with different numbers of samples. Note that if a camera contains less than the required number of images, we simply use all available images rather than duplicate them. We repeat all experiments 10 times and list the averaged results in Table 7. As demonstrated, the performance is better and more stable when using more samples to estimate the camera-related statistics. Besides, results are already good enough when only utilizing very few samples, e.g., 10 mini-batches. For the balance of simplicity and performance, we adopt 10 mini-batches for approximation in all experiments.
Compatibility with Different Backbones. Apart from ResNet [7] used in the above experiments, we further evaluate the compatibility of CBN. We embed CBN with other commonly used backbones: MobileNet V2 [29] and ShuffleNet V2 [22], and evaluate their performance on fully-supervised learning and direct transfer. As shown in Table 8, the performance is also boosted significantly.
5 Conclusions
In this paper, we advocate for a novel camera-based formulation for person re-identification and present a simple yet effective solution named camera-based batch normalization. With only a few additional costs, our approach shrinks the gap between intra-camera learning and inter-camera learning. It significantly boosts the performance on multiple ReID tasks, regardless of the source of supervision, and whether the trained model is tested on the same dataset or transferred to another dataset. Our research delivers two key messages. First, it is crucial to align all camera-related distributions in ReID tasks, so the ReID models can enjoy better abilities to generalize across different scenarios as well as transfer across multiple datasets. Second, with the aligned distributions, we unleash the potential of intra-camera annotations, which may have been undervalued in the community. With promising performance under the weakly-supervised setting (only intra-camera annotations are available), our approach provides a practical solution for deploying ReID models in large-scale, real-world scenarios.
References
Almazan, J., Gajic, B., Murray, N., Larlus, D.: Re-id done right: towards good practices for person re-identification. arXiv preprint arXiv:1801.05339 (2018)
Chang, X., Hospedales, T.M., Xiang, T.: Multi-level factorisation net for person re-identification. In: CVPR. IEEE (2018)
Deng, W., Zheng, L., Ye, Q., Kang, G., Yang, Y., Jiao, J.: Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In: CVPR. IEEE (2018)
Fan, H., Zheng, L., Yan, C., Yang, Y.: Unsupervised person re-identification: Clustering and fine-tuning. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 14(4), 83 (2018)
Fan, X., Luo, H., Zhang, X., He, L., Zhang, C., Jiang, W.: SCPNet: spatial-channel parallelism network for joint holistic and partial person re-identification. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11362, pp. 19–34. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20890-5_2
Fu, Y., Wei, Y., Wang, G., Zhou, Y., Shi, H., Huang, T.S.: Self-similarity grouping: a simple unsupervised cross domain adaptation approach for person re-identification. In: ICCV. IEEE (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. IEEE (2016)
Huang, H., et al.: Eanet: Enhancing alignment for cross-domain person re-identification. arXiv preprint arXiv:1812.11369 (2018)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Jiao, J., Zheng, W.S., Wu, A., Zhu, X., Gong, S.: Deep low-resolution person re-identification. In: AAAI (2018)
Kalayeh, M.M., Basaran, E., Gökmen, M., Kamasak, M.E., Shah, M.: Human semantic parsing for person re-identification. In: CVPR. IEEE (2018)
Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114(13), 3521–3526 (2017)
Li, W., Zhu, X., Gong, S.: Harmonious attention network for person re-identification. In: CVPR. IEEE (2018)
Li, Y., Wang, N., Shi, J., Liu, J., Hou, X.: Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779 (2016)
Li, Z., Hoiem, D.: Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 2935–2947 (2018)
Lin, S., Li, H., Li, C.T., Kot, A.C.: Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification. In: BMVC (2018)
Lin, Y., Dong, X., Zheng, L., Yan, Y., Yang, Y.: A bottom-up clustering approach to unsupervised person re-identification. In: AAAI (2019)
Lin, Y., Xie, L., Wu, Y., Yan, C., Tian, Q.: Unsupervised person re-identification via softened similarity learning. In: CVPR. IEEE (2020)
Liu, H., Feng, J., Qi, M., Jiang, J., Yan, S.: End-to-end comparative attention networks for person re-identification. IEEE Trans. Image Process. 26(7), 3492–3506 (2017)
Liu, J., Ni, B., Yan, Y., Zhou, P., Cheng, S., Hu, J.: Pose transferrable person re-identification. In: CVPR. IEEE (2018)
Luo, H., Gu, Y., Liao, X., Lai, S., Jiang, W.: Bag of tricks and a strong baseline for deep person re-identification. In: CVPRW. IEEE (2019)
Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 122–138. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_8
Mao, S., Zhang, S., Yang, M.: Resolution-invariant person re-identification. arXiv preprint arXiv:1906.09748 (2019)
Miao, J., Wu, Y., Liu, P., Ding, Y., Yang, Y.: Pose-guided feature alignment for occluded person re-identification. In: ICCV. IEEE (2019)
Pan, X., Luo, P., Shi, J., Tang, X.: Two at once: enhancing learning and generalization capacities via IBN-Net. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 484–500. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_29
Peng, P., et al.: Unsupervised cross-dataset transfer learning for person re-identification. In: CVPR. IEEE (2016)
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)
Rannen, A., Aljundi, R., Blaschko, M.B., Tuytelaars, T.: Encoder based lifelong learning. In: ICCV. IEEE (2017)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: inverted residuals and linear bottlenecks. In: CVPR. IEEE (2018)
Shen, Y., Li, H., Yi, S., Chen, D., Wang, X.: Person re-identification with deep similarity-guided graph neural network. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 508–526. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_30
Song, C., Huang, Y., Ouyang, W., Wang, L.: Mask-guided contrastive attention model for person re-identification. In: CVPR. IEEE (2018)
Song, J., Yang, Y., Song, Y.Z., Xiang, T., Hospedales, T.M.: Generalizable person re-identification by domain-invariant mapping network. In: CVPR. IEEE (2019)
Song, L., et al.: Unsupervised domain adaptive re-identification: theory and practice. Pattern Recogn. (2020)
Suh, Y., Wang, J., Tang, S., Mei, T., Lee, K.M.: Part-aligned bilinear representations for person re-identification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 418–437. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_25
Sun, H., Chen, Z., Yan, S., Xu, L.: MVP matching: a maximum-value perfect matching for mining hard samples, with application to person re-identification. In: ICCV. IEEE (2019)
Sun, Y., Zheng, L., Deng, W., Wang, S.: Svdnet for pedestrian retrieval. In: ICCV. IEEE (2017)
Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 501–518. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_30
Tian, M., et al.: Eliminating background-bias for robust person re-identification. In: CVPR. IEEE (2018)
Van Der Maaten, L.: Accelerating t-SNE using tree-based algorithms. JMLR 15(1), 3221–3245 (2014)
Wang, J., Zhu, X., Gong, S., Li, W.: Transferable joint attribute-identity deep learning for unsupervised person re-identification. In: CVPR. IEEE (2018)
Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer GAN to bridge domain gap for person re-identification. In: CVPR. IEEE (2018)
Wei, L., Zhang, S., Yao, H., Gao, W., Tian, Q.: Glad: global-local-alignment descriptor for pedestrian retrieval. In: ACMMM. ACM (2017)
Wu, A., Zheng, W.S., Guo, X., Lai, J.H.: Distilled person re-identification: towards a more scalable system. In: CVPR. IEEE (2019)
Wu, A., Zheng, W.S., Lai, J.H.: Unsupervised person re-identification by camera-aware similarity consistency learning. In: ICCV. IEEE (2019)
Yu, H.X., Wu, A., Zheng, W.S.: Cross-view asymmetric metric learning for unsupervised person re-identification. In: ICCV. IEEE (2017)
Yu, H.X., Wu, A., Zheng, W.S.: Unsupervised person re-identification by deep asymmetric metric embedding. TPAMI (2018)
Yu, H.X., Zheng, W.S., Wu, A., Guo, X., Gong, S., Lai, J.H.: Unsupervised person re-identification by soft multilabel learning. In: CVPR (2019)
Zhang, T., Xie, L., Wei, L., Zhang, Y., Li, B., Tian, Q.: Single camera training for person re-identification. In: AAAI (2020)
Zhang, X., et al.: Alignedreid: surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184 (2017)
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: ICCV. IEEE (2015)
Zheng, Z., Zheng, L., Yang, Y.: A discriminatively learned CNN embedding for person reidentification. ACM Trans. Multimedia Comput. Commun. Appl. 14(1), 13 (2017)
Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In: ICCV. IEEE (2017)
Zhong, Z., Zheng, L., Cao, D., Li, S.: Re-ranking person re-identification with k-reciprocal encoding. In: CVPR. IEEE (2017)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI (2020)
Zhong, Z., Zheng, L., Li, S., Yang, Y.: Generalizing a person retrieval model hetero- and homogeneously. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 176–192. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_11
Zhong, Z., Zheng, L., Luo, Z., Li, S., Yang, Y.: Invariance matters: exemplar memory for domain adaptive person re-identification. In: CVPR. IEEE (2019)
Zhong, Z., Zheng, L., Zheng, Z., Li, S., Yang, Y.: Camera style adaptation for person re-identification. In: CVPR. IEEE (2018)
Zhou, J., Yu, P., Tang, W., Wu, Y.: Efficient online local metric adaptation via negative samples for person reidentification. In: ICCV. IEEE (2017)
Zhou, K., Yang, Y., Cavallaro, A., Xiang, T.: Omni-scale feature learning for person re-identification. In: ICCV. IEEE (2019)
Zhu, X., Zhu, X., Li, M., Murino, V., Gong, S.: Intra-camera supervised person re-identification: a new benchmark. In: ICCVW. IEEE (2019)
Zhu, Z., et al.: Viewpoint-aware loss with angular regularization for person re-identification. In: AAAI (2020)
Acknowledgements
This work was supported by National Science Foundation of China under grant No. 61521002.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhuang, Z. et al. (2020). Rethinking the Distribution Gap of Person Re-identification with Camera-Based Batch Normalization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12357. Springer, Cham. https://doi.org/10.1007/978-3-030-58610-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-58610-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58609-6
Online ISBN: 978-3-030-58610-2
eBook Packages: Computer ScienceComputer Science (R0)