Abstract
The Visual Object Tracking challenge VOT2020 is the eighth annual tracker benchmarking activity organized by the VOT initiative. Results of 58 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The VOT2020 challenge was composed of five sub-challenges focusing on different tracking domains: (i) VOT-ST2020 challenge focused on short-term tracking in RGB, (ii) VOT-RT2020 challenge focused on “real-time” short-term tracking in RGB, (iii) VOT-LT2020 focused on long-term tracking namely coping with target disappearance and reappearance, (iv) VOT-RGBT2020 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2020 challenge focused on long-term tracking in RGB and depth imagery. Only the VOT-ST2020 datasets were refreshed. A significant novelty is introduction of a new VOT short-term tracking evaluation methodology, and introduction of segmentation ground truth in the VOT-ST2020 challenge – bounding boxes will no longer be used in the VOT-ST challenges. A new VOT Python toolkit that implements all these novelites was introduced. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website (http://votchallenge.net).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
The target was sought in a window centered at its estimated position in the previous frame. This is the simplest dynamic model that assumes all positions within a search region contain the target have equal prior probability.
- 11.
- 12.
- 13.
References
Babenko, B., Yang, M.H., Belongie, S.: Robust object tracking with online multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1619–1632 (2011)
Berg, A., Ahlberg, J., Felsberg, M.: A thermal object tracking benchmark. In: 12th IEEE International Conference on Advanced Video- and Signal-based Surveillance, Karlsruhe, Germany, 25–28 August 2015. IEEE (2015)
Berg, A., Johnander, J., de Gevigney, F.D., Ahlberg, J., Felsberg, M.: Semi-automatic annotation of objects in visual-thermal video. In: IEEE International Conference on Computer Vision, ICCV Workshops (2019)
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: IEEE International Conference on Computer Vision, ICCV (2019)
Bhat, G., Johnander, J., Danelljan, M., Khan, F.S., Felsberg, M.: Unveiling the power of deep tracking. In: ECCV, pp. 483–498 (2018)
Bhat, G., et al.: Learning what to learn for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 777–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_46
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR (2018)
Chen, K., et al.: Hybrid task cascade for instance segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
Dai, K., Zhang, Y., Wang, D., Li, J., Lu, H., Yang, X.: High-performance long-term tracking with meta-updater. In: CVPR (2020)
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. In: CVPR, pp. 6638–6646 (2017)
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: accurate tracking by overlap maximization. In: CVPR, pp. 4660–4669 (2019)
Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: CVPR (2020)
Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Discriminative scale space tracking. IEEE Trans. Pattern Anal. Mach. Intell. 39(8), 1561–1575 (2016)
Dunnhofer, M., Martinel, N., Luca Foresti, G., Micheloni, C.: Visual tracking by means of deep reinforcement learning and an expert demonstrator. In: The IEEE International Conference on Computer Vision (ICCV) Workshops, October 2019
Dunnhofer, M., Martinel, N., Micheloni, C.: A distilled model for tracking and tracker fusion (2020)
Fan, H., et al.: Lasot: a high-quality benchmark for large-scale single object tracking. In: Computer Vision Pattern Recognition (2019)
Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: a benchmark for higher frame rate object tracking. CoRR abs/1703.05884 (2017). http://arxiv.org/abs/1703.05884
Goyette, N., Jodoin, P.M., Porikli, F., Konrad, J., Ishwar, P.: Changedetection.net: a new change detection benchmark dataset. In: CVPR Workshops, pp. 1–8. IEEE (2012)
Guo, C., Zhang, L.: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Trans. Image Process. 19(1), 185–198 (2009)
Gustafsson, F.K., Danelljan, M., Bhat, G., Schön, T.B.: Energy-based models for deep probabilistic regression. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 325–343. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_20
Gustafsson, F.K., Danelljan, M., Timofte, R., Schön, T.B.: How to train your energy-based model for regression. CoRR abs/2005.01698 (2020). https://arxiv.org/abs/2005.01698
Henriques, J., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. PAMI 37(3), 583–596 (2015)
Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. arXiv:1810.11981 (2018)
Huang, L., Zhao, X., Huang, K.: GlobalTrack: a simple and strong baseline for long-term tracking. In: AAAI (2020)
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: alexnet-level accuracy with 50x fewer parameters and \(<\)0.5mb model size. arXiv:1602.07360 (2016)
Jack, V., et al.: Long-term tracking in the wild: A benchmark. arXiv:1803.09502 (2018)
Jung, I., Son, J., Baek, M., Han, B.: Real-time MDNet. In: ECCV, pp. 83–98 (2018)
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 34(7), 1409–1422 (2012). https://doi.org/10.1109/TPAMI.2011.239
Kristan, M., et al.: The seventh visual object tracking vot2019 challenge results. In: ICCV2019 Workshops, Workshop on Visual Object Tracking Challenge (2019)
Kristan, M., et al.: The visual object tracking vot2018 challenge results. In: ECCV2018 Workshops, Workshop on Visual Object Tracking Challenge (2018)
Kristan, M., et al.: The visual object tracking vot2017 challenge results. In: ICCV2017 Workshops, Workshop on Visual Object Tracking Challenge (2017)
Kristan, M., et al.: The visual object tracking vot2016 challenge results. In: ECCV2016 Workshops, Workshop on Visual Object Tracking Challenge (2016)
Kristan, M., et al.: The visual object tracking vot2015 challenge results. In: ICCV2015 Workshops, Workshop on Visual Object Tracking Challenge (2015)
Kristan, M., et al.: The visual object tracking vot2013 challenge results. In: ICCV2013 Workshops, Workshop on Visual Object Tracking Challenge, pp. 98–111 (2013)
Kristan, M., et al.: The visual object tracking vot2014 challenge results. In: ECCV2014 Workshops, Workshop on Visual Object Tracking Challenge (2014)
Kristan, M., et al.: A novel performance evaluation methodology for single-target trackers. IEEE Trans. Pattern Anal. Mach. Intell. 38(11), 2137–2155 (2016)
Leal-Taixé, L., Milan, A., Reid, I.D., Roth, S., Schindler, K.: Motchallenge 2015: towards a benchmark for multi-target tracking. CoRR abs/1504.01942 (2015). http://arxiv.org/abs/1504.01942
Li, A., Li, M., Wu, Y., Yang, M.H., Yan, S.: Nus-pro: a new visual tracking challenge. IEEE-PAMI (2015)
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: evolution of Siamese visual tracking with very deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2019)
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8971–8980, June 2018
Li, C., Liang, X., Lu, Y., Zhao, N., Tang, J.: RGB-T object tracking: benchmark and baseline. Pattern Recogn. (2019, submitted)
Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking: algorithms and benchmark. IEEE Trans. Image Process. 24(12), 5630–5644 (2015)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lukežič, A., Kart, U., Kämäräinen, J., Matas, J., Kristan, M.: CDTB: a color and depth visual object tracking dataset and benchmark. In: ICCV (2019)
Lukežič, A., Vojír̃ T., Čehovin Zajc, L., Matas, J., Kristan, M.: Discriminative correlation filter with channel and spatial reliability. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6309–6318, July 2017
Lukežič, A., Čehovin Zajc, L., Vojír̃ T., Matas, J., Kristan, M.: Now you see me: evaluating performance in long-term visual tracking. CoRR abs/1804.07056 (2018). http://arxiv.org/abs/1804.07056
Lukezic, A., Cehovin Zajc, L., Vojir, T., Matas, J., Kristan, M.: Performance evaluation methodology for long-term single object tracking. IEEE Trans. Cybern. (2020)
Lukezic, A., Matas, J., Kristan, M.: D3S - a discriminative single shot segmentation tracker. In: CVPR (2020)
Memarmoghadam, A., Moallem, P.: Size-aware visual object tracking via dynamic fusion of correlation filter-based part regressors. Signal Process. 164, 84–98 (2019). https://doi.org/10.1016/j.sigpro.2019.05.021. http://www.sciencedirect.com/science/article/pii/S0165168419301872
Moudgil, A., Gandhi, V.: Long-term visual object tracking benchmark. arXiv preprint arXiv:1712.01358 (2017)
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_27
Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: ECCV, pp. 300–317 (2018)
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp. 4293–4302 (2016)
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV (2019)
Pernici, F., del Bimbo, A.: Object tracking by oversampling local features. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2538–2551 (2013). https://doi.org/10.1109/TPAMI.2013.250
Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 22(10), 1090–1104 (2000)
Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: YouTube-BoundingBoxes: a large high-precision human-annotated data set for object detection in video. In: Computer Vision and Pattern Recognition, pp. 7464–7473 (2017)
Robinson, A., Lawin, F.J., Danelljan, M., Khan, F.S., Felsberg, M.: Learning fast and robust target models for video object segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Computer Vision Foundation, June 2020
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1–3), 125–141 (2008)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Seoung, W.O., Lee, J.Y., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: Computer Vision Pattern Recognition, pp. 7376–7385 (2018)
Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. TPAMI (2013). https://doi.org/10.1109/TPAMI.2013.230
Solera, F., Calderara, S., Cucchiara, R.: Towards the evaluation of reproducible robustness in tracking-by-detection. In: Advanced Video and Signal Based Surveillance, pp. 1–6 (2015)
Song, S., Xiao, J.: Tracking revisited using RGBD camera: unified benchmark and baselines. In: ICCV (2013)
Tao, R., Gavves, E., Smeulders, A.W.M.: Tracking for half an hour. CoRR abs/1711.10217 (2017). http://arxiv.org/abs/1711.10217
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355 (2019)
Čehovin, L., Kristan, M., Leonardis, A.: Is my new tracker really better than yours? Technical report 10, ViCoS Lab, University of Ljubljana, October 2013. http://prints.vicos.si/publications/302
Čehovin, L.: TraX: The visual Tracking eXchange Protocol and Library. Neurocomputing (2017). https://doi.org/10.1016/j.neucom.2017.02.036
Čehovin, L., Leonardis, A., Kristan, M.: Visual object tracking performance measures revisited. IEEE Trans. Image Process. 25(3), 1261–1274 (2016)
Vojír̃, T., Noskova, J., Matas, J.: Robust scale-adaptive mean-shift for tracking. Pattern Recogn. Lett. 49, 250–258 (2014)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: CVPR, pp. 1328–1338 (2019)
Wang, X., Kong, T., Shen, C., Jiang, Y., Li, L.: SOLO: segmenting objects by locations. arXiv preprint arXiv:1912.04488 (2019)
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: Computer Vision Pattern Recognition (2013)
Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. PAMI 37(9), 1834–1848 (2015)
Xiao, J., Stolkin, R., Gao, Y., Leonardis, A.: Robust fusion of color and depth data for RGB-D target tracking using adaptive range-invariant depth models and spatio-temporal consistency constraints. IEEE Trans. Cybern. 48, 2485–2499 (2018)
Xu, N., Price, B., Yang, J., Huang, T.: Deep grabcut for object selection. In: Proceedings of British Machine Vision Conference (2017)
Xu, T., Feng, Z.H., Wu, X.J., Kittler, J.: AFAT: adaptive failure-aware tracker for robust visual object tracking. arXiv preprint arXiv:2005.13708 (2020)
Xu, Y., Wang, Z., Li, Z., Ye, Y., Yu, G.: SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. arXiv preprint arXiv:1911.06188 (2019)
Yan, B., Wang, D., Lu, H., Yang, X.: Alpha-refine: boosting tracking performance by precise bounding box estimation. arXiv preprint arXiv:2007.02024 (2020)
Yan, B., Zhao, H., Wang, D., Lu, H., Yang, X.: Skimming-Perusal Tracking: a framework for real-time and robust long-term tracking. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S.: RepPoints: point set representation for object detection. In: The IEEE International Conference on Computer Vision (ICCV), pp. 9657–9666, October 2019
Yiming, L., Shen, J., Pantic, M.: Mobile face tracking: a survey and benchmark. arXiv:1805.09749v1 (2018)
Young, D.P., Ferryman, J.M.: PETS Metrics: on-line performance evaluation service. In: Proceedings of the 14th International Conference on Computer Communications and Networks, ICCCN 2005, pp. 317–324 (2005)
Zhang, L., Danelljan, M., Gonzalez-Garcia, A., van de Weijer, J., Khan, F.S.: Multi-modal fusion for end-to-end RGB-T tracking. In: IEEE International Conference on Computer Vision, ICCV Workshops (2019)
Zhang, P., Zhao, J., Wang, D., Lu, H., Yang, X.: Jointly modeling motion and appearance cues for robust RGB-T tracking. CoRR abs/2007.02041 (2020)
Zhang, Y., Wu, Z., Peng, H., Lin, S.: A transductive approach for video object segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4000–4009, June 2020
Zhang, Y., Wang, D., Wang, L., Qi, J., Lu, H.: Learning regression and verification networks for long-term visual tracking. CoRR abs/1809.04320 (2018)
Zhang, Z., Peng, H.: Deeper and wider Siamese networks for real-time visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4591–4600, June 2019
Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: object-aware anchor-free tracking. arXiv preprint arXiv:2006.10721 (2020)
Zhu, P., Wen, L., Bian, X., Haibin, L., Hu, Q.: Vision meets drones: a challenge. arXiv preprint arXiv:1804.07437 (2018)
Acknowledgements
This work was supported in part by the following research programs and projects: Slovenian research agency research programs P2-0214, Z2-1866, P2-0094, Slovenian research agency project J2-8175. Jiři Matas and Ondrej Drbohlav were supported by the Czech Science Foundation Project GACR P103/12/G084. Aleš Leonardis was supported by MURI project financed by MoD/Dstl and EPSRC through EP/N019415/1 grant. Michael Felsberg and Linbo He were supported by WASP, VR (ELLIIT and NCNN), and SSF (SymbiCloud). Roman Pflugfelder and Gustavo Fernández were supported by the AIT Strategic Research Programme 2020 Visual Surveillance and Insight. The challenge was sponsored by the Faculty of Computer Science, University of Ljubljana, Slovenia.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A VOT-ST2020 and VOT-RT2020 Submissions
This appendix provides a short summary of trackers considered in the VOT-ST2020 and VOT-RT2020 challenges.
1.1 A.1 Discriminative Sing-Shot Segmentation Tracker (D3S)
A. Lukezic
alan.lukezic@fri.uni-lj.si
Template-based discriminative trackers are currently the dominant tracking paradigm due to their robustness, but are restricted to bounding box tracking and a limited range of transformation models, which reduces their localization accuracy. We propose a discriminative single-shot segmentation tracker named D3S [50], which narrows the gap between visual object tracking and video object segmentation. A single-shot network applies two target models with complementary geometric properties, one invariant to a broad range of transformations, including non-rigid deformations, the other assuming a rigid object to simultaneously achieve high robustness and online target segmentation.
1.2 A.2 Visual Tracking by Means of Deep Reinforcement Learning and an Expert Demonstrator (A3CTDmask)
M. Dunnhofer, G. Foresti, C. Micheloni
{matteo.dunnhofer, gianluca.foresti, christian.micheloni}@uniud.it
A3CTDmask is the combination of the A3CTD tracker [16] with a one-shot segmentation method for target object mask generation. A3CTD is a real-time tracker built on a deep recurrent regression network architecture trained offline using a reinforcement learning based framework. After training, the proposed tracker is capable of producing bounding box estimates through the learned policy or by exploiting the demonstrator. A3CTDmask exploits SiamMask [74] by reinterpreting it as a one-shot segmentation module. The target object mask is generated inside a frame patch obtained through the bounding box estimates given by A3CTD.
1.3 A.3 Deep Convolutional Descriptor Aggregation for Visual Tracking (DCDA)
Y. Li, X. Ke
liyuezhou.cm@gmail.com, kex@fzu.edu.cn
This work aims to mine the target representation capability of pre-trained VGG16 model for visual tracking. Based on spatial and semantic priors, a central attention mask is designed for robust-aware feature aggregation, and an edge attention mask is used for accuracy aware feature aggregation. To make full use of the scene context, a regression loss is developed to learn a discriminative feature for complex scenes. DCDA tracker is implemented based on the Siamese network, with a feature fusion and template enhancement strategies.
1.4 A.4 IOU Guided Siamese Networks for Visual Object Tracking (IGS)
M. Dasari, R. Gorthi
{ee18d001, rkg}@iittp.ac.in
In the proposed IOU-SiamTrack framework, a new block called ‘IOU module’ is introduced. This module accepts the above feature domain response maps, convert them into image domain with the help of anchor boxes, as is done in the inference stage in [41, 42]. Using the classification response map, top-K ‘probable’ bounding boxes, having top-K responses are selected. IOU module then calculates the IOU of probable bounding boxes w.r.t. estimated bounding box and produce the one with maximum IOU score as predicted output bounding box. Through training progress, predicted box is more aligned with ground truth, as network is guided to minimise the IOU loss.
1.5 A.5 SiamMask_SOLO (SiamMask_S)
Y. Jiang, Z. Feng, T. Xu, X. Song
yj.jiang@stu.jiangnan.edu.cn, {z.feng, tianyang.xu}@surrey.ac.uk,
x.song@jiangnan.edu.cn
The SiamMask_SOLO tracker is based on the SiamMask algorithm. It utilizes a multi-layer aggregation module to make full use of different levels of deep CNN features. Besides, to balance all the three branches, the mask branch is replaced by a SOLO [75] head that uses CoordConv and FCN, which improves the performance of the proposed SiamMask_SOLO tracker in terms of both accuracy and robustness. The original refined module is kept for a further performance boost.
1.6 A.6 Diverse Ensemble Tracker (DET50)
N. Wang, W. Zhou, H. Li
wn6149@mail.ustc.edu.cn, {zhwg, lihq}@ustc.edu.cn
In this work, we leverage an ensemble of diverse models to learn manifold representations for robust object tracking. Based on the DiMP method, a shared backbone network (ResNet-50) is applied for feature extraction and multiple head networks for independent predictions. To shrink the representational overlaps among multiple models, both model diversity and response diversity regularization terms are used during training. This ensemble framework is end-to-end trained in a data-driven manner. After box-level prediction, we use SiamMask for mask generation.
1.7 A.7 VPU_SiamM: Robust Template Update Strategy for Efficient Object Tracking (VPU_SiamM)
A. Gebrehiwot, J. Bescos, Á. García-Martín
awet.gebrehiwot@estudiante.uam.es, {j.bescos, alvaro.garcia}@uam.es
The VPU_SiamM tracker is an improved version of the SiamMask [74]. The SiamMask tracks without any target update strategy. In order to enable more discriminant features and to enhance robustness, the VPU_SiamM applies a target template update strategy, which leverages both the initial ground truth template and a supplementary updatable template. The initial template provides highly reliable information and increase robustness against model drift and the updatable template integrates the new target information from the predicted target location given by the current frame. During online tracking, VPU_SiamM applies both forward and backward tracking strategies by updating the updatable target template with the predicted target. The tracking decision on the next frame is determined where both templates yield a high response map (score) in the search region. Data augmentation strategy has been implemented during the training process of the refinement branch to become robust in handling motion-blurred and low-resolution datasets during inference.
1.8 A.8 RPT: Learning Point Set Representation for Siamese Visual Tracking (RPT)
H. Zhang, L. Wang, Z. Ma, W. Lu, J. Yin, M. Cheng
1067166127@qq.com, {wanglinyuan, kobebean, lwhfh01}@zju.edu.cn, {yin_jun, cheng_miao}@dahuatech.com
RPT tracker is formulated with a two-stage structure. The first stage is composed with two parallel subnets, one for target estimation with RepPoints [84] in an offline-trained embedding space, the other trained online to provide high robustness against distractors [13]. The online classification subnet is set to a lightweight 2-layer convolutional neural network. The target estimation head is constructed with Siamese-based feature extraction and matching. For the second stage, the set of RepPoints with highest confidence (i.e. online classification score) is fed into a modified D3S [50] to obtain the segmentation mask. A segmentation map is obtained by combining enhanced target location channel with target and background similarity channels. The backbone is ResNet50 pre-trained on ImageNet, while the target estimation head is trained using pairs of frames from YouTube-Bounding Box [59], COCO [45] and ImageNet VID [63] datasets.
1.9 A.9 Tracking Student and Teacher (TRASTmask)
M. Dunnhofer, G. Foresti, C. Micheloni
{matteo.dunnhofer, gianluca.foresti, christian.micheloni}@uniud.it
TRASTmask is the combination of the TRAST tracker [17] with a one-shot segmentation method for target object mask generation. TRAST tracker consists of two components: (i) a fast processing CNN-based tracker, i.e. the Student; and (ii) an off-the-shelf tracker, i.e. the Teacher. The Student is trained offline based on knowledge distillation and reinforcement learning, where multiple tracking teachers are exploited. Tracker TRASTmask uses DiMP [5] as the Teacher. The target object mask is generated inside a frame patch obtained through the bounding box estimates given by TRAST tracker.
1.10 A.10 Ocean: Object-aware Anchor-free Tracking (Ocean)
Z. Zhang, H. Peng
zhangzhipeng2017@ia.ac.cn, houwen.peng@microsoft.com
We extend our object-aware anchor-free tracking framework [92] with novel transduction and segmentation networks, enabling it to predict accurate target mask. The transduction network is introduced to infuse the knowledge of the given mask in the first frame. Inspired by recent work TVOS [89], it compares the pixel-wise feature similarities between the template and search features, and then transfers the mask of the template to an attention map based on the similarities. We add the attention map to backbone features to learn target-background aware representations. Finally, a U-net shape segmentation pathway is designed to progressively refine the enhanced backbone features to target mask. The code will be completely released at https://github.com/researchmm/TracKit.
1.11 A.11 Tracking by Student FUSing Teachers (TRASFUSTm)
M. Dunnhofer, G. Foresti, C. Micheloni
{matteo.dunnhofer, gianluca.foresti, christian.micheloni}@uniud.it
The tracker TRASFUSTm is the combination of the TRASFUST tracker [17] with a one-shot segmentation method for target object mask generation. TRASFUSTm tracker consists of two components: (i) a fast processing CNN-based tracker, i.e. the Student; (ii) a pool of off-the-shelf trackers, i.e. Teachers. The Student is trained offline based on knowledge distillation and reinforcement learning, where multiple tracking teachers are exploited. After learning, through the learned evaluation method, the Student is capable to select the prediction of the best Teacher of the pool, thus performing robust fusion. Both trackers DiMP [5] and ECO [12] were chosen as Teachers. The target object mask is generated inside a frame patch obtained through the bounding box estimates given by TRASFUSTm tracker.
1.12 A.12 Alpha-Refine (AlphaRef)
B. Yan, D. Wang, H. Lu, X. Yang
yan_bin@mail.dlut.edu.cn, {wdice, lhchuan}@dlut.edu.cn,
xyang@remarkholdings.com
We propose a simple yet powerful two-stage tracker, which consists of a robust base tracker (super-dimp) and an accurate refinement module named Alpha-Refine [82]. In the first stage, super-dimp robustly locates the target, generating an initial bounding box for the target. Then in the second stage, based on this result, Alpha-Refine crops a small search region to predict a high-quality mask for the tracked target. Alpha-Refine exploits pixel-wise correlation for fine feature aggregation, and uses non-local layer to capture global context information. Besides, Alpha-Refine also deploys a delicate mask prediction head [60] to generate high-quality masks. The complete code and trained models of Alpha-Refine will be released at github.com/MasterBin-IIAU/AlphaRefine.
1.13 A.13 Hierarchical Representations with Discriminative Meta-Filters in Dual Path Network for Tracking (DPMT)
F. Xie, N. Wang, K. Yang, Y. Yao
220191672@seu.edu.cn, 20181222016@nuist.edu.cn,
yangkang779@163.con, 220191672@seu.edu.cn
We propose a novel dual path network with discriminative meta-filters and hierarchical representations to solve these issues. DPMT tracker consists of two pathways: (i) Geographical Sensitivity Pathway (GASP) and (ii) Geometrically Sensitivity Pathway (GESP). The modules in Geographical Sensitivity Pathway (GASP) are more sensitive to the spatial location of targets and distractors. Subnetworks in Geometrically Sensitivity Pathway (GESP) are designed to refine the bounding box to fit the target. According to this dual path network design, Geographical Sensitivity Pathway (GASP) should be trained to own more discriminative power between foreground and background while Geographical Sensitivity Pathway (GASP) should focus more on the appearance model of the object.
1.14 A.14 SiamMask (siammask)
Q. Wang, L. Zhang, L. Bertinetto, P.H.S. Torr, W. Hu
qiang.wang@nlpr.ia.ac.cn, {lz, luca}@robots.ox.ac.uk, philip.torr@eng.ox.ac.uk, wmhu@nlpr.ia.ac.cn
Our method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. In this way, our tracker gains a better instance-level understanding towards the object to track by exploiting the rich object mask representations offline. Once trained, SiamMask solely relies on a single bounding box initialisation and operates online, producing class-agnostic object segmentation masks and rotated bounding boxes. Code is publicly available at https://github.com/foolwood/SiamMask.
1.15 A.15 OceanPlus: Online Object-Aware Anchor-Free Tracking (OceanPlus)
Z. Zhang, H. Peng, Z. Wu, K. Liu, J. Fu, B. Li, W. Hu
zhangzhipeng2017@ia.ac.cn, houwen.peng@microsoft.com,
Wu.Zhirong@microsoft.com, liukaiwen2019@ia.ac.cn, jianf@microsoft.com,
bli@nlpr.ia.ac.cn, wmhu@nlpr.ia.ac.cn
This model is the extension of the Ocean tracker A.10. Inspired by recent online models, we introduce an online branch to accommodate to the changes of object scale and position. Specifically, the online branch inherits the structure and parameters from the first three stages of the Siamese backbone network. The fourth stage keeps the same structure as the original ResNet50, but its initial parameters are obtained through the pre-training strategy proposed in [5]. The segmentation refinement pathway is the same as Ocean. We refer the readers to Ocean tracker A.10 and https://github.com/researchmm/TracKit for more details.
1.16 A.16 fastOcean: Fast Object-Aware Anchor-Free Tracking (fastOcean)
Z. Zhang, H. Peng
zhangzhipeng2017@ia.ac.cn, houwen.peng@microsoft.com
To speed up the inference of our submitted tracker OceanPlus, we use TensorRTFootnote 12 to re-implement the model. All structure and model parameters are the same as OceanPlus. Please refer to OceanPlus A.15 and Ocean A.10 for more details.
1.17 A.17 Siamese Tracker with Discriminative Feature Embedding and Mask Prediction (SiamMargin)
G. Chen, F. Wang, C. Qian
{chenguangqi, wangfei, qianchen}@sensetime.com
SiamMargin is based on the SiamRPN++ algorithm [41]. In the training stage, a discrimination loss is added to the embedding layer. In the training phase the discriminative embedding is offline learned. In the inference stage the template feature of the object in current frame is obtained by ROIAlign from features of the current search region and it is updated via a moving average strategy. The discriminative embedding features are leveraged to accommodate the appearance change with properly online updating. Lastly, the SiamMask [74] model is appended to obtain the pixel-level mask prediction.
1.18 A.18 Siamese Tracker with Enhanced Template and Generalized Mask Generator (SiamEM)
Y. Li, Y. Ye, X. Ke
liyuezhou.cm@gmail.com, yyfzu@foxmail.com, kex@fzu.edu.cn
SiamEM is a Siamese tracker with enhanced template and generalized mask generator. SiamEM improves SiamFC++ [81] by obtaining feature results of the template and flip template in the network header while making decisions based on quality scores to predict bounding boxes. The segmentation network presented in [10] is used as a mask generation network.
1.19 A.19 TRacker by Using ATtention (TRAT)
H. Saribas, H. Cevikalp, B. Uzun
{hasansaribas48, hakan.cevikalp, eee.bedirhan}@gmail.com
The tracker ‘TRacker by using ATtention’ uses a two-stream network which consists of a 2D-CNN and a 3D-CNN, to use both spatial and temporal information in video streams. To obtain temporal (motion) information, 3D-CNN is fed by stacking the previous 4 frames with one stride. To extract spatial information, the 2D-CNN is used. Then, we fuse the two-stream network outputs by using an attention module. We use ATOM [13] tracker and ResNet backbone as a baseline. Code is available at https://github.com/Hasan4825/TRAT.
1.20 A.20 InfoGAN Based Tracker: InfoVITAL (InfoVital)
H. Kuchibhotla, M. Dasari, R. Gorthi
{ee18m009, ee18d001, rkg}@iittp.ac.in
Architecture of InfoGAN (Generator, Discriminator and a Q-Network) is incorporated in the Tracking-By-Detection Framework using the Mutual Information concept to bind two distributions (latent code) to the target and the background samples. Additional Q Network helps in proper estimation of the assigned distributions and the network is trained offline in an adversarial fashion. During online testing, the additional information from the Q-Network is used to obtain the target location in the subsequent frames. This greatly helps to assess the drift from the exact target location from frame-to-frame and also during occlusion.
1.21 A.21 Learning Discriminative Model Prediction for Tracking (DiMP)
G. Bhat, M. Danelljan, L. Van Gool, R. Timofte
{goutam.bhat, martin.danelljan, vangool, timofter}@vision.ee.ethz.ch
DiMP is an end-to-end tracking architecture, capable of fully exploiting both target and background appearance information for target model prediction. The target model here constitutes the weights of a convolution layer which performs the target-background classification. The weights of this convolution layer are predicted by the target model prediction network, which is derived from a discriminative learning loss by applying an iterative optimization procedure. The model prediction network employs a steepest descent based methodology that computes an optimal step length in each iteration to provide fast convergence. The online learned target model is applied in each frame to perform target-background classification. The final bounding box is then estimated using the overlap maximization approach as in [13]. See [5] for more details about the tracker.
1.22 A.22 SuperDiMP (SuperDiMP)
G. Bhat, M. Danelljan, F. Gustafsson, T. B. Schön, L. Van Gool, R. Timofte
{goutam.bhat, martin.danelljan}@vision.ee.ethz.ch, {fredrik.gustafsson,
thomas.schon}@it.uu.se, {vangool, timofter}@vision.ee.ethz.ch
SuperDiMP [23] combines the standard DiMP classifier from [5] with the EBM-based bounding-box regressor from [14, 22]. Instead of training the bounding box regression network to predict the IoU with an \(L_2\) loss [5], it is trained using the NCE+ approach [23] to minimize the negative-log likelihood. Further, the tracker uses better training and inference settings.
1.23 A.23 Learning What to Learn for Video Object Segmentation (LWTL)
G. Bhat, F. Jaremo Lawin, M. Danelljan, A. Robinson, M. Felsberg, L. Van Gool, R. Timofte
goutam.bhat@vision.ee.ethz.ch, felix.jaremo-lawin@liu.se,
martin.danelljan@vision.ee.ethz.ch, {andreas.robinson, michael.felsberg}@liu.se, {vangool, timofter}@vision.ee.ethz.ch
LWTL is an end-to-end trainable video object segmentation VOS architecture which captures the current target object information in a compact parametric model. It integrates a differentiable few-shot learner module, which predicts the target model parameters using the first frame annotation. The learner is designed to explicitly optimize an error between target model prediction and a ground truth label, which ensures a powerful model of the target object. Given a new frame, the target model predicts an intermediate representation of the target mask, which is input to the offline trained segmentation decoder to generate the final segmentation mask. LWTL learns the ground-truth labels used by the few-shot learner to train the target model. Furthermore, a network module is trained to predict spatial importance weights for different elements in the few-shot learning loss. All modules in the architecture are trained end-to-end by maximizing segmentation accuracy on annotated VOS videos. See [7] for more details.
1.24 A.24 Adaptive Failure-Aware Tracker (AFAT)
T. Xu, S. Zhao, Z. Feng, X. Wu, J. Kittler
tianyang.xu@surrey.ac.uk, zsc960813@163.com, z.feng@surrey.ac.uk,
wu_xiaojun@jiangnan.edu.cn, j.kittler@surrey.ac.uk
Adaptive Failure-Aware Tracker [80] is based on Siamese structure. First, multi-RPN module is employed to predict the central location with Resnet-50. Second, a 2-cell LSTM is established to perform quality prediction with an additional motion model. Third, fused mask branch is exploited for segmentation.
1.25 A.25 Ensemble Correlation Filter Tracking Based on Temporal Confidence Learning (TCLCF)
C. Tsai
chiyi_tsai@gms.tku.edu.tw
TCLCF is a real-time ensemble correlation filter tracker based on the temporal confidence learning method. In the current implementation, we use four different correlation filters to collaboratively track the same target. The TCLCF tracker is a fast and robust generic object tracker without GPU acceleration. Therefore, it can be implemented on the embedded platform with limited computing resources.
1.26 A.26 AFOD: Adaptive Focused Discriminative Segmentation Tracker (AFOD)
Y. Chen, J. Xu, J. Yu
{yiwei.chen, jingtao.xu, jiaqian.yu}@samsung.com
The proposed tracker is based on D3S and DiMP [5], employing ResNet-50 as backbone. AFOD calculates the feature similarity to foreground and background of the template as proposed in D3S. For discriminative features, AFOD updates the target model online. AFOD adaptively utilizes different strategies during tracking to update the scale of search region and to adjust the prediction. Moreover, the Lovasz hinge loss metric is used to learn the IoU score in offline training. The segmentation module is trained using both databases YoutubeVOS2019 and DAVIS2016. The offline training process includes two stages: (i) BCE loss is used for optimization and (ii) the Lovasz hinge is applied for further fine tuning. For inference, two ResNet-50 models are used; one for the segmentation and another for the target.
1.27 A.27 Fast Saliency-Guided Continuous Correlation Filter-Based Tracker (FSC2F)
A. Memarmoghadam
a.memarmoghadam@yahoo.com
The tracker FSC2F is based on the ECOhc approach [12]. A fast spatio temporal saliency map is added using the PQFT approach [21]. The PQFT model utilizes intensity, colour, and motion features for quaternion representation of the search image context around the previously pose of the tracked object. Therefore, attentional regions in the coarse saliency map can constrain target confidence peaks. Moreover, a faster scale estimation algorithm is utilised by enhancing the fast fDSST method [15] via jointly learning of the sparsely-sampled scale spaces.
1.28 A.28 Adaptive Visual Tracking and Instance Segmentation (DESTINE)
S.M. Marvasti-Zadeh, J. Khaghani, L. Cheng, H. Ghanei-Yakhdan, S. Kasaei
mojtaba.marvasti@ualberta.ca, khaghani@ualberta.ca, lcheng5@ualberta.ca,
hghaneiy@yazd.ac.ir, kasaei@sharif.edu
DESTINE is a two-stage method consisting of an axis-aligned bounding box estimation and mask prediction, respectively. First, DiMP50 [5] is used as the baseline tracker switching to ATOM [13] when IoU and normalized L1-distance between the results meet predefined thresholds. Then, to segment the estimated bounding box, the segmentation network of FRTM-VOS [60] uses the predicted mask by SiamMask [74] as its scores. Finally, DESTINE selects the best target mask according to the ratio of foreground pixels for two predictions. The codes are publicly released at https://github.com/MMarvasti/DESTINE.
1.29 A.29 Scale Adaptive Mean-Shift Tracker (ASMS)
Submitted by VOT Committee
The mean-shift tracker optimizes the Hellinger distance between template histogram and target candidate in the image. This optimization is done by a gradient descend. ASMS [73] addresses the problem of scale adaptation and presents a novel theoretically justified scale estimation mechanism which relies solely on the mean-shift procedure for the Hellinger distance. ASMS also introduces two improvements of the mean-shift tracker that make the scale estimation more robust in the presence of background clutter – a novel histogram colour weighting and a forward-backward consistency check. Code available at https://github.com/vojirt/asms.
1.30 A.30 ATOM: Accurate Tracking by Overlap Maximization (ATOM)
Submitted by VOT Committee
ATOM separates the tracking problem into two sub-tasks: i) target classification, where the aim is to robustly distinguish the target from the background; and ii) target estimation, where an accurate bounding box for the target is determined. Target classification is performed by training a discriminative classifier online. Target estimation is performed by an overlap maximization approach where a network module is trained offline to predict the overlap between the target object and a bounding box estimate, conditioned on the target appearance in first frame. See [13] for more details.
1.31 A.31 Discriminative Correlation Filter with Channel and Spatial Reliability - C++ (CSRpp)
Submitted by VOT Committee
The CSRpp tracker is the C++ implementation of the Discriminative Correlation Filter with Channel and Spatial Reliability (CSR-DCF) tracker [47].
1.32 A.32 Incremental Learning for Robust Visual Tracking (IVT)
Submitted by VOT Committee
The idea of the IVT tracker [62] is to incrementally learn a low-dimensional sub-space representation, adapting on-line to changes in the appearance of the target. The model update, based on incremental algorithms for principal component analysis, includes two features: a method for correctly updating the sample mean, and a forgetting factor to ensure less modelling power is expended fitting older observations.
1.33 A.33 Kernelized Correlation Filter (KCF)
Submitted by VOT Committee
This tracker is a C++ implementation of Kernelized Correlation Filter [24] operating on simple HOG features and Colour Names. The KCF tracker is equivalent to a Kernel Ridge Regression trained with thousands of sample patches around the object at different translations. It implements multi-thread multi-scale support, sub-cell peak estimation and replacing the model update by linear interpolation with a more robust update scheme. Code available at https://github.com/vojirt/kcf.
1.34 A.34 Multiple Instance Learning tracker (MIL)
Submitted by VOT Committee
MIL tracker [1] uses a tracking-by-detection approach, more specifically Multiple Instance Learning instead of traditional supervised learning methods and shows improved robustness to inaccuracies of the tracker and to incorrectly labelled training samples.
1.35 A.35 Robust Siamese Fully Convolutional Tracker (RSiamFC)
Submitted by VOT Committee
RSiamFC tracker is an extended SiamFC tracker [4] with a robust training method which puts a transformation on training sample to generate a pair of samples for feature extraction.
1.36 A.36 VOS SOTA Method (STM)
Submitted by VOT Committee
Please see the original paper for details [56].
1.37 A.37 (UPDT)
Submitted by VOT Committee
Please see the original paper for details [6].
B VOT-LT2020 Submissions
This appendix provides a short summary of trackers considered in the VOT-LT2020 challenge.
1.1 B.1 Long-Term Visual Tracking with Assistant Global Instance Search (Megtrack)
Z. Mai, H. Bai, K. Yu, X. QIu
marchihjun@gmail.com, 522184271@qq.com, valjean1832@outlook.com,
qiuxi@megvii.com
Megtrack tracker applies a 2-stage method that consists of local tracking and multi-level search. The local tracker is based on ATOM [13] algorithm improved by initializing online correlation filters with backbone feature maps and by inserting a bounding box calibration branch in the target estimation module. SiamMask [74] is cascaded to further refining the bounding box after locating the centre of the target. The multi-level search uses RPN-based regression network to generate candidate proposals before applying GlobalTrack [26]. Appearance scores are calculated using both the online-learned RTMDNet [29] and the offline-learned one-shot matching module and linearly combine them to leverage the former’s high robustness and the latter’s discriminative power. Using a pre-defined threshold, the highest-scored proposal is considered as the current tracker state and used to re-initialize the local tracker for consecutive tracking.
1.2 B.2 Skimming-Perusal Long-Term Tracker (SPLT)
B. Yan, H. Zhao, D. Wang, H. Lu, X. Yang
{yan_bin, haojie_zhao}@mail.dlut.edu.cn, {wdice, lhchuan}@dlut.edu.cn,
xyang@remarkholdings.com
This is the original SPLT tracker [83] without modification. SPLT consists of a perusal module and a skimming module. The perusal module aims at obtaining precise bounding boxes and determining the target’s state in a local search region. The skimming module is designed to quickly filter out most unreliable search windows, speeding up the whole pipeline.
1.3 B.3 A Baseline Long-Term Tracker with Meta-Updater (LTMU_B)
K. Dai, D. Wang, J. Li, H. Lu, X. Yang
dkn2014@mail.dlut.edu.cn, {wdice, jianhual}@dlut.edu.cn, lhchuan@dlut.edu.cn,
xyang@remarkholdings.com
The tracker LTMU_B is a simplified version of LTMU [11] and LTDSE with comparable performance adding a RPN-based regression network, a sliding-window based re-detection module and a complex mechanism for updating models and target re-localization. The short-term tracker LTMU_B contains two components. One is for target localization and based on DiMP algorithm [5] using ResNet50 as the backbone network. The update of DiMP is controlled by meta-updater which is proposed by LTMUFootnote 13. The second component is the SiamMask network [74] used for refining the bounding box after locating the centre of the target. It also takes the local search region as the input and outputs the tight bounding boxes of candidate proposals. For the verifier, we adopts MDNet network [5] which uses VGGM as the backbone and is pre-trained on ILSVRC VID dataset. The classification score is finally obtained by sending the tracking result’s feature to three fully connected layers. GlobalTrack [26] is utilised as the global detector.
1.4 B.4 Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLTDiMP)
S. Choi, J. Lee, Y. Lee, A. Hauptmann
seokeon@kaist.ac.kr, {ljhyun33, swack9751}@korea.ac.kr, alex@cs.cmu.edu
We propose an improved Discriminative Model Prediction method for robust long-term tracking based on a pre-trained short-term tracker. The baseline tracker is SuperDiMP which combines the bounding-box regressor of PrDiMP [14] with the standard DiMP [5] classifier. To make our model more discriminative and robust, we introduce uncertainty reduction using random erasing, background augmentation for more discriminative feature learning, and random search with spatio-temporal constraints. Code available at https://github.com/bismex/RLT-DIMP.
1.5 B.5 Long-Term MDNet (ltMDNet)
H. Fan, H. Ling
{hefan, hling}@cs.stonybrook.edu
We designate a long-term tracker by adapting MDNet [55]. In specific, we utilize an instance-aware detector [26] to generate target proposals. Then, these proposals are forwarded to MDNet for classification. Since the detector performs detection on the full image, the final tracker can locate the target in the whole image which can robustly deal with full occlusion and out-of-view. The instance-aware detector is implemented by on Faster R-CNN using ResNet-50. The MDNet is implemented as in the original paper.
1.6 B.6 (CLGS)
Submitted by VOT Committee
In this work, we develop a complementary local-global search (CLGS) framework to conduct robust long-term tracking, which is a local robust tracker based on SiamMask [74], a global detection based on cascade R-CNN [8], and an online verifier based on Real-time MDNet [29]. During online tracking, the SiamMask model locates the target in local region and estimates the size of the target according to the predicted mask. The online verifier is used to judge whether the target is found or lost. Once the target is lost, a global R-CNN detector (without class prediction) is used to generate region proposals on the whole image. Then, the online verifier will find the target from region proposals again. Besides, we design an effective online update strategy to improve the discrimination of the verifier.
1.7 B.7 (LT_DSE)
Submitted by VOT Committee
This algorithm divides each long-term sequence into several short episodes and tracks the target in each episode using short-term tracking techniques. Whether the target is visible or not is judged by the outputs from the short-term local tracker and the classification-based verifier updated online. If the target disappears, the image-wide re-detection will be conducted and output the possible location and size of the target. Based on these, the tracker crops the local search region that may include the target and sends it to the RPN based regression network. Then, the candidate proposals from the regression network will be scored by the online learned verifier. If the candidate with the maximum score is above the pre-defined threshold, the tracker will regard it as the target and re-initialize the short-term components. Finally, the tracker conducts short-term tracking until the target disappears again.
1.8 B.8 (SiamDW_LT)
Submitted by VOT Committee
SiamDW_LT is a long-term tracker that utilizes deeper and wider backbone networks with fast online model updates. The basic tracking module is a short-term Siamese tracker, which returns confidence scores to indicate the tracking reliability. When the Siamese tracker is uncertain on its tracking accuracy, an online correction module is triggered to refine the results. When the Siamese tracker is failed, a global re-detection module is activated to search the target in the images. Moreover, object disappearance and occlusion are also detected by the tracking confidence. In addition, we introduce model ensemble to further improve the tracking accuracy and robustness.
C VOT-RGBT2020 Submissions
This appendix provides a short summary of trackers considered in the VOT-RGBT2020 challenge.
1.1 C.1 Multi-model Continuous Correlation Filter for RGBT Visual Object Tracking (M2C2Frgbt)
A. Memarmoghadam
a.memarmoghadam@yahoo.com
Inspired by ECO tracker [12], we propose a robust yet efficient tracker namely as M2C2Frgbt that utilizes multiple models of the tracked object and estimates its position every frame by weighted cumulative fusion of their respective regressors in a ridge regression optimization problem [51]. Moreover, to accelerate tracking performance, we propose a faster scale estimation method in which the target scale filter is jointly learned via sparsely sampled scale spaces constructed by just the thermal infrared data. Our scale estimation approach enhances the running speed of fDSST [15] as the baseline algorithm better than 20% while maintaining the tracking performance as well. To suppress unwanted samples mostly belong to the occlusion or other non-object data, we conservatively update every model on-the-fly in a non-uniform sparse manner.
1.2 C.2 Jointly Modelling Motion and Appearance Cues for Robust RGB-T Tracking (JMMAC)
P. Zhang, S. Chen, D. Wang, H. Lu, X. Yang
pyzhang@mail.dlut.edu.cn, shuhaochn@mail.dlut.edu.cn, wdice@dlut.edu.cn,
lhchuan@dlut.edu.cn, xyang@remarkholdings.com
Our tracker is based on [88], consisting of two components, i.e. multimodal fusion for appearance trackers and camera motion estimation. In multimodal fusion, we develop a late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities. The fusion weights are determined by using offline-trained global and local Multimodal Fusion Networks (MFNet), and then adopted to linearly combine the response maps of RGB and T modalities obtained from ECOs. In MFNet, the truncated VGG-M networks is used as backbone to extract deep feature. In camera motion estimation, when the drastic camera motion is detected, we compensate movement to correct the search region by key-point-based image registration technique. Finally, we employ YOLOv2 to refine the bounding box. The scale estimation and model updating methods are borrowed from ECO in default.
1.3 C.3 Accurate Multimodal Fusion for RGB-T Object Tracking (AMF)
P. Zhang, S. Chen, B. Yan, D. Wang, H. Lu, X. Yang
{pyzhang, shuhaochn, yan_bin}@mail.dlut.edu.cn, {wdice, lhchuan}@dlut.edu.cn, xyang@remarkholdings.com
We achieve multimodal fusion for RGB-T tracking by linear combining the response maps obtained from two monomodality base trackers, i.e., DiMP. The fusion weight is obtained by the Multimodal Fusion Network proposed in [88]. To achieve high accuracy, the bounding box obtained from fused DiMP is then refined by a refinement module in visible modality. The refinement module, namely Alpha-Refine, aggregates features via a pixel-level correlation layer and a non-local layer and adaptively selects the most adequate results from three branches, namely bounding box, corner and mask heads, which can predict more accurate bounding boxes. Note that the target scale estimated by IoUNet in DiMP is also applied in visible modality which is followed by Alpha-Refine and the model updating method is borrowed from DiMP in default.
1.4 C.4 SqueezeNet Based Discriminative Correlation Filter Tracker (SNDCFT)
A. Varfolomieiev
a.varfolomieiev@kpi.ua
The tracker uses FHOG and convolutional features extracted from both video and infrared modalities. As the convolutional features, the output of the ‘fire2/concat’ layer of the original SqueezeNet network [27] is used (no additional pre-training for the network is performed). The core of the tracker is the spatially regularized discriminative correlation filter, which is calculated using the ADMM optimizer. The calculation of the DCF filter is performed independently over different feature modalities. The filter is updated in each frame using simple exponential forgetting.
1.5 C.5 Decision Fusion Adaptive Tracker (DFAT)
H. Li, Z. Tang, T. Xu, X. Zhu, X. Wu, J. Kittler
hui_li_jnu@163.com, 1030415519@vip.jiangnan.edu.cn, tianyang.xu@surrey.ac.uk, xuefeng_zhu95@163.com, wu_xiaojun@jiangnan.edu.cn, j.kittler@surrey.ac.uk
Decision Fusion Adaptive Tracker is based on Siamese structure. Firstly, the multi-layer deep features are extracted by Resnet-50. Then, multi-RPN module is employed to predict the central location with multi-layer deep features. Finally, an adaptive weight strategy for decision level fusion is utilized to generate the final result. In addition, the template features are updated by a linear template update strategy.
1.6 C.6 Multi-modal Fusion for End-to-End RGB-T Tracking (mfDiMP)
Submitted by VOT Committee
The mfDiMP tracker contains an end-to-end tracking framework for fusing the RGB and TIR modalities in RGB-T tracking [87]. The mfDiMP tracker fuses modalities at the feature level on both the IoU predictor and the model predictor of DiMP [87] and won the VOT-RGBT2019 challenge.
1.7 C.7 Online Deeper and Wider Siamese Networks for RGBT Visual Tracking (SiamDW-T)
Submitted by VOT Committee
SiamDW-T is based on previous work by Zhang and Peng [91], and extends it with two fusion strategies for RGBT tracking. A simple fully connected layer is appended to classify each fused feature to background or foreground. SiamDW-T achieved the second rank in VOT-RGBT2019 and its code is available at https://github.com/researchmm/VOT2019.
D VOT-RGBD2020 Submissions
This appendix provides a short summary of trackers considered in the VOT-RGBD2020 challenge.
1.1 D.1 Accurate Tracking by Category-Agnostic Instance Segmentation for RGBD Image (ATCAIS)
Y. Wang, L. Wang, D. Wang, H. Lu, X. Yang
{wym097,wlj,wdice,lhchuan}@dlut.edu.cn, xyang@remarkholdings.com
The proposed tracker combines both instance segmentation and the depth information for accurate tracking. ATCAIS is based on the ATOM tracker and the HTC instance segmentation method which is re-trained in a category-agnostic manner. The instance segmentation results are used to detect background distractors and to re-fine the target bounding boxes to prevent drifting. The depth value is used to detect the target occlusion or disappearance and re-finding the target.
1.2 D.2 Depth Enhanced DiMP for RGBD Tracking (DDiMP)
S. Qiu, Y. Gu, X. Zhang
{shoumeng, gyz, xlzhang}@mail.sim.ac.cn
DDiMP is based on SuperDiMP which combines the standard DiMP classifier from [5] with the bounding box regressor from [5]. The update strategy of the model during the tracking process is enhanced by using the model’s confidence for the current tracking results. Output of IoU-Net is used to determine whether to fine-tune the shape, size, and position of the target. To handle scale variations, the target is searched over five scales \(1.025^{\{-2, -1, 0, 1, 2\}}\), and depth information is utilized to prevent scale from changing too quickly. Finally, two trackers with different model update confidence thresholds run in parallel, and the output with higher confidence is selected as the tracking result of the current frame.
1.3 D.3 Complementary Local-Global Search for RGBD Visual Tracking (CLGS-D)
H. Zhao, Z. Wang, B. Yan. D. Wang, H. Lu, X. Yang
{haojie_zhao,zzwang,yan_bin,wdice,lhchuan@dlut.edu.cn}@mail.dlut.edu.cn,
xyang@remarkholdings.com
CLGS-D tracker is based on SiamMask, FlowNetv2, CenterNet, Real-time MDNet and a novel box refine module. The SiamMask model is used as the base tracker. MDNet is used to judge whether the target is found or lost. Once the target is lost, CenterNet is used to generate region proposals on the whole image. FlowNetv2 is used to estimate the motion of the target by generating a flow map. Then, the region proposals are filtered with aid of the flow and depth maps. Finally, an online “verifier” will find the target from the remaining region proposals again. A novel module is also used in this work to refine the bounding box.
1.4 D.4 Siamese Network for Long-term RGB-D Tracking (Siam_LTD)
X.-F. Zhu, H. Li, S. Zhao, T. Xu, X.-J. Wu
{xuefeng_zhu95,hui_li_jnu,zsc960813,wu_xiaojun}@163.com,
tianyang.xu@surrey.ac.uk
Siam_LTD employes ResNet-50 to extract backbone features and RPN branch to locate the centre. In addition, a re-detection mechanism is introduced.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Kristan, M. et al. (2020). The Eighth Visual Object Tracking VOT2020 Challenge Results. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12539. Springer, Cham. https://doi.org/10.1007/978-3-030-68238-5_39
Download citation
DOI: https://doi.org/10.1007/978-3-030-68238-5_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68237-8
Online ISBN: 978-3-030-68238-5
eBook Packages: Computer ScienceComputer Science (R0)