Abstract
In recent years, Siamese network-based trackers have achieved significant improvements in real-time tracking. Despite their success, performance bottlenecks caused by unavoidably complex scenarios in target-tracking tasks are becoming increasingly non-negligible. For example, occlusion and fast motion are factors that can easily cause tracking failures and are labeled in many high-quality tracking databases as challenging attributes. In addition, Siamese trackers tend to suffer from high memory costs, which restricts their applicability to mobile devices with tight memory budgets. To address these issues, we propose a Specialized teachers Distilled Siamese Tracker (SDST) framework to learn a student tracker, which is small, fast, and has enhanced performance in challenging attributes. SDST introduces two types of teachers for multi-teacher distillation: general teacher and specialized teachers. The former imparts basic knowledge to the students. The latter is used to transfer specialized knowledge to students, which helps improve their performance in challenging attributes. For students to efficiently capture critical knowledge from the two types of teachers, SDST is equipped with a carefully designed multi-teacher knowledge distillation model. Our model contains two processes: general teacher-student knowledge transfer and specialized teachers-student knowledge transfer. Extensive empirical evaluations of several popular Siamese trackers demonstrated the generality and effectiveness of our framework. Moreover, the results on Large-scale Single Object Tracking (LaSOT) show that the proposed method achieves a significant improvement of more than 2–4% in most challenging attributes. SDST also maintained high overall performance while achieving compression rates of up to 8x and framerates of 252 FPS and obtaining outstanding accuracy on all challenging attributes.
Similar content being viewed by others
References
Yi, Wu., J.L., Yang, M.-H,: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015)
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: A high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5374–5383 (2019)
Tao, R., Gavves, E., Smeulders, A.W.: Siamese instance search for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 1420–1429 (2016)
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.: Fully-convolutional siamese networks for object tracking. In: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14, 850–865 (2016). Springer
Dong, X., Shen, J.: Triplet loss in siamese network for object tracking. In: Proceedings of the European conference on computer vision (ECCV), 459–474 (2018)
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 8971–8980 (2018)
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision (ECCV), 101–117 (2018)
Gao, J., Zhang, T., Xu, C.: Graph convolutional tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4649–4659 (2019)
Choi, J., Kwon, J., Lee, K.M.: Deep meta learning for real-time target-aware visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, 911–920 (2019)
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Čehovin Zajc, L., Vojir, T., Bhat, G., Lukezic, A., Eldesokey, A., : The sh visual object tracking vot2018 challenge results. In: Proceedings of the European conference on computer vision (ECCV) Workshops, 0–0 (2018ixt)
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4282–4291 (2019)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Liu, Y., Zhang, W., Wang, J.: Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 415, 106–113 (2020)
Meng, Z., Yao, X., Sun, L.: Multi-task distillation: Towards mitigating the negative transfer in multi-task learning. In: 2021 IEEE international conference on image processing (ICIP), pp. 389–393 (2021). IEEE
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: 2010 IEEE Computer society conference on computer vision and pattern recognition, pp. 2544–2550 (2010). IEEE
Danelljan, M., Robinson, A., Shahbaz Khan, F., Felsberg, M.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part V 14, 472–488 (2016). Springer
Liang, T., Li, B., Wang, M., Tan, H., Luo, Z.: A closer look at the joint training of object detection and re-identification in multi-object tracking. IEEE Trans. Image Process. 32, 267–280 (2022)
Wang, N., Yeung, D.-Y.: Learning a deep compact image representation for visual tracking. Advances in neural information processing systems 26 (2013)
Qi, Y., Zhang, S., Qin, L., Huang, Q., Yao, H., Lim, J., Yang, M.-H.: Hedging deep features for visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 41(5), 1116–1130 (2018)
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.: End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2805–2813 (2017)
Dong, X., Shen, J., Wang, W., Liu, Y., Shao, L., Porikli, F.: Hyperparameter optimization for tracking with continuous deep q-learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 518–527 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1328–1338 (2019)
Fan, H., Ling, H.: Siamese cascaded region proposal networks for real-time visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7952–7961 (2019)
Xu, Y., Wang, Z., Li, Z., Yuan, Y., Yu, G.: Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 12549–12556 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Yan, B., Peng, H., Wu, K., Wang, D., Fu, J., Lu, H.: Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15180–15189 (2021)
Xue, Y., Jin, G., Shen, T., Tan, L., Yang, J., Hou, X.: Mobiletrack: Siamese efficient mobile network for high-speed uav tracking. IET Image Proc. 16(12), 3300–3313 (2022)
Gu, F., Lu, J., Cai, C., Zhu, Q., Ju, Z.: Eantrack: an efficient attention network for visual tracking. IEEE Trans. Autom. Sci. Eng. (2023). https://doi.org/10.1109/TASE.2023.3319676
Gu, F., Lu, J., Cai, C.: Rpformer: a robust parallel transformer for visual tracking in complex scenes. IEEE Trans. Instrum. Meas. 71, 1–14 (2022)
Gu, F., Lu, J., Cai, C., Zhu, Q., Ju, Z.: Vtst: efficient visual tracking with a stereoscopic transformer. IEEE Trans. Emerg. Top. Comput. Intell. (2024). https://doi.org/10.1109/TETCI.2024.3360303
Gu, F., Lu, J., Cai, C., Zhu, Q., Ju, Z.: Rtsformer: a robust toroidal transformer with spatiotemporal features for visual tracking. IEEE Trans. Hum. Mach. Syst. (2024). https://doi.org/10.1109/THMS.2024.3370582
Gopal, G.Y., Amer, M.A.: Separable self and mixed attention transformers for efficient object tracking. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 6708–6717 (2024)
Blatter, P., Kanakis, M., Danelljan, M., Gool, L.V.: Efficient visual tracking with exemplar transformers. In: 2023 IEEE/CVF winter conference on applications of computer vision (WACV), pp. 1571–1581 (2023)
Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1562–1577 (2019)
Yuan, D., Shu, X., Liu, Q., He, Z.: Aligned spatial-temporal memory network for thermal infrared target tracking. IEEE Trans. Circuits Syst. II Express Briefs 70(3), 1224–1228 (2022)
Xue, Y., Jin, G., Shen, T., Tan, L., Wang, N., Gao, J., Wang, L.: Smalltrack: wavelet pooling and graph enhanced classification for uav small object tracking. IEEE Trans. Geosci. Remote Sens. (2023). https://doi.org/10.1109/TGRS.2023.3305728
Shen, J., Liu, Y., Dong, X., Lu, X., Khan, F.S., Hoi, S.: Distilled siamese networks for visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 8896–8909 (2021)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014)
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)
Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4133–4141 (2017)
Park, S., Kwak, N.: Feature-level ensemble knowledge distillation for aggregating knowledge from multiple networks. In: ECAI 2020, pp. 1411–1418. IOS Press (2020)
You, S., Xu, C., Xu, C., Tao, D.: Learning from multiple teacher networks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1285–1294 (2017)
Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z., Liu, T.-Y.: Multilingual neural machine translation with knowledge distillation. arXiv preprint arXiv:1902.10461 (2019)
Liang, T., Wang, M., Chen, J., Chen, D., Luo, Z., Leung, V.C.: Compressing the multiobject tracking model via knowledge distillation. IEEE Trans. Comput. Social Syst. (2023). https://doi.org/10.1109/TCSS.2023.3293882
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeong, J., Demiris, Y., Choi, J.Y.: Context-aware deep feature compression for high-speed visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 479–488 (2018)
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4293–4302 (2016)
Huang, L., Zhao, X., Huang, K.: Globaltrack: A simple and strong baseline for long-term tracking. In: Proceedings of the AAAI conference on artificial intelligence. 34, 11037–11044 (2020)
Borsuk, V., Vei, R., Kupyn, O., Martyniuk, T., Krashenyi, I., Matas, J.: Fear: Fast, efficient, accurate and robust visual tracker. In: European conference on computer vision, pp. 644–663 (2022). Springer
Acknowledgements
This work was supported by JST CREST Grant Number JPMJCR22D1 and JSPS KAKENHI Grant Number JP22H00551, Japan.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, Y., Shimada, A., Minematsu, T. et al. A framework of specialized knowledge distillation for Siamese tracker on challenging attributes. Machine Vision and Applications 35, 94 (2024). https://doi.org/10.1007/s00138-024-01578-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-024-01578-4