Abstract
Despite great recent advances in visual tracking, its further development, including both algorithm design and evaluation, is limited due to lack of dedicated large-scale benchmarks. To address this problem, we present LaSOT, a high-quality Large-scale Single Object Tracking benchmark. LaSOT contains a diverse selection of 85 object classes, and offers 1550 totaling more than 3.87 million frames. Each video frame is carefully and manually annotated with a bounding box. This makes LaSOT, to our knowledge, the largest densely annotated tracking benchmark. Our goal in releasing LaSOT is to provide a dedicated high quality platform for both training and evaluation of trackers. The average video length of LaSOT is around 2500 frames, where each video contains various challenge factors that exist in real world video footage,such as the targets disappearing and re-appearing. These longer video lengths allow for the assessment of long-term trackers. To take advantage of the close connection between visual appearance and natural language, we provide language specification for each video in LaSOT. We believe such additions will allow for future research to use linguistic features to improve tracking. Two protocols, full-overlap and one-shot, are designated for flexible assessment of trackers. We extensively evaluate 48 baseline trackers on LaSOT with in-depth analysis, and results reveal that there still exists significant room for improvement. The complete benchmark, tracking results as well as analysis are available at http://vision.cs.stonybrook.edu/~lasot/.













Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Note that for tracking benchmark using full overlap split protocol, category bias should be inhibited in both training and evaluation of trackers. For tracking benchmark using one-shot split protocol, category bias should be inhibited in only training of trackers.
References
Babenko, B., Yang, M.H., & Belongie, S. (2009). Visual tracking with online multiple instance learning. In: CVPR.
Bao, C., Wu, Y., Ling, H., & Ji, H. (2012). Real time robust l1 tracker using accelerated proximal gradient approach. In: CVPR
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., & Torr, P.H. (2016). Staple: Complementary learners for real-time tracking. In: CVPR.
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., & Torr, P.H. (2016). Fully-convolutional siamese networks for object tracking. In: ECCVW
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R. (2019) Learning discriminative model prediction for tracking. In: ICCV
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M. (2010). Visual object tracking using adaptive correlation filters. In: CVPR.
Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeong, J., Demiris, Y., Choi, J.Y. (2018). Context-aware deep feature compression for high-speed visual tracking. In: CVPR
Choi, J., Jin Chang, H., Jeong, J., Demiris, Y., Young Choi, J. (2016). Visual tracking using attention-modulated disintegration and integration. In: CVPR.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: CVPR.
Dai, K., Wang, D., Lu, H., Sun, C., Li, J. (2019). Visual tracking via adaptive spatially-regularized correlation filters. In: CVPR
Dai, K., Zhang, Y., Wang, D., Li, J., Lu, H., Yang, X. (2020). High-performance long-term tracking with meta-updater. In: CVPR.
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M. (2017). Eco: Efficient convolution operators for tracking. In: CVPR
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M. (2019). Atom: Accurate tracking by overlap maximization. In: CVPR
Danelljan, M., Häger, G., Khan, F., Felsberg, M. (2014). Accurate scale estimation for robust visual tracking. In: BMVC.
Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2017). Discriminative scale space tracking. TPAMI, 39(8), 1561–1575.
Danelljan, M., Hager, G., Shahbaz Khan, F., & Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In: ICCV.
Danelljan, M., Robinson, A., Khan, F.S., & Felsberg, M. (2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In: ECCV.
Danelljan, M., Shahbaz Khan, F., Felsberg, M., Van de Weijer, J. (2014). Adaptive color attributes for real-time visual tracking. In: CVPR.
Dave, A., Khurana, T., Tokmakov, P., Schmid, C., Ramanan, D. (2020). Tao: A large-scale benchmark for tracking any object. In: ECCV.
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. IJCV, 88(2), 303–338.
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In: CVPR.
Fan, H., Ling, H. (2017). Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking. In: ICCV.
Fan, H., Ling, H. (2017). Sanet: Structure-aware network for visual tracking. In: CVPRW.
Fan, H., Ling, H. (2019). Siamese cascaded region proposal networks for real-time visual tracking. In: CVPR
Fan, H., Yang, F., Chu, P., Yuan, L., & Ling, H. (2020). TracKlinic: Diagnosis of challenge factors in visual tracking. In: arXiv:1911.07959.
Feng, Q., Ablavsky, V., Bai, Q., Li, G., & Sclaroff, S. (2020). Real-time visual object tracking with natural language description. In: WACV.
Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., & Lucey, S. (2017). Need for speed: A benchmark for higher frame rate object tracking. In: ICCV.
Galoogahi, H.K., Fagg, A., Lucey, S. (2017). Learning background-aware correlation filters for visual tracking. In: ICCV.
Ganin, Y., Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In: ICML.
Guo, Q., Feng, W., Zhou, C., Huang, R., Wan, L., & Wang, S. (2017). Learning dynamic siamese network for visual object tracking. In: ICCV.
Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In: CVPR.
Hare, S., Saffari, A., Torr, P.H.S. (2011). Struck: Structured output tracking with kernels. In: ICCV.
He, A., Luo, C., Tian, X., Zeng, W. (2018). A twofold siamese network for real-time object tracking. In: CVPR.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR.
Henriques, J.F., Caseiro, R., Martins, P., & Batista, J. (2012). Exploiting the circulant structure of tracking-by-detection with kernels. In: ECCV.
Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. TPAMI, 37(3), 583–596.
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., & Darrell, T. (2016). Natural language object retrieval. In: CVPR.
Huang, L., Zhao, X., & Huang, K. (2019). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. TPAMI.
Huang, L., Zhao, X., & Huang, K. (2020). Globaltrack: A simple and strong baseline for long-term tracking. In: AAAI.
Jia, X., Lu, H., & Yang, M.H. (2012). Visual tracking via adaptive structural local sparse appearance model. In: CVPR.
Kalal, Z., Mikolajczyk, K., & Matas, J. (2012). Tracking-learning-detection. TPAMI, 34(7), 1409–1422.
Kristan, M., Matas, J., Leonardis, A., Vojíř, T., Pflugfelder, R., Fernandez, G., et al. (2016). A novel performance evaluation methodology for single-target trackers. TPAMI, 38(11), 2137–2155.
Kristan et al., M. (2017). The visual object tracking vot2017 challenge results. In: ICCVW.
Kristan et al., M. (2018). The visual object tracking vot2018 challenge results. In: ECCVW.
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In: NIPS.
Li, A., Lin, M., Wu, Y., Yang, M. H., & Yan, S. (2016). Nus-pro: A new visual tracking challenge. TPAMI, 38(2), 335–349.
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: CVPR.
Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. In: CVPR.
Li, F., Tian, C., Zuo, W., Zhang, L., Yang, M.H. (2018). Learning spatial-temporal regularized correlation filters for visual tracking. In: CVPR.
Li, P., Chen, B., Ouyang, W., Wang, D., Yang, X., & Lu, H. (2019). Gradnet: Gradient-guided network for visual object tracking. In: ICCV.
Li, P., Wang, D., Wang, L., & Lu, H. (2018). Deep visual tracking: Review and experimental comparison. Pattern Recog., 76, 323–338.
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., & Wang, X. (2017). Person search with natural language description. In: CVPR.
Li, X., Hu, W., Shen, C., Zhang, Z., Dick, A., & Hengel, A. V. D. (2013). A survey of appearance models in visual object tracking. ACM TIST, 4(4), 58.
Li, Y., & Zhu, J. (2014). A scale adaptive kernel correlation filter tracker with feature integration. In: ECCVW.
Li, Z., Tao, R., Gavves, E., Snoek, C.G., & Smeulders, A.W., et al. (2017). Tracking by natural language specification. In: CVPR.
Liang, P., Blasch, E., & Ling, H. (2015). Encoding color information for visual tracking: Algorithms and benchmark. TIP, 24(12), 5630–5644.
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In: ECCV.
Liu, T., Wang, G., & Yang, Q. (2015) Real-time part-based visual tracking via adaptive correlation filters. In: CVPR
Lukezic, A., Kart, U., Kapyla, J., Durmush, A., Kamarainen, J.K., Matas, J., Kristan, M. (2019). Cdtb: A color and depth visual object tracking dataset and benchmark. In: ICCV.
Lukezic, A., Vojir, T., Zajc, L.C., Matas, J., & Kristan, M. (2017). Discriminative correlation filter with channel and spatial reliability. In: CVPR.
Ma, C., Huang, J.B., Yang, X., & Yang, M.H. (2015) Hierarchical convolutional features for visual tracking. In: ICCV
Ma, C., Yang, X., Zhang, C., & Yang, M.H. (2015). Long-term correlation tracking. In: CVPR.
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.
Mueller, M., Smith, N., & Ghanem, B. (2016). A benchmark and simulator for uav tracking. In: ECCV.
Mueller, M., Smith, N., & Ghanem, B. (2017). Context-aware correlation filter tracking. In: CVPR.
Müller, M., Bibi, A., Giancola, S., Al-Subaihi, S., & Ghanem, B. (2018). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In: ECCV
Nam, H., Han, B. (2016). Learning multi-domain convolutional neural networks for visual tracking. In: CVPR.
Real, E., Shlens, J., Mazzocchi, S., Pan, X., & Vanhoucke, V. (2017) Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In: CVPR
Ross, D. A., Lim, J., Lin, R. S., & Yang, M. H. (2008). Incremental learning for robust visual tracking. IJCV, 77(1–3), 125–141.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. IJCV, 115(3), 211–252.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In: ICLR.
Smeulders, A. W., Chu, D. M., Cucchiara, R., Calderara, S., Dehghan, A., & Shah, M. (2014). Visual tracking: An experimental survey. TPAMI, 36(7), 1442–1468.
Song, Y., Ma, C., Wu, X., Gong, L., Bao, L., Zuo, W., Shen, C., Lau, R., & Yang, M.H. (2018). Vital: Visual tracking via adversarial learning. In: CVPR.
Tao, R., Gavves, E., & Smeulders, A.W. (2016). Siamese instance search for tracking. In: CVPR.
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H. (2017). End-to-end representation learning for correlation filter based tracking. In: CVPR.
Valmadre, J., Bertinetto, L., Henriques, J.F., Tao, R., Vedaldi, A., Smeulders, A., Torr, P., & Gavves, E. (2018). Long-term tracking in the wild: A benchmark. In: ECCV.
Wang, G., Luo, C., Xiong, Z., & Zeng, W. (2019) Spm-tracker: Series-parallel matching for real-time visual object tracking. In: CVPR.
Wang, L., Ouyang, W., Wang, X., Lu, H. (2015). Visual tracking with fully convolutional networks. In: ICCV.
Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., & Li, H. (2019). Unsupervised deep tracking. In: CVPR.
Wang, N., & Yeung, D.Y. (2013). Learning a deep compact image representation for visual tracking. In: NIPS.
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., & Torr, P.H. (2019). Fast online object tracking and segmentation: A unifying approach. In: CVPR.
Wu, Y., Lim, J., & Yang, M.H. (2013). Online object tracking: A benchmark. In: CVPR.
Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. TPAMI, 37(9), 1834–1848.
Xu, T., Feng, Z.H., Wu, X.J., & Kittler, J. (2019). Joint group feature selection and discriminative filter learning for robust visual object tracking. In: ICCV.
Yan, B., Zhao, H., Wang, D., Lu, H., Yang, X. (2019). ’skimming-perusal’tracking: A framework for real-time and robust long-term tracking. In: ICCV.
Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. ACM CSUR, 38(4), 13.
Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In: NIPS.
Zhang, J., Ma, S., & Sclaroff, S. (2014). Meem: robust tracking via multiple experts using entropy minimization. In: ECCV.
Zhang, K., Zhang, L., Liu, Q., Zhang, D., Yang, M.H. (2014). Fast visual tracking via dense spatio-temporal context learning. In: ECCV.
Zhang, K., Zhang, L., & Yang, M.H. (2012). Real-time compressive tracking. In: ECCV.
Zhang, Y., Wang, L., Qi, J., Wang, D., Feng, M., & Lu, H. (2018). Structured siamese network for real-time visual tracking. In: ECCV
Zhang, Z., & Peng, H. (2019). Deeper and wider siamese networks for real-time visual tracking. In: CVPR.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In: CVPR.
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. (2018). Distractor-aware siamese networks for visual object tracking. In: ECCV.
Acknowledgements
We thank the anonymous reviewers for insightful suggestions, and Jeremy Chu for proofreading the final draft. Ling was supported partially by the Amazon AWS Machine Learning Research Award.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Konrad Schindler.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Fan, H., Bai, H., Lin, L. et al. LaSOT: A High-quality Large-scale Single Object Tracking Benchmark. Int J Comput Vis 129, 439–461 (2021). https://doi.org/10.1007/s11263-020-01387-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01387-y