Abstract
We introduce CoTracker, a transformer-based model that tracks a large number of 2D points in long video sequences. Differently from most existing approaches that track points independently, CoTracker tracks them jointly, accounting for their dependencies. We show that joint tracking significantly improves tracking accuracy and robustness, and allows CoTracker to track occluded points and points outside of the camera view. We also introduce several innovations for this class of trackers, including using token proxies that significantly improve memory efficiency and allow CoTracker to track 70k points jointly and simultaneously at inference on a single GPU. CoTracker is an online algorithm that operates causally on short windows. However, it is trained utilizing unrolled windows as a recurrent network, maintaining tracks for long periods of time even when points are occluded or leave the field of view. Quantitatively, CoTracker substantially outperforms prior trackers on standard point-tracking benchmarks. Code and model weights are available at https://co-tracker.github.io/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We assume that T is even. The last window is shorter if T/2 does not divide \(T'\).
References
Agarwal, S., et al.: Building Rome in a day. Commun. ACM 54(10), 105–112 (2011)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of the ICML (2021)
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Birchfield, S.T., Pundlik, S.J.: Joint tracking of features and edges. In: Proceedings of the CVPR (2008)
Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: Proceedings of the ICCV (1993)
Bruhn, A., Weickert, J., Schnörr, C.: Lucas/kanade meets horn/schunck: combining local and global optic flow methods. Int. J. Comput. Vis. 61, 211–231 (2005)
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Proceedings of the ECCV (2012)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the ECCV (2020)
Chen, F., Wang, X., Zhao, Y., Lv, S., Niu, X.: Visual object tracking: a survey. Comput. Vis. Image Underst. 222, 103508 (2022)
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8126–8135 (2021)
Chin, T.M., Karl, W.C., Willsky, A.S.: Probabilistic and sequential computation of optical flow using temporal coherence. IEEE Trans. on Image Process. 3(6), 773–788 (1994)
Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: end-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618 (2022)
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6638–6646 (2017)
Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. arXiv preprint arXiv:2309.16588 (2023)
Doersch, C., et al.: TAP-vid: a benchmark for tracking any point in a video. arXiv arXiv:2211.03726 (2022)
Doersch, C., et al.: Tapir: tracking any point with per-frame initialization and temporal refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10061–10072 (2023)
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the ICCV (2015)
Elad, M., Feuer, A.: Recursive optical flow estimation–adaptive filtering approach. J. Vis. Commun. Image Represent. 9(2), 119–138 (1998)
Girshick, R., Iandola, F., Darrell, T., Malik, J.: Deformable part models are convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 437–446 (2015)
Greff, K., et al.: Kubric: a scalable dataset generator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3749–3761 (2022)
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: tracking through occlusions using point trajectories. In: Proceedings of the ECCV (2022)
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle videos revisited: tracking through occlusions using point trajectories. In: Proceedings of the ECCV (2022)
Held, D., Thrun, S., Savarese, S.: Learning to track at 100 FPS with deep regression networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 749–765. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_45
Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1–3), 185–203 (1981)
Huang, Z., et al.: FlowFormer: a transformer architecture for optical flow. In: Proceedings of the ECCV (2022)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: Proceedings of the CVPR (2017)
Jaegle, A., et al.: Perceiver IO: A general architecture for structured inputs & outputs. In: Proceedings of the ICLR (2022)
Janai, J., Guney, F., Ranjan, A., Black, M., Geiger, A.: Unsupervised learning of multi-frame optical flow with occlusions. In: Proceedings of the ECCV (2018)
Jia, X., Lu, H., Yang, M.H.: Visual tracking via adaptive structural local sparse appearance model. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1822–1829. IEEE (2012)
Jiang, H., Sun, D., Jampani, V., Yang, M.H., Learned-Miller, E., Kautz, J.: Super SloMo: high quality estimation of multiple intermediate frames for video interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9000–9008 (2018)
Jiang, S., Lu, Y., Li, H., Hartley, R.: Learning optical flow from a few matches. In: Proceedings of the CVPR (2021)
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: DynamicStereo: consistent dynamic depth from stereo videos. In: Proceedings of the CVPR (2023)
Li, F., Tian, C., Zuo, W., Zhang, L., Yang, M.H.: Learning spatial-temporal regularized correlation filters for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4904–4913 (2018)
Li, Y., Zhu, J., Hoi, S.C.: Reliable patch trackers: robust visual tracking by exploiting reliable patches. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 353–361 (2015)
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the IJCAI, vol. 2 (1981)
Luo, J., Wan, Z., Li, B., Dai, Y.: Continuous parametric optical flow. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
Matthews, L., Ishikawa, T., Baker, S.: The template update problem. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 810–815 (2004)
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the CVPR (2016)
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: Advance of Neural Information Processing Systems , vol. 34, pp. 14200–14213 (2021)
Neoral, M., Śerých, J., Matas, J.: MFT: long-term tracking of every pixel. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6837–6847 (2023)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv arXiv:1704.00675 (2017)
Ren, Z., Gallo, O., Sun, D., Yang, M.H., Sudderth, E.B., Kautz, J.: A fusion approach for multi-frame optical flow estimation. In: Proceedings of the WACV (2019)
Sand, P., Teller, S.: Particle video: long-range motion estimation using point trajectories. Int. J. Comput. Vis. 80, 72–91 (2008)
Shi, X., et al.: VideoFlow: exploiting temporal cues for multi-frame optical flow estimation. arXiv arXiv:2303.08340 (2023)
Shi, X., et al.: FlowFormer++: masked cost volume autoencoding for pretraining optical flow estimation. arXiv arXiv:2303.01237 (2023)
Sidenbladh, H., Black, M.J., Fleet, D.J.: Stochastic tracking of 3D human figures using 2D image motion. In: Proceedings of the ECCV (2000)
Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W., Yang, M.H.: Crest: convolutional residual learning for visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2555–2564 (2017)
Sui, X., et al.: Craft: cross-attentional flow transformer for robust optical flow. In: Proceedings of the CVPR (2022)
Sun, D., et al.: AutoFlow: learning a better training set for optical flow. In: Proceedings of the CVPR (2021)
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of the CVPR (2018)
Sun, S., Chen, Y., Zhu, Y., Guo, G., Li, G.: SKFlow: learning optical flow with super kernels. arXiv arXiv:2205.14623 (2022)
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: Proceedings of the ECCV (2020)
Teed, Z., Deng, J.: DROID-SLAM: deep visual slam for monocular, stereo, and RGB-D cameras. In: Advances in Neural Information Processing Systems, vol. 34, pp. 16558–16569 (2021)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Wang, J., Zhong, Y., Dai, Y., Zhang, K., Ji, P., Li, H.: Displacement-invariant matching cost learning for accurate optical flow estimation. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Wang, Q., et al.: Tracking everything everywhere all at once. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19795–19806 (2023)
Xu, H., Yang, J., Cai, J., Zhang, J., Tong, X.: High-resolution optical flow from 1D attention and correlation. In: Proceedings of the CVPR (2021)
Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume processing. In: Proceedings of the CVPR (2017)
Yang, G., Vo, M., Neverova, N., Ramanan, D., Vedaldi, A., Joo, H.: Banmo: building animatable 3D neural models from many casual videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2863–2873 (2022)
Zhang, F., Woodford, O.J., Prisacariu, V.A., Torr, P.H.: Separable flow: learning motion cost volumes for optical flow estimation. In: Proceedings of the CVPR (2021)
Zhao, S., Zhao, L., Zhang, Z., Zhou, E., Metaxas, D.: Global matching with overlapping attention for optical flow estimation. In: Proceedings of the CVPR (2022)
Zhao, T., Nevatia, R.: Tracking multiple humans in crowded environment. In: Proceedings of the CVPR, vol. 2 (2004)
Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: a large-scale synthetic dataset for long-term point tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19855–19865 (2023)
Acknowledgments
We want to thank Laurynas Karazija for evaluating model efficiency, Luke Melas-Kyriazi and Jianyuan Wang for their paper comments, Roman Shapovalov, Iurii Makarov, Shalini Maiti, and Adam W. Harley for the insightful discussions. Christian Rupprecht was supported by ERC-CoG UNION101001212 and VisualAI EP/T028572/1.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C. (2025). CoTracker: It Is Better to Track Together. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15120. Springer, Cham. https://doi.org/10.1007/978-3-031-73033-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-73033-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73032-0
Online ISBN: 978-3-031-73033-7
eBook Packages: Computer ScienceComputer Science (R0)