CoTracker: It Is Better to Track Together

Karaev, Nikita; Rocco, Ignacio; Graham, Benjamin; Neverova, Natalia; Vedaldi, Andrea; Rupprecht, Christian

doi:10.1007/978-3-031-73033-7_2

Nikita Karaev^13,14,
Ignacio Rocco¹³,
Benjamin Graham¹³,
Natalia Neverova¹³,
Andrea Vedaldi¹³ &
…
Christian Rupprecht¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15120))

Included in the following conference series:

European Conference on Computer Vision

518 Accesses
23 Citations

Abstract

We introduce CoTracker, a transformer-based model that tracks a large number of 2D points in long video sequences. Differently from most existing approaches that track points independently, CoTracker tracks them jointly, accounting for their dependencies. We show that joint tracking significantly improves tracking accuracy and robustness, and allows CoTracker to track occluded points and points outside of the camera view. We also introduce several innovations for this class of trackers, including using token proxies that significantly improve memory efficiency and allow CoTracker to track 70k points jointly and simultaneously at inference on a single GPU. CoTracker is an online algorithm that operates causally on short windows. However, it is trained utilizing unrolled windows as a recurrent network, maintaining tracks for long periods of time even when points are occluded or leave the field of view. Quantitatively, CoTracker substantially outperforms prior trackers on standard point-tracking benchmarks. Code and model weights are available at https://co-tracker.github.io/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Tracking Objects as Points

TAPTR: Tracking Any Point with Transformers as Detection

Self-supervised Any-Point Tracking by Contrastive Random Walks

Notes

1.
We assume that T is even. The last window is shorter if T/2 does not divide \(T'\).

References

Agarwal, S., et al.: Building Rome in a day. Commun. ACM 54(10), 105–112 (2011)
Article Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of the ICML (2021)
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Birchfield, S.T., Pundlik, S.J.: Joint tracking of features and edges. In: Proceedings of the CVPR (2008)
Google Scholar
Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: Proceedings of the ICCV (1993)
Google Scholar
Bruhn, A., Weickert, J., Schnörr, C.: Lucas/kanade meets horn/schunck: combining local and global optic flow methods. Int. J. Comput. Vis. 61, 211–231 (2005)
Article Google Scholar
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Proceedings of the ECCV (2012)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the ECCV (2020)
Google Scholar
Chen, F., Wang, X., Zhao, Y., Lv, S., Niu, X.: Visual object tracking: a survey. Comput. Vis. Image Underst. 222, 103508 (2022)
Article Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8126–8135 (2021)
Google Scholar
Chin, T.M., Karl, W.C., Willsky, A.S.: Probabilistic and sequential computation of optical flow using temporal coherence. IEEE Trans. on Image Process. 3(6), 773–788 (1994)
Article Google Scholar
Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: end-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618 (2022)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Google Scholar
Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6638–6646 (2017)
Google Scholar
Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. arXiv preprint arXiv:2309.16588 (2023)
Doersch, C., et al.: TAP-vid: a benchmark for tracking any point in a video. arXiv arXiv:2211.03726 (2022)
Doersch, C., et al.: Tapir: tracking any point with per-frame initialization and temporal refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10061–10072 (2023)
Google Scholar
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the ICCV (2015)
Google Scholar
Elad, M., Feuer, A.: Recursive optical flow estimation–adaptive filtering approach. J. Vis. Commun. Image Represent. 9(2), 119–138 (1998)
Article Google Scholar
Girshick, R., Iandola, F., Darrell, T., Malik, J.: Deformable part models are convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 437–446 (2015)
Google Scholar
Greff, K., et al.: Kubric: a scalable dataset generator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3749–3761 (2022)
Google Scholar
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: tracking through occlusions using point trajectories. In: Proceedings of the ECCV (2022)
Google Scholar
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle videos revisited: tracking through occlusions using point trajectories. In: Proceedings of the ECCV (2022)
Google Scholar
Held, D., Thrun, S., Savarese, S.: Learning to track at 100 FPS with deep regression networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 749–765. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_45
Chapter Google Scholar
Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1–3), 185–203 (1981)
Article Google Scholar
Huang, Z., et al.: FlowFormer: a transformer architecture for optical flow. In: Proceedings of the ECCV (2022)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: Proceedings of the CVPR (2017)
Google Scholar
Jaegle, A., et al.: Perceiver IO: A general architecture for structured inputs & outputs. In: Proceedings of the ICLR (2022)
Google Scholar
Janai, J., Guney, F., Ranjan, A., Black, M., Geiger, A.: Unsupervised learning of multi-frame optical flow with occlusions. In: Proceedings of the ECCV (2018)
Google Scholar
Jia, X., Lu, H., Yang, M.H.: Visual tracking via adaptive structural local sparse appearance model. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1822–1829. IEEE (2012)
Google Scholar
Jiang, H., Sun, D., Jampani, V., Yang, M.H., Learned-Miller, E., Kautz, J.: Super SloMo: high quality estimation of multiple intermediate frames for video interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9000–9008 (2018)
Google Scholar
Jiang, S., Lu, Y., Li, H., Hartley, R.: Learning optical flow from a few matches. In: Proceedings of the CVPR (2021)
Google Scholar
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: DynamicStereo: consistent dynamic depth from stereo videos. In: Proceedings of the CVPR (2023)
Google Scholar
Li, F., Tian, C., Zuo, W., Zhang, L., Yang, M.H.: Learning spatial-temporal regularized correlation filters for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4904–4913 (2018)
Google Scholar
Li, Y., Zhu, J., Hoi, S.C.: Reliable patch trackers: robust visual tracking by exploiting reliable patches. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 353–361 (2015)
Google Scholar
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the IJCAI, vol. 2 (1981)
Google Scholar
Luo, J., Wan, Z., Li, B., Dai, Y.: Continuous parametric optical flow. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
Google Scholar
Matthews, L., Ishikawa, T., Baker, S.: The template update problem. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 810–815 (2004)
Article Google Scholar
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the CVPR (2016)
Google Scholar
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: Advance of Neural Information Processing Systems , vol. 34, pp. 14200–14213 (2021)
Google Scholar
Neoral, M., Śerých, J., Matas, J.: MFT: long-term tracking of every pixel. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6837–6847 (2023)
Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv arXiv:1704.00675 (2017)
Ren, Z., Gallo, O., Sun, D., Yang, M.H., Sudderth, E.B., Kautz, J.: A fusion approach for multi-frame optical flow estimation. In: Proceedings of the WACV (2019)
Google Scholar
Sand, P., Teller, S.: Particle video: long-range motion estimation using point trajectories. Int. J. Comput. Vis. 80, 72–91 (2008)
Article Google Scholar
Shi, X., et al.: VideoFlow: exploiting temporal cues for multi-frame optical flow estimation. arXiv arXiv:2303.08340 (2023)
Shi, X., et al.: FlowFormer++: masked cost volume autoencoding for pretraining optical flow estimation. arXiv arXiv:2303.01237 (2023)
Sidenbladh, H., Black, M.J., Fleet, D.J.: Stochastic tracking of 3D human figures using 2D image motion. In: Proceedings of the ECCV (2000)
Google Scholar
Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W., Yang, M.H.: Crest: convolutional residual learning for visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2555–2564 (2017)
Google Scholar
Sui, X., et al.: Craft: cross-attentional flow transformer for robust optical flow. In: Proceedings of the CVPR (2022)
Google Scholar
Sun, D., et al.: AutoFlow: learning a better training set for optical flow. In: Proceedings of the CVPR (2021)
Google Scholar
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of the CVPR (2018)
Google Scholar
Sun, S., Chen, Y., Zhu, Y., Guo, G., Li, G.: SKFlow: learning optical flow with super kernels. arXiv arXiv:2205.14623 (2022)
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: Proceedings of the ECCV (2020)
Google Scholar
Teed, Z., Deng, J.: DROID-SLAM: deep visual slam for monocular, stereo, and RGB-D cameras. In: Advances in Neural Information Processing Systems, vol. 34, pp. 16558–16569 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Wang, J., Zhong, Y., Dai, Y., Zhang, K., Ji, P., Li, H.: Displacement-invariant matching cost learning for accurate optical flow estimation. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Google Scholar
Wang, Q., et al.: Tracking everything everywhere all at once. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19795–19806 (2023)
Google Scholar
Xu, H., Yang, J., Cai, J., Zhang, J., Tong, X.: High-resolution optical flow from 1D attention and correlation. In: Proceedings of the CVPR (2021)
Google Scholar
Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume processing. In: Proceedings of the CVPR (2017)
Google Scholar
Yang, G., Vo, M., Neverova, N., Ramanan, D., Vedaldi, A., Joo, H.: Banmo: building animatable 3D neural models from many casual videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2863–2873 (2022)
Google Scholar
Zhang, F., Woodford, O.J., Prisacariu, V.A., Torr, P.H.: Separable flow: learning motion cost volumes for optical flow estimation. In: Proceedings of the CVPR (2021)
Google Scholar
Zhao, S., Zhao, L., Zhang, Z., Zhou, E., Metaxas, D.: Global matching with overlapping attention for optical flow estimation. In: Proceedings of the CVPR (2022)
Google Scholar
Zhao, T., Nevatia, R.: Tracking multiple humans in crowded environment. In: Proceedings of the CVPR, vol. 2 (2004)
Google Scholar
Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: a large-scale synthetic dataset for long-term point tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19855–19865 (2023)
Google Scholar

Download references

Acknowledgments

We want to thank Laurynas Karazija for evaluating model efficiency, Luke Melas-Kyriazi and Jianyuan Wang for their paper comments, Roman Shapovalov, Iurii Makarov, Shalini Maiti, and Adam W. Harley for the insightful discussions. Christian Rupprecht was supported by ERC-CoG UNION101001212 and VisualAI EP/T028572/1.

Author information

Authors and Affiliations

Meta AI, Toronto, Canada
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova & Andrea Vedaldi
Visual Geometry Group, University of Oxford, Oxford, UK
Nikita Karaev & Christian Rupprecht

Authors

Nikita Karaev
View author publications
You can also search for this author in PubMed Google Scholar
Ignacio Rocco
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Graham
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Neverova
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Vedaldi
View author publications
You can also search for this author in PubMed Google Scholar
Christian Rupprecht
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikita Karaev .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C. (2025). CoTracker: It Is Better to Track Together. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15120. Springer, Cham. https://doi.org/10.1007/978-3-031-73033-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-73033-7_2
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73032-0
Online ISBN: 978-3-031-73033-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CoTracker: It Is Better to Track Together

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Tracking Objects as Points

TAPTR: Tracking Any Point with Transformers as Detection

Self-supervised Any-Point Tracking by Contrastive Random Walks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

CoTracker: It Is Better to Track Together

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Tracking Objects as Points

TAPTR: Tracking Any Point with Transformers as Detection

Self-supervised Any-Point Tracking by Contrastive Random Walks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation