Long-Term Temporal Context Gathering for Neural Video Compression | SpringerLink
Skip to main content

Long-Term Temporal Context Gathering for Neural Video Compression

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Most existing neural video codecs (NVCs) only extract short-term temporal context by optical flow-based motion compensation. However, such short-term temporal context suffers from error propagation and lacks awareness of long-term relevant information. This limits their performance, particularly in a long prediction chain. In this paper, we address the issue by facilitating the synergy of both long-term and short-term temporal contexts during feature propagation. Specifically, we introduce our DCVC-LCG framework, which use a Long-term temporal Context Gathering (LCG) module to search the diverse and relevant context from the long-term reference feature. The searched long-term context is leveraged to refine the feature propagation by integrating into the short-term reference feature, which can enhance the reconstruction quality and mitigate the propagation errors. During the search process, how to distinguish the helpful context and filter the irrelevant information is challenging and vital. To this end, we cluster the reference feature and perform the searching process in an intra-cluster fashion to improve the context mining. This synergistic integration of long-term and short-term temporal contexts can significantly enhance the temporal correlation modeling. Additionally, to improve the probability estimation in variable-bitrate coding, we introduce the quantization parameter as an extra prior to the entropy model. Comprehensive evaluations demonstrate the effectiveness of our method, which offers an average 11.3% bitrate saving over the ECM on 1080p video datasets, using the single intra-frame setting.

L. Qi and Z. Jia—This work was done when Linfeng Qi and Zhaoyang Jia were interns at Microsoft Research Asia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8007
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10009
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. ECM. https://vcgit.hhi.fraunhofer.de/ecm/ECM/

  2. HM. https://vcgit.hhi.fraunhofer.de/jvet/HM/

  3. Original vimeo links. https://github.com/anchen1011/toflow/blob/master/data/original_vimeo_links.txt

  4. VTM. https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/

  5. Sullivan, G.J., Wiegand, T.: Rate-distortion optimization for video compression. IEEE Sig. Process. Mag. 15(6), 74–90 (1998)

    Article  Google Scholar 

  6. Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S.J., Toderici, G.: Scale-space flow for end-to-end optimized video compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8503–8512 (2020)

    Google Scholar 

  7. Agustsson, E., Tschannen, M., Mentzer, F., Timofte, R., Gool, L.V.: Generative adversarial networks for extreme learned image compression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 221–231 (2019)

    Google Scholar 

  8. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)

    Google Scholar 

  9. Bjontegaard, G.: Calculation of average PSNR differences between RD-curves. VCEG-M33 (2001)

    Google Scholar 

  10. Bossen, F., et al.: Common test conditions and software reference configurations. JCTVC-L1100 12(7), 1 (2013)

    Google Scholar 

  11. Bross, B., et al.: Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans. Circ. Syst. Video Technol. 31(10), 3736–3764 (2021)

    Article  Google Scholar 

  12. Chen, Z., Gu, S., Lu, G., Xu, D.: Exploiting intra-slice and inter-slice redundancy for learning-based lossless volumetric image compression. IEEE Trans. Image Process. 31, 1697–1707 (2022)

    Article  Google Scholar 

  13. Chen, Z., et al.: Neural video compression with spatio-temporal cross-covariance transformers. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 8543–8551 (2023)

    Google Scholar 

  14. Dai, J., et al.: Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773 (2017)

    Google Scholar 

  15. Dey, R., Salem, F.M.: Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 1597–1600. IEEE (2017)

    Google Scholar 

  16. Djelouah, A., Campos, J., Schaub-Meyer, S., Schroers, C.: Neural inter-frame compression for video coding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6421–6429 (2019)

    Google Scholar 

  17. Gao, Y., Li, J., Chu, L., Lu, Y.: Implicit motion function. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19278–19289 (2024)

    Google Scholar 

  18. Ho, Y.H., Chang, C.P., Chen, P.Y., Gnutti, A., Peng, W.H.: CANF-VC: conditional augmented normalizing flows for video compression. arXiv preprint arXiv:2207.05315 (2022)

  19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  20. Hu, Z., Chen, Z., Xu, D., Lu, G., Ouyang, W., Gu, S.: Improving deep video compression by resolution-adaptive flow coding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 193–209. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_12

    Chapter  Google Scholar 

  21. Hu, Z., Lu, G., Guo, J., Liu, S., Jiang, W., Xu, D.: Coarse-to-fine deep video coding with hyperprior-guided mode prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5921–5930 (2022)

    Google Scholar 

  22. Hu, Z., Lu, G., Xu, D.: FVC: a new framework towards deep video compression in feature space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1502–1511 (2021)

    Google Scholar 

  23. Huang, C., Li, J., Chu, L., Liu, D., Lu, Y.: Disentangle propagation and restoration for efficient video recovery. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 8336–8345 (2023)

    Google Scholar 

  24. Huang, C., Li, J., Chu, L., Liu, D., Lu, Y.: Arbitrary-scale video super-resolution guided by dynamic context. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 2294–2302 (2024)

    Google Scholar 

  25. Huang, C., Li, J., Li, B., Liu, D., Lu, Y.: Neural compression-based feature learning for video restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5872–5881 (2022)

    Google Scholar 

  26. Jampani, V., Sun, D., Liu, M.Y., Yang, M.H., Kautz, J.: Superpixel sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 352–368 (2018)

    Google Scholar 

  27. Jia, Z., Li, J., Li, B., Li, H., Lu, Y.: Generative latent coding for ultra-low bitrate image compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26088–26098 (2024)

    Google Scholar 

  28. Kim, J.H., Heo, B., Lee, J.S.: Joint global and local hierarchical priors for learned image compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5992–6001 (2022)

    Google Scholar 

  29. Ladune, T., Philippe, P., Hamidouche, W., Zhang, L., Déforges, O.: Optical flow and mode selection for learning-based video coding. In: 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6. IEEE (2020)

    Google Scholar 

  30. Ladune, T., Philippe, P., Hamidouche, W., Zhang, L., Déforges, O.: Conditional coding and variable bitrate for practical learned video coding. arXiv preprint arXiv:2104.09103 (2021)

  31. Ladune, T., Philippe, P., Hamidouche, W., Zhang, L., Déforges, O.: Conditional coding for flexible learned video compression. arXiv preprint arXiv:2104.07930 (2021)

  32. Li, J., Li, B., Lu, Y.: Deep contextual video compression. In: Advances in Neural Information Processing Systems 34 (2021)

    Google Scholar 

  33. Li, J., Li, B., Lu, Y.: Hybrid spatial-temporal entropy modelling for neural video compression. In: Proceedings of the 30th ACM International Conference on Multimedia (2022)

    Google Scholar 

  34. Li, J., Li, B., Lu, Y.: Neural video compression with diverse contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22616–22626 (2023)

    Google Scholar 

  35. Li, J., Li, B., Lu, Y.: Neural video compression with feature modulation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, 17–21 June 2024 (2024)

    Google Scholar 

  36. Li, J., Li, B., Xu, J., Xiong, R.: Diversity-based reference picture management for low delay screen content coding. IEEE Trans. Circ. Syst. Video Technol. 28(6), 1369–1378 (2017)

    Article  Google Scholar 

  37. Lin, J., Liu, D., Li, H., Wu, F.: M-LVC: multiple frames prediction for learned video compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3554 (2020)

    Google Scholar 

  38. Liu, B., Chen, Y., Machineni, R.C., Liu, S., Kim, H.S.: MMVC: learned multi-mode video compression with block-based prediction mode selection and density-adaptive entropy coding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18487–18496 (2023)

    Google Scholar 

  39. Liu, D., Zhao, D., Ji, X., Gao, W.: Dual frame motion compensation with optimal long-term reference frame selection and bit allocation. IEEE Trans. Circ. Syst. Video Technol. 20(3), 325–339 (2009)

    Article  Google Scholar 

  40. Liu, H., et al.: Neural video coding using multiscale motion compensation and spatiotemporal context model. IEEE Trans. Circ. Syst. Video Technol. 31(8), 3182–3196 (2020)

    Article  Google Scholar 

  41. Liu, H., Shen, H., Huang, L., Lu, M., Chen, T., Ma, Z.: Learned video compression via joint spatial-temporal correlation exploration. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11580–11587 (2020)

    Google Scholar 

  42. Lu, G., et al.: Content adaptive and error propagation aware deep video compression. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 456–472. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_27

    Chapter  Google Scholar 

  43. Lu, G., Ouyang, W., Xu, D., Zhang, X., Cai, C., Gao, Z.: DVC: an end-to-end deep video compression framework. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11006–11015 (2019)

    Google Scholar 

  44. Lu, G., Zhang, X., Ouyang, W., Chen, L., Gao, Z., Xu, D.: An end-to-end learning framework for video compression. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3292–3308 (2020)

    Article  Google Scholar 

  45. Ma, W., Li, J., Li, B., Lu, Y.: Uncertainty-aware deep video compression with ensembles. IEEE Trans. Multimedia 26, 7863–7872 (2024)

    Article  Google Scholar 

  46. Ma, X., et al.: Image as set of points. In: The Eleventh International Conference on Learning Representations (2023)

    Google Scholar 

  47. Mentzer, F., et al.: VCT: a video compression transformer. arXiv preprint arXiv:2206.07307 (2022)

  48. Mentzer, F., Toderici, G.D., Tschannen, M., Agustsson, E.: High-fidelity generative image compression. In: Advances in Neural Information Processing Systems 33, pp. 11913–11924 (2020)

    Google Scholar 

  49. Mercat, A., Viitanen, M., Vanne, J.: UVG dataset: 50/120fps 4K sequences for video codec analysis and development. In: Proceedings of the 11th ACM Multimedia Systems Conference, pp. 297–302 (2020)

    Google Scholar 

  50. Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)

    Google Scholar 

  51. Paul, M., Lin, W., Lau, C.T., Lee, B.S.: A long-term reference frame for hierarchical B-picture-based video coding. IEEE Trans. Circ. Syst. Video Technol. 24(10), 1729–1742 (2014)

    Article  Google Scholar 

  52. Qi, L., Li, J., Li, B., Li, H., Lu, Y.: Motion information propagation for neural video compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6111–6120 (2023)

    Google Scholar 

  53. Rippel, O., Anderson, A.G., Tatwawadi, K., Nair, S., Lytle, C., Bourdev, L.: ELF-VC: efficient learned flexible-rate video coding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14479–14488 (2021)

    Google Scholar 

  54. Rippel, O., Nair, S., Lew, C., Branson, S., Anderson, A.G., Bourdev, L.: Learned video compression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3454–3463 (2019)

    Google Scholar 

  55. Sheng, X., Li, J., Li, B., Li, L., Liu, D., Lu, Y.: Temporal context mining for learned video compression. IEEE Trans. Multimedia 25, 7311–7322 (2022)

    Article  Google Scholar 

  56. Sullivan, G.J., Ohm, J.R., Han, W.J., Wiegand, T.: Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circ. Syst. Video Technol. 22(12), 1649–1668 (2012)

    Article  Google Scholar 

  57. Tiwari, M., Cosman, P.C.: Selection of long-term reference frames in dual-frame video coding using simulated annealing. IEEE Sig. Process. Lett. 15, 249–252 (2008)

    Article  Google Scholar 

  58. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)

    Google Scholar 

  59. Wang, G.H., Li, J., Li, B., Lu, Y.: EVC: towards real-time neural image compression with mask decay. In: International Conference on Learning Representations (2023)

    Google Scholar 

  60. Wang, H., et al.: MCL-JCV: a JND-based H.264/AVC video quality assessment dataset. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 1509–1513. IEEE (2016)

    Google Scholar 

  61. Wiegand, T., Zhang, X., Girod, B.: Long-term memory motion-compensated prediction. IEEE Trans. Circ. Syst. Video Technol. 9(1), 70–84 (1999)

    Article  Google Scholar 

  62. Xiang, J., Tian, K., Zhang, J.: MIMT: masked image modeling transformer for video compression. In: The Eleventh International Conference on Learning Representations (2022)

    Google Scholar 

  63. Xie, F., Chu, L., Li, J., Lu, Y., Ma, C.: VideoTrack: learning to track objects via video transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22826–22835 (2023)

    Google Scholar 

  64. Xu, X., Wang, J., Ming, X., Lu, Y.: Towards robust video object segmentation with adaptive object calibration. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2709–2718 (2022)

    Google Scholar 

  65. Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with task-oriented flow. Int. J. Comput. Vis. 127(8), 1106–1125 (2019)

    Article  Google Scholar 

  66. Yang, R., Mentzer, F., Van Gool, L., Timofte, R.: Learning for video compression with recurrent auto-encoder and recurrent probability model. IEEE J. Sel. Top. Sig. Process. 15(2), 388–401 (2020)

    Article  Google Scholar 

  67. Yang, R., Yang, Y., Marino, J., Mandt, S.: Hierarchical autoregressive modeling for neural video compression. arXiv preprint arXiv:2010.10258 (2020)

  68. Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 332–348. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_20

    Chapter  Google Scholar 

  69. Yu, Q., et al.: \(k\)-means mask transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 288–307. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_17

    Chapter  Google Scholar 

  70. Zhang, Y., Duan, Z., Lu, M., Ding, D., Zhu, F., Ma, Z.: Another way to the top: exploit contextual clustering in learned image coding (2024)

    Google Scholar 

  71. Zhao, J., Li, B., Li, J., Xiong, R., Lu, Y.: A universal encoder rate distortion optimization framework for learned compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1880–1884 (2021)

    Google Scholar 

  72. Zhao, J., Li, B., Li, J., Xiong, R., Lu, Y.: A universal optimization framework for learning-based image codec. ACM Trans. Multimed. Comput. Commun. Appl. 20(1), 1–19 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linfeng Qi .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1339 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qi, L., Jia, Z., Li, J., Li, B., Li, H., Lu, Y. (2025). Long-Term Temporal Context Gathering for Neural Video Compression. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72848-8_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72847-1

  • Online ISBN: 978-3-031-72848-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics