Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation | SpringerLink
Skip to main content

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high dimensionality and complexity. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D or 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent, (ii) a local volume information captured by 3D convolutions with wavelet decomposition, and (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on standard video generation benchmarks such as UCF101, SkyTimelapse, and TaiChi demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control). The source code and pre-trained models will be publicly available once the paper is accepted.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 17159
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 21449
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ajay, A., et al.: Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587 (2023)

  2. An, J., et al.: Latent-shift: latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477 (2023)

  3. Anciukevičius, T., et al.: RenderDiffusion: image diffusion for 3D reconstruction, inpainting and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12608–12618 (2023)

    Google Scholar 

  4. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)

    Google Scholar 

  5. Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575 (2023)

    Google Scholar 

  6. Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion. arXiv preprint arXiv:2303.12688 (2023)

  7. Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133 (2022)

    Google Scholar 

  8. Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: Pi-GAN: periodic implicit generative adversarial networks for 3D-aware image synthesis. In: CVPR (2021)

    Google Scholar 

  9. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49

    Chapter  Google Scholar 

  10. Dai, Y., et al.: Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111 (2023)

  11. Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011 (2023)

  12. Gu, J., et al.: NerfDiff: single-image view synthesis with nerf-guided distillation from 3D-aware diffusion. In: International Conference on Machine Learning, pp. 11808–11826. PMLR (2023)

    Google Scholar 

  13. He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221 (2022)

  14. Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  15. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (2020)

    Google Scholar 

  16. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models (2022). https://arxiv.org/abs/2204.03458

  17. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv preprint arXiv:2204.03458 (2022)

  18. Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 2010 20th International Conference on Pattern Recognition, pp. 2366–2369. IEEE (2010)

    Google Scholar 

  19. Hu, Y., Chen, Z., Luo, C.: LaMD: latent motion diffusion for video generation. arXiv preprint arXiv:2304.11603 (2023)

  20. Huynh-Thu, Q., Ghanbari, M.: Scope of validity of PSNR in image/video quality assessment. Electron. Lett. 44(13), 800–801 (2008)

    Article  Google Scholar 

  21. Kalchbrenner, N., et al.: Video pixel networks. In: ICML (2017)

    Google Scholar 

  22. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  23. Khachatryan, L., et al.: Text2Video-Zero: text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)

  24. Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  25. Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-P2P: video editing with cross-attention control. arXiv preprint arXiv:2303.04761 (2023)

  26. Ma, X., et al.: Latte: latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024)

  27. Mei, K., Patel, V.M.: VIDM: video implicit diffusion models. In: AAAI (2023)

    Google Scholar 

  28. Pan, Y., Qiu, Z., Yao, T., Li, H., Mei, T.: To create what you tell: generating videos from captions. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1789–1798 (2017)

    Google Scholar 

  29. Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205 (2023)

    Google Scholar 

  30. Qi, C., et al.: FateZero: fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535 (2023)

  31. Ruan, L., et al.: MM-Diffusion: learning multi-modal diffusion models for joint audio and video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10219–10228 (2023)

    Google Scholar 

  32. Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: GRAF: generative radiance fields for 3D-aware image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  33. Secker, A., Taubman, D.: Highly scalable video compression using a lifting-based 3D wavelet transform with deformable mesh motion compensation. In: Proceedings of the International Conference on Image Processing, vol. 3, pp. 749–752. IEEE (2002)

    Google Scholar 

  34. Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NeurIPS (2019)

    Google Scholar 

  35. Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. In: ICLR (2023)

    Google Scholar 

  36. Skorokhodov, I., Tulyakov, S., Elhoseiny, M.: StyleGAN-V: a continuous video generator with the price, image quality and perks of StyleGAN2. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3626–3636 (2022)

    Google Scholar 

  37. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  38. Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. In: ICLR (2021)

    Google Scholar 

  39. Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: CVPR (2018)

    Google Scholar 

  40. Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

  41. Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: FVD: a new metric for video generation. In: DGS@ICLR (2019). https://api.semanticscholar.org/CorpusID:198489709

  42. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  43. Voleti, V., Jolicoeur-Martineau, A., Pal, C.: MCVD-masked conditional video diffusion for prediction, generation, and interpolation. In: Advances in Neural Information Processing Systems 35, pp. 23371–23385 (2022)

    Google Scholar 

  44. Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016)

    Google Scholar 

  45. Wang, B., Wang, Y., Selesnick, I., Vetro, A.: Video coding using 3D dual-tree wavelet transform. EURASIP J. Image Video Process. 2007, 1–15 (2007)

    Article  Google Scholar 

  46. Wang, J., Huang, K.: Medical image compression by using three-dimensional wavelet transformation. IEEE Trans. Med. Imaging 15(4), 547–554 (1996)

    Article  Google Scholar 

  47. Wang, T., et al.: RODIN: a generative model for sculpting 3D digital avatars using diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4563–4573 (2023)

    Google Scholar 

  48. Weissenborn, D., Täckström, O., Uszkoreit, J.: Scaling autoregressive video models. arXiv preprint arXiv:1906.02634 (2019)

  49. Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565 (2022)

  50. Xiong, W., Luo, W., Ma, L., Liu, W., Luo, J.: Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: CVPR (2018)

    Google Scholar 

  51. Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021)

  52. Yu, S., Sohn, K., Kim, S., Shin, J.: Video probabilistic diffusion models in projected latent space. arXiv preprint arXiv:2302.07685 (2023)

  53. Yu, S., et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In: The Tenth International Conference on Learning Representations (2022)

    Google Scholar 

  54. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  55. Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: MagicVideo: efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)

Download references

Acknowledgements

This research was supported by the National Supercomputing Center in Korea (KISTI) (TS-2023-RG-0004), the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1C1C1008496, RS-2024-00346597), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2020-II201336, Artificial Intelligence graduate school support (UNIST), RS-2021-II212068, Comprehensive Video Understanding and Generation with Knowledge-based Deep Logic Neural Network), and Culture, Sports, and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism (Research on neural watermark technology for copyright protection of generative AI 3D content, RS-2024-00348469, RS-2024-00333068).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Seungryong Kim or Jaejun Yoo .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 111721 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kim, K. et al. (2025). Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15110. Springer, Cham. https://doi.org/10.1007/978-3-031-72943-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72943-0_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72942-3

  • Online ISBN: 978-3-031-72943-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics