Real-Time Video Generation with Pyramid Attention Broadcast


Real-Time Video Generation with Pyramid Attention Broadcast

1National University of Singapore,  2Purdue University
{xuanlei, kai.wang, youy}@comp.nus.edu.sg   jin509@purdue.edu
(* indicates equal contribution)

Real-Time Video Generation: Achieved 🥳! We introduce Pyramid Attention Broadcast (PAB), the first approach that achieves real-time DiT-based video generation. By mitigating redundant attention computation, PAB achieves up to 21.6 FPS with 10.6x acceleration, without sacrificing quality across popular DiT-based video generation models including Open-Sora, Open-Sora-Plan, and Latte. Notably, as a training-free approach, PAB can enpower any future DiT-based video generation models with real-time capabilities.


Video 1: Comparison of video generation speeds between original method and ours. We test on Open-Sora with 5 videos of 4s (96 frames) 480p resolution. Baseline and ours use 1 and 8 NVIDIA H100 GPUs respectively.

Pyramid Attention Broadcast

Motivation

Recently, Sora and other DiT-based video generation models have attracted significant attention. However, in contrast to image generation, there are few studies focused on accelerating the inference of DiT-based video generation models. Additionally, the inference cost for generating a single video can be substantial, often requiring tens of GPU minutes or even hours. Therefore, accelerating the inference of video generation models has become urgent for broader GenAI applications.


Figure 1: We compare the attention outputs differences between the current and previous diffusion steps. Differences are quantified using Mean Square Error (MSE) and averaged across all layers for each diffusion step.

Implementation

Our study reveals two key observations of attention mechanisms in video diffusion transformers: Firstly, attention differences across time steps exhibit a U-shaped pattern, with significant variations occurring during the first and last 15% of steps, while the middle 70% of steps are very stable with minor differences. Secondly, within the stable middle segment, the differences varies among attention types: spatial attention varies the most, involving high-frequency elements like edges and textures; temporal attention exhibits mid-frequency variations related to movements and dynamics in videos; cross-modal attention is the most stable, linking text with video content, analogous to low-frequency signals reflecting textual semantics.


Figure 2: We propose pyramid attention broadcast (shown on the right side) which sets different broadcast ranges for three attentions based on their differences. The smaller the variation in attention, the broader the broadcast range. During runtime, we broadcast attention results to the next several steps (shown on the left side) to avoid redundant attention computations. \( x_t \) refers to the features at timestep \( t \).
Building on these insights, we propose Pyramid Attention Broadcast to alleviate unnecessary attention computations. In the middle segment, where attentions show minor differences, we can broadcast one diffusion step's attention outputs to several subsequent steps, thereby significantly reducing computational costs. Furthermore, for more efficient computation and minimum quality loss, we set varied broadcast ranges for different attentions based on their stability and differences. This simple yet effective strategy achieves up to a 35% speedup with negligible quality loss, even without post-training.

Parallelism


Figure 3: Comparison between original Dynamic Sequence Parallel (DSP) and ours. When temporal attention is broadcasted, we can avoid all communication.
To further enhance video generation speed, we improve sequence parallel based on Dynamic Sequence Parallelism (DSP). Sequence parallelism segments videos into different parts across multiple GPUs, reducing the workload on each GPU and decreasing generation latency. However, DSP introduces significant communication overhead, requiring two all-to-all communications for temporal attention. By broadcasting temporal attention in PAB, we eliminate these communications, as temporal attention no longer needs to be computed. This results in a significant reduction in communication overhead by over 50%, enabling more efficient distributed inference for real-time video generation.

Evaluations

Speedups

Measured total latency of PAB for different models for generating a single video on 8 NVIDIA H100 GPUs. When utilizing a single GPU, we achieve a speedup ranging from 1.26x to 1.32x, which remains stable across different schedulers. Scaling to multiple GPUs, our method achieves a speedup of up to 10.6x, which almost linearly scales with the number of GPUs due to our efficient improvement of sequence parallel.

Quanlitive Results



Quantitive Results


Related works

  • Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-Sora: Democratizing Efficient Video Production for All. GitHub, 2024.
  • PKU-Yuan Lab and Tuzhan AI etc. Open-Sora-Plan. GitHub, 2024.
  • Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent Diffusion Transformer for Video Generation. arXiv, 2024.
  • Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zangwei Zheng, Ziming Liu, Zheming Yang, and Yang You. DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers. arXiv, 2024.
  • Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, and Song Han. DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models. In CVPR, 2024.
  • Xinyin Ma, Gongfan Fang, and Xinchao Wang. DeepCache: Accelerating Diffusion Models for Free. In CVPR, 2024.
  • Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive Benchmark Suite for Video Generative Models. In CVPR, 2024.
  • David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation. arXiv, 2024.
  • Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-Video Generation without Text-Video Data. In ICLR, 2023.
  • Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A Space-Time Diffusion Model for Video Generation. arXiv, 2024.
  • Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. In CVPR, 2023.
  • Zhengqing Yuan, Ruoxi Chen, Zhaoxu Li, Haolong Jia, Lifang He, Chi Wang, and Lichao Sun. Mora: Enabling Generalist Video Generation via A Multi-Agent Framework. In CVPR, 2023.
  • Acknowledgments

    Xuanlei, Xiaolong, and Kai contribute equally to this work. Kai and Yang are equal advising.

    BibTeX

    @misc{zhao2024pab,
          title={Real-Time Video Generation with Pyramid Attention Broadcast},
          author={Xuanlei Zhao and Xiaolong Jin and Kai Wang and Yang You},
          year={2024},
          eprint={2408.12588},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2408.12588},
    }