Real-Time Video Generation: Achieved 🥳! We introduce Pyramid Attention Broadcast (PAB), the first approach that achieves real-time DiT-based video generation. By mitigating redundant attention computation, PAB achieves up to 21.6 FPS with 10.6x acceleration, without sacrificing quality across popular DiT-based video generation models including Open-Sora, Open-Sora-Plan, and Latte. Notably, as a training-free approach, PAB can enpower any future DiT-based video generation models with real-time capabilities.
Recently, Sora and other DiT-based video generation models have attracted significant attention. However, in contrast to image generation, there are few studies focused on accelerating the inference of DiT-based video generation models. Additionally, the inference cost for generating a single video can be substantial, often requiring tens of GPU minutes or even hours. Therefore, accelerating the inference of video generation models has become urgent for broader GenAI applications.
Our study reveals two key observations of attention mechanisms in video diffusion transformers: Firstly, attention differences across time steps exhibit a U-shaped pattern, with significant variations occurring during the first and last 15% of steps, while the middle 70% of steps are very stable with minor differences. Secondly, within the stable middle segment, the differences varies among attention types: spatial attention varies the most, involving high-frequency elements like edges and textures; temporal attention exhibits mid-frequency variations related to movements and dynamics in videos; cross-modal attention is the most stable, linking text with video content, analogous to low-frequency signals reflecting textual semantics.
Measured total latency of PAB for different models for generating a single video on 8 NVIDIA H100 GPUs. When utilizing a single GPU, we achieve a speedup ranging from 1.26x to 1.32x, which remains stable across different schedulers. Scaling to multiple GPUs, our method achieves a speedup of up to 10.6x, which almost linearly scales with the number of GPUs due to our efficient improvement of sequence parallel.
Xuanlei, Xiaolong, and Kai contribute equally to this work. Kai and Yang are equal advising.
@misc{zhao2024pab,
title={Real-Time Video Generation with Pyramid Attention Broadcast},
author={Xuanlei Zhao and Xiaolong Jin and Kai Wang and Yang You},
year={2024},
eprint={2408.12588},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.12588},
}