Efficient Diffusion Transformer with Step-Wise Dynamic Attention Mediators

Pu, Yifan; Xia, Zhuofan; Guo, Jiayi; Han, Dongchen; Li, Qixiu; Li, Duo; Yuan, Yuhui; Li, Ji; Han, Yizeng; Song, Shiji; Huang, Gao; Li, Xiu

doi:10.1007/978-3-031-72633-0_24

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15073))

Included in the following conference series:

European Conference on Computer Vision

622 Accesses

Abstract

This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps. In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Concurrently, integrating mediator tokens simplifies the attention module’s complexity to a linear scale, enhancing the efficiency of global attention processes. Additionally, we propose a time-step dynamic mediator token adjustment mechanism that further decreases the required computational FLOPs for generation, simultaneously facilitating the generation of high-quality images within the constraints of varied inference budgets Extensive experiments demonstrate that the proposed method can improve the generated image quality while also reducing the inference cost of diffusion transformers. When integrated with the recent work SiT, our method achieves a state-of-the-art FID score of 2.01. The source code is available at https://github.com/LeapLabTHU/Attention-Mediators.

Y. Pu, Z. Xia, J. Guo—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Agent Attention: On the Integration of Softmax and Linear Attention

PIXART- $$\Sigma $$ : Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

Article 12 December 2024

Notes

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Aditya, R., Prafulla, D., Alex, N., Casey, C., Mark, C.: Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125 (2022)
Bao, F., et al.: All are worth words: a vit backbone for diffusion models. In: IEEE CVPR (2023)
Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)
Google Scholar
Brooks, T., et al.: Video generation models as world simulators (2024). https://openai.com/research/video-generation-models-as-world-simulators
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Google Scholar
Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: masked generative image transformer. In: IEEE CVPR (2022)
Google Scholar
Chen, J., et al.: Pixart-$\sigma $: weak-to-strong training of diffusion transformer for 4k text-to-image generation. In: ECCV (2024)
Google Scholar
Chen, J., et al.: Pixart-$\delta $: fast and controllable image generation with latent consistency models. In: ICML (2024)
Google Scholar
Chen, J., et al.: Pixart-$\alpha $: fast training of diffusion transformer for photorealistic text-to-image synthesis. In: ICLR (2024)
Google Scholar
Crowson, K., Baumann, S.A., Birch, A., Abraham, T.M., Kaplan, D.Z., Shippole, E.: Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. In: ICML (2024)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE CVPR (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: ACL (2019)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: NeurIPS (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Esser, P., et al.: Scaling rectified flow transformers for high-resolution image synthesis (2024). https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf
Fang, Y., et al.: EVA: exploring the limits of masked visual representation learning at scale. In: IEEE CVPR (2023)
Google Scholar
Gao, S., Zhou, P., Cheng, M.M., Yan, S.: Masked diffusion transformer is a strong image synthesizer. In: IEEE ICCV (2023)
Google Scholar
Guo, J., et al.: Zero-shot generative model adaptation via image-specific prompt learning. In: IEEE CVPR (2023)
Google Scholar
Guo, J., et al.: Smooth diffusion: crafting smooth latent spaces in diffusion models. In: IEEE CVPR (2024)
Google Scholar
Han, D., Pan, X., Han, Y., Song, S., Huang, G.: FLatten transformer: vision transformer using focused linear attention. In: IEEE ICCV (2023)
Google Scholar
Han, D., Ye, T., Han, Y., Xia, Z., Song, S., Huang, G.: Agent attention: On the integration of softmax and linear attention. In: ECCV (2024)
Google Scholar
Han, Y., et al.: Dynamic perceiver for efficient visual recognition. In: IEEE ICCV (2023)
Google Scholar
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. In: IEEE TPAMI (2021)
Google Scholar
Han, Y., Huang, G., Song, S., Yang, L., Zhang, Y., Jiang, H.: Spatially adaptive feature refinement for efficient inference. In: IEEE TIP (2021)
Google Scholar
Han, Y., et al.: Latency-aware unified dynamic networks for efficient image recognition. In: IEEE TPAMI (2024)
Google Scholar
Han, Y., et al.: Learning to weight samples for dynamic early-exiting networks. In: ECCV (2022)
Google Scholar
Han, Y., Yuan, Z., Pu, Y., Xue, C., Song, S., Sun, G., Huang, G.: Latency-aware spatial-wise dynamic networks. In: NeurIPS (2022)
Google Scholar
Hansen, C., Hansen, C., Alstrup, S., Simonsen, J.G., Lioma, C.: Neural speed reading with structural-jump-lstm. In: ICLR (2019)
Google Scholar
Hassani, A., Walton, S., Li, J., Li, S., Shi, H.: Neighborhood attention transformer. In: IEEE CVPR (2023)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local NASH equilibrium. In: NeurIPS (2017)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. In: JMLR (2022)
Google Scholar
Hoogeboom, E., Heek, J., Salimans, T.: simple diffusion: end-to-end diffusion for high resolution images. In: ICML (2023)
Google Scholar
Huang, G., Chen, D., Li, T., Wu, F., Van Der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: ICLR (2018)
Google Scholar
Huang, G., et al.: Glance and focus networks for dynamic visual recognition. In: IEEE TPAMI (2022)
Google Scholar
Jabri, A., Fleet, D., Chen, T.: Scalable adaptive computation for iterative generation. In: ICML (2023)
Google Scholar
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: ICML (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kingma, D.P., Gao, R.: Understanding the diffusion objective as a weighted integral of elbos. In: NeurIPS (2023)
Google Scholar
Kirillov, A., et al.: Segment anything. In: IEEE ICCV (2023)
Google Scholar
Li, Y., et al.: Efficient and explicit modelling of image hierarchies for image restoration. In: IEEE CVPR (2023)
Google Scholar
Li, Z., et al.: Hunyuan-DiT: a powerful multi-resolution diffusion transformer with fine-grained Chinese understanding. arXiv preprint arXiv:2405.08748 (2024)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE ICCV (2021)
Google Scholar
Liu, Z., et al.: SCoFT: self-contrastive fine-tuning for equitable image generation. In: CVPR (2024)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Lu, H., Yang, G., Fei, N., Huo, Y., Lu, Z., Luo, P., Ding, M.: VDT: general-purpose video diffusion transformers via mask modeling. In: ICLR (2023)
Google Scholar
Lu, J., et al.: Soft: Softmax-free transformer with linear complexity. In: NeurIPS (2021)
Google Scholar
Lu, Z., Wang, Z., Huang, D., Wu, C., Liu, X., Ouyang, W., Bai, L.: FiT: flexible vision transformer for diffusion model. In: ICML (2024)
Google Scholar
Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: SiT: exploring flow and diffusion-based generative models with scalable interpolant transformers. In: ECCV (2024)
Google Scholar
Ma, X., et al.: Latte: latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024)
Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? In: NeurIPS (2019)
Google Scholar
Mo, S., Xie, E., Chu, R., Hong, L., Niessner, M., Li, Z.: DiT-3D: exploring plain diffusion transformers for 3d shape generation. In: NeurIPS (2023)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. In: TMLR (2024)
Google Scholar
Pan, X., Ye, T., Xia, Z., Song, S., Huang, G.: Slide-transformer: hierarchical vision transformer with local self-attention. In: IEEE CVPR (2023)
Google Scholar
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: IEEE ICCV (2023)
Google Scholar
Pu, Y., Han, Y., Wang, Y., Feng, J., Deng, C., Huang, G.: Fine-grained recognition with learnable semantic data augmentation. In: IEEE TIP (2023)
Google Scholar
Pu, Y., et al.: Rank-detr for high quality object detection. In: NeurIPS (2024)
Google Scholar
Pu, Y., et al.: Adaptive rotated convolution for rotated object detection. In: IEEE ICCV (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. In: JMLR (2020)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE CVPR (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Google Scholar
Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: scaling stylegan to large diverse datasets. In: SIGGRAPH (2022)
Google Scholar
Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: attention with linear complexities. In: WACV (2021)
Google Scholar
Song, L., et al.: Dynamic grained encoder for vision transformers. In: NeurIPS (2021)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, C., Yang, Q., Huang, R., Song, S., Huang, G.: Efficient knowledge distillation from model checkpoints. In: NeurIPS (2022)
Google Scholar
Wang, J., et al.: GRA: detecting oriented objects through group-wise rotating and attention. In: ECCV (2024)
Google Scholar
Wang, S., Wu, L., Cui, L., Shen, Y.: Glancing at the patch: anomaly localization with global and local feature comparison. In: IEEE CVPR (2021)
Google Scholar
Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: IEEE ICCV (2021)
Google Scholar
Wang, Y., Han, Y., Wang, C., Song, S., Tian, Q., Huang, G.: Computation-efficient deep learning for computer vision: a survey. In: Cybernetics and Intelligence (2023)
Google Scholar
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G.: Not all images are worth 16x16 words: dynamic transformers for efficient image recognition. In: NeurIPS (2021)
Google Scholar
Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: generalized segmentation via multimodal large language models. In: IEEE CVPR (2024)
Google Scholar
Xia, Z., Pan, X., Jin, X., He, Y., Xue’, H., Song, S., Huang, G.: Budgeted training for vision transformer. In: ICLR (2023)
Google Scholar
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: IEEE CVPR (2022)
Google Scholar
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Dat++: Spatially dynamic vision transformer with deformable attention. arXiv preprint arXiv:2309.01430 (2023)
Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V.: Nyströmformer: A nyström-based algorithm for approximating self-attention. In: AAAI (2021)
Google Scholar
Xue, S., Yi, M., Luo, W., Zhang, S., Sun, J., Li, Z., Ma, Z.M.: SA-Solver: stochastic adams solver for fast sampling of diffusion models. In: NeurIPS (2023)
Google Scholar
Yang, Q., Wang, S., Lin, M.G., Song, S., Huang, G.: Boosting offline reinforcement learning with action preference query. In: ICML (2023)
Google Scholar
Yang, Q., Wang, S., Zhang, Q., Huang, G., Song, S.: Hundreds guide millions: adaptive offline reinforcement learning with expert guidance. In: IEEE TNNLS (2023)
Google Scholar
Yang, X., Shih, S.M., Fu, Y., Zhao, X., Ji, S.: Your ViT is secretly a hybrid discriminative-generative diffusion model. arXiv:2208.07791 (2022)
You, H., et al.: Castling-vit: compressing self-attention via switching towards linear-angular attention during vision transformer inference. In: IEEE CVPR (2023)
Google Scholar
Zhang, T., Huang, H.Y., Feng, C., Cao, L.: Enlivening redundant heads in multi-head self-attention for machine translation. In: EMNLP (2021)
Google Scholar
Zheng, H., Nie, W., Vahdat, A., Anandkumar, A.: Fast training of diffusion models with masked transformers. TMLR (2024)
Google Scholar

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China under Grants 62321005 and 42327901.

Author information

Authors and Affiliations

Tsinghua University, Beijing, 100084, China
Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yizeng Han, Shiji Song, Gao Huang & Xiu Li
Microsoft Research Asia, Beijing, China
Yuhui Yuan & Ji Li

Authors

Yifan Pu
View author publications
You can also search for this author in PubMed Google Scholar
Zhuofan Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jiayi Guo
View author publications
You can also search for this author in PubMed Google Scholar
Dongchen Han
View author publications
You can also search for this author in PubMed Google Scholar
Qixiu Li
View author publications
You can also search for this author in PubMed Google Scholar
Duo Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuhui Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Ji Li
View author publications
You can also search for this author in PubMed Google Scholar
Yizeng Han
View author publications
You can also search for this author in PubMed Google Scholar
Shiji Song
View author publications
You can also search for this author in PubMed Google Scholar
Gao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiu Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gao Huang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pu, Y. et al. (2025). Efficient Diffusion Transformer with Step-Wise Dynamic Attention Mediators. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15073. Springer, Cham. https://doi.org/10.1007/978-3-031-72633-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-72633-0_24
Published: 22 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72632-3
Online ISBN: 978-3-031-72633-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Diffusion Transformer with Step-Wise Dynamic Attention Mediators

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Agent Attention: On the Integration of Softmax and Linear Attention

PIXART- $$\Sigma $$ : Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient Diffusion Transformer with Step-Wise Dynamic Attention Mediators

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Agent Attention: On the Integration of Softmax and Linear Attention

PIXART- $$\Sigma $$ : Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation