Efficient Diffusion Transformer with Step-Wise Dynamic Attention Mediators | SpringerLink
Skip to main content

Efficient Diffusion Transformer with Step-Wise Dynamic Attention Mediators

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps. In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Concurrently, integrating mediator tokens simplifies the attention module’s complexity to a linear scale, enhancing the efficiency of global attention processes. Additionally, we propose a time-step dynamic mediator token adjustment mechanism that further decreases the required computational FLOPs for generation, simultaneously facilitating the generation of high-quality images within the constraints of varied inference budgets Extensive experiments demonstrate that the proposed method can improve the generated image quality while also reducing the inference cost of diffusion transformers. When integrated with the recent work SiT, our method achieves a state-of-the-art FID score of 2.01. The source code is available at https://github.com/LeapLabTHU/Attention-Mediators.

Y. Pu, Z. Xia, J. Guo—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/openai/guided-diffusion/tree/main/evaluations.

  2. 2.

    https://github.com/GaParmar/clean-fid.

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Aditya, R., Prafulla, D., Alex, N., Casey, C., Mark, C.: Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125 (2022)

  3. Bao, F., et al.: All are worth words: a vit backbone for diffusion models. In: IEEE CVPR (2023)

    Google Scholar 

  4. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)

    Google Scholar 

  5. Brooks, T., et al.: Video generation models as world simulators (2024). https://openai.com/research/video-generation-models-as-world-simulators

  6. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  7. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)

    Google Scholar 

  8. Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: masked generative image transformer. In: IEEE CVPR (2022)

    Google Scholar 

  9. Chen, J., et al.: Pixart-\(\sigma \): weak-to-strong training of diffusion transformer for 4k text-to-image generation. In: ECCV (2024)

    Google Scholar 

  10. Chen, J., et al.: Pixart-\(\delta \): fast and controllable image generation with latent consistency models. In: ICML (2024)

    Google Scholar 

  11. Chen, J., et al.: Pixart-\(\alpha \): fast training of diffusion transformer for photorealistic text-to-image synthesis. In: ICLR (2024)

    Google Scholar 

  12. Crowson, K., Baumann, S.A., Birch, A., Abraham, T.M., Kaplan, D.Z., Shippole, E.: Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. In: ICML (2024)

    Google Scholar 

  13. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE CVPR (2009)

    Google Scholar 

  14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: ACL (2019)

    Google Scholar 

  15. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: NeurIPS (2021)

    Google Scholar 

  16. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  17. Esser, P., et al.: Scaling rectified flow transformers for high-resolution image synthesis (2024). https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

  18. Fang, Y., et al.: EVA: exploring the limits of masked visual representation learning at scale. In: IEEE CVPR (2023)

    Google Scholar 

  19. Gao, S., Zhou, P., Cheng, M.M., Yan, S.: Masked diffusion transformer is a strong image synthesizer. In: IEEE ICCV (2023)

    Google Scholar 

  20. Guo, J., et al.: Zero-shot generative model adaptation via image-specific prompt learning. In: IEEE CVPR (2023)

    Google Scholar 

  21. Guo, J., et al.: Smooth diffusion: crafting smooth latent spaces in diffusion models. In: IEEE CVPR (2024)

    Google Scholar 

  22. Han, D., Pan, X., Han, Y., Song, S., Huang, G.: FLatten transformer: vision transformer using focused linear attention. In: IEEE ICCV (2023)

    Google Scholar 

  23. Han, D., Ye, T., Han, Y., Xia, Z., Song, S., Huang, G.: Agent attention: On the integration of softmax and linear attention. In: ECCV (2024)

    Google Scholar 

  24. Han, Y., et al.: Dynamic perceiver for efficient visual recognition. In: IEEE ICCV (2023)

    Google Scholar 

  25. Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. In: IEEE TPAMI (2021)

    Google Scholar 

  26. Han, Y., Huang, G., Song, S., Yang, L., Zhang, Y., Jiang, H.: Spatially adaptive feature refinement for efficient inference. In: IEEE TIP (2021)

    Google Scholar 

  27. Han, Y., et al.: Latency-aware unified dynamic networks for efficient image recognition. In: IEEE TPAMI (2024)

    Google Scholar 

  28. Han, Y., et al.: Learning to weight samples for dynamic early-exiting networks. In: ECCV (2022)

    Google Scholar 

  29. Han, Y., Yuan, Z., Pu, Y., Xue, C., Song, S., Sun, G., Huang, G.: Latency-aware spatial-wise dynamic networks. In: NeurIPS (2022)

    Google Scholar 

  30. Hansen, C., Hansen, C., Alstrup, S., Simonsen, J.G., Lioma, C.: Neural speed reading with structural-jump-lstm. In: ICLR (2019)

    Google Scholar 

  31. Hassani, A., Walton, S., Li, J., Li, S., Shi, H.: Neighborhood attention transformer. In: IEEE CVPR (2023)

    Google Scholar 

  32. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local NASH equilibrium. In: NeurIPS (2017)

    Google Scholar 

  33. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  34. Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. In: JMLR (2022)

    Google Scholar 

  35. Hoogeboom, E., Heek, J., Salimans, T.: simple diffusion: end-to-end diffusion for high resolution images. In: ICML (2023)

    Google Scholar 

  36. Huang, G., Chen, D., Li, T., Wu, F., Van Der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: ICLR (2018)

    Google Scholar 

  37. Huang, G., et al.: Glance and focus networks for dynamic visual recognition. In: IEEE TPAMI (2022)

    Google Scholar 

  38. Jabri, A., Fleet, D., Chen, T.: Scalable adaptive computation for iterative generation. In: ICML (2023)

    Google Scholar 

  39. Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: ICML (2020)

    Google Scholar 

  40. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  41. Kingma, D.P., Gao, R.: Understanding the diffusion objective as a weighted integral of elbos. In: NeurIPS (2023)

    Google Scholar 

  42. Kirillov, A., et al.: Segment anything. In: IEEE ICCV (2023)

    Google Scholar 

  43. Li, Y., et al.: Efficient and explicit modelling of image hierarchies for image restoration. In: IEEE CVPR (2023)

    Google Scholar 

  44. Li, Z., et al.: Hunyuan-DiT: a powerful multi-resolution diffusion transformer with fine-grained Chinese understanding. arXiv preprint arXiv:2405.08748 (2024)

  45. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE ICCV (2021)

    Google Scholar 

  46. Liu, Z., et al.: SCoFT: self-contrastive fine-tuning for equitable image generation. In: CVPR (2024)

    Google Scholar 

  47. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  48. Lu, H., Yang, G., Fei, N., Huo, Y., Lu, Z., Luo, P., Ding, M.: VDT: general-purpose video diffusion transformers via mask modeling. In: ICLR (2023)

    Google Scholar 

  49. Lu, J., et al.: Soft: Softmax-free transformer with linear complexity. In: NeurIPS (2021)

    Google Scholar 

  50. Lu, Z., Wang, Z., Huang, D., Wu, C., Liu, X., Ouyang, W., Bai, L.: FiT: flexible vision transformer for diffusion model. In: ICML (2024)

    Google Scholar 

  51. Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: SiT: exploring flow and diffusion-based generative models with scalable interpolant transformers. In: ECCV (2024)

    Google Scholar 

  52. Ma, X., et al.: Latte: latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024)

  53. Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? In: NeurIPS (2019)

    Google Scholar 

  54. Mo, S., Xie, E., Chu, R., Hong, L., Niessner, M., Li, Z.: DiT-3D: exploring plain diffusion transformers for 3d shape generation. In: NeurIPS (2023)

    Google Scholar 

  55. Oquab, M., et al.: DINOv2: learning robust visual features without supervision. In: TMLR (2024)

    Google Scholar 

  56. Pan, X., Ye, T., Xia, Z., Song, S., Huang, G.: Slide-transformer: hierarchical vision transformer with local self-attention. In: IEEE CVPR (2023)

    Google Scholar 

  57. Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: IEEE ICCV (2023)

    Google Scholar 

  58. Pu, Y., Han, Y., Wang, Y., Feng, J., Deng, C., Huang, G.: Fine-grained recognition with learnable semantic data augmentation. In: IEEE TIP (2023)

    Google Scholar 

  59. Pu, Y., et al.: Rank-detr for high quality object detection. In: NeurIPS (2024)

    Google Scholar 

  60. Pu, Y., et al.: Adaptive rotated convolution for rotated object detection. In: IEEE ICCV (2023)

    Google Scholar 

  61. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  62. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. In: JMLR (2020)

    Google Scholar 

  63. Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)

    Google Scholar 

  64. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE CVPR (2022)

    Google Scholar 

  65. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)

    Google Scholar 

  66. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)

    Google Scholar 

  67. Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: scaling stylegan to large diverse datasets. In: SIGGRAPH (2022)

    Google Scholar 

  68. Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: attention with linear complexities. In: WACV (2021)

    Google Scholar 

  69. Song, L., et al.: Dynamic grained encoder for vision transformers. In: NeurIPS (2021)

    Google Scholar 

  70. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  71. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  72. Wang, C., Yang, Q., Huang, R., Song, S., Huang, G.: Efficient knowledge distillation from model checkpoints. In: NeurIPS (2022)

    Google Scholar 

  73. Wang, J., et al.: GRA: detecting oriented objects through group-wise rotating and attention. In: ECCV (2024)

    Google Scholar 

  74. Wang, S., Wu, L., Cui, L., Shen, Y.: Glancing at the patch: anomaly localization with global and local feature comparison. In: IEEE CVPR (2021)

    Google Scholar 

  75. Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: IEEE ICCV (2021)

    Google Scholar 

  76. Wang, Y., Han, Y., Wang, C., Song, S., Tian, Q., Huang, G.: Computation-efficient deep learning for computer vision: a survey. In: Cybernetics and Intelligence (2023)

    Google Scholar 

  77. Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G.: Not all images are worth 16x16 words: dynamic transformers for efficient image recognition. In: NeurIPS (2021)

    Google Scholar 

  78. Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: generalized segmentation via multimodal large language models. In: IEEE CVPR (2024)

    Google Scholar 

  79. Xia, Z., Pan, X., Jin, X., He, Y., Xue’, H., Song, S., Huang, G.: Budgeted training for vision transformer. In: ICLR (2023)

    Google Scholar 

  80. Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: IEEE CVPR (2022)

    Google Scholar 

  81. Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Dat++: Spatially dynamic vision transformer with deformable attention. arXiv preprint arXiv:2309.01430 (2023)

  82. Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V.: Nyströmformer: A nyström-based algorithm for approximating self-attention. In: AAAI (2021)

    Google Scholar 

  83. Xue, S., Yi, M., Luo, W., Zhang, S., Sun, J., Li, Z., Ma, Z.M.: SA-Solver: stochastic adams solver for fast sampling of diffusion models. In: NeurIPS (2023)

    Google Scholar 

  84. Yang, Q., Wang, S., Lin, M.G., Song, S., Huang, G.: Boosting offline reinforcement learning with action preference query. In: ICML (2023)

    Google Scholar 

  85. Yang, Q., Wang, S., Zhang, Q., Huang, G., Song, S.: Hundreds guide millions: adaptive offline reinforcement learning with expert guidance. In: IEEE TNNLS (2023)

    Google Scholar 

  86. Yang, X., Shih, S.M., Fu, Y., Zhao, X., Ji, S.: Your ViT is secretly a hybrid discriminative-generative diffusion model. arXiv:2208.07791 (2022)

  87. You, H., et al.: Castling-vit: compressing self-attention via switching towards linear-angular attention during vision transformer inference. In: IEEE CVPR (2023)

    Google Scholar 

  88. Zhang, T., Huang, H.Y., Feng, C., Cao, L.: Enlivening redundant heads in multi-head self-attention for machine translation. In: EMNLP (2021)

    Google Scholar 

  89. Zheng, H., Nie, W., Vahdat, A., Anandkumar, A.: Fast training of diffusion models with masked transformers. TMLR (2024)

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China under Grants 62321005 and 42327901.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gao Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pu, Y. et al. (2025). Efficient Diffusion Transformer with Step-Wise Dynamic Attention Mediators. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15073. Springer, Cham. https://doi.org/10.1007/978-3-031-72633-0_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72633-0_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72632-3

  • Online ISBN: 978-3-031-72633-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics