Abstract
This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps. In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Concurrently, integrating mediator tokens simplifies the attention module’s complexity to a linear scale, enhancing the efficiency of global attention processes. Additionally, we propose a time-step dynamic mediator token adjustment mechanism that further decreases the required computational FLOPs for generation, simultaneously facilitating the generation of high-quality images within the constraints of varied inference budgets Extensive experiments demonstrate that the proposed method can improve the generated image quality while also reducing the inference cost of diffusion transformers. When integrated with the recent work SiT, our method achieves a state-of-the-art FID score of 2.01. The source code is available at https://github.com/LeapLabTHU/Attention-Mediators.
Y. Pu, Z. Xia, J. Guo—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Aditya, R., Prafulla, D., Alex, N., Casey, C., Mark, C.: Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125 (2022)
Bao, F., et al.: All are worth words: a vit backbone for diffusion models. In: IEEE CVPR (2023)
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)
Brooks, T., et al.: Video generation models as world simulators (2024). https://openai.com/research/video-generation-models-as-world-simulators
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: masked generative image transformer. In: IEEE CVPR (2022)
Chen, J., et al.: Pixart-\(\sigma \): weak-to-strong training of diffusion transformer for 4k text-to-image generation. In: ECCV (2024)
Chen, J., et al.: Pixart-\(\delta \): fast and controllable image generation with latent consistency models. In: ICML (2024)
Chen, J., et al.: Pixart-\(\alpha \): fast training of diffusion transformer for photorealistic text-to-image synthesis. In: ICLR (2024)
Crowson, K., Baumann, S.A., Birch, A., Abraham, T.M., Kaplan, D.Z., Shippole, E.: Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. In: ICML (2024)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE CVPR (2009)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: ACL (2019)
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: NeurIPS (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Esser, P., et al.: Scaling rectified flow transformers for high-resolution image synthesis (2024). https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf
Fang, Y., et al.: EVA: exploring the limits of masked visual representation learning at scale. In: IEEE CVPR (2023)
Gao, S., Zhou, P., Cheng, M.M., Yan, S.: Masked diffusion transformer is a strong image synthesizer. In: IEEE ICCV (2023)
Guo, J., et al.: Zero-shot generative model adaptation via image-specific prompt learning. In: IEEE CVPR (2023)
Guo, J., et al.: Smooth diffusion: crafting smooth latent spaces in diffusion models. In: IEEE CVPR (2024)
Han, D., Pan, X., Han, Y., Song, S., Huang, G.: FLatten transformer: vision transformer using focused linear attention. In: IEEE ICCV (2023)
Han, D., Ye, T., Han, Y., Xia, Z., Song, S., Huang, G.: Agent attention: On the integration of softmax and linear attention. In: ECCV (2024)
Han, Y., et al.: Dynamic perceiver for efficient visual recognition. In: IEEE ICCV (2023)
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. In: IEEE TPAMI (2021)
Han, Y., Huang, G., Song, S., Yang, L., Zhang, Y., Jiang, H.: Spatially adaptive feature refinement for efficient inference. In: IEEE TIP (2021)
Han, Y., et al.: Latency-aware unified dynamic networks for efficient image recognition. In: IEEE TPAMI (2024)
Han, Y., et al.: Learning to weight samples for dynamic early-exiting networks. In: ECCV (2022)
Han, Y., Yuan, Z., Pu, Y., Xue, C., Song, S., Sun, G., Huang, G.: Latency-aware spatial-wise dynamic networks. In: NeurIPS (2022)
Hansen, C., Hansen, C., Alstrup, S., Simonsen, J.G., Lioma, C.: Neural speed reading with structural-jump-lstm. In: ICLR (2019)
Hassani, A., Walton, S., Li, J., Li, S., Shi, H.: Neighborhood attention transformer. In: IEEE CVPR (2023)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local NASH equilibrium. In: NeurIPS (2017)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. In: JMLR (2022)
Hoogeboom, E., Heek, J., Salimans, T.: simple diffusion: end-to-end diffusion for high resolution images. In: ICML (2023)
Huang, G., Chen, D., Li, T., Wu, F., Van Der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: ICLR (2018)
Huang, G., et al.: Glance and focus networks for dynamic visual recognition. In: IEEE TPAMI (2022)
Jabri, A., Fleet, D., Chen, T.: Scalable adaptive computation for iterative generation. In: ICML (2023)
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: ICML (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Kingma, D.P., Gao, R.: Understanding the diffusion objective as a weighted integral of elbos. In: NeurIPS (2023)
Kirillov, A., et al.: Segment anything. In: IEEE ICCV (2023)
Li, Y., et al.: Efficient and explicit modelling of image hierarchies for image restoration. In: IEEE CVPR (2023)
Li, Z., et al.: Hunyuan-DiT: a powerful multi-resolution diffusion transformer with fine-grained Chinese understanding. arXiv preprint arXiv:2405.08748 (2024)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE ICCV (2021)
Liu, Z., et al.: SCoFT: self-contrastive fine-tuning for equitable image generation. In: CVPR (2024)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Lu, H., Yang, G., Fei, N., Huo, Y., Lu, Z., Luo, P., Ding, M.: VDT: general-purpose video diffusion transformers via mask modeling. In: ICLR (2023)
Lu, J., et al.: Soft: Softmax-free transformer with linear complexity. In: NeurIPS (2021)
Lu, Z., Wang, Z., Huang, D., Wu, C., Liu, X., Ouyang, W., Bai, L.: FiT: flexible vision transformer for diffusion model. In: ICML (2024)
Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: SiT: exploring flow and diffusion-based generative models with scalable interpolant transformers. In: ECCV (2024)
Ma, X., et al.: Latte: latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024)
Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? In: NeurIPS (2019)
Mo, S., Xie, E., Chu, R., Hong, L., Niessner, M., Li, Z.: DiT-3D: exploring plain diffusion transformers for 3d shape generation. In: NeurIPS (2023)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. In: TMLR (2024)
Pan, X., Ye, T., Xia, Z., Song, S., Huang, G.: Slide-transformer: hierarchical vision transformer with local self-attention. In: IEEE CVPR (2023)
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: IEEE ICCV (2023)
Pu, Y., Han, Y., Wang, Y., Feng, J., Deng, C., Huang, G.: Fine-grained recognition with learnable semantic data augmentation. In: IEEE TIP (2023)
Pu, Y., et al.: Rank-detr for high quality object detection. In: NeurIPS (2024)
Pu, Y., et al.: Adaptive rotated convolution for rotated object detection. In: IEEE ICCV (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. In: JMLR (2020)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE CVPR (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: scaling stylegan to large diverse datasets. In: SIGGRAPH (2022)
Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: attention with linear complexities. In: WACV (2021)
Song, L., et al.: Dynamic grained encoder for vision transformers. In: NeurIPS (2021)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, C., Yang, Q., Huang, R., Song, S., Huang, G.: Efficient knowledge distillation from model checkpoints. In: NeurIPS (2022)
Wang, J., et al.: GRA: detecting oriented objects through group-wise rotating and attention. In: ECCV (2024)
Wang, S., Wu, L., Cui, L., Shen, Y.: Glancing at the patch: anomaly localization with global and local feature comparison. In: IEEE CVPR (2021)
Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: IEEE ICCV (2021)
Wang, Y., Han, Y., Wang, C., Song, S., Tian, Q., Huang, G.: Computation-efficient deep learning for computer vision: a survey. In: Cybernetics and Intelligence (2023)
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G.: Not all images are worth 16x16 words: dynamic transformers for efficient image recognition. In: NeurIPS (2021)
Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: generalized segmentation via multimodal large language models. In: IEEE CVPR (2024)
Xia, Z., Pan, X., Jin, X., He, Y., Xue’, H., Song, S., Huang, G.: Budgeted training for vision transformer. In: ICLR (2023)
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: IEEE CVPR (2022)
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Dat++: Spatially dynamic vision transformer with deformable attention. arXiv preprint arXiv:2309.01430 (2023)
Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V.: Nyströmformer: A nyström-based algorithm for approximating self-attention. In: AAAI (2021)
Xue, S., Yi, M., Luo, W., Zhang, S., Sun, J., Li, Z., Ma, Z.M.: SA-Solver: stochastic adams solver for fast sampling of diffusion models. In: NeurIPS (2023)
Yang, Q., Wang, S., Lin, M.G., Song, S., Huang, G.: Boosting offline reinforcement learning with action preference query. In: ICML (2023)
Yang, Q., Wang, S., Zhang, Q., Huang, G., Song, S.: Hundreds guide millions: adaptive offline reinforcement learning with expert guidance. In: IEEE TNNLS (2023)
Yang, X., Shih, S.M., Fu, Y., Zhao, X., Ji, S.: Your ViT is secretly a hybrid discriminative-generative diffusion model. arXiv:2208.07791 (2022)
You, H., et al.: Castling-vit: compressing self-attention via switching towards linear-angular attention during vision transformer inference. In: IEEE CVPR (2023)
Zhang, T., Huang, H.Y., Feng, C., Cao, L.: Enlivening redundant heads in multi-head self-attention for machine translation. In: EMNLP (2021)
Zheng, H., Nie, W., Vahdat, A., Anandkumar, A.: Fast training of diffusion models with masked transformers. TMLR (2024)
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China under Grants 62321005 and 42327901.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pu, Y. et al. (2025). Efficient Diffusion Transformer with Step-Wise Dynamic Attention Mediators. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15073. Springer, Cham. https://doi.org/10.1007/978-3-031-72633-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-72633-0_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72632-3
Online ISBN: 978-3-031-72633-0
eBook Packages: Computer ScienceComputer Science (R0)