Towards Multi-modal Transformers in Federated Learning | SpringerLink
Skip to main content

Towards Multi-modal Transformers in Federated Learning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Multi-modal transformers mark significant progress in different domains, but privacy concerns on high-quality data hinder their further improvement. Federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models without direct access to the raw data held by different clients. Despite its potential, a considerable research direction regarding the unpaired uni-modal clients and the transformer architecture in FL remains unexplored. To fill this gap, this paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain, where clients possess data of various modalities distributed across different datasets. We systematically evaluate the performance of existing methods when a transformer architecture is utilized and introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients. Through extensive experiments across various FL settings, FedCola demonstrates superior performance over previous approaches, offering new perspectives on future federated training of multi-modal transformers. Code is available at https://github.com/imguangyu/FedCola.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bao, G., Zhang, Q., Miao, D., Gong, Z., Hu, L.: Multimodal federated learning with missing modality via prototype mask and contrast. arXiv preprint arXiv:2312.13508 (2023)

  2. Bao, H., et al.: Vlmo: unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural Inf. Process. Syst. 35, 32897–32912 (2022)

    Google Scholar 

  3. Bergou, E.H., Burlachenko, K.P., Dutta, A., Richtárik, P.: Personalized federated learning with communication compression. Trans. Mach. Learn. Res. (2023)

    Google Scholar 

  4. Che, L., Wang, J., Zhou, Y., Ma, F.: Multimodal federated learning: a survey. Sensors 23(15), 6986 (2023)

    Article  Google Scholar 

  5. Chen, H.Y., Tu, C.H., Li, Z., Shen, H.W., Chao, W.L.: On the importance and applicability of pre-training for federated learning. In: The Eleventh International Conference on Learning Representations (2023)

    Google Scholar 

  6. Chen, J., Zhang, A.: Fedmsplit: correlation-adaptive federated multi-task learning across multimodal split networks. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 87–96 (2022)

    Google Scholar 

  7. Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  8. Cheng, S., Wu, J., Xiao, Y., Liu, Y., Liu, Y.: FedGEMS: federated learning of larger server models via selective knowledge fusion (2022)

    Google Scholar 

  9. Cho, Y.J., Manoel, A., Joshi, G., Sim, R., Dimitriadis, D.: Heterogeneous ensemble knowledge transfer for training large models in federated learning. In: International Joint Conference on Artificial Intelligence (2022)

    Google Scholar 

  10. Deng, Y., Kamani, M.M., Mahdavi, M.: Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461 (2020)

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2019)

    Google Scholar 

  12. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)

    Google Scholar 

  13. Dutta, A., et al.: On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3817–3824 (2020)

    Google Scholar 

  14. Feng, T., et al.: Fedmultimodal: a benchmark for multimodal federated learning. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4035–4045 (2023)

    Google Scholar 

  15. Gunasekar, S., et al.: Textbooks are all you need (2023)

    Google Scholar 

  16. He, C., Annavaram, M., Avestimehr, S.: Group knowledge transfer: federated learning of large CNNs at the edge. Adv. Neural Inf. Process. Syst. 33, 14068–14080 (2020)

    Google Scholar 

  17. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

    Google Scholar 

  18. Hsu, H., Qi, H., Brown, M.: Measuring the effects of non-identical data distribution for federated visual classification (2019)

    Google Scholar 

  19. Huang, H., Zhuang, W., Chen, C., Lyu, L.: Fedmef: towards memory-efficient federated dynamic pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27548–27557 (June 2024)

    Google Scholar 

  20. Jiang, A.Q., et al.: Mixtral of experts (2024)

    Google Scholar 

  21. Kairouz, P., et al.: Advances and open problems in federated learning. Found. Trends® Mach. Learn. 14(1–2), 1–210 (2021)

    Google Scholar 

  22. Kang, W., Liu, G., Shah, M., Yan, Y.: Segvg: transferring object bounding box to segmentation for visual grounding (2024)

    Google Scholar 

  23. Karimireddy, S.P., Kale, S., Mohri, M., Reddi, S., Stich, S., Suresh, A.T.: Scaffold: stochastic controlled averaging for federated learning. In: International Conference on Machine Learning, pp. 5132–5143. PMLR (2020)

    Google Scholar 

  24. Krizhevsky, A.: Learning multiple layers of features from tiny images, pp. 32–33 (2009)

    Google Scholar 

  25. Li, H., et al.: Fedtp: federated learning by transformer personalization. IEEE Trans. Neural Netw. Learn. Syst. (2023)

    Google Scholar 

  26. Li, Q., He, B., Song, D.: Model-contrastive federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10713–10722 (2021)

    Google Scholar 

  27. Li, T., Hu, S., Beirami, A., Smith, V.: Ditto: fair and robust federated learning through personalization. In: International Conference on Machine Learning, pp. 6357–6368 (2021)

    Google Scholar 

  28. Li, T., Sahu, A.K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V.: Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2, 429–450 (2020)

    Google Scholar 

  29. Li, T., Sanjabi, M., Beirami, A., Smith, V.: Fair resource allocation in federated learning. In: International Conference on Learning Representations (2020)

    Google Scholar 

  30. Li, X., Jiang, M., Zhang, X., Kamp, M., Dou, Q.: FedBN: federated learning on non-IID features via local batch normalization. In: International Conference on Learning Representations (2021)

    Google Scholar 

  31. Li, Y., Bubeck, S., Eldan, R., Giorno, A.D., Gunasekar, S., Lee, Y.T.: Textbooks are all you need ii: phi-1.5 Technical report (2023)

    Google Scholar 

  32. Lin, T., Kong, L., Stich, S.U., Jaggi, M.: Ensemble distillation for robust model fusion in federated learning. Adv. Neural Inf. Process. Syst. 33, 2351–2363 (2020)

    Google Scholar 

  33. Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y.: Federated learning for vision-and-language grounding problems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11572–11579 (2020)

    Google Scholar 

  34. Luo, J., Mendieta, M., Chen, C., Wu, S.: Pgfed: personalize each client’s global objective for federated learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3946–3956 (October 2023)

    Google Scholar 

  35. McMahan, H.B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282 (2017)

    Google Scholar 

  36. Mendieta, M., Sun, G., Chen, C.: Navigating heterogeneity and privacy in one-shot federated learning with diffusion models (2024)

    Google Scholar 

  37. Mendieta, M., Yang, T., Wang, P., Lee, M., Ding, Z., Chen, C.: Local learning matters: rethinking data heterogeneity in federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8397–8406 (2022)

    Google Scholar 

  38. Mortaheb, M., Vahapoglu, C., Ulukus, S.: Fedgradnorm: personalized federated gradient-normalized multi-task learning. In: 2022 IEEE 23rd International Workshop on Signal Processing Advances in Wireless Communication (SPAWC), pp. 1–5. IEEE (2022)

    Google Scholar 

  39. Nguyen, J., Wang, J., Malik, K., Sanjabi, M., Rabbat, M.: Where to begin? On the impact of pre-training and initialization in federated learning. In: Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022) (2022)

    Google Scholar 

  40. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)

    Google Scholar 

  41. Qu, L., et al.: Rethinking architecture design for tackling data heterogeneity in federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10061–10071 (2022)

    Google Scholar 

  42. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

    Google Scholar 

  43. Schopf, T., Braun, D., Matthes, F.: Evaluating unsupervised text classification: zero-shot and similarity-based approaches. In: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, pp. 6–15. NLPIR ’22, Association for Computing Machinery (2023)

    Google Scholar 

  44. Singh, A., et al.: Flava: a foundational language and vision alignment model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650 (2022)

    Google Scholar 

  45. Song, T., Tong, Y., Wei, S.: Profit allocation for federated learning. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 2577–2586. IEEE (2019)

    Google Scholar 

  46. Sun, G., Mendieta, M., Luo, J., Wu, S., Chen, C.: Fedperfix: towards partial model personalization of vision transformers in federated learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4988–4998 (2023)

    Google Scholar 

  47. Sun, G., Mendieta, M., Yang, T., Chen, C.: Conquering the communication constraints to enable large pre-trained models in federated learning. arXiv (2022)

    Google Scholar 

  48. Tan, A.Z., Yu, H., Cui, L., Yang, Q.: Towards personalized federated learning. IEEE Trans. Neural Netw. Learn. Syst. 34(12), 9587–9603 (2022)

    Article  MathSciNet  Google Scholar 

  49. Tan, Y., Long, G., Ma, J., Liu, L., Zhou, T., Jiang, J.: Federated learning from pre-trained models: a contrastive learning approach. Adv. Neural Inf. Process. Syst. 35, 19332–19344 (2022)

    Google Scholar 

  50. Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., Khazaeni, Y.: Federated learning with matched averaging. In: International Conference on Learning Representations (2020)

    Google Scholar 

  51. Wang, W., et al.: Image as a foreign language: beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)

    Google Scholar 

  52. Winter, E.: The shapley value. Handb. Game Theory Econ. Appl. 3, 2025–2054 (2002)

    Google Scholar 

  53. Xiong, B., Yang, X., Qi, F., Xu, C.: A unified framework for multi-modal federated learning. Neurocomputing 480, 110–118 (2022)

    Article  Google Scholar 

  54. Xu, H., Kostopoulou, K., Dutta, A., Li, X., Ntoulas, A., Kalnis, P.: Deepreduce: a sparse-tensor communication framework for federated deep learning. Adv. Neural Inf. Process. Syst. 34, 21150–21163 (2021)

    Google Scholar 

  55. Yang, J., et al.: Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Sci. Data 10(1), 41 (2023)

    Article  Google Scholar 

  56. Yu, Q., Liu, Y., Wang, Y., Xu, K., Liu, J.: Multimodal federated learning via contrastive representation ensemble. In: The Eleventh International Conference on Learning Representations (2022)

    Google Scholar 

  57. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28 (2015)

    Google Scholar 

  58. Zhang, Y., Ding, X., Gong, K., Ge, Y., Shan, Y., Yue, X.: Multimodal pathway: improve transformers with irrelevant data from other modalities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6108–6117 (2024)

    Google Scholar 

  59. Zhao, Y., Barnaghi, P., Haddadi, H.: Multimodal federated learning on IoT data. In: 2022 IEEE/ACM Seventh International Conference on Internet-of-Things Design and Implementation (IoTDI), pp. 43–54 (2022)

    Google Scholar 

  60. Zhuang, W., Chen, C., Lyu, L.: When foundation model meets federated learning: motivations, challenges, and future directions (2024)

    Google Scholar 

  61. Zhuang, W., Lyu, L.: FedWon: triumphing multi-domain federated learning without normalization. In: The Twelfth International Conference on Learning Representations (2024)

    Google Scholar 

Download references

Acknowledgement

This work is partially supported by the NSF/Intel Partnership on MLWiNS under Grant No. 2003198.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guangyu Sun .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1169 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, G., Mendieta, M., Dutta, A., Li, X., Chen, C. (2025). Towards Multi-modal Transformers in Federated Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15073. Springer, Cham. https://doi.org/10.1007/978-3-031-72633-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72633-0_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72632-3

  • Online ISBN: 978-3-031-72633-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics