Towards Multi-modal Transformers in Federated Learning

Sun, Guangyu; Mendieta, Matias; Dutta, Aritra; Li, Xin; Chen, Chen

doi:10.1007/978-3-031-72633-0_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15073))

Included in the following conference series:

European Conference on Computer Vision

121 Accesses

Abstract

Multi-modal transformers mark significant progress in different domains, but privacy concerns on high-quality data hinder their further improvement. Federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models without direct access to the raw data held by different clients. Despite its potential, a considerable research direction regarding the unpaired uni-modal clients and the transformer architecture in FL remains unexplored. To fill this gap, this paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain, where clients possess data of various modalities distributed across different datasets. We systematically evaluate the performance of existing methods when a transformer architecture is utilized and introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients. Through extensive experiments across various FL settings, FedCola demonstrates superior performance over previous approaches, offering new perspectives on future federated training of multi-modal transformers. Code is available at https://github.com/imguangyu/FedCola.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality

Federated Learning on Multimodal Data: A Comprehensive Survey

Article 01 June 2023

A survey of multimodal federated learning: background, applications, and perspectives

Article 29 July 2024

References

Bao, G., Zhang, Q., Miao, D., Gong, Z., Hu, L.: Multimodal federated learning with missing modality via prototype mask and contrast. arXiv preprint arXiv:2312.13508 (2023)
Bao, H., et al.: Vlmo: unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural Inf. Process. Syst. 35, 32897–32912 (2022)
Google Scholar
Bergou, E.H., Burlachenko, K.P., Dutta, A., Richtárik, P.: Personalized federated learning with communication compression. Trans. Mach. Learn. Res. (2023)
Google Scholar
Che, L., Wang, J., Zhou, Y., Ma, F.: Multimodal federated learning: a survey. Sensors 23(15), 6986 (2023)
Article Google Scholar
Chen, H.Y., Tu, C.H., Li, Z., Shen, H.W., Chao, W.L.: On the importance and applicability of pre-training for federated learning. In: The Eleventh International Conference on Learning Representations (2023)
Google Scholar
Chen, J., Zhang, A.: Fedmsplit: correlation-adaptive federated multi-task learning across multimodal split networks. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 87–96 (2022)
Google Scholar
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Cheng, S., Wu, J., Xiao, Y., Liu, Y., Liu, Y.: FedGEMS: federated learning of larger server models via selective knowledge fusion (2022)
Google Scholar
Cho, Y.J., Manoel, A., Joshi, G., Sim, R., Dimitriadis, D.: Heterogeneous ensemble knowledge transfer for training large models in federated learning. In: International Joint Conference on Artificial Intelligence (2022)
Google Scholar
Deng, Y., Kamani, M.M., Mahdavi, M.: Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Google Scholar
Dutta, A., et al.: On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3817–3824 (2020)
Google Scholar
Feng, T., et al.: Fedmultimodal: a benchmark for multimodal federated learning. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4035–4045 (2023)
Google Scholar
Gunasekar, S., et al.: Textbooks are all you need (2023)
Google Scholar
He, C., Annavaram, M., Avestimehr, S.: Group knowledge transfer: federated learning of large CNNs at the edge. Adv. Neural Inf. Process. Syst. 33, 14068–14080 (2020)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Hsu, H., Qi, H., Brown, M.: Measuring the effects of non-identical data distribution for federated visual classification (2019)
Google Scholar
Huang, H., Zhuang, W., Chen, C., Lyu, L.: Fedmef: towards memory-efficient federated dynamic pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27548–27557 (June 2024)
Google Scholar
Jiang, A.Q., et al.: Mixtral of experts (2024)
Google Scholar
Kairouz, P., et al.: Advances and open problems in federated learning. Found. Trends® Mach. Learn. 14(1–2), 1–210 (2021)
Google Scholar
Kang, W., Liu, G., Shah, M., Yan, Y.: Segvg: transferring object bounding box to segmentation for visual grounding (2024)
Google Scholar
Karimireddy, S.P., Kale, S., Mohri, M., Reddi, S., Stich, S., Suresh, A.T.: Scaffold: stochastic controlled averaging for federated learning. In: International Conference on Machine Learning, pp. 5132–5143. PMLR (2020)
Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images, pp. 32–33 (2009)
Google Scholar
Li, H., et al.: Fedtp: federated learning by transformer personalization. IEEE Trans. Neural Netw. Learn. Syst. (2023)
Google Scholar
Li, Q., He, B., Song, D.: Model-contrastive federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10713–10722 (2021)
Google Scholar
Li, T., Hu, S., Beirami, A., Smith, V.: Ditto: fair and robust federated learning through personalization. In: International Conference on Machine Learning, pp. 6357–6368 (2021)
Google Scholar
Li, T., Sahu, A.K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V.: Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2, 429–450 (2020)
Google Scholar
Li, T., Sanjabi, M., Beirami, A., Smith, V.: Fair resource allocation in federated learning. In: International Conference on Learning Representations (2020)
Google Scholar
Li, X., Jiang, M., Zhang, X., Kamp, M., Dou, Q.: FedBN: federated learning on non-IID features via local batch normalization. In: International Conference on Learning Representations (2021)
Google Scholar
Li, Y., Bubeck, S., Eldan, R., Giorno, A.D., Gunasekar, S., Lee, Y.T.: Textbooks are all you need ii: phi-1.5 Technical report (2023)
Google Scholar
Lin, T., Kong, L., Stich, S.U., Jaggi, M.: Ensemble distillation for robust model fusion in federated learning. Adv. Neural Inf. Process. Syst. 33, 2351–2363 (2020)
Google Scholar
Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y.: Federated learning for vision-and-language grounding problems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11572–11579 (2020)
Google Scholar
Luo, J., Mendieta, M., Chen, C., Wu, S.: Pgfed: personalize each client’s global objective for federated learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3946–3956 (October 2023)
Google Scholar
McMahan, H.B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282 (2017)
Google Scholar
Mendieta, M., Sun, G., Chen, C.: Navigating heterogeneity and privacy in one-shot federated learning with diffusion models (2024)
Google Scholar
Mendieta, M., Yang, T., Wang, P., Lee, M., Ding, Z., Chen, C.: Local learning matters: rethinking data heterogeneity in federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8397–8406 (2022)
Google Scholar
Mortaheb, M., Vahapoglu, C., Ulukus, S.: Fedgradnorm: personalized federated gradient-normalized multi-task learning. In: 2022 IEEE 23rd International Workshop on Signal Processing Advances in Wireless Communication (SPAWC), pp. 1–5. IEEE (2022)
Google Scholar
Nguyen, J., Wang, J., Malik, K., Sanjabi, M., Rabbat, M.: Where to begin? On the impact of pre-training and initialization in federated learning. In: Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022) (2022)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Google Scholar
Qu, L., et al.: Rethinking architecture design for tackling data heterogeneity in federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10061–10071 (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Google Scholar
Schopf, T., Braun, D., Matthes, F.: Evaluating unsupervised text classification: zero-shot and similarity-based approaches. In: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, pp. 6–15. NLPIR ’22, Association for Computing Machinery (2023)
Google Scholar
Singh, A., et al.: Flava: a foundational language and vision alignment model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650 (2022)
Google Scholar
Song, T., Tong, Y., Wei, S.: Profit allocation for federated learning. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 2577–2586. IEEE (2019)
Google Scholar
Sun, G., Mendieta, M., Luo, J., Wu, S., Chen, C.: Fedperfix: towards partial model personalization of vision transformers in federated learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4988–4998 (2023)
Google Scholar
Sun, G., Mendieta, M., Yang, T., Chen, C.: Conquering the communication constraints to enable large pre-trained models in federated learning. arXiv (2022)
Google Scholar
Tan, A.Z., Yu, H., Cui, L., Yang, Q.: Towards personalized federated learning. IEEE Trans. Neural Netw. Learn. Syst. 34(12), 9587–9603 (2022)
Article MathSciNet Google Scholar
Tan, Y., Long, G., Ma, J., Liu, L., Zhou, T., Jiang, J.: Federated learning from pre-trained models: a contrastive learning approach. Adv. Neural Inf. Process. Syst. 35, 19332–19344 (2022)
Google Scholar
Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., Khazaeni, Y.: Federated learning with matched averaging. In: International Conference on Learning Representations (2020)
Google Scholar
Wang, W., et al.: Image as a foreign language: beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)
Google Scholar
Winter, E.: The shapley value. Handb. Game Theory Econ. Appl. 3, 2025–2054 (2002)
Google Scholar
Xiong, B., Yang, X., Qi, F., Xu, C.: A unified framework for multi-modal federated learning. Neurocomputing 480, 110–118 (2022)
Article Google Scholar
Xu, H., Kostopoulou, K., Dutta, A., Li, X., Ntoulas, A., Kalnis, P.: Deepreduce: a sparse-tensor communication framework for federated deep learning. Adv. Neural Inf. Process. Syst. 34, 21150–21163 (2021)
Google Scholar
Yang, J., et al.: Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Sci. Data 10(1), 41 (2023)
Article Google Scholar
Yu, Q., Liu, Y., Wang, Y., Xu, K., Liu, J.: Multimodal federated learning via contrastive representation ensemble. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28 (2015)
Google Scholar
Zhang, Y., Ding, X., Gong, K., Ge, Y., Shan, Y., Yue, X.: Multimodal pathway: improve transformers with irrelevant data from other modalities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6108–6117 (2024)
Google Scholar
Zhao, Y., Barnaghi, P., Haddadi, H.: Multimodal federated learning on IoT data. In: 2022 IEEE/ACM Seventh International Conference on Internet-of-Things Design and Implementation (IoTDI), pp. 43–54 (2022)
Google Scholar
Zhuang, W., Chen, C., Lyu, L.: When foundation model meets federated learning: motivations, challenges, and future directions (2024)
Google Scholar
Zhuang, W., Lyu, L.: FedWon: triumphing multi-domain federated learning without normalization. In: The Twelfth International Conference on Learning Representations (2024)
Google Scholar

Download references

Acknowledgement

This work is partially supported by the NSF/Intel Partnership on MLWiNS under Grant No. 2003198.

Author information

Authors and Affiliations

Center for Research in Computer Vision, University of Central Florida, Orlando, FL, USA
Guangyu Sun, Matias Mendieta & Chen Chen
Department of Mathematics, University of Central Florida, Orlando, FL, USA
Aritra Dutta & Xin Li

Authors

Guangyu Sun
View author publications
You can also search for this author in PubMed Google Scholar
Matias Mendieta
View author publications
You can also search for this author in PubMed Google Scholar
Aritra Dutta
View author publications
You can also search for this author in PubMed Google Scholar
Xin Li
View author publications
You can also search for this author in PubMed Google Scholar
Chen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guangyu Sun .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1169 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, G., Mendieta, M., Dutta, A., Li, X., Chen, C. (2025). Towards Multi-modal Transformers in Federated Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15073. Springer, Cham. https://doi.org/10.1007/978-3-031-72633-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-72633-0_13
Published: 22 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72632-3
Online ISBN: 978-3-031-72633-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Multi-modal Transformers in Federated Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality

Federated Learning on Multimodal Data: A Comprehensive Survey

A survey of multimodal federated learning: background, applications, and perspectives

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1169 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Towards Multi-modal Transformers in Federated Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality

Federated Learning on Multimodal Data: A Comprehensive Survey

A survey of multimodal federated learning: background, applications, and perspectives

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1169 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation