Abstract
Multi-modal transformers mark significant progress in different domains, but privacy concerns on high-quality data hinder their further improvement. Federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models without direct access to the raw data held by different clients. Despite its potential, a considerable research direction regarding the unpaired uni-modal clients and the transformer architecture in FL remains unexplored. To fill this gap, this paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain, where clients possess data of various modalities distributed across different datasets. We systematically evaluate the performance of existing methods when a transformer architecture is utilized and introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients. Through extensive experiments across various FL settings, FedCola demonstrates superior performance over previous approaches, offering new perspectives on future federated training of multi-modal transformers. Code is available at https://github.com/imguangyu/FedCola.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bao, G., Zhang, Q., Miao, D., Gong, Z., Hu, L.: Multimodal federated learning with missing modality via prototype mask and contrast. arXiv preprint arXiv:2312.13508 (2023)
Bao, H., et al.: Vlmo: unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural Inf. Process. Syst. 35, 32897–32912 (2022)
Bergou, E.H., Burlachenko, K.P., Dutta, A., Richtárik, P.: Personalized federated learning with communication compression. Trans. Mach. Learn. Res. (2023)
Che, L., Wang, J., Zhou, Y., Ma, F.: Multimodal federated learning: a survey. Sensors 23(15), 6986 (2023)
Chen, H.Y., Tu, C.H., Li, Z., Shen, H.W., Chao, W.L.: On the importance and applicability of pre-training for federated learning. In: The Eleventh International Conference on Learning Representations (2023)
Chen, J., Zhang, A.: Fedmsplit: correlation-adaptive federated multi-task learning across multimodal split networks. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 87–96 (2022)
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Cheng, S., Wu, J., Xiao, Y., Liu, Y., Liu, Y.: FedGEMS: federated learning of larger server models via selective knowledge fusion (2022)
Cho, Y.J., Manoel, A., Joshi, G., Sim, R., Dimitriadis, D.: Heterogeneous ensemble knowledge transfer for training large models in federated learning. In: International Joint Conference on Artificial Intelligence (2022)
Deng, Y., Kamani, M.M., Mahdavi, M.: Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2019)
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Dutta, A., et al.: On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3817–3824 (2020)
Feng, T., et al.: Fedmultimodal: a benchmark for multimodal federated learning. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4035–4045 (2023)
Gunasekar, S., et al.: Textbooks are all you need (2023)
He, C., Annavaram, M., Avestimehr, S.: Group knowledge transfer: federated learning of large CNNs at the edge. Adv. Neural Inf. Process. Syst. 33, 14068–14080 (2020)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Hsu, H., Qi, H., Brown, M.: Measuring the effects of non-identical data distribution for federated visual classification (2019)
Huang, H., Zhuang, W., Chen, C., Lyu, L.: Fedmef: towards memory-efficient federated dynamic pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27548–27557 (June 2024)
Jiang, A.Q., et al.: Mixtral of experts (2024)
Kairouz, P., et al.: Advances and open problems in federated learning. Found. Trends® Mach. Learn. 14(1–2), 1–210 (2021)
Kang, W., Liu, G., Shah, M., Yan, Y.: Segvg: transferring object bounding box to segmentation for visual grounding (2024)
Karimireddy, S.P., Kale, S., Mohri, M., Reddi, S., Stich, S., Suresh, A.T.: Scaffold: stochastic controlled averaging for federated learning. In: International Conference on Machine Learning, pp. 5132–5143. PMLR (2020)
Krizhevsky, A.: Learning multiple layers of features from tiny images, pp. 32–33 (2009)
Li, H., et al.: Fedtp: federated learning by transformer personalization. IEEE Trans. Neural Netw. Learn. Syst. (2023)
Li, Q., He, B., Song, D.: Model-contrastive federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10713–10722 (2021)
Li, T., Hu, S., Beirami, A., Smith, V.: Ditto: fair and robust federated learning through personalization. In: International Conference on Machine Learning, pp. 6357–6368 (2021)
Li, T., Sahu, A.K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V.: Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2, 429–450 (2020)
Li, T., Sanjabi, M., Beirami, A., Smith, V.: Fair resource allocation in federated learning. In: International Conference on Learning Representations (2020)
Li, X., Jiang, M., Zhang, X., Kamp, M., Dou, Q.: FedBN: federated learning on non-IID features via local batch normalization. In: International Conference on Learning Representations (2021)
Li, Y., Bubeck, S., Eldan, R., Giorno, A.D., Gunasekar, S., Lee, Y.T.: Textbooks are all you need ii: phi-1.5 Technical report (2023)
Lin, T., Kong, L., Stich, S.U., Jaggi, M.: Ensemble distillation for robust model fusion in federated learning. Adv. Neural Inf. Process. Syst. 33, 2351–2363 (2020)
Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y.: Federated learning for vision-and-language grounding problems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11572–11579 (2020)
Luo, J., Mendieta, M., Chen, C., Wu, S.: Pgfed: personalize each client’s global objective for federated learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3946–3956 (October 2023)
McMahan, H.B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282 (2017)
Mendieta, M., Sun, G., Chen, C.: Navigating heterogeneity and privacy in one-shot federated learning with diffusion models (2024)
Mendieta, M., Yang, T., Wang, P., Lee, M., Ding, Z., Chen, C.: Local learning matters: rethinking data heterogeneity in federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8397–8406 (2022)
Mortaheb, M., Vahapoglu, C., Ulukus, S.: Fedgradnorm: personalized federated gradient-normalized multi-task learning. In: 2022 IEEE 23rd International Workshop on Signal Processing Advances in Wireless Communication (SPAWC), pp. 1–5. IEEE (2022)
Nguyen, J., Wang, J., Malik, K., Sanjabi, M., Rabbat, M.: Where to begin? On the impact of pre-training and initialization in federated learning. In: Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022) (2022)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Qu, L., et al.: Rethinking architecture design for tackling data heterogeneity in federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10061–10071 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Schopf, T., Braun, D., Matthes, F.: Evaluating unsupervised text classification: zero-shot and similarity-based approaches. In: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, pp. 6–15. NLPIR ’22, Association for Computing Machinery (2023)
Singh, A., et al.: Flava: a foundational language and vision alignment model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650 (2022)
Song, T., Tong, Y., Wei, S.: Profit allocation for federated learning. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 2577–2586. IEEE (2019)
Sun, G., Mendieta, M., Luo, J., Wu, S., Chen, C.: Fedperfix: towards partial model personalization of vision transformers in federated learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4988–4998 (2023)
Sun, G., Mendieta, M., Yang, T., Chen, C.: Conquering the communication constraints to enable large pre-trained models in federated learning. arXiv (2022)
Tan, A.Z., Yu, H., Cui, L., Yang, Q.: Towards personalized federated learning. IEEE Trans. Neural Netw. Learn. Syst. 34(12), 9587–9603 (2022)
Tan, Y., Long, G., Ma, J., Liu, L., Zhou, T., Jiang, J.: Federated learning from pre-trained models: a contrastive learning approach. Adv. Neural Inf. Process. Syst. 35, 19332–19344 (2022)
Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., Khazaeni, Y.: Federated learning with matched averaging. In: International Conference on Learning Representations (2020)
Wang, W., et al.: Image as a foreign language: beit pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186 (2023)
Winter, E.: The shapley value. Handb. Game Theory Econ. Appl. 3, 2025–2054 (2002)
Xiong, B., Yang, X., Qi, F., Xu, C.: A unified framework for multi-modal federated learning. Neurocomputing 480, 110–118 (2022)
Xu, H., Kostopoulou, K., Dutta, A., Li, X., Ntoulas, A., Kalnis, P.: Deepreduce: a sparse-tensor communication framework for federated deep learning. Adv. Neural Inf. Process. Syst. 34, 21150–21163 (2021)
Yang, J., et al.: Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Sci. Data 10(1), 41 (2023)
Yu, Q., Liu, Y., Wang, Y., Xu, K., Liu, J.: Multimodal federated learning via contrastive representation ensemble. In: The Eleventh International Conference on Learning Representations (2022)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28 (2015)
Zhang, Y., Ding, X., Gong, K., Ge, Y., Shan, Y., Yue, X.: Multimodal pathway: improve transformers with irrelevant data from other modalities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6108–6117 (2024)
Zhao, Y., Barnaghi, P., Haddadi, H.: Multimodal federated learning on IoT data. In: 2022 IEEE/ACM Seventh International Conference on Internet-of-Things Design and Implementation (IoTDI), pp. 43–54 (2022)
Zhuang, W., Chen, C., Lyu, L.: When foundation model meets federated learning: motivations, challenges, and future directions (2024)
Zhuang, W., Lyu, L.: FedWon: triumphing multi-domain federated learning without normalization. In: The Twelfth International Conference on Learning Representations (2024)
Acknowledgement
This work is partially supported by the NSF/Intel Partnership on MLWiNS under Grant No. 2003198.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sun, G., Mendieta, M., Dutta, A., Li, X., Chen, C. (2025). Towards Multi-modal Transformers in Federated Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15073. Springer, Cham. https://doi.org/10.1007/978-3-031-72633-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-72633-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72632-3
Online ISBN: 978-3-031-72633-0
eBook Packages: Computer ScienceComputer Science (R0)