Data-Driven MoE: A Data-Driven Approach to Construct MoE by a Single LLM

Teng, Zeyu; Yan, Zhiwei; Song, Yong; Ye, Xiaozhou; Ouyang, Ye

doi:10.1007/978-981-97-5672-8_30

Zeyu Teng¹⁰,
Zhiwei Yan¹¹,
Yong Song¹⁰,
Xiaozhou Ye¹⁰ &
…
Ye Ouyang¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14878))

Included in the following conference series:

International Conference on Intelligent Computing

909 Accesses

Abstract

With the emergence of the Mixtral-8x7B model, Mixture of Experts (MoE) models have gained people’s attention by its ability to outperform Large Language Models (LLMs) with much larger parameter scales while requiring less resource consumption. In details, the construction of a MoE model involves incorporating gate layers and utilizing sparse MoE layers to replace the feed-forward neural network layers of a LLM. Existing methods for constructing MoE models primarily involve reusing some well-trained neural network layers and initializing gate layers randomly. However, this process still demands a considerable amount of training, and additional training techniques must be implemented to ensure that each expert is sufficiently trained. To address the problems of elevated training expenses and low fault tolerance associated with existing MoE methods, we propose a simple, efficient, and effective MoE construction method - data-driven MoE construction method. In this method, rather than relying on random initialization, gate layers are initialized with a series of supervised fine-tuning data that encompass the scenarios in which the MoE model will be utilized. Sparse MoE layers are then constructed by replicating individual LLM’s FFN layer with noises. The proposed method offers a straightforward, convenient, and efficient approach for constructing MoE models. The experimental results indicate that the proposed data-driven MoE model surpasses existing approaches in terms of performance on both domain-specific and general tasks. The maximum improvements observed were 7.26, 9.96, 8.99, and 5.25 in the Rouge calculation for the medical, math, law, and general domains, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Pre-trained Models for Representation Learning

AQLoRA: An Adaptive Quantization-Based Efficient Fine-Tuning Method for LLMs

LasQ: Largest Singular Components Fine-Tuning for LLMs with Quantization

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing gpt-4 with 90% ChatGPT quality (2023). https://vicuna.lmsys.org. Accessed 14 Apr 2023
Eigen, D., Ranzato, M., Sutskever, I.: Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 (2013)
Liu, H, Liao, Y., Meng, Y., Wang, Y.: Lawgpt (2023). https://github.com/LiuHC0428/LAW_GPT
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural Comput. 3(1), 79–87 (1991)
Google Scholar
Jiang, A.Q., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)
Komatsuzaki, A., et al.: Sparse upcycling: training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055 (2022)
Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., Awadallah, A.: Orca: progressive learning from complex explanation traces of GPT-4. arXiv preprint arXiv:2306.02707 (2023)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Team, L.M.: LLaMA-MoE: building mixture-of-experts from llama with continual pre-training, December 2023. https://github.com/pjlab-sys4nlp/llama-moe
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Yang, S., et al.: Zhongjing: enhancing the Chinese medical capabilities of large language model through expert feedback and real world multi-turn dialogue. arXiv preprint arXiv:2308.03549 (2023)
Yang, Z., et al.: GPT can solve mathematical problems without a calculator. arXiv preprint arXiv:2309.03241 (2023)
Zhang, Z., Lin, Y., Liu, Z., Li, P., Sun, M., Zhou, J.: MoEfication: transformer feed-forward layers are mixtures of experts. arXiv preprint arXiv:2110.01786 (2021)
Zuo, S., Zhang, Q., Liang, C., He, P., Zhao, T., Chen, W.: MoEBERT: from BERT to mixture-of-experts via importance-guided adaptation. arXiv preprint arXiv:2204.07675 (2022)

Download references

Author information

Authors and Affiliations

AsiaInfo Technologies (China) Co., Ltd., Beijing, China
Zeyu Teng, Yong Song & Xiaozhou Ye
AsiaInfo Technologies (Nanjing) Co., Ltd., Nanjing, China
Zhiwei Yan
AsiaInfo Technologies (Guangzhou) Co., Ltd., Guangzhou, China
Ye Ouyang

Authors

Zeyu Teng
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwei Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yong Song
View author publications
You can also search for this author in PubMed Google Scholar
Xiaozhou Ye
View author publications
You can also search for this author in PubMed Google Scholar
Ye Ouyang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Song .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Tianjin University of Science and Technology, Tianjin, China
Zhanjun Si
Tianjin University of Science and Technology, Tianjin, China
Chuanlei Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Teng, Z., Yan, Z., Song, Y., Ye, X., Ouyang, Y. (2024). Data-Driven MoE: A Data-Driven Approach to Construct MoE by a Single LLM. In: Huang, DS., Si, Z., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14878. Springer, Singapore. https://doi.org/10.1007/978-981-97-5672-8_30

Download citation

DOI: https://doi.org/10.1007/978-981-97-5672-8_30
Published: 01 August 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5671-1
Online ISBN: 978-981-97-5672-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics