Abstract
With the emergence of the Mixtral-8x7B model, Mixture of Experts (MoE) models have gained people’s attention by its ability to outperform Large Language Models (LLMs) with much larger parameter scales while requiring less resource consumption. In details, the construction of a MoE model involves incorporating gate layers and utilizing sparse MoE layers to replace the feed-forward neural network layers of a LLM. Existing methods for constructing MoE models primarily involve reusing some well-trained neural network layers and initializing gate layers randomly. However, this process still demands a considerable amount of training, and additional training techniques must be implemented to ensure that each expert is sufficiently trained. To address the problems of elevated training expenses and low fault tolerance associated with existing MoE methods, we propose a simple, efficient, and effective MoE construction method - data-driven MoE construction method. In this method, rather than relying on random initialization, gate layers are initialized with a series of supervised fine-tuning data that encompass the scenarios in which the MoE model will be utilized. Sparse MoE layers are then constructed by replicating individual LLM’s FFN layer with noises. The proposed method offers a straightforward, convenient, and efficient approach for constructing MoE models. The experimental results indicate that the proposed data-driven MoE model surpasses existing approaches in terms of performance on both domain-specific and general tasks. The maximum improvements observed were 7.26, 9.96, 8.99, and 5.25 in the Rouge calculation for the medical, math, law, and general domains, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing gpt-4 with 90% ChatGPT quality (2023). https://vicuna.lmsys.org. Accessed 14 Apr 2023
Eigen, D., Ranzato, M., Sutskever, I.: Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 (2013)
Liu, H, Liao, Y., Meng, Y., Wang, Y.: Lawgpt (2023). https://github.com/LiuHC0428/LAW_GPT
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural Comput. 3(1), 79–87 (1991)
Jiang, A.Q., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)
Komatsuzaki, A., et al.: Sparse upcycling: training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055 (2022)
Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., Awadallah, A.: Orca: progressive learning from complex explanation traces of GPT-4. arXiv preprint arXiv:2306.02707 (2023)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Team, L.M.: LLaMA-MoE: building mixture-of-experts from llama with continual pre-training, December 2023. https://github.com/pjlab-sys4nlp/llama-moe
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Yang, S., et al.: Zhongjing: enhancing the Chinese medical capabilities of large language model through expert feedback and real world multi-turn dialogue. arXiv preprint arXiv:2308.03549 (2023)
Yang, Z., et al.: GPT can solve mathematical problems without a calculator. arXiv preprint arXiv:2309.03241 (2023)
Zhang, Z., Lin, Y., Liu, Z., Li, P., Sun, M., Zhou, J.: MoEfication: transformer feed-forward layers are mixtures of experts. arXiv preprint arXiv:2110.01786 (2021)
Zuo, S., Zhang, Q., Liang, C., He, P., Zhao, T., Chen, W.: MoEBERT: from BERT to mixture-of-experts via importance-guided adaptation. arXiv preprint arXiv:2204.07675 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Teng, Z., Yan, Z., Song, Y., Ye, X., Ouyang, Y. (2024). Data-Driven MoE: A Data-Driven Approach to Construct MoE by a Single LLM. In: Huang, DS., Si, Z., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14878. Springer, Singapore. https://doi.org/10.1007/978-981-97-5672-8_30
Download citation
DOI: https://doi.org/10.1007/978-981-97-5672-8_30
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5671-1
Online ISBN: 978-981-97-5672-8
eBook Packages: Computer ScienceComputer Science (R0)