Data-Driven MoE: A Data-Driven Approach to Construct MoE by a Single LLM | SpringerLink
Skip to main content

Data-Driven MoE: A Data-Driven Approach to Construct MoE by a Single LLM

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14878))

Included in the following conference series:

  • 909 Accesses

Abstract

With the emergence of the Mixtral-8x7B model, Mixture of Experts (MoE) models have gained people’s attention by its ability to outperform Large Language Models (LLMs) with much larger parameter scales while requiring less resource consumption. In details, the construction of a MoE model involves incorporating gate layers and utilizing sparse MoE layers to replace the feed-forward neural network layers of a LLM. Existing methods for constructing MoE models primarily involve reusing some well-trained neural network layers and initializing gate layers randomly. However, this process still demands a considerable amount of training, and additional training techniques must be implemented to ensure that each expert is sufficiently trained. To address the problems of elevated training expenses and low fault tolerance associated with existing MoE methods, we propose a simple, efficient, and effective MoE construction method - data-driven MoE construction method. In this method, rather than relying on random initialization, gate layers are initialized with a series of supervised fine-tuning data that encompass the scenarios in which the MoE model will be utilized. Sparse MoE layers are then constructed by replicating individual LLM’s FFN layer with noises. The proposed method offers a straightforward, convenient, and efficient approach for constructing MoE models. The experimental results indicate that the proposed data-driven MoE model surpasses existing approaches in terms of performance on both domain-specific and general tasks. The maximum improvements observed were 7.26, 9.96, 8.99, and 5.25 in the Rouge calculation for the medical, math, law, and general domains, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing gpt-4 with 90% ChatGPT quality (2023). https://vicuna.lmsys.org. Accessed 14 Apr 2023

  3. Eigen, D., Ranzato, M., Sutskever, I.: Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 (2013)

  4. Liu, H, Liao, Y., Meng, Y., Wang, Y.: Lawgpt (2023). https://github.com/LiuHC0428/LAW_GPT

  5. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural Comput. 3(1), 79–87 (1991)

    Google Scholar 

  6. Jiang, A.Q., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

  7. Komatsuzaki, A., et al.: Sparse upcycling: training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055 (2022)

  8. Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., Awadallah, A.: Orca: progressive learning from complex explanation traces of GPT-4. arXiv preprint arXiv:2306.02707 (2023)

  9. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  10. Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  11. Team, L.M.: LLaMA-MoE: building mixture-of-experts from llama with continual pre-training, December 2023. https://github.com/pjlab-sys4nlp/llama-moe

  12. Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  13. Yang, S., et al.: Zhongjing: enhancing the Chinese medical capabilities of large language model through expert feedback and real world multi-turn dialogue. arXiv preprint arXiv:2308.03549 (2023)

  14. Yang, Z., et al.: GPT can solve mathematical problems without a calculator. arXiv preprint arXiv:2309.03241 (2023)

  15. Zhang, Z., Lin, Y., Liu, Z., Li, P., Sun, M., Zhou, J.: MoEfication: transformer feed-forward layers are mixtures of experts. arXiv preprint arXiv:2110.01786 (2021)

  16. Zuo, S., Zhang, Q., Liang, C., He, P., Zhao, T., Chen, W.: MoEBERT: from BERT to mixture-of-experts via importance-guided adaptation. arXiv preprint arXiv:2204.07675 (2022)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong Song .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Teng, Z., Yan, Z., Song, Y., Ye, X., Ouyang, Y. (2024). Data-Driven MoE: A Data-Driven Approach to Construct MoE by a Single LLM. In: Huang, DS., Si, Z., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14878. Springer, Singapore. https://doi.org/10.1007/978-981-97-5672-8_30

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5672-8_30

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5671-1

  • Online ISBN: 978-981-97-5672-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics