A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning | SpringerLink
Skip to main content

A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning

  • Conference paper
  • First Online:
Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2023)

Abstract

With the development of machine learning technology in various fields, such as medical care, smart manufacturing, etc., the data has exploded. It is a challenge to train a deep learning model for different application domains with large-scale data and limited resources of a single device. The distributed machine-learning technology, which uses a parameter server and multiple clients to train a model collaboratively, is an excellent method to solve this problem. However, it needs much communication between different devices with limited communication resources. The stale synchronous parallel method is a mainstream communication method to solve this problem, but it always leads to high synchronization delay and low computing efficiency as the inappropriate delay threshold value set by the user based on experience. This paper proposes a synchronous parallel method with parameters communication prediction for distributed machine learning. It predicts the optimal timing for synchronization, which can solve the problem of long synchronization waiting time caused by the inappropriate threshold settings in the stale synchronous parallel method. Moreover, it allows fast nodes to continue local training while performing global synchronization, which can improve the resource utilization of work nodes. Experimental results show that compared with the delayed synchronous parallel method, the training time and quality, and resource usage of our method are both significantly improved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9723
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12154
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahmad, F., et al.: A deep learning architecture for psychometric natural language processing. ACM Trans. Inf. Syst. (TOIS) 38(1), 1–29 (2020)

    Article  MathSciNet  Google Scholar 

  2. Dabare, R., Wong, K.W., Shiratuddin, M.F., Koutsakis, P.: A fuzzy data augmentation technique to improve regularisation. Int. J. Intell. Syst. 37(8), 4561–4585 (2022)

    Article  Google Scholar 

  3. Liu, W.-X., Jinjie, L., Cai, J., Zhu, Y., Ling, S., Chen, Q.: DRL-PLink: deep reinforcement learning with private link approach for mix-flow scheduling in software-defined data-center networks. IEEE Trans. Netw. Serv. Manage. 19(2), 1049–1064 (2021)

    Article  Google Scholar 

  4. Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. IEEE (2020)

    Google Scholar 

  5. Wang, Y., Wang, K., Huang, H., Miyazaki, T., Guo, S.: Traffic and computation co-offloading with reinforcement learning in fog computing for industrial applications. IEEE Trans. Industr. Inf. 15(2), 976–986 (2018)

    Article  Google Scholar 

  6. Xu, C., Wang, K., Sun, Y., Guo, S., Zomaya, A.Y.: Redundancy avoidance for big data in data centers: a conventional neural network approach. IEEE Trans. Netw. Sci. Eng. 7(1), 104–114 (2018)

    Article  MathSciNet  Google Scholar 

  7. Xu, C., Wang, K., Li, P., Xia, R., Guo, S., Guo, M.: Renewable energy-aware big data analytics in geo-distributed data centers with reinforcement learning. IEEE Trans. Netw. Sci. Eng. 7(1), 205–215 (2018)

    Article  MathSciNet  Google Scholar 

  8. Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., Guo, C.: A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020), pp. 463–479 (2020)

    Google Scholar 

  9. Liang, X., et al.: Accelerating local SGD for non-IID data using variance reduction. Front. Comp. Sci. 17(2), 172311 (2023)

    Article  Google Scholar 

  10. Lin, G., et al.: Understanding adaptive gradient clipping in DP-SGD, empirically. Int. J. Intell. Syst. 37(11), 9674–9700 (2022)

    Article  Google Scholar 

  11. Gerbessiotis, A.V., Valiant, L.G.: Direct bulk-synchronous parallel algorithms. J. Parallel Distrib. Comput. 22(2), 251–267 (1994)

    Article  Google Scholar 

  12. Wang, Z., et al.: FSP: towards flexible synchronous parallel frameworks for distributed machine learning. IEEE Trans. Parallel Distrib. Syst. 34(2), 687–703 (2022)

    Article  Google Scholar 

  13. Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)

    Google Scholar 

  14. Ho, Q., et al.: More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, vol. 26 (2013)

    Google Scholar 

  15. Moritz, P., et al.: Ray: a distributed framework for emerging AI applications. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 561–577 (2018)

    Google Scholar 

  16. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2012), pp. 15–28 (2012)

    Google Scholar 

  17. Spark MLlib (2020). http://spark.apache.org/mllib/. Accessed Apr 2020

  18. Wang, H., Guo, S., Li, R.: OSP: overlapping computation and communication in parameter server for fast machine learning. In: Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10 (2019)

    Google Scholar 

  19. Wang, H., Zhihao, Q., Guo, S., Wang, N., Li, R., Zhuang, W.: LOSP: overlap synchronization parallel with local compensation for fast distributed training. IEEE J. Sel. Areas Commun. 39(8), 2541–2557 (2021)

    Article  Google Scholar 

  20. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283 (2016)

    Google Scholar 

  21. Xing, E.P., et al.: Petuum: a new platform for distributed machine learning on big data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1335–1344 (2015)

    Google Scholar 

  22. Wei, J., et al.: Managed communication and consistency for fast data-parallel iterative analytics. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 381–394 (2015)

    Google Scholar 

  23. Khalil, H.: Nonlinear Systems, 3rd edn. Pearson, Upper Saddle River (2001)

    Google Scholar 

  24. Abouelnaga, Y., Ali, O.S., Rady, H., Moustafa, M.: CIFAR-10: KNN-based ensemble of classifiers. In: 2016 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1192–1195. IEEE (2016)

    Google Scholar 

  25. MNIST (2020). http://yann.lecun.com/exdb/mnist. Accessed June 2020

Download references

Acknowledgment

I would like to express my gratitude to all those who helped me during the writing of this work. This work is supported by the Key Technology Research and Development Program of China under Grant No. 2022YFB2901200.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meiting Xue .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zeng, Y. et al. (2024). A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning. In: Gao, H., Wang, X., Voros, N. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2023. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 563. Springer, Cham. https://doi.org/10.1007/978-3-031-54531-3_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-54531-3_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-54530-6

  • Online ISBN: 978-3-031-54531-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics