Abstract
With the development of machine learning technology in various fields, such as medical care, smart manufacturing, etc., the data has exploded. It is a challenge to train a deep learning model for different application domains with large-scale data and limited resources of a single device. The distributed machine-learning technology, which uses a parameter server and multiple clients to train a model collaboratively, is an excellent method to solve this problem. However, it needs much communication between different devices with limited communication resources. The stale synchronous parallel method is a mainstream communication method to solve this problem, but it always leads to high synchronization delay and low computing efficiency as the inappropriate delay threshold value set by the user based on experience. This paper proposes a synchronous parallel method with parameters communication prediction for distributed machine learning. It predicts the optimal timing for synchronization, which can solve the problem of long synchronization waiting time caused by the inappropriate threshold settings in the stale synchronous parallel method. Moreover, it allows fast nodes to continue local training while performing global synchronization, which can improve the resource utilization of work nodes. Experimental results show that compared with the delayed synchronous parallel method, the training time and quality, and resource usage of our method are both significantly improved.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmad, F., et al.: A deep learning architecture for psychometric natural language processing. ACM Trans. Inf. Syst. (TOIS) 38(1), 1–29 (2020)
Dabare, R., Wong, K.W., Shiratuddin, M.F., Koutsakis, P.: A fuzzy data augmentation technique to improve regularisation. Int. J. Intell. Syst. 37(8), 4561–4585 (2022)
Liu, W.-X., Jinjie, L., Cai, J., Zhu, Y., Ling, S., Chen, Q.: DRL-PLink: deep reinforcement learning with private link approach for mix-flow scheduling in software-defined data-center networks. IEEE Trans. Netw. Serv. Manage. 19(2), 1049–1064 (2021)
Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. IEEE (2020)
Wang, Y., Wang, K., Huang, H., Miyazaki, T., Guo, S.: Traffic and computation co-offloading with reinforcement learning in fog computing for industrial applications. IEEE Trans. Industr. Inf. 15(2), 976–986 (2018)
Xu, C., Wang, K., Sun, Y., Guo, S., Zomaya, A.Y.: Redundancy avoidance for big data in data centers: a conventional neural network approach. IEEE Trans. Netw. Sci. Eng. 7(1), 104–114 (2018)
Xu, C., Wang, K., Li, P., Xia, R., Guo, S., Guo, M.: Renewable energy-aware big data analytics in geo-distributed data centers with reinforcement learning. IEEE Trans. Netw. Sci. Eng. 7(1), 205–215 (2018)
Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., Guo, C.: A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020), pp. 463–479 (2020)
Liang, X., et al.: Accelerating local SGD for non-IID data using variance reduction. Front. Comp. Sci. 17(2), 172311 (2023)
Lin, G., et al.: Understanding adaptive gradient clipping in DP-SGD, empirically. Int. J. Intell. Syst. 37(11), 9674–9700 (2022)
Gerbessiotis, A.V., Valiant, L.G.: Direct bulk-synchronous parallel algorithms. J. Parallel Distrib. Comput. 22(2), 251–267 (1994)
Wang, Z., et al.: FSP: towards flexible synchronous parallel frameworks for distributed machine learning. IEEE Trans. Parallel Distrib. Syst. 34(2), 687–703 (2022)
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Ho, Q., et al.: More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Moritz, P., et al.: Ray: a distributed framework for emerging AI applications. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 561–577 (2018)
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2012), pp. 15–28 (2012)
Spark MLlib (2020). http://spark.apache.org/mllib/. Accessed Apr 2020
Wang, H., Guo, S., Li, R.: OSP: overlapping computation and communication in parameter server for fast machine learning. In: Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10 (2019)
Wang, H., Zhihao, Q., Guo, S., Wang, N., Li, R., Zhuang, W.: LOSP: overlap synchronization parallel with local compensation for fast distributed training. IEEE J. Sel. Areas Commun. 39(8), 2541–2557 (2021)
Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283 (2016)
Xing, E.P., et al.: Petuum: a new platform for distributed machine learning on big data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1335–1344 (2015)
Wei, J., et al.: Managed communication and consistency for fast data-parallel iterative analytics. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 381–394 (2015)
Khalil, H.: Nonlinear Systems, 3rd edn. Pearson, Upper Saddle River (2001)
Abouelnaga, Y., Ali, O.S., Rady, H., Moustafa, M.: CIFAR-10: KNN-based ensemble of classifiers. In: 2016 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1192–1195. IEEE (2016)
MNIST (2020). http://yann.lecun.com/exdb/mnist. Accessed June 2020
Acknowledgment
I would like to express my gratitude to all those who helped me during the writing of this work. This work is supported by the Key Technology Research and Development Program of China under Grant No. 2022YFB2901200.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Zeng, Y. et al. (2024). A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning. In: Gao, H., Wang, X., Voros, N. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2023. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 563. Springer, Cham. https://doi.org/10.1007/978-3-031-54531-3_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-54531-3_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54530-6
Online ISBN: 978-3-031-54531-3
eBook Packages: Computer ScienceComputer Science (R0)