Abstract
High-performance computing (HPC) networking is of great importance in scaling many HPC applications across multiple nodes. Generally, most HPC applications deployed on traditional supercomputers or clusters adopt RDMA protocols such as InfiniBand for inter-node networking to mitigate high latency during constant communication. As cloud-based HPC continues to emerge as a significant trend, utilizing RDMA in the cloud has become a challenging problem. To address this problem, We propose an efficient elastic RDMA Protocol (eRDMA) to enabling RDMA’s merits for HPC applications in the cloud. eRDMA applys the direct data movement (DDM) of cloud infrastructure processing Unit (CIPU), overlay of virtual private cloud (VPC), and compatibility for RDMA verbs to fully utilize the elastic resources with the features of RDMA network for HPC scenarios in the cloud. The effectiveness of eRDMA is demonstrated by various experimental results across different platforms for many HPC and general TCP applications.
Similar content being viewed by others
Data availability
The data that support the findings of this study are not openly available due to reasons of sensitivity and are available from the corresponding author upon reasonable request. Data are located in controlled access data storage at Alibaba Cloud Intelligence Group.
References
Berendsen, H., van der Spoel, D., van Drunen, R.: Gromacs: A message-passing parallel molecular dynamics implementation. Comput. Phys. Commun. 91(1), 43–56 (1995)
Dongarra, J.J., Moler, C.B., Bunch, J.R., Stewart, G.W.: LINPACK Users’ Guide. Society for Industrial and Applied Mathematics, (1979)
Grun, P., Hefty, S., Sur, S., Goodell, D., Russell, R.D., Pritchard, H., Squyres, J.M.: A brief introduction to the openfabrics interfaces—a new network api for maximizing high performance application efficiency. in 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, (2015), pp. 34–39
Guo, C., Wu, H., Deng, Z., Soni, G., Ye, J., Padhye, J., Lipshteyn, M.: Rdma over commodity ethernet at scale,” in Proceedings of the 2016 ACM SIGCOMM Conference, ser. SIGCOMM ’16. New York, NY, USA: Association for Computing Machinery, (2016), p. 202–215
Hallquist, J.O.: Ls-dyna theoretical manual. (1991)
Hang, Y., Yao, X.: A detailed explanation about alibaba cloud cipu. Alibaba Cloud Community, (2022)
Hu, S., Zhu, Y., Cheng, P., Guo, C., Tan, K., Padhye, J., Chen, K.: Deadlocks in datacenter networks: Why do they form, and how to avoid them. in Proceedings of the 15th ACM Workshop on Hot Topics in Networks, ser. HotNets ’16. New York, NY, USA: Association for Computing Machinery, (2016), p. 92-98
Kalia, A., Kaminsky, M., Andersen, D.G.: Design guidelines for high performance rdma systems. in Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference, ser. USENIX ATC ’16. USA: USENIX Association, (2016), p. 437-450
Kutzner, C., Kniep, e. Christian: Gromacs in the cloud: A global supercomputer to speed up alchemical drug design. (2022)
Lindahl, E., Berk, H., van der Spoel.: Gromacs 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Model. (2001)
Liu, J., Chandrasekaran, B., Yu, W., Wu, J., Buntinas, D., Kini, S., Wyckoff, P., Panda, D.: Micro-benchmark level performance comparison of high-speed cluster interconnects. (09 2003), pp. 60– 65
MacArthur, P., Russell, R.D.: A performance study to guide rdma programming decisions. in 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, (2012), pp. 778–785
Powers, J., Klemp, J., Skamarock, e.: The weather research and forecasting (wrf) model: Overview, system efforts, and future directions. Bulletin of the American Meteorological Society, vol. 98, (01 2017)
Shalev, L., Ayoub, H., Bshara, N., Sabbag, E.: A cloud-optimized transport protocol for elastic and scalable hpc. IEEE Micro Special Issue on Commercial Products, (2020)
Shalev, L., Ayoub, H., Bshara, N., Sabbag, E.: Supercomputing on nitro in aws cloud. IEEE Micro, vol. PP, pp. 1–1, (08 2020)
Shpiner, A., Zahavi, E., Zdornov, V., Anker, T., Kadosh, M.: Unlocking credit loop deadlocks. in Proceedings of the 15th ACM Workshop on Hot Topics in Networks, ser. HotNets ’16. New York, NY, USA: Association for Computing Machinery, (2016), p. 85-91
Stephens, B., Cox, A.L., Singla, A., Carter, J., Dixon, C., Felter, W.: Practical dcb for improved data center networks. in IEEE INFOCOM 2014 - IEEE Conference on Computer Communications, (2014), pp. 1824–1832
White, A., Pour Biazar, A., Doty, K., McNider, R.: Iterative assimilation of geostationary satellite observations in retrospective meteorological modeling for air quality studies. Atmospheric Environment, vol. 272, p. 118947, (01 2022)
Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y., Padhye, J., Raindel, S., Yahia, M.H., Zhang, M.: Congestion control for large-scale rdma deployments. ser. SIGCOMM ’15. New York, NY, USA: Association for Computing Machinery, (2015), p. 523–536
Acknowledgements
We would like to thank Cheng Xu, Yunqi Han, Kai Shen, Jinhu Li, Yunqi Han, and Xiangzheng Sun for the help with illustrating the eRDMA, CIPU and NetACC. We also like to thank the THPC reviewers for their helpful comments.
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors are employees of Alibaba Group and received salaries from Alibaba Group. The research conducted in this paper is related to the products and services of Alibaba Group.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cao, H., Xu, C., Han, Y. et al. An efficient cloud-based elastic RDMA protocol for HPC applications. CCF Trans. HPC 6, 45–53 (2024). https://doi.org/10.1007/s42514-023-00170-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-023-00170-y