An efficient cloud-based elastic RDMA protocol for HPC applications | CCF Transactions on High Performance Computing Skip to main content
Log in

An efficient cloud-based elastic RDMA protocol for HPC applications

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

High-performance computing (HPC) networking is of great importance in scaling many HPC applications across multiple nodes. Generally, most HPC applications deployed on traditional supercomputers or clusters adopt RDMA protocols such as InfiniBand for inter-node networking to mitigate high latency during constant communication. As cloud-based HPC continues to emerge as a significant trend, utilizing RDMA in the cloud has become a challenging problem. To address this problem, We propose an efficient elastic RDMA Protocol (eRDMA) to enabling RDMA’s merits for HPC applications in the cloud. eRDMA applys the direct data movement (DDM) of cloud infrastructure processing Unit (CIPU), overlay of virtual private cloud (VPC), and compatibility for RDMA verbs to fully utilize the elastic resources with the features of RDMA network for HPC scenarios in the cloud. The effectiveness of eRDMA is demonstrated by various experimental results across different platforms for many HPC and general TCP applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data that support the findings of this study are not openly available due to reasons of sensitivity and are available from the corresponding author upon reasonable request. Data are located in controlled access data storage at Alibaba Cloud Intelligence Group.

References

  • Berendsen, H., van der Spoel, D., van Drunen, R.: Gromacs: A message-passing parallel molecular dynamics implementation. Comput. Phys. Commun. 91(1), 43–56 (1995)

    Article  Google Scholar 

  • Dongarra, J.J., Moler, C.B., Bunch, J.R., Stewart, G.W.: LINPACK Users’ Guide. Society for Industrial and Applied Mathematics, (1979)

  • Grun, P., Hefty, S., Sur, S., Goodell, D., Russell, R.D., Pritchard, H., Squyres, J.M.: A brief introduction to the openfabrics interfaces—a new network api for maximizing high performance application efficiency. in 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, (2015), pp. 34–39

  • Guo, C., Wu, H., Deng, Z., Soni, G., Ye, J., Padhye, J., Lipshteyn, M.: Rdma over commodity ethernet at scale,” in Proceedings of the 2016 ACM SIGCOMM Conference, ser. SIGCOMM ’16. New York, NY, USA: Association for Computing Machinery, (2016), p. 202–215

  • Hallquist, J.O.: Ls-dyna theoretical manual. (1991)

  • Hang, Y., Yao, X.: A detailed explanation about alibaba cloud cipu. Alibaba Cloud Community, (2022)

  • Hu, S., Zhu, Y., Cheng, P., Guo, C., Tan, K., Padhye, J., Chen, K.: Deadlocks in datacenter networks: Why do they form, and how to avoid them. in Proceedings of the 15th ACM Workshop on Hot Topics in Networks, ser. HotNets ’16. New York, NY, USA: Association for Computing Machinery, (2016), p. 92-98

  • Kalia, A., Kaminsky, M., Andersen, D.G.: Design guidelines for high performance rdma systems. in Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference, ser. USENIX ATC ’16. USA: USENIX Association, (2016), p. 437-450

  • Kutzner, C., Kniep, e. Christian: Gromacs in the cloud: A global supercomputer to speed up alchemical drug design. (2022)

  • Lindahl, E., Berk, H., van der Spoel.: Gromacs 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Model. (2001)

  • Liu, J., Chandrasekaran, B., Yu, W., Wu, J., Buntinas, D., Kini, S., Wyckoff, P., Panda, D.: Micro-benchmark level performance comparison of high-speed cluster interconnects. (09 2003), pp. 60– 65

  • MacArthur, P., Russell, R.D.: A performance study to guide rdma programming decisions. in 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, (2012), pp. 778–785

  • Powers, J., Klemp, J., Skamarock, e.: The weather research and forecasting (wrf) model: Overview, system efforts, and future directions. Bulletin of the American Meteorological Society, vol. 98, (01 2017)

  • Shalev, L., Ayoub, H., Bshara, N., Sabbag, E.: A cloud-optimized transport protocol for elastic and scalable hpc. IEEE Micro Special Issue on Commercial Products, (2020)

  • Shalev, L., Ayoub, H., Bshara, N., Sabbag, E.: Supercomputing on nitro in aws cloud. IEEE Micro, vol. PP, pp. 1–1, (08 2020)

  • Shpiner, A., Zahavi, E., Zdornov, V., Anker, T., Kadosh, M.: Unlocking credit loop deadlocks. in Proceedings of the 15th ACM Workshop on Hot Topics in Networks, ser. HotNets ’16. New York, NY, USA: Association for Computing Machinery, (2016), p. 85-91

  • Stephens, B., Cox, A.L., Singla, A., Carter, J., Dixon, C., Felter, W.: Practical dcb for improved data center networks. in IEEE INFOCOM 2014 - IEEE Conference on Computer Communications, (2014), pp. 1824–1832

  • White, A., Pour Biazar, A., Doty, K., McNider, R.: Iterative assimilation of geostationary satellite observations in retrospective meteorological modeling for air quality studies. Atmospheric Environment, vol. 272, p. 118947, (01 2022)

  • Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y., Padhye, J., Raindel, S., Yahia, M.H., Zhang, M.: Congestion control for large-scale rdma deployments. ser. SIGCOMM ’15. New York, NY, USA: Association for Computing Machinery, (2015), p. 523–536

Download references

Acknowledgements

We would like to thank Cheng Xu, Yunqi Han, Kai Shen, Jinhu Li, Yunqi Han, and Xiangzheng Sun for the help with illustrating the eRDMA, CIPU and NetACC. We also like to thank the THPC reviewers for their helpful comments.

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hang Cao.

Ethics declarations

Conflict of interest

The authors are employees of Alibaba Group and received salaries from Alibaba Group. The research conducted in this paper is related to the products and services of Alibaba Group.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, H., Xu, C., Han, Y. et al. An efficient cloud-based elastic RDMA protocol for HPC applications. CCF Trans. HPC 6, 45–53 (2024). https://doi.org/10.1007/s42514-023-00170-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-023-00170-y

Keywords

Navigation