DCU-CHK: checkpointing for large-scale CPU-DCU heterogeneous computing systems

Jia, Jie; Lin, Xinyuan; Lin, Fang; Liu, Yi

doi:10.1007/s42514-023-00178-4

DCU-CHK: checkpointing for large-scale CPU-DCU heterogeneous computing systems

Regular Paper
Published: 07 January 2024

Volume 6, pages 519–532, (2024)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Jie Jia^1,2,
Xinyuan Lin^1,2,
Fang Lin^1,2 &
…
Yi Liu^1,2

264 Accesses
2 Citations
Explore all metrics

Abstract

By utilizing the superior computing power of accelerators, heterogeneous architectures have become increasingly popular in high-performance computing (HPC) systems. Meanwhile, the scale of HPC systems also continuously increases, which poses challenges to resilience. The Hygon DCU, a domestic-developed accelerator, has been used in a growing number of Chinese-made supercomputers. Therefore, it is crucial to provide checkpointing support for the CPU-DCU platform. This paper proposes DCU-CHK, a novel checkpointing scheme for large-scale CPU-DCU heterogeneous computing systems. The scheme provides transparent checkpointing for HIP applications and employs an address translation mechanism to ensure the robustness of restarting. The scheme is implemented based on DMTCP. Experimental results demonstrate the effectiveness and scalability of the scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems

Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization

Article 20 August 2017

Application-Based Coarse-Grained Incremental Checkpointing Based on Non-volatile Memory

Data availability

All data analyzed are included in this manuscript.

References

Ansel, J., Arya, K., Cooperman, G.: DMTCP: transparent checkpointing for cluster computations and the desktop. In: IPDPS 2009—Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2009, pp. 1–12
Bailey, D.H., Barszcz, E., Barton, J.T., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)
Google Scholar
Bautista-Gomez, L., Komatitsch, D., Maruyama, N., et al.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 SC—International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2011, pp. 1–12
Che, S., Boyer, M., Meng, J., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC). Ieee, 2009, pp. 44–54
CRIU [EB/OL] (2023). https://criu.org/Main_Page. Accessed 15 June 2023
Egevang, K., Francis, P.: The IP network address translator (NAT) (1994)
Garg, R., Mohan, A., Sullivan, M., et al.: CRUM: checkpoint-restart support for CUDA’s unified memory. In: Proceedings—IEEE International Conference on Cluster Computing, ICCC, Institute of Electrical and Electronics Engineers Inc., 2018, 2018-Septe, pp. 302–313
Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46(1), 494–499 (2006)
Article Google Scholar
HeCBench/gibbs-hip at master · zjin-lcf/HeCBench[EB/OL] (2022). https://github.com/zjin-lcf/HeCBench/tree/master/gibbs-hip. Accessed 23 Dec 2022
HeCBench/qtclustering-hip at master · zjin-lcf/HeCBench[EB/OL] (2022). https://github.com/zjin-lcf/HeCBench/tree/master/qtclustering-hip. Accessed 23 Dec 2022
Jain, T., Cooperman, G. Crac: checkpoint-restart architecture for cuda with streams and uvm. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE, 2020, 2020-Novem, pp. 1–15
Martino, C.D., Kramer, W., Kalbarczyk, Z., et al.: Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: Proceedings of the International Conference on Dependable Systems and Networks. IEEE, 2015, 2015-Septe, pp. 25–36
Mantevo/miniFE: MiniFE Finite Element Mini-Application[EB/OL] (2022). https://github.com/Mantevo/miniFE. Accessed 23 Dec 2022
Pourghassemi, B., Chandramowlishwaran, A.: cudacr: an in-kernel application-level checkpoint/restart scheme for cuda-enabled gpus. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2017, pp. 725–732
RadeonOpenCompute/ROCm: ROCm - Open Software Platform for GPU Compute[EB/OL] (2022). https://github.com/RadeonOpenCompute/ROCm. Accessed 21 Dec 2022
ROCm-Developer-Tools/HIP: HIP: C++ Heterogeneous-Compute Interface for Portability[EB/OL] (2022). https://github.com/ROCm-Developer-Tools/HIP. Accessed 21 Dec 2022
Slurm Workload Manager - Overview [EB/OL]. Slurm (2020). https://slurm.schedmd.com/overview.html. Accessed 1 Dec 2020
Takizawa, H., Sato, K., Komatsut, K., et al.: CheCUDA: a checkpoint/restart tool for CUDA applications. In: Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings. IEEE, 2009, pp. 408–413
Takizawa, H., Koyama, K., Sato, K., et al. CheCL: transparent checkpointing and process migration of OpenCL applications. In: 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 2011, pp. 864–876
Top500 | Top500 [EB/OL] (2023). https://www.top500.org/lists/top500/. Accessed 15 June 2023
Wang, R., Qian, D.: Key issues in exascale computing. Scientia Sinica Informationis 50(9), 1303 (2020)
Article Google Scholar
Zhou, S.: Lsf: load sharing in large heterogeneous distributed systems. In: I Workshop on Cluster Computing, p. 136 (1992)

Download references

Acknowledgements

The research presented in this supported by the GHFund A (no. ghfund202107010337).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Jie Jia, Xinyuan Lin, Fang Lin & Yi Liu
Sino-German Joint Software Institute, Beihang University, Beijing, 100191, China
Jie Jia, Xinyuan Lin, Fang Lin & Yi Liu

Authors

Jie Jia
View author publications
You can also search for this author inPubMed Google Scholar
Xinyuan Lin
View author publications
You can also search for this author inPubMed Google Scholar
Fang Lin
View author publications
You can also search for this author inPubMed Google Scholar
Yi Liu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jie Jia.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest, financial or otherwise. On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jia, J., Lin, X., Lin, F. et al. DCU-CHK: checkpointing for large-scale CPU-DCU heterogeneous computing systems. CCF Trans. HPC 6, 519–532 (2024). https://doi.org/10.1007/s42514-023-00178-4

Download citation

Received: 16 June 2023
Accepted: 24 November 2023
Published: 07 January 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s42514-023-00178-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

DCU-CHK: checkpointing for large-scale CPU-DCU heterogeneous computing systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems

Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization

Application-Based Coarse-Grained Incremental Checkpointing Based on Non-volatile Memory

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now