故きを温ねて新しきを知る
- Fast Multi-GPU collectives with NCCL
- NCCL: Collective Operations
- Collective communication: theory, practice, and experience
- A Generalization of the Allreduce Operation
- TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
- A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce
- Recent Improvements of MPI Communication for DDLS
- Optimization of Collective Communication Operations in MPICH
- Sparse allreduce: Efficient scalable communication for power-law data
Fast Multi-GPU collectives with NCCL
"Fast Multi-GPU collectives with NCCL | NVIDIA Technical Blog." NVIDIA Technical Blog, 21 Aug. 2022, developer.nvidia.com/blog/fast-multi-gpu-collectives-nccl.
NCCL: Collective Operations
"Collective Operations — NCCL 2.18.1 documentation." 6 May. 2023, docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html.
AllReduce
Broadcast
Reduce
AllGather
ReduceScatter
Collective communication: theory, practice, and experience
Chan, Ernie, et al. "Collective communication: theory, practice, and experience." Concurrency and Computation: Practice and Experience 19.13 (2007): 1749-1783.
https://www.cs.utexas.edu/~pingali/CSE392/2011sp/lectures/Conc_Comp.pdf
Figure 5. Minimum-spanning tree algorithm for reduce
Figure 9. Recursive-doubling algorithm for reduce–scater
Figure 14. Bidirectional exchange algorithm for allreduce
A Generalization of the Allreduce Operation
Kolmakov, Dmitry, and Xuecang Zhang. "A generalization of the allreduce operation." arXiv preprint arXiv:2004.09362 (2020).
TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
Shah, Aashaka, et al. "{TACCL}: Guiding Collective Algorithm Synthesis using Communication Sketches." 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 2023.
A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce
Wang, Guozheng, et al. "A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce." Data Science and Engineering (2023): 1-12.
Recent Improvements of MPI Communication for DDLS
Kim, Hyejin. "Recent Improvements of MPI Communication for DDLS - Hyejin Kim - Medium." Medium, 6 Jan. 2022, hk3342.medium.com/recent-improvements-of-mpi-communication-74e3c4a1ccb4.
Optimization of Collective Communication Operations in MPICH
Thakur, Rajeev, Rolf Rabenseifner, and William Gropp. "Optimization of collective communication operations in MPICH." The International Journal of High Performance Computing Applications 19.1 (2005): 49-66.
Sparse allreduce: Efficient scalable communication for power-law data
Zhao, Huasha, and John Canny. "Sparse allreduce: Efficient scalable communication for power-law data." arXiv preprint arXiv:1312.3020 (2013).