Abstract
Clusters of Symmetric Multiprocessors (SMP) are more commonplace than ever in achieving high-performance. Scientific applications running on clusters employ collective communications extensively. Shared memory communication and Remote Direct Memory Access (RDMA) over multi-rail networks are promising approaches in addressing the increasing demand on intra-node and inter-node communications, and thereby in boosting the performance of collectives in emerging multi-core SMP clusters. In this regard, this paper designs and evaluates two classes of collective communication algorithms directly at the Elan user-level over multi-rail Quadrics QsNetII with message striping: 1) RDMA-based traditional multi-port algorithms for gather, all-gather, and all-to-all collectives for medium to large messages, and 2) RDMA-based and SMP-aware multi-port all-gather algorithms for small to medium size messages.
The multi-port RDMA-based Direct algorithm for gather and all-to-all collectives gain an improvement of up to 2.15 for 4 KB messages over elan_gather(), and up to 2.26 for 2 KB messages over elan_alltoall(), respectively. For the all-gather, our SMP-aware Bruck algorithm outperforms all other all-gather algorithms including elan_gather() for 512 B to 8 KB messages, with a 1.96 improvement factor for 4 KB messages. Our multi-port Direct all-gather is the best algorithm for 16 KB to 1 MB, and outperforms elan_gather() by a factor of 1.49 for 32 KB messages. Experimentation with real applications has shown up to 1.47 communication speedup can be achieved using the proposed all-gather algorithms.
Similar content being viewed by others
References
Alexandrov, A., Ionescu, M., Schauser, K.E., Scheiman, C.: Incorporating long messages into the logp model—one step closer towards a realistic model for parallel computation. In: Proc. 7th ACM Symposium on Parallel Algorithms and Architecture (SPAA’95), 1995
Beecroft, J., Addison, D., Hewson, D., McLaren, M., Roweth, D., Petrini, F., Nieplocha, J.: QsNetII: defining high-performance network design. IEEE Micro 25(4), 34–47 (2005)
Bokhari, S.H.: Multiphase complete exchange on Paragon, SP2, and CS-2. IEEE Parallel Distrib. Technol. 4(3), 45–59 (1996)
Bruck, J., Ho, C.-T., Kipnis, S., Upfal, E., Weathersby, D.: Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst. 8(11), 1143–1156 (1997)
Buntinas, D., Mercier, G., Gropp, W.: Data transfers between processes in an SMP system: performance study and application to MPI. In: Proc. 35th Int. Conf. on Parallel Processing (ICPP 2006), 2006
Chai, L., Hartono, A., Panda, D.K.: Designing high performance and scalable MPI intra-node communication support for clusters. In: Proc. 8th IEEE Int. Conf. on Cluster Computing (Cluster 2006), 2006
Chan, E., Van de Geijn, R., Gropp, W., Thakur, R.: Collective communication on architectures that support simultaneous communication over multiple links. In: Proc. 11th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP’06), pp. 2–11 (2006)
Coll, S., Frachtenberg, E., Petrini, F., Hoisie, A., Gurvits, L.: Using multirail networks in high performance clusters. Concurr. Comput. Pract. Exp. 15(7-8), 625–651 (2003)
Cray Man Page Collection: Shared Memory Access (SHMEM) S-2383-2, http://docs.cray.com/
Culler, D.E., Karp, R.M., Patterson, D.A., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eiken, T.: LogP: towards a realistic model of parallel computation. In: Proc. 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1993
Hockney, R.: The communication challenge for MPP, Intel Paragon and Meiko CS-2. Parallel Comput. 20(3), 389–398 (1994)
InfiniBand Architecture, http://www.infinibandta.org/
Jin, H.-W., Sur, S., Chai, L., Panda, D.K.: LiMIC: support for high-performance mpi intra-node communication on Linux clusters. In: Proc. 34th Int. Conf. on Parallel Processing (ICPP 2005), 2005
Kielmann, T., Bal, H.E., Verstoep, K.: Fast measurement of LogP parameters for message passing platforms. In: Proc. 4th Workshop on Runtime Systems for Parallel Programming (RTSPP), 2000
Liu, J., Vishnu, A., Panda, D.K.: Building multirail infiniband clusters: MPI-level design and performance evaluation, In: Proc. 2004 ACM/IEEE Conf. on Supercomputing (SC’04), 2004
Mamidala, A.R., Chai, L., Jin, H.-W., Panda, D.K.: Efficient SMP-aware MPI-level Broadcast over InfiniBand’s Hardware Multicast. In: Proc. 6th Workshop on Communication Architecture for Clusters (CAC 2006), 2006
Mamidala, A.R., Vishnu, A., Panda, D.K.: Efficient shared memory and RDMA based design for MPI-allgather over InfiniBand. In: Proc. EuroPVM/MPI, pp. 66–75 (2006)
MPI: A Message-Passing Interface standard (1997)
Myricom, http://www.myricom.com/
Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.J.: Performance analysis of MPI collective operations. In: Proc. 19th IEEE Int. Parallel and Distributed Processing Symposium (IPDPS’05), 2005
Qian, Y., Afsahi, A.: Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters. In: Proc. 6th Workshop on Communication Architecture for Clusters (CAC 2006), 2006
Qian, Y., Afsahi, A.: RDMA-based and SMP-aware multi-port all-gather on multi-rail QsNetII SMP clusters. In: Proc. 36th Int. Conf. on Parallel Processing (ICPP 2007), 2007
Rashti, M.J., Afsahi, A.: Assessing the ability of computation/communication overlap and communication progress in modern interconnects. In: Proc. 15th IEEE Symposium on High-Performance Interconnects (Hot Interconnects 2007), pp. 117–124 (2007)
Ritzdorf, H., Traff, J.L.: Collective operations in NEC’s high-performance MPI libraries. In: Proc. 20th Int. Parallel and Distributed Processing Symposium (IPDPS’06), 2006
Roweth, D., Addison, D.: Optimized gather collectives on QsNetII. In: Proc. EuroPVM/MPI, pp. 407–414 (2005)
Roweth, D., Pittman, A., Beecroft, J.: Performance of all-to-all on QsNetII. Quadrics White Paper (2005). http://www.quadrics.com/
Shan, H., Singh, J.P., Oliker, L., Biswas, R.: Message passing and shared address space parallelism on an SMP cluster. Parallel Comput. 29, 167–186 (2003)
Sistare, S., vandeVaart, R., Loh, E.: Optimization of MPI collectives on clusters of large-scale SMPs. In: Proc. 1999 ACM/IEEE Conf. on Supercomputing (SC’99), 1999
Sur, S., Jin, H.-W., Panda, D.K.: Efficient and scalable all-to-all personalized exchange for InfiniBand clusters. In: Proc. 33rd Int. Conf. on Parallel Processing (ICCP’04), pp. 275–282 (2004)
Sur, S., Bondhugula, U.K.R., Mamidala, A., Jin, H.-W., Panda, D.K.: High performance RDMA based all-to-all broadcast for Infiniband clusters. In: Proc. Int. Conf. on High Performance Computing (HiPC 2005), 2005
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)
Tipparaju, V., Nieplocha, J.: Optimizing all-to-all collective communication by exploiting concurrency in modern networks. In: Proc. 2005 ACM/IEEE Conf. on Supercomputing (SC’05), 2005
Tipparaju, V., Nieplocha, J., Panda, D.K.: Fast collective operations using shared and remote memory access protocols on clusters. In: Proc. 17th IEEE Int. Parallel and Distributed Processing Symposium (IPDPS’03), 2003
Traff, J.L.: Efficient allgather for regular SMP-clusters. In: Proc. EuroPVM/MPI, pp. 58–65 (2006)
Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective communications. In: Proc. 2000 ACM/IEEE Conf. on Supercomputing (SC2000) (2000)
Wu, M., Kendall, R.A., Wright, K.: Optimizing collective communications on SMP clusters. In: Proc. 34th Int. Conf. on Parallel Processing (ICPP 2005), pp. 399–407 (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Qian, Y., Afsahi, A. Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters. Cluster Comput 11, 341–354 (2008). https://doi.org/10.1007/s10586-008-0065-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-008-0065-8