Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters | Cluster Computing Skip to main content
Log in

Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Clusters of Symmetric Multiprocessors (SMP) are more commonplace than ever in achieving high-performance. Scientific applications running on clusters employ collective communications extensively. Shared memory communication and Remote Direct Memory Access (RDMA) over multi-rail networks are promising approaches in addressing the increasing demand on intra-node and inter-node communications, and thereby in boosting the performance of collectives in emerging multi-core SMP clusters. In this regard, this paper designs and evaluates two classes of collective communication algorithms directly at the Elan user-level over multi-rail Quadrics QsNetII with message striping: 1) RDMA-based traditional multi-port algorithms for gather, all-gather, and all-to-all collectives for medium to large messages, and 2) RDMA-based and SMP-aware multi-port all-gather algorithms for small to medium size messages.

The multi-port RDMA-based Direct algorithm for gather and all-to-all collectives gain an improvement of up to 2.15 for 4 KB messages over elan_gather(), and up to 2.26 for 2 KB messages over elan_alltoall(), respectively. For the all-gather, our SMP-aware Bruck algorithm outperforms all other all-gather algorithms including elan_gather() for 512 B to 8 KB messages, with a 1.96 improvement factor for 4 KB messages. Our multi-port Direct all-gather is the best algorithm for 16 KB to 1 MB, and outperforms elan_gather() by a factor of 1.49 for 32 KB messages. Experimentation with real applications has shown up to 1.47 communication speedup can be achieved using the proposed all-gather algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alexandrov, A., Ionescu, M., Schauser, K.E., Scheiman, C.: Incorporating long messages into the logp model—one step closer towards a realistic model for parallel computation. In: Proc. 7th ACM Symposium on Parallel Algorithms and Architecture (SPAA’95), 1995

  2. Beecroft, J., Addison, D., Hewson, D., McLaren, M., Roweth, D., Petrini, F., Nieplocha, J.: QsNetII: defining high-performance network design. IEEE Micro 25(4), 34–47 (2005)

    Article  Google Scholar 

  3. Bokhari, S.H.: Multiphase complete exchange on Paragon, SP2, and CS-2. IEEE Parallel Distrib. Technol. 4(3), 45–59 (1996)

    Article  Google Scholar 

  4. Bruck, J., Ho, C.-T., Kipnis, S., Upfal, E., Weathersby, D.: Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst. 8(11), 1143–1156 (1997)

    Article  Google Scholar 

  5. Buntinas, D., Mercier, G., Gropp, W.: Data transfers between processes in an SMP system: performance study and application to MPI. In: Proc. 35th Int. Conf. on Parallel Processing (ICPP 2006), 2006

  6. Chai, L., Hartono, A., Panda, D.K.: Designing high performance and scalable MPI intra-node communication support for clusters. In: Proc. 8th IEEE Int. Conf. on Cluster Computing (Cluster 2006), 2006

  7. Chan, E., Van de Geijn, R., Gropp, W., Thakur, R.: Collective communication on architectures that support simultaneous communication over multiple links. In: Proc. 11th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP’06), pp. 2–11 (2006)

  8. Coll, S., Frachtenberg, E., Petrini, F., Hoisie, A., Gurvits, L.: Using multirail networks in high performance clusters. Concurr. Comput. Pract. Exp. 15(7-8), 625–651 (2003)

    Article  Google Scholar 

  9. Cray Man Page Collection: Shared Memory Access (SHMEM) S-2383-2, http://docs.cray.com/

  10. Culler, D.E., Karp, R.M., Patterson, D.A., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eiken, T.: LogP: towards a realistic model of parallel computation. In: Proc. 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1993

  11. Hockney, R.: The communication challenge for MPP, Intel Paragon and Meiko CS-2. Parallel Comput. 20(3), 389–398 (1994)

    Google Scholar 

  12. InfiniBand Architecture, http://www.infinibandta.org/

  13. Jin, H.-W., Sur, S., Chai, L., Panda, D.K.: LiMIC: support for high-performance mpi intra-node communication on Linux clusters. In: Proc. 34th Int. Conf. on Parallel Processing (ICPP 2005), 2005

  14. Kielmann, T., Bal, H.E., Verstoep, K.: Fast measurement of LogP parameters for message passing platforms. In: Proc. 4th Workshop on Runtime Systems for Parallel Programming (RTSPP), 2000

  15. Liu, J., Vishnu, A., Panda, D.K.: Building multirail infiniband clusters: MPI-level design and performance evaluation, In: Proc. 2004 ACM/IEEE Conf. on Supercomputing (SC’04), 2004

  16. Mamidala, A.R., Chai, L., Jin, H.-W., Panda, D.K.: Efficient SMP-aware MPI-level Broadcast over InfiniBand’s Hardware Multicast. In: Proc. 6th Workshop on Communication Architecture for Clusters (CAC 2006), 2006

  17. Mamidala, A.R., Vishnu, A., Panda, D.K.: Efficient shared memory and RDMA based design for MPI-allgather over InfiniBand. In: Proc. EuroPVM/MPI, pp. 66–75 (2006)

  18. MPI: A Message-Passing Interface standard (1997)

  19. Myricom, http://www.myricom.com/

  20. PDSH, http://www.llnl.gov/linux/pdsh/

  21. Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.J.: Performance analysis of MPI collective operations. In: Proc. 19th IEEE Int. Parallel and Distributed Processing Symposium (IPDPS’05), 2005

  22. Qian, Y., Afsahi, A.: Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters. In: Proc. 6th Workshop on Communication Architecture for Clusters (CAC 2006), 2006

  23. Qian, Y., Afsahi, A.: RDMA-based and SMP-aware multi-port all-gather on multi-rail QsNetII SMP clusters. In: Proc. 36th Int. Conf. on Parallel Processing (ICPP 2007), 2007

  24. Rashti, M.J., Afsahi, A.: Assessing the ability of computation/communication overlap and communication progress in modern interconnects. In: Proc. 15th IEEE Symposium on High-Performance Interconnects (Hot Interconnects 2007), pp. 117–124 (2007)

  25. Ritzdorf, H., Traff, J.L.: Collective operations in NEC’s high-performance MPI libraries. In: Proc. 20th Int. Parallel and Distributed Processing Symposium (IPDPS’06), 2006

  26. Roweth, D., Addison, D.: Optimized gather collectives on QsNetII. In: Proc. EuroPVM/MPI, pp. 407–414 (2005)

  27. Roweth, D., Pittman, A., Beecroft, J.: Performance of all-to-all on QsNetII. Quadrics White Paper (2005). http://www.quadrics.com/

  28. Shan, H., Singh, J.P., Oliker, L., Biswas, R.: Message passing and shared address space parallelism on an SMP cluster. Parallel Comput. 29, 167–186 (2003)

    Article  Google Scholar 

  29. Sistare, S., vandeVaart, R., Loh, E.: Optimization of MPI collectives on clusters of large-scale SMPs. In: Proc. 1999 ACM/IEEE Conf. on Supercomputing (SC’99), 1999

  30. Sur, S., Jin, H.-W., Panda, D.K.: Efficient and scalable all-to-all personalized exchange for InfiniBand clusters. In: Proc. 33rd Int. Conf. on Parallel Processing (ICCP’04), pp. 275–282 (2004)

  31. Sur, S., Bondhugula, U.K.R., Mamidala, A., Jin, H.-W., Panda, D.K.: High performance RDMA based all-to-all broadcast for Infiniband clusters. In: Proc. Int. Conf. on High Performance Computing (HiPC 2005), 2005

  32. Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)

    Article  Google Scholar 

  33. Tipparaju, V., Nieplocha, J.: Optimizing all-to-all collective communication by exploiting concurrency in modern networks. In: Proc. 2005 ACM/IEEE Conf. on Supercomputing (SC’05), 2005

  34. Tipparaju, V., Nieplocha, J., Panda, D.K.: Fast collective operations using shared and remote memory access protocols on clusters. In: Proc. 17th IEEE Int. Parallel and Distributed Processing Symposium (IPDPS’03), 2003

  35. Traff, J.L.: Efficient allgather for regular SMP-clusters. In: Proc. EuroPVM/MPI, pp. 58–65 (2006)

  36. Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective communications. In: Proc. 2000 ACM/IEEE Conf. on Supercomputing (SC2000) (2000)

  37. Wu, M., Kendall, R.A., Wright, K.: Optimizing collective communications on SMP clusters. In: Proc. 34th Int. Conf. on Parallel Processing (ICPP 2005), pp. 399–407 (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmad Afsahi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qian, Y., Afsahi, A. Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters. Cluster Comput 11, 341–354 (2008). https://doi.org/10.1007/s10586-008-0065-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-008-0065-8

Keywords