Abstract
Fine-grain MPI (FG-MPI) extends the execution model of MPI to allow for interleaved execution of multiple concurrent MPI processes inside an OS-process. It provides a runtime that is integrated into the MPICH2 middleware and uses light-weight coroutines to implement an MPI-aware scheduler. In this paper we describe the FG-MPI runtime system and discuss the main design issues in its implementation. FG-MPI enables expression of function-level parallelism, which along with a runtime scheduler, can be used to simplify MPI programming and achieve performance without adding complexity to the program. As an example, we use FG-MPI to re-structure a typical use of non-blocking communication and show that the integrated scheduler relieves the programmer from scheduling computation and communication inside the application and brings the performance part outside of the program specification into the runtime.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
We will be using the terms “node” and “machine” interchangeably in this paper to refer to a single computational node with multiple processor cores, operating under a single operating system.
MPI processes sharing the same address space are referred to as co-located processes.
We do not support MPI dynamic process management functionality.
References
Argonne National Laboratory, USA (2007) MPICH2: Performance and portability. MPICH2 flyer at Super, Computing SC07
Balaji P, Goodell D (2008) Using 32-bit as rank. Available from https://trac.mcs.anl.gov/projects/mpich2/ticket/42/. Accessed 5 April 2013
Balaji P, Buntinas D, Goodell D, Gropp W, Thakur R (2008) Toward efficient support for multithreaded MPI communication. In: Proceedings of the 15th Euro PVM/MPI Users’ Group Meeting, Springer, Berlin, pp 120–129
Balaji P, Buntinas D, Goodell D, Gropp W, Kumar S, Lusk EL, Thakur R, Träff JL (2009) MPI on a million processors. In: PVM/MPI, pp 20–30
Buntinas D, Gropp W, Mercier G (2006) Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem. In: Proceedings of the 6th IEEE International Symposium on Cluster Computing and the Grid, pp 521–530
Demaine E (1997) A threads-only MPI implementation for the development of parallel programs. In: Proceedings of the 11th International Symposium on High Performance, Computing Systems, pp 153–163
Ferreira KB, Bridges P, Brightwell R (2008) Characterizing application sensitivity to OS interference using kernel-level noise injection. In: Proceedings of the (2008) ACM/IEEE conference on Supercomputing. IEEE Press, vol 12, pp 19(1–19)
Gropp W (2001) Learning from the success of MPI. In: Proceedings of the 8th International Conference on High Performance Computing, Springer, HiPC ’01, pp 81–94
Gropp W, Lusk E, Skjellum A (1999) Using MPI - 2nd Edition: Portable Parallel Programming with the Message Passing Interface. MIT Press, Scientific and Engineering Computation Series
Huang C, Lawlor OS, Kale LV (2003) Adaptive MPI. In: Languages and Compilers for Parallel Computing, 16th International Workshop. Revised Papers, Springer, Lecture Notes in Computer Science, vol 2958, pp 306–322
Kamal H, Wagner A (2012) Added concurrency to improve MPI performance on multicore. In: 41st International Conference on Parallel Processing (ICPP), pp 229–238
Kamal H, Mirtaheri SM, Wagner A (2010) Scalability of communicators and groups in MPI. In: Proc. of the 19th ACM Intl. Symposium on High Performance Distributed Computing, ACM, New York, USA, HPDC ’10, pp 264–275
Marjanović V, Labarta J, Ayguadé E, Valero M (2010) Overlapping communication and computation by using a hybrid MPI/SMPSs approach. In: Proc. of the 24th ACM International Conference on Supercomputing, ACM, New York, pp 5–16
Saltzer J (1993) On the naming and binding of network destinations. Network Working Group. http://tools.ietf.org/html/rfc1498. Accessed 5 April 2013
Tang H, Yang T (2001) Optimizing threaded MPI execution on SMP clusters. In: ICS ’01: Proceedings of 15th International Conference on Supercomputing, ACM, New York, pp 381–392
Thakur R, Gropp W (2007) Test suite for evaluating performance of MPI implementations that support \(MPI\_THREAD\_MULTIPLE.\) In: PVM/MPI, pp 46–55
Träff JL (2010) Compact and efficient implementation of the MPI group operations. In: Proc. of the 17th EuroMPI conference, Springer, Berlin, Heidelberg, pp 170–178
Von Behren R, Condit J, Zhou F, Necula G, Brewer E (2003) Capriccio: scalable threads for Internet services. In: SOSP ’19, ACM, New York, pp 268–281
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kamal, H., Wagner, A. An integrated fine-grain runtime system for MPI. Computing 96, 293–309 (2014). https://doi.org/10.1007/s00607-013-0329-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-013-0329-x