Abstract
Holistic tuning and optimization of hybrid MPI and OpenMP applications is becoming focus for parallel code developers as the number of cores and hardware threads in processing nodes of high-end systems continue to increase. For example, there is support for 32 hardware threads on a Cray XE6 node with Interlagos processors while the IBM Blue Gene/Q system could support up to 64 threads per node. Note that, by default, OpenMP threads and MPI tasks are pinned to processor cores on these high-end systems and throughout the paper we assume fix bindings of threads to physical cores for the discussion. A number of OpenMP runtimes also support user specified bindings of threads to physical cores. Parallel and node efficiencies on these high-end systems for hybrid MPI and OpenMP applications largely depend on balancing and overlapping computation and communication workloads. This issue is further intensified when the nodes have a non-uniform access memory (NUMA) model and I/O accelerator devices. In these environments, where access to I/O devices such as GPU for code acceleration and network interface for MPI communication and parallel file I/O are managed and scheduled by a host CPU, application developers could introduce innovative solutions to overlap CPUs and I/O operations to improve node and parallel efficiencies. For example, in a production level application called BigDFT, the developers have introduced a master-slave model to explicitly overlap blocking, collective communication operations and local multi-threaded computation. Similarly some applications parallelized with MPI, OpenMP and GPU acceleration could assign a management thread for the GPU data and control orchestration, an MPI control thread for communication management while the CPU threads perform overlapping calculations, and potentially a background thread can be set aside for file I/O based fault-tolerance. Considering these emerging applications design needs, we would like to motivate the OpenMP standards committee, through examples and empirical results, to introduce thread and task heterogeneity in the language specification. This will allow code developers, especially those programming for large-scale distributed-memory HPC systems and accelerator devices, to design and develop portable solutions with overlapping control and data flow for their applications without resorting to custom solutions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
BigDFT code, http://inac.cea.fr/L_Sim/BigDFT/
Cray XE6 system, http://www.cray.com/Products/XE/CrayXE6System.aspx
Cray XK6 system, http://www.cray.com/Products/XK6/XK6.aspx
Ayguade, E., Badia, R.M., Cabrera, D., Duran, A., Gonzalez, M., Igual, F., Jimenez, D., Labarta, J., Martorell, X., Mayo, R., Perez, J.M., Quintana-Ortí, E.S.: A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 154–167. Springer, Heidelberg (2009)
Beyer, J.C., Stotzer, E.J., Hart, A., de Supinski, B.R.: OpenMP for Accelerators. In: Chapman, B.M., Gropp, W.D., Kumaran, K., Müller, M.S. (eds.) IWOMP 2011. LNCS, vol. 6665, pp. 108–121. Springer, Heidelberg (2011)
Fatica, M.: Accelerating Linpack with CUDA on heterogeneous clusters. In: GPGPU-2 Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. ACM, New York (2009)
Genovese, L., Neelov, A., Goedecker, S., Deutsch, T., Ghasemi, A., Zilberberg, O., Bergman, Rayson, M., Schneider, R.: Daubechies wavelets as a basis set for density functional pseudopotential calculations. J. Chem. Phys. 129, 14109 (2008)
Jones, W.M., Daly, J.T., DeBardeleben, N.A.: Application Resilience: Making Progress in Spite of Failure. In: Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp. 789–794 (2008)
Park, B.H., Naughton, T.J., Agarwal, P.K., Bernholdt, D.E., Geist, A., Tippens, J.L.: Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation. In: IEEE Symp. on Policies for Distributed Systems and Networks (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alam, S.R., Fourestey, G., Videau, B., Genovese, L., Goedecker, S., Dugan, N. (2012). Overlapping Computations with Communications and I/O Explicitly Using OpenMP Based Heterogeneous Threading Models. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds) OpenMP in a Heterogeneous World. IWOMP 2012. Lecture Notes in Computer Science, vol 7312. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30961-8_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-30961-8_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30960-1
Online ISBN: 978-3-642-30961-8
eBook Packages: Computer ScienceComputer Science (R0)