Abstract
The US Department of Energy (DOE) started operating two GPU-based pre-exascale supercomputers in 2018 and plans to deploy another pre-exascale in 2020, and three exascale supercomputers in 2021/2022. All of the systems are GPU-enabled systems, and they plan to provide optimized vendor-promoted programming models for their GPUs such as CUDA, HIP and SYCL. However, due to their limited functional portability, it is challenging for HPC application developers to maintain their applications in an efficient and effective way with good productivity across all US DOE pre-exascale/exascale systems. Directive-based programming models for accelerators can be one of the solutions for HPC applications on the DOE supercomputers. In this study, we employ OpenMP and OpenACC offloading models to port and re-implement the RI-MP2 Fortran kernel of the GAMESS application on a pre-exascale GPU system, Summit. We compare and evaluate the performance of the re-structured offloading kernels with the original OpenMP threading kernel. We also evaluate the performance of multiple math libraries on the NVIDIA V100 GPU in the RI-MP2 kernel. Using the optimized directive-based offloading implementations, the RI-MP2 kernel on a single V100 GPU becomes more than 7 times faster than on dual-socket Power9 processors, which is near the theoretical speed-up based on peak performance ratios. MPI+directive-based offloading implementations of the RI-MP2 kernel perform more than 40 times faster than a MPI+OpenMP threading implementation on the same number of Summit nodes. This study demonstrates how directive-based offloading implementations can perform near what we expect based on machine peak ratios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Intel Xeon Platinum 8180M Processor Information page. https://ark.intel.com/content/www/us/en/ark/products/120498/intel-xeon-platinum-8180m-processor-38-5m-cache-2-50-ghz.html
Intel Xeon Processor Scalable Family, Specifcation Update (2019). https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
JLSE Web page. https://press3.mcs.anl.gov/jlse/
Summit User guide Web page. https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide/
cuBLAS API Reference Guide Web page (2019). https://docs.nvidia.com/cuda/cublas
CUDA Toolkit Web page (2019). https://developer.nvidia.com/cuda-toolkit
HIP GitHub repository (2019). https://github.com/ROCm-Developer-Tools/HIP
IBM Engineering and Scientific Subroutine Library User guide Web page (2019). https://www.ibm.com/support/knowledgecenter/en/SSFHY8_6.1
IBM XL Fortran Compiler for Linux User guide Web page (2019). https://www.ibm.com/support/knowledgecenter/SSAT4T_16.1.1
INTEL Fortran Compiler (2019). https://software.intel.com/en-us/fortran-compilers
Intel Math Kernel Library User guide Web page (2019). https://software.intel.com/en-us/mkl
NVBLAS User guide Web page (2019). https://docs.nvidia.com/cuda/nvblas
PGI version 19.4 Documentation for OpenPOWER and NVIDIA Processors (2019). https://www.pgroup.com/resources/docs/19.4/openpower
SYCL Web page (2019). https://www.khronos.org/sycl/
TOP 500 list (2019). https://www.top500.org
Asadchev, A., Allada, V., Felder, J., Bode, B.M., Gordon, M.S., Windus, T.L.: Uncontracted Rys quadrature implementation of up to G functions on graphical processing units. J. Chem. Theory Comput. 6(3), 696–704 (2010)
Asadchev, A., Gordon, M.S.: New multithreaded hybrid CPU/GPU approach to Hartree-Fock. J. Chem. Theory Comput. 8(11), 4166–4176 (2012)
Bernholdt, D.E., Harrison, R.J.: Large-scale correlated electronic structure calculations: the RI-MP2 method on parallel computers. Chem. Phys. Lett. 250(5–6), 477–484 (1996)
Feyereisen, M., Fitzgerald, G., Komornicki, A.: Use of approximate integrals in ab initio theory. an application in MP2 energy calculations. Chem. Phys. Lett. 208(5–6), 359–363 (1993)
Gordon, M.S., Schmidt, M.W.: Advances in electronic structure theory: GAMESS a decade later, Chap. 41. In: Dykstra, C.E., Frenking, G., Kim, K.S., Scuseria, G.E. (eds.) Theory and Applications of Computational Chemistry, pp. 1167–1189. Elsevier, Amsterdam (2005). https://doi.org/10.1016/B978-044451719-7/50084-6
Katouda, M., Nagase, S.: Efficient parallel algorithm of second-order Møller–Plesset perturbation theory with resolution-of-identity approximation (RI-MP2). Int. J. Quantum Chem. 109(10), 2121–2130 (2009). https://doi.org/10.1002/qua.22068, https://onlinelibrary.wiley.com/doi/abs/10.1002/qua.22068
NVIDIA: Nvidia Tesla v100 GPU architecture (2017). http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
Olivares-Amaya, R., Watson, M.A., Edgar, R.G., Vogt, L., Shao, Y., Aspuru-Guzik, A.: Accelerating correlated quantum chemistry calculations using graphical processing units and a mixed precision matrix multiplication library. J. Chem. Theory Comput. 6(1), 135–144 (2009)
OpenACC-Standard.org: The OpenACC Application Programming Interface version 2.6 (November 2017)
OpenMP.org: OpenMP Application Programming Interface version 4.5, November 2015
Ostlund, N.S., Szabo, A.: Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory. Macmillan (1982)
Schmidt, M.W., et al.: General atomic and molecular electronic structure system. J. Comput. Chem. 14(11), 1347–1363 (1993). https://doi.org/10.1002/jcc.540141112, https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.540141112
Vogt, L., Olivares-Amaya, R., Kermes, S., Shao, Y., Amador-Bedolla, C., Aspuru-Guzik, A.: Accelerating resolution-of-the-identity second-order Møller-Plesset quantum chemistry calculations with graphical processing units. J. Phys. Chem. A 112(10), 2049–2057 (2008)
Watson, M., Olivares-Amaya, R., Edgar, R.G., Aspuru-Guzik, A.: Accelerating correlated quantum chemistry calculations using graphical processing units. Comput. Sci. Eng. 12(4), 40–51 (2010). https://doi.org/10.1109/MCSE.2010.29
Acknowledgment
This work was supported by the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357, and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and by a grant from the Department of Energy Exascale Computing Project (ECP), administered by the Ames Laboratory. We also gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. Last but not least, we would like to thank the Exascale Computing Project (ECP) and Oak Ridge Leadership Computing Facility (OLCF) for organizing the 2019 ECP/OLCF OpenMP Hackathon in Knoxville, TN, and give special thanks our mentors, Dmytro Bykov from OLCF and Vivek Kale from BNL for their contributions to this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix I
Appendix I
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Kwack, J., Bertoni, C., Pham, B., Larkin, J. (2020). Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries. In: Wienke, S., Bhalachandra, S. (eds) Accelerator Programming Using Directives. WACCPD 2019. Lecture Notes in Computer Science(), vol 12017. Springer, Cham. https://doi.org/10.1007/978-3-030-49943-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-49943-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49942-6
Online ISBN: 978-3-030-49943-3
eBook Packages: Computer ScienceComputer Science (R0)