Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries | SpringerLink
Skip to main content

Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries

  • Conference paper
  • First Online:
Accelerator Programming Using Directives (WACCPD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12017))

Included in the following conference series:

Abstract

The US Department of Energy (DOE) started operating two GPU-based pre-exascale supercomputers in 2018 and plans to deploy another pre-exascale in 2020, and three exascale supercomputers in 2021/2022. All of the systems are GPU-enabled systems, and they plan to provide optimized vendor-promoted programming models for their GPUs such as CUDA, HIP and SYCL. However, due to their limited functional portability, it is challenging for HPC application developers to maintain their applications in an efficient and effective way with good productivity across all US DOE pre-exascale/exascale systems. Directive-based programming models for accelerators can be one of the solutions for HPC applications on the DOE supercomputers. In this study, we employ OpenMP and OpenACC offloading models to port and re-implement the RI-MP2 Fortran kernel of the GAMESS application on a pre-exascale GPU system, Summit. We compare and evaluate the performance of the re-structured offloading kernels with the original OpenMP threading kernel. We also evaluate the performance of multiple math libraries on the NVIDIA V100 GPU in the RI-MP2 kernel. Using the optimized directive-based offloading implementations, the RI-MP2 kernel on a single V100 GPU becomes more than 7 times faster than on dual-socket Power9 processors, which is near the theoretical speed-up based on peak performance ratios. MPI+directive-based offloading implementations of the RI-MP2 kernel perform more than 40 times faster than a MPI+OpenMP threading implementation on the same number of Summit nodes. This study demonstrates how directive-based offloading implementations can perform near what we expect based on machine peak ratios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Intel Xeon Platinum 8180M Processor Information page. https://ark.intel.com/content/www/us/en/ark/products/120498/intel-xeon-platinum-8180m-processor-38-5m-cache-2-50-ghz.html

  2. Intel Xeon Processor Scalable Family, Specifcation Update (2019). https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf

  3. JLSE Web page. https://press3.mcs.anl.gov/jlse/

  4. Summit User guide Web page. https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide/

  5. cuBLAS API Reference Guide Web page (2019). https://docs.nvidia.com/cuda/cublas

  6. CUDA Toolkit Web page (2019). https://developer.nvidia.com/cuda-toolkit

  7. HIP GitHub repository (2019). https://github.com/ROCm-Developer-Tools/HIP

  8. IBM Engineering and Scientific Subroutine Library User guide Web page (2019). https://www.ibm.com/support/knowledgecenter/en/SSFHY8_6.1

  9. IBM XL Fortran Compiler for Linux User guide Web page (2019). https://www.ibm.com/support/knowledgecenter/SSAT4T_16.1.1

  10. INTEL Fortran Compiler (2019). https://software.intel.com/en-us/fortran-compilers

  11. Intel Math Kernel Library User guide Web page (2019). https://software.intel.com/en-us/mkl

  12. NVBLAS User guide Web page (2019). https://docs.nvidia.com/cuda/nvblas

  13. PGI version 19.4 Documentation for OpenPOWER and NVIDIA Processors (2019). https://www.pgroup.com/resources/docs/19.4/openpower

  14. SYCL Web page (2019). https://www.khronos.org/sycl/

  15. TOP 500 list (2019). https://www.top500.org

  16. Asadchev, A., Allada, V., Felder, J., Bode, B.M., Gordon, M.S., Windus, T.L.: Uncontracted Rys quadrature implementation of up to G functions on graphical processing units. J. Chem. Theory Comput. 6(3), 696–704 (2010)

    Article  Google Scholar 

  17. Asadchev, A., Gordon, M.S.: New multithreaded hybrid CPU/GPU approach to Hartree-Fock. J. Chem. Theory Comput. 8(11), 4166–4176 (2012)

    Article  Google Scholar 

  18. Bernholdt, D.E., Harrison, R.J.: Large-scale correlated electronic structure calculations: the RI-MP2 method on parallel computers. Chem. Phys. Lett. 250(5–6), 477–484 (1996)

    Article  Google Scholar 

  19. Feyereisen, M., Fitzgerald, G., Komornicki, A.: Use of approximate integrals in ab initio theory. an application in MP2 energy calculations. Chem. Phys. Lett. 208(5–6), 359–363 (1993)

    Google Scholar 

  20. Gordon, M.S., Schmidt, M.W.: Advances in electronic structure theory: GAMESS a decade later, Chap. 41. In: Dykstra, C.E., Frenking, G., Kim, K.S., Scuseria, G.E. (eds.) Theory and Applications of Computational Chemistry, pp. 1167–1189. Elsevier, Amsterdam (2005). https://doi.org/10.1016/B978-044451719-7/50084-6

  21. Katouda, M., Nagase, S.: Efficient parallel algorithm of second-order Møller–Plesset perturbation theory with resolution-of-identity approximation (RI-MP2). Int. J. Quantum Chem. 109(10), 2121–2130 (2009). https://doi.org/10.1002/qua.22068, https://onlinelibrary.wiley.com/doi/abs/10.1002/qua.22068

  22. NVIDIA: Nvidia Tesla v100 GPU architecture (2017). http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

  23. Olivares-Amaya, R., Watson, M.A., Edgar, R.G., Vogt, L., Shao, Y., Aspuru-Guzik, A.: Accelerating correlated quantum chemistry calculations using graphical processing units and a mixed precision matrix multiplication library. J. Chem. Theory Comput. 6(1), 135–144 (2009)

    Article  Google Scholar 

  24. OpenACC-Standard.org: The OpenACC Application Programming Interface version 2.6 (November 2017)

    Google Scholar 

  25. OpenMP.org: OpenMP Application Programming Interface version 4.5, November 2015

    Google Scholar 

  26. Ostlund, N.S., Szabo, A.: Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory. Macmillan (1982)

    Google Scholar 

  27. Schmidt, M.W., et al.: General atomic and molecular electronic structure system. J. Comput. Chem. 14(11), 1347–1363 (1993). https://doi.org/10.1002/jcc.540141112, https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.540141112

  28. Vogt, L., Olivares-Amaya, R., Kermes, S., Shao, Y., Amador-Bedolla, C., Aspuru-Guzik, A.: Accelerating resolution-of-the-identity second-order Møller-Plesset quantum chemistry calculations with graphical processing units. J. Phys. Chem. A 112(10), 2049–2057 (2008)

    Article  Google Scholar 

  29. Watson, M., Olivares-Amaya, R., Edgar, R.G., Aspuru-Guzik, A.: Accelerating correlated quantum chemistry calculations using graphical processing units. Comput. Sci. Eng. 12(4), 40–51 (2010). https://doi.org/10.1109/MCSE.2010.29

    Article  Google Scholar 

Download references

Acknowledgment

This work was supported by the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357, and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and by a grant from the Department of Energy Exascale Computing Project (ECP), administered by the Ames Laboratory. We also gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. Last but not least, we would like to thank the Exascale Computing Project (ECP) and Oak Ridge Leadership Computing Facility (OLCF) for organizing the 2019 ECP/OLCF OpenMP Hackathon in Knoxville, TN, and give special thanks our mentors, Dmytro Bykov from OLCF and Vivek Kale from BNL for their contributions to this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to JaeHyuk Kwack .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 301 KB)

Appendix I

Appendix I

Table 12. Fortran wrapper for cuBLAS and cuBLASXT functions

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kwack, J., Bertoni, C., Pham, B., Larkin, J. (2020). Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries. In: Wienke, S., Bhalachandra, S. (eds) Accelerator Programming Using Directives. WACCPD 2019. Lecture Notes in Computer Science(), vol 12017. Springer, Cham. https://doi.org/10.1007/978-3-030-49943-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-49943-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-49942-6

  • Online ISBN: 978-3-030-49943-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics