GPU thread throttling for page-level thrashing reduction via static analysis | The Journal of Supercomputing
Skip to main content

GPU thread throttling for page-level thrashing reduction via static analysis

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Unified virtual memory was introduced in modern GPUs to enable a new programming model for programmers. This method manages memory pages between the GPU and CPU automatically, reducing the complexity of data management for programmers. However, when a GPU programs generates a large memory footprint that exceeds the GPU memory capacity, thrashing can occur, leading to significant performance degradation. To address this issue, this paper proposes a thread throttling that restricts the active thread groups, thereby alleviating memory oversubscription and improving performance. The proposed method adjusts the active thread group at compile time to ensure that their memory footprints fit within the available memory capacity. The effectiveness of the proposed method was evaluated using GPU programs that experience memory oversubscription. The results showed that our approach improved the performance of the original programs by 3.44\(\times\) on average. This represents a 1.53\(\times\) performance improvement compared to static thread throttling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available from Hyunjun Kim on reasonable request.

References

  1. Ganguly D, Zhang Z, Yang J, Melhem R (2019) Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory. In: Proceedings of the 46th International Symposium on Computer Architecture (ISCA), pp 224–235

  2. Ganguly D, Zhang Z, Yang J, Melhem R (2020) Adaptive page migration for irregular data-intensive applications under gpu memory oversubscription. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 451–461

  3. Ganguly D, Melhem R, Yang J (2021) An adaptive framework for oversubscription management in cpu-gpu unified memory. In: Design, Automation and Test in Europe Conference and Exhibition (DATE)

  4. Zheng T, Nellans D, Zulfiqar A, Stephenson M, Keckler SW (2016) Towards high performance paged memory for gpus. In: IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 345–357

  5. Li C, Ausavarungnirun R, Rossbach CJ, Zhang Y, Mutlu O, Guo Y, Yang J (2019) A framework for memory oversubscription management in graphics processing units. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS

  6. Allen T, Ge R (2021) Demystifying gpu uvm cost with deep runtime and workload analysis. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 141–150

  7. Allen T, Ge R (2021) In-depth analyses of unified virtual memory system for gpu accelerated computing. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

  8. Yu Q, Childers B, Huang L, Qian C, Wang Z (2020) Hpe: Hierarchical page eviction policy for unified memory in gpus. IEEE Trans Comput Aided Design Integr Circuits Syst 39(10):2461–2474

    Article  Google Scholar 

  9. Yu Q, Childers B, Huang L, Qian C, Guo H, Wang Z (2020) Coordinated page prefetch and eviction for memory oversubscription management in gpus. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 472–482

  10. Kim H, Sim J, Gera P, Hadidi R, Kim H (2020) Batch-aware unified memory management in gpus for irregular workloads. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp 1357–1370

  11. Rogers TG, O’Connor M, Aamodt TM (2013) Divergence-aware warp scheduling. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-46

  12. Wang B, Yu W, Sun X-H, Wang X (2015) Dacache: Memory divergence-aware gpu cache management. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS

  13. Wang B, Liu Z, Wang X, Yu W (2015) Eliminating intra-warp conflict misses in gpu. In: Design, Automation and Test in Europe Conference and Exhibition. DATE

  14. Wang B, Zhu Y, Yu W (2016) Oaws: Memory occlusion aware warp scheduling. In: International Conference on Parallel Architecture and Compilation Techniques. PACT

  15. Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for gpus. In: IEEE 21st International Symposium on High Performance Computer Architecture. HPCA

  16. Rogers TG, O’Connor M, Aamodt TM (2012) Cache-conscious wavefront scheduling. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-45

  17. Kim H, Hong S, Lee H, Seo E, Han H (2019) Compiler-assisted gpu thread throttling for reduced cache contention. In: Proceedings of the 48th International Conference on Parallel Processing. ICPP

  18. Jung J, Kim J, Lee J (2023) Deepum: Tensor migration and prefetching in unified memory. In: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp 207–221

  19. Wen F, Qin M, Gratz P, Reddy N (2021) Openmem: Hardware/software cooperative management for mobile memory system. In: 58th ACM/IEEE Design Automation Conference (DAC), pp 109–114

  20. Wen F, Qin M, Gratz P, Reddy N (2022) Software hint-driven data management for hybrid memory in mobile systems. ACM Trans Embed Comput Syst. 21(1)

  21. Alur R, Devietti J, Navarro Leija OS, Singhania N (2017) Gpudrano: Detecting uncoalesced accesses in gpu programs. In: Computer Aided Verification. CAV

  22. Alur R, Devietti J, Singhania N (2018) Block-size independence for gpu programs. In: Static Analysis. SAS

  23. NVIDIA: NVIDIA Tesla V100 GPU Architecture: The World’s Most Advanced Data Center GPU (2017)

  24. AMD: Radeons Next-generation Vega Architecture (2017)

  25. Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to gpu codes. In: Innovative Parallel Computing (InPar), pp 1–10

  26. Gu Y, Wu W, Li Y, Chen L (2020) Uvmbench: A comprehensive benchmark suite for researching unified virtual memory in gpus. In: arXiv, arxiv.org/abs/2007.09822

  27. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization. IISWC

  28. Kayiran O, Jog A, Kandemir MT, Das CR (2013) Neither more nor less: Optimizing thread-level parallelism for gpgpus. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. PACT

  29. Zhang J, Gao S, Kim NS, Jung M (2018) Ciao: Cache interference-aware throughput-oriented architecture and scheduling for gpus. In: IEEE International Parallel and Distributed Processing Symposium. IPDPS

  30. Chen Y, Hayes AB, Zhang C, Salmon T, Zhang EZ (2018) Locality-aware software throttling for sparse matrix operation on gpus. In: 2018 USENIX Annual Technical Conference. ATC

  31. Li D, Rhu M, Johnson DR, O’Connor M, Erez M, Burger D, Fussell DS, Redder SW (2015) Priority-based cache allocation in throughput processors. In: IEEE 21st International Symposium on High Performance Computer Architecture. HPCA

  32. Li C, Song SL, Dai H, Sidelnik A, Hari SKS, Zhou H (2015) Locality-driven dynamic gpu cache bypassing. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS

  33. Li A, Braak G-J, Kumar A, Corporaal H (2015) Adaptive and transparent cache bypassing for gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC

  34. Ausavarungnirun R, Ghose S, Kayiran O, Loh GH, Das CR, Kandemir MT, Mutlu O (2015) Exploiting inter-warp heterogeneity to improve gpgpu performance. In: International Conference on Parallel Architecture and Compilation. PACT

  35. Jia W, Shaw KA, Martonosi M (2014) Mrpb: Memory request prioritization for massively parallel processors. In: IEEE 20th International Symposium on High Performance Computer Architecture. HPCA

  36. Xie X, Liang Y, Sun G, Chen D (2013) An efficient compiler framework for cache bypassing on gpus. In: Proceedings of the International Conference on Computer-Aided Design. ICCAD

  37. Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in gpu. In: ACM/IEEE 44th Annual International Symposium on Computer Architecture. ISCA

  38. Chen X, Chang L-W, Rodrigues CI, Lv J, Wang Z, Hwu W-M (2014) Adaptive cache management for energy-efficient gpu computing. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO

  39. Panda R, Zheng X, Wang J, Gerstlauer A, John LK (2017) Statistical pattern based modeling of gpu memory access streams. In: Proceedings of the 54th Annual Design Automation Conference. DAC

  40. Kim H, Hong S, Park J, Han H (2020) Static code transformations for thread-dense memory accesses in gpu computing. Concurr Comput Pract Exper 32(5):5512

    Article  Google Scholar 

  41. Lattner C, Adve V (2004) Llvm: a compilation framework for lifelong program analysis & transformation. In: International Symposium on Code Generation and Optimization (CGO), pp 75–86

  42. Gu Y, Wu W, Li Y, Chen L (2020) UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

Download references

Funding

This work is supported by NRF grant (2021R1A2C2008877) and IITP grant (2021000773) funded by Korea government, MSIT.

Author information

Authors and Affiliations

Authors

Contributions

HK conceived the presented idea. HK developed and evaluated the idea. HK wrote the initial manuscript and HH helped to improve the writing and the structure of the manuscript. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Hyunjun Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, H., Han, H. GPU thread throttling for page-level thrashing reduction via static analysis. J Supercomput 80, 9829–9847 (2024). https://doi.org/10.1007/s11227-023-05787-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05787-y

Keywords