Abstract
Unified virtual memory was introduced in modern GPUs to enable a new programming model for programmers. This method manages memory pages between the GPU and CPU automatically, reducing the complexity of data management for programmers. However, when a GPU programs generates a large memory footprint that exceeds the GPU memory capacity, thrashing can occur, leading to significant performance degradation. To address this issue, this paper proposes a thread throttling that restricts the active thread groups, thereby alleviating memory oversubscription and improving performance. The proposed method adjusts the active thread group at compile time to ensure that their memory footprints fit within the available memory capacity. The effectiveness of the proposed method was evaluated using GPU programs that experience memory oversubscription. The results showed that our approach improved the performance of the original programs by 3.44\(\times\) on average. This represents a 1.53\(\times\) performance improvement compared to static thread throttling.
Similar content being viewed by others
Availability of data and materials
The datasets generated during and/or analyzed during the current study are available from Hyunjun Kim on reasonable request.
References
Ganguly D, Zhang Z, Yang J, Melhem R (2019) Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory. In: Proceedings of the 46th International Symposium on Computer Architecture (ISCA), pp 224–235
Ganguly D, Zhang Z, Yang J, Melhem R (2020) Adaptive page migration for irregular data-intensive applications under gpu memory oversubscription. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 451–461
Ganguly D, Melhem R, Yang J (2021) An adaptive framework for oversubscription management in cpu-gpu unified memory. In: Design, Automation and Test in Europe Conference and Exhibition (DATE)
Zheng T, Nellans D, Zulfiqar A, Stephenson M, Keckler SW (2016) Towards high performance paged memory for gpus. In: IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 345–357
Li C, Ausavarungnirun R, Rossbach CJ, Zhang Y, Mutlu O, Guo Y, Yang J (2019) A framework for memory oversubscription management in graphics processing units. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS
Allen T, Ge R (2021) Demystifying gpu uvm cost with deep runtime and workload analysis. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 141–150
Allen T, Ge R (2021) In-depth analyses of unified virtual memory system for gpu accelerated computing. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
Yu Q, Childers B, Huang L, Qian C, Wang Z (2020) Hpe: Hierarchical page eviction policy for unified memory in gpus. IEEE Trans Comput Aided Design Integr Circuits Syst 39(10):2461–2474
Yu Q, Childers B, Huang L, Qian C, Guo H, Wang Z (2020) Coordinated page prefetch and eviction for memory oversubscription management in gpus. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 472–482
Kim H, Sim J, Gera P, Hadidi R, Kim H (2020) Batch-aware unified memory management in gpus for irregular workloads. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp 1357–1370
Rogers TG, O’Connor M, Aamodt TM (2013) Divergence-aware warp scheduling. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-46
Wang B, Yu W, Sun X-H, Wang X (2015) Dacache: Memory divergence-aware gpu cache management. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS
Wang B, Liu Z, Wang X, Yu W (2015) Eliminating intra-warp conflict misses in gpu. In: Design, Automation and Test in Europe Conference and Exhibition. DATE
Wang B, Zhu Y, Yu W (2016) Oaws: Memory occlusion aware warp scheduling. In: International Conference on Parallel Architecture and Compilation Techniques. PACT
Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for gpus. In: IEEE 21st International Symposium on High Performance Computer Architecture. HPCA
Rogers TG, O’Connor M, Aamodt TM (2012) Cache-conscious wavefront scheduling. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-45
Kim H, Hong S, Lee H, Seo E, Han H (2019) Compiler-assisted gpu thread throttling for reduced cache contention. In: Proceedings of the 48th International Conference on Parallel Processing. ICPP
Jung J, Kim J, Lee J (2023) Deepum: Tensor migration and prefetching in unified memory. In: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp 207–221
Wen F, Qin M, Gratz P, Reddy N (2021) Openmem: Hardware/software cooperative management for mobile memory system. In: 58th ACM/IEEE Design Automation Conference (DAC), pp 109–114
Wen F, Qin M, Gratz P, Reddy N (2022) Software hint-driven data management for hybrid memory in mobile systems. ACM Trans Embed Comput Syst. 21(1)
Alur R, Devietti J, Navarro Leija OS, Singhania N (2017) Gpudrano: Detecting uncoalesced accesses in gpu programs. In: Computer Aided Verification. CAV
Alur R, Devietti J, Singhania N (2018) Block-size independence for gpu programs. In: Static Analysis. SAS
NVIDIA: NVIDIA Tesla V100 GPU Architecture: The World’s Most Advanced Data Center GPU (2017)
AMD: Radeons Next-generation Vega Architecture (2017)
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to gpu codes. In: Innovative Parallel Computing (InPar), pp 1–10
Gu Y, Wu W, Li Y, Chen L (2020) Uvmbench: A comprehensive benchmark suite for researching unified virtual memory in gpus. In: arXiv, arxiv.org/abs/2007.09822
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization. IISWC
Kayiran O, Jog A, Kandemir MT, Das CR (2013) Neither more nor less: Optimizing thread-level parallelism for gpgpus. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. PACT
Zhang J, Gao S, Kim NS, Jung M (2018) Ciao: Cache interference-aware throughput-oriented architecture and scheduling for gpus. In: IEEE International Parallel and Distributed Processing Symposium. IPDPS
Chen Y, Hayes AB, Zhang C, Salmon T, Zhang EZ (2018) Locality-aware software throttling for sparse matrix operation on gpus. In: 2018 USENIX Annual Technical Conference. ATC
Li D, Rhu M, Johnson DR, O’Connor M, Erez M, Burger D, Fussell DS, Redder SW (2015) Priority-based cache allocation in throughput processors. In: IEEE 21st International Symposium on High Performance Computer Architecture. HPCA
Li C, Song SL, Dai H, Sidelnik A, Hari SKS, Zhou H (2015) Locality-driven dynamic gpu cache bypassing. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS
Li A, Braak G-J, Kumar A, Corporaal H (2015) Adaptive and transparent cache bypassing for gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC
Ausavarungnirun R, Ghose S, Kayiran O, Loh GH, Das CR, Kandemir MT, Mutlu O (2015) Exploiting inter-warp heterogeneity to improve gpgpu performance. In: International Conference on Parallel Architecture and Compilation. PACT
Jia W, Shaw KA, Martonosi M (2014) Mrpb: Memory request prioritization for massively parallel processors. In: IEEE 20th International Symposium on High Performance Computer Architecture. HPCA
Xie X, Liang Y, Sun G, Chen D (2013) An efficient compiler framework for cache bypassing on gpus. In: Proceedings of the International Conference on Computer-Aided Design. ICCAD
Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in gpu. In: ACM/IEEE 44th Annual International Symposium on Computer Architecture. ISCA
Chen X, Chang L-W, Rodrigues CI, Lv J, Wang Z, Hwu W-M (2014) Adaptive cache management for energy-efficient gpu computing. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO
Panda R, Zheng X, Wang J, Gerstlauer A, John LK (2017) Statistical pattern based modeling of gpu memory access streams. In: Proceedings of the 54th Annual Design Automation Conference. DAC
Kim H, Hong S, Park J, Han H (2020) Static code transformations for thread-dense memory accesses in gpu computing. Concurr Comput Pract Exper 32(5):5512
Lattner C, Adve V (2004) Llvm: a compilation framework for lifelong program analysis & transformation. In: International Symposium on Code Generation and Optimization (CGO), pp 75–86
Gu Y, Wu W, Li Y, Chen L (2020) UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs
Funding
This work is supported by NRF grant (2021R1A2C2008877) and IITP grant (2021000773) funded by Korea government, MSIT.
Author information
Authors and Affiliations
Contributions
HK conceived the presented idea. HK developed and evaluated the idea. HK wrote the initial manuscript and HH helped to improve the writing and the structure of the manuscript. All authors discussed the results and contributed to the final manuscript.
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kim, H., Han, H. GPU thread throttling for page-level thrashing reduction via static analysis. J Supercomput 80, 9829–9847 (2024). https://doi.org/10.1007/s11227-023-05787-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05787-y