Abstract
General-purpose graphics processing unit (GPGPU) has become one of the most important high performance platforms oriented to high throughput applications. However, on-chip resources contention can often occur as there are large amounts of concurrent running threads inside GPGPU. It has become an important factor affecting the performance of GPGPUs. We propose memory-aware TLP throttling and cache bypassing (MATB) mechanism, which can exploit data temporal locality and memory bandwidth. It aims to make those cache blocks with good data locality stay inside L1D cache longer while maintaining on-chip resources utiliza- tion. On one hand, it can alleviate cache contention via limiting the memory warps with bad data reuse to be scheduled while cache contention and on-chip network congestion occur. On the other hand, it can make memory bandwidth be utilized more effectively via cache bypassing. Experimental results show MATB can achieve 26.6% and 14.2% performance improvement respectively on average relative to GTO and DYNCTA with low hardware cost.
Similar content being viewed by others
References
Nvidia C: Nvidia’s next generation cuda compute architecture: Fermi. Comput. Syst., 26, 63–72 (2009)
Nvidia C: Nvidia’s next generation cuda compute architecture: Kepler gk110. Whitepaper (2012)
Luebke, D., Humphreys, G.: How GPUs work. IEEE Comput. 40(2), 96–100 (2007)
Montrym, J., Moreton, H.: The Geforce 6800. IEEE Micro 25(2), 41–51 (2005)
Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro 30(2), 56–69 (2010)
Lindholm, E., Nickolls, J., Oberman, S., et al.: NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)
He, Y., Zhang, J., Shen, F., et al.: Thread scheduling optimization of general purpose Graphics Processing Unit: a survey. Chinese J. Comput. 39(9), 1–17 (2016)
Corparation AMD: ATI stream computing OpenCL programming guide (2010)
Xiang, P., Yang, Y., Zhou, H.: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation. Proceedings of the 20th International Symposium on High Performance Computer Architecture, USA, pp. 284–295 (2014)
Li, D.: Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors. Dissertation for Ph.D. Degree. USA: The University of Texas at Austin, pp. 4–6 (2014)
Rogers, T.G., O’Connor, M., Aamodt, T.M.: Cache-conscious wave- front scheduling. Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Canada, pp. 72–83, (2012)
Rogers, T.G., O’Connor, M., Aamodt, T.M.: Divergence-aware warp scheduling. Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, USA, pp. 99–110, (2013)
Kayiran, O., Jog, A., Kandemir, M.T., et al.: Neither more nor less: optimizing thread-level parallelism for gpgpus. Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, United Kingdom, pp. 157–166 (2013)
Li, C., Yang, Y., Dai, H., et al.: Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software, USA, pp. 231–242 (2014)
Kim, K., Lee, S., Yoon, M.K., et al.: Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding. Proceedings of the 22th International Symposium on High Performance Computer Architecture, Spain, pp. 163–165 (2016)
Che, S., Boyer, M., Meng, J., et al.: Rodinia: A benchmark suite for heterogeneous computing. Proceedings of the 2009 IEEE International Symposium on Workload Characterization. USA, pp. 44–54 (2009)
Bakhoda, A., Yuan, G., Fung, W.L., et al.: Analyzing CUDA workloads using a detailed GPU simulator. Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, USA, pp. 163–174 (2009)
NVIDIA C: C Programming Guide: Design Guide. PG-02829-001 v6.5, NVIDIA, Santa Clara, Calif, USA (2014)
NVIDIA CUDA Team: NVIDIA compute PTX: Parallel thread execution. ISA version (2009)
Johnson, T.L., Hwu, W.M.W.: Run-time adaptive cache hierarchy management via reference analysis. Proceedings of the 24th International Symposium on Computer Architecture, Denver, Colorado, USA, pp. 315–326 (1997)
Kharbutli, M., Solihin, Y.: Counter-based cache replacement and bypassing algorithms. IEEE Trans. Comput. 57(4), 433–447 (2008)
Duong, N., Zhao, D., Kim, T., et al.: Improving cache management policies using dynamic reuse distances. Proceedings of 45th Annual IEEE/ACM International Symposium on, Canada, pp. 389–400 (2012)
Gaur, J., Chaudhuri, M., Subramoney, S.: Bypass and insertion algorithms for exclusive last-level caches. Proceedings of 38th International Sympo- sium on Computer Architecture, USA, pp. 81–92 (2011)
Chen, X., Chang, L.W., Rodrigues, C.I., et al.: Adaptive cache management for energy-efficient gpu computing. Proceedings of the 47th annual IEEE/ACM international symposium on microarchitecture, United Kingdom, pp. 343–355 (2014)
Tian, Y., Puthoor, S., Greathouse, J.L., et al.: Adaptive GPU cache bypassing. Proceedings of the 8th Workshop on General Purpose Processing using GPUS, USA, pp. 25–35 (2015)
Li, C., Song, S.L., Dai, H., et al.: Locality-driven dynamic GPU cache bypassing. Proceedings of the 29th ACM on International Conference on Supercomputing, USA, pp. 67–77 (2015)
Xie, X., Liang, Y., Wang, Y., et al.: Coordinated static and dynamic cache bypassing for GPUs. Proceedings of 21st International Symposium on High Performance Computer Architecture, USA, pp. 76–88 (2015)
Lee, S.Y., Wu, C.J.: Ctrl-C: Instruction-aware control loop based adaptive cache bypassing for GPUs. Proceedings of 34th International Conference on Computer Design, USA, pp. 133–140 (2016)
Zheng, Z., Wang, Z., Lipasti, M.: Adaptive cache and concurrency allocation on GPGPUs. IEEE Comput. Archit. Lett. 14(2), 90–93 (2015)
Lee, M., Song, S., Moon, J., et al.: Improving GPGPU resource utilization through alternative thread block scheduling. Proceedings of the 20th International Symposium on High Performance Computer Architecture, USA, pp. 260–271 (2014)
Yoon, M.K., Kim, K., Lee, S., et al.: Virtual thread: maximizing thread-level parallelism beyond GPU scheduling limit. Proceedings of the 43rd Annual International Symposium on Computer Architecture, South Korea, pp. 609–621 (2016)
Xie, X., Liang, Y., Li, X., et al.: Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. Proceedings of the 48th International Symposium on Microarchitecture, USA, pp. 395–406 (2015)
Adriaens, J.T., Compton, K., Kim, N.S., et al.: The case for GPGPU spatial multitasking. Proceedings of the 18th International Symposium on High Performance Computer Architecture, USA, pp. 1–12 (2012)
Zhong, J., He, B.: Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst. 25(6), 1522–1532 (2014)
Xu, Q., Jeon, H., Kim, K., et al.: Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramm- ing. Proceedings of the 43rd International Symposium on Computer Architecture, South Korea, pp. 230–242 (2016)
Park, J.J.K., Park, Y., Mahlke, S.: Dynamic resource management for efficient utilization of multitasking GPUs. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, China, pp. 527–540 (2017)
Park, J.J.K., Park, Y., Mahlke, S.: Chimera: Collaborative preemption for multitasking on a shared GPU. Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, Turkey, pp. 593–606 (2015)
Tanasic, I., Gelado, I., Cabezas, J., et al.: Enabling preemptive multiprogramming on GPUs. Proceedings of 41st International Symposium on Computer Architecture, USA, pp. 193–204 (2014)
Wu, B., Liu, X., Zhou, X., et al.: FLEP: Enabling flexible and efficient preemption on GPUs. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, China, pp. 483–496 (2017)
Wang, Z., Yang, J., Melhem, R., et al.: Simultaneous multikernel GPU: multi-tasking throughput processors via fine-grained sharing, pp. 358–369. Proceedings of International Symposium on High Performance Computer Architecture, Spain (2016)
Acknowledgements
The authors would like to thank the reviewers for their worthy suggestions that help to improve this work greatly. This work was partially supported by the National Natural Science Foundation of China [Project Nos. 61373039, 61662002, 61462004], and the Natural Science Foundation of Jiangxi Province, China[Project No. 20151BAB207042, 20161BAB212056], and the Key Research and Development Plan of the Scientific Department in Jiangxi Province, China (No. 20161BBE50063), and the Science and Technology Project of the Education Department in Jiangxi Province, China[Project No. GJJ150605]. Yanxiang He is the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, J., He, Y., Shen, F. et al. Memory-aware TLP throttling and cache bypassing for GPUs. Cluster Comput 22 (Suppl 1), 871–883 (2019). https://doi.org/10.1007/s10586-017-1396-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-1396-0