{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,11,1]],"date-time":"2023-11-01T18:02:22Z","timestamp":1698861742619},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2016,12,2]],"date-time":"2016-12-02T00:00:00Z","timestamp":1480636800000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NRF","award":["2013R1A2A2A01069132, CNS-1525412 and CNS-1319501"]},{"name":"Spanish MINECO","award":["TIN2016-78799-P"]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2016,12,28]]},"abstract":"In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a GPU accesses the CPU memory only if it does not find its required data in the directories associated with its high-bandwidth memory, or the NMOESI coherency protocol limits the access to that data. Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 \u00d7 in comparison to alternative software-based approaches. It also allows the CPU to access GPUs modified data by at least 13 \u00d7 faster.<\/jats:p>","DOI":"10.1145\/2996190","type":"journal-article","created":{"date-parts":[[2016,12,2]],"date-time":"2016-12-02T19:13:45Z","timestamp":1480706025000},"page":"1-25","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["UMH"],"prefix":"10.1145","volume":"13","author":[{"given":"Amir Kavyan","family":"Ziabari","sequence":"first","affiliation":[{"name":"Northeastern University, Boston, MA"}]},{"given":"Yifan","family":"Sun","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, MA"}]},{"given":"Yenai","family":"Ma","sequence":"additional","affiliation":[{"name":"Boston University, Boston, MA"}]},{"given":"Dana","family":"Schaa","sequence":"additional","affiliation":[{"name":"Northeastern University"}]},{"given":"Jos\u00e9 L.","family":"Abell\u00e1n","sequence":"additional","affiliation":[{"name":"Universidad Cat\u00f3lica San Antonio de Murcia, Murcia, Spain"}]},{"given":"Rafael","family":"Ubal","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, MA"}]},{"given":"John","family":"Kim","sequence":"additional","affiliation":[{"name":"KAIST, South Korea"}]},{"given":"Ajay","family":"Joshi","sequence":"additional","affiliation":[{"name":"Boston University, Boston, MA"}]},{"given":"David","family":"Kaeli","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, MA"}]}],"member":"320","published-online":{"date-parts":[[2016,12,2]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1535\/itj.1003.02"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 354--365","author":"Agarwal N.","unstructured":"N. Agarwal , D. Nellans , M. O\u2019Connor , S. W. Keckler , and T. F. Wenisch . 2015. Unlocking bandwidth for GPUs in CC-NUMA systems . In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 354--365 . N. Agarwal, D. Nellans, M. O\u2019Connor, S. W. Keckler, and T. F. Wenisch. 2015. Unlocking bandwidth for GPUs in CC-NUMA systems. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 354--365."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2688500.2688527"},{"key":"e_1_2_1_4_1","unstructured":"AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. White paper. https:\/\/www.amd.com\/Documents\/GCN_Architecture_whitepaper.pdf. AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. White paper. https:\/\/www.amd.com\/Documents\/GCN_Architecture_whitepaper.pdf."},{"key":"e_1_2_1_5_1","unstructured":"AMD. 2014a. AMD Launches World\u2019s Fastest Graphics Card. Retrieved from http:\/\/www.amd.com\/en-us\/ press-releases\/Pages\/fastest-graphics-card-2014apr8.aspx. AMD. 2014a. AMD Launches World\u2019s Fastest Graphics Card. Retrieved from http:\/\/www.amd.com\/en-us\/ press-releases\/Pages\/fastest-graphics-card-2014apr8.aspx."},{"key":"e_1_2_1_6_1","unstructured":"AMD. 2014b. AMD Radeon HD 7800 Series Graphic Cards. (2014). Retrieved from http:\/\/www.amd.com\/en-us\/products\/graphics\/desktop\/7000\/7800. AMD. 2014b. AMD Radeon HD 7800 Series Graphic Cards. (2014). Retrieved from http:\/\/www.amd.com\/en-us\/products\/graphics\/desktop\/7000\/7800."},{"key":"e_1_2_1_7_1","unstructured":"AMD. 2015a. High Bandwidth Memory. Retrieved from http:\/\/www.amd.com\/en-us\/innovations\/software- technologies\/hbm. AMD. 2015a. High Bandwidth Memory. Retrieved from http:\/\/www.amd.com\/en-us\/innovations\/software- technologies\/hbm."},{"key":"e_1_2_1_8_1","unstructured":"AMD. 2015b. High-Bandwidth Memory (HBM): Reinventing Memory Technology. (2015). https:\/\/www.amd.com\/Documents\/High-Bandwidth-Memory-HBM.pdf. AMD. 2015b. High-Bandwidth Memory (HBM): Reinventing Memory Technology. (2015). https:\/\/www.amd.com\/Documents\/High-Bandwidth-Memory-HBM.pdf."},{"key":"e_1_2_1_9_1","unstructured":"AMD. 2016. AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK). Retrieved from http:\/\/developer.amd.com\/sdks\/amdappsdk\/. AMD. 2016. AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK). Retrieved from http:\/\/developer.amd.com\/sdks\/amdappsdk\/."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2015.72"},{"key":"e_1_2_1_11_1","volume-title":"AMD","author":"Boudier Pierre","year":"2011","unstructured":"Pierre Boudier and Graham Sellers . 2011 . MEMORY SYSTEM ON FUSION APUS: The Benefits of Zero Copy . AMD , June 2011. Web. Nov. 11 2016. http:\/\/developer.amd.com\/wordpress\/media\/2013\/06\/1004_final.pdf. Pierre Boudier and Graham Sellers. 2011. MEMORY SYSTEM ON FUSION APUS: The Benefits of Zero Copy. AMD, June 2011. Web. Nov. 11 2016. http:\/\/developer.amd.com\/wordpress\/media\/2013\/06\/1004_final.pdf."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751218"},{"key":"e_1_2_1_13_1","volume-title":"Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency. Xilinx White Paper: Virtex-7 FPGAs","author":"Dorsey Patrick","year":"2010","unstructured":"Patrick Dorsey . 2010. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency. Xilinx White Paper: Virtex-7 FPGAs ( 2010 ), 1--10. Patrick Dorsey. 2010. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency. Xilinx White Paper: Virtex-7 FPGAs (2010), 1--10."},{"key":"e_1_2_1_14_1","volume-title":"NVIDIA Updates GPU Roadmap","author":"Gupta NVIDIA","year":"2014","unstructured":"NVIDIA Gupta , Sumit. 2015. NVIDIA Updates GPU Roadmap ; Announces Pascal. Retrieved from http:\/\/blogs.nvidia.com\/blog\/ 2014 \/03\/25\/gpu-roadmap-pascal\/. NVIDIA Gupta, Sumit. 2015. NVIDIA Updates GPU Roadmap; Announces Pascal. Retrieved from http:\/\/blogs.nvidia.com\/blog\/2014\/03\/25\/gpu-roadmap-pascal\/."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815968"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the 14th International Conference on High Performance Computing (HiPC\u201907)","author":"Harish Pawan","unstructured":"Pawan Harish and P. J. Narayanan . 2007. Accelerating large graph algorithms on the GPU using CUDA . In Proceedings of the 14th International Conference on High Performance Computing (HiPC\u201907) . Springer-Verlag, Berlin, 197--208. Pawan Harish and P. J. Narayanan. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of the 14th International Conference on High Performance Computing (HiPC\u201907). Springer-Verlag, Berlin, 197--208."},{"key":"e_1_2_1_17_1","unstructured":"NVIDIA Harris Mark. 2013. Unified Memory in CUDA 6. Retrieved from http:\/\/devblogs.nvidia.com\/parallelforall\/unified-memory-in-cuda-6\/. NVIDIA Harris Mark. 2013. Unified Memory in CUDA 6. Retrieved from http:\/\/devblogs.nvidia.com\/parallelforall\/unified-memory-in-cuda-6\/."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74735-2_15"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.51"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485957"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the 1995 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD\u201995)","author":"Kadiyala M.","unstructured":"M. Kadiyala and L. N. Bhuyan . 1995. A dynamic cache sub-block design to reduce false sharing . In Proceedings of the 1995 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD\u201995) . 313--318. M. Kadiyala and L. N. Bhuyan. 1995. A dynamic cache sub-block design to reduce false sharing. In Proceedings of the 1995 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD\u201995). 313--318."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.5555\/2523721.2523744"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.55"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2038037.1941591"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture. IEEE, 243--252","author":"Kim Jesung","year":"1995","unstructured":"Jesung Kim , Sang Lyul Min , Sanghoon Jeon , Byoungchu Ahn , Deog Kyoon Jeong , and Chong Sang Kim . 1995 . U-cache: A cost-effective solution to synonym problem . In Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture. IEEE, 243--252 . Jesung Kim, Sang Lyul Min, Sanghoon Jeon, Byoungchu Ahn, Deog Kyoon Jeong, and Chong Sang Kim. 1995. U-cache: A cost-effective solution to synonym problem. In Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture. IEEE, 243--252."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835963"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/L-CA.2013.19"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750374"},{"key":"e_1_2_1_29_1","volume-title":"Getting the Most from OpenCLTM 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel\u00ae Processor Graphics. Intel","author":"Lake Adam","year":"2014","unstructured":"Adam Lake . 2014. Getting the Most from OpenCLTM 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel\u00ae Processor Graphics. Intel 2014 , Web . Nov. 11 2016. https:\/\/software.intel.com\/sites\/default\/files\/managed\/f1\/25\/opencl-zero-copy-in-opencl-1-2.pdf. Adam Lake. 2014. Getting the Most from OpenCLTM 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel\u00ae Processor Graphics. Intel 2014, Web. Nov. 11 2016. https:\/\/software.intel.com\/sites\/default\/files\/managed\/f1\/25\/opencl-zero-copy-in-opencl-1-2.pdf."},{"key":"e_1_2_1_30_1","first-page":"2014","article-title":"Understanding performance of PCI express systems","volume":"28","author":"Lawley Jason","year":"2014","unstructured":"Jason Lawley . 2014 . Understanding performance of PCI express systems . Xilinx , October 28 , 2014 . web. Nov. 11, 2016. http:\/\/www.xilinx.com\/support\/documentation\/white_papers\/wp350.pdf. Jason Lawley. 2014. Understanding performance of PCI express systems. Xilinx, October 28, 2014. web. Nov. 11, 2016. http:\/\/www.xilinx.com\/support\/documentation\/white_papers\/wp350.pdf.","journal-title":"Xilinx"},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the 2015 15th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 1092--1098","author":"Li Wenqiang","unstructured":"Wenqiang Li , Guanghao Jin , Xuewen Cui , and S. See . 2015. An evaluation of unified memory technology on NVIDIA GPUs . In Proceedings of the 2015 15th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 1092--1098 . Wenqiang Li, Guanghao Jin, Xuewen Cui, and S. See. 2015. An evaluation of unified memory technology on NVIDIA GPUs. In Proceedings of the 2015 15th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 1092--1098."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155673"},{"key":"e_1_2_1_33_1","unstructured":"Milo M. K. Martin. 2003. Token Coherence. Ph.D. Dissertation. University of Wisconsin--Madison. Milo M. K. Martin. 2003. Token Coherence. Ph.D. Dissertation. University of Wisconsin--Madison."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2578948.2560689"},{"key":"e_1_2_1_35_1","unstructured":"Dan Negrut. 2014. Unified Memory in CUDA 6.0: A Brief Overview. Retrieved from http:\/\/www.drdobbs.com\/ parallel\/unified-memory-in-cuda-6-a-brief-overvie\/. Dan Negrut. 2014. Unified Memory in CUDA 6.0: A Brief Overview. Retrieved from http:\/\/www.drdobbs.com\/ parallel\/unified-memory-in-cuda-6-a-brief-overvie\/."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.88"},{"key":"e_1_2_1_37_1","unstructured":"NVIDIA. 2012. NVIDIA Maximus System Builder\u2019s Guide. Retrieved from http:\/\/www.nvidia.com\/content\/quadro\/maximus\/di-06471-001_v02.pdf. NVIDIA. 2012. NVIDIA Maximus System Builder\u2019s Guide. Retrieved from http:\/\/www.nvidia.com\/content\/quadro\/maximus\/di-06471-001_v02.pdf."},{"key":"e_1_2_1_38_1","volume-title":"Whitepaper: Nvidia NVLink high-speed interconnect: application performance. NVIDIA","author":"NVIDIA.","year":"2014","unstructured":"NVIDIA. 2014 . Whitepaper: Nvidia NVLink high-speed interconnect: application performance. NVIDIA Nov. 2014, web Nov. 11. 2016. http:\/\/info.nvidianews.com\/rs\/nvidia\/images\/NVIDIA%20NVLink%20High-Speed%20Interconnect%20Application%20Performance%20Brief.pdf. NVIDIA. 2014. Whitepaper: Nvidia NVLink high-speed interconnect: application performance. NVIDIA Nov. 2014, web Nov. 11. 2016. http:\/\/info.nvidianews.com\/rs\/nvidia\/images\/NVIDIA%20NVLink%20High-Speed%20Interconnect%20Application%20Performance%20Brief.pdf."},{"key":"e_1_2_1_39_1","unstructured":"NVIDIA. 2015a. NVIDIA CUDA C Programming Guide: Version 7.5. (2015). Retrieved from http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/ NVIDIA. 2015a. NVIDIA CUDA C Programming Guide: Version 7.5. (2015). Retrieved from http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/"},{"key":"e_1_2_1_40_1","unstructured":"NVIDIA. 2015b. Tesla K80 GPU Accelerator. Retrieved from https:\/\/images.nvidia.com\/content\/pdf\/kepler\/ Tesla-K80-BoardSpec-07317-001-v05.pdf. NVIDIA. 2015b. Tesla K80 GPU Accelerator. Retrieved from https:\/\/images.nvidia.com\/content\/pdf\/kepler\/ Tesla-K80-BoardSpec-07317-001-v05.pdf."},{"key":"e_1_2_1_41_1","unstructured":"Sreepathi Pai. 2014. Microbenchmarking Unified Memory in CUDA 6.0. Retrieved from http:\/\/users.ices. utexas.edu\/sreepai\/automem\/. Sreepathi Pai. 2014. Microbenchmarking Unified Memory in CUDA 6.0. Retrieved from http:\/\/users.ices. utexas.edu\/sreepai\/automem\/."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541942"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540747"},{"key":"e_1_2_1_44_1","volume-title":"Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA\u201914)","author":"Power Jonathan","unstructured":"Jonathan Power , Mark D. Hill , and David A. Wood . 2014. Supporting x86-64 address translation for 100s of GPU lanes . In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA\u201914) . IEEE, 568--578. Jonathan Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA\u201914). IEEE, 568--578."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.30"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2009.5161068"},{"key":"e_1_2_1_48_1","volume-title":"X86 Instruction Set Architecture","author":"Shanley Tom","unstructured":"Tom Shanley . 2010. X86 Instruction Set Architecture . Mindshare Press . Tom Shanley. 2010. X86 Instruction Set Architecture. Mindshare Press."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830821"},{"key":"e_1_2_1_51_1","volume-title":"High bandwidth memory (HBM) dram. JESD235","author":"Standard JEDEC","year":"2013","unstructured":"JEDEC Standard . 2013. High bandwidth memory (HBM) dram. JESD235 ( 2013 ). JEDEC Standard. 2013. High bandwidth memory (HBM) dram. JESD235 (2013)."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.102"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2016.7581262"},{"key":"e_1_2_1_54_1","unstructured":"NVIDIA SuperMicro. 2016. Revolutionising High Performance Computing with Supermicro Solutions Using Nvidia Tesla. Retrieved from http:\/\/goo.gl\/2YEKIq. NVIDIA SuperMicro. 2016. Revolutionising High Performance Computing with Supermicro Solutions Using Nvidia Tesla. Retrieved from http:\/\/goo.gl\/2YEKIq."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370865"},{"key":"e_1_2_1_56_1","unstructured":"Rafael Ubal and David Kaeli. 2015. The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing. Retrieved from www.multi2sim.org. Rafael Ubal and David Kaeli. 2015. The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing. Retrieved from www.multi2sim.org."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750399"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2996190","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,31]],"date-time":"2022-12-31T06:43:26Z","timestamp":1672469006000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2996190"}},"subtitle":["A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs"],"short-title":[],"issued":{"date-parts":[[2016,12,2]]},"references-count":55,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2016,12,28]]}},"alternative-id":["10.1145\/2996190"],"URL":"https:\/\/doi.org\/10.1145\/2996190","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,12,2]]},"assertion":[{"value":"2016-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-09-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-12-02","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}