{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T13:11:18Z","timestamp":1740143478880,"version":"3.37.3"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2019,2,13]],"date-time":"2019-02-13T00:00:00Z","timestamp":1550016000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100004663","name":"Ministry of Science and Technology of Taiwan","doi-asserted-by":"crossref","award":["MOST-106-2218-E-002-040"],"id":[{"id":"10.13039\/501100004663","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,3,31]]},"abstract":"Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD instruction-set architectures (ISAs). Therefore, migrating existing applications to another host ISA that has fewer but longer SIMD registers and more advanced instructions raises the issues of asymmetric SIMD capability. To date, this issue has been overlooked and the host SIMD capability is underutilized, resulting in suboptimal performance. In this article, we present a novel binary translation technique called spill-aware superword level parallelism (saSLP), which combines short ARMv8 instructions and registers in the guest binaries to exploit the x86 AVX2 host\u2019s parallelism, register capacity, and gather instructions. Our experiment results show that saSLP improves the performance by 1.6\u00d7 (2.3\u00d7) across a number of benchmarks and reduces spilling by 97% (99%) for ARMv8 to x86 AVX2 (AVX-512) translation. Furthermore, with AVX2 (AVX-512) gather instructions, saSLP speeds up several data-irregular applications that cannot be vectorized on ARMv8 NEON by up to 3.9\u00d7 (4.2\u00d7).<\/jats:p>","DOI":"10.1145\/3301488","type":"journal-article","created":{"date-parts":[[2019,2,14]],"date-time":"2019-02-14T19:36:17Z","timestamp":1550172977000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation"],"prefix":"10.1145","volume":"16","author":[{"given":"Yu-Ping","family":"Liu","sequence":"first","affiliation":[{"name":"National Taiwan University, Daan Dist. Taipei, Taiwan"}]},{"given":"Ding-Yong","family":"Hong","sequence":"additional","affiliation":[{"name":"Academia Sinica, Nankang Dist. Taipei, Taiwan"}]},{"given":"Jan-Jan","family":"Wu","sequence":"additional","affiliation":[{"name":"Academia Sinica, Nankang Dist. Taipei, Taiwan"}]},{"given":"Sheng-Yu","family":"Fu","sequence":"additional","affiliation":[{"name":"National Taiwan University, Daan Dist. Taipei, Taiwan"}]},{"given":"Wei-Chung","family":"Hsu","sequence":"additional","affiliation":[{"name":"National Taiwan University, Daan Dist. Taipei, Taiwan"}]}],"member":"320","published-online":{"date-parts":[[2019,2,13]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"ARM Ltd. 2015. ARM Cortex-A Series Programmer\u2019s Guide for ARMv8-A. ARM Ltd. 2015. ARM Cortex-A Series Programmer\u2019s Guide for ARMv8-A."},{"key":"e_1_2_1_2_1","unstructured":"ARM Ltd. 2017. ARM Architecture Reference Manual Supplement: The Scalable Vector Extension (SVE) for ARMv8-A. ARM Ltd. 2017. ARM Architecture Reference Manual Supplement: The Scalable Vector Extension (SVE) for ARMv8-A."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1177\/109434209100500306"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/349299.349303"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/956417.956550"},{"volume-title":"Proceedings of the USENIX Annual Technical Conference. USENIX, 41--46","year":"2005","author":"Bellard Fabrice","key":"e_1_2_1_6_1"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/40.671403"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346199"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/776261.776263"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2843859.2843867"},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","unstructured":"Agner Fog. 2018. Lists of Instruction Latencies Throughputs and Micro-operation Breakdowns for Intel AMD and VIA CPUs. https:\/\/www.agner.org\/optimize\/instruction_tables.pdf. Agner Fog. 2018. Lists of Instruction Latencies Throughputs and Micro-operation Breakdowns for Intel AMD and VIA CPUs. https:\/\/www.agner.org\/optimize\/instruction_tables.pdf.","DOI":"10.1063\/1.5046674"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS.2015.70"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-016-0480-z"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/SAMOS.2015.7363680"},{"key":"e_1_2_1_16_1","unstructured":"Israel Hirsh and S. Gideon. 2017. Intel Architecture Code Analyzer User\u2019s Guide. Israel Hirsh and S. Gideon. 2017. Intel Architecture Code Analyzer User\u2019s Guide."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS.2016.0115"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2259016.2259030"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.14"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1995896.1995900"},{"key":"e_1_2_1_21_1","unstructured":"Intel Corp. 2018a. Intel 64 and IA-32 Architectures Optimization Reference Manual. Intel Corp. 2018a. Intel 64 and IA-32 Architectures Optimization Reference Manual."},{"key":"e_1_2_1_22_1","unstructured":"Intel Corp. 2018b. Intel 64 and IA-32 Architectures Software Developer\u2019s Manual. Intel Corp. 2018b. Intel 64 and IA-32 Architectures Software Developer\u2019s Manual."},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","unstructured":"J. Jeffers J. Reinders and A. Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Elsevier Science. J. Jeffers J. Reinders and A. Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Elsevier Science.","DOI":"10.1016\/B978-0-12-809194-4.00002-8"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145824"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/349299.349320"},{"volume-title":"Proceedings of the International Symposium on Code Generation and Optimization (CGO\u201904)","author":"Lattner Chris","key":"e_1_2_1_26_1"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2006.27"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2017.15"},{"key":"e_1_2_1_29_1","first-page":"1","article-title":"Design and implementation of a lightweight dynamic optimization system","volume":"6","author":"Lu Jiwei","year":"2004","journal-title":"Journal of Instruction-Level Parallelism"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1065010.1065034"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.68"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/DATE.2011.5763274"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250734.1250746"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/2190025.2190062"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/545215.545246"},{"volume-title":"Proceedings of the International Symposium on Code Generation and Optimization (CGO\u201915)","author":"Porpodas Vasileios","key":"e_1_2_1_36_1"},{"key":"e_1_2_1_37_1","unstructured":"RISC-V Foundation. 2016. RISC-V Vector Extension Proposal. RISC-V Foundation. 2016. RISC-V Vector Extension Proposal."},{"volume-title":"Proceedings of the GCC Developers Summit. Red Hat Inc., 131--142","year":"2007","author":"Rosen Ira","key":"e_1_2_1_38_1"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2005.33"},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","unstructured":"Cheng\n \n Wang Shiliang\n \n Hu Ho-Seop\n \n Kim Sreekumar R.\n \n Nair Mauricio Breternitz Jr. Zhiwei Ying and Youfeng Wu\n . \n 2007\n . StarDBT: \n An\n efficient multi-platform dynamic binary translation system. In Proceedings of the Asia-Pacific Computer Systems Architecture Conference Lecture Notes in Computer Science Vol. \n 4697\n . \n Springer Berlin 4--15. Cheng Wang Shiliang Hu Ho-Seop Kim Sreekumar R. Nair Mauricio Breternitz Jr. Zhiwei Ying and Youfeng Wu. 2007. StarDBT: An efficient multi-platform dynamic binary translation system. In Proceedings of the Asia-Pacific Computer Systems Architecture Conference Lecture Notes in Computer Science Vol. 4697. Springer Berlin 4--15.","DOI":"10.1007\/978-3-540-74309-5_3"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2886101"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2854038.2854054"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3301488","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,1]],"date-time":"2023-01-01T11:31:36Z","timestamp":1672572696000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3301488"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,2,13]]},"references-count":41,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,3,31]]}},"alternative-id":["10.1145\/3301488"],"URL":"https:\/\/doi.org\/10.1145\/3301488","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2019,2,13]]},"assertion":[{"value":"2018-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-13","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}