Abstract
In the past years, many works have demonstrated the applicability of Coarse-Grained Reconfigurable Array (CGRA) accelerators to optimize loops by using software pipelining approaches. They are proven to be effective in reducing the total execution time of multimedia and signal processing applications. However, the run-time reconfigurability of CGRAs is hampered overheads introduced by the needed translation and mapping steps. In this work, we present a novel run-time translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. We propose a greedy approach, since the modulo scheduling for CGRA is an NP-complete problem. In addition to read-after-write dependencies, the dynamic modulo scheduling faces new challenges, such as register insertion to solve recurrence dependences and to balance the pipelining paths. Our results demonstrate that the greedy run-time algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for a 16-issue VLIW processor. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated. In this analysis it is demonstrated that when changing from one memory access per cycle to two memory accesses per cycle, the modulo scheduling algorithm is able to exploit this increase in memory bandwidth and enhance performance accordingly. Additionally, to measure area and performance, the proposed CGRA was prototyped on an FPGA. The area comparisons show that a crossbar CGRA (with 16 processing elements and including an 4-issue VLIW host processor) is only 1.11 × bigger than a standalone 8-issue VLIW softcore processor.
Similar content being viewed by others
References
Ahn, M., Yoon, J.W., Paek, Y., Kim, Y., Kiemb, M., Choi, K. (2006). A spatial mapping algorithm for heterogeneous coarse-grained reconfigurable architectures. In: Proceedings DATE, pp. 363–368.
Arnold, O., Matus, E., Noethen, B., Winter, M., Limberg, T., Fettweis, G. (2014). Tomahawk: Parallelism and heterogeneity in communications signal processing mpsocs. ACM Transactions on Embedded Computing Systems, 13(3s), 107:1–107:241.
Beck, A.C.S., Rutzig, M.B., Gaydadjiev, G., Carro, L. (2008). Transparent reconfigurable acceleration for heterogeneous embedded applications. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1208–1213.
Bispo, J., Paulino, N., Cardoso, J.M., Ferreira, J.C. (2013). Transparent runtime migration of loop-based traces of processor instructions to reconfigurable processing units. International Journal of Reconfigurable Computing
Bispo, J., Paulino, N., Ferreira, J., Cardoso, J. (2012). Transparent trace-based binary acceleration for reconfigurable hw/sw systems. IEEE Transactions on Industrial Informatics.
Boppu, S., Hannig, F., Teich, J. (2014). Compact code generation for tightly-coupled processor arrays. Journal of Signal Processing Systems, 77(1-2), 5–29.
Bouwens, F., Berekovic, M., Kanstein, A., Gaydadjiev, G. (2007). Architectural exploration of the adres coarse-grained reconfigurable array. In Proceedings ARC pp. 1–13.
Chen, L., & Mitra, T. (2012). Graph minor approach for application mapping on cgras. In: Proceedings FPT.
De Sutter, B., Coene, P., Vander Aa, T., Mei, B. (2008). Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays. In Proceedings LCTES (pp. 151–160).
Ferreira, R., Duarte, V., Meireles, W., Pereira, M., Carro, L., Wong, S. (2013). A just-in-time modulo scheduling for virtual coarse-grained reconfigurable architectures. In SAMOS XIII.
Ferreira, R., Vendramini, J.G., Mucida, L., Pereira, M.M., Carro, L. (2011). An fpga-based heterogeneous coarse-grained dynamically reconfigurable architecture. In: Proceedings CASES.
Friedman, S., Carroll, A., Van Essen, B., Ylvisaker, B., Ebeling, C., Hauck, S. (2009). Spr: an architecture-adaptive cgra mapping tool. In: Proceeding of the ACM/SIGDA international symposium on field programmable gate arrays, FPGA ’09 (pp. 191–200). New York: ACM
Goel, N., Kumar, A., Panda, P.R. (2014). Shared-port register file architecture for low-energy vliw processors. ACM Transactions Architectural Code Optimization, 11(1).
Hamzeh, M., Shrivastava, A., Vrudhula, S. (2012). Epimap: Using epimorphism to map applications on CGRAs. In Proceeding of DAC conference (pp. 1280–1287).
Hamzeh, M., Shrivastava, A., Vrudhula, S. (2014). Branch-aware loop mapping on CGRAs. In Proceeding of DAC conference on design automation conference (pp. 1–6). ACM.
Hamzeh, M., Shrivastava, A., Vrudhula, S.B. (2013). Regimap: register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In Proceeding of DAC conference (p. 18).
Hartenstein, R. (2001). Coarse grain reconfigurable architecture (embedded tutorial). In: Proceedings of the 2001 asia and south pacific design automation conference, ASP-DAC ’01.
Hatanaka, A., & Bagherzadeh, N. (2007). A modulo scheduling algorithm for a coarse-grain reconfigurable array template. In IPDPS 2007 (pp. 1–8).
Hoogerbrugge, J., & Corporaal, H. (1994). Register file port requirements of transport triggered architectures. In: Proceedings of the 27th annual international symposium on microarchitecture. (pp. 191–195). ACM.
Jääskeläinen, P., Kultala, H., Viitanen, T., Takala, J. (2014). Code density and energy efficiency of exposed datapath architectures. Journal of Signal Processing Systems, 1–16.
Kim, Y., Lee, J., Shrivastava, A., Yoon, J., Cho, D., Paek, Y. (2011). High throughput data mapping for coarse-grained reconfigurable architectures. IEEE Transactions on CAD of International Circuits and Systems, 30 (11), 1599 –1609. doi:10.1109/TCAD.2011.2161217.
Laboratories, & H.P. (2014). Vex toolchain. http://www.hpl.hp.com/downloads/vex/.
Lee, G., Choi, K., Dutt, N. (2011). Mapping multi-domain applications onto coarse-grained reconfigurable architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(5), 637–650.
Lin, T.J., Chen, S.K., Kuo, Y.T., Liu, C.W., Hsiao, P.C. (2008). Design and implementation of a high-performance and complexity-effective vliw dsp for multimedia applications. Journal of Signal Processing Systems, 51(3), 209–223.
Loeffler, C., Ligtenberg, A., Moschytz, G.S. (1989). Practical fast 1-d dct algorithms with 11 multiplications. IEEE international conference on acoustics, speech, and signal processing, 1989. ICASSP-89 (pp. 988–991).
McCool, M. (2007). Signal processing and general-purpose computing and gpus [exploratory dsp]. IEEE Signal Processing Magazine, 24(3), 109–114.
Mei, B., Vernalde, S., Verkest, D., De Man, H., Lauwereins, R. (2002). Dresc: a retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings FPT pp. 166–173.
Mei, B., Vernalde, S., Verkest, D., Man, H.D., Lauwereins, R. (2003). Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In: Proceedings DATE.
Oh, T., Egger, B., Park, H., Mahlke, S. (2009). Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures. In Proceedings LCTES pp. 21–30.
Paek, J.K., Choi, K., Lee, J. (2011). Binary acceleration using coarse-grained reconfigurable architecture. SIGARCH Computers Architecture News, 38(4), 33–39.
Park, H., Fan, K., Kudlur, M., Mahlke, S. (2006). Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures. In Proceedings CASES (pp. 136–146).
Park, H., Fan, K., Mahlke, S.A., Oh, T., Kim, H., Kim, H.s. (2008). Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In: Proceedings PACT.
Park, H., Park, Y., Mahlke, S. (2009). Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications. In Proceedings MICRO (pp. 370–380).
Rau, B.R. (1994). Iterative modulo scheduling: an algorithm for software pipelining loops. In Proceedings MICRO (pp. 63–74).
Wong, S., Van As, T., Brown, G. (2008). ρ-vex: A reconfigurable and extensible softcore vliw processor. In International conference on field-programmable technology FPT (pp. 369–372). IEEE.
Yoon, J., Shrivastava, A., Park, S., Ahn, M., Jeyapaul, R., Paek, Y. (2008). Spkm: A novel graph drawing based algorithm for application mapping onto coarse-grained reconfigurable architectures. In Proceedings ASPDAC (pp. 776–782).
Zhou, L., Liu, H., Zhang, J. (2013). Loop acceleration by cluster-based cgra. IEICE Electronics Express, 10(16).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ferreira, R., Denver, W., Pereira, M. et al. A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility. J Sign Process Syst 85, 45–66 (2016). https://doi.org/10.1007/s11265-015-0974-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-015-0974-8