TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism

Chronaki, Kallia; Casas, Marc; Moreto, Miquel; Bosch, Jaume; Badia, Rosa M.

doi:10.1007/978-3-319-92040-5_20

Kallia Chronaki^17,18,
Marc Casas¹⁷,
Miquel Moreto^17,18,
Jaume Bosch¹⁷ &
…
Rosa M. Badia^17,19

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10876))

Included in the following conference series:

International Conference on High Performance Computing

1991 Accesses
1 Citations

Abstract

As chip multi-processors (CMPs) are becoming more and more complex, software solutions such as parallel programming models are attracting a lot of attention. Task-based parallel programming models offer an appealing approach to utilize complex CMPs. However, the increasing number of cores on modern CMPs is pushing research towards the use of fine grained parallelism. Task-based programming models need to be able to handle such workloads and offer performance and scalability. Using specialized hardware for boosting performance of task-based programming models is a common practice in the research community.

Our paper makes the observation that task creation becomes a bottleneck when we execute fine grained parallel applications with many task-based programming models. As the number of cores increases the time spent generating the tasks of the application is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX offers a solution for minimizing task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. We draw the requirements for this hardware in order to boost execution of highly parallel applications. From our evaluation using 11 parallel workloads on both symmetric and asymmetric multicore systems, we obtain performance improvements up to 15\(\times \), averaging to 3.1\(\times \) over the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Mapping Medley: Adaptive Parallelism Mapping with Varying Optimization Goals

A Hierarchical Task Scheduler for Heterogeneous Computing

Optimizing Task Parallelism with Library-Semantics-Aware Compilation

Notes

1.
Details about the benchmarks used are in Sect. 4.
2.
The experimental set-up is explained in Sect. 4.
3.
Nanos++ also supports nested parallelism so any of the worker threads can potentially create tasks. However the majority of the existing parallel applications are not implemented using nested parallelism.
4.
Section 6 further describes these proposals.

References

OpenMP architecture review board. OpenMP Specification. 4.5 (2015)
Google Scholar
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exper. 23(2), 187–198 (2011)
Article Google Scholar
Ayguadé, E., Badia, R., Bellens, P., Cabrera, D., Duran, A., Ferrer, R., Gonzàlez, M., Igual, F., Jiménez-González, D., Labarta, J., Martinell, L., Martorell, X., Mayo, R., Pérez, J., Planas, J., Quintana-Ortí, E.: Extending OpenMP to survive the heterogeneous multicore era. Int. J. Parallel Prog. 38(5–6), 440–459 (2010)
Article Google Scholar
Barcelona Supercomputing Center. BSC Application Repository, 18 April 2014. https://pm.bsc.es/projects/bar
Barcelona Supercomputing Center. Nanos++
Google Scholar
Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: SC, pp. 66:1–66:11 (2012)
Google Scholar
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou,Y.: Cilk: an efficient multithreaded runtime system. In: PPoPP, pp. 207–216 (1995)
Article Google Scholar
Bueno, J., Planas, J., Duran, A., Badia, R.M., Martorell, X., Ayguadé, E., Labarta, J.: Productive programming of GPU clusters with OmpSs. In: IPDPS, pp. 557–568 (2012)
Google Scholar
Chapman, B.: The multicore programming challenge. In: Xu, M., Zhan, Y., Cao, J., Liu, Y. (eds.) APPT 2007. LNCS, vol. 4847, p. 3. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76837-1_3
Chapter Google Scholar
Chasapis, D., Casas, M., Moreto, M., Vidal, R., Ayguade, E., Labarta, J., Valero, M.: PARSECSs: evaluating the impact of task parallelism in the PARSEC benchmark suite. Trans. Archit. Code Optim. 12, 41:1–41:22 (2015)
Google Scholar
Chronaki, K., Rico, A., Badia, R.M., Ayguadé, E., Labarta, J., Valero, M.: Criticality-aware dynamic task scheduling for heterogeneous architectures. In: ICS, pp. 329–338 (2015)
Google Scholar
Dallou, T., Engelhardt, N., Elhossini, A., Juurlink, B.: Nexus#: a distributed hardware task manager for task-based programming models. In: IPDPS, pp. 1129–1138 (2015)
Google Scholar
Dennard, R., Gaensslen, F., Rideout, V., Bassous, E., LeBlanc, A.: Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J. Solid-State Circuits 9, 256–268 (1974)
Article Google Scholar
Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: OmpSs: a proposal for programming heterogeneous multicore architectures. Parallel Process. Lett. 21, 173–193 (2011)
Article MathSciNet Google Scholar
Etsion, Y., Cabarcas, F., Rico, A., Ramirez, A., Badia, R.M., Ayguade, E., Labarta, J., Valero, M.: Task superscalar: an out-of-order task pipeline. In: MICRO, pp. 89–100 (2010)
Google Scholar
Grass, T., Allande, C., Armejach, A., Rico, A., Ayguadé, E., Labarta, J., Valero, M., Casas, M., Moreto, M.: MUSA: a multi-level simulation approach for next-generation HPC machines. In: SC 2016, pp. 526–537, November 2016
Google Scholar
Jeff, B.: big.LITTLE technology moves towards fully heterogeneous global task scheduling. ARM White Paper (2013)
Google Scholar
Jeffrey, M.C., Subramanian, S., Yan, C., Emer, J., Sanchez, D.: A scalable architecture for ordered parallelism. In: MICRO, pp. 228–241 (2015)
Google Scholar
Kumar, S., Hughes, C.J., Nguyen, A.: Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In: ISCA, pp. 162–173 (2007)
Google Scholar
Manivannan, M., Stenström, P.: Runtime-guided cache coherence optimizations in multicore architectures. In: IPDPS (2014)
Google Scholar
Papaefstathiou, V., Katevenis, M.G., Nikolopoulos, D.S., Pnevmatikatos, D.: Prefetching and cache management using task lifetimes. In: ICS 2013, pp. 325–334 (2013)
Google Scholar
Reinders, J.: Intel Threading Building Blocks - Outfitting C++ for Multicore Processor Parallelism. O’Reilly, Sebastopol (2007)
Google Scholar
Rico, A., Cabarcas, F., Villavieja, C., Pavlovic, M., Vega, A., Etsion, Y., Ramirez, A., Valero, M.: On the simulation of large-scale architectures using multiple application abstraction levels. ACM Trans. Archit. Code Optim. 8(4), 36:1–36:20 (2012)
Article Google Scholar
Sanchez, D., Yoo, R.M., Kozyrakis, C.: Flexible architectural support for fine-grain scheduling. In: ASPLOS, pp. 311–322 (2010)
Google Scholar
Själander, M., Terechko, A., Duranton, M.: A look-ahead task management unit for embedded multicore architectures. In: EUROMICRO DSD, pp. 149–157 (2008)
Google Scholar
Tan, X., Bosch, J., Vidal, M., Álvarez, C., Jiménez-González, D., Ayguadé, E., Valero, M.: General purpose task-dependence management hardware for task-based dataflow programming models. In: IPDPS, pp. 244–253 (2017)
Google Scholar
Vandierendonck, H., Tzenakis, G., Nikolopoulos, D.S.: A unified scheduler for recursive and task dataflow parallelism. In: PACT, pp. 1–11 (2011)
Google Scholar
Castillo, E., Alvarez, L., Moretó, M., Casas, M., Vallejo, E., Bosque, J.L., Beivide, R., Valero, M.: Architectural support for task dependence management with flexible software scheduling. In: HPCA, pp. 283–295 (2018)
Google Scholar

Download references

Acknowledgements

This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 671697 and No. 779877. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Finally, the authors would like to thank Thomas Grass for his valuable help with the simulator.

Author information

Authors and Affiliations

Barcelona Supercomputig Center (BSC), Barcelona, Spain
Kallia Chronaki, Marc Casas, Miquel Moreto, Jaume Bosch & Rosa M. Badia
Universitat Politècnica de Catalunya (UPC), Barcelona, Spain
Kallia Chronaki & Miquel Moreto
Spanish National Research Council (CSIC), Bellaterra, Spain
Rosa M. Badia

Authors

Kallia Chronaki
View author publications
You can also search for this author in PubMed Google Scholar
Marc Casas
View author publications
You can also search for this author in PubMed Google Scholar
Miquel Moreto
View author publications
You can also search for this author in PubMed Google Scholar
Jaume Bosch
View author publications
You can also search for this author in PubMed Google Scholar
Rosa M. Badia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kallia Chronaki , Marc Casas or Miquel Moreto .

Editor information

Editors and Affiliations

Tokyo Institute of Technology, Tokyo, Japan
Rio Yokota
University of Edinburgh, Edinburgh, United Kingdom
Michèle Weiland
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
David Keyes
Technische Universität München, Garching bei München, Germany
Carsten Trinitis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chronaki, K., Casas, M., Moreto, M., Bosch, J., Badia, R.M. (2018). TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 10876. Springer, Cham. https://doi.org/10.1007/978-3-319-92040-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-92040-5_20
Published: 29 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92039-9
Online ISBN: 978-3-319-92040-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics