Abstract
We are witnessing the consolidation of the heterogeneous computing in parallel computing with architectures such as Cell Broadband Engine (Cell BE) or Graphics Processing Units (GPUs) which are present in a myriad of developments for high performance computing. These platforms provide a Software Development Kit (SDK) to maximize performance at the expense of dealing with complex and low-level architectural details which makes the software development a daunting task. This paper explores stencil computations in several heterogeneous programming models like Cell SDK, CellSs, ALF and CUDA to optimize the Jacobi method for solving Laplace’s differential equation. We describe the programming techniques to extract the maximum performance on the Cell BE and the GPU, and compare their computing paradigms. Experimental results are shown on two Nvidia Teslas and one IBM BladeCenter QS20 blade which incorporates two 3.2 GHz Cell BEs v 5.1. The speed-up factor for our set of GPU optimizations reaches 3–4×, and the execution times defeat those of the Cell BE by an order of magnitude, also showing great scalability when moving towards newer GPU generations and/or more demanding problem sizes.
Similar content being viewed by others
References
Abellán JL, Fernández J, Acacio ME (2008) Characterizing the basic synchronization and communication operations in dual cell-based blades. In: International conference on computational science, Krakow, Poland.
Amorim R, Haase G, Liebmann M, Weber dos Santos R (2009) Comparing CUDA and OpenGL implementations for a Jacobi iteration. In: Smari WW (ed) Proceedings of the 2009 high performance computing & simulation conference (HPCS’09), IEEE, New Jersey. Logos Verlag, Berlin, pp 22–32
Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley. Tech rep UCB/EECS-2006-183, EECS Department, University of California, Berkeley
Christen M, Schenk O, Neufeld E, Messmer P, Burkhart H (2009) Parallel data-locality aware stencil computations on modern micro-architectures. In: Proceedings of the 2009 IEEE international symposium on parallel & distributed processing (IPDPS ’09). IEEE Computer Society, Washington, pp 1–10
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC ’08). IEEE Press, Piscataway, pp 1–12
Demmel JW (1997) Applied numerical linear algebra. In: Society for industrial and applied mathematics. SIAM, Philadelphia
Fang X, Tang Y, Wang G, Tang T, Zhang Y (2010) Optimizing stencil application on multi-thread GPU architecture using stream programming model. In: Proceedings of 23rd international conference (ARCS), Hannover, Germany, pp 234–245
Gaona E, Fernández J, Acacio ME (2009) Fast and efficient synchronization and communication collective primitives for dual cell-based blades. In: Euro-Par, pp 900–911
Hill J (2007) Scientific programming on the cell using ALF. Tech rep, HPCx consortium
Systems IBM Technology Group (2007) Cell broadband engine programming tutorial version 2.1
IBM Systems and Technology Group (2007) SPE runtime management library version 2.1
Intel: Array building blocks (2012). http://software.intel.com/en-us/articles/intel-array-building-blocks/
Kahle J, Day M, Hofstee H, Johns C, Maeurer T, Shippy D (2005) Introduction to the cell multiprocessor. IBM J Res Dev 49(4/5):589–604
Lester BP (1993) The art of parallel programming. Prentice-Hall, Upper Saddle River
Lindholm E, Nickolls J, Oberman S, Montrym J (2008) Nvidia tesla: a unified graphics and computing architecture. IEEE MICRO 28(2):39–55. http://doi.ieeecomputersociety.org/10.1109/MM.2008.31
Maruyama N, Nomura T, Sato K, Matsuoka S (2011) Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis (SC ’11), New York, USA, pp 11:1–11:12
McCool MD (2008) Scalable programming models for massively multicore processors. IEEE MICRO 96(5):816–831
NVIDIA: (2008) NVIDIA CUDA programming guide 2.0
Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) Gpu computing. Proc IEEE 96(5):879–899
Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell T (2007) A survey of general-purpose computation on graphics hardware. Comput Graph Forum 26(1):80–113
Renganarayana L, Harthikote-matha M, Dewri R, Rajopadhye S (2007) Towards optimal multi-level tiling for stencil computations. In Proceedings of 21st IEEE international parallel and distributed processing symposium (IPDPS), Long Beach, CA, USA
Stone JE, Gohara D, Shi G (2010) Opencl: A parallel programming standard for heterogeneous computing systems. IEEE Des Test Comput 12(3):66–73. http://dx.doi.org/10.1109/MCSE.2010.69
Unat D, Cai X, Baden SB (2011) Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of the international conference on supercomputing (ICS ’11). ACM, New York, pp 214–224
Venkatasubramanian S, Vuduc RW, None N (2009) Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In: Proceedings of the 23rd international conference on supercomputing (ICS ’09). ACM, New York, pp 244–255
Acknowledgements
This work has been jointly supported by the Fundación Séneca (Agencia Regional de Ciencia y Tecnología, Región de Murcia) under projects 00001/CS/2007, 15290/PI/2010 and under the fellowship 12461/FPI/09, by the Spanish MICINN and European Commission FEDER funds under projects Consolider Ingenio-2010 CSD2006-00046 and TIN2009-14475-C04. We also thank NVIDIA for hardware donation under Professor Partnership 2008–2010 and CUDA Teaching Center Award 2011–2012.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cecilia, J.M., Abellán, J.L., Fernández, J. et al. Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE. J Supercomput 62, 787–803 (2012). https://doi.org/10.1007/s11227-012-0749-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-012-0749-y