Abstract
The Cell Broadband Engine (Cell BE) is a heterogeneous multi-core processor specifically designed to exploit thread-level parallelism. Its memory model comprehends a common shared main memory and eight small private local memories. Programming of the Cell BE involves dealing with multiple threads and explicit data movement strategies through DMAs which make the task very challenging. This situation gets even worse when dual Cell-based blades are considered. In this context, fast and efficient collective primitives are indispensable to reduce complexity and optimize performance.
In this paper, we describe the design and implementation of three collective operations: barrier, broadcast and reduce. Their design takes into consideration the architectural peculiarities and asymmetries of dual Cell-based blades. Meanwhile, their implementation requires minimal resources, a signal register and a buffer. Experimental results show low latencies and high bandwidths, synchronization latency of 637 ns, broadcast bandwidth of 38.33 GB/s for 16 KB messages, and reduce latency of 1535 ns with 32 floats, on a dual Cell-based blade with 16 SPEs.
This work has been jointly supported by the Spanish MEC under grants “TIN2006-15516-C04-03” and European Comission FEDER funds under grant “Consolider Ingenio-2010 CSD2006-00046”. Epifanio Gaona is supported by fellowship 09503/FPI/08 from Comunidad Autónoma de la Región de Murcia (Fundación Séneca, Agencia Regional de Ciencia y Tecnología).
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Kahle, J., Day, M., Hofstee, H., Johns, C., Maeurer, T., Shippy, D.: Introduction to the Cell Multiprocessor. IBM Journal of Research and Development 49(4/5), 589–604 (2005)
Nanda, A., Moulic, J., Hanson, R., Goldrian, G., Day, M.N., D’Amora, B.D., Kesavarapu, S.: Cell/B.E. blades: Building blocks for Scalable, real-time, interactive, and digital media servers. IBM Systems Journal 51(5), 573–582 (2007)
O’Brien, K., O’Brien, K., Sura, Z., Chen, T., Zhang, T.: Supporting OpenMP on Cell. International Journal of Parallel Programming 36(3), 287–360 (2008)
Ohara, M., Inoue, H., Sohda, Y., Komatsu, H., Nakatani, T.: MPI microtask for programming the Cell Broadband EngineTM processors. IBM Systems Journal 45(1), 85–102 (2006)
Kumar, A., Senthilkumar, G., Krishna, M., Jayam, N., Baruah, P.K., Sharma, R., Srinivasan, A., Kapoor, S.: A Buffered-mode MPI Implementation for the Cell BETM Processor. In: 7th International Conference on Computational Science, Beijing, China (2007)
Bellens, P., Prez, J.M., Bada, R.M., Labarta, J.: CellSs: a Programming Model for the Cell BE Architecture. In: Proceedings of IEEE/ACM Conference on SuperComputing, Tampa, FL (2006)
McCool, M.D.: Data-Parallel Programming on the Cell BE and the GPU using the RapidMind Development Platform. In: Proceedings of GSPx Multicore Applications Conference, Santa Clara, CA (2006)
McCool, M.D.: Scalable Programming Models for Massively Multicore Processors. Proceedings of the IEEE 96(5), 816–831 (2008)
Abellán, J.L., Fernández, J., Acacio, M.E.: Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades. In: 8th International Conference on Computational Science, Krákow, Poland (2008)
Yu, W., Buntinas, D., Graham, R.L., Panda, D.K.: Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-based Collective Message Passing Protocol. In: Proceedings of Workshop on Communication Architecture for Clusters, Santa Fe, NM, USA (2004)
Velamati, M.K., Kumar, A., Jayam, N., Senthilkumar, G., Baruah, P., Sharma, R., Kapoor, S., Srinivasan, A.: Optimization of Collective Communication in Intra-Cell MPI. In: Proceedings of 14th International Conference on High Performance Computing, Goa, India (2007)
Petrini, F., Moody, A., Fernández, J., Frachtenberg, E., Panda, D.K.: NIC-based Reduction Algorithms for Large-Scale Clusters. International Journal of High Performance Computing and Networking 4(3–4), 122–136 (2005)
Kistler, M., Perrone, M., Petrini, F.: Cell Processor Interconnection Network: Built for Speed. IEEE Micro. 25(3), 2–15 (2006)
Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: A Survey of Barrier Algorithms for Coarse Grained Supercomputers. Technical report, Technical University of Chemnitz (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gaona, E., Fernández, J., Acacio, M.E. (2009). Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-Based Blades. In: Sips, H., Epema, D., Lin, HX. (eds) Euro-Par 2009 Parallel Processing. Euro-Par 2009. Lecture Notes in Computer Science, vol 5704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03869-3_83
Download citation
DOI: https://doi.org/10.1007/978-3-642-03869-3_83
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03868-6
Online ISBN: 978-3-642-03869-3
eBook Packages: Computer ScienceComputer Science (R0)