Abstract
MapReduce model is a new parallel programming model initially developed for large-scale web content processing. Data analysis meets the issue of how to do calculation over extremely large dataset. The arrival of MapReduce provides a chance to utilize commodity hardware for massively parallel data analysis applications. The translation and optimization from relational algebra operators to MapReduce programs is still an open and dynamic research field. In this paper, we focus on a special type of data analysis query, namely, multiple group by query. We first study the communication cost of MapReduce model, then we give an initial implementation of multiple group by query. We then propose an optimized version which addresses and improves the communication cost issues. Our optimized version shows a better accelerating ability and a better scalability than the other version.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Jeffrey, D., Sanjay, G.: MapReduce: Simplified data processing on large clusters. Communications of the ACM, 107–113 (2008)
Hung-chih, Y., Ali, D., et al.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD 2007, pp. 1029–1040 (2007)
Lämmel, R.: Google’s MapReduce programming model. Sci. Comput. Program, 208–237 (2007) (revisited)
GridGain, http://www.gridgain.com/
Hadoop, http://hadoop.apache.org/ (accessed, April 2009)
Zhimin, C., Vivek, N.: Efficient computation of multiple group by queries. In: SIGMOD 2005, pp. 263–274 (2005)
Grid’5000, https://www.grid5000.fr/
Dewitt, D.J., Gray, J.: Parallel database systems: the future of high performance database systems. Communications of the ACM, 85–98 (1992)
Hellerstein, J.: Parallel programming in the age of big data (2008)
Stephano, C.A., Mauro, N., et al.: Horizontal data partitioning in database design. In: SIGMOD 1982, pp. 128–136. ACM, New York (1982)
Cascading, http://www.cascading.org/
Chao, J., Christian, V., et al.: MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms. In: ESCIENCE 2008, pp. 214–221 (2008)
Dionysios, L., Kenneth, Y., et al.: Ad-hoc data processing in the cloud. In: Proc. VLDB Endow., pp. 1472–1475 (2008)
Azza, A., Bajda-Pawlikowski, et al.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In: VLDB (2009)
Thusoo, A., Sarma, J.S., et al.: Hive - A Warehousing Solution Over a Map-Reduce Framework. In: VLDB (2009)
Christopher, O., Benjamin, R., et al.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD 2008, pp. 1099–1110. ACM, New York (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pan, J., Magoulès, F., Le Biannic, Y. (2010). Executing Multiple Group by Query Using MapReduce Approach: Implementation and Optimization. In: Bellavista, P., Chang, RS., Chao, HC., Lin, SF., Sloot, P.M.A. (eds) Advances in Grid and Pervasive Computing. GPC 2010. Lecture Notes in Computer Science, vol 6104. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13067-0_67
Download citation
DOI: https://doi.org/10.1007/978-3-642-13067-0_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13066-3
Online ISBN: 978-3-642-13067-0
eBook Packages: Computer ScienceComputer Science (R0)