Abstract
MapReduce is the most popular framework for distributed processing. Recently, the scalability of data mining and machine learning algorithms has significantly improved with help from MapReduce. However, MapReduce does not handle iterative algorithms very efficiently. The problem is that many data mining and machine learning algorithms are iterative by nature. In order to overcome the limitations of MapReduce, many advanced distributed systems have been developed, including HaLoop, iMapReduce, Twister, and Spark. In this paper, we identify and categorize the limitations of MapReduce in handling iterative algorithms, and then, experimentally investigate the consequences of these limitations by using the most flexible and stable distributed system, Spark. According to our experiment results, the network I/O overhead was the primary factor that affected system performance the most. The disk I/O overhead also affected system performance, but it was less significant than the network I/O overhead. For the synchronization overhead, it affected system performance only when the static data was not cached.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. Int. J. Very Large Data Bases 23(3), 355–380 (2014)
Lee, S., Kim, J., Moon, Y.S., Lee, W.: Efficient level-based top-down data cube computation using MapReduce. Trans. Large-Scale Data-Knowl.-Cent. Syst. XXI, pp. 1–9 (2015)
Shim, K.: MapReduce algorithms for big data analysis. Proc. VLDB Endow. 5(12), 2016–2017 (2012)
Apache. Apache Hadoop. https://hadoop.apache.org/
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. Int. J. Very Large Data Bases 21(2), 169–190 (2012)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: (2010, April) MapReduce Online. NSDI 10(4), 20 (2010)
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (2010, June)
Lee, H., Kang, M., Youn, S.B., Lee, J. G., Kwon, Y.: An experimental comparison of iterative MapReduce frameworks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2089–2094 (2016, October)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, pp. 10 (2010, June)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapreduce: a distributed computing framework for iterative computation. J. Grid Comput.10(1), 47–68 (2012)
Jiang, X., Li, C., Sun, J.: A modified K-means clustering for mining of multimedia databases based on dimensionality reduction and similarity measures. Clust. Comput. 1–8 (2017)
Miner, D., Shook, A.: MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Inc. (2012)
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Clust. Comput. 1–15 (2017)
Kang, M., Lee, J.: A comparative analysis of iterative MapReduce systems. In: Proceedings of the 6th International Conference on Emerging Databases: Technologies, Applications, and Theory (EDB), pp. 61–64 (2016)
Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Synth. Lect. Hum. Lang. Technol. 3(1), 1–177 (2010)
Apache. Apache Spark. https://spark.apache.org/
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146 (2010, June)
Leskovec, J., Krevl, A.: SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data, June (2014)
The Lemur Project. The ClueWeb09 Collection. http://lemurproject.org/clueweb09, May (2011)
Kwon, Y., Nunley, D., Gardner, J.P., Balazinska, M., Howe, B., Loebman, S.: Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In: Scientific and Statistical Database Management, pp. 132–150. Springer, Berlin, Heidelberg (2010, January)
Kim, J., Lee, W., Song, J.J., Lee, S.B.: Optimized combinatorial clustering for stochastic processes. Clust. Comput. 20(2), 1135–1148 (2017)
Chu, C., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. Adv. Neural Inf. Process. Syst. 6, 281–288 (2007)
Karau, H., Warren, R.: High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark. O’Reilly Media, Inc. (2016)
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G., ICSI, V.: Making sense of performance in data analytics frameworks. NSDI 15, 293–307 (2015, May)
Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7(12), 1047–1058 (2014)
Acknowledgements
This research, “Geospatial Big Data Management, Analysis and Service Platform Technology Development”, was supported by the MOLIT(The Ministry of Land, Infrastructure and Transport), Korea, under the national spatial information research program supervised by the KAIA(Korea Agency for Infrastructure Technology Advancement)”(17NSIP-B081011-04). In addition, this project was supported by Microsoft Research through “Azure for Research” global RFP program.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper is a revised and expanded version of a paper entitled ‘A Comparative Analysis of Iterative MapReduce Systems’ The paper proudly received the Runner-Up Paper Award. Presented at the 6th International Conference on Emerging Databases: Technologies, Applications, and Theory (EDB 2016), 17–19 October 2016, Jeju Island, Korea.
Rights and permissions
About this article
Cite this article
Kang, M., Lee, JG. An experimental analysis of limitations of MapReduce for iterative algorithms on Spark. Cluster Comput 20, 3593–3604 (2017). https://doi.org/10.1007/s10586-017-1167-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-1167-y