Abstract
Nowadays, many enterprises commit to the extraction of actionable knowledge from huge datasets as part of their core business activities. Applications belong to very different domains such as fraud detection or one-to-one marketing, and encompass business analytics and support to decision making in both private and public sectors. In these scenarios, a central place is held by the MapReduce framework and in particular its open source implementation, Apache Hadoop. In such environments, new challenges arise in the area of jobs performance prediction, with the needs to provide Service Level Agreement guarantees to the end-user and to avoid waste of computational resources. In this paper we provide performance analysis models to estimate MapReduce job execution times in Hadoop clusters governed by the YARN Capacity Scheduler. We propose models of increasing complexity and accuracy, ranging from queueing networks to stochastic well formed nets, able to estimate job performance under a number of scenarios of interest, including also unreliable resources. The accuracy of our models is evaluated by considering the TPC-DS industry benchmark running experiments on Amazon EC2 and the CINECA Italian supercomputing center. The results have shown that the average accuracy we can achieve is in the range 9–14%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amazon EC2 pricing. http://aws.amazon.com/ec2/pricing/
The digital universe in 2020. http://idcdocserv.com/1414
Aguilera-Mendoza, L., Llorente-Quesada, M.T.: Modeling and simulation of Hadoop distributed file system in a cluster of workstations. In: Cuzzocrea, A., Maabout, S. (eds.) MEDI 2013. LNCS, vol. 8216, pp. 1–12. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41366-7_1
Ahmed, S.T., Loguinov, D.: On the performance of MapReduce: a stochastic approach. In: IEEE International Conference on Big Data, pp. 49–54. IEEE (2014)
Alipour, H., Liu, Y., Gorton, I.: Model driven performance simulation of cloud provisioned Hadoop MapReduce applications. In: Proceedings of the 8th International Workshop on Modeling in Software Engineering, MiSE 2016 (2016)
Ardagna, D., Ghezzi, C., Mirandola, R.: Rethinking the use of models in software architecture. In: Becker, S., Plasil, F., Reussner, R. (eds.) QoSA 2008. LNCS, vol. 5281, pp. 1–27. Springer, Heidelberg (2008). doi:10.1007/978-3-540-87879-7_1
Baarir, S., Beccuti, M., Cerotti, D., De Pierro, M., Donatelli, S., Franceschinis, G.: The GreatSPN tool: recent enhancements. ACM SIGMETRICS PER 36(4), 4–9 (2009)
Barbierato, E., Gribaudo, M., Iacono, M.: Modeling apache hive based applications in big data architectures. In: VALUETOOLS 2013 Proceedings (2013)
Bardhan, S., Menascé, D.: Queuing network models to predict the completion time of the map phase of MapReduce jobs. In: Proceedings of the Computer Measurement Group International Conference (2012)
Bertoli, M., Casale, G., Serazzi, G.: JMT: performance engineering tools for system modeling. SIGMETRICS Perform. Eval. Rev. 36(4), 10–15 (2009)
Bruneo, D., Longo, F., Ghosh, R., Scarpa, M., Puliafito, A., Trivedi, K.S.: Analytical modeling of reactive autonomic management techniques in IAAS clouds. In: IEEE CLOUD 2015 Proceedings (2015)
Castiglione, A., Gribaudo, M., Iacono, M., Palmieri, F.: Exploiting mean field analysis to model performances of big data architectures. Future Gener. Comput. Syst. 37, 203–211 (2014)
Chu, W.W., Sit, C.M., Leung, K.K.: Task response time for real-time distributed systems with resource contentions. IEEE Trans. Softw. Eng. 17(10), 1076–1092 (1991)
Dubois, D.J., Casale, G.: OptiSpot: minimizing application deployment cost using spot cloud resources. Clust. Comput. 19, 1–17 (2016)
Gibilisco, G.P., Li, M., Zhang, L., Ardagna, D.: Stage aware performance modeling of DAG based in memory analytic platforms. In: Cloud (2016)
Herodotou, H.: Hadoop performance models (2011)
Jagadish, H.V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Shahabi, C.: Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)
Jensen, K., Kristensen, L.M., Wells, L.: Coloured Petri nets and CPN tools for modelling and validation of concurrent systems. Int. J. Softw. Tools Technol. Transf. 9(3–4), 213–254 (2007)
Jin, H., Qiao, K., Sun, X.H., Li, Y.: Performance under failures of MapReduce applications. In: CCGrid 2011 Proceedings (2011)
Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel Distrib. Comput. 74(7), 2561–2573 (2014)
Krevat, E., Shiran, T., Anderson, E., Tucek, J., Wylie, J.J., Ganger, G.R.: Applying performance models to understand data-intensive computing efficiency. Technical report, DTIC Document (2010)
Laney, D.: 3D data management: controlling data volume, velocity, and variety. Technical report, META Group (2012)
Lazowska, E.D., Zahorjan, J., Graham, G.S., Sevcik, K.C.: Quantitative System Performance. Prentice-Hall, Upper Saddle River (1984)
Liang, D.R., Tripathi, S.K.: On performance prediction of parallel computations with precedent constraints. IEEE Trans. Parallel Distrib. Syst. 11(5), 491–508 (2000)
Lin, M., Zhang, L., Wierman, A., Tan, J.: Joint optimization of overlapping phases in MapReduce. SIGMETRICS Perform. Eval. Rev. 41(3), 16–18 (2013)
Lin, X., Meng, Z., Xu, C., Wang, M.: A practical performance model for Hadoop MapReduce. In: 2012 IEEE International Conference on Cluster Computing Workshops (CLUSTER WORKSHOPS), pp. 231–239. IEEE (2012)
Mak, V.W., Lundstrom, S.F.: Predicting performance of parallel computations. IEEE Trans. Parallel Distrib. Syst. 1(3), 257–270 (1990)
Marynowski, J.E., Santin, A.O., Pimentel, A.R.: Method for testing the fault tolerance of MapReduce frameworks. Comput. Netw. 86, 1–13 (2015)
Nelson, R.D., Tantawi, A.N.: Approximate analysis of fork/join synchronization in parallel queues. IEEE Trans. Comput. 37(6), 739–743 (1988)
Polo, J., Becerra, Y., Carrera, D., Steinder, M., Whalley, I., Torres, J., Ayguadé, E.: Deadline-based MapReduce workload management. IEEE Trans. Netw. Serv. Manag. 10(2), 231–244 (2013)
Ruiz, M.C., Calleja, J., Cazorla, D.: Petri nets formalization of Map/Reduce paradigm to optimise the performance-cost tradeoff. In: 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 3, pp. 92–99. IEEE (2015)
Shanklin, C.: Benchmarking Apache Hive 13 for Enterprise Hadoop. https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
Verma, A., Cherkasova, L., Campbell, R.H.: ARIA: automatic resource inference and allocation for MapReduce environments. In: ICAC 2011 Proceedings (2011)
Vianna, E., Comarela, G., Pontes, T., Almeida, J.M., Almeida, V.A.F., Wilkinson, K., Kuno, H.A., Dayal, U.: Analytical performance models for MapReduce workloads. Int. J. Parallel Program. 41(4), 495–525 (2013)
Yang, X., Sun, J.: An analytical performance model of MapReduce. In: CCIS 2011 (2011)
Yu, X., Li, W.: Performance modelling and analysis of MapReduce/Hadoop workloads. In: LANMAN 2015 Proceedings (2015)
Acknowledgments
This work has received funding from the European Union Horizon 2020 research and innovation program under grant agreement No. 644869 (DICE). Experimental data are available as open data at https://zenodo.org/record/58847#.V5i0wmXA45Q.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Ardagna, D., Bernardi, S., Gianniti, E., Karimian Aliabadi, S., Perez-Palacin, D., Requeno, J.I. (2016). Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well Formed Nets. In: Carretero, J., Garcia-Blas, J., Ko, R., Mueller, P., Nakano, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10048. Springer, Cham. https://doi.org/10.1007/978-3-319-49583-5_47
Download citation
DOI: https://doi.org/10.1007/978-3-319-49583-5_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49582-8
Online ISBN: 978-3-319-49583-5
eBook Packages: Computer ScienceComputer Science (R0)