Abstract
The improvement of Hadoop performance has received considerable attention from researchers in cloud computing fields. Most studies have focused on improving the performance of a Hadoop cluster. Notably, various parameters are required to configure Hadoop and must be adjusted to improve performance. This paper proposes a mechanism to improve Hadoop, schedule jobs, and allocate and utilize resources. Specifically, we present an improved ant colony optimization method to schedule jobs according to the job size and the time expected for execution. Priority is given to the job with the minimum data size and minimum response time. The resource usage and running jobs by data node are predicted using an artificial neural network, and job activity and resource usage are monitored using the resource manager. Moreover, we enhance the Hadoop Name node performance by adding an aggregator node to the default HDFS framework architecture. The changes involve four entities: the name node, secondary name node, aggregator nodes, and data nodes, where the aggregator node is responsible for assigning the jobs among the data node, and the Name node keeps tracking only the aggregator nodes. We test the overall scheme among Amazon EC2 and S3, and show the results of throughput and CPU response time for different data sizes. Finally, we show that the proposed approach shows significant improvement compare to native Hadoop and other approaches.
Similar content being viewed by others
Change history
28 September 2023
A Correction to this paper has been published: https://doi.org/10.1007/s42979-023-02168-3
References
Wang T, Wang J, Nguyen SN, Yang Z, Mi N, Sheng B. Ea2s2: an efficient application-aware storage system for big data processing in heterogeneous clusters. In: 2017 26th international conference on computer communication and networks (ICCCN). IEEE; 2017. p. 1–9.
Subrahmanyam K, Thanekar SA, Bagwan A. Improving Hadoop performance by enhancing name node capabilities. J Soc Technol Environ Sci. 2017;6(2):1–8.
Usama M, Liu M, Chen M. Job schedulers for big data processing in Hadoop environment: testing real-life schedulers using benchmark programs. Digit Commun Netw. 2017;3(4):260–73.
Han S, Choi W, Muwafiq R, Nah Y. Impact of memory size on bigdata processing based on Hadoop and spark. In: Proceedings of the international conference on research in adaptive and convergent systems. ACM; 2017. p. 275–80.
Nghiem PP, Figueira SM. Towards efficient resource provisioning in MapReduce. J Parallel Distrib Comput. 2016;95:29–41.
Wang K, Yang Y, Qiu X, Gao Z. MOSM: an approach for efficient storing massive small files on Hadoop. In: 2017 IEEE 2nd international conference on big data analysis (ICBDA). IEEE; 2017. p. 397–401.
Kim H-G. Effects of design factors of HDFS on a I/O performance. J Comput Sci. 2018;14:304–9.
Nazini H, Sasikala T. Simulating aircraft landing and take off scheduling in distributed framework environment using Hadoop file system. Cluster Comput. 2018;22:1–9.
Luo X, Fu X. Configuration optimization method of Hadoop system performance based on genetic simulated annealing algorithm. Cluster Comput. 2018;22:1–9.
Guo M. Design and realization of bank history data management system based on Hadoop 2.0. Cluster Comput. 2018;22:1–7.
Aydin G, Hallac IR. Distributed log analysis on the cloud using mapreduce. arXiv preprint arXiv:1802.03589. 2018.
Yao Y, Tai J, Sheng B, Mi N. LSPS: a job size-based scheduler for efficient task assignments in Hadoop. IEEE Trans Cloud Comput. 2015;3(4):411–24.
Bhatnagar R. Machine learning and big data processing: a technological perspective and review. In: International conference on advanced machine learning technologies and applications. Springer; 2018. p. 468–78.
Lu Q, Li S, Zhang W, Zhang L. A genetic algorithm-based job scheduling model for big data analytics. EURASIP J Wirel Commun Netw. 2016;2016(1):152.
Hua X, Huang MC, Liu P. Hadoop configuration tuning with ensemble modeling and metaheuristic optimization. IEEE Access. 2018;6:44161–74.
Ba-Alwi FM, Ammar SM. Improved FTWeighted HashT Apriori algorithm for big data using Hadoop MapReduce model. J Adv Math Comput Sci. 2018;27(1):1–11.
Singh S, Garg R, Mishra P. Performance optimization of MapReduce-based apriori algorithm on Hadoop cluster. Comput Electr Eng. 2018;67:348–64.
Soualhia M, Khomh F, Tahar S. A dynamic and failure-aware task scheduling framework for Hadoop. IEEE Trans Cloud Comput. 2018. https://doi.org/10.1109/TCC.2018.2805812.
Wang J, Qiu M, Guo B, Zong Z. Phase—reconfigurable shuffle optimization for Hadoop MapReduce. IEEE Trans Cloud Comput. 2015. https://doi.org/10.1109/TCC.2015.2459707.
Kc K, Anyanwu K. Scheduling Hadoop jobs to meet deadlines. In: 2010 IEEE second international conference on cloud computing technology and science (CloudCom). IEEE; 2010. p. 388–92.
Guo Y, Wu L, Yu W, Wu B, Wang X. The improved job scheduling algorithm of Hadoop platform. arXiv preprint arXiv:1506.03004. 2015.
Brahmwar M, Kumar M, Sikka G. Tolhit—a scheduling algorithm for Hadoop cluster. Proc Comput Sci. 2016;89:203–8.
Gu R, Yang X, Yan J, Sun Y, Wang B, Yuan C, Huang Y. SHadoop: improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. J Parallel Distrib Comput. 2014;74(3):2166–79.
Alshammari H, Lee J, Bajwa H. H2hadoop: improving Hadoop performance using the metadata of related jobs. IEEE Trans Cloud Comput. 2016;6:1031–40.
Jeon S, Chung H, Choi W, Shin H, Chun J, Kim JT, Nah Y. MapReduce tuning to improve distributed machine learning performance. In: 2018 IEEE first international conference on artificial intelligence and knowledge engineering (AIKE). IEEE; 2018. p. 198–200.
Chung H, Nah Y. Performance comparison of distributed processing of large volume of data on top of Xen and Docker-based virtual clusters. In: International conference on database systems for advanced applications. Springer; 2017. p. 103–13.
Chen C-T, Hung L-J, Hsieh S-Y, Buyya R, Zomaya Y. Heterogeneous job allocation scheduler for Hadoop MapReduce using dynamic grouping integrated neighboring search. IEEE Trans Cloud Comput. 2017. https://doi.org/10.1109/TCC.2017.2748586.
Sneha S, Sebastian S. Improved fair scheduling algorithm for Hadoop clustering. Oriental J Comput Sci Technol. 2017;10:194–200.
Choi D, Jeon M, Kim N, Lee B-D. An enhanced data-locality-aware task scheduling algorithm for Hadoop applications. IEEE Syst J. 2017;99:1–12.
Guo Y, Rao J, Cheng D, Zhou X. ishuffle: improving Hadoop performance with shuffle-on-write. IEEE Trans Parallel Distrib Syst. 2017;28(6):1649–62.
Acknowledgements
This research was supported by the MIST (Ministry of Science and ICT), Korea, under the National Program for Excellence in SW supervised by the IITP (Institute for Information & communications Technology Promotion) (2017-0-00091). This work was supported by “Human Resources Program in Energy Technology” of the Korea Institute of Energy Technology Evaluation and Planning (KETEP), granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea. (No. 20174030201740).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Alanazi, R., Alhazmi, F., Chung, H. et al. A Multi-Optimization Technique for Improvement of Hadoop Performance with a Dynamic Job Execution Method Based on Artificial Neural Network. SN COMPUT. SCI. 1, 184 (2020). https://doi.org/10.1007/s42979-020-00182-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-020-00182-3