Abstract
In the last years, the volume of information is growing faster than ever before, moving from small datasets to huge volumes of information. This data growth has forced researchers to look for new alternatives to process and store this data, since traditional techniques have been limited by the size and structure of the information. On the other hand, the power of parallel computing in new processors has gradually increased, from single processor architectures to multiple processor, cores and threads. This latter fact enabled the use of machine learning techniques to take advantage of parallel processing capabilities offered by new architectures on large volumes of data. The present paper reviews and proposes a classification, using as criteria, the hardware infrastructures used in works of machine learning parallel approaches applied to large volumes of data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
References
Al-Jarrah, O.Y., Yoo, P.D., Muhaidat, S., Karagiannidis, G.K., Taha, K.: Efficient machine learning for big data: a review. Big Data Res. 2(3), 87–93 (2015)
Aridhi, S., Mephu, E.: Big graph mining: frameworks and techniques. Big Data Res. 6, 1–10 (2016)
Armbrust, M., Ghodsi, A., Zaharia, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J.: Spark SQL: relational data processing in spark michael. In: Proceedings of the ACM SIGMOD International Conference on Management of Data - SIGMOD 2015, pp. 1383–1394 (2015)
Bertolucci, M., Carlini, E., Dazzi, P., Lulli, A., Ricci, L.: Static and dynamic big data partitioning on apache spark. Adv. Parallel Comput. 27, 489–498 (2016)
Borthakur, D.: HDFS architecture guide. Hadoop Apache Project, 1–13 (2008). https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
Castillo, S.J.L., del Castillo, J.R.F., Sotos, L.G.: Algorithms of machine learning for K-clustering. In: Demazeau, Y., et al. (eds.) Trends in Practical Applications of Agents and Multiagent Systems. AISC, vol. 71, pp. 443–452. Springer, Heidelberg (2010)
Catanzaro, B., Sundaram, N., Keutzer, K.: Fast support vector machine training and classication on graphics processors. In: Machine Learning, pp. 104–111 (2008)
Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al Najada, H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 23 (2015)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating Systems Design and Implementation, pp. 137–149 (2004)
Nagina, Dhingra, S.: Scheduling algorithms in big data: a survey. Int. J. Eng. Comput. Sci. 5(8), 17737–17743 (2016)
Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. ACM SIGKDD Explor. Newslett. 14(2), 1–5 (2013)
Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)
Ghemawat, S., Gobioff, H., Leung, S.: Google file system (2003)
Guller, M.: Big Data Analytics with Spark (2015). ISBN 9781484209653
Hafez, M.M., Shehab, M.E., El Fakharany, E., Abdel Ghfar Hegazy, A.E.F.: Effective selection of machine learning algorithms for big data analytics using apache spark. In: Hassanien, A.E., Shaalan, K., Gaber, T., Azar, A.T., Tolba, M.F. (eds.) AISI 2016. AISC, vol. 533, pp. 692–704. Springer, Cham (2017). doi:10.1007/978-3-319-48308-5_66
Hashem, I.A.T., Anuar, N.B., Gani, A., Yaqoob, I., Xia, F., Khan, S.U.: MapReduce: review and open challenges. Scientometrics 109(1), 1–34 (2016)
He, Q., Li, N., Luo, W.J., Shi, Z.Z.: A survey of machine learning for big data processing. Moshi Shibie yu Rengong Zhineng/Pattern Recogn. Artif. Intell. 27(4), 327–336 (2014)
Hodge, V.J., Keefe, S.O., Austin, J.: Hadoop neural network for parallel and distributed feature selection. Neural Netw. 78, 24–35 (2016)
Holmes, A.: Hadoop in Practice. Manning, 2nd edn. (2015). ISBN 9781617292224
Issa, J., Figueira, S.: Hadoop and memcached: performance and power characterization and analysis. J. Cloud Comput.: Adv. Syst. Appl. 1(1), 10 (2012)
Jackson, J.C., Vijayakumar, V., Quadir, M.A., Bharathi, C.: Survey on programming models and environments for cluster, cloud, and grid computing that defends big data. Procedia Comput. Sci. 50, 517–523 (2015)
Jain, A., Bhatnagar, V.: Crime data analysis using pig with Hadoop. Phys. Procedia 78(December 2015), 571–578 (2016)
Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Comput. 18(1), 369–383 (2015)
Kacfah Emani, C., Cullot, N., Nicolle, C.: Understandable big data: a survey. Comput. Sci. Rev. 17, 70–81 (2015)
Kamishima, T., Motoyoshi, F.: Learning from cluster examples. Mach. Learn. 53(3), 199–233 (2003)
Kiran, M., Kumar, A., Mukherjee, S., Ravi Prakash, G.: Verification and validation of MapReduce program model for parallel support vector machine algorithm on Hadoop cluster. Int. Conf. Adv. Comput. Communi. Syst. (ICACCS) 4(3), 317–325 (2013)
Kirk, D., Hwu, W.-M.W.: Processors, Programming Massively Parallel: A Hands-on Approach (2010). ISBN 0123814723
Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Franklin, M., Jordan, M.: MLbase: a distributed machine-learning system. In: 6th Biennial Conference on Innovative Data Systems Research (CIDR 2013) (2013)
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2(1), 24 (2015)
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1–7 (2016)
Modha, D.S., Spangler, W.S.: Feature weighting in k-means clustering. Mach. Learn. 52(3), 217–237 (2003)
Naimur Rahman, M., Esmailpour, A., Zhao, J.: Machine learning with big data an efficient electricity generation forecasting system. Big Data Res. 5, 9–15 (2016)
Namiot, D.: On big data stream processing. Int. J. Open Inf. Technol. 3(8), 48–51 (2015)
Nguyen, T.T.T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. IEEE Commun. Surveys Tutorials 10(4) (2008)
Spangenberg, N., Roth, M., Franczyk, B.: Evaluating new approaches of big data analytics frameworks. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 208, pp. 28–37. Springer, Cham (2015). doi:10.1007/978-3-319-19027-3_3
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning (2012). ISBN 9781935182689
Pääkkönen, P.: Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing. J. Big Data 3(1), 6 (2016)
Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benitez, J.M., Alonso-Betanzos, A., Herrera, F.: Un Framework de Selección de Características basado en la Teoría de la Información para Big Data sobre Apache Spark
Saecker, M., Markl, V.: Big data analytics on modern hardware architectures: a technology survey. In: Aufaure, M.-A., Zimányi, E. (eds.) Business Intelligence. LNBIP, vol. 138, pp. 125–149. Springer, Heidelberg (2013). doi:10.1007/978-3-642-36318-4_6
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016)
Saraladevi, B., Pazhaniraja, N., Paul, P.V., Basha, M.S.S., Dhavachelvan, P.: Big data and Hadoop-a study in security perspective. Procedia Comput. Sci. 50, 596–601 (2015)
Seminario, C.E., Wilson, D.C.: Case study evaluation of mahout as a recommender platform. CEUR Workshop Proc. 910(September 2012), 45–50 (2012)
Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)
Singh, R., Kaur, P.J.: Analyzing performance of Apache Tez and MapReduce with Hadoop multinode cluster on Amazon cloud. J. Big Data 3(1), 19 (2016)
Walunj, S.G., Sadafale, K.: An online recommendation system for e-commerce based on apache mahout framework. In: Proceedings of the Annual Conference on Computers and People Research, pp. 153–158 (2013)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud 2010 Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)
Acknowledgements
This work has been funded by the Spanish Government TIN2016-76515-R grant for the COMBAHO project, supported with Feder funds.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Salvador, J., Ruiz, Z., Garcia-Rodriguez, J. (2017). Big Data Infrastructure: A Survey. In: Ferrández Vicente, J., Álvarez-Sánchez, J., de la Paz López, F., Toledo Moreo, J., Adeli, H. (eds) Biomedical Applications Based on Natural and Artificial Computing. IWINAC 2017. Lecture Notes in Computer Science(), vol 10338. Springer, Cham. https://doi.org/10.1007/978-3-319-59773-7_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-59773-7_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59772-0
Online ISBN: 978-3-319-59773-7
eBook Packages: Computer ScienceComputer Science (R0)