Big Data Infrastructure: A Survey | SpringerLink
Skip to main content

Big Data Infrastructure: A Survey

  • Conference paper
  • First Online:
Biomedical Applications Based on Natural and Artificial Computing (IWINAC 2017)

Abstract

In the last years, the volume of information is growing faster than ever before, moving from small datasets to huge volumes of information. This data growth has forced researchers to look for new alternatives to process and store this data, since traditional techniques have been limited by the size and structure of the information. On the other hand, the power of parallel computing in new processors has gradually increased, from single processor architectures to multiple processor, cores and threads. This latter fact enabled the use of machine learning techniques to take advantage of parallel processing capabilities offered by new architectures on large volumes of data. The present paper reviews and proposes a classification, using as criteria, the hardware infrastructures used in works of machine learning parallel approaches applied to large volumes of data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://mahout.apache.org/.

  2. 2.

    http://spark.apache.org/mllib/.

  3. 3.

    https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/index.html.

  4. 4.

    http://hadoop.apache.org/.

  5. 5.

    http://spark.apache.org/.

  6. 6.

    http://www.h2o.ai/h2o/.

  7. 7.

    http://storm.apache.org/.

  8. 8.

    https://flink.apache.org/.

  9. 9.

    http://www.cs.waikato.ac.nz/ml/weka/.

  10. 10.

    https://rapidminer.com/.

References

  1. Al-Jarrah, O.Y., Yoo, P.D., Muhaidat, S., Karagiannidis, G.K., Taha, K.: Efficient machine learning for big data: a review. Big Data Res. 2(3), 87–93 (2015)

    Article  Google Scholar 

  2. Aridhi, S., Mephu, E.: Big graph mining: frameworks and techniques. Big Data Res. 6, 1–10 (2016)

    Article  Google Scholar 

  3. Armbrust, M., Ghodsi, A., Zaharia, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J.: Spark SQL: relational data processing in spark michael. In: Proceedings of the ACM SIGMOD International Conference on Management of Data - SIGMOD 2015, pp. 1383–1394 (2015)

    Google Scholar 

  4. Bertolucci, M., Carlini, E., Dazzi, P., Lulli, A., Ricci, L.: Static and dynamic big data partitioning on apache spark. Adv. Parallel Comput. 27, 489–498 (2016)

    Google Scholar 

  5. Borthakur, D.: HDFS architecture guide. Hadoop Apache Project, 1–13 (2008). https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf

  6. Castillo, S.J.L., del Castillo, J.R.F., Sotos, L.G.: Algorithms of machine learning for K-clustering. In: Demazeau, Y., et al. (eds.) Trends in Practical Applications of Agents and Multiagent Systems. AISC, vol. 71, pp. 443–452. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  7. Catanzaro, B., Sundaram, N., Keutzer, K.: Fast support vector machine training and classication on graphics processors. In: Machine Learning, pp. 104–111 (2008)

    Google Scholar 

  8. Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al Najada, H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 23 (2015)

    Article  Google Scholar 

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating Systems Design and Implementation, pp. 137–149 (2004)

    Google Scholar 

  10. Nagina, Dhingra, S.: Scheduling algorithms in big data: a survey. Int. J. Eng. Comput. Sci. 5(8), 17737–17743 (2016)

    Google Scholar 

  11. Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. ACM SIGKDD Explor. Newslett. 14(2), 1–5 (2013)

    Article  Google Scholar 

  12. Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)

    Article  Google Scholar 

  13. Ghemawat, S., Gobioff, H., Leung, S.: Google file system (2003)

    Google Scholar 

  14. Guller, M.: Big Data Analytics with Spark (2015). ISBN 9781484209653

    Google Scholar 

  15. Hafez, M.M., Shehab, M.E., El Fakharany, E., Abdel Ghfar Hegazy, A.E.F.: Effective selection of machine learning algorithms for big data analytics using apache spark. In: Hassanien, A.E., Shaalan, K., Gaber, T., Azar, A.T., Tolba, M.F. (eds.) AISI 2016. AISC, vol. 533, pp. 692–704. Springer, Cham (2017). doi:10.1007/978-3-319-48308-5_66

    Chapter  Google Scholar 

  16. Hashem, I.A.T., Anuar, N.B., Gani, A., Yaqoob, I., Xia, F., Khan, S.U.: MapReduce: review and open challenges. Scientometrics 109(1), 1–34 (2016)

    Article  Google Scholar 

  17. He, Q., Li, N., Luo, W.J., Shi, Z.Z.: A survey of machine learning for big data processing. Moshi Shibie yu Rengong Zhineng/Pattern Recogn. Artif. Intell. 27(4), 327–336 (2014)

    Google Scholar 

  18. Hodge, V.J., Keefe, S.O., Austin, J.: Hadoop neural network for parallel and distributed feature selection. Neural Netw. 78, 24–35 (2016)

    Article  Google Scholar 

  19. Holmes, A.: Hadoop in Practice. Manning, 2nd edn. (2015). ISBN 9781617292224

    Google Scholar 

  20. Issa, J., Figueira, S.: Hadoop and memcached: performance and power characterization and analysis. J. Cloud Comput.: Adv. Syst. Appl. 1(1), 10 (2012)

    Article  Google Scholar 

  21. Jackson, J.C., Vijayakumar, V., Quadir, M.A., Bharathi, C.: Survey on programming models and environments for cluster, cloud, and grid computing that defends big data. Procedia Comput. Sci. 50, 517–523 (2015)

    Article  Google Scholar 

  22. Jain, A., Bhatnagar, V.: Crime data analysis using pig with Hadoop. Phys. Procedia 78(December 2015), 571–578 (2016)

    Google Scholar 

  23. Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Comput. 18(1), 369–383 (2015)

    Article  Google Scholar 

  24. Kacfah Emani, C., Cullot, N., Nicolle, C.: Understandable big data: a survey. Comput. Sci. Rev. 17, 70–81 (2015)

    Article  MathSciNet  Google Scholar 

  25. Kamishima, T., Motoyoshi, F.: Learning from cluster examples. Mach. Learn. 53(3), 199–233 (2003)

    Article  MATH  Google Scholar 

  26. Kiran, M., Kumar, A., Mukherjee, S., Ravi Prakash, G.: Verification and validation of MapReduce program model for parallel support vector machine algorithm on Hadoop cluster. Int. Conf. Adv. Comput. Communi. Syst. (ICACCS) 4(3), 317–325 (2013)

    Google Scholar 

  27. Kirk, D., Hwu, W.-M.W.: Processors, Programming Massively Parallel: A Hands-on Approach (2010). ISBN 0123814723

    Google Scholar 

  28. Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Franklin, M., Jordan, M.: MLbase: a distributed machine-learning system. In: 6th Biennial Conference on Innovative Data Systems Research (CIDR 2013) (2013)

    Google Scholar 

  29. Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2(1), 24 (2015)

    Article  Google Scholar 

  30. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1–7 (2016)

    MathSciNet  MATH  Google Scholar 

  31. Modha, D.S., Spangler, W.S.: Feature weighting in k-means clustering. Mach. Learn. 52(3), 217–237 (2003)

    Article  MATH  Google Scholar 

  32. Naimur Rahman, M., Esmailpour, A., Zhao, J.: Machine learning with big data an efficient electricity generation forecasting system. Big Data Res. 5, 9–15 (2016)

    Article  Google Scholar 

  33. Namiot, D.: On big data stream processing. Int. J. Open Inf. Technol. 3(8), 48–51 (2015)

    Google Scholar 

  34. Nguyen, T.T.T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. IEEE Commun. Surveys Tutorials 10(4) (2008)

    Google Scholar 

  35. Spangenberg, N., Roth, M., Franczyk, B.: Evaluating new approaches of big data analytics frameworks. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 208, pp. 28–37. Springer, Cham (2015). doi:10.1007/978-3-319-19027-3_3

    Chapter  Google Scholar 

  36. Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning (2012). ISBN 9781935182689

    Google Scholar 

  37. Pääkkönen, P.: Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing. J. Big Data 3(1), 6 (2016)

    Article  Google Scholar 

  38. Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benitez, J.M., Alonso-Betanzos, A., Herrera, F.: Un Framework de Selección de Características basado en la Teoría de la Información para Big Data sobre Apache Spark

    Google Scholar 

  39. Saecker, M., Markl, V.: Big data analytics on modern hardware architectures: a technology survey. In: Aufaure, M.-A., Zimányi, E. (eds.) Business Intelligence. LNBIP, vol. 138, pp. 125–149. Springer, Heidelberg (2013). doi:10.1007/978-3-642-36318-4_6

    Chapter  Google Scholar 

  40. Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016)

    Article  Google Scholar 

  41. Saraladevi, B., Pazhaniraja, N., Paul, P.V., Basha, M.S.S., Dhavachelvan, P.: Big data and Hadoop-a study in security perspective. Procedia Comput. Sci. 50, 596–601 (2015)

    Article  Google Scholar 

  42. Seminario, C.E., Wilson, D.C.: Case study evaluation of mahout as a recommender platform. CEUR Workshop Proc. 910(September 2012), 45–50 (2012)

    Google Scholar 

  43. Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)

    Article  Google Scholar 

  44. Singh, R., Kaur, P.J.: Analyzing performance of Apache Tez and MapReduce with Hadoop multinode cluster on Amazon cloud. J. Big Data 3(1), 19 (2016)

    Article  MathSciNet  Google Scholar 

  45. Walunj, S.G., Sadafale, K.: An online recommendation system for e-commerce based on apache mahout framework. In: Proceedings of the Annual Conference on Computers and People Research, pp. 153–158 (2013)

    Google Scholar 

  46. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud 2010 Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)

    Google Scholar 

Download references

Acknowledgements

This work has been funded by the Spanish Government TIN2016-76515-R grant for the COMBAHO project, supported with Feder funds.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose Garcia-Rodriguez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Salvador, J., Ruiz, Z., Garcia-Rodriguez, J. (2017). Big Data Infrastructure: A Survey. In: Ferrández Vicente, J., Álvarez-Sánchez, J., de la Paz López, F., Toledo Moreo, J., Adeli, H. (eds) Biomedical Applications Based on Natural and Artificial Computing. IWINAC 2017. Lecture Notes in Computer Science(), vol 10338. Springer, Cham. https://doi.org/10.1007/978-3-319-59773-7_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59773-7_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59772-0

  • Online ISBN: 978-3-319-59773-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics