{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T15:42:12Z","timestamp":1740152532411,"version":"3.37.3"},"reference-count":32,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2018,12,17]],"date-time":"2018-12-17T00:00:00Z","timestamp":1545004800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"The Hammer prototype is a query engine for corpora of Open Data that provides users with the concept of blind querying. Since data sets published on Open Data portals are heterogeneous, users wishing to find out interesting data sets are blind: queries cannot be fully specified, as in the case of databases. Consequently, the query engine is responsible for rewriting and adapting the blind query to the actual data sets, by exploiting lexical and semantic similarity. The effectiveness of this approach was discussed in our previous works. In this paper, we report our experience in developing the query engine. In fact, in the very first version of the prototype, we realized that the implementation of the retrieval technique was too slow, even though corpora contained only a few thousands of data sets. We decided to adopt the Map-Reduce paradigm, in order to parallelize the query engine and improve performances. We passed through several versions of the query engine, either based on the Hadoop framework or on the Spark framework. Hadoop and Spark are two very popular frameworks for writing and executing parallel algorithms based on the Map-Reduce paradigm. In this paper, we present our study about the impact of adopting the Map-Reduce approach and its two most famous frameworks to parallelize the Hammer query engine; we discuss various implementations of the query engine, either obtained without significantly rewriting the algorithm or obtained by completely rewriting the algorithm by exploiting high level abstractions provided by Spark. The experimental campaign we performed shows the benefits provided by each studied solution, with the perspective of moving toward Big Data in the future. The lessons we learned are collected and synthesized into behavioral guidelines for developers approaching the problem of parallelizing algorithms by means of Map-Reduce frameworks.<\/jats:p>","DOI":"10.3390\/a11120209","type":"journal-article","created":{"date-parts":[[2018,12,18]],"date-time":"2018-12-18T07:15:59Z","timestamp":1545117359000},"page":"209","source":"Crossref","is-referenced-by-count":5,"title":["Hadoop vs. Spark: Impact on Performance of the Hammer Query Engine for Open Data Corpora"],"prefix":"10.3390","volume":"11","author":[{"given":"Mauro","family":"Pelucchi","sequence":"first","affiliation":[{"name":"Tabulaex, A Burning Glass Company, 20126 Milano, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9228-560X","authenticated-orcid":false,"given":"Giuseppe","family":"Psaila","sequence":"additional","affiliation":[{"name":"Dipartimento di Ingegneria Gestionale, dell\u2019Informazione e della Produzione (DIGIP), University of Bergamo, 24044 Dalmine, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5769-1311","authenticated-orcid":false,"given":"Maurizio","family":"Toccu","sequence":"additional","affiliation":[{"name":"Dipartimento di Ingegneria Gestionale, dell\u2019Informazione e della Produzione (DIGIP), University of Bergamo, 24044 Dalmine, Italy"}]}],"member":"1968","published-online":{"date-parts":[[2018,12,17]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Pelucchi, M., Psaila, G., and Toccu, M.P. (2017, January 25\u201327). Building a query engine for a corpus of open data. Proceedings of the 13th International Conference on Web Information Systems and Technologies (WEBIST 2017), Porto, Portugal.","DOI":"10.5220\/0006308801260136"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Pelucchi, M., Psaila, G., and Toccu, M. (2017, January 25\u201327). Enhanced Querying of Open Data Portals. Proceedings of the International Conference on Web Information Systems and Technologies (WEBIST 2017), Porto, Portugal.","DOI":"10.1007\/978-3-319-93527-0_9"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Pelucchi, M., Psaila, G., and Maurizio, T. (2017, January 24\u201326). The challenge of using Map-Reduce to query open data. Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA 2017), Madrid, Spain.","DOI":"10.5220\/0006487803310342"},{"key":"ref_4","unstructured":"Braunschweig, K., Eberius, J., Thiele, M., and Lehner, W. (2012, January 16\u201320). The State of Open Data. Proceedings of the 21st World Wide Web 2012 (WWW2012) Conference, Lyon, France."},{"key":"ref_5","unstructured":"Liu, J., Dong, X., and Halevy, A.Y. (2006, January 30). Answering Structured Queries on Unstructured Data. Proceedings of the WebDB 2006, Chicago, IL, USA."},{"key":"ref_6","unstructured":"Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (June, January 29). FedX: A federation layer for distributed query processing on linked open data. Proceedings of the Extended Semantic Web Conference, Heraklion, Crete, Greece."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1145\/1327452.1327492","article-title":"MapReduce: Simplified data processing on large clusters","volume":"51","author":"Dean","year":"2008","journal-title":"Commun. ACM"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Shvachko, K., Kuang, H., Radia, S., and Chansler, R. (2010, January 3\u20137). The hadoop distributed file system. Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA.","DOI":"10.1109\/MSST.2010.5496972"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"285","DOI":"10.14778\/1920841.1920881","article-title":"HaLoop: Efficient iterative data processing on large clusters","volume":"3","author":"Bu","year":"2010","journal-title":"Proc. VLDB Endow."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., and Rash, S. (2011, January 12\u201316). Apache Hadoop goes realtime at Facebook. Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece.","DOI":"10.1145\/1989323.1989438"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1145\/2934664","article-title":"Apache spark: A unified engine for big data processing","volume":"59","author":"Zaharia","year":"2016","journal-title":"Commun. ACM"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Gu, L., and Li, H. (2013, January 13\u201315). Memory or time: Performance evaluation for iterative operation on hadoop and spark. Proceedings of the 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), Zhangjiajie, China.","DOI":"10.1109\/HPCC.and.EUC.2013.106"},{"key":"ref_13","unstructured":"Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., and Isard, M. (2016, January 2\u20134). Tensorflow: A system for large-scale machine learning. Proceedings of the OSDI 2016, Savannah, GA, USA."},{"key":"ref_14","unstructured":"Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C.E., and Hellerstein, J. (arXiv, 2014). Graphlab: A new framework for parallel machine learning, arXiv."},{"key":"ref_15","unstructured":"Burdick, D.R., Ghoting, A., Krishnamurthy, R., Pednault, E.P.D., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., and Vaithyanathan, S. (2013). Systems and Methods for Processing Machine Learning Algorithms in a MapReduce Environment. (8,612,368), U.S. Patent."},{"key":"ref_16","first-page":"1235","article-title":"Mllib: Machine learning in apache spark","volume":"17","author":"Meng","year":"2016","journal-title":"J. Mach. Learn. Res."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1109\/TKDE.2013.109","article-title":"Data mining with big data","volume":"26","author":"Wu","year":"2014","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Lin, X. (2014, January 27\u201329). Mr-apriori: Association rules algorithm based on mapreduce. Proceedings of the 2014 5th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China.","DOI":"10.1109\/ICSESS.2014.6933531"},{"key":"ref_19","unstructured":"Oruganti, S., Ding, Q., and Tabrizi, N. (June, January 27). Exploring Hadoop as a platform for distributed association rule mining. Proceedings of the Fifth International Conference on Future Computational Technologies and Applications (FUTURE COMPUTING 2013), Valencia, Spain."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., Lonergan, L., Cohen, J., Welton, C., and Sherry, G. (2014, January 22\u201327). HAWQ: A massively parallel processing SQL engine in hadoop. Proceedings of the 2014 ACM SIGMOD International Conference On Management of Data, Snowbird, UT, USA.","DOI":"10.1145\/2588555.2595636"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"489","DOI":"10.1007\/s10515-013-0135-x","article-title":"JackHare: A framework for SQL to NoSQL translation using MapReduce","volume":"21","author":"Chung","year":"2014","journal-title":"Autom. Softw. Eng."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Kononenko, O., Baysal, O., Holmes, R., and Godfrey, M. (2014, January 29\u201330). Mining Modern Repositories with Elasticsearch. Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), Hyderabad, India.","DOI":"10.1145\/2597073.2597091"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Shahi, D. (2015). Apache Solr: An Introduction. Apache Solr, Springer.","DOI":"10.1007\/978-1-4842-1070-3"},{"key":"ref_24","unstructured":"Croft, W.B., Metzler, D., and Strohman, T. (2010). Search Engines: Information Retrieval in Practice, Addison-Wesley Reading."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Manning, C.D., Raghavan, P., and Sch\u00fctze, H. (2008). Introduction to Information Retrieval, Cambridge University Press.","DOI":"10.1017\/CBO9780511809071"},{"key":"ref_26","unstructured":"Winkler, W.E. (1999). The State of Record Linkage and Current Research Problems."},{"key":"ref_27","unstructured":"White, T. (2012). Hadoop: The Definitive Guide, O\u2019Reilly Media, Inc."},{"key":"ref_28","first-page":"21","article-title":"The hadoop distributed file system: Architecture and design","volume":"11","author":"Borthakur","year":"2007","journal-title":"Hadoop Proj. Website"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., and Seth, S. (2013, January 1\u20133). Apache hadoop yarn: Yet another resource negotiator. Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara, CA, USA.","DOI":"10.1145\/2523616.2523633"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., and Fox, G. (2010, January 21\u201325). Twister: A runtime for iterative mapreduce. Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, IL, USA.","DOI":"10.1145\/1851476.1851593"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1007\/s10723-012-9204-9","article-title":"Imapreduce: A distributed computing framework for iterative computation","volume":"10","author":"Zhang","year":"2012","journal-title":"J. Grid Comput."},{"key":"ref_32","first-page":"95","article-title":"Spark: Cluster computing with working sets","volume":"10","author":"Zaharia","year":"2010","journal-title":"HotCloud"}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/11\/12\/209\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,14]],"date-time":"2024-06-14T21:01:15Z","timestamp":1718398875000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/11\/12\/209"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,12,17]]},"references-count":32,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2018,12]]}},"alternative-id":["a11120209"],"URL":"https:\/\/doi.org\/10.3390\/a11120209","relation":{},"ISSN":["1999-4893"],"issn-type":[{"type":"electronic","value":"1999-4893"}],"subject":[],"published":{"date-parts":[[2018,12,17]]}}}