{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,3,2]],"date-time":"2024-03-02T22:05:48Z","timestamp":1709417148810},"reference-count":31,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,1,19]],"date-time":"2021-01-19T00:00:00Z","timestamp":1611014400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,1,19]],"date-time":"2021-01-19T00:00:00Z","timestamp":1611014400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"published-print":{"date-parts":[[2021,12]]},"abstract":"Abstract<\/jats:title>The MapReduce programming paradigm is frequently used in order to process and analyse a huge amount of data. This paradigm relies on the ability to apply the same operation in parallel on independent chunks of data. The consequence is that the overall performances greatly depend on the way data are partitioned among the various computation nodes. The default partitioning technique, provided by systems like Hadoop or Spark, basically performs a random subdivision of the input records, without considering the nature and correlation between them. Even if such approach can be appropriate in the simplest case where all the input records have to be always analyzed, it becomes a limit for sophisticated analyses, in which correlations between records can be exploited to preliminarily prune unnecessary computations. In this paper we design a context-based multi-dimensional partitioning technique, called CoPart<\/jats:sc>, which takes care of data correlation in order to determine how records are subdivided between splits (i.e., units of work assigned to a computation node). More specifically, it considers not only the correlation of data w.r.t. contextual attributes, but also the distribution of each contextual dimension in the dataset. We experimentally compare our approach with existing ones, considering both quality criteria and the query execution times.<\/jats:p>","DOI":"10.1186\/s40537-021-00410-4","type":"journal-article","created":{"date-parts":[[2021,1,19]],"date-time":"2021-01-19T11:05:11Z","timestamp":1611054311000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["CoPart: a context-based partitioning technique for big data"],"prefix":"10.1186","volume":"8","author":[{"ORCID":"http:\/\/orcid.org\/0000-0003-3675-7243","authenticated-orcid":false,"given":"Sara","family":"Migliorini","sequence":"first","affiliation":[]},{"given":"Alberto","family":"Belussi","sequence":"additional","affiliation":[]},{"given":"Elisa","family":"Quintarelli","sequence":"additional","affiliation":[]},{"given":"Damiano","family":"Carra","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,1,19]]},"reference":[{"key":"410_CR1","unstructured":"White T. Hadoop: the definitive guide. 4th edn. O\u2019Reilly Media, Inc.; 2015."},{"key":"410_CR2","unstructured":"Chambers B, Zaharia M. Spark: the definitive guide big data processing made simple. 1st ed. O\u2019Reilly Media, Inc.; 2018."},{"issue":"4","key":"410_CR3","doi-asserted-by":"publisher","first-page":"785","DOI":"10.1007\/s10707-018-0325-6","volume":"22","author":"L Alarabi","year":"2018","unstructured":"Alarabi L, Mokbel MF, Musleh M. ST-Hadoop: a MapReduce framework for spatio-temporal data. GeoInformatica. 2018;22(4):785\u2013813.","journal-title":"GeoInformatica"},{"issue":"2","key":"410_CR4","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s10109-019-00292-4","volume":"21","author":"M Bakli","year":"2019","unstructured":"Bakli M, Sakr M, Soliman TH. HadoopTrajectory: a Hadoop spatiotemporal data processing extension. J Geogr Syst. 2019;21(2):211\u201335.","journal-title":"J Geogr Syst"},{"key":"410_CR5","doi-asserted-by":"crossref","unstructured":"Beck M, Hao W, Campan A. Accelerating the mobile cloud: using amazon mobile analytics and k-means clustering. In: 2017 IEEE 7th annual computing and communication workshop and conference (CCWC); 2017. p. 1\u20137.","DOI":"10.1109\/CCWC.2017.7868372"},{"issue":"2","key":"410_CR6","doi-asserted-by":"publisher","first-page":"322","DOI":"10.1145\/93605.98741","volume":"19","author":"N Beckmann","year":"1990","unstructured":"Beckmann N, Kriegel HP, Schneider R, Seeger B. The r*-tree: an efficient and robust access method for points and rectangles. SIGMOD Rec. 1990;19(2):322\u201331. https:\/\/doi.org\/10.1145\/93605.98741.","journal-title":"SIGMOD Rec"},{"key":"410_CR7","unstructured":"Belussi A, Carra D, Migliorini S, Negri M, Pelagatti G. What makes spatial data big? A discussion on how to partition spatial data. In: 10th international confernece on geographic information science (GIScience 2018); 2018, p. 2:1\u20135."},{"issue":"3","key":"410_CR8","doi-asserted-by":"publisher","first-page":"523","DOI":"10.1007\/s10707-011-0140-9","volume":"16","author":"A Belussi","year":"2012","unstructured":"Belussi A, Migliorini S. A framework for integrating multi-accuracy spatial data in geographical applications. Geoinformatica. 2012;16(3):523\u201361.","journal-title":"Geoinformatica"},{"key":"410_CR9","doi-asserted-by":"crossref","unstructured":"Belussi A, Migliorini S, Eldawy A. Detecting skewness of big spatial data in SpatialHadoop. In: Proceedings of the 26th ACM SIGSPATIAL international confernce on advances in geographic information systems; 2018, p. 432\u20135.","DOI":"10.1145\/3274895.3274923"},{"issue":"4","key":"410_CR10","doi-asserted-by":"publisher","first-page":"201","DOI":"10.3390\/ijgi9040201","volume":"9","author":"A Belussi","year":"2020","unstructured":"Belussi A, Migliorini S, Eldawy A. Skewness-based partitioning in spatialHadoop. ISPRS Int J Geo-Inf. 2020;9(4):201. https:\/\/doi.org\/10.3390\/ijgi9040201.","journal-title":"ISPRS Int J Geo-Inf"},{"key":"410_CR11","doi-asserted-by":"crossref","unstructured":"Belussi A, Migliorini S, Negri M, Pelagatti G. Validation of spatial integrity constraints in city models. In: Proceedings of the 4th ACM SIGSPATIAL international workshop on mobile geographic information systems; 2015, p. 70\u20139.","DOI":"10.1145\/2834126.2834137"},{"issue":"1","key":"410_CR12","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1016\/j.is.2012.05.004","volume":"38","author":"C Bolchini","year":"2013","unstructured":"Bolchini C, Quintarelli E, Tanca L. CARVE: context-aware automatic view definition over relational databases. Inf Syst. 2013;38(1):45\u201367.","journal-title":"Inf Syst"},{"issue":"1","key":"410_CR13","doi-asserted-by":"publisher","first-page":"87","DOI":"10.1609\/aimag.v16i1.1127","volume":"16","author":"P Br\u00e9zillon","year":"1995","unstructured":"Br\u00e9zillon P, Abu-Hakima S. Using knowledge in its context: report on the IJCAI-93 workshop. AI Mag. 1995;16(1):87\u201391. https:\/\/doi.org\/10.1609\/aimag.v16i1.1127.","journal-title":"AI Mag"},{"key":"410_CR14","doi-asserted-by":"publisher","unstructured":"Curino C, Zhang Y, Jones EPC, Madden S. Schism: a workload-driven approach to database replication and partitioning. In: Proceedings of the VLDB endow. 2010; 3(1): 48\u201357. https:\/\/doi.org\/10.14778\/1920841.1920853. http:\/\/www.vldb.org\/pvldb\/vldb2010\/pvldb_vol3\/R04.pdf.","DOI":"10.14778\/1920841.1920853"},{"issue":"5","key":"410_CR15","doi-asserted-by":"publisher","first-page":"161","DOI":"10.1080\/02693799108927841","volume":"2","author":"MJ Egenhofer","year":"1991","unstructured":"Egenhofer MJ, Franzosa R. Point-set topological spatial relations. Int J Geogr Inf Syst. 1991;2(5):161\u201374.","journal-title":"Int J Geogr Inf Syst"},{"issue":"12","key":"410_CR16","doi-asserted-by":"publisher","first-page":"1602","DOI":"10.14778\/2824032.2824057","volume":"8","author":"A Eldawy","year":"2015","unstructured":"Eldawy A, Alarabi L, Mokbel MF. Spatial partitioning techniques in SpatialHadoop. Proc VLDB Endow. 2015;8(12):1602\u20135. https:\/\/doi.org\/10.14778\/2824032.2824057","journal-title":"Proc VLDB Endow"},{"key":"410_CR17","doi-asserted-by":"crossref","unstructured":"Eldawy A, Mokbel MF. SpatialHadoop: a mapreduce framework for spatial data. In: 2015 IEEE 31st international conference on data engineering; 2015, p. 1352\u201363.","DOI":"10.1109\/ICDE.2015.7113382"},{"issue":"2","key":"410_CR18","doi-asserted-by":"publisher","first-page":"177","DOI":"10.1145\/335191.335412","volume":"29","author":"C Faloutsos","year":"2000","unstructured":"Faloutsos C, Seeger B, Traina A, Traina C Jr. Spatial join selectivity using power laws. SIGMOD Rec. 2000;29(2):177\u201388.","journal-title":"SIGMOD Rec"},{"key":"410_CR19","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1016\/j.is.2014.07.006","volume":"47","author":"IAT Hashem","year":"2015","unstructured":"Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU. The rise of \u201cbig data\u201d on cloud computing: review and open research issues. Inf Syst. 2015;47:98\u2013115. https:\/\/doi.org\/10.1016\/j.is.2014.07.006.","journal-title":"Inf Syst"},{"key":"410_CR20","doi-asserted-by":"publisher","first-page":"164229","DOI":"10.1109\/ACCESS.2019.2945338","volume":"7","author":"JH Huh","year":"2019","unstructured":"Huh JH, Seo YS. Understanding edge computing: engineering evolution with artificial intelligence. IEEE Access. 2019;7:164229\u201345.","journal-title":"IEEE Access"},{"issue":"Suppl 1","key":"410_CR21","doi-asserted-by":"publisher","first-page":"1011","DOI":"10.1007\/s10586-017-1183-y","volume":"22","author":"CV Huynh","year":"2019","unstructured":"Huynh CV, Huh J. B+-tree construction on massive data with hadoop. Clust Comput. 2019;22(Suppl 1):1011\u201321. https:\/\/doi.org\/10.1007\/s10586-017-1183-y.","journal-title":"Clust Comput"},{"key":"410_CR22","doi-asserted-by":"publisher","unstructured":"Ienco D, Pensa RG, Meo R. Context-based distance learning for categorical data clustering. In: Adams NM, Robardet C, Siebes A, Boulicaut J, editors. Advances in intelligent data analysis VIII, 8th international symposium on intelligent data analysis, IDA 2009, Lyon, France, August 31\u2013September 2, 2009. proceedings, Lecture Notes in Computer Science, vol. 5772. Berlin: Springer; 2009. , p. 83\u201394. https:\/\/doi.org\/10.1007\/978-3-642-03915-7_8.","DOI":"10.1007\/978-3-642-03915-7_8"},{"issue":"8","key":"410_CR23","doi-asserted-by":"publisher","first-page":"36","DOI":"10.1145\/1536616.1536632","volume":"52","author":"A Jacobs","year":"2009","unstructured":"Jacobs A. The pathologies of big data. Commun ACM. 2009;52(8):36\u201344. https:\/\/doi.org\/10.1145\/1536616.1536632.","journal-title":"Commun ACM"},{"issue":"6","key":"410_CR24","doi-asserted-by":"publisher","first-page":"845","DOI":"10.1007\/s00778-014-0362-1","volume":"23","author":"KA Kumar","year":"2014","unstructured":"Kumar KA, Quamar A, Deshpande A, Khuller S. SWORD: workload-aware data placement and replica selection for cloud data management systems. VLDB J. 2014;23(6):845\u201370. https:\/\/doi.org\/10.1007\/s00778-014-0362-1.","journal-title":"VLDB J"},{"key":"410_CR25","doi-asserted-by":"crossref","unstructured":"Migliorini S, Belussi A, Negri M, Pelagatti G. Towards massive spatial data validation with SpatialHadoop. In: Proceedings of the 5th ACM SIGSPATIAL international workshop on analytics for big geospatial data; 2016, p. 18\u201327.","DOI":"10.1145\/3006386.3006392"},{"key":"410_CR26","doi-asserted-by":"publisher","unstructured":"Migliorini S, Belussi A, Quintarelli E, Carra D. A context-based approach for partitioning big data. In: Proceedings of the 23nd international conference on extending database technology, EDBT 2020; 2020, p. 431\u20134. OpenProceedings.or. https:\/\/doi.org\/10.5441\/002\/edbt.2020.50.","DOI":"10.5441\/002\/edbt.2020.50"},{"key":"410_CR27","doi-asserted-by":"publisher","unstructured":"Mountasser I, Ouhbi B, Frikh B. Hybrid large-scale ontology matching strategy on big data environment. In: Anderst-Kotsis G, editor. Proceedings of the 18th international conference on information integration and web-based applications and services, iiWAS 2016, Singapore, November 28\u201330. New York: ACM; 2016, p. 282\u20137. https:\/\/doi.org\/10.1145\/3011141.3011185.","DOI":"10.1145\/3011141.3011185"},{"key":"410_CR28","doi-asserted-by":"crossref","unstructured":"Ramdane Y, Boussaid O, Kabachi N, Bentayeb F. Partitioning and bucketing techniques to speed up query processing in spark-sql. In: 2018 IEEE 24th international conference on parallel and distributed systems (ICPADS); 2018, p. 142\u201351.","DOI":"10.1109\/PADSW.2018.8644891"},{"key":"410_CR29","doi-asserted-by":"publisher","unstructured":"Sun L, Franklin MJ, Krishnan S, Xin RS. Fine-grained partitioning for aggressive data skipping. In: Dyreson CE, Li F, \u00d6zsu MT, editors. International conference on management of data, SIGMOD 2014, Snowbird, UT, USA, June 22\u201327, 2014. New York: ACM; 2014, p. 1115\u201326. https:\/\/doi.org\/10.1145\/2588555.2610515.","DOI":"10.1145\/2588555.2610515"},{"issue":"1","key":"410_CR30","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1109\/TKDE.2013.109","volume":"26","author":"X Wu","year":"2014","unstructured":"Wu X, Zhu X, Wu G, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng. 2014;26(1):97\u2013107. https:\/\/doi.org\/10.1109\/TKDE.2013.109.","journal-title":"IEEE Trans Knowl Data Eng"},{"issue":"1","key":"410_CR31","doi-asserted-by":"publisher","first-page":"37","DOI":"10.1007\/s10707-018-0330-9","volume":"23","author":"J Yu","year":"2019","unstructured":"Yu J, Zhang Z, Sarwat M. Spatial data management in apache spark: the geospark perspective and beyond. Geoinformatica. 2019;23(1):37\u201378.","journal-title":"Geoinformatica"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-021-00410-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s40537-021-00410-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-021-00410-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,1,19]],"date-time":"2021-01-19T11:09:48Z","timestamp":1611054588000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-021-00410-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,1,19]]},"references-count":31,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["410"],"URL":"https:\/\/doi.org\/10.1186\/s40537-021-00410-4","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-91158\/v1","asserted-by":"object"}]},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,1,19]]},"assertion":[{"value":"8 October 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 January 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 January 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"This manuscript uses a real-world dataset containg the swipes of a city pass, called VeronaCard. Such data have been collected and provided by the tourist office of Verona, a municipality in Northen Italy. The used dataset contains only aggregated and anonymous information which cannot be directly associated to any specific tourist.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and Consent to participate"}},{"value":"Not applicable. The manuscript does not contain data which is traceable to any individual person.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"No financial and non-financial competing interests exist regarding this work.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"21"}}