{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,11,19]],"date-time":"2024-11-19T17:59:44Z","timestamp":1732039184669},"reference-count":44,"publisher":"Springer Science and Business Media LLC","issue":"8","license":[{"start":{"date-parts":[[2020,5,2]],"date-time":"2020-05-02T00:00:00Z","timestamp":1588377600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,5,2]],"date-time":"2020-05-02T00:00:00Z","timestamp":1588377600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100010665","name":"H2020 Marie Sklodowska-Curie Actions","doi-asserted-by":"publisher","award":["702527"],"id":[{"id":"10.13039\/100010665","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Knowl Inf Syst"],"published-print":{"date-parts":[[2020,8]]},"abstract":"Abstract<\/jats:title>A big challenge in the knowledge discovery process is to perform data pre-processing, specifically feature selection, on a large amount of data and high dimensional attribute set. A variety of techniques have been proposed in the literature to deal with this challenge with different degrees of success as most of these techniques need further information about the given input data for thresholding, need to specify noise levels or use some feature ranking procedures. To overcome these limitations, rough set theory (RST) can be used to discover the dependency within the data and reduce the number of attributes enclosed in an input data set while using the data alone and requiring no supplementary information. However, when it comes to massive data sets, RST reaches its limits as it is highly computationally expensive. In this paper, we propose a scalable and effective rough set theory-based approach for large-scale data pre-processing, specifically for feature selection, under the Spark framework. In our detailed experiments, data sets with up to 10,000 attributes have been considered, revealing that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance. Thus, making it relevant to big data.<\/jats:p>","DOI":"10.1007\/s10115-020-01467-y","type":"journal-article","created":{"date-parts":[[2020,5,2]],"date-time":"2020-05-02T07:02:20Z","timestamp":1588402940000},"page":"3321-3386","update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["A scalable and effective rough set theory-based approach for big data pre-processing"],"prefix":"10.1007","volume":"62","author":[{"ORCID":"http:\/\/orcid.org\/0000-0002-2551-6586","authenticated-orcid":false,"given":"Zaineb","family":"Chelly\u00a0Dagdia","sequence":"first","affiliation":[]},{"given":"Christine","family":"Zarges","sequence":"additional","affiliation":[]},{"given":"Ga\u00ebl","family":"Beck","sequence":"additional","affiliation":[]},{"given":"Mustapha","family":"Lebbah","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,5,2]]},"reference":[{"issue":"5","key":"1467_CR1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.5936\/csbj.201301010","volume":"4","author":"FM Afendi","year":"2013","unstructured":"Afendi FM, Ono N, Nakamura Y, Nakamura K, Darusman LK, Kibinge N, Morita AH, Tanaka K, Horai H, Altaf-Ul-Amin M et al (2013) Data mining methods for omics and knowledge of crude medicinal plants toward big data biology. Comput Struct Biotechnol J 4(5):1\u201314","journal-title":"Comput Struct Biotechnol J"},{"issue":"3","key":"1467_CR2","doi-asserted-by":"crossref","first-page":"6843","DOI":"10.1016\/j.eswa.2008.08.022","volume":"36","author":"MH Aghdam","year":"2009","unstructured":"Aghdam MH, Ghasem-Aghaee N, Basiri ME (2009) Text feature selection using ant colony optimization. Expert Syst Appl 36(3):6843\u20136853","journal-title":"Expert Syst Appl"},{"key":"1467_CR3","doi-asserted-by":"crossref","unstructured":"Ahmed S, Zhang M, Peng L (2013) Enhanced feature selection for biomarker discovery in LC-MS data using GP. In: Evolutionary computation (CEC), 2013 IEEE congress on. IEEE, pp 584\u2013591","DOI":"10.1109\/CEC.2013.6557621"},{"key":"1467_CR4","unstructured":"Asuncion A, Newman DJ (2007) UCI machine learning repository. http:\/\/www.ics.uci.edu\/~mlearn\/MLRepository.html"},{"issue":"1","key":"1467_CR5","doi-asserted-by":"crossref","first-page":"252","DOI":"10.1016\/j.ijpe.2009.11.023","volume":"124","author":"C Bai","year":"2010","unstructured":"Bai C, Sarkis J (2010) Integrating sustainability into supplier selection with grey system and rough set methodologies. Int J Produ Econ 124(1):252\u2013264","journal-title":"Int J Produ Econ"},{"issue":"2","key":"1467_CR6","doi-asserted-by":"crossref","first-page":"395","DOI":"10.1007\/s10115-017-1140-3","volume":"56","author":"V Bol\u00f3n-Canedo","year":"2018","unstructured":"Bol\u00f3n-Canedo V, Rego-Fern\u00e1ndez D, Peteiro-Barral D, Alonso-Betanzos A, Guijarro-Berdi\u00f1as B, S\u00e1nchez-Maro\u00f1o N (2018) On the scalability of feature selection methods on high-dimensional data. Knowl Inf Syst 56(2):395\u2013442","journal-title":"Knowl Inf Syst"},{"issue":"2","key":"1467_CR7","doi-asserted-by":"crossref","first-page":"171","DOI":"10.1007\/s11036-013-0489-0","volume":"19","author":"M Chen","year":"2014","unstructured":"Chen M, Mao S, Liu Y (2014) Big data: a survey. Mobile Netw Appl 19(2):171\u2013209","journal-title":"Mobile Netw Appl"},{"key":"1467_CR8","doi-asserted-by":"publisher","unstructured":"Dagdia ZC, Zarges C, Beck G, Lebbah M (2017) A distributed rough set theory based algorithm for an efficient big data pre-processing under the spark framework. In: 2017 IEEE international conference on big data, BigData 2017, Boston, MA, USA, December 11\u201314, 2017, pp 911\u2013916. https:\/\/doi.org\/10.1109\/BigData.2017.8258008","DOI":"10.1109\/BigData.2017.8258008"},{"issue":"1\u20134","key":"1467_CR9","doi-asserted-by":"crossref","first-page":"131","DOI":"10.3233\/IDA-1997-1302","volume":"1","author":"M Dash","year":"1997","unstructured":"Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(1\u20134):131\u2013156","journal-title":"Intell Data Anal"},{"issue":"1","key":"1467_CR10","doi-asserted-by":"crossref","first-page":"72","DOI":"10.1145\/1629175.1629198","volume":"53","author":"J Dean","year":"2010","unstructured":"Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72\u201377","journal-title":"Commun ACM"},{"issue":"28","key":"1467_CR11","first-page":"281","volume":"43","author":"I D\u00fcntsch","year":"2000","unstructured":"D\u00fcntsch I, Gediga G (2000) Rough set data analysis. Encycl Comput Sci Technol 43(28):281\u2013301","journal-title":"Encycl Comput Sci Technol"},{"key":"1467_CR12","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1016\/j.simpat.2016.01.010","volume":"64","author":"ESM El-Alfy","year":"2016","unstructured":"El-Alfy ESM, Alshammari MA (2016) Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce. Simul Model Pract Theory 64:18\u201329","journal-title":"Simul Model Pract Theory"},{"issue":"2","key":"1467_CR13","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2481244.2481246","volume":"14","author":"W Fan","year":"2013","unstructured":"Fan W, Bifet A (2013) Mining big data: current status, and forecast to the future. ACM sIGKDD Explor Newsl 14(2):1\u20135","journal-title":"ACM sIGKDD Explor Newsl"},{"issue":"5","key":"1467_CR14","doi-asserted-by":"crossref","first-page":"380","DOI":"10.1002\/widm.1134","volume":"4","author":"A Fern\u00e1ndez","year":"2014","unstructured":"Fern\u00e1ndez A, del R\u00edo S, L\u00f3pez V, Bawakid A, del Jesus MJ, Ben\u00edtez JM, Herrera F (2014) Big data with cloud computing: an insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdiscip Rev Data Min Knowl Discov 4(5):380\u2013409","journal-title":"Wiley Interdiscip Rev Data Min Knowl Discov"},{"issue":"4","key":"1467_CR15","doi-asserted-by":"crossref","first-page":"1969","DOI":"10.1016\/j.asoc.2012.11.042","volume":"13","author":"A Ghosh","year":"2013","unstructured":"Ghosh A, Datta A, Ghosh S (2013) Self-adaptive differential evolution for feature selection in hyperspectral image data. Appl Soft Comput 13(4):1969\u20131977","journal-title":"Appl Soft Comput"},{"issue":"4","key":"1467_CR16","doi-asserted-by":"crossref","first-page":"108","DOI":"10.1145\/332051.332082","volume":"43","author":"JW Grzymala-Busse","year":"2000","unstructured":"Grzymala-Busse JW, Ziarko W (2000) Data mining and rough set theory. Commun ACM 43(4):108\u2013109","journal-title":"Commun ACM"},{"issue":"Mar","key":"1467_CR17","first-page":"1157","volume":"3","author":"I Guyon","year":"2003","unstructured":"Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157\u20131182","journal-title":"J Mach Learn Res"},{"key":"1467_CR18","unstructured":"https:\/\/spark.apache.org\/mllib\/. Mllib website"},{"key":"1467_CR19","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1016\/j.knosys.2015.10.025","volume":"94","author":"J Hu","year":"2016","unstructured":"Hu J, Pedrycz W, Wang G, Wang K (2016) Rough sets in distributed decision information systems. Knowl-Based Syst 94:13\u201322","journal-title":"Knowl-Based Syst"},{"key":"1467_CR20","unstructured":"John GH, Kohavi R, Pfleger K et al (1994) Irrelevant features and the subset selection problem. In: Machine learning: proceedings of the eleventh international conference, pp 121\u2013129"},{"key":"1467_CR21","doi-asserted-by":"crossref","DOI":"10.1002\/9781118874059","volume-title":"Discovering knowledge in data: an introduction to data mining","author":"DT Larose","year":"2014","unstructured":"Larose DT (2014) Discovering knowledge in data: an introduction to data mining. Wiley, New York"},{"issue":"3","key":"1467_CR22","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1023\/A:1011219918340","volume":"16","author":"P Lingras","year":"2001","unstructured":"Lingras P (2001) Unsupervised rough set classification using GAs. J Intell Inf Syst 16(3):215\u2013228","journal-title":"J Intell Inf Syst"},{"key":"1467_CR23","doi-asserted-by":"crossref","unstructured":"Lingras P (2002) Rough set clustering for web mining. In: Fuzzy systems, 2002. FUZZ-IEEE\u201902. Proceedings of the 2002 IEEE international conference on, vol\u00a02. IEEE, pp 1039\u20131044","DOI":"10.1109\/FUZZ.2002.1006647"},{"key":"1467_CR24","unstructured":"Liu H, Motoda H, Setiono R, Zhao Z (2010) Feature selection: an ever evolving frontier in data mining. In: Feature selection in data mining, pp 4\u201313"},{"issue":"4","key":"1467_CR25","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1109\/TKDE.2005.66","volume":"17","author":"H Liu","year":"2005","unstructured":"Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491\u2013502","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"1467_CR26","doi-asserted-by":"crossref","unstructured":"Liu H, Zhao Z (2012) Manipulating data and dimension reduction methods: feature selection. In: Computational complexity. Springer, Berlin, pp 1790\u20131800","DOI":"10.1007\/978-1-4614-1800-9_115"},{"key":"1467_CR27","volume-title":"Rough sets: theoretical aspects of reasoning about data","author":"Z Pawlak","year":"2012","unstructured":"Pawlak Z (2012) Rough sets: theoretical aspects of reasoning about data, vol 9. Springer, Berlin"},{"issue":"1","key":"1467_CR28","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1016\/j.ins.2006.06.003","volume":"177","author":"Z Pawlak","year":"2007","unstructured":"Pawlak Z, Skowron A (2007) Rudiments of rough sets. Inf Sci 177(1):3\u201327","journal-title":"Inf Sci"},{"issue":"8","key":"1467_CR29","doi-asserted-by":"crossref","first-page":"1226","DOI":"10.1109\/TPAMI.2005.159","volume":"27","author":"H Peng","year":"2005","unstructured":"Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226\u20131238","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"1467_CR30","doi-asserted-by":"crossref","unstructured":"Peralta D, del R\u00edo S, Ram\u00edrez-Gallego S, Triguero I, Benitez JM, Herrera F (2015) Evolutionary feature selection for big data classification: a mapreduce approach. Math Probl Eng. https:\/\/doi.org\/10.1155\/2015\/246139","DOI":"10.1155\/2015\/246139"},{"issue":"9","key":"1467_CR31","doi-asserted-by":"crossref","first-page":"597","DOI":"10.1016\/j.artint.2010.04.018","volume":"174","author":"Y Qian","year":"2010","unstructured":"Qian Y, Liang J, Pedrycz W, Dang C (2010) Positive approximation: an accelerator for attribute reduction in rough set theory. Artif Intell 174(9):597\u2013618","journal-title":"Artif Intell"},{"key":"1467_CR32","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1016\/j.ijar.2018.01.008","volume":"97","author":"Y Qian","year":"2018","unstructured":"Qian Y, Liang X, Wang Q, Liang J, Liu B, Skowron A, Yao Y, Ma J, Dang C (2018) Local rough set: a solution to rough data analysis in big data. Int J Approx Reason 97:38\u201363","journal-title":"Int J Approx Reason"},{"issue":"3","key":"1467_CR33","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1109\/SURV.2011.032211.00087","volume":"13","author":"S Sakr","year":"2011","unstructured":"Sakr S, Liu A, Batista DM, Alomari M (2011) A survey of large scale data management approaches in cloud environments. IEEE Commun Surv Tutor 13(3):311\u2013336","journal-title":"IEEE Commun Surv Tutor"},{"issue":"5","key":"1467_CR34","doi-asserted-by":"crossref","first-page":"1273","DOI":"10.1007\/s10618-015-0441-y","volume":"30","author":"P Sch\u00e4fer","year":"2016","unstructured":"Sch\u00e4fer P (2016) Scalable time series classification. Data Min Knowl Discov 30(5):1273\u20131298","journal-title":"Data Min Knowl Discov"},{"issue":"4","key":"1467_CR35","doi-asserted-by":"crossref","first-page":"972","DOI":"10.1007\/s10618-017-0498-x","volume":"31","author":"J Schneider","year":"2017","unstructured":"Schneider J, Vlachos M (2017) Scalable density-based clustering with quality guarantees using random projections. Data Mining Knowl Discov 31(4):972\u20131005","journal-title":"Data Mining Knowl Discov"},{"key":"1467_CR36","unstructured":"Shanahan JG, Dai L (2015) Large scale distributed data science using apache spark. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2323\u20132324"},{"key":"1467_CR37","volume-title":"MPI-the complete reference: the MPI core","author":"M Snir","year":"1998","unstructured":"Snir M (1998) MPI-the complete reference: the MPI core, vol 1. MIT Press, Cambridge"},{"issue":"5","key":"1467_CR38","doi-asserted-by":"crossref","first-page":"1024","DOI":"10.1007\/s10618-016-0466-x","volume":"30","author":"N Talukder","year":"2016","unstructured":"Talukder N, Zaki MJ (2016) A distributed approach for graph mining in massive networks. Data Mini Knowl Discov 30(5):1024\u20131052","journal-title":"Data Mini Knowl Discov"},{"issue":"1","key":"1467_CR39","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.asoc.2008.05.006","volume":"9","author":"K Thangavel","year":"2009","unstructured":"Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: a review. Appl Soft Comput 9(1):1\u201312","journal-title":"Appl Soft Comput"},{"issue":"6","key":"1467_CR40","doi-asserted-by":"crossref","first-page":"1520","DOI":"10.1007\/s10618-016-0453-2","volume":"30","author":"NX Vinh","year":"2016","unstructured":"Vinh NX, Chan J, Romano S, Bailey J, Leckie C, Ramamohanarao K, Pei J (2016) Discovering outlying aspects in large datasets. Data Min Knowl Discov 30(6):1520\u20131555","journal-title":"Data Min Knowl Discov"},{"issue":"1","key":"1467_CR41","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1109\/TKDE.2013.109","volume":"26","author":"X Wu","year":"2014","unstructured":"Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97\u2013107","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"1467_CR42","unstructured":"Xu X, J\u00e4ger J, Kriegel HP (1999) A fast parallel clustering algorithm for large spatial databases. In: High performance data mining. Springer, pp 263\u2013290"},{"issue":"5","key":"1467_CR43","doi-asserted-by":"crossref","first-page":"1242","DOI":"10.1007\/s10618-017-0500-7","volume":"31","author":"T Zhai","year":"2017","unstructured":"Zhai T, Gao Y, Wang H, Cao L (2017) Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Mining Knowl Discov 31(5):1242\u20131265","journal-title":"Data Mining Knowl Discov"},{"issue":"2","key":"1467_CR44","doi-asserted-by":"crossref","first-page":"465","DOI":"10.1007\/s10618-016-0481-y","volume":"31","author":"J Zhang","year":"2017","unstructured":"Zhang J, Wang S, Chen L, Gallinari P (2017) Multiple bayesian discriminant functions for high-dimensional massive data classification. Data Min Knowl Discov 31(2):465\u2013501","journal-title":"Data Min Knowl Discov"}],"container-title":["Knowledge and Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10115-020-01467-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10115-020-01467-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10115-020-01467-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,30]],"date-time":"2023-09-30T15:09:45Z","timestamp":1696086585000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10115-020-01467-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,2]]},"references-count":44,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2020,8]]}},"alternative-id":["1467"],"URL":"https:\/\/doi.org\/10.1007\/s10115-020-01467-y","relation":{},"ISSN":["0219-1377","0219-3116"],"issn-type":[{"value":"0219-1377","type":"print"},{"value":"0219-3116","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,5,2]]},"assertion":[{"value":"13 September 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 March 2020","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 March 2020","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 May 2020","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}