{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,4,29]],"date-time":"2025-04-29T08:30:20Z","timestamp":1745915420128,"version":"3.37.3"},"reference-count":74,"publisher":"Wiley","issue":"1","license":[{"start":{"date-parts":[[2015,11,24]],"date-time":"2015-11-24T00:00:00Z","timestamp":1448323200000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"funder":[{"name":"National Research Project","award":["TIN2014-57251-P","TIN2012-37954","TIN2013-47210-P"]},{"name":"Andalusian Research Plan","award":["P10-TIC-6858","P11-TIC-7765","P12-TIC-2958"]},{"DOI":"10.13039\/501100010801","name":"Xunta de Galicia","doi-asserted-by":"crossref","award":["GRC 2014\/035","FEDER funds of the European Union","POS-A\/2013\/196","ED481B 2014\/164-0"],"id":[{"id":"10.13039\/501100010801","id-type":"DOI","asserted-by":"crossref"}]},{"name":"FPU scholarship from the Spanish Ministry of Education and Science","award":["FPU13\/00047"]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["WIREs Data Min & Knowl"],"published-print":{"date-parts":[[2016,1]]},"abstract":"Discretization of numerical data is one of the most influential data preprocessing tasks in knowledge discovery and data mining. The purpose of attribute discretization is to find concise data representations as categories which are adequate for the learning task retaining as much information in the original continuous attribute as possible. In this article, we present an updated overview of discretization techniques in conjunction with a complete taxonomy of the leading discretizers. Despite the great impact of discretization as data preprocessing technique, few elementary approaches have been developed in the literature for Big Data. The purpose of this article is twofold: a comprehensive taxonomy of discretization techniques to help the practitioners in the use of the algorithms is presented; the article aims is to demonstrate that standard discretization methods can be parallelized in Big Data platforms such as Apache Spark, boosting both performance and accuracy. We thus propose a distributed implementation of one of the most well\u2010known discretizers based on Information Theory, obtaining better results than the one produced by: the entropy minimization discretizer proposed by Fayyad and Irani. Our scheme goes beyond a simple parallelization and it is intended to be the first to face the Big Data challenge. WIREs Data Mining Knowl Discov<\/jats:italic> 2016, 6:5\u201321. doi: 10.1002\/widm.1173<\/jats:p>This article is categorized under:\n\nTechnologies > Classification<\/jats:p><\/jats:list-item>\nTechnologies > Data Preprocessing<\/jats:p><\/jats:list-item>\n<\/jats:list><\/jats:p>","DOI":"10.1002\/widm.1173","type":"journal-article","created":{"date-parts":[[2015,11,24]],"date-time":"2015-11-24T09:56:49Z","timestamp":1448359009000},"page":"5-21","source":"Crossref","is-referenced-by-count":96,"title":["Data discretization: taxonomy and big data challenge"],"prefix":"10.1002","volume":"6","author":[{"given":"Sergio","family":"Ram\u00edrez\u2010Gallego","sequence":"first","affiliation":[{"name":"Department of Computer Science and Artificial Intelligence University of Granada Granada Spain"}]},{"given":"Salvador","family":"Garc\u00eda","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Artificial Intelligence University of Granada Granada Spain"}]},{"given":"H\u00e9ctor","family":"Mouri\u00f1o\u2010Tal\u00edn","sequence":"additional","affiliation":[{"name":"Department of Computer Science University of A Coru\u00f1a A Coru\u00f1a Spain"}]},{"given":"David","family":"Mart\u00ednez\u2010Rego","sequence":"additional","affiliation":[{"name":"Department of Computer Science University of A Coru\u00f1a A Coru\u00f1a Spain"},{"name":"Department of Computer Science University College London London UK"}]},{"given":"Ver\u00f3nica","family":"Bol\u00f3n\u2010Canedo","sequence":"additional","affiliation":[{"name":"Department of Computer Science University of A Coru\u00f1a A Coru\u00f1a Spain"}]},{"given":"Amparo","family":"Alonso\u2010Betanzos","sequence":"additional","affiliation":[{"name":"Department of Computer Science University of A Coru\u00f1a A Coru\u00f1a Spain"}]},{"given":"Jos\u00e9 Manuel","family":"Ben\u00edtez","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Artificial Intelligence University of Granada Granada Spain"}]},{"given":"Francisco","family":"Herrera","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Artificial Intelligence University of Granada Granada Spain"}]}],"member":"311","published-online":{"date-parts":[[2015,11,24]]},"reference":[{"key":"e_1_2_9_2_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1016304305535"},{"key":"e_1_2_9_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.35"},{"key":"e_1_2_9_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10247-4"},{"key":"e_1_2_9_5_1","doi-asserted-by":"publisher","DOI":"10.1201\/9781420089653"},{"volume-title":"C4.5: Programs for Machine Learning","year":"1993","author":"Ross Quinlan J","key":"e_1_2_9_6_1"},{"key":"e_1_2_9_7_1","unstructured":"AgrawalR SrikantR. Fast algorithms for mining association rules. In:Proceedings of the 20th Very Large Data Bases conference (VLDB) Santiago de Chile Chile 1994 pages 487\u2013499."},{"key":"e_1_2_9_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-008-5083-5"},{"key":"e_1_2_9_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2009.24"},{"key":"e_1_2_9_10_1","first-page":"101","volume-title":"Data Mining and Knowledge Discovery Handbook","author":"Yang Y","year":"2010"},{"key":"e_1_2_9_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2013.109"},{"key":"e_1_2_9_12_1","unstructured":"DeanJ GhemawatS. Mapreduce: simplified data processing on large clusters. In: San Francisco CA OSDI 2004 pages 137\u2013150."},{"key":"e_1_2_9_13_1","unstructured":"Apache Hadoop Project. Apache Hadoop 2015. [Onlinehttps:\/\/hadoop.apache.org\/; Accessed March 2015]."},{"volume-title":"Hadoop, The Definitive Guide","year":"2012","author":"White T","key":"e_1_2_9_14_1"},{"key":"e_1_2_9_15_1","unstructured":"Apache Spark: lightning\u2010fast cluster computing. Apache spark 2015. [Onlinehttp:\/\/spark.apache.org\/; Accessed March 2015]."},{"volume-title":"Learning Spark: Lightning\u2010Fast Big Data Analytics","year":"2015","author":"Hamstra M","key":"e_1_2_9_16_1"},{"key":"e_1_2_9_17_1","unstructured":"Apache Mahout Project. Apache Mahout 2015. [Onlinehttp:\/\/mahout.apache.org\/; Accessed March 2015]."},{"key":"e_1_2_9_18_1","unstructured":"Machine Learning Library (MLlib) for Spark. Mllib 2015. [Onlinehttps:\/\/spark.apache.org\/docs\/1.2.0\/mllib-guide.html; Accessed March 2015]."},{"key":"e_1_2_9_19_1","unstructured":"FayyadUM IraniKB. Multi\u2010interval discretization of continuous\u2010valued attributes for classification learning. In:Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI) San Francisco CA 1993 pages 1022\u20131029."},{"key":"e_1_2_9_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.1987.4767986"},{"key":"e_1_2_9_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.88569"},{"key":"e_1_2_9_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/BFb0017012"},{"key":"e_1_2_9_23_1","unstructured":"KerberR. Chimerge: discretization of numeric attributes. In:National Conference on Artifical Intelligence American Association for Artificial Intelligence (AAAI) San Jose California 1992 pages 123\u2013128."},{"key":"e_1_2_9_24_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1022631118932"},{"key":"e_1_2_9_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.391407"},{"key":"e_1_2_9_26_1","doi-asserted-by":"crossref","unstructured":"PfahringerB. Compression\u2010based discretization of continuous attributes. In:Proceedings of the 12th International Conference on Machine Learning (ICML) Tahoe City California 1995 pages 456\u2013463.","DOI":"10.1016\/B978-1-55860-377-6.50063-3"},{"key":"e_1_2_9_27_1","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/39.8.688"},{"key":"e_1_2_9_28_1","unstructured":"FriedmanN GoldszmidtM. Discretizing continuous attributes while learning Bayesian networks. In:Proceedings of the 13th International Conference on Machine Learning (ICML) Bari Italy 1996 pages 157\u2013165."},{"key":"e_1_2_9_29_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0888-613X(96)00074-6"},{"key":"e_1_2_9_30_1","unstructured":"HoKM ScottPD. Zeta: a global method for discretization of continuous variables. In:Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD) Newport Beach California 1997 pages 191\u2013194."},{"key":"e_1_2_9_31_1","unstructured":"CerquidesJ De MantarasRL. Proposal and empirical comparison of a parallelizable distance\u2010based discretization method. In:Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD) Newport Beach California 1997 pages 139\u2013142."},{"key":"e_1_2_9_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/69.617056"},{"key":"e_1_2_9_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/69.634751"},{"key":"e_1_2_9_34_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0218488598000264"},{"key":"e_1_2_9_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/PL00011680"},{"key":"e_1_2_9_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2002.1000349"},{"key":"e_1_2_9_37_1","first-page":"275","volume-title":"Frontiers in Artificial Intelligence and Applications","author":"Gir\u00e1ldez R","year":"2002"},{"key":"e_1_2_9_38_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:MACH.0000019804.29836.05"},{"key":"e_1_2_9_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2004.1269594"},{"key":"e_1_2_9_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2005.39"},{"key":"e_1_2_9_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2005.135"},{"key":"e_1_2_9_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2005.153"},{"key":"e_1_2_9_43_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-006-8364-x"},{"key":"e_1_2_9_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2006.70"},{"key":"e_1_2_9_45_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2006.06.005"},{"key":"e_1_2_9_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.250582"},{"key":"e_1_2_9_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2008.66"},{"key":"e_1_2_9_48_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2007.09.004"},{"key":"e_1_2_9_49_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2008.06.063"},{"key":"e_1_2_9_50_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-008-0142-6"},{"key":"e_1_2_9_51_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2011.08.008"},{"key":"e_1_2_9_52_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2011.11.001"},{"key":"e_1_2_9_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2011.101"},{"key":"e_1_2_9_54_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2011.12.008"},{"key":"e_1_2_9_55_1","doi-asserted-by":"publisher","DOI":"10.1142\/S021800141350002X"},{"key":"e_1_2_9_56_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2012.10.036"},{"key":"e_1_2_9_57_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2013.12.005"},{"key":"e_1_2_9_58_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2014.02.113"},{"key":"e_1_2_9_59_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-014-0350-5"},{"key":"e_1_2_9_60_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2014.10.014"},{"key":"e_1_2_9_61_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-014-0380-z"},{"key":"e_1_2_9_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2015.2410143"},{"key":"e_1_2_9_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/584091.584093"},{"key":"e_1_2_9_64_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00500-008-0323-y"},{"key":"e_1_2_9_65_1","doi-asserted-by":"crossref","unstructured":"Ver\u00f3nicaBol\u00f3n\u2010Canedo NoeliaS\u00e1nchez\u2010Maro\u00f1o andAmparoAlonso\u2010Betanzos. On the effectiveness of discretization on gene selection of microarray data. In:International Joint Conference on Neural Networks IJCNN 2010 Barcelona Spain 18\u201323 July 2010 pages 1\u20138.","DOI":"10.1109\/IJCNN.2010.5596825"},{"key":"e_1_2_9_66_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2010.11.028"},{"key":"e_1_2_9_67_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-014-1151-8"},{"key":"e_1_2_9_68_1","doi-asserted-by":"crossref","unstructured":"ZhangY CheungY\u2010M. Discretizing numerical attributes in decision tree for big data analysis. In:ICDM Workshops Shenzhen China 2014 pages 1150\u20131157.","DOI":"10.1109\/ICDMW.2014.103"},{"key":"e_1_2_9_69_1","unstructured":"BeyerM.A. LaneyD. 3D data management: controlling data volume velocity and variety 2001. [Onlinehttp:\/\/blogs.gartner.com\/doug\u2010laney\/files\/2012\/01\/ad949\u20103D\u2010Data\u2010Management\u2010Controlling\u2010Data\u2010Volume\u2010Velocity\u2010and\u2010Variety.pdf; Accessed March 2015]."},{"key":"e_1_2_9_70_1","doi-asserted-by":"publisher","DOI":"10.1002\/widm.1134"},{"key":"e_1_2_9_71_1","article-title":"Mapreduce is good enough? If all you have is a hammer, throw away everything that's not a nail!","author":"Lin J","year":"2012","journal-title":"Clin Orthop Relat Res"},{"key":"e_1_2_9_72_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2014.03.043"},{"key":"e_1_2_9_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/1961189.1961199"},{"volume-title":"Pattern Classification and Scene Analysis","year":"1973","author":"Duda RO","key":"e_1_2_9_74_1"},{"volume-title":"Readings in Machine Learning","year":"1990","author":"Quinlan JR","key":"e_1_2_9_75_1"}],"container-title":["WIREs Data Mining and Knowledge Discovery"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fwidm.1173","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fwidm.1173","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/widm.1173","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/full-xml\/10.1002\/widm.1173","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/wires.onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/widm.1173","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,8,28]],"date-time":"2023-08-28T13:27:27Z","timestamp":1693229247000},"score":1,"resource":{"primary":{"URL":"https:\/\/wires.onlinelibrary.wiley.com\/doi\/10.1002\/widm.1173"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,11,24]]},"references-count":74,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2016,1]]}},"alternative-id":["10.1002\/widm.1173"],"URL":"https:\/\/doi.org\/10.1002\/widm.1173","archive":["Portico"],"relation":{},"ISSN":["1942-4787","1942-4795"],"issn-type":[{"type":"print","value":"1942-4787"},{"type":"electronic","value":"1942-4795"}],"subject":[],"published":{"date-parts":[[2015,11,24]]}}}