Abstract
Classification algorithms are the most commonly used data mining models that are widely used to extract valuable knowledge from huge amounts of data. The criteria used to evaluate the classifiers are mostly accuracy, computational complexity, robustness, scalability, integration, comprehensibility, stability, and interestingness. This study compares the classification of algorithm accuracies, speed (CPU time consumed) and robustness for various datasets and their implementation techniques. The data miner selects the model mainly with respect to classification accuracy; therefore, the performance of each classifier plays a crucial role for selection. Complexity is mostly dominated by the time required for classification. In terms of complexity, the CPU time consumed by each classifier is implied here. The study first discusses the application of certain classification models on multiple datasets in three stages: first, implementing the algorithms on original datasets; second, implementing the algorithms on the same datasets where continuous variables are discretised; and third, implementing the algorithms on the same datasets where principal component analysis is applied. The accuracies and the speed of the results are then compared. The relationship of dataset characteristics and implementation attributes between accuracy and CPU time is also examined and debated. Moreover, a regression model is introduced to show the correlating effect of dataset and implementation conditions on the classifier accuracy and CPU time. Finally, the study addresses the robustness of the classifiers, measured by repetitive experiments on both noisy and cleaned datasets.
Similar content being viewed by others
References
Abbasi A, Chen H (2009) A comparison of fraud cues and classification methods for fake escrow website detection. Inf Technol Manage 10(2–3):83–101
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Al-Sheshtawi KA, Abdul-Kader HM, Ismail NA (2010) Artificial immune clonal selection classification algorithms for classifying malware and benign processes using API call sequences. Int J Comput Sci Netw Secur 10(4):31–39
Armstrong LJ, Diepeveen D, Maddern R (2007) The application of data mining techniques to characterize agricultural soil profiles. In: Proceedings of the 6th Australasian conference on data mining and analytics (AusDM’07), Gold Coast, Australia, pp 81–95
Badulescu LA (2007) The choice of the best attribute selection measure in decision tree induction. Ann Univ Craiova Math Comp Sci Ser 34(1):88–93
Bazan JG (1998) A comparison of dynamic and non-dynamic rough set methods for extracting laws from decision tables. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery: methodology and applications. Physica-Verlag, Heidelberg, pp 321–365
Bazan JG, Szczuka M (2001) RSES and RSESlib - A collection of tools for rough set computations. In: Ziarko W, Yao Y (eds) Proceedings of the 2nd international conference on rough sets and current trends in computing (RSCTC’2000). Lecture Notes in Artificial Intelligence 2005. Springer, Berlin, pp 106–113
Berson A, Smith S, Thearling K (2000) Building data mining applications for CRM. McGraw Hill, USA
Bigss D, Ville B, Suen E (1991) A method of choosing multiway partitions for classification and decision trees. J Appl Stat 18(1):49–62
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classes. In: Proceedings of the 5th annual workshop on computational learning theory, Pittsburg, USA, pp 144–152
Brazdil PB, Soares C, Costa JP (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50(3):251–277
Breiman L, Friedman JH, Olshen R, Stone CJ (1984) Classification and regression tree. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove
Brownlee J (2005) Clonal selection theory & CLONALG: the clonal selection classification algorithm (CSCA). Technical Report 2-02, Centre for Intelligent Systems and Complex Processes (CISCP), Swinburne University of Technology (SUT), Australia. Available via http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.2821&rep=rep1&type=pdf. Accessed on 15 August 2011
Castro LN, Zuben FJ (2002) Learning and optimization using the clonal selection principle. IEEE Trans Evol Computat 6(3):239–251
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Accessed on 29 August 2011
Chiarini TM, Berndt JD, Luther SL, Foulis PR, French DD (2009) Identifying fall-related injuries: text mining the electronic medical record. Inf Technol Manage 10(4):253–265
Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach. Springer, USA
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Dogan N, Tanrikulu Z (2010) A comparative framework for evaluating classification algorithms. In: Proceedings of WCE 2010: international data mining and knowledge engineering 1, London, UK, pp 309–314
Dunham MH (2002) Data mining: introductory and advanced topics. Prentice Hall, New Jersey
EL-Manzalawy Y, Honavar V (2005) WLSVM: integrating LibSVM into weka environment. Software available at http://www.cs.iastate.edu/~yasser/wlsvm. Accessed on 29 August 2011
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407
Ge E, Nayak R, Xu Y, Li Y (2006) Data mining for lifetime prediction of metallic components. In: Proceedings of the 5th Australasian data mining conference (AusDM2006), Sydney, Australia, pp 75–81
Hacker S, Ahn LV (2009) Matchin: eliciting user preferences with an online game. In: Proceedings of the 27th international conference on human factors in computing systems (CHI’09), pp 1207–1216
Han J, Kamber M (2006) Data mining concepts and techniques, 2nd edn. Morgan Kaufmann, USA
Hand D, Mannila H, Smyth P (2001) Principles of data mining. The MIT Press, Cambridge
He H, Jin H, Chen J, McAullay D, Li J, Fallon T (2006) Analysis of breast feeding data using data mining methods. In: Proceedings of the 5th Australasian data mining conference (AusDM2006), Sydney, Australia, pp 47–53
Hergert F, Finnoff W, Zimmermann HG (1995) Improving model selection by dynamic regularization methods. In: Petsche T, Hanson SJ, Shavlik J (eds) Computational learning theory and natural learning systems: selecting good models. MIT Press, Cambridge, pp 323–343
Hill T, Lewicki P (2007) STATISTICS: methods and applications. StatSoft, Tulsa
Howley T, Madden MG, OConnell ML, Ryder AG (2006) The effect of principal component analysis on machine learning accuracy with high dimensional spectral data. Knowl-Based Syst 19(5):363–370
Jamain A, Hand DJ (2008) Mining supervised classification performance studies: a meta-analytic investigation. J Classif 25(1):87–112
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Besnard P, Hanks S (eds) Proceedings of the 17th conference on uncertainty in artificial intelligence. Morgan Kaufmann, USA, pp 338–345
Kaelbling LP (1994) Associative methods in reinforcement learning: an emprical study. In: Hanson SJ, Petsche T, Kearns M, Rivest RL (eds) Computational learning theory and natural learning systems: intersection between theory and experiment. MIT Press, Cambridge, pp 133–153
Kalousis A, Gama J, Hilario M (2004) On data and algorithms: understanding inductive performance. Mach Learn 54(3):275–312
Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min Knowl Disc 7(4):349–371
Keogh E, Stefano L, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Kim W, Kohavi R, Gehrke J, DuMouchel W (eds) Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining. Seattle, Washington, pp 206–215
Kim SB, Han KS, Rim HC, Myaeng SH (2006) Some effective techniques for Naive Bayes text classification. IEEE Trans Knowl Data Eng 18(11):1457–1466
Ko M, Osei-Bryson KM (2008) Reexamining the impact of information technology investment on productivity using regression tree and multivariate adaptive regression splines (MARS). Inf Technol Manage 9(4):285–299
Kohonen T (1990a) Improved versions of learning vector quantization. In: Proceedings of the international joint conference on neural networks I, San Diego, pp 545–550
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Kohonen T, Hynninen J, Kangas J, Laaksonen J, Torkkola K (1995) LVQ-PK: the learning vector quantization package version 3.1. Technical Report. Helsinki University of Technology Laboratory of Computer and Information Science, Finland. Available via http://www.cis.hut.fi/research/lvq_pak/lvq_doc.txt. Accessed on 27 March 2012
Le Cessie S, van Houwelingen JC (1992) Ridge estimators in logistic regression. Appl Stat 41(1):191–201
Lim T-S, Loh W-Y, Shih Y-S (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40(3):203–228
Loh WY, Shih YS (1997) Split selection methods for classification trees. Stat Sin 7(4):815–840
Maimon O, Rokach L (2010) The data mining and knowledge discovery handbook. Springer, Berlin, 2nd edn
Maindonald J (2006) Data mining methodological weaknesses and suggested fixes. In: Proceedings of the 5th Australasian data mining conference (AusDM2006), Sydney, Australia, pp 9–16
Mitchell TM (1997) Machine learning. McGraw Hill, USA
Pawlak Z (1982) Rough sets. Int J Parallel Prog 11(5):341–356
Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer, Dordrecht
Pitt E, Nayak R (2007) The use of various data mining and feature selection methods in the analysis of a population survey dataset. In: Ong KL, Li W, Gao J (eds) Proceedings of the 2nd international workshop on integrating artificial intelligence and data mining (AIDM 2007) CRPIT 87. Goald Coast, Australia, pp 87–97
Putten P, Lingjun M, Kok JN (2008) Profiling novel classification algorithms: artificial immune system. In: Proceedings of the 7th IEEE international conference on cybernetic intelligent systems (CIS 2008), London, UK, pp 1–6
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, USA
Quinlan JR (1994) Comparing connectionist and symbolic learning methods. In: Hanson SJ, Drastal GA, Rivest RL (eds) Computational learning theory and natural learning systems: constraints and prospect. MIT Press, Cambridge, pp 445–456
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL, and the PDP research group (eds) Parallel distributed processing: explorations in the microstructure of cognition 1: foundations. MIT Press, Cambridge, pp 318–362
Shih YS (2004) QUEST user manual. Department of Mathematics, National Chung Cheng University, Taiwan. Available via http://www.stat.wisc.edu/~loh/treeprogs/quest/questman.pdf. Accessed on 15 August 2011
SPSS (2012) CHAID and exhaustive CHAID algorithms. Available via ftp://ftp.software.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Algorithms/13.0/TREE-CHAID.pdf. Accessed on 9 April 2012
Su CT, Hsiao YH (2009) Multiclass MTS for simultaneous feature selection and classification. IEEE Trans Knowl Data Eng 21(2):192–204
UCI Machine Learning Repository (2010) Available via http://archive.ics.uci.edu/ml. Accessed on 10 December 2010
Watkins AB (2001) AIRS: a resource limited artificial immune classifier. M.Sc. Thesis, Mississippi State University Department of Computer Science, USA. Available via http://www.cse.msstate.edu/~andrew/research/publications/watkins_thesis.pdf. Accessed on 15 August 2011
Watkins AB (2005) Exploiting immunological metaphors in the development of serial, parallel and distributed learning algorithms. PhD Thesis, University of Kent, Canterbury, UK Available via http://www.cse.msstate.edu/~andrew/research/publications/watkins_phd_dissertation.pdf. Accessed on 15 August 2011
Watkins AB, Timmis J, Boggess L (2004) Artificial immune recognition system (AIRS): an immune-inspired supervised learning algorithm. Genet Program Evolvable Mach 5(3):291–317
WEKA (2011) Classification algorithms. Available via http://wekaclassalgos.sourceforge.net/. Accessed on 15 August 2011
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, USA
Wu X, Kumar V (2009) The top ten algorithms in data mining. Chapman&Hall/CRC Press, Taylor &Francis Group, USA
Yang Y, Webb GI, Cerquides J, Korb KB, Boughton J, Ting KM (2007) To select or to weigh: a comparative study of linear combination schemes for superparent-one-dependence estimators. IEEE Trans Knowl Data Eng 19(12):1652–1664
Zhu D, Premkumar G, Zhang X, Chu C-H (2001) Data mining for network intrusion detection: a comparison of alternative methods. Decis Sci 32(4):635–660
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dogan, N., Tanrikulu, Z. A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Inf Technol Manag 14, 105–124 (2013). https://doi.org/10.1007/s10799-012-0135-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10799-012-0135-8