Abstract
In this work we address an extension of box clustering in supervised classification problems that makes use of optimization problems to refine the results obtained by agglomerative techniques. The central concept of box clustering is that of homogeneous boxes that give rise to overtrained classifiers under some conditions. Thus, we focus our attentions on the issue of pruning out redundant boxes, using the information gleaned from the other boxes generated under the hypothesis that such a choice would identify simpler models with good predictive power. We propose a pruning method based on an integer optimization problem and a family of sub problems derived from the main one. The overall performances are then compared to the accuracy levels of competing methods on a wide range of real data sets. The method has proven to be robust, making it possible to derive a more compact system of boxes in the instance space with good performance on training and test data.
Similar content being viewed by others
References
Almuallim H, Dietterich TG (1991) Learning with many irrelevant features. In: Proceedings of the ninth national conference on artificial intelligence, vol 2. AAAI Press, Menlo Park, CA, pp 547–552
Bache K, Lichman M (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://www.ics.uci.edu/~mlearn/MLRepository.html
Bertolazzi P, Felici G, Festa P, Lancia G (2008) Logic classification and feature selection for biomedical data. Comput Math Appl 55(5):889–899
Boros E, Hammer PL, Ibaraki T, Kogan A (1997) Logical analysis of numerical data. Math Progr 79:163–190
Boros E, Hammer PL, Ibaraki T, Kogan A, Mayoraz E, Muchnik I (2000) An implementation of logical analysis of data. Knowl Data Eng IEEE Trans 12(2):292–306
Davenport MA, Baraniuk MG, Scott CD (2006a) Controlling false alarms with support vector machines. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), Toulouse, France
Davenport MA, Baraniuk MG, Scott CD (2006b) Learning minimum volume sets with support vector machines. In: IEEE workshop on machine learning for signal processing (MLSP), Maynooth, Ireland
Davenport MA, Baraniuk RG, Scott C (2010) Tuning support vector machines for minimax and neyman-pearson classification. IEEE Trans Pattern Anal Mach Intell 32(10):1888–1898. http://www.ece.rice.edu/~md/np_svm.php
Eckstein J, Hammer PL, Liu Y, Nediak M, Simeone B (2002) The maximum box problem and its application to data analysis. Comput Optim Appl 23(3):285–298
Felici G, Simeone B, Spinelli V (2010) Classification techniques and error control in logic mining. In: Stahlbock R, Crone SF, Lessmann S (eds) Data mining, Annals of information systems, vol 8. Springer, London, pp 99–119. ISBN:978-1-4419-1279-4
Grudzinski K, Grochowski M, Duch W (2010) Pruning classification rules with reference vector selection methods. In: Rutkowski L, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) ICAISC (1), Lecture notes in computer science, vol 6113. Springer, Berlin, Heidelberg, pp 347–354
Hammer PL, Liu Y, Simeone B, Szedmàk S (2004) Saturated systems of homogeneous boxes and the logical analysis of numerical data. Discret Appl Math 144:103–109
Harris E (2002) Information gain versus gain ratio: a study of split method biases. In: ISAIM. http://dblp.uni-trier.de/db/conf/isaim/isaim2002.html
Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. Knowl Data Eng IEEE Trans 17(3):299–310
Kaneko A, Kano M (2003) Discrete geometry on red and blue points in the plane–a survey. In: Discrete and computational geometry. Springer, Berlin, Heidelberg, pp 551–570
Kohavi R, Provost F (1998) Glossary of terms. Mach Learn 30:271–274
Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 workshop on learning from imbalanced data sets
Maravalle M, Ricca F, Simeone B, Spinelli V (2014) Carpal tunnel syndrome automatic classification: electromyography vs. ultrasound imaging. In: Proceedings of TOP, pp 1–24
McCarthy K, Zabar B, Weiss G (2005) Does cost-sensitive learning beat sampling for classifying rare classes? In: Proceedings of the 1st international workshop on utility-based data mining, pp 69–77
Nitesh VC (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) The data mining and knowledge discovery handbook. Springer, New York, pp 853–867
Schaffer C (1994) A conservative law for generalization performance. In: Cohen WW, Hirsh H (eds) Eleventh international conference on machine learning, ICML. Morgan Kaufmann, San Francisco, CA, pp 259–265. ISBN:1-55860-335-2
Shah D, Lakshmanan LVS, Ramamritham K, Sudarshan S (1999) Interestingness and pruning of mined patterns. In: 1999 ACM SIGMOD workshop on research issues in data mining and knowledge discovery. http://dblp.uni-trier.de/db/conf/dmkd/dmkd1999.html
Simeone B, Spinelli V (2007) The optimization problem framework for box clustering approach in logic mining. In: Book of abstract of Euro XXII–22nd European conference on operational research, Euro XXII
Simeone B, Felici G, Spinelli V (2007) A graph coloring approach for box clustering techniques in logic mining. In: Book of Abstract of Euro XXII–22nd European conference on operational research, Euro XXII
Sogaard A (2013) Semi-supervised learning and domain adaptation in natural language processing. Morgan & Claypool, San Rafael
Weka (2013) Machine learning group– data mining software in java. University of Waikato, Department of Computer Science. http://www.cs.waikato.ac.nz/ml/weka
Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, San Francisco, CA. ISBN:0123748569, 9780123748560
Wu S, Flach P (2005) A scored AUC metric for classifier evaluation and selection In: ICML’05 workshop on ROC Analysis in Machine Learning, Bonn, Germany, August 2005
Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the third IEEE international conference on data mining
Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17(3):73–83
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Spinelli, V. Pruning boxes in a box-based classification method. Adv Data Anal Classif 10, 285–304 (2016). https://doi.org/10.1007/s11634-014-0193-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-014-0193-3