Abstract
In the last few years, feature selection has become essential to confront the dimensionality problem, removing irrelevant and redundant information. For this purpose, ranker methods have become an approximation commonly used since they do not compromise the computational efficiency. Ranker methods return an ordered ranking of all the features, and thus it is necessary to establish a threshold to reduce the number of features to deal with. In this work, a practical subset of features is selected according to three different data complexity measures, releasing the user from the task of choosing a fixed threshold in advance. The proposed approach was tested on six different DNA microarray datasets which have brought a difficult challenge for researchers due to the high number of gene expression and the low number of patients. The adequacy of the proposed approach in terms of classification error was checked by the use of an ensemble of ranker methods with a Support Vector Machine as classifier. This study shows that our approach was able to achieve competitive results compared with those obtained by fixed threshold approach, which is the standard in most research works.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ball, G.H., Hall, D.J.: Some implications of interactive graphic computer systems for data analysis and statistics. Technometrics 12(1), 17–31 (1970)
Basu, M., Ho, T.K.: Data Complexity in Pattern Recognition. Springer Science & Business Media, Berlin (2006)
Boln-Canedo, V., Snchez-Maroo, N., Alonso-Betanzos, A.: Feature Selection for High-Dimensional Data. Springer, Heidelberg (2016)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)
Gao, K., Khoshgoftaar, T.M., Wang, H.: An empirical investigation of filter attribute selection techniques for software quality classification. In: IEEE International Conference on Information Reuse and Integration, IRI 2009, pp. 272–277. IEEE (2009)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)
Guyon, I.: Feature Extraction: Foundations and Applications, vol. 207. Springer Science & Business Media, Berlin (2006)
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) Machine Learning: ECML-94. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994)
Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, Hoboken (2004)
Liu, H., Setiono, R.: Chi2: feature selection and discretization of numeric attributes. In: 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, pp. 388–388. IEEE Computer Society (1995)
Mejía-Lavalle, M., Sucar, E., Arroyo, G.: Feature selection with a perceptron neural net. In: Proceedings of the International Workshop on Feature Selection for Data Mining, pp. 131–135 (2006)
Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: A time efficient approach for distributed feature selection partitioning by features. In: Puerta, J.M., et al. (eds.) CAEPIA 2015. LNCS, vol. 9422, pp. 245–254. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24598-0_22
Navarro, F.F.G.: Feature selection in cancer research: microarray gene expression and in vivo 1H-MRS domains. Ph.D. thesis, Universitat Politècnica de Catalunya (2011)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Quinlan, J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Ridge, K.: Bio-medical dataset. http://datam.i2r.a-star.edu.sg/datasets/krbd. Accessed May 2016
Rodríguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J.: Detecting fault modules applying feature selection to classifiers. In: IEEE International Conference on Information Reuse and Integration, IRI 2007, pp. 667–672. IEEE (2007)
Seijo-Pardo, B., Bolón-Canedo, V., Alonso-Betanzos, A.: Using a feature selection ensemble on DNA microarray datasets. In: Proceeding of 24th European Symposium on Artificial Neural Networks, pp. 277–282 (2016)
Willett, P.: Combination of similarity rankings using data fusion. J. Chem. Inf. Model. 53(1), 1–10 (2013)
Yu, L., Liu, H.: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004)
Acknowledgments
This research has been financially supported in part by the Spanish Ministerio de Economía y Competitividad (research project TIN2015-65069-C2-1-R), by European Union FEDER funds and by the Consellería de Industria of the Xunta de Galicia (research project GRC2014/035).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Seijo-Pardo, B., Bolón-Canedo, V., Alonso-Betanzos, A. (2016). Using Data Complexity Measures for Thresholding in Feature Selection Rankers. In: Luaces , O., et al. Advances in Artificial Intelligence. CAEPIA 2016. Lecture Notes in Computer Science(), vol 9868. Springer, Cham. https://doi.org/10.1007/978-3-319-44636-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-44636-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44635-6
Online ISBN: 978-3-319-44636-3
eBook Packages: Computer ScienceComputer Science (R0)