Abstract
Conditional mutual information (CMI) maximization is a promising criterion for feature selection in a computationally efficient stepwise way, but it is hard to be applied comprehensively because of imprecise probability calculation and heavy computational load. Many dimension-reduced CMI-based and mutual information (MI)-based methods have been reported to achieve state-of-art performances in terms of classification. However, model deviations are introduced into the CMI and MI formulations in these methods during dimension reduction. In this paper, we start with the full-dimensional CMI to deal with the feature selection problem, so as to retain full inter-feature and feature-label mutual information when selecting new features. The cost function is approximated and simplified from a mathematical perspective to overcome the difficulties for maximizing the original full-dimensional CMI. A relationship is established between the proposed feature selection criterion and the one based on Hilbert-Schmidt independence, which explains qualitatively how the new criterion succeeds to achieve relevance maximization and redundance minimization simultaneously. Experiments on real-world datasets demonstrate the predominance of the proposed method over the existing ones.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In this paper, we consider the multi-label problem to avoid loss of generality. The single-label problem is a simplification of it by setting K = 1 and Y = y.
A multiplier of K is added to weight the inter-feature mutual information of independent labels, as K-label models are considered in this paper instead of the single-label one in [27].
Available at: http://pubchem.ncbi.nlm.nih.gov
Assay IDs include: 1416 (PERK), 1446 (JAK2), 1481 (ATPase), 1531 (MEK)
More detailed descriptions of the two datasets can be found in [16].
References
Bache K, Lichman M (2013) Uci machine learning repository
Bennasar M, Hicks Y, Setchi R (2015) Feature selection using joint mutual information maximisation. Expert Syst Appl 42(22):8520–8532
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1):245–271
Brown G, Pocock A, Zhao MJ, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66
Bu Z, Li HJ, Zhang C, Cao J, Li A, Shi Y Graph k-means based on leader identification, dynamic game and opinion dynamics, pp 1–1. https://doi.org/10.1109/TKDE.2019.2903712
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Chen Y, Bi J, Wang J (2006) Miles: Multiple-instance learning via embedded instance selection. IEEE Trans Pattern Anal Mach Intell 28(12):1931–1947
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555
Gretton A, Bousquet O, Smola A, Schölkopf B (2005) Measuring statistical dependence with hilbert-schmidt norms. In: International conference on algorithmic learning theory. Springer, pp 63–77
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37
Janecek A, Gansterer WN, Demel M, Ecker G (2008) On the relationship between feature selection and classification accuracy. FSDM 4:90–105
Kalousis A, Prados J, Hilario M (2007) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12(1):95–116
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1-2):273–324
Koller D, Sahami M (1996) Toward optimal feature selection. Technical report, Stanford InfoLab
Kong X, Philip SY (2010) Multi-label feature selection for graph classification. In: 2010 IEEE 10th international conference on Data mining (ICDM). IEEE, pp 274–283
Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159
Li HJ, Bu Z, Wang Z, Cao J (2020) Dynamical clustering in electronic commerce systems via optimization and leadership expansion. IEEE, pp 5327–5334
Liu H, Motoda H, Setiono R, Zhao Z (2010) Feature selection: an ever evolving frontier in data mining. In: Feature selection in data mining, pp 4–13
Liu H, Sun J, Liu L, Zhang H (2009) Feature selection with dynamic mutual information. Pattern Recogn 42(7):1330–1339
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502
Makoto Y, Jitkrittum W, Sigal L, Xing EP, Sugiyama M (2014) High-dimensional feature selection by feature-wise kernelized lasso. MIT, pp 185–207
Mitra P, Murthy C, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Nakariyakul S, Casasent DP (2009) An improvement on floating search algorithms for feature subset selection. Pattern Recogn 42(9):1932–1940
Neumann J, Schnörr C, Steidl G (2005) Combined svm-based feature selection and classification. Mach Learn 61(1):129–150
Pappu V, Pardalos PM (2014) High-dimensional data classification. In: Clusters, orders, and trees: Methods and applications. Springer, pp 119–150
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Song L, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 13:1393–1434
Sugiyama M (2012) Machine learning with squared-loss mutual information. Entropy 15(1):80–112
Suzuki T, Sugiyama M, Kanamori T, Sese J (2009) Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinform 10(1):S52
Torkkola K (2003) Feature extraction by non-parametric mutual information maximization. J Mach Learn Res 3:1415–1438
Tu CJ, Chuang LY, Chang JY, Yang CH et al (2007) Feature selection using pso-svm. International Journal of Computer Science
Unler A, Murat A, Chinnam RB (2011) mr2pso: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Elsevier, pp 4625–4641
Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Computi Appl 24(1):175–186
Wang J, Wei JM, Yang Z, Wang SQ (2017) Feature selection by maximizing independent classification information. IEEE Trans Knowl Data Eng 29(4):828–841
Wang T, Lu J, Zhang G (2018) Two-stage fuzzy multiple kernel learning based on hilbert-schmidt independence criterion. IEEE, pp 1–1
Yan X, Cheng H, Han J, Yu PS (2008) Mining significant graph patterns by leap search. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, pp 433–444
Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. In: 2002. ICDM 2003. Proceedings. 2002 IEEE international conference on Data mining. IEEE, pp 721–724
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: ICML, vol 3, pp 856–863
Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048
Zhang ML, Zhou ZH (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837
Zhang Y, Zhou ZH (2010) Multilabel dimensionality reduction via dependence maximization. ACM Trans Knowl Discov Data (TKDD) 4(3):14
Zhou Y, Jin R, Hoi SC (2010) Exclusive lasso for multi-task feature selection. In: AISTATS, vol 9, pp 988–995
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sha, ZC., Liu, ZM., Ma, C. et al. Feature selection for multi-label classification by maximizing full-dimensional conditional mutual information. Appl Intell 51, 326–340 (2021). https://doi.org/10.1007/s10489-020-01822-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01822-0