MOTiFS: Monte Carlo Tree Search Based Feature Selection
Abstract
:1. Introduction
- The novel feature selection algorithm, MOTiFS, is proposed which combines the robustness of MCTS with the accuracy of wrapper methods.
- MOTiFS searches through the feature space efficiently and find the best feature subset within a few iterations, relatively.
- Only two hyper-parameters, scaling factor and termination criteria, are required to be tuned, making MOTiFS simple and flexible to handle.
- MOTiFS is tested on 25 benchmark datasets and results are also compared with other established methods. The promising results demonstrate the superiority of MOTiFS.
2. Literature Review
3. Background
3.1. Working Procedure of Monte Carlo Tree Search (MCTS)
- Selection: Starting from the root node, the algorithm traverses the tree by selecting nodes with the highest approximated values, until a non-terminal node with unexpanded children is reached.
- Expansion: A new child node is added to expand the tree, according to the available set of actions.
- Simulation: From the new child node, a random simulation is performed until the terminal node is reached, to approximate the reward.
- Backpropagation: The simulation result (reward) is backpropagated through the selected nodes to update the tree.
3.2. Upper Confidence Bounds for Trees (UCT) Algorithm
4. MOTiFS (Monte Carlo Tree Search Based Feature Selection)
- The root is , which means no features are selected.
- Any node at levelhas two children,and, where.
4.1. Feature Subset Generation
4.1.1. Selection
4.1.2. Expansion
4.1.3. Simulation
4.2. Reward Calculation and Backpropagation
Algorithm 1 The MOTiFS Algorithm |
Load dataset and preprocess Initialize SCALAR, BUDGET //Scaling factor & Number of MCTS simulations (hyper parameters) function MOTiFS (featuresList) create rootNode maxReward, bestFeatureSubset ← UCTSEARCH (rootNode) return (maxReward, bestFeatureSubset) function UCTSEARCH (rootNode) Initialize maxReward, bestFeatureSubset while within computational budget do frontNode ← TREEPOLICY (rootNode) reward, featureSubset ← DEFAULTPOLICY (frontNode.state) BACKUP (frontNode, reward) if reward is greater than maxReward then maxReward ← reward bestFeatureSubset ← featureSubset return (maxReward, bestFeatureSubset) function TREEPOLICY (node) while node is non-terminal do if node not fully expanded then return EXPAND (node) else node ← BESTCHILD (node, SCALAR) return node function EXPAND (node) choose a untried actions from A(node.state) add a newChild with f(node.state, a) return newChild function BESTCHILD (, ) return function DEFAULTPOLICY (state) while state is non-terminal do choose a A(state) uniformly at random state ← f(state, a) traverse state.path if ai is equal to fi+1 then featureSubset ← INCLUDE (fi+1) reward ← REWARD (featureSubset) return (reward, featureSubset) function BACKUP (node, reward) while node is not null do node.visits ← node.visits + 1 if reward > node.reward then node.reward ← reward node ← node.parent return |
5. Experiment and Results
5.1. Datasets
5.2. Experimental Procedure and Parameter Setup for MOTiFS
5.3. Comparison Methods
5.4. Results and Comparisons
5.5. Discussion
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Yu, L.; Liu, H. Efficient Feature Selection via Analysis of Relevance and Redundancy. J. Mach. Learn. Res. 2004, 5, 1205–1224. [Google Scholar]
- Gasca, E.; Sánchez, J.S.; Alonso, R. Eliminating redundancy and irrelevance using a new MLP-based feature selection method. Pattern Recognit. 2006, 39, 313–315. [Google Scholar] [CrossRef]
- Gheyas, I.A.; Smith, L.S. Feature subset selection in large dimensionality domains. Pattern Recognit. 2010, 43, 5–13. [Google Scholar] [CrossRef]
- Zheng, Y.; Kwoh, C.K. A Feature Subset Selection Method Based On High-Dimensional Mutual Information. Entropy 2011, 13, 860–901. [Google Scholar] [CrossRef]
- Robnik-Šikonja, M.; Kononenko, I. Theoretical and Empirical Analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23–69. [Google Scholar] [CrossRef]
- Sluga, D.; Lotrič, U. Quadratic mutual information feature selection. Entropy 2017, 19, 157. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning. Elements 2009, 1, 337–387. [Google Scholar]
- Guo, Y.; Berman, M.; Gao, J. Group subset selection for linear regression. Comput. Stat. Data Anal. 2014, 75, 39–52. [Google Scholar] [CrossRef]
- Saganowski, S.; Gliwa, B.; Bródka, P.; Zygmunt, A.; Kazienko, P.; Kozlak, J. Predicting community evolution in social networks. Entropy 2015, 17, 3053–3096. [Google Scholar] [CrossRef]
- Reif, M.; Shafait, F. Efficient feature size reduction via predictive forward selection. Pattern Recognit. 2014, 47, 1664–1673. [Google Scholar] [CrossRef]
- Śmieja, M.; Warszycki, D. Average information content maximization-a new approach for fingerprint hybridization and reduction. PLoS ONE 2016, 11, e0146666. Available online: http://ww2.ii.uj.edu.pl/~smieja/aic/ (accessed on 5 May 2018). [CrossRef] [PubMed]
- Dash, M.; Choi, K.; Scheuermann, P.; Liu, H. Feature selection for clustering-a filter solution. In Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2003, Maebashi City, Japan, 9–12 December 2002; pp. 115–122. [Google Scholar]
- Kim, Y.; Street, W.N.; Menczer, F. Feature selection in unsupervised learning via evolutionary search. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20–23 August 2000; pp. 365–369. [Google Scholar]
- Xue, B.; Zhang, M.; Browne, W.N.; Yao, X. A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Trans. Evol. Comput. 2016, 20, 606–626. [Google Scholar] [CrossRef]
- Hamdani, T.M.; Won, J.M.; Alimi, A.M.; Karray, F. Hierarchical genetic algorithm with new evaluation function and bi-coded representation for the selection of features considering their confidence rate. Appl. Soft Comput. J. 2011, 11, 2501–2509. [Google Scholar] [CrossRef]
- Hong, J.H.; Cho, S.B. Efficient huge-scale feature selection with speciated genetic algorithm. Pattern Recognit. Lett. 2006, 27, 143–150. [Google Scholar] [CrossRef]
- Kabir, M.M.; Shahjahan, M.; Murase, K. A new hybrid ant colony optimization algorithm for feature selection. Expert Syst. Appl. 2012, 39, 3747–3763. [Google Scholar] [CrossRef]
- Tabakhi, S.; Moradi, P. Relevance-redundancy feature selection based on ant colony optimization. Pattern Recognit. 2015, 48, 2798–2811. [Google Scholar] [CrossRef]
- Unler, A.; Murat, A.; Chinnam, R.B. Mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Inf. Sci. 2011, 181, 4625–4641. [Google Scholar] [CrossRef]
- Zhang, Y.; Gong, D.; Hu, Y.; Zhang, W. Feature selection algorithm based on bare bones particle swarm optimization. Neurocomputing 2015, 148, 150–157. [Google Scholar] [CrossRef]
- Xue, B.; Zhang, M.; Browne, W.N. Single feature ranking and binary particle swarm optimisation based feature subset ranking for feature selection. Conf. Res. Pract. Inf. Technol. Ser. 2012, 122, 27–36. [Google Scholar]
- Paul, S.; Das, S. Simultaneous feature selection and weighting—An evolutionary multi-objective optimization approach. Pattern Recognit. Lett. 2015, 65, 51–59. [Google Scholar] [CrossRef]
- Cordon, O.; Herrera, F.; del Jesus, M.J.; Villar, P. A multiobjective genetic algorithm for feature selection and granularity learning in fuzzy-rule based classification systems. In Proceedings of the IFSA World Congress and 20th NAFIPS International Conference, Vancouver, BC, Canada, 25–28 July 2001; Volume 3, pp. 1253–1258. [Google Scholar]
- Xue, B.; Fu, W.; Zhang, M. Multi-objective Feature Selection in Classification: A Differential Evolution Approach. In Proceedings of the 10th International Conference on Simulated Evolution and Learning, Dunedin, New Zealand, 15–18 December 2014; pp. 516–528. [Google Scholar]
- Nakamura, R.Y.M.; Pereira, L.A.M.; Costa, K.A.; Rodrigues, D.; Papa, J.P.; Yang, X.S. BBA: A binary bat algorithm for feature selection. In Proceedings of the Brazilian Symposium of Computer Graphic and Image Processing, Campinas, Brazil, 17–20 October 2012; pp. 291–297. [Google Scholar]
- Rodrigues, D.; Pereira, L.A.M.; Nakamura, R.Y.M.; Costa, K.A.P.; Yang, X.; Souza, A.N.; Papa, J.P. A wrapper approach for feature selection based on Bat Algorithm and Optimum-Path Forest. Expert Syst. Appl. 2014, 41, 2250–2258. [Google Scholar] [CrossRef]
- Montazeri, M. HHFS: Hyper-heuristic feature selection. Intell. Data Anal. 2016, 20, 953–974. [Google Scholar] [CrossRef]
- Browne, C.; Powley, E. A survey of monte carlo tree search methods. IEEE Trans. Intell. AI Games 2012, 4, 1–49. [Google Scholar] [CrossRef] [Green Version]
- Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
- Hall, M. Correlation-based Feature Selection for Machine Learning. Methodology 1999, 21i195-i20, 1–5. [Google Scholar]
- Senawi, A.; Wei, H.L.; Billings, S.A. A new maximum relevance-minimum multicollinearity (MRmMC) method for feature selection and ranking. Pattern Recognit. 2017, 67, 47–61. [Google Scholar] [CrossRef]
- Zhao, G.D.; Wu, Y.; Chen, F.Q.; Zhang, J.M.; Bai, J. Effective feature selection using feature vector graph for classification. Neurocomputing 2015, 151, 376–389. [Google Scholar] [CrossRef]
- Huang, C.L.; Wang, C.J. A GA-based feature selection and parameters optimizationfor support vector machines. Expert Syst. Appl. 2006, 31, 231–240. [Google Scholar] [CrossRef]
- Durbha, S.S.; King, R.L.; Younan, N.H. Wrapper-based feature subset selection for rapid image information mining. IEEE Geosci. Remote Sens. Lett. 2010, 7, 43–47. [Google Scholar] [CrossRef]
- Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef]
- Bermejo, P.; Gámez, J.A.; Puerta, J.M. A GRASP algorithm for fast hybrid (filter-wrapper) feature subset selection in high-dimensional datasets. Pattern Recognit. Lett. 2011, 32, 701–711. [Google Scholar] [CrossRef]
- Solorio-Fernández, S.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F. A new hybrid filter–wrapper feature selection method for clustering based on ranking. Neurocomputing 2016, 214, 866–880. [Google Scholar] [CrossRef]
- Almuallim, H.; Dietterich, T.G. Learning Boolean concepts in the presence of many irrelevant features. Artif. Intell. 1994, 69, 279–305. [Google Scholar] [CrossRef]
- Li, F.; Zhang, Z.; Jin, C. Feature selection with partition differentiation entropy for large-scale data sets. Inf. Sci. 2016, 329, 690–700. [Google Scholar] [CrossRef]
- Gaudel, R.; Sebag, M. Feature Selection as a One-Player Game. In Proceedings of the International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 359–366. [Google Scholar]
- Szenkovits, A.; Meszlenyi, R.; Buza, K.; Gasko, N.; Lung, R.I.; Suciu, M. Feature Selection with a Genetic Algorithm for Classification of Brain Imaging Data; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Buza, K.; Alexandros, N.; Lars, S.-T. Time-series classification based on individualized error prediction. In Proceedings of the IEEE 13th International conference on Computational Science and Engineering (CSE), Hong Kong, China, 11–13 December 2010; pp. 48–54. [Google Scholar]
- Chen, G.H.; Stanislav, N.; Devavrat, S. A latent source model for nonparametric time series classification. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2013; pp. 1088–1098. [Google Scholar]
- Devroye, L.; Gyorfi, L.; Krzyzak, A.; Lugosi, G. On the Strong Universal Consistency of Nearest Neighbor Regression Function Estimates. Ann. Stat. 1994, 22, 1371–1385. [Google Scholar] [CrossRef]
- Chang, C.; Lin, C. Retrieved from LIBSVM—A Library for Support Vector Machines. 2001. Available online: https://www.csie.ntu.edu.tw/~cjlin/libsvm/ (accessed on 18 July 2017).
- Machine Learning Repository. Retrieved from University of California, Irvine. Available online: http://archive.ics.uci.edu/ml/index.php (accessed on 18 July 2017).
- Klekota, J.; Roth, F.P. Chemical substructures that enrich for biological activity. Bioinformatics 2008, 24, 2518–2525. [Google Scholar] [CrossRef] [PubMed]
- Tahir, M.A.; Bouridane, A.; Kurugollu, F. Simultaneous feature selection and feature weighting using Hybrid Tabu Search/K-nearest neighbor classifier. Pattern Recognit. Lett. 2007, 28, 438–446. [Google Scholar] [CrossRef]
- Mitra, P.; Murthy, C.A.; Pal, S.K. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 301–312. [Google Scholar] [CrossRef]
- Chen, Y.-C.; Pal, N.R.; Chung, I.-F. An Integrated Mechanism for Feature Selection and Fuzzy Rule Extraction for Classification. IEEE Trans. Fuzzy Syst. 2012, 20, 683–698. [Google Scholar] [CrossRef]
Notation | Interpretation |
---|---|
Original feature set | |
Total number of features | |
Node at tree level | |
Action taken at | |
Simulation reward |
# | Dataset | No. of Features | No. of Instances | No. of Classes |
---|---|---|---|---|
1 | Spambase | 57 | 4701 | 2 |
2 | WBC | 9 | 699 | 2 |
3 | Ionosphere | 34 | 351 | 2 |
4 | Arrhythmia | 195 | 452 | 16 |
5 | Multiple features | 649 | 2000 | 10 |
6 | Waveform | 40 | 5000 | 3 |
7 | WBDC | 30 | 569 | 2 |
8 | Glass | 9 | 214 | 6 |
9 | Wine | 13 | 178 | 3 |
10 | Australian | 14 | 690 | 2 |
11 | German number | 24 | 1000 | 2 |
12 | Zoo | 17 | 101 | 7 |
13 | Breast cancer | 10 | 683 | 2 |
14 | DNA | 180 | 2000 | 2 |
15 | Vehicle | 18 | 846 | 4 |
16 | Sonar | 60 | 208 | 2 |
17 | Hillvalley | 100 | 606 | 2 |
18 | Musk 1 | 166 | 476 | 2 |
19 | Splice | 60 | 1000 | 2 |
20 | KRFP * | 4860 | 215 | 2 |
21 | Soybean-small | 35 | 47 | 4 |
22 | Liver disorders | 6 | 345 | 2 |
23 | Credit | 15 | 690 | 2 |
24 | Tic-tac-toe | 9 | 985 | 2 |
25 | Libras movement | 90 | 360 | 15 |
Parameter | Values Used for Different Datasets |
---|---|
Scaling factor, C | (0.1, 0.05, 0.02) |
Termination criteria | (500, 1000, 10,000) iterations |
Method | Description |
---|---|
SFS, SBS | Sequential Forward Selection and Sequential Backward Selection * [22] (2015) |
FS-FS | Feature Similarity Technique [49] (2002) |
FR-FS | Fuzzy Rule Based Technique [50] (2012) |
SFSW | An Evolutionary Multi-Objective Optimization Approach [22] (2015) |
DEMOFS | Differential Evolution Based Multi-Objective Feature Selection * [22] (2014) |
BA | Bat Algorithm and Optimum-Path Forest Based Wrapper Approach [26] (2014) |
PSO | Particle Swarm Optimization Based Method * [26] (2014) |
SCE, CCE | Shannon’s Entropy Reduction, Complementary Entropy Reduction * [39] (2016) |
PDE-2 | Partition Differential Entropy Based Method [39] (2016) |
Dataset | Avg. Acc. # Sel. Feat. | ||||||||
---|---|---|---|---|---|---|---|---|---|
MOTiFS | SFSW [22] | SFS [22] | SBS [22] | FS-FS [49] | FR-FS [50] | DEMOFS [22] | BA [26] | PSO [26] | |
Spambase | 0.907 ± 0.003 | 0.885 | 0.874 | 0.870 | 0.900 | ||||
31.5 | 26.0 | 35.7 | 37.3 | 29.0 | |||||
WBC | 0.968 ± 0.001 | 0.961 | 0.960 | 0.951 | 0.956 | ||||
5.52 | 4.2 | 6.4 | 7.3 | 4.0 | |||||
Ionosphere | 0.889 ± 0.007 | 0.883 | 0.887 | 0.859 | 0.788 | 0.844 | 0.780 | 0.790 | |
12.32 | 11.5 | 1.2 | 9.1 | 16.0 | 4.33 | 21.0 | 14.0 | ||
Arrhythmia | 0.650 ± 0.003 | 0.658 | 0.599 | 0.580 | 0.589 | ||||
94.4 | 100.0 | 89.4 | 49.2 | 100.0 | |||||
Multiple features | 0.980 ± 0.001 | 0.979 | 0.903 | 0.912 | 0.783 | ||||
321.84 | 270.0 | 210.0 | 305.0 | 325.0 | |||||
Waveform | 0.816 ±0.002 | 0.837 | 0.778 | 0.785 | 0.752 | ||||
19.42 | 16.0 | 18.4 | 18.3 | 20.0 | |||||
WBDC | 0.967 ± 0.004 | 0.941 | 0.901 | 0.898 | 0.936 | ||||
15.42 | 13.5 | 13.9 | 17.8 | 2.14 | |||||
Glass | 0.705 ± 0.003 | 0.678 | 0.631 | 0.636 | 0.615 | ||||
4.80 | 4.4 | 5.8 | 7.0 | 6.96 | |||||
Wine | 0.963 ± 0.004 | 0.961 | 0.914 | 0.914 | 0.955 | 0.897 | |||
7.52 | 6.9 | 6.0 | 7.5 | 4.38 | 6.0 | ||||
Australian | 0.850 ± 0.002 | 0.846 | 0.830 | 0.828 | 0.773 | ||||
6.98 | 4.7 | 3.7 | 3.0 | 4.0 | |||||
German number | 0.725 ± 0.008 | 0.713 | 0.682 | 0.658 | 0.701 | ||||
11.46 | 10.5 | 12.2 | 10.8 | 1.0 | |||||
Zoo | 0.920 ± 0.022 | 0.954 | 0.949 | 0.980 | 0.954 | ||||
9.06 | 11.0 | 9.0 | 13.0 | 11.0 | |||||
Breast cancer | 0.967 ± 0.003 | 0.965 | 0.951 | 0.949 | 0.940 | 0.930 | |||
6.14 | 4.3 | 6.10 | 6.10 | 5.0 | 5.0 | ||||
DNA | 0.810 ± 0.006 | 0.831 | 0.822 | 0.823 | 0.760 | 0.760 | |||
89.26 | 71.8 | 18.8 | 20.6 | 96.0 | 91.0 | ||||
Vehicle | 0.721 ± 0.008 | 0.653 | 0.686 | 0.673 | |||||
10.14 | 9.1 | 10.8 | 10.7 | ||||||
Sonar | 0.850 ± 0.002 | 0.827 | 0.729 | 0.786 | |||||
28.96 | 20.0 | 5.85 | 10.0 | ||||||
Hillvalley | 0.535 ± 0.003 | 0.575 | 0.605 | ||||||
45.18 | 40.0 | 26.0 | |||||||
Musk 1 | 0.852 ± 0.003 | 0.815 | 0.835 | ||||||
81.34 | 59.3 | 58.0 | |||||||
Splice | 0.778 ± 0.002 | 0.680 | 0.670 | ||||||
25.66 | 28.0 | 28.0 | |||||||
KRFP | 0.896 ± 0.001 | 0.842 * | 0.884 * | ||||||
2390.2 | 6.0 | 1866 |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chaudhry, M.U.; Lee, J.-H. MOTiFS: Monte Carlo Tree Search Based Feature Selection. Entropy 2018, 20, 385. https://doi.org/10.3390/e20050385
Chaudhry MU, Lee J-H. MOTiFS: Monte Carlo Tree Search Based Feature Selection. Entropy. 2018; 20(5):385. https://doi.org/10.3390/e20050385
Chicago/Turabian StyleChaudhry, Muhammad Umar, and Jee-Hyong Lee. 2018. "MOTiFS: Monte Carlo Tree Search Based Feature Selection" Entropy 20, no. 5: 385. https://doi.org/10.3390/e20050385
APA StyleChaudhry, M. U., & Lee, J. -H. (2018). MOTiFS: Monte Carlo Tree Search Based Feature Selection. Entropy, 20(5), 385. https://doi.org/10.3390/e20050385