Abstract
Random Forests (RF) models have been proven to perform well in both classification and regression. However, with the randomizing mechanism in both bagging samples and feature selection, the performance of RF can deteriorate when applied to high-dimensional data. In this paper, we propose a new approach for feature sampling for RF to deal with high-dimensional data. We first apply \(p\)-value to assess the feature importance on finding a cut-off between informative and less informative features. The set of informative features is then further partitioned into two groups, highly informative and informative features, using some statistical measures. When sampling the feature subspace for learning RFs, features from the three groups are taken into account. The new subspace sampling method maintains the diversity and the randomness of the forest and enables one to generate trees with a lower prediction error. In addition, quantile regression is employed to obtain predictions in the regression problem for a robustness towards outliers. The experimental results demonstrated that the proposed approach for learning random forests significantly reduced prediction errors and outperformed most existing random forests when dealing with high-dimensional data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
Breiman, L.: Manual on setting up, using, and understanding random forests v3. 1. (2002) (retrieved October 23, 2010)
Nguyen, T.T., Huang, J., Nguyen, T.: Two-level quantile regression forests for bias correction in range prediction. Machine Learning, 1–19 (2014)
Tuv, E., Borisov, A., Runger, G., Torkkola, K.: Feature selection with ensembles, artificial variables, and redundancy elimination. The Journal of Machine Learning Research 10, 1341–1366 (2009)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC Press (1984)
Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)
Genuer, R., Poggi, J.M., Tuleau-Malot, C.: Variable selection using random forests. Pattern Recognition Letters 31(14), 2225–2236 (2010)
Welch, B.L.: The generalization ofstudent‘s’ problem when several different population variances are involved. Biometrika, 28–35 (1947)
Meinshausen, N.: Quantile regression forests. The Journal of Machine Learning Research 7, 983–999 (2006)
Ho, C.H., Lin, C.J.: Large-scale linear support vector regression. The Journal of Machine Learning Research 13(1), 3323–3348 (2012)
Cai, Z., Jermaine, C., Vagena, Z., Logothetis, D., Perez, L.L.: The pairwise gaussian random field for high-dimensional data imputation. In: Data Mining (ICDM), pp. 61–70. IEEE (2013)
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
Meinshausen, N.: quantregforest: quantile regression forests. R package version 0.2-3 (2012)
Hothorn, T., Hornik, K., Zeileis, A.: party: A laboratory for recursive part (y) itioning. r package version 0.9-9999 (2011). http://cran.r-project.org/package=party (date last accessed November 28, 2013)
Deng, H.: Guided random forest in the rrf package. arXiv preprint arXiv:1306.0237 (2013)
Deng, H., Runger, G.: Gene selection with guided regularized random forest. Pattern Recognition 46(12), 3483–3489 (2013)
Ye, Y., Wu, Q., Zhexue Huang, J., Ng, M.K., Li, X.: Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognition 46(3), 769–787 (2013)
Tung, N.T., Huang, J.Z., Khan, I., Li, M.J., Williams, G.: Extensions to Quantile Regression Forests for Very High-Dimensional Data. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part II. LNCS, vol. 8444, pp. 247–258. Springer, Heidelberg (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Nguyen, TT., Zhao, H., Huang, J.Z., Nguyen, T.T., Li, M.J. (2015). A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9078. Springer, Cham. https://doi.org/10.1007/978-3-319-18032-8_36
Download citation
DOI: https://doi.org/10.1007/978-3-319-18032-8_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18031-1
Online ISBN: 978-3-319-18032-8
eBook Packages: Computer ScienceComputer Science (R0)