A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data

Nguyen, Thanh-Tung; Zhao, He; Huang, Joshua Zhexue; Nguyen, Thuy Thi; Li, Mark Junjie

doi:10.1007/978-3-319-18032-8_36

Thanh-Tung Nguyen¹⁰,
He Zhao¹¹,
Joshua Zhexue Huang¹²,
Thuy Thi Nguyen¹³ &
…
Mark Junjie Li¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9078))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4313 Accesses

Abstract

Random Forests (RF) models have been proven to perform well in both classification and regression. However, with the randomizing mechanism in both bagging samples and feature selection, the performance of RF can deteriorate when applied to high-dimensional data. In this paper, we propose a new approach for feature sampling for RF to deal with high-dimensional data. We first apply \(p\)-value to assess the feature importance on finding a cut-off between informative and less informative features. The set of informative features is then further partitioned into two groups, highly informative and informative features, using some statistical measures. When sampling the feature subspace for learning RFs, features from the three groups are taken into account. The new subspace sampling method maintains the diversity and the randomness of the forest and enables one to generate trees with a lower prediction error. In addition, quantile regression is employed to obtain predictions in the regression problem for a robustness towards outliers. The experimental results demonstrated that the proposed approach for learning random forests significantly reduced prediction errors and outperformed most existing random forests when dealing with high-dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Extensions to Quantile Regression Forests for Very High-Dimensional Data

An efficient random forests algorithm for high dimensional data classification

Article 21 March 2018

Smart Data Simplification: A Comprehensive Feature Selection Framework for High-Dimensional Datasets

References

Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
Article MATH Google Scholar
Breiman, L.: Manual on setting up, using, and understanding random forests v3. 1. (2002) (retrieved October 23, 2010)
Google Scholar
Nguyen, T.T., Huang, J., Nguyen, T.: Two-level quantile regression forests for bias correction in range prediction. Machine Learning, 1–19 (2014)
Google Scholar
Tuv, E., Borisov, A., Runger, G., Torkkola, K.: Feature selection with ensembles, artificial variables, and redundancy elimination. The Journal of Machine Learning Research 10, 1341–1366 (2009)
MATH MathSciNet Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC Press (1984)
Google Scholar
Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)
Google Scholar
Genuer, R., Poggi, J.M., Tuleau-Malot, C.: Variable selection using random forests. Pattern Recognition Letters 31(14), 2225–2236 (2010)
Article Google Scholar
Welch, B.L.: The generalization ofstudent‘s’ problem when several different population variances are involved. Biometrika, 28–35 (1947)
Google Scholar
Meinshausen, N.: Quantile regression forests. The Journal of Machine Learning Research 7, 983–999 (2006)
MATH MathSciNet Google Scholar
Ho, C.H., Lin, C.J.: Large-scale linear support vector regression. The Journal of Machine Learning Research 13(1), 3323–3348 (2012)
MATH MathSciNet Google Scholar
Cai, Z., Jermaine, C., Vagena, Z., Logothetis, D., Perez, L.L.: The pairwise gaussian random field for high-dimensional data imputation. In: Data Mining (ICDM), pp. 61–70. IEEE (2013)
Google Scholar
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
Google Scholar
Meinshausen, N.: quantregforest: quantile regression forests. R package version 0.2-3 (2012)
Google Scholar
Hothorn, T., Hornik, K., Zeileis, A.: party: A laboratory for recursive part (y) itioning. r package version 0.9-9999 (2011). http://cran.r-project.org/package=party (date last accessed November 28, 2013)
Deng, H.: Guided random forest in the rrf package. arXiv preprint arXiv:1306.0237 (2013)
Deng, H., Runger, G.: Gene selection with guided regularized random forest. Pattern Recognition 46(12), 3483–3489 (2013)
Article Google Scholar
Ye, Y., Wu, Q., Zhexue Huang, J., Ng, M.K., Li, X.: Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognition 46(3), 769–787 (2013)
Google Scholar
Tung, N.T., Huang, J.Z., Khan, I., Li, M.J., Williams, G.: Extensions to Quantile Regression Forests for Very High-Dimensional Data. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part II. LNCS, vol. 8444, pp. 247–258. Springer, Heidelberg (2014)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Thuyloi University, Hanoi, Vietnam
Thanh-Tung Nguyen
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, People’s Republic of China
He Zhao
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
Joshua Zhexue Huang & Mark Junjie Li
Faculty of Information Technology, Vietnam National University of Agriculture, Hanoi, Vietnam
Thuy Thi Nguyen

Authors

Thanh-Tung Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
He Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Zhexue Huang
View author publications
You can also search for this author in PubMed Google Scholar
Thuy Thi Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Mark Junjie Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Junjie Li .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Tru Cao
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Nanjing University, Nanjing, China
Zhi-Hua Zhou
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu-Bao Ho
The University of Hong Kong, Hong Kong, Hong Kong SAR
David Cheung
Osaka University, Osaka, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, TT., Zhao, H., Huang, J.Z., Nguyen, T.T., Li, M.J. (2015). A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9078. Springer, Cham. https://doi.org/10.1007/978-3-319-18032-8_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-18032-8_36
Published: 09 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18031-1
Online ISBN: 978-3-319-18032-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Extensions to Quantile Regression Forests for Very High-Dimensional Data

An efficient random forests algorithm for high dimensional data classification

Smart Data Simplification: A Comprehensive Feature Selection Framework for High-Dimensional Datasets

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Extensions to Quantile Regression Forests for Very High-Dimensional Data

An efficient random forests algorithm for high dimensional data classification

Smart Data Simplification: A Comprehensive Feature Selection Framework for High-Dimensional Datasets

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation