Gene selection and classification of microarray data using random forest

doi:10.1186/1471-2105-7-3

. 2006 Jan 6:7:3.

doi: 10.1186/1471-2105-7-3.

Gene selection and classification of microarray data using random forest

Ramón Díaz-Uriarte¹, Sara Alvarez de Andrés

Affiliations

PMID: 16398926
PMCID: PMC1363357
DOI: 10.1186/1471-2105-7-3

Gene selection and classification of microarray data using random forest

Ramón Díaz-Uriarte et al. BMC Bioinformatics. 2006.

. 2006 Jan 6:7:3.

doi: 10.1186/1471-2105-7-3.

Authors

Ramón Díaz-Uriarte¹, Sara Alvarez de Andrés

Affiliation

¹ Bioinformatics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernandez Almagro 3, Madrid, 28029, Spain. rdiaz@ligarto.org

PMID: 16398926
PMCID: PMC1363357
DOI: 10.1186/1471-2105-7-3

Abstract

Background: Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.

Results: We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.

Conclusion: Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

PubMed Disclaimer

Figures

**Figure 1**
**Out-of-Bag (OOB) vs *mtryFactor* for the nine microarray data sets**. *mtryFactor* is the multiplicative factor of the default *mtry* (number⋅of⋅genes MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjabgwSixlabd+gaVjabdAgaMjabgwSixlabdEgaNjabdwgaLjabd6gaUjabdwgaLjabdohaZbWcbeaaaaa@4332@); thus, an *mtryFactor* of 3 means the number of genes tried at each split is 3 * number⋅of⋅genes MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjabgwSixlabd+gaVjabdAgaMjabgwSixlabdEgaNjabdwgaLjabd6gaUjabdwgaLjabdohaZbWcbeaaaaa@4332@; an *mtryFactor* = 0 means the number of genes tried was 1; the *mtryFactors* examined were = {0, 0.05, 0.1, 0.17, 0.25, 0.33, 0.5, 0.75, 0.8, 1, 1.15, 1.33, 1.5, 2, 3, 4, 5, 6, 8, 10, 13}. Results shown for six different *ntree* = {1000, 2000, 5000, 10000, 20000, 40000}, *nodesize* = 1.

See this image and copyright information in PMC

Cited by

Functional knowledge transfer for high-accuracy prediction of under-studied biological processes.
Park CY, Wong AK, Greene CS, Rowland J, Guan Y, Bongo LA, Burdine RD, Troyanskaya OG. Park CY, et al. PLoS Comput Biol. 2013;9(3):e1002957. doi: 10.1371/journal.pcbi.1002957. Epub 2013 Mar 14. PLoS Comput Biol. 2013. PMID: 23516347 Free PMC article.
Transcriptome classification reveals molecular subtypes in psoriasis.
Ainali C, Valeyev N, Perera G, Williams A, Gudjonsson JE, Ouzounis CA, Nestle FO, Tsoka S. Ainali C, et al. BMC Genomics. 2012 Sep 12;13:472. doi: 10.1186/1471-2164-13-472. BMC Genomics. 2012. PMID: 22971201 Free PMC article.
Robust edge-based biomarker discovery improves prediction of breast cancer metastasis.
Adnan N, Lei C, Ruan J. Adnan N, et al. BMC Bioinformatics. 2020 Sep 30;21(Suppl 14):359. doi: 10.1186/s12859-020-03692-2. BMC Bioinformatics. 2020. PMID: 32998692 Free PMC article.
Random Forests for Global and Regional Crop Yield Predictions.
Jeong JH, Resop JP, Mueller ND, Fleisher DH, Yun K, Butler EE, Timlin DJ, Shim KM, Gerber JS, Reddy VR, Kim SH. Jeong JH, et al. PLoS One. 2016 Jun 3;11(6):e0156571. doi: 10.1371/journal.pone.0156571. eCollection 2016. PLoS One. 2016. PMID: 27257967 Free PMC article.
Automated identification of protein-ligand interaction features using Inductive Logic Programming: a hexose binding case study.
A Santos JC, Nassif H, Page D, Muggleton SH, E Sternberg MJ. A Santos JC, et al. BMC Bioinformatics. 2012 Jul 11;13:162. doi: 10.1186/1471-2105-13-162. BMC Bioinformatics. 2012. PMID: 22783946 Free PMC article.

See all "Cited by" articles

References

1. Lee JW, Lee JB, Park M, Song SH. An extensive evaluation of recent classification tools applied to microarray data. Computation Statistics and Data Analysis. 2005;48:869–885.
1. Yeung KY, Bumgarner RE, Raftery AE. Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics. 2005;21:2394–2402. - PubMed
1. Jirapech-Umpai T, Aitken S. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics. 2005;6:148. - PMC - PubMed
1. Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of sample size for various classification rules. Bioinformatics. 2005;21:1509–1515. - PubMed
1. Li Y, Campbell C, Tipping M. Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics. 2002;18:1332–1339. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Lee JW, Lee JB, Park M, Song SH. An extensive evaluation of recent classification tools applied to microarray data. Computation Statistics and Data Analysis. 2005;48:869–885.

[2] Lee JW, Lee JB, Park M, Song SH. An extensive evaluation of recent classification tools applied to microarray data. Computation Statistics and Data Analysis. 2005;48:869–885.

[3] Yeung KY, Bumgarner RE, Raftery AE. Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics. 2005;21:2394–2402. - PubMed

[4] Yeung KY, Bumgarner RE, Raftery AE. Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics. 2005;21:2394–2402. - PubMed

[5] Jirapech-Umpai T, Aitken S. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics. 2005;6:148. - PMC - PubMed

[6] Jirapech-Umpai T, Aitken S. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics. 2005;6:148. - PMC - PubMed

[7] Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of sample size for various classification rules. Bioinformatics. 2005;21:1509–1515. - PubMed

[8] Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of sample size for various classification rules. Bioinformatics. 2005;21:1509–1515. - PubMed

[9] Li Y, Campbell C, Tipping M. Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics. 2002;18:1332–1339. - PubMed

[10] Li Y, Campbell C, Tipping M. Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics. 2002;18:1332–1339. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Gene selection and classification of microarray data using random forest

Affiliation

Gene selection and classification of microarray data using random forest

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources