Abstract
Recently developed SAGE technology enables us to simultaneously quantify the expression levels of thousands of genes in a population of cells. SAGE data is helpful in classification of different types of cancers. However, one main challenge in this task is the availability of a smaller number of samples compared to huge number of genes, many of which are irrelevant for classification. Another main challenge is that there is a lack of appropriate statistical methods that consider the specific properties of SAGE data. We propose an efficient solution by selecting relevant genes by information gain and building a multinomial event model for SAGE data. Promising results, in terms of accuracy, were obtained for the model proposed.
Similar content being viewed by others
References
Buckhaults P, Zhang Z, Chen YC, Wang TL, St Croix B, Saha S, Bardelli A, Morin PJ, Polyak K, Hruban RH, Velculescu VE, Shih IM (2003) Identifying tumor origin using a gene expression-based classification map. Cancer Res 63:4144–4149
Cai L, Huang H, Blackshaw S, Liu J, Cepko C, Wong W (2004) Clustering analysis of SAGE data using a Poisson approach. Genome Biol 5:R51
Cover T, Thomas J (1991) Elements of information theory. Wiley, New York
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer class discovery and class prediction by gene expression monitoring. Science 286: 531–537
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1/3):389–422
Karl-Michael S (2003) A comparison of event models for naive Bayes anti-spam e-mail filtering. In: Proceedings of the 10th conference of the European chapter of the Association for Computational Linguistics, Budapest, pp 307–314
McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of AAAI-98 workshop on learning for text categorization. AAAI Press, Menlo Park, pp 41–48
Ng RT, Sander J, Sleumer MC (2001) Hierarchical cluster analysis of SAGE data for cancer profiling. In: Proceedings of the ACM SIGKDD workshop on data mining in bioinformatics (BIOKDD), pp 65–72
Porter D, Weremowicz S, Chin K, Seth P, Keshaviah A, Lahti-Domenici J, Bae YK, Monitto CL, Merlos-Suarez A, Chan J, Hulette CM, Richardson A, Morton CC, Marks J, Duyao M, Hruban R, Gabrielson E, Gelman R, Polyak K (2003) A neural survival factor is a candidate oncogene in breast cancer. Proc Natl Acad Sci USA 100:10931–10936
SAGEMap (2005) http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL4
Sander J, Ng RT, Sleumer MC, Yuen MS, Jones SJ (2005) A methodology for analyzing SAGE libraries for cancer profiling. Special issue on genomic information retrieval. ACM Trans Inf Syst 23(1):35–60
Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S (2004) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 21(5):631–643
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270:484–487
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jin, X., Zhou, W. & Bie, R. Multinomial event naive Bayesian modeling for SAGE data classification. Computational Statistics 22, 133–143 (2007). https://doi.org/10.1007/s00180-007-0029-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-007-0029-0