Abstract
Promoters are modular DNA structures containing complex regulatory elements required for gene transcription initiation. Hence, the identification of promoters using machine learning approach is very important for improving genome annotation and understanding transcriptional regulation. In recent years, many methods have been proposed for the prediction of eukaryotic and prokaryotic promoters. However, the performances of these methods are still far from being satisfactory. In this article, we develop a hybrid approach (called IPMD) that combines position correlation score function and increment of diversity with modified Mahalanobis Discriminant to predict eukaryotic and prokaryotic promoters. By applying the proposed method to Drosophila melanogaster, Homo sapiens, Caenorhabditis elegans, Escherichia coli, and Bacillus subtilis promoter sequences, we achieve the sensitivities and specificities of 90.6 and 97.4% for D. melanogaster, 88.1 and 94.1% for H. sapiens, 83.3 and 95.2% for C. elegans, 84.9 and 91.4% for E. coli, as well as 80.4 and 91.3% for B. subtilis. The high accuracies indicate that the IPMD is an efficient method for the identification of eukaryotic and prokaryotic promoters. This approach can also be extended to predict other species promoters.
Similar content being viewed by others
References
Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de Peer Y (2008a) Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res 18:310–323
Abeel T, Saeys Y, Rouzé P, van de Peer Y (2008b) ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles. Bioinformatics 24:i24–i31
Aerts S, Thijs G, Dabrowski M, Moreau Y, Moor BD (2004) Comprehensive analysis of base composition around the transcription start site in Metazoa. BMC Genomics 5:34
Akan P, Deloukas P (2008) DNA sequence and structural properties as predictors of human and mouse promoters. Gene 410:165–176
Anwar F, Baker SM, Jabid T, Mehedi Hasan M, Shoyaib M, Khan H, Walshe R (2008) pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach. BMC Bioinformatics 9:414
Bajic VB, Seah SH, Chong A, Zhang G, Koh JL, Brusic V (2002) Dragon promoter finder: recognition of vertebrate RNA polymerase II promoters. Bioinformatics 18:198–199
Bajic VB, Choudhary V, Hock CK (2004) Content analysis of the core promoter region of human genes. In Silico Biol 4:109–125
Burden S, Lin YX, Zhang R (2005) Improving promoter prediction for the NNPP2.2 algorithm: a case study using E. Coli DNA sequences. Bioinformatics 21:601–607
Chan B, Kibler D (2005) Using hexamers to predict cis-regulatory motifs in Drosophila. BMC Bioinformatics 6:262
Chou KC (1995) A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins 21:319–344
Chou KC, Liu WM, Maggiora GM, Zhang CT (1998) Prediction and classification of domain structural classes. Proteins 31:97–103
Davuluri RV, Grosse I, Zhang MQ (2001) Computational identification of promoters and first exons in the human genome. Nat Genet 29:412–417
Down TA, Hubbard TJ (2002) Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res 12:458–461
Feng Y, Luo L (2008) Use of tetrapeptide signals for protein secondary-structure prediction. Amino Acids 35:607–614
Gangal R, Sharma P (2005) Human pol II promoter prediction: time series descriptors and machine learning. Nucleic Acids Res 33:1332–1336
Goni JR, Pere A, Torrents D, Orozco M (2007) Determining promoter location based on DNA structure first-principles calculations. Genome Biol 8:R263
Gordon L, Chervonenkis AY, Gammerman AJ, Shahmuradov LA, Solovyev VV (2003) Sequence alignment kernel for recognition of promoter regions. Bioinformatics 19:1964–1971
Gordon JJ, Towsey MW, Hogan JM, Mathews SA, Timms P (2006) Improved prediction of bacterial transcription start sites. Bioinformatics 22:142–148
Grech B, Maetschke S, Mathews S, Timms P (2007) Genome-wide analysis of chlamydiae for promoters that phylogenetically footprint. Res Microbiol 158:685–693
Grech B, Mathews S, Timms P (2008) Phylogenetic comparison of the known Chlamydia trachomatis σ66 promoters across to Chlamydia pneumoniae and Chlamydia caviae identifies seven poorly conserved promoters. Res Microbiol 159:550–556
Hawley DK, McClure WR (1983) Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Res 11:2237–2255
Horton PB, Kanehisa M (1992) An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. Nucleic Acids Res 20:4331–4338
Huerta AM, Collado–Vides J (2003) Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. J Mol Biol 333:261–278
Hutchinson G (1996) The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Bioinformatics 12:391–398
Janky R, van Helden J (2008) Evaluation of phylogenetic footprint discovery for predicting bacterial cis-regulatory elements and revealing their evolution. BMC Bioinformatics 9:37
Kielbasa SM, Gonze D, Herzel H (2005) Measuring similarities between transcription factor binding sites. BMC Bioinformatics 6:237
Knudsen S (1999) Promoter2.0: for the recognition of pol II promoter sequences. Bioinformatics 15:356–361
Laxton RR (1978) The measure of diversity. J Theor Biol 70:51–67
Levitsky VG, Katokhin AV (2003) Recognition of eukaryotic promoters using a genetic algorithm based on iterative discriminant analysis. In Silico Biol 3:81–87
Li QZ, Lin H (2006) The recognition and prediction of σ70 promoters in Escherichia coli K–12. J Theor Biol 242:135–141
Mahdi RN, Rouchka EC (2009) RBF–TSS: identification of transcription start site in human using radial basis functions network and oligonucleotide positional frequencies. PLoS One 4:e4878
Makita Y, Nakao M, Ogasawara N, Nakai K (2004) DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Res 1:D75–D77
Ohler U (2006) Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction. Nucleic Acids Res 34:5943–5950
Ohler U, Harbeck S, Niemann H, Noth E, Reese MG (1999) Interpolated Markov chains for eukaryotic promoter recognition. Bioinformatics 15:363–369
Ohler U, Niemann H, Liao GC, Rubin GM (2001) Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics 17:S199–S206
Ohler U, Liao GC, Niemann H, Rubin GM (2002) Computational analysis of core promoters in the Drosophila genome. Genome Biol 3:RESEARCH0087
Pedersen AG, Engelbrecht J (1995) Investigations of Escherichia coli promoter sequences with artificial neural networks: new signals discovered upstream of the transcriptional startpoint. Proc Int Conf Intell Syst Mol Biol 3:292–299
Pedersen AG, Baldi P, Brunak S, Chauvin Y (1996) Characterization of prokaryotic and eukaryotic promoters using Hidden Markov models. Proc Int Conf Intell Syst Mol Biol 4:182–191
Pedersen AG, Baldi P, Brunak S (1999) The biology of eukaryotic promoter prediction—a review. Comput Chem 23:191–207
Ponger L, Mouchiroud D (2002) CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics 18:631–633
Prestridge DS (1995) Predicting pol II promoter sequences using transcription factor binding sites. J Mol Biol 249:923–932
Rangannan V, Bansal M (2007) Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability. J Biosci 32:851–862
Rangannan V, Bansal M (2009) Relative stability of DNA as a generic criterion for promoter prediction: whole genome annotation of microbial genomes with varying nucleotide base composition. Mol Biosyst 5:1758–1769
Rani TS, Bhavani SD, Bapi RS (2007) Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics 23:582–588
Reese MG (2001) Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome. Comput Chem 26:51–56
Salgado H, Gama-Castro S, Martinez-Antonio A, Diaz-Peredo E, Sanchez-Solano F, Peralta-Gil M, Garcia-Alonso D, Jimenez-Jacinto V, Santos-Zavaleta A, Bonavides-Martinez C, Collado-Vides J (2004) RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K–12. Nucleic Acids Res 32:D303–D306
Satija R, Pachter L, Hein J (2008) Combining statistical alignment and phylogenetic footprinting to detect regulatory elements. Bioinformatics 24:1236–1242
Schmid CD, Perier R, Praz V, Bucher P (2006) EPD in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Res 34:D82–D85
Shahmuradov IA, Solovyev VV, Gammerman AJ (2005) Plant promoter prediction with confidence estimation. Nucleic Acids Res 33:1069–1076
Shepelev V, Fedorov A (2006) Advances in the exon–intron database (EID). Brief Bioinform 7:178–185
Solovyev VV, Shahmuradov IA (2003) PromH: promoters identification using orthologous genomic sequences. Nucleic Acids Res 31:3540–3545
Sonnenburg S, Zien A, Ratsch G (2006) ARTS: accurate recognition of transcription starts in human. Bioinformatics 22:e472–e480
Wang HQ, Benham CJ (2006) Promoter prediction and annotation of microbial genomes based on DNA sequence and structural responses to superhelical stress. BMC Bioinformatics 7:248
Wasserman WW, Sandelin A (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5:276–287
Yang JY, Zhou Y, Yu ZG, Anh V, Zhou LQ (2008) Human pol II promoter recognition based on primary sequences and free energy of dinucleotides. BMC Bioinformatics 9:113
Zhang MQ (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc Natl Acad Sci USA 94:565–568
Zhang MQ (2005) Using CorePromoter to find human core promoters. Curr Protoc Bioinformatics Chapter 2: Unit 2.9
Zhang LR, Luo LF (2003) Splice site prediction with quadratic discriminant analysis using diversity measure. Nucleic Acids Res 31:6214–6220
Zhang X, Kassim A, Bajic VB (2004) Digital signal processing for potential promoter. In: IEEE international workshop on biomedical circuit and systems, pp S2/7/INV–S2/16-19
Acknowledgments
The authors are grateful to the anonymous reviewers for their valuable suggestions and comments, which have led to the improvement of this article. This study was supported in part by the Fundamental Research Funds for the Central Universities (ZYGX2009J081) and the National Natural Science Foundation of China (61063016).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lin, H., Li, QZ. Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci. 130, 91–100 (2011). https://doi.org/10.1007/s12064-010-0114-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12064-010-0114-8