Abstract
DNA sequence classification involves attributing sub-strings or words within a sequence to known distinct sequence classes. A query sequence was classified by comparing all of its words to words in databases representative of three classes of DNA, transcriptional promoters, exons and introns. The efficiency of this comparision was increased by constructing directed, acyclic word graphs (DAWGs) of all sequences and databases. The resulting landscape was scored to determine the preference of words in the query sequence for any one particular database class. Using this approach it was possible to detect 94% of a test set of individual promoter sequences, with only 4% incorrect detection of test exon sequences as promoters. Preliminary attempts were made to parse genomic DNA into promoter, exon and intron regions. Initial results indicate that a reasonably high degree of correlation exists between the predicted regions and known promoter-exon-intron domains.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D. and McConnell, R.: Building the minimal DFA for the set of all subwords of a word on-line in linear time. Lect. Notes Comp. Sci. 172 (1984) 109–118
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T. and Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theoret. Comp. Sci. 40 (1985) 31–55
Bucher, P.: The eukaryotic promoter database EPD, EMBL nucleotide sequence data library, Release 48.
Burset, M. and Guigó, R.: Evaluation of gene structure prediction programs. Genomics, 34 (1996) 353–367
Chen, Q.K., Hertz, G.Z. and Stormo, G.D.: PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices. CABIOS (1997) In Press
Clift, B., Haussler, D., McConnell, R., Schneider, T.D. and Stormo, G.D.: Sequence landscapes. Nucl. Acids Res. 14 (1986) 141–158
Fickett, J.W. and Tung, C.S.: Assessment of protein coding measures. Nucl. Acids Res. 20 (1992) 6441–5450
Fields, C.A. and Soderlund, C.A.: Gm: a practical tool for automating DNA sequence analysis. Comp. Appl. Biosci. 6 (1990) 263–270
Guigó, R., Knudsen, S., Drake, N. and Smith, T.: Prediction of gene structure. J. Mol. Biol. 226 (1992) 141–157
Hutchinson, G.B.: The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Comp. Appl. Biosci. 12 (1996) 391–398
Manber, U. and Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22 (1993) 935–948
Mathews, B.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405 (1975) 442–451
Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannery, B.P.: Numerical recipes in C. The art of scientific computing. 2nd Edition. Cambridge University Press. pp1–994
Prestridge, D.: Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol. 249 (1995) 923–932
Reese, M.G., Eeckman, F.H., Kulp, D. and Haussler, D.: Improved splice site detection in Genie. Recomb 97 (1997) 232–240
Snyder, E.E. and Stormo, G.D.: Identificationof Protein Coding Regions in Genomic DNA. J. Mol. Biol. 248 (1995) 1–18
Snyder, E.E. and Stormo, G.D.: Identification of coding regions in genomics DNA sequences: an application of dynamic programming and neural networks. Nucl. Acids Res. 21 (1993) 607–613
Solovyev, V.V., Salamov, A.A. and Lawrence, C.B.: Predicting internal exons by oligonucleotide composition and discriminant analysis of splicable open reading frames. Nucl. Acids Res. 22 (1994) 5156–5163
Stormo, G.D. and Haussler, D.: Optimally Parsing a Sequence into Different Classes Based on Multiple Types of Evidence. In: Proceedings of the Second International Conference on Intelligent Systems in Molecular Biology, pp. 369–375.
Uberbacher, E.C. and Mural, R.J.: Locating protein coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. 388 (1991) 11261–11265
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Levy, S., Stormo, G.D. (1997). DNA sequence classification using DAWGs. In: Mycielski, J., Rozenberg, G., Salomaa, A. (eds) Structures in Logic and Computer Science. Lecture Notes in Computer Science, vol 1261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63246-8_21
Download citation
DOI: https://doi.org/10.1007/3-540-63246-8_21
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63246-7
Online ISBN: 978-3-540-69242-3
eBook Packages: Springer Book Archive