DNA sequence classification using DAWGs | SpringerLink
Skip to main content

DNA sequence classification using DAWGs

  • Computational Molecular Biology
  • Chapter
  • First Online:
Structures in Logic and Computer Science

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1261))

  • 154 Accesses

Abstract

DNA sequence classification involves attributing sub-strings or words within a sequence to known distinct sequence classes. A query sequence was classified by comparing all of its words to words in databases representative of three classes of DNA, transcriptional promoters, exons and introns. The efficiency of this comparision was increased by constructing directed, acyclic word graphs (DAWGs) of all sequences and databases. The resulting landscape was scored to determine the preference of words in the query sequence for any one particular database class. Using this approach it was possible to detect 94% of a test set of individual promoter sequences, with only 4% incorrect detection of test exon sequences as promoters. Preliminary attempts were made to parse genomic DNA into promoter, exon and intron regions. Initial results indicate that a reasonably high degree of correlation exists between the predicted regions and known promoter-exon-intron domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  • Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D. and McConnell, R.: Building the minimal DFA for the set of all subwords of a word on-line in linear time. Lect. Notes Comp. Sci. 172 (1984) 109–118

    Google Scholar 

  • Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T. and Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theoret. Comp. Sci. 40 (1985) 31–55

    Google Scholar 

  • Bucher, P.: The eukaryotic promoter database EPD, EMBL nucleotide sequence data library, Release 48.

    Google Scholar 

  • Burset, M. and Guigó, R.: Evaluation of gene structure prediction programs. Genomics, 34 (1996) 353–367

    Google Scholar 

  • Chen, Q.K., Hertz, G.Z. and Stormo, G.D.: PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices. CABIOS (1997) In Press

    Google Scholar 

  • Clift, B., Haussler, D., McConnell, R., Schneider, T.D. and Stormo, G.D.: Sequence landscapes. Nucl. Acids Res. 14 (1986) 141–158

    Google Scholar 

  • Fickett, J.W. and Tung, C.S.: Assessment of protein coding measures. Nucl. Acids Res. 20 (1992) 6441–5450

    Google Scholar 

  • Fields, C.A. and Soderlund, C.A.: Gm: a practical tool for automating DNA sequence analysis. Comp. Appl. Biosci. 6 (1990) 263–270

    Google Scholar 

  • Guigó, R., Knudsen, S., Drake, N. and Smith, T.: Prediction of gene structure. J. Mol. Biol. 226 (1992) 141–157

    Google Scholar 

  • Hutchinson, G.B.: The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Comp. Appl. Biosci. 12 (1996) 391–398

    Google Scholar 

  • Manber, U. and Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22 (1993) 935–948

    Google Scholar 

  • Mathews, B.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405 (1975) 442–451

    Google Scholar 

  • Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannery, B.P.: Numerical recipes in C. The art of scientific computing. 2nd Edition. Cambridge University Press. pp1–994

    Google Scholar 

  • Prestridge, D.: Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol. 249 (1995) 923–932

    Google Scholar 

  • Reese, M.G., Eeckman, F.H., Kulp, D. and Haussler, D.: Improved splice site detection in Genie. Recomb 97 (1997) 232–240

    Google Scholar 

  • Snyder, E.E. and Stormo, G.D.: Identificationof Protein Coding Regions in Genomic DNA. J. Mol. Biol. 248 (1995) 1–18

    Google Scholar 

  • Snyder, E.E. and Stormo, G.D.: Identification of coding regions in genomics DNA sequences: an application of dynamic programming and neural networks. Nucl. Acids Res. 21 (1993) 607–613

    Google Scholar 

  • Solovyev, V.V., Salamov, A.A. and Lawrence, C.B.: Predicting internal exons by oligonucleotide composition and discriminant analysis of splicable open reading frames. Nucl. Acids Res. 22 (1994) 5156–5163

    Google Scholar 

  • Stormo, G.D. and Haussler, D.: Optimally Parsing a Sequence into Different Classes Based on Multiple Types of Evidence. In: Proceedings of the Second International Conference on Intelligent Systems in Molecular Biology, pp. 369–375.

    Google Scholar 

  • Uberbacher, E.C. and Mural, R.J.: Locating protein coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. 388 (1991) 11261–11265

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Jan Mycielski Grzegorz Rozenberg Arto Salomaa

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Levy, S., Stormo, G.D. (1997). DNA sequence classification using DAWGs. In: Mycielski, J., Rozenberg, G., Salomaa, A. (eds) Structures in Logic and Computer Science. Lecture Notes in Computer Science, vol 1261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63246-8_21

Download citation

  • DOI: https://doi.org/10.1007/3-540-63246-8_21

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-63246-7

  • Online ISBN: 978-3-540-69242-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics