Abstract
In computational biology, an important problem is to identify a word of length k present in each of a given set of sequences. Here, we investigate the problem of calculating the probability that such a word exists in a set of r random strings. Existing methods to approximate this probability are either inaccurate when r > 2 or are restricted to Bernoulli models. We introduce two new methods for computing this probability under Bernoulli and Markov models. We present generalizations of the methods to compute the probability of finding a word of length k shared among q of r sequences, and to allow mismatches. We show through simulations that our approximations are significantly more accurate than methods previously published.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Arratia, R., Waterman, M.S.: An Erdős-Rényi law with shifts. Advances in Mathematics 55, 13–23 (1985)
Arratia, R., Waterman, M.S.: Critical Phenomena in sequence matching. The Annals of Probability 13, 1236–1249 (1985)
Blais, E.: Computing Probabilities for Common Substrings in Random Strings. M.Sc. Thesis, McGill University (2006)
Erdős, P., Rényi, A.: On a new law of large numbers. Journal d’Analyse Mathématique 22, 103–111 (1970)
Erdős, P., Révész, P.: On the length of the longest head run. Topics in Information Theory. Coll. Math. Soc. János Bolyai 16, 219–228 (1975)
Feller, W.: An Introduction to Probability Theory and its Applications, 3rd edn., vol. 1. John Wiley & Sons, Chichester (1968)
Fishman, G.S.: Monte Carlo: Concepts, Algorithms, and Apps. Springer, Heidelberg (1996)
Guibas, L.J., Odlyzko, A.M.: String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory, Series A 30, 183–208 (1981)
Harary, F.: Graphical Enumeration. Academic Press, London (1973)
Karlin, S., Ost, F.: Maximal length of common words among random letter sequences. The Annals of Probability 16, 535–563 (1988)
Morgenstern, B., Frech, K., Dress, A., Werner, T.: DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 14, 290–294 (1998)
Naus, J., Sheng, K.-N.: Matching among multiple random sequences. Bulletin of Mathematical Biology 59, 483–496 (1997)
Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Computer Science 287, 593–617 (2002)
Nijenhuis, A., Wilf, H.: Combinatorial Algorithms for Computers and Calculators. Academic Press, London (1978)
Pevzner, P.A., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proc. 8th Inter. Conf. on Int. Sys. for Mol. Biol., pp. 269–278 (2000)
Régnier, M.: A unified approach to word occurrence probabilities. Discrete Applied Mathematics 104, 259–280 (2000)
Régnier, M., Szpankowski, W.: On pattern frequency occurrences in a Markovian sequence. Algorithmica 22, 631–649 (1998)
Sinha, S., Tompa, M.: Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research 30, 5549–5560 (2002)
van Aardenne-Ehrenfest, T., de Bruijn, N.G.: Circuits and trees in oriented linear graphs. Simon Stevin 28, 203–217 (1951)
van Helden, J., André, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology 281, 827–842 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Blais, E., Blanchette, M. (2006). Common Substrings in Random Strings. In: Lewenstein, M., Valiente, G. (eds) Combinatorial Pattern Matching. CPM 2006. Lecture Notes in Computer Science, vol 4009. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780441_13
Download citation
DOI: https://doi.org/10.1007/11780441_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35455-0
Online ISBN: 978-3-540-35461-1
eBook Packages: Computer ScienceComputer Science (R0)