Abstract
Although real-world text datasets, such as DNA sequences, are far from being uniformly random, string searching average-case algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length n that occurs elsewhere in the string with k-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant k and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in \(\mathcal {O}(n\frac{(c\log n)^k}{k!})\) time on average, where \(c>1\) is a constant, using \(\mathcal {O}(n)\) space. Finally, we show that our technique is applicable to several algorithmic problems found in computational biology and elsewhere. The importance of our technique lies on the fact that it is the first one achieving this bound for non-constant k and using \(\mathcal {O}(n)\) space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The reason for imposing “\(\alpha >4\)” instead of “\(\alpha >3\)” becomes clear in Sect. 3.3.
References
Abboud, A., Williams, R., Yu, H.: More applications of the polynomial method to algorithm design. In: SODA, SODA 2015, pp. 218–230. Society for Industrial and Applied Mathematics (2015)
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discret. Algorithms 2(1), 53–86 (2004)
Alamro, H., Ayad, L.A.K., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P.: Longest common prefixes with k-mismatches and applications. In: Tjoa, A.M., Bellatreche, L., Biffl, S., van Leeuwen, J., Wiedermann, J. (eds.) SOFSEM 2018. LNCS, vol. 10706, pp. 636–649. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73117-9_45
Alzamel, M., et al.: Efficient computation of sequence mappability. In: Gagie, T., et al. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 12–26. Springer, Cham (2018)
Alzamel, M., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P., Radoszewski, J., Sung, W.-K.: Faster algorithms for 1-mappability of a sequence. In: Gao, X., Du, H., Han, M. (eds.) COCOA 2017. LNCS, vol. 10628, pp. 109–121. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71147-8_8
Apostolico, A., Guerra, C., Landau, G.M., Pizzi, C.: Sequence similarity measures based on bounded hamming distance. Theor. Comput. Sci. 638, 76–90 (2016). Pattern Matching, Text Data Structures and Compression
Apostolico, A., Guerra, C., Pizzi, C.: Alignment free sequence similarity with bounded hamming distance. In: DCC, pp. 183–192. IEEE (2014)
Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000). https://doi.org/10.1007/10719839_9
Bollobás, B., Letzter, S.: Longest common extension. Eur. J. Comb. 68, 242–248 (2018)
Charalampopoulos, P., et al.: Linear-time algorithm for long LCF with \(k\) mismatches. In: CPM. LIPIcs, vol. 105, pp. 23:1–23:16. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2018)
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC, STOC 2004, pp. 91–100. ACM (2004)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
Derrien, T., et al.: Fast computation and applications of genome mappability. PLoS ONE 7(1), e30377 (2012)
Eades, P., McKay, B.D.: An algorithm for generating subsets of fixed size with a strong minimal change property. Inf. Process. Lett. 19(3), 131–133 (1984)
Farach, M.: Optimal suffix tree construction with large alphabets. In: FOCS, pp. 137–143. IEEE Computer Society (1997)
Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv 45(2), 13:1–13:42 (2013)
Fischer, J.: Inducing the LCP-array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22300-6_32
Flouri, T., Giaquinta, E., Kobert, K., Ukkonen, E.: Longest common substrings with \(k\) mismatches. Inf. Process. Lett. 115(6–8), 643–647 (2015)
Grabowski, S.: A note on the longest common substring with \(k\)-mismatches problem. Inf. Process. Lett. 115(6–8), 640–642 (2015)
Horwege, S., et al.: Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res. 42(Webserver-Issue), 7–11 (2014)
Karlin, S., Ghandour, G., Ost, F., T, S., Korn, L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. USA 80, 5660–5664 (1983)
Kociumaka, T., Radoszewski, J., Starikovskaya, T.A.: Longest common substring with approximately \(k\) mismatches. CoRR, abs/1712.08573 (2017)
Kolpakov, R., Bana, G., Kucherov, G.: MREPS: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31(13), 3672–3678 (2003)
Kucherov, G., Tsur, D.: Improved filters for the approximate suffix-prefix overlap problem. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 139–148. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11918-2_14
Leimeister, C., Morgenstern, B.: Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30(14), 2000–2008 (2014)
Liang, K.-H.: Bioinformatics for Biomedical Science and Clinical Applications. Woodhead Publishing Series in Biomedicine. Woodhead Publishing (2013)
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Manzini, G.: Longest common prefix with mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 299–310. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23826-5_29
Navarro, G., Baeza-Yates, R.A.: A hybrid indexing method for approximate string matching. J. Discret. Algorithms 1(1), 21–49 (2000)
Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: DCC, pp. 193–202. IEEE (2009)
Pizzi, C.: Missmax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms Mol. Biol. 11(1), 6 (2016)
Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient \(q\)-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13(2), 296–308 (2006)
Smit, A.F.: Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 9(6), 657–663 (1999)
Thankachan, S.V., Aluru, C., Chockalingam, S.P., Aluru, S.: Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In: Raphael, B.J. (ed.) RECOMB 2018. LNCS, vol. 10812, pp. 211–224. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89929-9_14
Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016)
Thankachan, S.V., Chockalingam, S.P., Liu, Y., Apostolico, A., Aluru, S.: ALFRED: a practical method for alignment-free distance computation. J. Comput. Biol., 23(6), 452–460 (2016)
Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13(2), 336–350 (2006)
Välimäki, N., Ladra, S., Mäkinen, V.: Approximate all-pairs suffix/prefix overlaps. Inf. Comput. 213, 49–58 (2012)
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett. 17(2), 81–84 (1983)
Acknowledgements
We warmly thank Cyril Nicaud (Université Paris-Est) for useful discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A K-Combinations Generation Process
A K-Combinations Generation Process
We generate all combinations of size K of the set [N] using the following folklore algorithm. We build a tree in a recursive way. Each node v has a label \(\ell (v) \in [N]\), apart from the root which is labeled with 0. We say that a node v represents the set \(r(v)=\{\ell (v)\} \cup \{\ell (u)|u\text { is an ancestor of }v\}\setminus \{0\}\). By construction, no two nodes will represent the same set. At the end of the process each of the \(\mathcal {O}(\genfrac(){0.0pt}1{N}{K})\) leaves will represent a distinct K-combination. The procedure works as follows:
-
1.
It takes as input a node v with satellite data \(\ell (v)\) and |r(v)|.
-
2.
It creates \(N-\ell (v)-K+|r(v)|\) child nodes of v, labeled \(\ell (v)+1\) to \(N-K+|r(v)|+1\).
-
3.
For each newly created node u, if \(|r(u)|=|r(v)|+1<K\), the procedure recursively calls itself.
Initiating this procedure with an input node u, with \(\ell (u)=0\) and \(|r(u)|=0\), all K-combinations are generated in time \(\mathcal {O}(\genfrac(){0.0pt}1{N}{K})\) if \(K\le N/2\) as shown below. Note that each node of the tree is associated with a unique subset of [N] and neighbouring nodes only differ on whether they contain the largest element of their union.
To upper bound the time complexity, it suffices to bound the number of nodes in the tree with a single child, since the rest internal nodes are trivially upper bounded by the number of leaves. A node with label \(N-i\) will have a single child only if it is the \((K-i)\)th element added in a combination. This can happen \(\mathcal {O}(\genfrac(){0.0pt}1{N-i-1}{K-i-1})\) times. Given our assumption that \(K\le N/2\) and using Pascal’s identity \(\genfrac(){0.0pt}1{m}{j}=\genfrac(){0.0pt}1{m-1}{j-1}+\genfrac(){0.0pt}1{m-1}{j}\), we bound the number of nodes with a single child as follows:
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ayad, L.A.K., Barton, C., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P. (2018). Longest Common Prefixes with k-Errors and Applications. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds) String Processing and Information Retrieval. SPIRE 2018. Lecture Notes in Computer Science(), vol 11147. Springer, Cham. https://doi.org/10.1007/978-3-030-00479-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-00479-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00478-1
Online ISBN: 978-3-030-00479-8
eBook Packages: Computer ScienceComputer Science (R0)