Abstract
Solving the problem of reporting all occurrences of patterns containing variable length gaps in an input text T efficiently is important for various applications in a broad range of domains such as Bioinformatics or Natural Language Processing. In this paper we present an efficient solution for static inputs which utilizes the wavelet tree of the suffix array. The algorithm partially traverses the wavelet tree to find matches and can be easily adapted to several variants of the problem. We explore the practical properties of our solution in an experimental study where we compare to online and semi-indexed solutions using standard datasets. The experiments show that our approach is the best choice for searching patterns with many gaps in large texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
I.e. any two match tuples \({\langle }i_0\ldots i_{k-1}{\rangle }\) and \({\langle }i'_0\ldots i'_{k-1}{\rangle }\) spanning the intervals \([i_0,i_{k-1}+m_{k-1}-1]\) and \([i'_0,i'_{k-1}+m_{k-1}-1]\) do not overlap.
References
Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)
Baeza-Yates, R.: A fast set intersection algorithm for sorted sequences. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 400–408. Springer, Heidelberg (2004)
Bille, P., Gørtz, I.L.: Substring range reporting. Algorithmica 69(2), 384–396 (2014)
Bille, P., Thorup, M.: Regular expression matching with multi-strings and intervals. In: Proceedings of SODA, pp. 1297–1308 (2010)
Bille, P., Gørtz, I.L., Vildhøj, H.W., Wind, D.K.: String matching with variable length gaps. Theor. Comput. Sci. 443, 25–34 (2012)
Fredriksson, K., Grabowski, S.: Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retrieval 11(4), 335–357 (2008)
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of SODA, pp. 841–850 (2003)
Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S., Pagni, M., Sigrist, C.J.A.: The PROSITE database. Nucleic Acids Res. 34(suppl 1), D227–D230 (2006)
Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Soft. Prac. Exp. 45(1), 1–29 (2015)
Lewenstein, M.: Indexing with gaps. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 135–143. Springer, Heidelberg (2011)
Lopez, A.: Hierarchical phrase-based translation with suffix arrays. In: Proceedings of EMNLP-CoNLL, pp. 976–985 (2007)
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Metzler, D., Croft, W.B.: A Markov random field model for term dependencies. In: Proceedings of SIGIR, pp. 472–479 (2005)
Mihalcea, R., Tarau, P., Figa, E.: Pagerank on semantic networks, with application to word sense disambiguation. In: Proceedings of COLING (2004)
Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured motifs search. J. Comput. Biol. 12(8), 1065–1082 (2005)
Navarro, G., Raffinot, M.: Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol. 10(6), 903–923 (2003)
Rahman, M.S., Iliopoulos, C.S., Lee, I., Mohamed, M., Smyth, W.F.: Finding patterns with variable length gaps or don’t cares. In: Chen, D.Z., Lee, D.T. (eds.) COCOON 2006. LNCS, vol. 4112, pp. 146–155. Springer, Heidelberg (2006)
Thompson, K.: Regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)
Acknowledgement
We are grateful to Timo Bingmann for profiling our initial implementation. This work was supported under the Australian Research Council’s Discovery Projects scheme (project DP140103256) and Deutsche Forschungsgemeinschaft.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Bader, J., Gog, S., Petri, M. (2016). Practical Variable Length Gap Pattern Matching. In: Goldberg, A., Kulikov, A. (eds) Experimental Algorithms. SEA 2016. Lecture Notes in Computer Science(), vol 9685. Springer, Cham. https://doi.org/10.1007/978-3-319-38851-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-38851-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-38850-2
Online ISBN: 978-3-319-38851-9
eBook Packages: Computer ScienceComputer Science (R0)