Abstract
We describe a compression-aware method to find all-vs-all maximal exact matches (MEM) among strings of a repetitive collection \(\mathcal {T}\). The key concept in our work is the construction of a fully-balanced grammar \(\mathcal {G}\) from \(\mathcal {T}\) that meets a property that we call fix-free: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of \(\mathcal {T}\) incrementally over \(\mathcal {G}\) using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al. [7], we show how we can build \(\mathcal {G}\) from \(\mathcal {T}\) in linear time and space. We also demonstrate that our MEM algorithm runs on top of \(\mathcal {G}\) in \(O(G +occ)\) time and uses \(O((G+occ)\log G)\) bits, where G is the grammar size, and occ is the number of MEMs in \(\mathcal {T}\). In the conclusions, we discuss how to modify our idea to perform approximate pattern matching in compressed space.
Supported by Academy of Finland (Grants 323233 and 339070), and by Basal Funds FB0001, Chile (first author).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Batu, T., Ergun, F., Sahinalp, C.: Oblivious string embeddings and edit distance approximations. In: Proceedings of the 17th Symposium on Discrete Algorithms (SODA), pp. 792–801 (2006)
Boucher, C., et al.: PHONI: streamed matching statistics with multi-genome references. In: Proceedings of the 21st Data Compression Conference (DCC), pp. 193–202 (2021)
Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12(4), 327–344 (1994)
Charikar, M., et al.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)
Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms 17(1), 1–39 (2020)
Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19
Claude, F., Navarro, G., Pacheco, A.: Grammar-compressed indexes with logarithmic search time. J. Comput. Syst. Sci. 118, 53–74 (2021)
Cole, R., Vishkin, U.: Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms. In: Proceedings of the 18th Annual Symposium on Theory of Computing (STOC), pp. 206–219 (1986)
Díaz-Domínguez, D., Navarro, G.: A grammar compressor for collections of reads with applications to the construction of the BWT. In: Proceedings of the 31st Data Compression Conference (DCC), pp. 83–92 (2021)
Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12200-2_16
Gagie, T., Navarro, G., Prezza, N.: Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1) (2020). Article 2
Jeż, A.: Approximation of grammar-based compression via recompression. Theor. Comput. Sci. 592, 115–134 (2015)
Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 827–840 (2018)
Kent, W.J.: BLAT-the BLAST-like alignment tool. Genome Res. 12(4), 656–664 (2002)
Kieffer, J., Yang, E.-H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)
Kurtz, S., et al.: Versatile and open software for comparing large genomes. Genome Biol. 5, 1–9 (2004)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A. (ed.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48194-X_17
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013)
Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)
Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design. Cambridge University Press, Cambridge (2015)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Navarro, G.: Computing MEMs on repetitive text collections. In: Proceedings of the 34th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. article 22 (2023)
Nong, G., Zhang, S., Chan, W.H.; Linear suffix array construction by almost pure induced-sorting. In; Proceedings of the 19th Data Compression Conference (DCC), pp. 193–202 (2009)
Nunes, D.S.N., Louza, F., Gog, S., Ayala-Rincón, M., Navarro, G.: A grammar compression algorithm based on induced suffix sorting. In: Proceedings of the 28th Data Compression Conference (DCC), pp. 42–51 (2018)
Ohlebusch, E., Fischer, J., Gog, S.: CST++. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 322–333. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16321-0_34
Rossi, M., Oliva, M., Bonizzoni, P., Langmead, B., Gagie, T., Boucher, C.: Finding maximal exact matches using the r-index. J. Comput. Biol. 29(2), 188–194 (2022)
Rossi, M., Oliva, M., Langmead, B., Gagie, T., Boucher, C.: MONI: a pangenomic index for finding maximal exact matches. J. Comput. Biol. 29(2), 169–187 (2022)
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)
Sahinalp, S.C., Vishkin, U.: On a parallel-algorithms method for string matching problems (overview). In: Bonuccelli, M., Crescenzi, P., Petreschi, R. (eds.) CIAC 1994. LNCS, vol. 778, pp. 22–32. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57811-0_3
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory (SWAT), pp. 1–11 (1973)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Díaz-Domínguez, D., Salmela, L. (2023). Computing All-vs-All MEMs in Grammar-Compressed Text. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-43980-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43979-7
Online ISBN: 978-3-031-43980-3
eBook Packages: Computer ScienceComputer Science (R0)