Computing All-vs-All MEMs in Grammar-Compressed Text

Díaz-Domínguez, Diego; Salmela, Leena

doi:10.1007/978-3-031-43980-3_13

Diego Díaz-Domínguez¹⁰ &
Leena Salmela¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14240))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

511 Accesses

Abstract

We describe a compression-aware method to find all-vs-all maximal exact matches (MEM) among strings of a repetitive collection \(\mathcal {T}\). The key concept in our work is the construction of a fully-balanced grammar \(\mathcal {G}\) from \(\mathcal {T}\) that meets a property that we call fix-free: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of \(\mathcal {T}\) incrementally over \(\mathcal {G}\) using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al. [7], we show how we can build \(\mathcal {G}\) from \(\mathcal {T}\) in linear time and space. We also demonstrate that our MEM algorithm runs on top of \(\mathcal {G}\) in \(O(G +occ)\) time and uses \(O((G+occ)\log G)\) bits, where G is the grammar size, and occ is the number of MEMs in \(\mathcal {T}\). In the conclusions, we discuss how to modify our idea to perform approximate pattern matching in compressed space.

Supported by Academy of Finland (Grants 323233 and 339070), and by Basal Funds FB0001, Chile (first author).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 7549; Price includes VAT (Japan)

Softcover Book: JPY 9437; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression

Grammar Index by Induced Suffix Sorting

Compressed Subsequence Matching and Packed Tree Coloring

References

Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)
Article MathSciNet MATH Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Article Google Scholar
Batu, T., Ergun, F., Sahinalp, C.: Oblivious string embeddings and edit distance approximations. In: Proceedings of the 17th Symposium on Discrete Algorithms (SODA), pp. 792–801 (2006)
Google Scholar
Boucher, C., et al.: PHONI: streamed matching statistics with multi-genome references. In: Proceedings of the 21st Data Compression Conference (DCC), pp. 193–202 (2021)
Google Scholar
Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12(4), 327–344 (1994)
Article MathSciNet MATH Google Scholar
Charikar, M., et al.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)
Article MathSciNet MATH Google Scholar
Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms 17(1), 1–39 (2020)
Article MathSciNet MATH Google Scholar
Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19
Chapter Google Scholar
Claude, F., Navarro, G., Pacheco, A.: Grammar-compressed indexes with logarithmic search time. J. Comput. Syst. Sci. 118, 53–74 (2021)
Article MathSciNet MATH Google Scholar
Cole, R., Vishkin, U.: Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms. In: Proceedings of the 18th Annual Symposium on Theory of Computing (STOC), pp. 206–219 (1986)
Google Scholar
Díaz-Domínguez, D., Navarro, G.: A grammar compressor for collections of reads with applications to the construction of the BWT. In: Proceedings of the 31st Data Compression Conference (DCC), pp. 83–92 (2021)
Google Scholar
Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12200-2_16
Chapter Google Scholar
Gagie, T., Navarro, G., Prezza, N.: Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1) (2020). Article 2
Google Scholar
Jeż, A.: Approximation of grammar-based compression via recompression. Theor. Comput. Sci. 592, 115–134 (2015)
Article MathSciNet MATH Google Scholar
Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 827–840 (2018)
Google Scholar
Kent, W.J.: BLAT-the BLAST-like alignment tool. Genome Res. 12(4), 656–664 (2002)
Google Scholar
Kieffer, J., Yang, E.-H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)
Article MathSciNet MATH Google Scholar
Kurtz, S., et al.: Versatile and open software for comparing large genomes. Genome Biol. 5, 1–9 (2004)
Article Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A. (ed.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48194-X_17
Chapter Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)
Article Google Scholar
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013)
Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)
Article Google Scholar
Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design. Cambridge University Press, Cambridge (2015)
Book Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Article MathSciNet MATH Google Scholar
Navarro, G.: Computing MEMs on repetitive text collections. In: Proceedings of the 34th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. article 22 (2023)
Google Scholar
Nong, G., Zhang, S., Chan, W.H.; Linear suffix array construction by almost pure induced-sorting. In; Proceedings of the 19th Data Compression Conference (DCC), pp. 193–202 (2009)
Google Scholar
Nunes, D.S.N., Louza, F., Gog, S., Ayala-Rincón, M., Navarro, G.: A grammar compression algorithm based on induced suffix sorting. In: Proceedings of the 28th Data Compression Conference (DCC), pp. 42–51 (2018)
Google Scholar
Ohlebusch, E., Fischer, J., Gog, S.: CST++. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 322–333. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16321-0_34
Chapter Google Scholar
Rossi, M., Oliva, M., Bonizzoni, P., Langmead, B., Gagie, T., Boucher, C.: Finding maximal exact matches using the r-index. J. Comput. Biol. 29(2), 188–194 (2022)
Article MathSciNet Google Scholar
Rossi, M., Oliva, M., Langmead, B., Gagie, T., Boucher, C.: MONI: a pangenomic index for finding maximal exact matches. J. Comput. Biol. 29(2), 169–187 (2022)
Article MathSciNet Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)
Article MathSciNet MATH Google Scholar
Sahinalp, S.C., Vishkin, U.: On a parallel-algorithms method for string matching problems (overview). In: Bonuccelli, M., Crescenzi, P., Petreschi, R. (eds.) CIAC 1994. LNCS, vol. 778, pp. 22–32. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57811-0_3
Chapter Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory (SWAT), pp. 1–11 (1973)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Helsinki, Finland
Diego Díaz-Domínguez & Leena Salmela

Authors

Diego Díaz-Domínguez
View author publications
You can also search for this author in PubMed Google Scholar
Leena Salmela
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego Díaz-Domínguez .

Editor information

Editors and Affiliations

ISTI-CNR, Pisa, Italy
Franco Maria Nardini
University of Pisa, Pisa, Italy
Nadia Pisanti
University of Pisa, Pisa, Italy
Rossano Venturini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Díaz-Domínguez, D., Salmela, L. (2023). Computing All-vs-All MEMs in Grammar-Compressed Text. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-43980-3_13
Published: 20 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43979-7
Online ISBN: 978-3-031-43980-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Computing All-vs-All MEMs in Grammar-Compressed Text

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression

Grammar Index by Induced Suffix Sorting

Compressed Subsequence Matching and Packed Tree Coloring

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Computing All-vs-All MEMs in Grammar-Compressed Text

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression

Grammar Index by Induced Suffix Sorting

Compressed Subsequence Matching and Packed Tree Coloring

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation