Abstract
Lempel–Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel–Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kth order empirical entropy compression \(n H_k + o(n\log \sigma )\) with \(k = o(\log _\sigma n)\), where n is the input length and \(\sigma\) is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be \(\Omega (\log n)\) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in all tested scenarios, and sometimes below 1.05, to the size of LZ.
Similar content being viewed by others
Notes
Hereafter, \(\log\) denote logarithm with base 2 if it is not explicitly stated otherwise.
To conform with the indexation scheme used throughout the paper, we do not follow the standard practice to index the least significant bit as zeroth.
References
Alakuijala, J., Farruggia, A., Ferragina, P., Kliuchnikov, E., Obryk, R., Szabadka, Z., Vandevenne, L.: Brotli: a general-purpose data compressor. ACM Trans. Inf. Syst. 37(1), 4 (2018). https://doi.org/10.1145/3231935
Amir, A., Landau, G.M., Ukkonen, E.: Online timestamped text indexing. Inf. Process. Lett. 82(5), 253–259 (2002). https://doi.org/10.1016/S0020-0190(01)00275-7
Bannai, H., Gagie, T., I, T.: Online LZ77 parsing and matching statistics with RLBWTs. In: Proceedings of the CPM 2018, LIPIcs, vol. 105, pp. 7:1–7:12. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik (2018). https://doi.org/10.4230/LIPIcs.CPM.2018.7
Belazzougui, D., Puglisi, S.J.: Range predecessor and Lempel–Ziv parsing. In: Proceedings of the SODA 2016, pp. 2053–2071. SIAM (2016). https://doi.org/10.1137/1.9781611974331.ch143
Bille, P., Cording, P.H., Fischer, J., Gørtz, I.L.: Lempel–Ziv compression in a sliding window. In: Proceedings of the CPM 2017, LIPIcs, vol. 78. Schloss Dagstuhl–Leibniz–Zentrum für Informatik (2017). https://doi.org/10.4230/LIPIcs.CPM.2017.15
Deorowicz, S., Danek, A., Niemiec, M.: GDC 2: compression of large collections of genomes. Sci. Rep. 5, 11565 (2015). https://doi.org/10.1038/srep11565
Deorowicz, S., Grabowski, S.: Robust relative compression of genomes with random access. Bioinformatics 27(21), 2979–2986 (2011). https://doi.org/10.1093/bioinformatics/btr505
Duda, J.: Asymmetric numeral systems as close to capacity low state entropy coders. CoRR abs/1311.2540 (2013). http://arxiv.org/abs/1311.2540
Elias, P.: Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21(2), 194–203 (1975). https://doi.org/10.1109/TIT.1975.1055349
Ferragina, P., Nitto, I., Venturini, R.: On the bit-complexity of Lempel-Ziv compression. SIAM J. Comput. 42(4), 1521–1541 (2013). https://doi.org/10.1137/120869511
Fischer, J., Gagie, T., Gawrychowski, P., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. In: Proceedings of the ESA 2015, LNCS, vol. 9294, pp. 533–544. Springer (2015). https://doi.org/10.1007/978-3-662-48350-3_45
Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006). https://doi.org/10.1016/j.ipl.2006.04.008
Gagie, T., Manzini, G.: Space-conscious compression. In: Proc. MFCS 2007, LNCS, vol. 4708, pp. 206–217. Springer (2007). https://doi.org/10.1007/978-3-540-74456-6_20
Gagie, T., Navarro, G., Prezza, N.: On the approximation ratio of Lempel–Ziv parsing. In: Proceedings of the LATIN 2018, LNCS, vol. 10807, pp. 490–503. Springer (2018). https://doi.org/10.1007/978-3-319-77404-6_36
Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the SODA 2018, pp. 1459–1477. SIAM (2018). https://doi.org/10.1137/1.9781611975031.96
Gagie, T., Puglisi, S.J., Valenzuela, D.: Analyzing relative Lempel–Ziv reference construction. In: Proceedings of the SPIRE 2016, LNCS, vol. 9954, pp. 160–165. Springer (2016). https://doi.org/10.1007/978-3-319-46049-9_16
Gańczorz, M.: Entropy bounds for grammar compression. CoRR abs/1804.08547 (2018). http://arxiv.org/abs/1804.08547
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: Plug and play with succinct data structures. In: Proceedings of the SEA 2014, LNCS, vol. 8504, pp. 326–337. Springer (2014). https://doi.org/10.1007/978-3-319-07959-2_28
Hoobin, C., Puglisi, S.J., Zobel, J.: Relative Lempel–Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endow. 5(3), 265–273 (2011). https://doi.org/10.14778/2078331.2078341
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel–Ziv factorization: Simple, fast, small. In: Proceedings of the CPM 2013, LNCS, vol. 7922, pp. 189–200. Springer (2013). https://doi.org/10.1007/978-3-642-38905-4_19
Karkkainen, J., Kempa, D., Puglisi, S.J.: Lempel–Ziv parsing in external memory. In: Proceedings of the DCC 2014, pp. 153–162. IEEE (2014). https://doi.org/10.1109/DCC.2014.78
Kempa, D., Kosolobov, D.: LZ-End parsing in compressed space. In: Proceedings of the DCC 2017, pp. 350–359. IEEE (2017). https://doi.org/10.1109/DCC.2017.73
Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the STOC 2018, pp. 827–840. ACM (2018). https://doi.org/10.1145/3188745.3188814
Kosaraju, S.R., Manzini, G.: Compression of low entropy strings with Lempel–Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999). https://doi.org/10.1137/S0097539797331105
Kosolobov, D.: Relations between greedy and bit-optimal LZ77 encodings. In: Proceedings of the STACS 2018, LIPIcs, vol. 96, pp. 46:1–46:14. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik (2018). https://doi.org/10.4230/LIPIcs.STACS.2018.46
Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: Proceedings of the DCC 2010, pp. 239–248. IEEE (2010). https://doi.org/10.1109/DCC.2010.29
Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel–Ziv compression of genomes for large-scale storage and retrieval. In: Proceedings of the SPIRE 2010, LNCS, vol. 6393, pp. 201–206. Springer (2010). https://doi.org/10.1007/978-3-642-16321-0_20
Kuruppu, S., Puglisi, S.J., Zobel, J.: Optimized relative Lempel–Ziv compression of genomes. In: Australasian Computer Science Conference, pp. 91–98. Australian Computer Society, Inc. (2011)
Larsson, N.J.: Most recent match queries in on-line suffix trees. In: Proceedings of the CPM 2014, LNCS, vol. 8486, pp. 252–261 (2014). https://doi.org/10.1007/978-3-319-07566-2_26
Larsson, N.J., Sadakane, K.: Faster suffix sorting. Theor. Comput. Sci. 387(3), 258–272 (2007). https://doi.org/10.1016/j.tcs.2007.07.017
Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Softw. Pract. Exp. 45(1), 1–29 (2015)
Levenshtein, V.I.: On the redundancy and delay of decodable coding of natural numbers. Syst. Theory Res. 20, 149–155 (1968)
Liao, K., Petri, M., Moffat, A., Wirth, A.: Effective construction of relative Lempel–Ziv dictionaries. In: Proceedings of the WWW 2016, pp. 807–816. International World Wide Web Conferences Steering Committee (2016). https://doi.org/10.1145/2872427.2883042
Mäkinen, V., Navarro, G.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007). https://doi.org/10.1145/1216370.1216372
Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001). https://doi.org/10.1145/382780.382782
Navarro, G.: Indexing highly repetitive collections. In: Proceedings of the IWOCA 2012, LNCS, vol. 7643, pp. 274–279 (2012). https://doi.org/10.1007/978-3-642-35926-2_29
Ochoa, C., Navarro, G.: RePair and all irreducible grammars are upper bounded by high-order empirical entropy. IEEE Trans. Inf. Theory (2018). https://doi.org/10.1109/TIT.2018.2871452
Policriti, A., Prezza, N.: Fast online Lempel–Ziv factorization in compressed space. In: Proceedings of the SPIRE 2015, LNCS, vol. 9309, pp. 13–20. Springer (2015). https://doi.org/10.1007/978-3-319-23826-5_2
Policriti, A., Prezza, N.: LZ77 computation based on the run-length encoded BWT. Algorithmica 80(7), 1986–2011 (2018). https://doi.org/10.1007/s00453-017-0327-z
Puglisi, S.J.: Lempel-Ziv compression. In: Kao, M.-Y. (ed.) Encyclopedia of algorithms, pp. 1095–1100., Springer, New York (2016). https://doi.org/10.1007/978-1-4939-2864-4_634
Shields, P.C.: Performance of LZ algorithms on individual sequences. IEEE Trans. Inf. Theory 45(4), 1283–1288 (1999). https://doi.org/10.1109/18.761286
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982). https://doi.org/10.1145/322344.322346
Tillson, T.W.: A hamiltonian decomposition of \(K^*_{2m}\), \(2m \ge 8\). J. Combin. Theory Ser. B 29(1), 68–74 (1980). https://doi.org/10.1016/0095-8956(80)90044-1
Valenzuela, D.: CHICO: A compressed hybrid index for repetitive collections. In: Proceedings of the SEA 2016, LNCS, vol. 9685, pp. 326–338. Springer (2016). https://doi.org/10.1007/978-3-319-38851-9_22
Wandelt, S., Leser, U.: FRESCO: referential compression of highly similar sequences. IEEE/ACM Trans. Comput. Biol. Bioinf. 10(5), 1275–1288 (2013). https://doi.org/10.1109/TCBB.2013.122
Wyner, A.J.: The redundancy and distribution of the phrase lengths of the fixed-database Lempel–Ziv algorithm. IEEE Trans. Inf. Theory 43(5), 1452–1464 (1997). https://doi.org/10.1109/18.623144
Yann Collet: Zstandard. (2016). Retrieved from: https://facebook.github.io/zstd/. Accessed 2018-09-17
Yuta Mori: libdivsufsort. https://github.com/y-256/libdivsufsort/. Accessed 22 May 2020
Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976). https://doi.org/10.1109/TIT.1976.1055501
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977). https://doi.org/10.1109/TIT.1977.1055714
Acknowledgements
This work started during Shonan Meeting 126 “Computation over Compressed Structured Data”. Funded in part by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant Agreement No. 690941 (project BIRDS).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
D. Kosolobov supported by the Russian Science Foundation (RSF), Project 18-71-00002 (for the upper bound analysis and a part of lower bound analysis). D. Valenzuela supported by the Academy of Finland (Grant 309048). G. Navarro funded by Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile. S.J. Puglisi supported by the Academy of Finland (Grant 319454).
Rights and permissions
About this article
Cite this article
Kosolobov, D., Valenzuela, D., Navarro, G. et al. Lempel–Ziv-Like Parsing in Small Space. Algorithmica 82, 3195–3215 (2020). https://doi.org/10.1007/s00453-020-00722-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-020-00722-6