A Framework of Dynamic Data Structures for String Processing

A Framework of Dynamic Data Structures for String Processing

Author Nicola Prezza



PDF
Thumbnail PDF

File

LIPIcs.SEA.2017.11.pdf
  • Filesize: 0.5 MB
  • 15 pages

Document Identifiers

Author Details

Nicola Prezza

Cite As Get BibTex

Nicola Prezza. A Framework of Dynamic Data Structures for String Processing. In 16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, pp. 11:1-11:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.SEA.2017.11

Abstract

In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of five recently-published compression algorithms implemented using DYNAMIC with those of state-of-the-art tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more space-efficient (albeit slower) than classical ones performing the same tasks.

Subject Classification

Keywords
  • C++
  • dynamic
  • compression
  • data structure
  • bitvector
  • string

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Timo Beller, Maike Zwerger, Simon Gog, and Enno Ohlebusch. Space-Efficient Construction of the Burrows-Wheeler Transform. In String Processing and Information Retrieval, pages 5-16. Springer, 2013. Google Scholar
  2. Daniel K. Blandford and Guy E. Blelloch. Compact representations of ordered sets. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pages 11-19. Society for Industrial and Applied Mathematics, 2004. Google Scholar
  3. Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm, 1994. Google Scholar
  4. Ho-Leung Chan, Wing-Kai Hon, Tak-Wah Lam, and Kunihiko Sadakane. Compressed indexes for dynamic text collections. ACM Transactions on Algorithms (TALG), 3(2):21, 2007. Google Scholar
  5. Joshimar Cordova and Gonzalo Navarro. Practical dynamic entropy-compressed bitvectors with applications. In International Symposium on Experimental Algorithms, pages 105-117. Springer, 2016. Google Scholar
  6. dbwt: direct construction of the BWT. http://researchmap.jp/muuw41s7s-1587/#_1587. Accessed: 2016-11-17.
  7. ds-vector: C++ library for dynamic succinct vector. https://code.google.com/archive/p/ds-vector/. Accessed: 2016-11-17.
  8. DYNAMIC: dynamic succinct/compressed data structures library. https://github.com/xxsds/DYNAMIC. Accessed: 2017-01-22.
  9. Paolo Ferragina, Travis Gagie, and Giovanni Manzini. Lightweight data indexing and compression in external memory. Algorithmica, 63(3):707-730, 2012. Google Scholar
  10. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 390-398. IEEE, 2000. Google Scholar
  11. bitvector: succinct dynamic bitvector implementation. https://github.com/nicola-gigante/bitvector. Accessed: 2016-11-17.
  12. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014. URL: http://dx.doi.org/10.1007/978-3-319-07959-2_28.
  13. Roberto Grossi, Rajeev Raman, Satti Srinivasa Rao, and Rossano Venturini. Dynamic compressed strings with random access. In International Colloquium on Automata, Languages, and Programming, pages 504-515. Springer, 2013. Google Scholar
  14. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lightweight Lempel-Ziv parsing. In Experimental Algorithms, pages 139-150. Springer, 2013. Google Scholar
  15. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Linear time Lempel-Ziv factorization: Simple, fast, small. In Combinatorial Pattern Matching. Springer, 2013. Google Scholar
  16. Dominik Kempa and Simon J. Puglisi. Lempel-Ziv factorization: Simple, fast, practical. In Proceedings of the Meeting on Algorithm Engineering &Expermiments, pages 103-112. Society for Industrial and Applied Mathematics, 2013. Google Scholar
  17. Patrick Klitzke and Patrick K. Nicholson. A general framework for dynamic succinct and compressed data structures. Proceedings of the 18th ALENEX, pages 160-173, 2016. Google Scholar
  18. libcds: compact data structures library. https://github.com/fclaude/libcds. Accessed: 2016-11-17.
  19. LZ77 factorization algorithms. https://www.cs.helsinki.fi/group/pads/lz77.html. Accessed: 2016-05-20.
  20. Veli Mäkinen and Gonzalo Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms (TALG), 4(3):32, 2008. Google Scholar
  21. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. of Computational Biology, 17(3):281-308, 2010. Google Scholar
  22. Memoria: C++14 framework providing general purpose dynamic data structures. https://bitbucket.org/vsmirnov/memoria/wiki/Home. Accessed: 2016-11-17.
  23. Y. Mori. Short description of improved two-stage suffix sorting algorithm, 2005. Google Scholar
  24. Gonzalo Navarro and Yakov Nekrich. Optimal dynamic sequence representations. SIAM Journal on Computing, 43(5):1781-1806, 2014. Google Scholar
  25. Pizza&Chili corpus. http://pizzachili.dcc.uchile.cl. Accessed: 2016-07-25.
  26. A. Policriti and N. Prezza. Computing LZ77 in Run-Compressed Space. In 2016 Data Compression Conference (DCC), pages 23-32, March 2016. URL: http://dx.doi.org/10.1109/DCC.2016.30.
  27. Alberto Policriti, Nicola Gigante, and Nicola Prezza. Average linear time and compressed space construction of the Burrows-Wheeler transform. In International Conference on Language and Automata Theory and Applications, pages 587-598. Springer, 2015. Google Scholar
  28. Alberto Policriti and Nicola Prezza. Fast online Lempel-Ziv factorization in compressed space. In International Symposium on String Processing and Information Retrieval, pages 13-20. Springer, 2015. Google Scholar
  29. Andreas Poyias, Simon J Puglisi, and Rajeev Raman. Compact Dynamic Rewritable (CDRW) Arrays. In 2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 109-119. SIAM, 2017. Google Scholar
  30. Andreas Poyias and Rajeev Raman. Improved practical compact dynamic tries. In International Symposium on String Processing and Information Retrieval, pages 324-336. Springer, 2015. Google Scholar
  31. Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct dynamic data structures. In Workshop on Algorithms and Data Structures, pages 426-437. Springer, 2001. Google Scholar
  32. Jouni Sirén, Niko Välimäki, Veli Mäkinen, and Gonzalo Navarro. Run-length compressed indexes are superior for highly repetitive sequence collections. In String Processing and Information Retrieval, pages 164-175. Springer, 2009. Google Scholar
  33. succinct library. https://github.com/ot/succinct. Accessed: 2016-11-17.
  34. sux library. http://sux.di.unimi.it/. Accessed: 2016-11-17.
  35. Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on information theory, 23(3):337-343, 1977. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail