Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs

Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs

Authors Adrián Goga, Andrej Baláž



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.18.pdf
  • Filesize: 0.67 MB
  • 12 pages

Document Identifiers

Author Details

Adrián Goga
  • Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia
Andrej Baláž
  • Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia

Acknowledgements

We want to thank Travis Gagie for the conception of the idea during his data structures course and helpful remarks throughout the realisation of this project. Our thanks also go to Broňa Brejová for useful advice during the writing process and Uwe Baier for kindly responding to our questions via email. Finally, we thank Lucas Pansani Ramos for the aid he provided in the early days of the project.

Cite As Get BibTex

Adrián Goga and Andrej Baláž. Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 18:1-18:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.WABI.2022.18

Abstract

We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019).
Wheeler graphs are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting Wheeler graph, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. 
To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The phrases are then sorted lexicographically. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP representation of the text is generally much shorter than the original since individual phrases are used many times in the parse, thus reducing the size of the dictionary. 
To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the original text, tunnel the Wheeler graph of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact Wheeler graph of the original text. Compared with constructing a Wheeler graph from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of Wheeler graphs as a pangenomic reference for real-world pangenomic datasets.

Subject Classification

ACM Subject Classification
  • Theory of computation → Theory and algorithms for application domains
Keywords
  • Wheeler graphs
  • BWT tunnelling
  • prefix-free parsing
  • pangenomic graphs

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 911-930. SIAM, 2020. Google Scholar
  2. Uwe Baier. On Undetected Redundancy in the Burrows-Wheeler Transform. Annual Symposium on Combinatorial Pattern Matching (CPM 2018), 105:3:1-3:15, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.3.
  3. Uwe Baier, Thomas Büchler, Enno Ohlebusch, and Pascal Weber. Edge minimization in de Bruijn graphs. In 2020 Data Compression Conference (DCC), pages 223-232. IEEE, 2020. Google Scholar
  4. Uwe Baier and Kadir Dede. BWT Tunnel Planning is hard but manageable. In 2019 Data Compression Conference (DCC), pages 142-151. IEEE, 2019. Google Scholar
  5. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14(1):1-15, 2019. Google Scholar
  6. Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. In Digital SRC Research Report. Citeseer, 1994. Google Scholar
  7. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st annual symposium on foundations of computer science, pages 390-398. IEEE, 2000. Google Scholar
  8. Johannes Fischer and Florian Kurpicz. Dismantling divsufsort. arXiv preprint, 2017. URL: http://arxiv.org/abs/1710.01896.
  9. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical computer science, 698:67-78, 2017. Google Scholar
  10. Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9):875-879, 2018. Google Scholar
  11. Daniel Gibney and Sharma V Thankachan. On the complexity of recognizing wheeler graphs. Algorithmica, 84(3):784-814, 2022. Google Scholar
  12. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
  13. Juha Kärkkäinen and Peter Sanders. Simple linear work suffix array construction. In International colloquium on automata, languages, and programming, pages 943-955. Springer, 2003. Google Scholar
  14. Ben Langmead. Aligning short sequencing reads with Bowtie. Current protocols in bioinformatics, 32(1):11-7, 2010. Google Scholar
  15. Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357-359, 2012. Google Scholar
  16. Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint, 2013. URL: http://arxiv.org/abs/1303.3997.
  17. Felipe A Louza, Simon Gog, and Guilherme P Telles. Inducing enhanced suffix arrays for string collections. Theoretical Computer Science, 678:22-39, 2017. Google Scholar
  18. Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Transactions on Information Systems (TOIS), 31(3):1-15, 2013. Google Scholar
  19. Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suffix array construction by almost pure induced-sorting. In 2009 data compression conference, pages 193-202. IEEE, 2009. Google Scholar
  20. Alberto Policriti and Nicola Prezza. From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017), volume 78 of Leibniz International Proceedings in Informatics (LIPIcs), pages 17:1-17:10, Dagstuhl, Germany, 2017. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.17.
  21. Jarno Alanko; Nicola Cotumaccio; Nicola Prezza. Linear-time minimization of Wheeler DFAs, 2022. URL: https://sigport.org/documents/linear-time-minimization-wheeler-dfas.
  22. Nayanah Siva. 1000 genomes project. Nature biotechnology, 26(3):256-257, 2008. Google Scholar
  23. Daniel Valenzuela and Veli Mäkinen. CHIC: a short read aligner for pan-genomic references. biorxiv, page 178129, 2017. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail