{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,5]],"date-time":"2024-08-05T02:14:16Z","timestamp":1722824056713},"reference-count":27,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2021,7,12]],"date-time":"2021-07-12T00:00:00Z","timestamp":1626048000000},"content-version":"vor","delay-in-days":11,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001711","name":"Swiss National Science Foundation","doi-asserted-by":"publisher","award":["407540_167331"],"id":[{"id":"10.13039\/501100001711","id-type":"DOI","asserted-by":"publisher"}]},{"name":"ETH core funding"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,8,4]]},"abstract":"Abstract<\/jats:title>\n \n Motivation<\/jats:title>\n Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving.<\/jats:p>\n <\/jats:sec>\n \n Results<\/jats:title>\n In this article, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10 000 RNA-seq datasets show that RowDiff combined with multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST.<\/jats:p>\n <\/jats:sec>\n \n Availability and implementation<\/jats:title>\n RowDiff is implemented in C++ within the MetaGraph framework. The source code and the data used in the experiments are publicly available at https:\/\/github.com\/ratschlab\/row_diff.<\/jats:p>\n <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab330","type":"journal-article","created":{"date-parts":[[2021,5,3]],"date-time":"2021-05-03T19:46:46Z","timestamp":1620071206000},"page":"i169-i176","source":"Crossref","is-referenced-by-count":6,"title":["Topology-based sparsification of graph annotations"],"prefix":"10.1093","volume":"37","author":[{"given":"Daniel","family":"Danciu","sequence":"first","affiliation":[{"name":"Department of Computer Science, Biomedical Informatics Group, ETH Zurich , Zurich, Switzerland"},{"name":"Biomedical Informatics Research, University Hospital Zurich , Zurich, Switzerland"}]},{"given":"Mikhail","family":"Karasikov","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Biomedical Informatics Group, ETH Zurich , Zurich, Switzerland"},{"name":"Biomedical Informatics Research, University Hospital Zurich , Zurich, Switzerland"},{"name":"Swiss Institute of Bioinformatics , Zurich, Switzerland"}]},{"given":"Harun","family":"Mustafa","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Biomedical Informatics Group, ETH Zurich , Zurich, Switzerland"},{"name":"Biomedical Informatics Research, University Hospital Zurich , Zurich, Switzerland"},{"name":"Swiss Institute of Bioinformatics , Zurich, Switzerland"}]},{"given":"Andr\u00e9","family":"Kahles","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Biomedical Informatics Group, ETH Zurich , Zurich, Switzerland"},{"name":"Biomedical Informatics Research, University Hospital Zurich , Zurich, Switzerland"},{"name":"Swiss Institute of Bioinformatics , Zurich, Switzerland"}]},{"given":"Gunnar","family":"R\u00e4tsch","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Biomedical Informatics Group, ETH Zurich , Zurich, Switzerland"},{"name":"Biomedical Informatics Research, University Hospital Zurich , Zurich, Switzerland"},{"name":"Swiss Institute of Bioinformatics , Zurich, Switzerland"},{"name":"Department of Biology, ETH Zurich , Zurich, Switzerland"}]}],"member":"286","published-online":{"date-parts":[[2021,7,12]]},"reference":[{"key":"2023062410300804600_btab330-B1","first-page":"18:1","volume-title":"17th International Workshop on Algorithms in Bioinformatics (WABI 2017), Volume 88 of Leibniz International Proceedings in Informatics (LIPIcs)","author":"Almodaresi","year":"2017"},{"key":"2023062410300804600_btab330-B2","volume-title":"Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science, vol 11467.","author":"Almodaresi","year":"2019"},{"key":"2023062410300804600_btab330-B3","doi-asserted-by":"crossref","first-page":"288","DOI":"10.1186\/s12859-015-0709-7","article-title":"Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph","volume":"16","author":"Benoit","year":"2015","journal-title":"BMC Bioinformatics"},{"key":"2023062410300804600_btab330-B4","volume-title":"String Processing and Information Retrieval. SPIRE 2019.","author":"Bingmann","year":"2019"},{"key":"2023062410300804600_btab330-B5","volume-title":"Lecture Notes in Computer Science, vol 7534. Springer, Berlin, Heidelberg","author":"Bowe","year":"2012"},{"key":"2023062410300804600_btab330-B6","doi-asserted-by":"crossref","first-page":"152","DOI":"10.1038\/s41587-018-0010-1","article-title":"Ultrafast search of all deposited bacterial and viral genomic data","volume":"37","author":"Bradley","year":"2019","journal-title":"Nat. Biotechnol"},{"key":"2023062410300804600_btab330-B7","doi-asserted-by":"crossref","first-page":"198","DOI":"10.1186\/s13059-018-1568-0","article-title":"Krakenuniq: confident and fast metagenomics classification using unique k-mer counts","volume":"19","author":"Breitwieser","year":"2018","journal-title":"Genome Biol"},{"key":"2023062410300804600_btab330-B8","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1186\/1748-7188-8-22","article-title":"Space-efficient and exact de Bruijn graph representation based on a bloom filter","volume":"8","author":"Chikhi","year":"2013","journal-title":"Algorith. Mol. Biol"},{"key":"2023062410300804600_btab330-B9","doi-asserted-by":"crossref","first-page":"1415","DOI":"10.1093\/bioinformatics\/bts173","article-title":"Large-scale compression of genomic sequence databases with the burrows\u2013wheeler transform","volume":"28","author":"Cox","year":"2012","journal-title":"Bioinformatics"},{"key":"2023062410300804600_btab330-B10","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1145\/321812.321820","article-title":"Efficient storage and retrieval by content and address of static files","volume":"21","author":"Elias","year":"1974","journal-title":"JACM"},{"key":"2023062410300804600_btab330-B11","volume-title":"On the Number of Bits Required to Implement an Associative Memory","author":"Fano","year":"1971"},{"key":"2023062410300804600_btab330-B12","volume-title":"Lecture Notes in Computer Science, vol 8504","author":"Gog","year":"2014"},{"key":"2023062410300804600_btab330-B13","doi-asserted-by":"crossref","first-page":"721","DOI":"10.1093\/bioinformatics\/btz662","article-title":"Improved representation of sequence bloom trees","volume":"36","author":"Harris","year":"2020","journal-title":"Bioinformatics"},{"key":"2023062410300804600_btab330-B14","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1038\/ng.1028","article-title":"De novo assembly and genotyping of variants using colored de Bruijn graphs","volume":"44","author":"Iqbal","year":"2012","journal-title":"Nat. Genet"},{"key":"2023062410300804600_btab330-B15","article-title":"Metagraph: indexing and analysing nucleotide archives at petabase-scale","author":"Karasikov","year":"2020","journal-title":"bioRxiv, 10.1101\/2020.10.01.322164"},{"key":"2023062410300804600_btab330-B16","doi-asserted-by":"crossref","first-page":"626","DOI":"10.1089\/cmb.2019.0324","article-title":"Sparse binary relation representations for genome graph annotation","volume":"27","author":"Karasikov","year":"2020","journal-title":"J. Comput. Biol"},{"key":"2023062410300804600_btab330-B17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1101\/gr.260604.119","article-title":"Data structures based on k-mers for querying large collections of sequencing datasets","volume":"31","author":"Marchet","year":"2021","journal-title":"Genome Res"},{"key":"2023062410300804600_btab330-B18","doi-asserted-by":"crossref","first-page":"3181","DOI":"10.1093\/bioinformatics\/btx067","article-title":"Succinct colored de Bruijn graphs","volume":"33","author":"Muggli","year":"2017","journal-title":"Bioinformatics"},{"key":"2023062410300804600_btab330-B19","doi-asserted-by":"crossref","first-page":"i51","DOI":"10.1093\/bioinformatics\/btz350","article-title":"Building large updatable colored de Bruijn graphs via merging","volume":"35","author":"Muggli","year":"2019","journal-title":"Bioinformatics"},{"key":"2023062410300804600_btab330-B20","doi-asserted-by":"crossref","first-page":"407","DOI":"10.1093\/bioinformatics\/bty632","article-title":"Dynamic compression schemes for graph coloring","volume":"35","author":"Mustafa","year":"2019","journal-title":"Bioinformatics"},{"key":"2023062410300804600_btab330-B21","doi-asserted-by":"crossref","first-page":"D733","DOI":"10.1093\/nar\/gkv1189","article-title":"Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation","volume":"44","author":"O\u2019Leary","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023062410300804600_btab330-B22","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1186\/s13059-016-0997-x","article-title":"Mash: fast genome and metagenome distance estimation using minhash","volume":"17","author":"Ondov","year":"2016","journal-title":"Genome Biol"},{"key":"2023062410300804600_btab330-B23","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1016\/j.cels.2018.05.021","article-title":"Mantis: a fast, small, and exact large-scale sequence-search index","volume":"7","author":"Pandey","year":"2018","journal-title":"Cell Syst"},{"key":"2023062410300804600_btab330-B24","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1145\/1290672.1290680","article-title":"Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets","volume":"3","author":"Raman","year":"2007","journal-title":"ACM Trans. Algorithms (TALG)"},{"key":"2023062410300804600_btab330-B25","doi-asserted-by":"crossref","first-page":"e1002195","DOI":"10.1371\/journal.pbio.1002195","article-title":"Big data: astronomical or genomical?","volume":"13","author":"Stephens","year":"2015","journal-title":"PLoS Biol"},{"key":"2023062410300804600_btab330-B26","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/sdata.2017.203","article-title":"The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans","volume":"5","author":"Tully","year":"2018","journal-title":"Sci. Data"},{"key":"2023062410300804600_btab330-B27","doi-asserted-by":"crossref","first-page":"2556","DOI":"10.1093\/bioinformatics\/bty157","article-title":"Integrating long-range connectivity information into de Bruijn graphs","volume":"34","author":"Turner","year":"2018","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/Supplement_1\/i169\/50694228\/btab330.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/Supplement_1\/i169\/50694228\/btab330.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,25]],"date-time":"2023-06-25T00:22:21Z","timestamp":1687652541000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/Supplement_1\/i169\/6319677"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,1]]},"references-count":27,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2021,8,4]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab330","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.11.17.386649","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,7,1]]},"published":{"date-parts":[[2021,7,1]]}}}