Author:
Description:
Ever-decreasing genome sequencing costs have led to an explosion in sequencing throughput, with the global sequencing capacity expected to exceed one exabyte per year in the next decade. This includes samples from clinical and human genome sequencing projects, model and non-model organisms, as well as metagenomes that pool sequences from communities of microorganisms. Much of these data are deposited in data banks in the form of short inexact sequence fragments which traditionally needed to be assembled into longer genome fragments before they could be analysed. Due to both the limited computational capacity to assemble genomes using established algorithms and the lack of established indexing practices for unassembled data, a majority of the raw data in these data banks is effectively inaccessible to a large portion of the scientific community. Tools that index unassembled sequencing data typically organise the data into an annotated sequence graph structure. Graph nodes represent words extracted from the indexed sequences, while annotations store metadata by associating nodes with different attributes or sample labels. Due to the heterogeneous nature of biological sequence data, many sequence graph tools optimise for either high compression of or efficient query to a database of a specific type of data, spawning many index formats with limited inter-compatibility. Striving to develop a more unified approach to sequence graph-based indexing that can accommodate many different sequence types, we have developed the MetaGraph framework for constructing, querying from, and assembling sequences from annotated De Bruijn graphs. MetaGraph further pushes the boundaries of sequence collection representation, providing highly compressed lossless succinct representations accompanied by scalable construction and optimised query procedures to ensure greater or comparable query performance to existing methods. We provide representations tailored towards a variety of data types, including sequences originating from reference ...
Publisher:
ETH Zurich
Contributors:
Rätsch, Gunnar ; Kahles, André ; Stanke, Mario ; Birol, Inanc
Year of Publication:
2022
Document Type:
info:eu-repo/semantics/doctoralThesis ; [Doctoral and postdoctoral thesis]
Language:
en
Subjects:
succinct data structures ; metagenomics ; algorithms ; high-throughput sequencing ; genome graphs ; info:eu-repo/classification/ddc/004 ; info:eu-repo/classification/ddc/570 ; Data processing ; computer science ; Life sciences
Rights:
info:eu-repo/semantics/openAccess ; http://creativecommons.org/licenses/by-nc/4.0/ ; Creative Commons Attribution-NonCommercial 4.0 International
Terms of Re-use:
CC-BY-NC
Relations:
info:eu-repo/grantAgreement/SNF/NFP
75:
Gesuch/167331
;
http://hdl.handle.net/20.500.11850/588880
info:eu-repo/grantAgreement/SNF/NFP
75:
Gesuch/167331
;
http://hdl.handle.net/20.500.11850/588880
Content Provider:
ETH Zürich Research Collection
- URL: https://www.research-collection.ethz.ch/
- Research Organization Registry (ROR): ETH Zurich
- Continent: Europe
- Country: ch
- Latitude / Longitude: 47.384000 / 8.654000 (Google Maps | OpenStreetMap)
- Number of documents: 119,223
- Open Access: 93,755 (79%)
- Type: Academic publications
- System: DSpace XOAI
- Content provider indexed in BASE since:
- BASE URL: https://www.base-search.net/Search/Results?q=coll:ftethz
My Lists:
My Tags:
Notes:
More Versions Loading ...
An error has occurred!