{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,1,26]],"date-time":"2023-01-26T05:20:52Z","timestamp":1674710452891},"reference-count":57,"publisher":"Oxford University Press (OUP)","issue":"18","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":1490,"URL":"http:\/\/creativecommons.org\/licenses\/by\/3.0"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,9,15]]},"abstract":"Abstract<\/jats:title>\n Motivation: Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (indels). This is a serious obstacle both in determining the spectrum and abundance of genetic variations and in personal genomics.<\/jats:p>\n Results: For the first time, we adopt the approximate string matching paradigm of geometric embedding to read mapping, thus rephrasing it to nearest neighbor queries in a q-gram frequency vector space. Using the L1 distance between frequency vectors has the benefit of providing lower bounds for an edit distance with affine gap costs. Using a cache-oblivious kd-tree, we realize running times, which match the state-of-the-art. Additionally, running time and memory requirements are about constant for read lengths between 100 and 1000 bp. We provide a first proof-of-concept that geometric embedding is a promising paradigm for read mapping and that L1 distance might serve to detect structural variations. TreQ, our initial implementation of that concept, performs more accurate than many popular read mappers over a wide range of structural variants.<\/jats:p>\n Availability and implementation: TreQ will be released under the GNU Public License (GPL), and precomputed genome indices will be provided for download at http:\/\/treq.sf.net.<\/jats:p>\n Contact: \u00a0pavelm@cs.rutgers.edu<\/jats:p>\n Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/bts380","type":"journal-article","created":{"date-parts":[[2012,9,7]],"date-time":"2012-09-07T20:35:22Z","timestamp":1347050122000},"page":"i325-i332","source":"Crossref","is-referenced-by-count":3,"title":["Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd<\/i>-trees"],"prefix":"10.1093","volume":"28","author":[{"given":"Md Pavel","family":"Mahmud","sequence":"first","affiliation":[{"name":"1 Department of Computer Science"}]},{"given":"John","family":"Wiedenhoeft","sequence":"additional","affiliation":[{"name":"1 Department of Computer Science"}]},{"given":"Alexander","family":"Schliep","sequence":"additional","affiliation":[{"name":"1 Department of Computer Science"},{"name":"2 BioMaPS Institute for Quantitative Biology, Rutgers University, New Jersey, USA"}]}],"member":"286","published-online":{"date-parts":[[2012,9,3]]},"reference":[{"key":"2023012513000377600_B1","doi-asserted-by":"crossref","first-page":"237","DOI":"10.1145\/777792.777828","article-title":"Cache-oblivious data structures for orthogonal range searching","volume-title":"Proceedings of the nineteenth annual symposium on Computational geometry","author":"Agarwal","year":"2003"},{"key":"2023012513000377600_B2","doi-asserted-by":"crossref","first-page":"1061","DOI":"10.1038\/ng.437","article-title":"Personalized copy number and segmental duplication maps using next-generation sequencing","volume":"41","author":"Alkan","year":"2009","journal-title":"Nat. Genet."},{"key":"2023012513000377600_B3","doi-asserted-by":"crossref","first-page":"363","DOI":"10.1038\/nrg2958","article-title":"Genome structural variation discovery and genotyping","volume":"12","author":"Alkan","year":"2011","journal-title":"Nat. Rev. Genet."},{"key":"2023012513000377600_B4","article-title":"Cache-oblivious data structures","volume-title":"Handbook of Data Structures and Applications","author":"Arge","year":"2005"},{"key":"2023012513000377600_B5","first-page":"28","article-title":"The x-tree : An index structure for high-dimensional data","volume-title":"VLDB\u201896, Proceedings of 22th International Conference on Very Large Data Bases, September 3\u20136, 1996, Mumbai, India","author":"Berchtold","year":"1996"},{"key":"2023012513000377600_B6","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1016\/0020-0190(93)90222-U","article-title":"Approximate closest-point queries in high dimensions","volume":"45","author":"Bern","year":"1993","journal-title":"Inf. Process Lett."},{"key":"2023012513000377600_B7","doi-asserted-by":"crossref","first-page":"322","DOI":"10.1145\/502807.502809","article-title":"Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases","volume":"33","author":"B\u00f6hm","year":"2001","journal-title":"ACM Comput. Surv."},{"key":"2023012513000377600_B8","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1963190.1963191","article-title":"Indexing methods for approximate dictionary searching","volume":"16","author":"Boytsov","year":"2011","journal-title":"J. Exp. Algorithmics"},{"key":"2023012513000377600_B9","doi-asserted-by":"crossref","first-page":"1.1:1.1","DOI":"10.1145\/1963190.1963191","article-title":"Indexing methods for approximate dictionary searching: Comparative analysis","volume":"16","author":"Boytsov","year":"2011","journal-title":"J. Exp. Algorithmics"},{"key":"2023012513000377600_B10","first-page":"28","article-title":"A spatial index for approximate multiple string matching","volume":"1","author":"Bugnion","year":"1995","journal-title":"J. Brazilian Chem. Soc."},{"key":"2023012513000377600_B11","first-page":"51","article-title":"Better filtering with gapped q-grams","volume":"56","author":"Burkhardt","year":"2002","journal-title":"Fundam. Inf."},{"key":"2023012513000377600_B12","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1007\/s11042-008-0226-z","article-title":"Improving the space cost of k -nn search in metric spaces by using distance estimators","volume":"41","author":"Bustos","year":"2009","journal-title":"Multimedia Tools Appl."},{"key":"2023012513000377600_B13","doi-asserted-by":"crossref","first-page":"677","DOI":"10.1038\/nmeth.1363","article-title":"Breakdancer: an algorithm for high-resolution mapping of genomic structural variation","volume":"6","author":"Chen","year":"2009","journal-title":"Nat. Methods"},{"key":"2023012513000377600_B14","doi-asserted-by":"crossref","first-page":"186","DOI":"10.1101\/gr.8.3.186","article-title":"Base-calling of automated sequencer traces using phred. II. error probabilities","volume":"8","author":"Ewing","year":"1998","journal-title":"Genome Res."},{"key":"2023012513000377600_B15","first-page":"285","article-title":"Cache-oblivious algorithms","volume-title":"Proceedings of the 40th Annual Symposium on Foundations of Computer Science","author":"Frigo","year":"1999"},{"key":"2023012513000377600_B16","doi-asserted-by":"crossref","first-page":"e100","DOI":"10.1093\/nar\/gkq010","article-title":"Incorporating sequence quality data into alignment improves DNA read mapping","volume":"38","author":"Frith","year":"2010","journal-title":"Nucleic Acids Res."},{"key":"2023012513000377600_B17","doi-asserted-by":"crossref","first-page":"656","DOI":"10.1093\/bioinformatics\/bts028","article-title":"Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts","volume":"28","author":"Goke","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012513000377600_B18","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511574931","volume-title":"Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology","author":"Gusfield","year":"1997"},{"key":"2023012513000377600_B19","doi-asserted-by":"crossref","first-page":"576","DOI":"10.1038\/nmeth0810-576","article-title":"mrsfast: a cache-oblivious algorithm for short-read mapping","volume":"7","author":"Hach","year":"2010","journal-title":"Nat. Method."},{"key":"2023012513000377600_B20","doi-asserted-by":"crossref","first-page":"3085","DOI":"10.1093\/bioinformatics\/btr537","article-title":"Probabilistic alignments with quality scores: an application to short-read mapping toward accurate snp\/indel detection","volume":"27","author":"Hamada","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012513000377600_B21","doi-asserted-by":"crossref","first-page":"1270","DOI":"10.1101\/gr.088633.108","article-title":"Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes","volume":"19","author":"Hormozdiari","year":"2009","journal-title":"Genome Res."},{"key":"2023012513000377600_B22","doi-asserted-by":"crossref","first-page":"i350","DOI":"10.1093\/bioinformatics\/btq216","article-title":"Next-generation variationhunter: combinatorial algorithms for transposon insertion discovery","volume":"26","author":"Hormozdiari","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012513000377600_B23","first-page":"619","article-title":"Fast approximate similarity search in extremely high-dimensional data sets","volume-title":"ICDE","author":"Houle","year":"2005"},{"key":"2023012513000377600_B24","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1002\/(SICI)1520-684X(19980615)29:6<59::AID-SCJ6>3.0.CO;2-K","article-title":"Sr-tree: An index structure for nearest-neighbor searching of high-dimensional point data","volume":"29","author":"Katayama","year":"1998","journal-title":"Sys. Comput. Japan"},{"key":"2023012513000377600_B25","first-page":"140","article-title":"An empirical comparison of exact nearest neighbour algorithms","volume-title":"Knowledge Discovery in Databases: PKDD 2007, volume 4702 of Lecture Notes in Computer Science","author":"Kibriya","year":"2007"},{"key":"2023012513000377600_B26","doi-asserted-by":"crossref","first-page":"R23","DOI":"10.1186\/gb-2009-10-2-r23","article-title":"Pemer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data","volume":"10","author":"Korbel","year":"2009","journal-title":"Genome Biol."},{"key":"2023012513000377600_B27","doi-asserted-by":"crossref","first-page":"R25","DOI":"10.1186\/gb-2009-10-3-r25","article-title":"Ultrafast and memory-efficient alignment of short dna sequences to the human genome","volume":"10","author":"Langmead","year":"2009","journal-title":"Genome Biol."},{"key":"2023012513000377600_B28","doi-asserted-by":"crossref","first-page":"473","DOI":"10.1038\/nmeth.f.256","article-title":"Modil: detecting small indels from clone-end sequencing with mixtures of distributions","volume":"6","author":"Lee","year":"2009","journal-title":"Nat. Methods"},{"key":"2023012513000377600_B29","first-page":"564","article-title":"The spectrum kernel: a string kernel for svm protein classification","volume-title":"Proceedings of Pacific Symposium on Biocomputing","author":"Leslie","year":"2002"},{"key":"2023012513000377600_B30","doi-asserted-by":"crossref","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with burrows\u2013wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012513000377600_B31","doi-asserted-by":"crossref","first-page":"473","DOI":"10.1093\/bib\/bbq015","article-title":"A survey of sequence alignment algorithms for next-generation sequencing","volume":"11","author":"Li","year":"2010","journal-title":"Brief. Bioinformatics"},{"key":"2023012513000377600_B32","doi-asserted-by":"crossref","first-page":"1966","DOI":"10.1093\/bioinformatics\/btp336","article-title":"Soap2: an improved ultrafast tool for short read alignment","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012513000377600_B33","doi-asserted-by":"crossref","first-page":"106","DOI":"10.1016\/j.jtbi.2011.06.020","article-title":"New powerful statistics for alignment-free sequence comparison under a pattern transfer model","volume":"284","author":"Liu","year":"2011","journal-title":"J. Theo. Biol."},{"key":"2023012513000377600_B34","doi-asserted-by":"crossref","first-page":"936","DOI":"10.1101\/gr.111120.110","article-title":"Stampy: A statistical algorithm for sensitive and fast mapping of illumina sequence reads","volume":"21","author":"Lunter","year":"2011","journal-title":"Genome Res."},{"key":"2023012513000377600_B35","first-page":"1029","article-title":"High quality SNP calling using Illumina data at shallow coverage","volume":"26","author":"Malhis","year":"2010","journal-title":"Bioinformatics (Oxford, England)"},{"key":"2023012513000377600_B36","article-title":"Ann: A library for approximate nearest neighbor searching","author":"Mount","year":"2010"},{"key":"2023012513000377600_B37","first-page":"331","article-title":"Fast approximate nearest neighbors with automatic algorithm configuration","volume-title":"International Conference on Computer Vision Theory and Application (VISSAPP\u201809)","author":"Muja","year":"2009"},{"key":"2023012513000377600_B38","doi-asserted-by":"crossref","first-page":"395","DOI":"10.1145\/316542.316550","article-title":"A fast bit-vector algorithm for approximate string matching based on dynamic programming","volume":"46","author":"Myers","year":"1999","journal-title":"J. ACM"},{"key":"2023012513000377600_B39","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1145\/375360.375365","article-title":"A guided tour to approximate string matching","volume":"33","author":"Navarro","year":"2001","journal-title":"ACM Comput. Surv."},{"key":"2023012513000377600_B40","doi-asserted-by":"crossref","first-page":"266","DOI":"10.1016\/j.tcs.2005.11.037","article-title":"A metric index for approximate string matching","volume":"352","author":"Navarro","year":"2006","journal-title":"Theo. Comput. Sci."},{"key":"2023012513000377600_B41","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1016\/j.jda.2004.08.003","article-title":"Indexing text with approximate q-grams","volume":"3","author":"Navarro","year":"2005","journal-title":"J. Discrete Algorithms"},{"key":"2023012513000377600_B42","doi-asserted-by":"crossref","first-page":"1725","DOI":"10.1101\/gr.194201","article-title":"Ssaha: A fast search method for large dna databases","volume":"11","author":"Ning","year":"2001","journal-title":"Genome Res."},{"key":"2023012513000377600_B43","first-page":"359","article-title":"Effective indexing and filtering for similarity search in large biosequence databases","volume-title":"BIBE","author":"Ozturk","year":"2003"},{"key":"2023012513000377600_B44","doi-asserted-by":"crossref","first-page":"811","DOI":"10.1142\/S0218213005002405","article-title":"Vector space indexing for biosequence similarity searches","volume":"14","author":"Ozturk","year":"2005","journal-title":"Int. J. Artificial Intel. Tool"},{"key":"2023012513000377600_B45","doi-asserted-by":"crossref","first-page":"1348","DOI":"10.1016\/j.patrec.2010.04.004","article-title":"Locality sensitive hashing: a comparison of hash function types and querying mechanisms","volume":"31","author":"Paulev\u00e9","year":"2010","journal-title":"Pattern Reco. Lett."},{"key":"2023012513000377600_B46","doi-asserted-by":"crossref","first-page":"1615","DOI":"10.1089\/cmb.2009.0198","article-title":"Alignment-free sequence comparison (i): statistics and power","volume":"16","author":"Reinert","year":"2009","journal-title":"J. Comput. Biol."},{"key":"2023012513000377600_B47","first-page":"507","article-title":"The r+-tree: A dynamic index for multi-dimensional objects","volume-title":"Proceedings of the 13th International Conference on Very Large Data Bases","author":"Sellis","year":"1987"},{"key":"2023012513000377600_B48","first-page":"178","article-title":"On the collapse of q-Gram filtration","volume-title":"FUN with Algorithms","author":"Sutinen","year":"1998"},{"key":"2023012513000377600_B49","doi-asserted-by":"crossref","first-page":"525","DOI":"10.1016\/j.ygeno.2009.01.009","article-title":"Estimation of bacterial species phylogeny through oligonucleotide frequency distances","volume":"93","author":"Takahashi","year":"2009","journal-title":"Genomics"},{"key":"2023012513000377600_B50","doi-asserted-by":"crossref","first-page":"1061","DOI":"10.1038\/nature09534","article-title":"A map of human genome variation from population-scale sequencing","volume":"467","author":"The 1000 Genomes Project Consortium","year":"2010","journal-title":"Nature"},{"key":"2023012513000377600_B51","doi-asserted-by":"crossref","first-page":"1299","DOI":"10.1038\/nature04226","article-title":"A haplotype map of the human genome","volume":"437","author":"The International HapMap Consortium","year":"2005","journal-title":"Nature"},{"key":"2023012513000377600_B52","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1016\/0304-3975(92)90143-4","article-title":"Approximate string matching with q-grams and maximal matches","volume":"92","author":"Ukkonen","year":"1992","journal-title":"Theor. Comput. Sci."},{"key":"2023012513000377600_B53","first-page":"75","article-title":"Preserving order in a forest in less than logarithmic time","volume-title":"Proceedings of the 16th Annual Symposium on Foundations of Computer Science","author":"van Emde Boas","year":"1975"},{"key":"2023012513000377600_B54","first-page":"99","article-title":"Design and implementation of an efficient priority queue","volume":"10","author":"van Emde Boas","year":"1976","journal-title":"Theo. Comput. Syst."},{"key":"2023012513000377600_B55","doi-asserted-by":"crossref","first-page":"1467","DOI":"10.1089\/cmb.2010.0056","article-title":"Alignment-free sequence comparison (ii): theoretical power of comparison statistics","volume":"17","author":"Wan","year":"2010","journal-title":"J. Comput. Biol."},{"key":"2023012513000377600_B56","doi-asserted-by":"crossref","first-page":"1646","DOI":"10.1101\/gr.088823.108","article-title":"Razers fast read mapping with sensitivity control","volume":"19","author":"Weese","year":"2009","journal-title":"Genome Res."},{"key":"2023012513000377600_B57","first-page":"545","article-title":"Approximate string search in spatial databases","volume-title":"ICDE","author":"Yao","year":"2010"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/18\/i325\/48884733\/bioinformatics_28_18_i325.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/18\/i325\/48884733\/bioinformatics_28_18_i325.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,25]],"date-time":"2023-01-25T18:40:04Z","timestamp":1674672004000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/28\/18\/i325\/245595"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,9,3]]},"references-count":57,"journal-issue":{"issue":"18","published-print":{"date-parts":[[2012,9,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bts380","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2012,9,15]]},"published":{"date-parts":[[2012,9,3]]}}}