Abstract
Availability of extensive genetics data across multiple individuals and populations is driving the growing importance of graph based reference representations. Aligning sequences to graphs is a fundamental operation on several types of sequence graphs (variation graphs, assembly graphs, pan-genomes, etc.) and their biological applications. Though research on sequence to graph alignments is nascent, it can draw from related work on pattern matching in hypertext. In this paper, we study sequence to graph alignment problems under Hamming and edit distance models, and linear and affine gap penalty functions, for multiple variants of the problem that allow changes in query alone, graph alone, or in both. We prove that when changes are permitted in graphs either standalone or in conjunction with changes in the query, the sequence to graph alignment problem is \(\mathcal {NP}\)-complete under both Hamming and edit distance models for alphabets of size \({\ge }2\). For the case where only changes to the sequence are permitted, we present an \(O(|V|+m|E|)\) time algorithm, where m denotes the query size, and V and E denote the vertex and edge sets of the graph, respectively. Our result is generalizable to both linear and affine gap penalty functions, and improves upon the run-time complexity of existing algorithms.
C. Jain and H. Zhang—Should be regarded as joint first-authors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amir, A., Lewenstein, M., Lewenstein, N.: Pattern matching in hypertext. J. Algorithms 35(1), 82–99 (2000)
Antipov, D., Korobeynikov, A., McLean, J.S., Pevzner, P.A.: hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32(7), 1009–1015 (2015)
Backurs, A., Indyk, P.: Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 51–58. ACM (2015)
Beretta, S., Bonizzoni, P., Denti, L., Previtali, M., Rizzi, R.: Mapping RNA-seq data to a transcript graph via approximate pattern matching to a hypertext. In: Figueiredo, D., Martín-Vide, C., Pratas, D., Vega-Rodríguez, M.A. (eds.) AlCoB 2017. LNCS, vol. 10252, pp. 49–61. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58163-7_3
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2009)
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M.R., McVean, G.: Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47(6), 682 (2015)
Eggertsson, H.P., et al.: Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49(11), 1654 (2017)
Garg, S., Rautiainen, M., Novak, A.M., Garrison, E., Durbin, R., Marschall, T.: A graph-based approach to diploid genome assembly. Bioinformatics 34(13), i105–i114 (2018)
Garrison, E., et al.: Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018)
Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162(3), 705–708 (1982)
Heydari, M., Miclotte, G., Van de Peer, Y., Fostier, J.: BrownieAligner: accurate alignment of illumina sequencing data to de Bruijn graphs. BMC Bioinform. 19(1), 311 (2018)
Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)
Kuosmanen, A., Paavilainen, T., Gagie, T., Chikhi, R., Tomescu, A., Mäkinen, V.: Using minimum path cover to boost dynamic programming on DAGs: co-linear chaining extended. In: Raphael, B.J. (ed.) RECOMB 2018. LNCS, vol. 10812, pp. 105–121. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89929-9_7
Lee, C., Grasso, C., Sharlow, M.F.: Multiple sequence alignment using partial order graphs. Bioinformatics 18(3), 452–464 (2002)
Limasset, A., Cazaux, B., Rivals, E., Peterlongo, P.: Read mapping on de Bruijn graphs. BMC Bioinform. 17(1), 237 (2016)
Liu, B., Guo, H., Brudno, M., Wang, Y.: deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics 32(21), 3224–3232 (2016)
Manber, U., Wu, S.: Approximate string matching with arbitrary costs for text and hypertext. In: Advances in Structural and Syntactic Pattern Recognition, pp. 22–33. World Scientific (1992)
Myers, E.W.: An overview of sequence comparison algorithms in molecular biology. University of Arizona, Department of Computer Science (1991)
Myers, E.W.: The fragment assembly string graph. Bioinformatics 21(Suppl\(\_\)2), ii79–ii85 (2005)
Navarro, G.: Improved approximate pattern matching on hypertext. Theoret. Comput. Sci. 237(1–2), 455–463 (2000)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)
Nguyen, N., et al.: Building a pan-genome reference for a population. J. Comput. Biol. 22(5), 387–401 (2015)
Novak, A.M., et al.: Genome graphs. Preprint at bioRxiv (2017). https://doi.org/10.1101/101378
Park, K., Kim, D.K.: String matching in hypertext. In: Galil, Z., Ukkonen, E. (eds.) CPM 1995. LNCS, vol. 937, pp. 318–329. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-60044-2_51
Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. 98(17), 9748–9753 (2001)
Rautiainen, M., Marschall, T.: Aligning sequences to general graphs in O(V + mE) time. Preprint at bioRxiv (2017). https://doi.org/10.1101/216127
Rowe, W.P., Winn, M.D.: Indexed variation graphs for efficient and accurate resistome profiling. Bioinformatics 1, 8 (2018)
Salmela, L., Rivals, E.: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24), 3506–3514 (2014)
Sirén, J., Välimäki, N., Mäkinen, V.: Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 11(2), 375–388 (2014)
Thachuk, C.: Indexing hypertext. J. Discrete Algorithms 18, 113–122 (2013)
Vaddadi, K., Tayal, K., Srinivasan, R., Sivadasan, N.: Sequence alignment on directed graphs. J. Comput. Biol. 26(1), 53–67 (2018)
Wang, J.R., Holt, J., McMillan, L., Jones, C.D.: FMLRC: hybrid long read error correction using an FM-index. BMC Bioinform. 19(1), 50 (2018)
Wick, R.R., Judd, L.M., Gorrie, C.L., Holt, K.E.: Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 13(6), e1005595 (2017)
Zhang, H., Jain, C., Aluru, S.: A comprehensive evaluation of long read error correction methods. Preprint at bioRxiv (2019). https://doi.org/10.1101/519330
Acknowledgements
This work is supported in part by US National Science Foundation grant CCF-1816027. Yu Gao was supported by the ACO Program at Georgia Institute of Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
See Fig. 4.
An example to illustrate the construction of an alignment graph for sequence to graph alignment using affine gap penalty. The alignment graph now contains three sub-graphs separated by the gray dash lines. The deletion and insertion weighted edges in the alignment graph for linear gap penalty are shifted to deletion sub-graph and insertion sub-graph, respectively. Their weights are also changed to the gap extension penalty. Besides, more edges are added to connect the sub-graphs with each other. For simplicity, we use the highlighted vertex as an example to illustrate how to open a gap and extend it. The weight of magenta colored edges is the sum of gap open penalty and gap extension penalty, and the weight of the black colored edges is 0. (Color figure online)
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Jain, C., Zhang, H., Gao, Y., Aluru, S. (2019). On the Complexity of Sequence to Graph Alignment. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-17083-7_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17082-0
Online ISBN: 978-3-030-17083-7
eBook Packages: Computer ScienceComputer Science (R0)