On the Complexity of Sequence to Graph Alignment | SpringerLink
Skip to main content

On the Complexity of Sequence to Graph Alignment

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2019)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Abstract

Availability of extensive genetics data across multiple individuals and populations is driving the growing importance of graph based reference representations. Aligning sequences to graphs is a fundamental operation on several types of sequence graphs (variation graphs, assembly graphs, pan-genomes, etc.) and their biological applications. Though research on sequence to graph alignments is nascent, it can draw from related work on pattern matching in hypertext. In this paper, we study sequence to graph alignment problems under Hamming and edit distance models, and linear and affine gap penalty functions, for multiple variants of the problem that allow changes in query alone, graph alone, or in both. We prove that when changes are permitted in graphs either standalone or in conjunction with changes in the query, the sequence to graph alignment problem is \(\mathcal {NP}\)-complete under both Hamming and edit distance models for alphabets of size \({\ge }2\). For the case where only changes to the sequence are permitted, we present an \(O(|V|+m|E|)\) time algorithm, where m denotes the query size, and V and E denote the vertex and edge sets of the graph, respectively. Our result is generalizable to both linear and affine gap penalty functions, and improves upon the run-time complexity of existing algorithms.

C. Jain and H. Zhang—Should be regarded as joint first-authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 7549
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 9437
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Amir, A., Lewenstein, M., Lewenstein, N.: Pattern matching in hypertext. J. Algorithms 35(1), 82–99 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  2. Antipov, D., Korobeynikov, A., McLean, J.S., Pevzner, P.A.: hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32(7), 1009–1015 (2015)

    Article  Google Scholar 

  3. Backurs, A., Indyk, P.: Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 51–58. ACM (2015)

    Google Scholar 

  4. Beretta, S., Bonizzoni, P., Denti, L., Previtali, M., Rizzi, R.: Mapping RNA-seq data to a transcript graph via approximate pattern matching to a hypertext. In: Figueiredo, D., Martín-Vide, C., Pratas, D., Vega-Rodríguez, M.A. (eds.) AlCoB 2017. LNCS, vol. 10252, pp. 49–61. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58163-7_3

    Chapter  Google Scholar 

  5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  6. Dilthey, A., Cox, C., Iqbal, Z., Nelson, M.R., McVean, G.: Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47(6), 682 (2015)

    Article  Google Scholar 

  7. Eggertsson, H.P., et al.: Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49(11), 1654 (2017)

    Article  Google Scholar 

  8. Garg, S., Rautiainen, M., Novak, A.M., Garrison, E., Durbin, R., Marschall, T.: A graph-based approach to diploid genome assembly. Bioinformatics 34(13), i105–i114 (2018)

    Article  Google Scholar 

  9. Garrison, E., et al.: Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018)

    Article  Google Scholar 

  10. Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162(3), 705–708 (1982)

    Article  Google Scholar 

  11. Heydari, M., Miclotte, G., Van de Peer, Y., Fostier, J.: BrownieAligner: accurate alignment of illumina sequencing data to de Bruijn graphs. BMC Bioinform. 19(1), 311 (2018)

    Article  Google Scholar 

  12. Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)

    Article  Google Scholar 

  13. Kuosmanen, A., Paavilainen, T., Gagie, T., Chikhi, R., Tomescu, A., Mäkinen, V.: Using minimum path cover to boost dynamic programming on DAGs: co-linear chaining extended. In: Raphael, B.J. (ed.) RECOMB 2018. LNCS, vol. 10812, pp. 105–121. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89929-9_7

    Chapter  Google Scholar 

  14. Lee, C., Grasso, C., Sharlow, M.F.: Multiple sequence alignment using partial order graphs. Bioinformatics 18(3), 452–464 (2002)

    Article  Google Scholar 

  15. Limasset, A., Cazaux, B., Rivals, E., Peterlongo, P.: Read mapping on de Bruijn graphs. BMC Bioinform. 17(1), 237 (2016)

    Article  Google Scholar 

  16. Liu, B., Guo, H., Brudno, M., Wang, Y.: deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics 32(21), 3224–3232 (2016)

    Article  Google Scholar 

  17. Manber, U., Wu, S.: Approximate string matching with arbitrary costs for text and hypertext. In: Advances in Structural and Syntactic Pattern Recognition, pp. 22–33. World Scientific (1992)

    Google Scholar 

  18. Myers, E.W.: An overview of sequence comparison algorithms in molecular biology. University of Arizona, Department of Computer Science (1991)

    Google Scholar 

  19. Myers, E.W.: The fragment assembly string graph. Bioinformatics 21(Suppl\(\_\)2), ii79–ii85 (2005)

    Google Scholar 

  20. Navarro, G.: Improved approximate pattern matching on hypertext. Theoret. Comput. Sci. 237(1–2), 455–463 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  21. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)

    Article  Google Scholar 

  22. Nguyen, N., et al.: Building a pan-genome reference for a population. J. Comput. Biol. 22(5), 387–401 (2015)

    Article  MathSciNet  Google Scholar 

  23. Novak, A.M., et al.: Genome graphs. Preprint at bioRxiv (2017). https://doi.org/10.1101/101378

  24. Park, K., Kim, D.K.: String matching in hypertext. In: Galil, Z., Ukkonen, E. (eds.) CPM 1995. LNCS, vol. 937, pp. 318–329. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-60044-2_51

    Chapter  Google Scholar 

  25. Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. 98(17), 9748–9753 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  26. Rautiainen, M., Marschall, T.: Aligning sequences to general graphs in O(V + mE) time. Preprint at bioRxiv (2017). https://doi.org/10.1101/216127

  27. Rowe, W.P., Winn, M.D.: Indexed variation graphs for efficient and accurate resistome profiling. Bioinformatics 1, 8 (2018)

    Google Scholar 

  28. Salmela, L., Rivals, E.: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24), 3506–3514 (2014)

    Article  Google Scholar 

  29. Sirén, J., Välimäki, N., Mäkinen, V.: Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 11(2), 375–388 (2014)

    Article  Google Scholar 

  30. Thachuk, C.: Indexing hypertext. J. Discrete Algorithms 18, 113–122 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  31. Vaddadi, K., Tayal, K., Srinivasan, R., Sivadasan, N.: Sequence alignment on directed graphs. J. Comput. Biol. 26(1), 53–67 (2018)

    Google Scholar 

  32. Wang, J.R., Holt, J., McMillan, L., Jones, C.D.: FMLRC: hybrid long read error correction using an FM-index. BMC Bioinform. 19(1), 50 (2018)

    Article  Google Scholar 

  33. Wick, R.R., Judd, L.M., Gorrie, C.L., Holt, K.E.: Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 13(6), e1005595 (2017)

    Article  Google Scholar 

  34. Zhang, H., Jain, C., Aluru, S.: A comprehensive evaluation of long read error correction methods. Preprint at bioRxiv (2019). https://doi.org/10.1101/519330

Download references

Acknowledgements

This work is supported in part by US National Science Foundation grant CCF-1816027. Yu Gao was supported by the ACO Program at Georgia Institute of Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Srinivas Aluru .

Editor information

Editors and Affiliations

A Appendix

A Appendix

See Fig. 4.

Fig. 4.
figure 4

An example to illustrate the construction of an alignment graph for sequence to graph alignment using affine gap penalty. The alignment graph now contains three sub-graphs separated by the gray dash lines. The deletion and insertion weighted edges in the alignment graph for linear gap penalty are shifted to deletion sub-graph and insertion sub-graph, respectively. Their weights are also changed to the gap extension penalty. Besides, more edges are added to connect the sub-graphs with each other. For simplicity, we use the highlighted vertex as an example to illustrate how to open a gap and extend it. The weight of magenta colored edges is the sum of gap open penalty and gap extension penalty, and the weight of the black colored edges is 0. (Color figure online)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jain, C., Zhang, H., Gao, Y., Aluru, S. (2019). On the Complexity of Sequence to Graph Alignment. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17083-7_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17082-0

  • Online ISBN: 978-3-030-17083-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics