Mining K-mers of Various Lengths in Biological Sequences | SpringerLink
Skip to main content

Mining K-mers of Various Lengths in Biological Sequences

  • Conference paper
  • First Online:
Bioinformatics Research and Applications (ISBRA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10330))

Included in the following conference series:

Abstract

Counting the occurrence frequency of each k-mer in a biological sequence is an important step in many bioinformatics applications. However, most k-mer counting algorithms rely on a given k to produce single-length k-mers, which is inefficient for sequence analysis for different k. Moreover, existing k-mer counters focus more on DNA sequences and less on protein ones. In practice, the analysis of k-mers in protein sequences can provide substantial biological insights in structure, function and evolution. To this end, an efficient algorithm, called VLmer (Various Length k-mer mining), is proposed to mine k-mers of various lengths termed vl-mers via inverted-index technique, which is orders of magnitude faster than the conventional forward-index method. Moreover, to the best of our knowledge, VLmer is the first able to mine k-mers of various lengths in both DNA and protein sequences.

J. Zhang and J. Guo—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8007
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10009
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Li, W., Freudenberg, J., Miramontes, P.: Diminishing return for increased mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinform. 15(1), 2 (2014)

    Article  Google Scholar 

  2. Bremges, A., Singer, E., Woyke, T., Sczyrba, A.: MeCorS: metagenome-enabled error correction of single cell sequencing reads. Bioinformatics 32(14), 2199–2201 (2016)

    Article  Google Scholar 

  3. Hamp, T., Rost, B.: Evolutionary profiles improve protein-protein interaction prediction from sequence. Bioinformatics 31(12), 1945–1950 (2015)

    Article  Google Scholar 

  4. Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931–934 (2015)

    Article  Google Scholar 

  5. Kim, D., Song, L., Breitwieser, F.P., Salzberg, S.L.: Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26(12), 1721–1729 (2016)

    Article  Google Scholar 

  6. Horwege, S., Lindner, S., Boden, M., Hatje, K., Kollmar, M., Leimeister, C.-A., Morgenstern, B.: Spaced words and KMACS: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res. 42, W1–W7 (2014)

    Article  Google Scholar 

  7. Kurtz, S., Narechania, A., Stein, J.C., Ware, D.: A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genom. 9(1), 517 (2008)

    Article  Google Scholar 

  8. Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)

    Article  Google Scholar 

  9. Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 1 (2011)

    Article  Google Scholar 

  10. Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinform. 14(1), 1 (2013)

    Article  Google Scholar 

  11. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)

    Article  Google Scholar 

  12. Audano, P., Vannberg, F.: Kanalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics 30(14), 2070–2072 (2014)

    Article  Google Scholar 

  13. Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)

    Article  Google Scholar 

  14. Mamun, A.-A., Pal, S., Rajasekaran, S.: KCMBT: a k-mer Counter based on Multiple Burst Trees. Bioinformatics 32(18), 2783–2790 (2016)

    Article  Google Scholar 

  15. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20(2), 265–272 (2010)

    Article  Google Scholar 

  16. Shariat, B., Movahedi, N.S., Chitsaz, H., Boucher, C.: HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly. BMC Genom. 15(10), S9 (2014)

    Article  Google Scholar 

  17. Degnan, P.H., Ochman, H., Moran, N.A.: Sequence conservation and functional constraint on intergenic spacers in reduced genomes of the obligate symbiont buchnera. PLoS Genet. 7(9), e1002252 (2011)

    Article  Google Scholar 

  18. Miranda, R.G., Rojas, M., Montgomery, M.P., Gribbin, K.P., Barkan, A.: RNA binding specificity landscape of the pentatricopeptide repeat protein PPR10. RNA 23(4), 586–599 (2017)

    Article  Google Scholar 

  19. Zhang, R., Xue, R., Yu, T., Liu, L.: Dynamic and efficient private keyword search over inverted index-based encrypted data. ACM Trans. Internet Technol. (TOIT) 16(3), 21 (2016)

    Article  Google Scholar 

  20. Zhang, J., Wang, Y., Yang, D.: CCSpan: mining closed contiguous sequential patterns. Knowl.-Based Syst. 89, 1–13 (2015)

    Article  Google Scholar 

  21. Zhang, J., Wang, Y., Zhang, C., Shi, Y.: Mining contiguous sequential generators in biological sequences. IEEE/ACM Trans. Comput. Biol. Bioinf. 13(5), 855–867 (2016)

    Article  Google Scholar 

  22. Zhang, J., Wang, Y., Wei, H.: An interaction framework of service-oriented ontology learning. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2303–2306. ACM (2012)

    Google Scholar 

  23. Zhang, J., Wang, Y., Yang, D.: Automatic learning common definitional patterns from multi-domain Wikipedia pages. In: 2014 IEEE International Conference on Data Mining Workshop (ICDMW), pp. 251–258. IEEE (2014)

    Google Scholar 

  24. Leung, K.-S., Wong, K.-C., Chan, T.-M., Wong, M.-H., Lee, K.-H., Lau, C.-K., Tsui, S.K.: Discovering protein-DNA binding sequence patterns using association rule mining. Nucleic Acids Res. 38(19), 6324–6337 (2010)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDB13040700); the National Natural Science Foundation of China (Nos. 91439103; 91529303; 61602460; 31200987); Shanghai Municipal Natural Science Foundation (Nos. 17ZR1406900; 2016M601660); the China Postdoctoral Science Foundation (Nos. 2016M600338; 2016M601660); and the JSPS KAKENHI Grant (No. 15H05707).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tao Zeng or Luonan Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Zhang, J. et al. (2017). Mining K-mers of Various Lengths in Biological Sequences. In: Cai, Z., Daescu, O., Li, M. (eds) Bioinformatics Research and Applications. ISBRA 2017. Lecture Notes in Computer Science(), vol 10330. Springer, Cham. https://doi.org/10.1007/978-3-319-59575-7_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59575-7_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59574-0

  • Online ISBN: 978-3-319-59575-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics