Abstract
Counting the occurrence frequency of each k-mer in a biological sequence is an important step in many bioinformatics applications. However, most k-mer counting algorithms rely on a given k to produce single-length k-mers, which is inefficient for sequence analysis for different k. Moreover, existing k-mer counters focus more on DNA sequences and less on protein ones. In practice, the analysis of k-mers in protein sequences can provide substantial biological insights in structure, function and evolution. To this end, an efficient algorithm, called VLmer (Various Length k-mer mining), is proposed to mine k-mers of various lengths termed vl-mers via inverted-index technique, which is orders of magnitude faster than the conventional forward-index method. Moreover, to the best of our knowledge, VLmer is the first able to mine k-mers of various lengths in both DNA and protein sequences.
J. Zhang and J. Guo—Contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Li, W., Freudenberg, J., Miramontes, P.: Diminishing return for increased mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinform. 15(1), 2 (2014)
Bremges, A., Singer, E., Woyke, T., Sczyrba, A.: MeCorS: metagenome-enabled error correction of single cell sequencing reads. Bioinformatics 32(14), 2199–2201 (2016)
Hamp, T., Rost, B.: Evolutionary profiles improve protein-protein interaction prediction from sequence. Bioinformatics 31(12), 1945–1950 (2015)
Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931–934 (2015)
Kim, D., Song, L., Breitwieser, F.P., Salzberg, S.L.: Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26(12), 1721–1729 (2016)
Horwege, S., Lindner, S., Boden, M., Hatje, K., Kollmar, M., Leimeister, C.-A., Morgenstern, B.: Spaced words and KMACS: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res. 42, W1–W7 (2014)
Kurtz, S., Narechania, A., Stein, J.C., Ware, D.: A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genom. 9(1), 517 (2008)
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 1 (2011)
Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinform. 14(1), 1 (2013)
Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)
Audano, P., Vannberg, F.: Kanalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics 30(14), 2070–2072 (2014)
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)
Mamun, A.-A., Pal, S., Rajasekaran, S.: KCMBT: a k-mer Counter based on Multiple Burst Trees. Bioinformatics 32(18), 2783–2790 (2016)
Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20(2), 265–272 (2010)
Shariat, B., Movahedi, N.S., Chitsaz, H., Boucher, C.: HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly. BMC Genom. 15(10), S9 (2014)
Degnan, P.H., Ochman, H., Moran, N.A.: Sequence conservation and functional constraint on intergenic spacers in reduced genomes of the obligate symbiont buchnera. PLoS Genet. 7(9), e1002252 (2011)
Miranda, R.G., Rojas, M., Montgomery, M.P., Gribbin, K.P., Barkan, A.: RNA binding specificity landscape of the pentatricopeptide repeat protein PPR10. RNA 23(4), 586–599 (2017)
Zhang, R., Xue, R., Yu, T., Liu, L.: Dynamic and efficient private keyword search over inverted index-based encrypted data. ACM Trans. Internet Technol. (TOIT) 16(3), 21 (2016)
Zhang, J., Wang, Y., Yang, D.: CCSpan: mining closed contiguous sequential patterns. Knowl.-Based Syst. 89, 1–13 (2015)
Zhang, J., Wang, Y., Zhang, C., Shi, Y.: Mining contiguous sequential generators in biological sequences. IEEE/ACM Trans. Comput. Biol. Bioinf. 13(5), 855–867 (2016)
Zhang, J., Wang, Y., Wei, H.: An interaction framework of service-oriented ontology learning. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2303–2306. ACM (2012)
Zhang, J., Wang, Y., Yang, D.: Automatic learning common definitional patterns from multi-domain Wikipedia pages. In: 2014 IEEE International Conference on Data Mining Workshop (ICDMW), pp. 251–258. IEEE (2014)
Leung, K.-S., Wong, K.-C., Chan, T.-M., Wong, M.-H., Lee, K.-H., Lau, C.-K., Tsui, S.K.: Discovering protein-DNA binding sequence patterns using association rule mining. Nucleic Acids Res. 38(19), 6324–6337 (2010)
Acknowledgements
This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDB13040700); the National Natural Science Foundation of China (Nos. 91439103; 91529303; 61602460; 31200987); Shanghai Municipal Natural Science Foundation (Nos. 17ZR1406900; 2016M601660); the China Postdoctoral Science Foundation (Nos. 2016M600338; 2016M601660); and the JSPS KAKENHI Grant (No. 15H05707).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zhang, J. et al. (2017). Mining K-mers of Various Lengths in Biological Sequences. In: Cai, Z., Daescu, O., Li, M. (eds) Bioinformatics Research and Applications. ISBRA 2017. Lecture Notes in Computer Science(), vol 10330. Springer, Cham. https://doi.org/10.1007/978-3-319-59575-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-59575-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59574-0
Online ISBN: 978-3-319-59575-7
eBook Packages: Computer ScienceComputer Science (R0)