Abstract
Representations of biological sequences facilitating sequence comparison are crucial in several bioinformatics tasks. Recently, the Lyndon factorization has been proved to preserve common factors in overlapping reads [6], thus leading to the idea of using factorizations of sequences to define measures of similarity between reads. In this paper we propose as a signature of sequencing reads the notion of fingerprint, i.e., the sequence of lengths of consecutive factors in Lyndon-based factorizations of the reads. Surprisingly, fingerprints of reads are effective in preserving sequence similarities while providing a compact representation of the read, and so, k-mers extracted from a fingerprint, called k-fingers, can be used to capture sequence similarity between reads.
We first provide a probabilistic framework to estimate the behaviour of fingerprints. Then we experimentally evaluate the effectiveness of this representation for machine learning algorithms for classifying biological sequences. In particular, we considered the problem of assigning RNA-Seq reads to the most likely gene from which they were generated. Our results show that fingerprints can provide an effective machine learning interpretable representation, successfully preserving sequence similarity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Asgari, E., Mofrad, M.R.: Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10(11), e0141287 (2015)
Berlin, K., Koren, S., Chin, C.S., Drake, J.P., Landolin, J.M., Phillippy, A.M.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnol. 33(6), 623–630 (2015)
Berstel, J., Perrin, D.: The origins of combinatorics on words. Eur. J. Comb. 28(3), 996–1022 (2007)
Bonizzoni, P., De Felice, C., Zaccagnino, R., Zizza, R.: Lyndon words versus inverse Lyndon words: queries on suffixes and bordered words. In: Leporati, A., Martín-Vide, C., Shapira, D., Zandron, C. (eds.) LATA 2020. LNCS, vol. 12038, pp. 385–396. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-40608-0_27
Bonizzoni, P., De Felice, C., Zaccagnino, R., Zizza, R.: Inverse Lyndon words and inverse Lyndon factorizations of words. Adv. App. Math. 101, 281–319 (2018)
Bonizzoni, P., De Felice, C., Zaccagnino, R., Zizza, R.: On the longest common prefix of suffixes in an inverse Lyndon factorization and other properties. Theor. Comput. Sci. 862, 24–41 (2021)
Chen, K.T., Fox, R.H., Lyndon, R.C.: Free differential calculus, IV. the quotient groups of the lower central series. Ann. Math. 68(1), 81–95 (1958)
Delgrange, O., Rivals, E.: STAR: an algorithm to search for tandem approximate repeats. Bioinformatics 20(16), 2812–2820 (2004)
Denti, L., et al.: Shark: fishing relevant reads in an RNA-Seq sample. Bioinformatics (2021)
Duval, J.P.: Factorizing words over an ordered alphabet. J. Algorithms 4(4), 363–381 (1983)
Kimothi, D., Soni, A., Biyani, P., Hogan, J.M.: Distributed representations for biological sequence analysis. arXiv preprint arXiv:1608.05949 (2016)
Kumar, P., Krishna, P.R., Raju, S.B.: Pattern Discovery Using Sequence Data Mining: Applications and Studies. IGI Publishing, United States (2011)
Köppl, D., Hashimoto, D., Hendrian, D., Shinohara, A.: In-Place bijective Burrows-Wheeler Transforms. In: Combinatorial Pattern Matching (2020)
Lothaire, M.: Combinatorics on Words. Cambridge University Press, Cambridge (1967)
Lyndon, R.C.: On burnside’s problem. Trans. Am. Math. Soc. 77(2), 202–215 (1954)
Motomura, K., Fujita, T., Tsutsumi, M., Kikuzato, S., Nakamura, M., Otaki, J.M.: Word decoding of protein amino acid sequences with availability analysis: a linguistic approach. PLoS ONE 7(11), e50039 (2012)
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 17(1), 132 (2016)
Srinivasan, S.M., Vural, S., King, B.R., Guda, C.: Mining for class-specific motifs in protein sequence classification. BMC Bioinform. 14(1), 96 (2013)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Education India (2016)
Vries, J.K., Liu, X.: Subfamily specific conservation profiles for proteins based on n-gram patterns. BMC Bioinform. 9(1), 72 (2008)
Acknowledgments
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement number [872539].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Bonizzoni, P. et al. (2021). Can We Replace Reads by Numeric Signatures? Lyndon Fingerprints as Representations of Sequencing Reads for Machine Learning. In: Martín-Vide, C., Vega-Rodríguez, M.A., Wheeler, T. (eds) Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science(), vol 12715. Springer, Cham. https://doi.org/10.1007/978-3-030-74432-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-74432-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74431-1
Online ISBN: 978-3-030-74432-8
eBook Packages: Computer ScienceComputer Science (R0)