Can We Replace Reads by Numeric Signatures? Lyndon Fingerprints as Representations of Sequencing Reads for Machine Learning | SpringerLink
Skip to main content

Can We Replace Reads by Numeric Signatures? Lyndon Fingerprints as Representations of Sequencing Reads for Machine Learning

  • Conference paper
  • First Online:
Algorithms for Computational Biology (AlCoB 2021)

Abstract

Representations of biological sequences facilitating sequence comparison are crucial in several bioinformatics tasks. Recently, the Lyndon factorization has been proved to preserve common factors in overlapping reads [6], thus leading to the idea of using factorizations of sequences to define measures of similarity between reads. In this paper we propose as a signature of sequencing reads the notion of fingerprint, i.e., the sequence of lengths of consecutive factors in Lyndon-based factorizations of the reads. Surprisingly, fingerprints of reads are effective in preserving sequence similarities while providing a compact representation of the read, and so, k-mers extracted from a fingerprint, called k-fingers, can be used to capture sequence similarity between reads.

We first provide a probabilistic framework to estimate the behaviour of fingerprints. Then we experimentally evaluate the effectiveness of this representation for machine learning algorithms for classifying biological sequences. In particular, we considered the problem of assigning RNA-Seq reads to the most likely gene from which they were generated. Our results show that fingerprints can provide an effective machine learning interpretable representation, successfully preserving sequence similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://preshing.com/20110504/hash-collision-probabilities/.

  2. 2.

    https://scikit-learn.org/.

  3. 3.

    https://github.com/rzaccagnino/DeepShark.

References

  1. Asgari, E., Mofrad, M.R.: Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10(11), e0141287 (2015)

    Article  Google Scholar 

  2. Berlin, K., Koren, S., Chin, C.S., Drake, J.P., Landolin, J.M., Phillippy, A.M.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnol. 33(6), 623–630 (2015)

    Article  Google Scholar 

  3. Berstel, J., Perrin, D.: The origins of combinatorics on words. Eur. J. Comb. 28(3), 996–1022 (2007)

    Article  MathSciNet  Google Scholar 

  4. Bonizzoni, P., De Felice, C., Zaccagnino, R., Zizza, R.: Lyndon words versus inverse Lyndon words: queries on suffixes and bordered words. In: Leporati, A., Martín-Vide, C., Shapira, D., Zandron, C. (eds.) LATA 2020. LNCS, vol. 12038, pp. 385–396. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-40608-0_27

    Chapter  MATH  Google Scholar 

  5. Bonizzoni, P., De Felice, C., Zaccagnino, R., Zizza, R.: Inverse Lyndon words and inverse Lyndon factorizations of words. Adv. App. Math. 101, 281–319 (2018)

    Article  MathSciNet  Google Scholar 

  6. Bonizzoni, P., De Felice, C., Zaccagnino, R., Zizza, R.: On the longest common prefix of suffixes in an inverse Lyndon factorization and other properties. Theor. Comput. Sci. 862, 24–41 (2021)

    Google Scholar 

  7. Chen, K.T., Fox, R.H., Lyndon, R.C.: Free differential calculus, IV. the quotient groups of the lower central series. Ann. Math. 68(1), 81–95 (1958)

    Article  MathSciNet  Google Scholar 

  8. Delgrange, O., Rivals, E.: STAR: an algorithm to search for tandem approximate repeats. Bioinformatics 20(16), 2812–2820 (2004)

    Article  Google Scholar 

  9. Denti, L., et al.: Shark: fishing relevant reads in an RNA-Seq sample. Bioinformatics (2021)

    Google Scholar 

  10. Duval, J.P.: Factorizing words over an ordered alphabet. J. Algorithms 4(4), 363–381 (1983)

    Article  MathSciNet  Google Scholar 

  11. Kimothi, D., Soni, A., Biyani, P., Hogan, J.M.: Distributed representations for biological sequence analysis. arXiv preprint arXiv:1608.05949 (2016)

  12. Kumar, P., Krishna, P.R., Raju, S.B.: Pattern Discovery Using Sequence Data Mining: Applications and Studies. IGI Publishing, United States (2011)

    Google Scholar 

  13. Köppl, D., Hashimoto, D., Hendrian, D., Shinohara, A.: In-Place bijective Burrows-Wheeler Transforms. In: Combinatorial Pattern Matching (2020)

    Google Scholar 

  14. Lothaire, M.: Combinatorics on Words. Cambridge University Press, Cambridge (1967)

    MATH  Google Scholar 

  15. Lyndon, R.C.: On burnside’s problem. Trans. Am. Math. Soc. 77(2), 202–215 (1954)

    MathSciNet  MATH  Google Scholar 

  16. Motomura, K., Fujita, T., Tsutsumi, M., Kikuzato, S., Nakamura, M., Otaki, J.M.: Word decoding of protein amino acid sequences with availability analysis: a linguistic approach. PLoS ONE 7(11), e50039 (2012)

    Article  Google Scholar 

  17. Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 17(1), 132 (2016)

    Article  Google Scholar 

  18. Srinivasan, S.M., Vural, S., King, B.R., Guda, C.: Mining for class-specific motifs in protein sequence classification. BMC Bioinform. 14(1), 96 (2013)

    Article  Google Scholar 

  19. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Education India (2016)

    Google Scholar 

  20. Vries, J.K., Liu, X.: Subfamily specific conservation profiles for proteins based on n-gram patterns. BMC Bioinform. 9(1), 72 (2008)

    Article  Google Scholar 

Download references

Acknowledgments

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement number [872539].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paola Bonizzoni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bonizzoni, P. et al. (2021). Can We Replace Reads by Numeric Signatures? Lyndon Fingerprints as Representations of Sequencing Reads for Machine Learning. In: Martín-Vide, C., Vega-Rodríguez, M.A., Wheeler, T. (eds) Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science(), vol 12715. Springer, Cham. https://doi.org/10.1007/978-3-030-74432-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-74432-8_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-74431-1

  • Online ISBN: 978-3-030-74432-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics