How Much Do DNA and Protein Deep Embeddings Preserve Biological Information? | SpringerLink
Skip to main content

How Much Do DNA and Protein Deep Embeddings Preserve Biological Information?

  • Conference paper
  • First Online:
Computational Methods in Systems Biology (CMSB 2024)

Abstract

Models that map DNA and protein sequences into deep embeddings have been recently developed. While their ability to improve prediction in downstream tasks has been demonstrated, clear advantages and disadvantages of embedding types, and different means of applying them, are not yet available. In this paper we compare five different models (one for DNA, four for proteins) and different embedding aggregation methods with respect to their ability to preserve evolutionary and functional information, using a hierarchical tree approach. Specifically, we introduce a novel procedure that builds hierarchical clustering trees to assess the relative position of sequences in the embedding latent space, compared to the phylogenetic and functional similarities between sequences. The methods are benchmarked on five different datasets from various organisms. The ESM protein language model and DNABert emerge as best performers in different settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 22879
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 28599
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abdi, H., et al.: DISTATIS: the analysis of multiple distance matrices. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)-Workshops. IEEE (2005)

    Google Scholar 

  2. Ashburner, M., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)

    Article  Google Scholar 

  3. Benegas, G., Batra, S.S., Song, Y.S.: DNA language models are powerful predictors of genome-wide variant effects. Proc. Natl. Acad. Sci. 120(44), e2311219120 (2023)

    Article  Google Scholar 

  4. Bepler, T., Berger, B.: Learning protein sequence embeddings using information from structure (2019). arXiv preprint arXiv:1902.08661

  5. Bepler, T., Berger, B.: Learning the protein language: evolution, structure, and function. Cell Syst. 12(6), 654–669 (2021)

    Google Scholar 

  6. Brandes, N., et al.: ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8), 2102–2110 (2022)

    Google Scholar 

  7. Detlefsen, N.S., Hauberg, S., Boomsma, W.: Learning meaningful representations of protein sequences. Nat. Commun. 13(1), 1914 (2022)

    Google Scholar 

  8. Devlin, J. et al.: BERT: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805

  9. Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, Cambridge (1998)

    Google Scholar 

  10. Elnaggar, A., et al.: ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 7112–7127 (2021)

    Article  Google Scholar 

  11. Evans, R., et al.: Protein complex prediction with alphafold-multimer. bioRxiv (2022). https://doi.org/10.1101/2021.10.04.463034

  12. Fenoy, E., Edera, A.A., Stegmayer, G.: Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks. Briefings Bioinform. 23(4), bbac232 (2022)

    Google Scholar 

  13. Ferruz, N., Schmidt, S., Höcker, B.: ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13(1), 4348 (2022)

    Article  Google Scholar 

  14. Gao, M., et al.: AF2Complex predicts direct physical interactions in multimeric proteins with deep learning. Nat. Commun. 13(1), 1744 (2022)

    Google Scholar 

  15. Heinzinger, M., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 723 (2019)

    Google Scholar 

  16. Hie, B.L., et al.: Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42(2), 275–283 (2024)

    Article  Google Scholar 

  17. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)

    Article  Google Scholar 

  18. Ji, Y., et al.: DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37(15), 2112–2120 (2021)

    Google Scholar 

  19. Jing, X., Wu, F., Luo, X., Xu, J.: Single-sequence protein structure prediction by integrating protein language models. Proc. Natl. Acad. Sci. 121(13), e2308788121 (2024)

    Article  Google Scholar 

  20. Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021)

    Google Scholar 

  21. Khurana, D., et al.: Natural language processing: state of the art, current trends and challenges. Multimedia Tools Appl. 82(3), 3713–3744 (2023)

    Google Scholar 

  22. Kosloff, M., Kolodny, R.: Sequence-similar, structure-dissimilar protein pairs in the PDB. Proteins: Structure Funct. Bioinform. 71(2), 891–902 (2008)

    Google Scholar 

  23. Kusner, M., et al.: From word embeddings to document distances. In: International Conference on Machine Learning. PMLR (2015)

    Google Scholar 

  24. Lin, Z., et al.: Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv (2022)

    Google Scholar 

  25. Lin, Z., et al.: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637), 1123–1130 (2023)

    Article  MathSciNet  Google Scholar 

  26. Madani, A., et al.: Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41(8), 1099–1106 (2023)

    Article  Google Scholar 

  27. Mardikoraem, M., et al.: Generative models for protein sequence modeling: recent advances and future directions. Briefings Bioinform. 24(6), bbad358 (2023)

    Google Scholar 

  28. Mikolov, T., et al.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781

  29. Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivar. Behav. Res. 21(4), 441–458 (1986)

    Google Scholar 

  30. Nijkamp, E., Ruffolo, J.A., Weinstein, E.N., Naik, N., Madani, A.: ProGen2: exploring the boundaries of protein language models. Cell Syst. 14(11), 968–978 (2023)

    Article  Google Scholar 

  31. Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, New Orleans, Louisiana (2018)

    Google Scholar 

  32. Rives, A., et al.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 118(15), e2016239118 (2021)

    Article  Google Scholar 

  33. Salant, S., Berant, J.: Contextualized word representations for reading comprehension (2017). arXiv preprint arXiv:1712.03609

  34. Sievers, F., et al.: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal omega. Mol. Syst. Biol. 7(1), 539 (2011)

    Google Scholar 

  35. Sofi, M.Y., Shafi, A., Masoodi, K.Z.: Chapter 6 - multiple sequence alignment. In: Bioinformatics for Everyone, pp. 47–53. Academic Press (2022)

    Google Scholar 

  36. Su, J. et al.: RoFormer: Enhanced transformer with rotary position embedding (2021). arXiv preprint arXiv:2104.09864

  37. Suzek, B.E., et al.: UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31(6), 926–932 (2015)

    Google Scholar 

  38. Tan, X., Yuan, C., Wu, H., Zhao, X.: Comprehensive evaluation of BERT model for DNA-language for prediction of DNA sequence binding specificities in fine-tuning phase. In: Huang, D.S., Jo, K.H., Jing, J., Premaratne, P., Bevilacqua, V., Hussain, A. (eds.) Intelligent Computing Theories and Application. ICIC 2022. LNCS, vol. 13394. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13829-4_8

  39. The Gene Ontology Consortium, et al.: The gene ontology knowledgebase in 2023. Genetics 224(1), iyad031 (2023)

    Google Scholar 

  40. Tsaban, T., et al.: Harnessing protein folding neural networks for peptide–protein docking. Nat. Commun. 13(1), 176 (2022)

    Google Scholar 

  41. Unsal, S., Atas, H., Albayrak, M., Turhan, K., Acar, A.C., Doğan, T.: Learning functional properties of proteins with language models. Nat. Mach. Intell. 4(3), 227–245 (2022)

    Article  Google Scholar 

  42. Vijaymeena, M., Kavitha, K.: A survey on similarity measures in text mining. Mach. Learn. Appl. Int. J. 3(2), 19–28 (2016)

    Google Scholar 

  43. Villegas-Morcillo, A., Gomez, A.M., Sanchez, V.: An analysis of protein language model embeddings for fold prediction. Briefings Bioinform. 23(3), bbac142 (2022)

    Google Scholar 

  44. Väth, P., et al.: PROVAL: a framework for comparison of protein sequence embeddings. J. Comput. Math. Data Sci. 3, 100044 (2022)

    Article  Google Scholar 

  45. Yao, Y., et al.: An integration of deep learning with feature embedding for protein-protein interaction prediction. PeerJ 7, e7126 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially financed by the co-funding European Union - Next Generation EU, in the context of The National Recovery and Resilience Plan, Investment 1.5 Ecosystems of Innovation, Project Tuscany Health Ecosystem (THE), CUP: B83C22003920001, Spoke 3. We thank Dr. Margherita Bodini and Dr. Alessandro Brozzi from GSK Vaccines, Siena, Italy, for useful discussions on DNA and protein embeddings and their aggregation.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Matteo Tolloso or Alina Sîrbu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tolloso, M., Galfrè, S.G., Pavone, A., Podda, M., Sîrbu, A., Priami, C. (2024). How Much Do DNA and Protein Deep Embeddings Preserve Biological Information?. In: Gori, R., Milazzo, P., Tribastone, M. (eds) Computational Methods in Systems Biology. CMSB 2024. Lecture Notes in Computer Science(), vol 14971. Springer, Cham. https://doi.org/10.1007/978-3-031-71671-3_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-71671-3_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-71670-6

  • Online ISBN: 978-3-031-71671-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics