Abstract
Models that map DNA and protein sequences into deep embeddings have been recently developed. While their ability to improve prediction in downstream tasks has been demonstrated, clear advantages and disadvantages of embedding types, and different means of applying them, are not yet available. In this paper we compare five different models (one for DNA, four for proteins) and different embedding aggregation methods with respect to their ability to preserve evolutionary and functional information, using a hierarchical tree approach. Specifically, we introduce a novel procedure that builds hierarchical clustering trees to assess the relative position of sequences in the embedding latent space, compared to the phylogenetic and functional similarities between sequences. The methods are benchmarked on five different datasets from various organisms. The ESM protein language model and DNABert emerge as best performers in different settings.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdi, H., et al.: DISTATIS: the analysis of multiple distance matrices. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)-Workshops. IEEE (2005)
Ashburner, M., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)
Benegas, G., Batra, S.S., Song, Y.S.: DNA language models are powerful predictors of genome-wide variant effects. Proc. Natl. Acad. Sci. 120(44), e2311219120 (2023)
Bepler, T., Berger, B.: Learning protein sequence embeddings using information from structure (2019). arXiv preprint arXiv:1902.08661
Bepler, T., Berger, B.: Learning the protein language: evolution, structure, and function. Cell Syst. 12(6), 654–669 (2021)
Brandes, N., et al.: ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8), 2102–2110 (2022)
Detlefsen, N.S., Hauberg, S., Boomsma, W.: Learning meaningful representations of protein sequences. Nat. Commun. 13(1), 1914 (2022)
Devlin, J. et al.: BERT: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, Cambridge (1998)
Elnaggar, A., et al.: ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 7112–7127 (2021)
Evans, R., et al.: Protein complex prediction with alphafold-multimer. bioRxiv (2022). https://doi.org/10.1101/2021.10.04.463034
Fenoy, E., Edera, A.A., Stegmayer, G.: Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks. Briefings Bioinform. 23(4), bbac232 (2022)
Ferruz, N., Schmidt, S., Höcker, B.: ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13(1), 4348 (2022)
Gao, M., et al.: AF2Complex predicts direct physical interactions in multimeric proteins with deep learning. Nat. Commun. 13(1), 1744 (2022)
Heinzinger, M., et al.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 723 (2019)
Hie, B.L., et al.: Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42(2), 275–283 (2024)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Ji, Y., et al.: DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37(15), 2112–2120 (2021)
Jing, X., Wu, F., Luo, X., Xu, J.: Single-sequence protein structure prediction by integrating protein language models. Proc. Natl. Acad. Sci. 121(13), e2308788121 (2024)
Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021)
Khurana, D., et al.: Natural language processing: state of the art, current trends and challenges. Multimedia Tools Appl. 82(3), 3713–3744 (2023)
Kosloff, M., Kolodny, R.: Sequence-similar, structure-dissimilar protein pairs in the PDB. Proteins: Structure Funct. Bioinform. 71(2), 891–902 (2008)
Kusner, M., et al.: From word embeddings to document distances. In: International Conference on Machine Learning. PMLR (2015)
Lin, Z., et al.: Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv (2022)
Lin, Z., et al.: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637), 1123–1130 (2023)
Madani, A., et al.: Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41(8), 1099–1106 (2023)
Mardikoraem, M., et al.: Generative models for protein sequence modeling: recent advances and future directions. Briefings Bioinform. 24(6), bbad358 (2023)
Mikolov, T., et al.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivar. Behav. Res. 21(4), 441–458 (1986)
Nijkamp, E., Ruffolo, J.A., Weinstein, E.N., Naik, N., Madani, A.: ProGen2: exploring the boundaries of protein language models. Cell Syst. 14(11), 968–978 (2023)
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, New Orleans, Louisiana (2018)
Rives, A., et al.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 118(15), e2016239118 (2021)
Salant, S., Berant, J.: Contextualized word representations for reading comprehension (2017). arXiv preprint arXiv:1712.03609
Sievers, F., et al.: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal omega. Mol. Syst. Biol. 7(1), 539 (2011)
Sofi, M.Y., Shafi, A., Masoodi, K.Z.: Chapter 6 - multiple sequence alignment. In: Bioinformatics for Everyone, pp. 47–53. Academic Press (2022)
Su, J. et al.: RoFormer: Enhanced transformer with rotary position embedding (2021). arXiv preprint arXiv:2104.09864
Suzek, B.E., et al.: UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31(6), 926–932 (2015)
Tan, X., Yuan, C., Wu, H., Zhao, X.: Comprehensive evaluation of BERT model for DNA-language for prediction of DNA sequence binding specificities in fine-tuning phase. In: Huang, D.S., Jo, K.H., Jing, J., Premaratne, P., Bevilacqua, V., Hussain, A. (eds.) Intelligent Computing Theories and Application. ICIC 2022. LNCS, vol. 13394. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13829-4_8
The Gene Ontology Consortium, et al.: The gene ontology knowledgebase in 2023. Genetics 224(1), iyad031 (2023)
Tsaban, T., et al.: Harnessing protein folding neural networks for peptide–protein docking. Nat. Commun. 13(1), 176 (2022)
Unsal, S., Atas, H., Albayrak, M., Turhan, K., Acar, A.C., Doğan, T.: Learning functional properties of proteins with language models. Nat. Mach. Intell. 4(3), 227–245 (2022)
Vijaymeena, M., Kavitha, K.: A survey on similarity measures in text mining. Mach. Learn. Appl. Int. J. 3(2), 19–28 (2016)
Villegas-Morcillo, A., Gomez, A.M., Sanchez, V.: An analysis of protein language model embeddings for fold prediction. Briefings Bioinform. 23(3), bbac142 (2022)
Väth, P., et al.: PROVAL: a framework for comparison of protein sequence embeddings. J. Comput. Math. Data Sci. 3, 100044 (2022)
Yao, Y., et al.: An integration of deep learning with feature embedding for protein-protein interaction prediction. PeerJ 7, e7126 (2019)
Acknowledgements
This work was partially financed by the co-funding European Union - Next Generation EU, in the context of The National Recovery and Resilience Plan, Investment 1.5 Ecosystems of Innovation, Project Tuscany Health Ecosystem (THE), CUP: B83C22003920001, Spoke 3. We thank Dr. Margherita Bodini and Dr. Alessandro Brozzi from GSK Vaccines, Siena, Italy, for useful discussions on DNA and protein embeddings and their aggregation.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tolloso, M., Galfrè, S.G., Pavone, A., Podda, M., Sîrbu, A., Priami, C. (2024). How Much Do DNA and Protein Deep Embeddings Preserve Biological Information?. In: Gori, R., Milazzo, P., Tribastone, M. (eds) Computational Methods in Systems Biology. CMSB 2024. Lecture Notes in Computer Science(), vol 14971. Springer, Cham. https://doi.org/10.1007/978-3-031-71671-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-71671-3_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71670-6
Online ISBN: 978-3-031-71671-3
eBook Packages: Computer ScienceComputer Science (R0)