{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,2]],"date-time":"2024-08-02T17:54:46Z","timestamp":1722621286125},"reference-count":52,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,6,8]],"date-time":"2023-06-08T00:00:00Z","timestamp":1686182400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,6,8]],"date-time":"2023-06-08T00:00:00Z","timestamp":1686182400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000015","name":"U.S. Department of Energy","doi-asserted-by":"publisher","award":["DE-AC05-00OR22725"],"id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Exascale Computing Project","award":["17-SC-20-SC"]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"Abstract<\/jats:title>The vast size of chemical space necessitates computational approaches to automate and accelerate the design of molecular sequences to guide experimental efforts for drug discovery. Genetic algorithms provide a useful framework to incrementally generate molecules by applying mutations to known chemical structures. Recently, masked language models have been applied to automate the mutation process by leveraging large compound libraries to learn commonly occurring chemical sequences (i.e., using tokenization) and predict rearrangements (i.e., using mask prediction). Here, we consider how language models can be adapted to improve molecule generation for different optimization tasks. We use two different generation strategies for comparison, fixed and adaptive. The fixed strategy uses a pre-trained model to generate mutations; the adaptive strategy trains the language model on each new generation of molecules selected for target properties during optimization. Our results show that the adaptive strategy allows the language model to more closely fit the distribution of molecules in the population. Therefore, for enhanced fitness optimization, we suggest the use of the fixed strategy during an initial phase followed by the use of the adaptive strategy. We demonstrate the impact of adaptive training by searching for molecules that optimize both heuristic metrics, drug-likeness and synthesizability, as well as predicted protein binding affinity from a surrogate model. Our results show that the adaptive strategy provides a significant improvement in fitness optimization compared to the fixed pre-trained model, empowering the application of language models to molecular design tasks.<\/jats:p>","DOI":"10.1186\/s13321-023-00719-7","type":"journal-article","created":{"date-parts":[[2023,6,8]],"date-time":"2023-06-08T14:02:41Z","timestamp":1686232961000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Adaptive language model training for molecular design"],"prefix":"10.1186","volume":"15","author":[{"given":"Andrew E.","family":"Blanchard","sequence":"first","affiliation":[]},{"given":"Debsindhu","family":"Bhowmik","sequence":"additional","affiliation":[]},{"given":"Zachary","family":"Fox","sequence":"additional","affiliation":[]},{"given":"John","family":"Gounley","sequence":"additional","affiliation":[]},{"given":"Jens","family":"Glaser","sequence":"additional","affiliation":[]},{"given":"Belinda S.","family":"Akpa","sequence":"additional","affiliation":[]},{"given":"Stephan","family":"Irle","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,6,8]]},"reference":[{"issue":"5","key":"719_CR1","doi-asserted-by":"publisher","first-page":"533","DOI":"10.1016\/S1473-3099(20)30120-1","volume":"20","author":"E Dong","year":"2020","unstructured":"Dong E, Du H, Gardner L (2020) An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis 20(5):533\u2013534. https:\/\/doi.org\/10.1016\/S1473-3099(20)30120-1","journal-title":"Lancet Infect Dis"},{"key":"719_CR2","doi-asserted-by":"publisher","first-page":"587","DOI":"10.1177\/10943420221121804","volume":"36","author":"AE Blanchard","year":"2022","unstructured":"Blanchard AE, Gounley J, Bhowmik D, Chandra Shekar M, Lyngaas I, Gao S, Yin J, Tsaris A, Wang F, Glaser J (2022) Language models for the prediction of SARS-CoV-2 inhibitors. Int J High Perform Comput Appl 36:587","journal-title":"Int J High Perform Comput Appl"},{"issue":"4","key":"719_CR3","doi-asserted-by":"publisher","first-page":"1955","DOI":"10.1021\/acs.jcim.9b01053","volume":"60","author":"AJ Minnich","year":"2020","unstructured":"Minnich AJ, McLoughlin K, Tse M, Deng J, Weber A, Murad N, Madej BD, Ramsundar B, Rush T, Calad-Thomson S, Brase J, Allen JE (2020) AMPL: a data-driven modeling pipeline for drug discovery. J Chem Inform Model 60(4):1955\u20131968. https:\/\/doi.org\/10.1021\/acs.jcim.9b01053","journal-title":"J Chem Inform Model"},{"issue":"6","key":"719_CR4","doi-asserted-by":"publisher","first-page":"1241","DOI":"10.1016\/j.drudis.2018.01.039","volume":"23","author":"H Chen","year":"2018","unstructured":"Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T (2018) The rise of deep learning in drug discovery. Drug Discov Today 23(6):1241\u20131250. https:\/\/doi.org\/10.1016\/j.drudis.2018.01.039","journal-title":"Drug Discov Today"},{"issue":"12","key":"719_CR5","doi-asserted-by":"publisher","first-page":"5832","DOI":"10.1021\/acs.jcim.0c01010","volume":"60","author":"A Acharya","year":"2020","unstructured":"Acharya A, Agarwal R, Baker MB, Baudry J, Bhowmik D, Boehm S, Byler KG, Chen SY, Coates L, Cooper CJ, Demerdash O, Daidone I, Eblen JD, Ellingson S, Forli S, Glaser J, Gumbart JC, Gunnels J, Hernandez O, Irle S, Kneller DW, Kovalevsky A, Larkin J, Lawrence TJ, LeGrand S, Liu S-H, Mitchell JC, Park G, Parks JM, Pavlova A, Petridis L, Poole D, Pouchard L, Ramanathan A, Rogers DM, Santos-Martins D, Scheinberg A, Sedova A, Shen Y, Smith JC, Smith MD, Soto C, Tsaris A, Thavappiragasam M, Tillack AF, Vermaas JV, Vuong VQ, Yin J, Yoo S, Zahran M, Zanetti-Polzi L (2020) Supercomputer-based ensemble docking drug discovery pipeline with application to Covid-19. J Chem Inf Model 60(12):5832\u20135852. https:\/\/doi.org\/10.1021\/acs.jcim.0c01010","journal-title":"J Chem Inf Model"},{"issue":"6","key":"719_CR6","doi-asserted-by":"publisher","first-page":"3058","DOI":"10.1021\/acs.jcim.1c00449","volume":"61","author":"E Cho","year":"2021","unstructured":"Cho E, Rosa M, Anjum R, Mehmood S, Soban M, Mujtaba M, Bux K, Moin ST, Tanweer M, Dantu S, Pandini A, Yin J, Ma H, Ramanathan A, Islam B, Mey ASJS, Bhowmik D, Haider S (2021) Dynamic profiling of $$\\beta$$-coronavirus 3cl mpro protease ligand-binding sites. J Chem Inf Model 61(6):3058\u20133073. https:\/\/doi.org\/10.1021\/acs.jcim.1c00449","journal-title":"J Chem Inf Model"},{"key":"719_CR7","doi-asserted-by":"publisher","DOI":"10.1109\/BigData52589.2021.9671323","author":"SH Chen","year":"2021","unstructured":"Chen SH, Todd Young M, Gounley J, Stanley C, Bhowmik D (2021) How distinct structural flexibility within sars-cov-2 spike protein reveals potential therapeutic targets. IEEE. https:\/\/doi.org\/10.1109\/BigData52589.2021.9671323","journal-title":"IEEE"},{"issue":"S18","key":"719_CR8","doi-asserted-by":"publisher","first-page":"484","DOI":"10.1186\/s12859-018-2507-5","volume":"19","author":"D Bhowmik","year":"2018","unstructured":"Bhowmik D, Gao S, Young MT, Ramanathan A (2018) Deep clustering of protein folding simulations. BMC Bioinf 19(S18):484","journal-title":"BMC Bioinf"},{"issue":"18","key":"719_CR9","doi-asserted-by":"publisher","first-page":"10520","DOI":"10.1021\/acs.chemrev.8b00728","volume":"119","author":"X Yang","year":"2019","unstructured":"Yang X, Wang Y, Byrne R, Schneider G, Yang S (2019) Concepts of artificial intelligence for computer-assisted drug discovery. Chem Rev 119(18):10520\u201310594. https:\/\/doi.org\/10.1021\/acs.chemrev.8b00728","journal-title":"Chem Rev"},{"key":"719_CR10","unstructured":"Enamine REAL Database. https:\/\/enamine.net\/compound-collections\/real-compounds\/real-database. Accessed: 2020-04-01 through https:\/\/virtual-flow.org\/"},{"issue":"6","key":"719_CR11","doi-asserted-by":"publisher","first-page":"1686","DOI":"10.1021\/ci300124c","volume":"52","author":"IF Martins","year":"2012","unstructured":"Martins IF, Teixeira AL, Pinheiro L, Falcao AO (2012) A Bayesian approach to in Silico blood-brain barrier penetration modeling. J Chem Inf Model 52(6):1686\u20131697. https:\/\/doi.org\/10.1021\/ci300124c","journal-title":"J Chem Inf Model"},{"issue":"10","key":"719_CR12","doi-asserted-by":"publisher","first-page":"1936","DOI":"10.1021\/acs.jcim.6b00290","volume":"56","author":"G Subramanian","year":"2016","unstructured":"Subramanian G, Ramsundar B, Pande V, Denny RA (2016) Computational modeling of $$\\beta$$-secretase 1 (BACE-1) inhibitors using ligand based approaches. J Chem Inf Model 56(10):1936\u20131949. https:\/\/doi.org\/10.1021\/acs.jcim.6b00290","journal-title":"J Chem Inf Model"},{"key":"719_CR13","unstructured":"RDKit: Open-source cheminformatics. http:\/\/www.rdkit.org"},{"key":"719_CR14","doi-asserted-by":"publisher","DOI":"10.1177\/10943420211010930","author":"SA Jacobs","year":"2021","unstructured":"Jacobs SA, Moon T, McLoughlin K, Jones D, Hysom D, Ahn DH, Gyllenhaal J, Watson P, Lightstone FC, Allen JE, Karlin I, Van Essen B (2021) Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative models. Int J High Perform Comput Appl. https:\/\/doi.org\/10.1177\/10943420211010930","journal-title":"Int J High Perform Comput Appl"},{"issue":"1","key":"719_CR15","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1186\/s13321-021-00494-3","volume":"13","author":"AE Blanchard","year":"2021","unstructured":"Blanchard AE, Stanley C, Bhowmik D (2021) Using GANs with adaptive training data to search for new molecules. J Cheminform 13(1):4\u201311. https:\/\/doi.org\/10.1186\/s13321-021-00494-3","journal-title":"J Cheminform"},{"key":"719_CR16","unstructured":"De\u00a0Cao N, Kipf T (2018) MolGAN: An implicit generative model for small molecular graphs. ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models"},{"key":"719_CR17","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-44874-8","volume-title":"Introduction to evolutionary computing","author":"AE Eiben","year":"2015","unstructured":"Eiben AE, Smith JE (2015) Introduction to evolutionary computing, 2nd edn. Springer, Berlin","edition":"2"},{"issue":"19","key":"719_CR18","doi-asserted-by":"publisher","first-page":"7296","DOI":"10.1021\/ja401184g","volume":"135","author":"AM Virshup","year":"2013","unstructured":"Virshup AM, Contreras-Garc\u00eda J, Wipf P, Yang W, Beratan DN (2013) Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds. J Am Chem Soc 135(19):7296\u20137303. https:\/\/doi.org\/10.1021\/ja401184g","journal-title":"J Am Chem Soc"},{"issue":"12","key":"719_CR19","doi-asserted-by":"publisher","first-page":"3567","DOI":"10.1039\/c8sc05372c","volume":"10","author":"JH Jensen","year":"2019","unstructured":"Jensen JH (2019) A graph-based genetic algorithm and generative model\/Monte Carlo tree search for the exploration of chemical space. Chem Sci 10(12):3567\u20133572. https:\/\/doi.org\/10.1039\/c8sc05372c","journal-title":"Chem Sci"},{"issue":"3","key":"719_CR20","doi-asserted-by":"publisher","first-page":"1079","DOI":"10.1021\/ci034290p","volume":"44","author":"N Brown","year":"2004","unstructured":"Brown N, McKay B, Gilardoni F, Gasteiger J (2004) A graph-based genetic algorithm and its application to the multiobjective evolution of median molecules. J Chem Inform Comput Sci 44(3):1079\u20131087. https:\/\/doi.org\/10.1021\/ci034290p","journal-title":"J Chem Inform Comput Sci"},{"issue":"3","key":"719_CR21","doi-asserted-by":"publisher","first-page":"1096","DOI":"10.1021\/acs.jcim.8b00839","volume":"59","author":"N Brown","year":"2019","unstructured":"Brown N, Fiscato M, Segler MHS, Vaucher AC (2019) GuacaMol: benchmarking models for de novo molecular design. J Chem Inform Model 59(3):1096\u20131108. https:\/\/doi.org\/10.1021\/acs.jcim.8b00839","journal-title":"J Chem Inform Model"},{"issue":"2","key":"719_CR22","doi-asserted-by":"publisher","first-page":"545","DOI":"10.1021\/ci050369d","volume":"46","author":"EW Lameijer","year":"2006","unstructured":"Lameijer EW, Kok JN, B\u00e4ck T, Ijzerman AP (2006) The molecule evoluator. An interactive evolutionary algorithm for the design of drug-like molecules. J Chem Inform Model 46(2):545\u2013552. https:\/\/doi.org\/10.1021\/ci050369d","journal-title":"J Chem Inform Model"},{"issue":"2","key":"719_CR23","doi-asserted-by":"publisher","first-page":"295","DOI":"10.1021\/ci800308h","volume":"49","author":"CA Nicolaou","year":"2009","unstructured":"Nicolaou CA, Apostolakis J, Pattichis CS (2009) De novo drug design using multiobjective evolutionary graphs. J Chem Inform Model 49(2):295\u2013307. https:\/\/doi.org\/10.1021\/ci800308h","journal-title":"J Chem Inform Model"},{"issue":"2","key":"719_CR24","doi-asserted-by":"publisher","first-page":"553","DOI":"10.1021\/ci050370c","volume":"46","author":"EW Lameijer","year":"2006","unstructured":"Lameijer EW, Kok JN, Back T, Ijzerman AP (2006) Mining a chemical database for fragment co-occurrence: discovery of \u201cchemical clich\u00e9s\u2019\u2019. J Chem Inform Model 46(2):553\u2013562. https:\/\/doi.org\/10.1021\/ci050370c","journal-title":"J Chem Inform Model"},{"issue":"5","key":"719_CR25","doi-asserted-by":"publisher","first-page":"487","DOI":"10.1023\/A:1008184403558","volume":"14","author":"G Schneider","year":"2000","unstructured":"Schneider G, Lee ML, Stahl M, Schneider P (2000) De novo design of molecular architectures by evolutionary assembly of drug-derived building blocks. J Comput Aided Mol Design 14(5):487\u2013494. https:\/\/doi.org\/10.1023\/A:1008184403558","journal-title":"J Comput Aided Mol Design"},{"issue":"1","key":"719_CR26","doi-asserted-by":"publisher","first-page":"120","DOI":"10.1021\/acscentsci.7b00512","volume":"4","author":"MHS Segler","year":"2018","unstructured":"Segler MHS, Kogej T, Tyrchan C, Waller MP (2018) Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Central Sci 4(1):120\u2013131. https:\/\/doi.org\/10.1021\/acscentsci.7b00512","journal-title":"ACS Central Sci"},{"key":"719_CR27","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13321-018-0323-6","volume":"11","author":"J Ar\u00e9s-Pous","year":"2019","unstructured":"Ar\u00e9s-Pous J, Johansson SV, Prykhodko O, Bjerrum EJ, Tyrchan C, Reymond J-L, Reymond J-L, Chen H, Engkvist O (2019) Randomized smiles strings improve the quality of molecular generative models. J Cheminform 11:1","journal-title":"J Cheminform"},{"issue":"1","key":"719_CR28","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41467-022-30839-","volume":"13","author":"D Flam-Shepherd","year":"2022","unstructured":"Flam-Shepherd D, Zhu K, Aspuru-Guzik A (2022) Language models can learn complex molecular distributions. Nat Commun 13(1):1\u201310. https:\/\/doi.org\/10.1038\/s41467-022-30839-","journal-title":"Nat Commun"},{"issue":"4","key":"719_CR29","doi-asserted-by":"publisher","first-page":"1347","DOI":"10.1021\/acs.jcim.8b00902","volume":"59","author":"M Awale","year":"2019","unstructured":"Awale M, Sirockin F, Stiefl N, Reymond J-L (2019) Drug analogs from fragment-based long short-term memory generative neural networks. J Chem Inform Model 59(4):1347\u20131356. https:\/\/doi.org\/10.1021\/acs.jcim.8b00902","journal-title":"J Chem Inform Model"},{"key":"719_CR30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-019-0328-9","volume":"11","author":"S Zheng","year":"2019","unstructured":"Zheng S, Yan X, Gu Q, Yang Y, Du Y, Lu Y, Xu J (2019) Qbmg: quasi-biogenic molecule generator with deep recurrent neural network. J Cheminform 11:1","journal-title":"J Cheminform"},{"key":"719_CR31","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1038\/s41467-019-13807-w","volume":"11","author":"O M\u00e9ndez-Lucio","year":"2018","unstructured":"M\u00e9ndez-Lucio O, Baillif B, Clevert D-A, Rouqui\u00e9 D, Wichard JD (2018) De novo generation of hit-like molecules from gene expression signatures using artificial intelligence. Nat Commun 11:10","journal-title":"Nat Commun"},{"key":"719_CR32","unstructured":"Fabian B, Edlich T, Gaspar H, Segler MHS, Meyers J, Fiscato M, Ahmed M (2020) Molecular representation learning with language models and domain-relevant auxiliary tasks. ArXiv abs\/2011.13230"},{"issue":"12","key":"719_CR33","doi-asserted-by":"publisher","first-page":"5804","DOI":"10.1021\/acs.jcim.1c01289","volume":"61","author":"H Kim","year":"2021","unstructured":"Kim H, Na J, Lee WB (2021) Generative chemical transformer: Neural machine learning of molecular geometric structures from chemical language via attention. J Chem Inf Model 61(12):5804\u20135814. https:\/\/doi.org\/10.1021\/acs.jcim.1c01289","journal-title":"J Chem Inf Model"},{"issue":"9","key":"719_CR34","doi-asserted-by":"publisher","first-page":"2064","DOI":"10.1021\/acs.jcim.1c00600","volume":"62","author":"V Bagal","year":"2022","unstructured":"Bagal V, Aggarwal R, Vinod PK, Priyakumar UD (2022) Molgpt: molecular generation using a transformer-decoder model. J Chem Inf Model 62(9):2064\u20132076. https:\/\/doi.org\/10.1021\/acs.jcim.1c00600","journal-title":"J Chem Inf Model"},{"key":"719_CR35","doi-asserted-by":"publisher","DOI":"10.3389\/fphar.2020.565644","author":"D Polykovskiy","year":"2020","unstructured":"Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, Golovanov S, Tatanov O, Belyaev S, Kurbanov R, Artamonov A, Aladinskiy V, Veselov M, Kadurin A, Johansson S, Chen H, Nikolenko S, Aspuru-Guzik A, Zhavoronkov A (2020) Molecular sets (moses): a benchmarking platform for molecular generation models. Front Pharmacol. https:\/\/doi.org\/10.3389\/fphar.2020.565644","journal-title":"Front Pharmacol"},{"key":"719_CR36","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1(Mlm), 4171\u20134186. arXiv:1810.04805"},{"key":"719_CR37","doi-asserted-by":"publisher","DOI":"10.1109\/TEVC.2022.3144045","author":"AE Blanchard","year":"2022","unstructured":"Blanchard AE, Chandra Shekar M, Gao S, Gounley J, Lyngaas I, Glaser J, Bhowmik D (2022) Automating genetic algorithm mutations for molecules using a masked language model. IEEE Trans Evolut Comput. https:\/\/doi.org\/10.1109\/TEVC.2022.3144045","journal-title":"IEEE Trans Evolut Comput"},{"key":"719_CR38","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","volume":"28","author":"D Weininger","year":"1998","unstructured":"Weininger D (1998) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31\u201336. https:\/\/doi.org\/10.1021\/ci00057a005","journal-title":"J Chem Inf Comput Sci"},{"key":"719_CR39","doi-asserted-by":"publisher","unstructured":"Schuster M, Nakajima K (2012) Japanese and korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149\u20135152. https:\/\/doi.org\/10.1109\/ICASSP.2012.6289079","DOI":"10.1109\/ICASSP.2012.6289079"},{"key":"719_CR40","unstructured":"Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser \u0141, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google\u2019s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 1\u201323. arXiv:1609.08144"},{"issue":"2","key":"719_CR41","doi-asserted-by":"publisher","first-page":"90","DOI":"10.1038\/nchem.1243","volume":"4","author":"GR Bickerton","year":"2012","unstructured":"Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL (2012) Quantifying the chemical beauty of drugs. Nat Chem 4(2):90\u201398. https:\/\/doi.org\/10.1038\/nchem.1243","journal-title":"Nat Chem"},{"issue":"1","key":"719_CR42","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1758-2946-1-8","volume":"1","author":"P Ertl","year":"2009","unstructured":"Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminf 1(1):1\u201311. https:\/\/doi.org\/10.1186\/1758-2946-1-8","journal-title":"J Cheminf"},{"key":"719_CR43","unstructured":"jglaser\/protein-ligand-mlp-1. https:\/\/huggingface.co\/jglaser\/protein-ligand-mlp-1"},{"key":"719_CR44","doi-asserted-by":"crossref","unstructured":"Aizman A, Maltby G, Breuel T (2019) High performance I\/O for large scale deep learning. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 5965\u20135967. IEEE","DOI":"10.1109\/BigData47090.2019.9005703"},{"key":"719_CR45","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/sdata.2014.22","volume":"1","author":"R Ramakrishnan","year":"2014","unstructured":"Ramakrishnan R, Dral PO, Rupp M, Von Lilienfeld OA (2014) Quantum chemistry structures and properties of 134 kilo molecules. Sci Data 1:1\u20137. https:\/\/doi.org\/10.1038\/sdata.2014.22","journal-title":"Sci Data"},{"key":"719_CR46","unstructured":"gdb9 Dataset. http:\/\/deepchem.io.s3-website-us-west-1.amazonaws.com\/datasets\/gdb9.tar.gz. Accessed 28 May 2021"},{"key":"719_CR47","doi-asserted-by":"crossref","unstructured":"Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush AM (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38\u201345. Association for Computational Linguistics, Online. https:\/\/www.aclweb.org\/anthology\/2020.emnlp-demos.6","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"719_CR48","doi-asserted-by":"publisher","unstructured":"Rajbhandari S, Rasley J, Ruwase O, He Y (2020) Zero: Memory optimizations toward training trillion parameter models. International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020-Novem, 1\u201324. https:\/\/doi.org\/10.1109\/SC41405.2020.00024.arXiv:1910.02054","DOI":"10.1109\/SC41405.2020.00024."},{"key":"719_CR49","doi-asserted-by":"publisher","unstructured":"Wang S, Guo Y, Wang Y, Sun H, Huang J (2019) Smiles-Bert: Large scale unsupervised pre-training for molecular property prediction. ACM-BCB 2019 - Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 429\u2013436. https:\/\/doi.org\/10.1145\/3307339.3342186","DOI":"10.1145\/3307339.3342186"},{"key":"719_CR50","doi-asserted-by":"publisher","DOI":"10.1101\/2020.12.23.424259","author":"D Xue","year":"2020","unstructured":"Xue D, Zhang H, Xiao D, Gong Y, Chuai G, Sun Y, Tian H, Wu H, Li Y, Liu Q (2020) X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis. bioRxiv. https:\/\/doi.org\/10.1101\/2020.12.23.424259","journal-title":"bioRxiv"},{"issue":"1","key":"719_CR51","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41598-021-90259-7","volume":"11","author":"H Kim","year":"2021","unstructured":"Kim H, Lee J, Ahn S, Lee JR (2021) A merged molecular representation learning for molecular properties prediction with a web-based service. Sci Rep 11(1):1\u20139. https:\/\/doi.org\/10.1038\/s41598-021-90259-7","journal-title":"Sci Rep"},{"key":"719_CR52","doi-asserted-by":"crossref","unstructured":"G\u00f3mez-Bombarelli R, Wei JN, Duvenaud D, Hern\u00e1ndez-Lobato JM, \u00e1nchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A, (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Sci 4(2):268\u2013276","DOI":"10.1021\/acscentsci.7b00572"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00719-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-023-00719-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00719-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,8]],"date-time":"2023-06-08T14:06:21Z","timestamp":1686233181000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-023-00719-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,8]]},"references-count":52,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["719"],"URL":"https:\/\/doi.org\/10.1186\/s13321-023-00719-7","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,8]]},"assertion":[{"value":"9 May 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 April 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 June 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"59"}}