{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,9,19]],"date-time":"2024-09-19T16:29:52Z","timestamp":1726763392775},"reference-count":33,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,5,21]],"date-time":"2023-05-21T00:00:00Z","timestamp":1684627200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,5,21]],"date-time":"2023-05-21T00:00:00Z","timestamp":1684627200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"National Science Foundation","award":["1420620"]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"Abstract<\/jats:title>Accurate prediction of molecular properties is essential in the screening and development of drug molecules and other functional materials. Traditionally, property-specific molecular descriptors are used in machine learning models. This in turn requires the identification and development of target or problem-specific descriptors. Additionally, an increase in the prediction accuracy of the model is not always feasible from the standpoint of targeted descriptor usage. We explored the accuracy and generalizability issues using a framework of Shannon entropies, based on SMILES, SMARTS and\/or InChiKey strings of respective molecules. Using various public databases of molecules, we showed that the accuracy of the prediction of machine learning models could be significantly enhanced simply by using Shannon entropy-based descriptors evaluated directly from SMILES. Analogous to partial pressures and total pressure of gases in a mixture, we used atom-wise fractional Shannon entropy in combination with total Shannon entropy from respective tokens of the string representation to model the molecule efficiently. The proposed descriptor was competitive in performance with standard descriptors such as Morgan fingerprints and SHED in regression models. Additionally, we found that either a hybrid descriptor set containing the Shannon entropy-based descriptors or an optimized, ensemble architecture of multilayer perceptrons and graph neural networks using the Shannon entropies was synergistic to improve the prediction accuracy. This simple approach of coupling the Shannon entropy framework to other standard descriptors and\/or using it in ensemble models could find applications in boosting the performance of molecular property predictions in chemistry and material science.<\/jats:p>","DOI":"10.1186\/s13321-023-00712-0","type":"journal-article","created":{"date-parts":[[2023,5,21]],"date-time":"2023-05-21T09:02:03Z","timestamp":1684659723000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Harnessing Shannon entropy-based descriptors in machine learning models to enhance the prediction accuracy of molecular properties"],"prefix":"10.1186","volume":"15","author":[{"given":"Rajarshi","family":"Guha","sequence":"first","affiliation":[]},{"given":"Darrell","family":"Velegol","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,5,21]]},"reference":[{"key":"712_CR1","doi-asserted-by":"publisher","first-page":"123836","DOI":"10.1016\/j.fuel.2022.123836","volume":"321","author":"AE Comesana","year":"2022","unstructured":"Comesana AE, Huntington TT, Scown CD, Niemeyer KE, Rapp VH (2022) A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties. Fuel 321:123836","journal-title":"Fuel"},{"key":"712_CR2","doi-asserted-by":"publisher","first-page":"e26870","DOI":"10.1002\/qua.26870","volume":"122","author":"S Raghunathan","year":"2022","unstructured":"Raghunathan S, Priyakumar UD (2022) Molecular representations for machine learning applications in chemistry. Int J Quantum Chem 122:e26870","journal-title":"Int J Quantum Chem"},{"key":"712_CR3","doi-asserted-by":"publisher","first-page":"4538","DOI":"10.1016\/j.csbj.2021.08.011","volume":"19","author":"P Carracedo-Reboredo","year":"2021","unstructured":"Carracedo-Reboredo P, Li\u00f1ares-Blanco J, Rodr\u00edguez-Fern\u00e1ndez N, Cedr\u00f3n F, Novoa FJ, Carballal A et al (2021) A review on machine learning approaches and trends in drug discovery. Comput Struct Biotechnol J 19:4538\u20134558","journal-title":"Comput Struct Biotechnol J"},{"key":"712_CR4","first-page":"102713","volume":"108","author":"L Lv","year":"2022","unstructured":"Lv L, Chen T, Dou J, Plaza A (2022) A hybrid ensemble-based deep-learning framework for landslide susceptibility mapping. Int J Appl Earth Obs Geoinf 108:102713","journal-title":"Int J Appl Earth Obs Geoinf"},{"key":"712_CR5","doi-asserted-by":"publisher","first-page":"551","DOI":"10.1007\/s13042-021-01442-1","volume":"13","author":"M Sabzevari","year":"2022","unstructured":"Sabzevari M, Mart\u00ednez-Mu\u00f1oz G, Su\u00e1rez A (2022) Building heterogeneous ensembles by pooling homogeneous ensembles. Int J Mach Learn Cybern 13:551\u2013558","journal-title":"Int J Mach Learn Cybern"},{"key":"712_CR6","doi-asserted-by":"publisher","first-page":"2294","DOI":"10.1021\/ci7004687","volume":"48","author":"RW Homer","year":"2008","unstructured":"Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD (2008) SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries. J Chem Inf Model 48:2294\u20132307","journal-title":"J Chem Inf Model"},{"key":"712_CR7","doi-asserted-by":"publisher","first-page":"42","DOI":"10.1186\/s13321-018-0295-6","volume":"10","author":"N Kochev","year":"2018","unstructured":"Kochev N, Avramova S, Jeliazkova N (2018) Ambit-SMIRKS: a software module for reaction representation, reaction search and structure transformation. J Cheminform 10:42","journal-title":"J Cheminform"},{"key":"712_CR8","doi-asserted-by":"publisher","first-page":"45024","DOI":"10.1088\/2632-2153\/aba947","volume":"1","author":"M Krenn","year":"2020","unstructured":"Krenn M, H\u00e4se F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach Learn Sci Technol 1:45024","journal-title":"Mach Learn Sci Technol"},{"key":"712_CR9","doi-asserted-by":"publisher","first-page":"88","DOI":"10.1186\/s13321-021-00566-4","volume":"13","author":"F Berenger","year":"2021","unstructured":"Berenger F, Tsuda K (2021) Molecular generation by Fast Assembly of (Deep)SMILES fragments. J Cheminform 13:88","journal-title":"J Cheminform"},{"key":"712_CR10","doi-asserted-by":"publisher","first-page":"71","DOI":"10.1186\/s13321-019-0393-0","volume":"11","author":"J Ar\u00fas-Pous","year":"2019","unstructured":"Ar\u00fas-Pous J, Johansson SV, Prykhodko O, Bjerrum EJ, Tyrchan C, Reymond J-L et al (2019) Randomized SMILES strings improve the quality of molecular generative models. J Cheminform 11:71","journal-title":"J Cheminform"},{"key":"712_CR11","doi-asserted-by":"publisher","first-page":"e1603","DOI":"10.1002\/wcms.1603","volume":"12","author":"DS Wigh","year":"2022","unstructured":"Wigh DS, Goodman JM, Lapkin AA (2022) A review of molecular representation in the age of machine learning. WIREs Comput Mol Sci 12:e1603","journal-title":"WIREs Comput Mol Sci"},{"key":"712_CR12","doi-asserted-by":"publisher","first-page":"94","DOI":"10.1016\/j.chemolab.2011.07.008","volume":"109","author":"AA Toropov","year":"2011","unstructured":"Toropov AA, Toropova AP, Martyanov SE, Benfenati E, Gini G, Leszczynska D et al (2011) Comparison of SMILES and molecular graphs as the representation of the molecular structure for QSAR analysis for mutagenic potential of polyaromatic amines. Chemom Intell Lab Syst 109:94\u2013100","journal-title":"Chemom Intell Lab Syst"},{"key":"712_CR13","doi-asserted-by":"publisher","first-page":"143","DOI":"10.1016\/j.aiopen.2021.07.002","volume":"2","author":"R Cartuyvels","year":"2021","unstructured":"Cartuyvels R, Spinks G, Moens M-F (2021) Discrete and continuous representations and processing in deep learning: looking forward. AI Open 2:143\u2013159","journal-title":"AI Open"},{"key":"712_CR14","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbab365","author":"MV Sabando","year":"2021","unstructured":"Sabando MV, Ponzoni I, Milios EE, Soto AJ (2021) Using molecular embeddings in QSAR modeling: does it make a difference? Brief Bioinform. https:\/\/doi.org\/10.1093\/bib\/bbab365","journal-title":"Brief Bioinform"},{"key":"712_CR15","doi-asserted-by":"publisher","first-page":"107533","DOI":"10.1016\/j.compchemeng.2021.107533","volume":"155","author":"V Mann","year":"2021","unstructured":"Mann V, Venkatasubramanian V (2021) Retrosynthesis prediction using grammar-based neural machine translation: an information-theoretic approach. Comput Chem Eng 155:107533","journal-title":"Comput Chem Eng"},{"key":"712_CR16","doi-asserted-by":"publisher","first-page":"1240","DOI":"10.3390\/e23101240","volume":"23","author":"DS Sabirov","year":"2021","unstructured":"Sabirov DS, Shepelevich IS (2021) Information entropy in chemistry: an overview. Entropy 23:1240","journal-title":"Entropy"},{"key":"712_CR17","doi-asserted-by":"publisher","first-page":"550","DOI":"10.1021\/ci010243q","volume":"42","author":"FL Stahura","year":"2002","unstructured":"Stahura FL, Godden JW, Bajorath J (2002) Differential Shannon entropy analysis identifies molecular property descriptors that predict aqueous solubility of synthetic compounds with high accuracy in binary QSAR calculations. J Chem Inf Comput Sci 42:550\u2013558","journal-title":"J Chem Inf Comput Sci"},{"key":"712_CR18","doi-asserted-by":"publisher","first-page":"7572","DOI":"10.1063\/1.481366","volume":"112","author":"M H\u1ed3","year":"2000","unstructured":"H\u1ed3 M, Clark BJ, Smith VH, Weaver DF, Gatti C, Sagar RP et al (2000) Shannon information entropies of molecules and functional groups in the self-consistent reaction field. J Chem Phys 112:7572\u20137580","journal-title":"J Chem Phys"},{"key":"712_CR19","doi-asserted-by":"publisher","first-page":"1615","DOI":"10.1021\/ci0600509","volume":"46","author":"E Gregori-Puigjan\u00e9","year":"2006","unstructured":"Gregori-Puigjan\u00e9 E, Mestres J (2006) SHED: Shannon entropy descriptors from topological feature distributions. J Chem Inf Model 46:1615\u20131622","journal-title":"J Chem Inf Model"},{"key":"712_CR20","doi-asserted-by":"publisher","first-page":"1946","DOI":"10.2174\/156802612804910278","volume":"12","author":"R Guha","year":"2012","unstructured":"Guha R, Willighagen E (2012) A survey of quantitative descriptions of molecular structure. Curr Top Med Chem 12:1946\u20131956","journal-title":"Curr Top Med Chem"},{"key":"712_CR21","doi-asserted-by":"publisher","first-page":"1655","DOI":"10.1021\/ci900060x","volume":"49","author":"M Dehmer","year":"2009","unstructured":"Dehmer M, Varmuza K, Borgert S, Emmert-Streib F (2009) On entropy-based molecular descriptors: statistical analysis of real and synthetic chemical structures. J Chem Inf Model 49:1655\u20131663","journal-title":"J Chem Inf Model"},{"key":"712_CR22","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1186\/s13321-020-00479-8","volume":"13","author":"D Jiang","year":"2021","unstructured":"Jiang D, Wu Z, Hsieh C-Y, Chen G, Liao B, Wang Z et al (2021) Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J Cheminform 13:12","journal-title":"J Cheminform"},{"key":"712_CR23","doi-asserted-by":"publisher","first-page":"1246","DOI":"10.1038\/s42256-022-00581-6","volume":"4","author":"T Janela","year":"2022","unstructured":"Janela T, Bajorath J (2022) Simple nearest-neighbour analysis meets the accuracy of compound potency predictions using complex machine learning models. Nat Mach Intell 4:1246\u20131255","journal-title":"Nat Mach Intell"},{"key":"712_CR24","doi-asserted-by":"publisher","first-page":"100588","DOI":"10.1016\/j.patter.2022.100588","volume":"3","author":"M Krenn","year":"2022","unstructured":"Krenn M, Ai Q, Barthel S, Carson N, Frei A, Frey NC et al (2022) SELFIES and the future of molecular string representations. Patterns 3:100588","journal-title":"Patterns"},{"key":"712_CR25","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1021\/c160017a018","volume":"5","author":"HL Morgan","year":"1965","unstructured":"Morgan HL (1965) The generation of a unique machine description for chemical structures\u2014a technique developed at chemical abstracts service. J Chem Doc 5:107\u2013113","journal-title":"J Chem Doc"},{"key":"712_CR26","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1186\/1758-2946-3-3","volume":"3","author":"G Hinselmann","year":"2011","unstructured":"Hinselmann G, Rosenbaum L, Jahn A, Fechner N, Zell A (2011) jCompoundMapper: an open source Java library and command-line tool for chemical fingerprints. J Cheminform 3:3","journal-title":"J Cheminform"},{"key":"712_CR27","first-page":"1","volume-title":"Handbook of computational chemistry","author":"A Mauri","year":"2016","unstructured":"Mauri A, Consonni V, Todeschini R (2016) Molecular descriptors. In: Leszczynski J (ed) Handbook of computational chemistry. Springer Netherlands, Dordrecht, pp 1\u201329"},{"key":"712_CR28","doi-asserted-by":"publisher","first-page":"1770","DOI":"10.1021\/acs.jmedchem.1c00613","volume":"65","author":"AK Dilger","year":"2022","unstructured":"Dilger AK, Pabbisetty KB, Corte JR, De Lucca I, Fang T, Yang W et al (2022) Discovery of milvexian, a high-affinity, orally bioavailable inhibitor of factor XIa in clinical studies for antithrombotic therapy. J Med Chem 65:1770\u20131785","journal-title":"J Med Chem"},{"key":"712_CR29","doi-asserted-by":"publisher","first-page":"2077","DOI":"10.1021\/ci900161g","volume":"49","author":"K Hansen","year":"2009","unstructured":"Hansen K, Mika S, Schroeter T, Sutter A, ter Laak A, Steger-Hartmann T et al (2009) Benchmark data set for in silico prediction of ames mutagenicity. J Chem Inf Model 49:2077\u20132081","journal-title":"J Chem Inf Model"},{"key":"712_CR30","doi-asserted-by":"publisher","first-page":"66","DOI":"10.1186\/s13321-018-0321-8","volume":"10","author":"D Probst","year":"2018","unstructured":"Probst D, Reymond J-L (2018) A probabilistic molecular fingerprint for big data settings. J Cheminform 10:66","journal-title":"J Cheminform"},{"key":"712_CR31","unstructured":"Data61 C (2018) StellarGraph machine learning library"},{"key":"712_CR32","doi-asserted-by":"publisher","first-page":"1560","DOI":"10.1021\/acs.jcim.0c01127","volume":"61","author":"X Li","year":"2021","unstructured":"Li X, Fourches D (2021) SMILES pair encoding: a data-driven substructure tokenization algorithm for deep learning. J Chem Inf Model 61:1560\u20131569","journal-title":"J Chem Inf Model"},{"key":"712_CR33","doi-asserted-by":"publisher","first-page":"1523","DOI":"10.1021\/acscentsci.9b00476","volume":"5","author":"T-S Lin","year":"2019","unstructured":"Lin T-S, Coley CW, Mochigase H, Beech HK, Wang W, Wang Z et al (2019) BigSMILES: a structurally-based line notation for describing macromolecules. ACS Cent Sci 5:1523\u20131531","journal-title":"ACS Cent Sci"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00712-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-023-00712-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00712-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,21]],"date-time":"2023-05-21T09:02:10Z","timestamp":1684659730000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-023-00712-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,21]]},"references-count":33,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["712"],"URL":"https:\/\/doi.org\/10.1186\/s13321-023-00712-0","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,21]]},"assertion":[{"value":"7 December 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 March 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 May 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"The authors declare no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"54"}}