Abstract
In drug discovery, partition and distribution coefficients, logP and logD for octanol/water, are widely used as metrics of the lipophilicity of molecules, which in turn have a strong influence on the bioactivity and bioavailability of potential drugs. There are a variety of established methods, mostly fragment or atom-based, to calculate logP while logD prediction generally relies on calculated logP and pKa for the estimation of neutral and ionized populations at a given pH. Algorithms such as ClogP have limitations generally leading to systematic errors for chemically related molecules while pKa estimation is generally more difficult due to the interplay of electronic, inductive and conjugation effects for ionizable moieties. We propose an integrated machine learning QSAR modeling approach to predict logD by training the model with experimental data while using ClogP and pKa predicted by commercial software as model descriptors. By optimizing the loss function for the ClogD calculated by the software, we build a correction model that incorporates both descriptors from the software and available experimental logD data. Additionally, we calculate logP from the logD model using the software predicted pKa’s. Here, we have trained models using publicly or commercial available logD data to show that this approach can improve on commercial software predictions of lipophilicity. When applied to other logD data sets, this approach extends the domain of applicability of logD and logP predictions over commercial software. Performance of these models favorably compare with models built with a larger set of proprietary logD data.
Similar content being viewed by others
Data availability
The ChEMBL logD data, as described in the methods section, is included in the supplemental material, along predictions. We also provide as supplemental material a detailed list of descriptors used in model training and an analysis of the property space for the main three sets used in this work. Due to licensing limitations, the BioByte logD data sets are not available in this publication. Moka calculations for the various sets are subjected to licensing limitations and are only shown for two examples in the results section. Genentech’s internal logD data sets are not available for publication.
References
Waring MJ (2010) Lipophilicity in drug discovery. Expert Opin Drug Discov 5(3):235–248. https://doi.org/10.1517/17460441003605098
Leo A, Hansch C, Elkins D (1971) Partition coefficients and their uses. Chem Rev 71(6):525–616. https://doi.org/10.1021/cr60274a001
Leo A, Hansch C, Jow YC (1976) Dependence of hydrophobicity of apolar molecules on their molecular volume. J Med Chem 19(5):611–615. https://doi.org/10.1021/jm00227a007
Dearden JC (1985) Partitioning and lipophilicity in quantitative structure-activity relationships. Environ Health Perspect 61(9):203–228. https://doi.org/10.1289/ehp.8561203
Wang P-H, Lien EJ (1980) Effects of different buffer species on partition coefficients of drugs used in quantitative structure-activity relationships. J Pharm Sci 69(6):662–668. https://doi.org/10.1002/jps.2600690614
Ferreira LA, Chervenak A, Placko S, Kestranek A, Madeira PP, Zaslavsky BY (2015) Effect of ionic composition on the partitioning of organic compounds in octanol-buffer systems. RSC Adv 5(26):20574–20582. https://doi.org/10.1039/c5ra01402f
Chou JT, Jurs PC (1979) Computer-assisted computation of partition coefficients from molecular structures using fragment constants. J Chem Inf Comput Sci 19(3):172–178. https://doi.org/10.1021/ci60019a013
Ghose AK, Viswanadhan VN, Wendoloski JJ (1998) Prediction of hydrophobic (lipophilic) properties of small organic molecules using fragmental methods: an analysis of ALOGP and CLOGP methods. Society 5639(98):3762–3772
Ghose AK, Pritchett A, Crippen GM (1988) Atomic physicochemical parameters for three dimensional structure directed quantitative structure-activity: modeling hydrophobic interactions relationships 1. J Comput Chem 9(1):80–90
Wang R, Fu Y, Lai L (1997) A new atom-additive method for calculating partition coefficients. J Chem Inf Comput Sci 37(3):615–621. https://doi.org/10.1021/ci960169p
Işık M, Bergazin TD, Fox T, Rizzi A, Chodera JD, Mobley DL (2020) Assessing the accuracy of octanol-water partition coefficient predictions in the SAMPL6 part II Log P challenge. J Comput Aided Mol Des 34:335. https://doi.org/10.1007/s10822-020-00295-0
Giaginis C, Tsantili-Kakoulidou A (2008) Alternative measures of lipophilicity: from octanol-water partitioning to IAM retention. J Pharm Sci 97(8):2984–3004. https://doi.org/10.1002/jps.21244
Garmire LX, Hunt CA (2008) In silico methods for unraveling the mechanistic complexities of intestinal absorption: metabolism-efflux transport interactions ABSTRACT. Drug Metabol Dispos 36(7):1414–1424. https://doi.org/10.1124/dmd.107.020164.1996
Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (2012) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 64(SUPPL):4–17. https://doi.org/10.1016/j.addr.2012.09.019
Wenlock MC, Austin RP, Barton P, Davis AM, Leeson PD (2003) A comparison of physiochemical property profiles of development and marketed oral drugs. J Med Chem 46(7):1250–1256. https://doi.org/10.1021/jm021053p
Xing L, Glen RC (2002) Novel methods for the prediction of LogP, Pka, and LogD. J Chem Inf Comput Sci 42(4):796–805. https://doi.org/10.1021/ci010315d
Klamt A, Thormann M, Wichmann K, Tosco P (2012) COSMO Sar3D: molecular field analysis based on local COSMO σ-profiles. J Chem Inf Model 52(8):2157–2164. https://doi.org/10.1021/ci300231t
Mansouri K, Cariello NF, Korotcov A, Tkachenko V, Grulke CM, Sprankle CS, Allen D, Casey WM, Kleinstreuer NC, Williams AJ (2019) Open-source QSAR models for PKa prediction using multiple machine learning approaches. J Cheminform. https://doi.org/10.1186/s13321-019-0384-1
Yang Q, Li Y, Yang J, Liu Y, Zhang L, Luo S, Cheng J (2020) Holistic prediction of the p K a in diverse solvents based on a machine-learning approach. Angew Chem 132(43):19444–19453. https://doi.org/10.1002/ange.202008528
Mannhold R, Van De Waterbeemd H (2001) Substructure and whole molecule approaches for calculating Log P. J Comput Aided Mol Des 15(4):337–354. https://doi.org/10.1023/A:1011107422318
Kramer C, Beck B, Clark T (2010) A surface-integral model for log pow. J Chem Inf Model 50(3):429–436. https://doi.org/10.1021/ci900431f
Taft RW (1952) Linear free energy relationships from rates of esterification and hydrolysis of aliphatic and ortho-substituted benzoate esters. J Am Chem Soc 74(11):2729–2732. https://doi.org/10.1021/ja01131a010
Hansch C, Leo A, Taft RW (1991) A survey of Hammett substituent constants and resonance and field parameters. Chem Rev 91(2):165–195. https://doi.org/10.1021/cr00002a004
Da Silva CO, Da Silva EC, Nascimento MAC (1999) Ab initio calculations of absolute PKa values in aqueous solution I. Carboxylic acids. J Phys Chem A 103(50):11194–11199. https://doi.org/10.1021/jp9836473
Citra MJ (1999) Estimating the PK(a) of phenols, carboxylic acids and alcohols from semi-empirical quantum chemical methods. Chemosphere 38(1):191–206. https://doi.org/10.1016/S0045-6535(98)00172-6
Abraham MH, Acree JWE (2010) The transfer of neutral molecules, ions and ionic species from water to wet octanol. Phys Chem Chem Phys 12(40):13182. https://doi.org/10.1039/c0cp00695e
Bouchard G, Carrupt P, Testa B, Gobry V, Girault HH (2001) The apparent lipophilicity of quaternary ammonium ions is influenced by galvani potential difference, not ion-pairing: a cyclic voltammetry study. Pharm Res 18(5):702–708. https://doi.org/10.1023/A:1011001914685
Zamora WJ, Curutchet C, Campanera JM, Luque FJ (2017) Prediction of PH-dependent hydrophobic profiles of small molecules from miertus-scrocco-tomasi continuum solvation calculations. J Phys Chem B 121(42):9868–9880. https://doi.org/10.1021/acs.jpcb.7b08311
Livingston DJ (2012) Theoretical property predictions. Front Med Chem 2:545–570. https://doi.org/10.2174/978160805205910502010545
Tetko IV, Poda GI, Ostermann C, Mannhold R (2009) Accurate in silico Log P predictions: one can’t embrace the unembraceable. QSAR Comb Sci 28(8):845–849. https://doi.org/10.1002/qsar.200960003
Mannhold R, Poda GI, Ostermann C, Tetko IV (2009) Calculation of molecular lipophilicity: state-of-the-art and comparison of log P methods on more than 96,000 compounds. J Pharm Sci 98(3):861–893. https://doi.org/10.1002/jps.21494
Ertl P, Rohde B, Selzer P (2000) Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J Med Chem 43(20):3714–3717. https://doi.org/10.1021/jm000942e
Milletti F, Storchi L, Goracci L, Bendels S, Wagner B, Kansy M, Cruciani G (2010) Extending PKa prediction accuracy: high-throughput PKa measurements to understand PKa modulation of new chemical series. Eur J Med Chem 45(9):4270–4279. https://doi.org/10.1016/j.ejmech.2010.06.026
Milletti F, Storchi L, Sforna G, Cruciani G (2007) New and original PKa prediction method using grid molecular interaction fields. J Chem Inf Model 47(6):2172–2181. https://doi.org/10.1021/ci700018y
Leo AJ, Hoekman D (2000) Calculating log P(Oct) with no missing fragments; the problem of estimating new interaction parameters. Perspect Drug Discov Des 18:19–38. https://doi.org/10.1023/A:1008739110753
Fu L, Liu L, Yang ZJ, Li P, Ding JJ, Yun YH, Lu AP, Hou TJ, Cao DS (2020) Systematic modeling of log D7.4 based on ensemble machine learning, group contribution, and matched molecular pair analysis. J Chem Inf Model 60(1):63–76. https://doi.org/10.1021/acs.jcim.9b00718
Lapins M, Arvidsson S, Lampa S, Berg A, Schaal W, Alvarsson J, Spjuth O (2018) A confidence predictor for logD using conformal regression and a support-vector machine. J Cheminform 10(1):1–10. https://doi.org/10.1186/s13321-018-0271-1
Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017) The ChEMBL database in 2017. Nucleic Acids Res 45(D1):D945–D954. https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/general-questions
Pérez-Villanueva J, Yépez-Mulia L, Rodríguez-Villar K, Cortés-Benítez F, Palacios-Espinosa JF, Soria-Arteche O (2021) The giardicidal activity of lobendazole, fabomotizole, tenatoprazole and ipriflavone: a ligand-based virtual screening and in vitro study. Eur J Med Chem. https://doi.org/10.1016/j.ejmech.2020.113110
Tetko IV, Maran U, Tropsha A (2017) Public (Q)SAR services, integrated modeling environments, and model repositories on the web: state of the art and perspectives for future development. Mol Inf 36(3):1–13. https://doi.org/10.1002/minf.201600082
Lin B, Pease J (2013) A novel method for high throughput lipophilicity determination by microscale shake flask and liquid chromatography tandem mass spectrometry. Comb Chem High Throughput Screen 16(10):817–825. https://doi.org/10.2174/1386207311301010007
Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29(2):97–101. https://doi.org/10.1021/ci00062a008
Xue L, Bajorath J (2012) Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. Comb Chem High Throughput Screen 3(5):363–372. https://doi.org/10.2174/1386207003331454
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754. https://doi.org/10.1021/ci100050t
Hall L, Kier L (2000) The E-state as the basis for molecular structure space definition and structure similarity. J Chem Inf Comput 40:784–791
Varnek A, Baskin I (2011) Machine learning methods for property prediction in chemoinformatics: quo vadis ? J Chem Inf Model 52:1413
Varnek A, Baskin I (2012) Machine learning methods for property prediction in chemoinformatics: quo vadis? J Chem Inf Model. https://doi.org/10.1021/ci200409x
Fernández-Delgado M, Sirsat MS, Cernadas E, Alawadi S, Barro S, Febrero-Bande M (2019) An extensive experimental survey of regression methods. Neural Netw 111:11–34. https://doi.org/10.1016/j.neunet.2018.12.010
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Lee ML, Aliagas I, Feng JA, Gabriel T, O’Donnell TJ, Sellers BD, Wiswedel B, Gobbi A (2017) Chemalot and chemalot-knime: command line programs as workflow tools for drug discovery. J Cheminform 9(1):1–14. https://doi.org/10.1186/s13321-017-0228-9
Tetko IV, Tanchuk VY, Villa AEP (2001) Prediction of N-octanol/water partition coefficients from PHYSPROP database using artificial neural networks and E-state indices. J Chem Inf Comput Sci 41(3–6):1407–1421. https://doi.org/10.1021/ci010368v
Nikolova N, Jaworska J (2004) Approaches to measure chemical similarity—a review. QSAR Comb Sci 22(9–10):1006–1026. https://doi.org/10.1002/qsar.200330831
Garrido NM, Queimada AJ, Jorge M, Macedo EA, Economou IG (2009) 1-Octanol/water partition coefficients of n-alkanes from molecular simulations of absolute solvation free energies. J Chem Theory Comput 5(9):2436–2446. https://doi.org/10.1021/ct900214y
Acknowledgements
At Genentech: Huy Nguyen, Fabio Broccatelli, and Hao Zheng.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Aliagas, I., Gobbi, A., Lee, ML. et al. Comparison of logP and logD correction models trained with public and proprietary data sets. J Comput Aided Mol Des 36, 253–262 (2022). https://doi.org/10.1007/s10822-022-00450-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-022-00450-9