Abstract
In recent years the development of computational techniques that build models to correctly assign chemical compounds to various classes or to retrieve potential drug-like compounds has been an active area of research. Many of the best-performing techniques for these tasks utilize a descriptor-based representation of the compound that captures various aspects of the underlying molecular graph’s topology. In this paper we compare five different set of descriptors that are currently used for chemical compound classification. We also introduce four different descriptors derived from all connected fragments present in the molecular graphs primarily for the purpose of comparing them to the currently used descriptor spaces and analyzing what properties of descriptor spaces are helpful in providing effective representation for molecular graphs. In addition, we introduce an extension to existing vector-based kernel functions to take into account the length of the fragments present in the descriptors. We experimentally evaluate the performance of the previously introduced and the new descriptors in the context of SVM-based classification and ranked-retrieval on 28 classification and retrieval problems derived from 18 datasets. Our experiments show that for both of these tasks, two of the four descriptors introduced in this paper along with the extended connectivity fingerprint based descriptors consistently and statistically outperform previously developed schemes based on the widely used fingerprint- and Maccs keys-based descriptors, as well as recently introduced descriptors obtained by mining and analyzing the structure of the molecular graphs.
Similar content being viewed by others
References
http://dtp.nci.nih.gov. The Aids Antiviral Screen
http://www.chemcomp.com. Chemical Computing Group
http://www.daylight.com. Daylight Inc
http://www.mdl.com. MDL Information Systems Inc
http://www.scitegic.com. Scitegic Inc
http://www.tripos.com. Tripos Inc
http://www.tripos.com. Tripos Inc
Mdl drug data report, version 2002.2. MDL Information Systems Inc. San Leandro, CA
http://pubchem.ncbi.nlm.nih.gov. The PubChem Project
http://www.chemaxon.com. ChemAxon Inc
http://www.predictive-toxicology.org
Food and drug administration orange book, 22nd edn. U.S Food and Drug Administration, Washington DC (2003)
Ames BN, Durston WE, Yamasaki E and Lee FD (1973). Carcinogens are mutagens: a simple test system combining liver homogenates for activation and bacteria for detection. Proc Natl Acad Sci 70: 2281–2285
Barnard JM and Downs GM (1997). Chemical fragment generation and clustering software. J Chem Inf Comput Sci 37: 141–142
Bland JM (1995). An introduction to medical statistics, 2nd edn. Oxford University Press, Oxford
Brown R and Martin Y (1996). Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Model 36(1): 576–584
Deshpande M, Kuramochi M, Wale N and Karypis G (2005). Frequent substructure-based approaches for classifying chemical compounds. IEEE TKDE 17(8): 1036–1050
Durant JL, Leland BA, Henry DR and Nourse JG (2002). Reoptimization of mdl keys for use in drug discovery. J Chem Inf Model 42(6): 1273–1280
Gold LS and Zeiger E (1997). Handbook of carcinogenic potency and genotoxicity databases. CRC Press, BOCA Raton
Gribskov M and Robinson N (1996). Use of receiver operating characteristic (roc) analysis to evaluate matching. Comput Chem 20: 25–33
Helma C, Cramer T, Kramer S and Raedt LD (2004). Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. J Chem Inf Comp Sci 44(4): 1402–1411
Hert J, Willet P, Wilton D, Acklin P, Azzaoui K, Jacoby E and Schuffenhauer A (2004). Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. Org Biomol Chem 2: 3256–3266
Horvath T, Grtner T, Wrobel S (2004) Cyclic pattern kernels for predictive graph mining. SIGKDD. pp 158–167
Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. PKDD. pp 13–23
Joachims T (1999). Advances in kernel methods: support vector learning, making large-scale svm learning practical. MIT-Press, Cambridge
Kashima H, Tsuda K, Inokuchi A (2003) Marginalized kernels between labeled graphs. ICML
Kier L, Hall L (1999) Molecular structure description. ic Press
Kramer S, Raedt LD, Helma C (2001) Molecular feature mining in hiv data. SIGKDD
Kuramochi M and Karypis G (2004). An efficient algorithm for discovering frequent subgraphs. IEEE TKDE 16(9): 1038–1051
Leach AR (2001). Molecular modeling: principles and applications. Prentice Hall, Englewood Cliffs
Menchetti S, Costa F, Frasconi P (2005) Weighted decomposition kernels. ICML
Morgan HL (1965). The generation of unique machine description for chemical structures: a technique developed at chemical abstract services. J Chem Doc 5: 107–1133
Nijssen S, Kok JN (2004) A quickstart in frequent structure mining can make a difference. SIGKDD
Richards GW (2002). Virtual screening using grid computing: the screensaver project. Nat Rev: Drug Discov 1: 551–554
Rogers D, Brown R and Hahn M (2005). Using extended-connectivity fingerprints with laplacian-modified bayesian analysis in high-throughput screening. J Biomol Screen 10(7): 682–686
Srinivasan A, King RD, Muggleton SH, Sternberg M (1997) The predictive toxicology evaluation challenge. IJCAI-97, pp 1–6
Swamidass SJ, Chen J, Bruand J, Phung P, Ralaivola L and Baldi P (2005). Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics 21(1): 359–368
Vapnik V (1998). Statistical learning theory. Wiley, New York
Vieth M, Siegel MG, Higgs RE, Watson IA, Robertson DH, Savin KA, Durst GL and Hipskind PA (2004). Characteristic physical properties and structural fragments of marketed oral drug. J Med Chem 47(1): 224–232
Wale N, Karypis G (2006) Comparison of descriptor spaces for chemical compound retrieval and classification. International Conference in Datamining. (ICDM)
West DB (2001). Introduction to graph theory. Prentice Hall, Englewood Cliffs
Whittle M, Gillet VJ and Willett P (2004). Enhancing the effectiveness of virtual screening by fusing nearest neighbor list: A comparison of similarity coefficients. J Chem Inf Model 44: 1840–1848
Willett P (1998). Chemical similarity searching. J Chem Inf Model 38(6): 983–996
Wrlein M, Meinl T, Fischer I, Philippsen M (2005) A quantitative comparison of the subgraph miners mofa, gspan, ffsm, and gaston. PKDD
Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. ICDM
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wale, N., Watson, I.A. & Karypis, G. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl Inf Syst 14, 347–375 (2008). https://doi.org/10.1007/s10115-007-0103-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0103-5