Abstract
Amino acid substitutions in HIV-1 proteins critical to the viral replication cycle have the potential to undermine successful inhibition of those targets, with some mutations leading to either reduced susceptibility to certain medications or complete drug resistance. Phenotypic tests are best suited to quantify the effects of complex mutational patterns on drug resistance; however, the relatively high cost and long turnaround time associated with phenotyping has increased the demand for in silico drug-specific models capable of accurately predicting phenotype directly from the target protein sequences. The focus of this study is on the HIV-1 integrase (IN) enzyme, which mediates integration of reversibly transcribed viral DNA into the host cell genome, and the development of predictive statistical learning models of resistance to the IN inhibitors Raltegravir (RAL) and Elvitegravir (EVG). Models were trained using datasets of IN protein sequence variants each having a known phenotype, quantified as the fold change in susceptibility to the respective inhibitor, and obtained using an experimental assay. A sequence-based approach employing n-grams relative frequencies was implemented to uniquely characterize each IN variant as a feature vector of input attributes. Models for classifying IN variants as susceptible or resistant reach cross-validation balanced accuracy rates of 89% with RAL and 85% with EVG. Additionally, regression models achieve Pearson’s correlation coefficients, between experimental and predicted log-transformed phenotypic fold change values, as high as r = 0.80 with RAL and r = 0.76 with EVG. Our results suggest that as additional training data are made publicly available, the models may hold promise as supplementary tools for making treatment decisions.
Keywords: Drug resistance, genotype-phenotype correlations, HIV-1 integrase, n-grams, regression, statistical learning models, supervised classification.