Abstract
Protein homology detection is a key problem in computational biology. In this paper, a novel building block for protein called N-nary profile which contains the evolutionary information of protein sequence frequency profiles has been presented. The protein sequence frequency profiles calculated from the multiple sequence alignments outputted by PSI-BLAST are converted into N-nary profiles. Such N-nary profiles are filtered by a feature selection algorithm called chi-square algorithm. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each N-nary profile and then the corresponding vectors are inputted to support vector machine (SVM). The latent semantic analysis (LSA) model, an efficient feature extraction algorithm, is adopted to further improve the performance of this method. When tested on the SCOP 1.53 data set, the prediction performance of N-nary profile method outperforms all compared methods of protein remote homology detection. The ROC50 score is 0.736, which is higher than the current best method for nearly 4 percent.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Mol. Biol. 147, 195–197 (1981)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)
Pearson, W.R.: Rapid and Sensitive Sequence Comparison with Fastp and Fasta. Methods Enzymol. 183, 63–98 (1990)
Rost, B.: Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999)
Thomas, L.: Remote homology detection based on oligomer distances. Bioinformatics 22, 2224–2231 (2006)
Karplus, K., Barrett, C., Hughey, R.: Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics 14, 846–856 (1998)
Qian, B., Goldstein, R.A.: Performance of an Iterated T-Hmm for Homology Detection. Bioinformatics 20, 2175–2180 (2004)
Vapnik, V.N.: Statistical Learning Theory. New York (1998)
Jaakkola, T., Diekhans, M., Haussler, D.: A Discriminative Framework for Detecting Remote Protein Homologies. J. Comput. Biol. 7, 95–114 (2000)
Li, L., Noble, W.S.: Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J. Comput. Biol. 10, 857–868 (2003)
Leslie, C., Eskin, E., Noble, W.S.: The Spectrum Kernel: A String Kernel for svm Protein Classification. In: Pacific Symposium on Biocomputing, pp. 566–575 (2002)
Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, S.W.: Mismatch String Kernels for Discriminative Protein Classification. Bioinformatics 20, 467–476 (2004)
Hou, Y., Hsu, W., Lee, M.L., Bystroff, C.: Efficient Remote Homology Detection Using Local Structure. Bioinformatics 19, 2294–2301 (2003)
Ogul, H., Mumcuoglu, E.: A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets. BioSystems 87, 75–81 (2007)
Håndstad, T., Hestnes, A.J., Sætrom, P.: Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinformatics 8, 23 (2007)
Saigo, H., Vert, J.P., Ueda, N., Akutsu, T.: Protein Homology Detection Using String Alignment Kernels. Bioinformatics 20, 1682–1689 (2004)
Saigo, H., Vert, J.P., Akutsu, T., Ueda, N.: Comparison of Svm-Based Methods for Remote Homology Detection. Genome Informatics 13, 396–397 (2002)
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.H., Zhang, Z., Miller, W., Lipman, D.J.: Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs. Nucleic Acids Research 25, 3389–3402 (1997)
Dowd, S.E., Zaragoza, J., Rodriguez, J.R., Oliver, M.J., Payton, P.R.: Windows.Net Network Distributed Basic Local Alignment Search Toolkit (W.Nd-Blast). BMC Bioinformatics. 6, 93 (2005)
Dong, Q.W., Lin, L., Wang, X.L.: Protein Remote Homology Detection Based on Binary Profiles. In: Hochreiter, S., Wagner, R. (eds.) BIRD 2007. LNCS (LNBI), vol. 4414, pp. 212–223. Springer, Heidelberg (2007)
Dong, Q.W., Wang, X.L., Lin, L.: Application of Latent Semantic Analysis to Protein Remote Homology Detection. Bioinformatics 22, 285–290 (2006)
Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In: 6th Annual International Conference on Research in Computational Molecular Biology, pp. 225–232 (2002)
Chandonia, J.M., Hon, G., Walker, N.S., Conte, L.L., Koehl, P., Levitt, M., Brenner, S.E.: The astral compendium in 2004. Nucleic acids research 32, 189–192 (2004)
Holm, L., Sander, C.: Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14, 423–429 (1998)
Henikoff, S., Henikoff, J.G.: Position-Based Sequence Weights. J. Mol. Biol. 243, 574–578 (1994)
Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J.P., Chothia, C., Murzin, A.G.: Scop Database in 2004: Refinements Integrate Structure and Sequence Family Data. Nucleic Acids Research 32, 226–229 (2004)
Dong, Q.W., Lin, L., Wang, X.L., Li, M.H.: A Pattern-Based svm for Protein Remote Homology Detection. In: 4th international conference on machine learning and cybernetics, GuangZhou, China, pp. 3363–3368 (2005)
Yang, Y., Pedersen, J.A.: A comparative study on feature selection in text categorization. In: 14th international conference on machine learning, San Francisco, USA, pp. 412–420 (1997)
Ganapathiraju, M., et al.: Characterization of protein secondary structure, Application of latent semantic analysis using different vocabularies. IEEE Signal Processing Magazine 21, 78–87 (2004)
Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to Latent Semantic Analysis. Discourse Processes 25, 259–284 (1998)
Ben-Hur, A., Brutlag, D.: Remote homology detection: A motif based approach. Bioinformatics 19(suppl. 1), i26–i33 (2003)
Gribskov, M., Robinson, N.L.: Use of Receiver Operating Characteristic (Roc) Analysis to Evaluate Sequence Matching. Computers and Chemistry 20, 25–33 (1996)
Bailey, T.L., Grundy, W.N.: Classifying Proteins by Family Using the Product of Correlated P-Values. In: 3rd international conference on computational molecular biology (RECOMB 1999), pp. 10–14 (1999)
Krogh, A., Brown, M., Mian, I.S., Sjolander, K., Haussler, D.: Hidden Markov Models in Computational Biology: Applications to Protein Modeling. Journal of Molecular Biology 235, 1501–1531 (1994)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, B., Lin, L., Wang, X., Dong, Q., Wang, X. (2008). A Discriminative Method for Protein Remote Homology Detection Based on N-nary Profiles. In: Elloumi, M., Küng, J., Linial, M., Murphy, R.F., Schneider, K., Toma, C. (eds) Bioinformatics Research and Development. BIRD 2008. Communications in Computer and Information Science, vol 13. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70600-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-70600-7_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70598-7
Online ISBN: 978-3-540-70600-7
eBook Packages: Computer ScienceComputer Science (R0)