Abstract
Determining protein sequence similarity is an important task for protein classification and homology detection. Typically this may be done using sequence alignment algorithms, yet fast and accurate alignment-free kernel based classifiers exist. Viewing sequences as a “bag of words”, we test a simple weighted string kernel, investigating the effects of k-mer length, sequence length and choice of weighting. We also extend the kernel to operate on the k-mer frequency representation of a sequence rather than the “bag of words” representation.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Miller, C.J., Attwood, T.K.: Psst..the probabilistic sequence search tool. In: 2nd IEEE International Symposium on Bioinformatics and Bioengineering, pp. 33–40. IEEE Press, Washington (2001), ISBN 0769514235
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
Pearson, W.R.: Rapid and sensitive sequence comparison with fastp and fasta. Methods Enzymol 183, 63–98 (1990)
Krogh, A., Brown, M., Mian, I., Sjolander, K., Haussler, D.: Hidden markov models in computational biology: Applications to protein modelling. Journal of Molecular Biology 235, 1501–1531 (1994)
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI–BLAST: A new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Leslie, C., Eskin, E., Weston, J., Noble, W.S.: The spectrum kernel: a string kernel for svm protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 564–575. World Scientific, Singapore (2002)
Leslie, C., Eskin, E., Weston, J., Noble, W.S.: Mismatch string kernels for svm protein classification. In: Advances in Neural Information Processing Systems (2002)
Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., Zygouri, C.: PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Research 31(1), 400–402 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Spalding, J.D., Hoyle, D.C. (2005). Accuracy of String Kernels for Protein Sequence Classification. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds) Pattern Recognition and Data Mining. ICAPR 2005. Lecture Notes in Computer Science, vol 3686. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551188_49
Download citation
DOI: https://doi.org/10.1007/11551188_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28757-5
Online ISBN: 978-3-540-28758-2
eBook Packages: Computer ScienceComputer Science (R0)