Abstract
This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional ’bag of words’ representation, we use a ’bag of character n-grams’ representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or ’deep’ text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk E-mail. In: Proc. of AAAI Workshop on Learning for Text Categorization (1998)
Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D.: An Evaluation of Naive Bayesian Anti-Spam Filtering. In: Potamias, G., Moustakis, V., van Someren, M. (eds.) Proc. of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, pp. 9–17 (2000)
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6(1), 49–73 (2003)
Drucker, H., Wu, D., Vapnik, V.: Support Vector Machines for Spam Categorization. IEEE Trans. Neural Network 10, 1048–1054 (1999)
Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical report 2004/2, NCSR Demokritos (2004)
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proc. 3rd Int’l Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Proc. of the Conference Pacific Assoc. Comp. Linguistics (2003)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text Classification Using String Kernels. The Journal of Machine Learning Research 2, 419–444 (2002)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proc. of the European Conference on Machine Learning (1998)
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: Stacking Classifiers for Anti-Spam Filtering of E-Mail. In: Proc. of 6th Conf. Empirical Methods in Natural Language Processing, pp. 44–50 (2001)
Hovold, J.: Naive Bayes Spam Filtering Using Word-Position-Based Attributes. In: Proc. of the Second Conference on Email and Anti-Spam (2005)
Yang, Y., Petersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th Int. Conference on Machine Learning, pp. 412–420 (1997)
Pampapathi, R., Mirkin, B., Levene, M.: A Suffix Tree Approach to Text Categorisation Applied to Spam Filtering, http://arxiv.org/abs/cs.AI/0503030
Berger, H., Koehle, M., Merkl, D.: On the Impact of Document Representation on Classifier Performance in e-Mail Categorization. In: Proc. of the 4th International Conference on Information Systems Technology and its Applications, pp. 19–30 (2005)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kanaris, I., Kanaris, K., Stamatatos, E. (2006). Spam Detection Using Character N-Grams. In: Antoniou, G., Potamias, G., Spyropoulos, C., Plexousakis, D. (eds) Advances in Artificial Intelligence. SETN 2006. Lecture Notes in Computer Science(), vol 3955. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11752912_12
Download citation
DOI: https://doi.org/10.1007/11752912_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34117-8
Online ISBN: 978-3-540-34118-5
eBook Packages: Computer ScienceComputer Science (R0)