Abstract
The main problem of Internet e-mail service is the massive spam message delivery. Everyday, millions of unwanted and unhelpful messages are received by Internet users annoying their mailboxes. Fortunately, nowadays there are different kinds of filters able to automatically identify and delete most of these messages. In order to reduce the bulk of information to deal with, only distinctive attributes are selected spam and legitimate e-mails. This work presents a comparative study about the performance of five well-known feature selection techniques when they are applied in conjunction with four different types of Naïve Bayes classifier. The results obtained from the experiments carried out show the relevance of choosing an appropriate feature selection technique in order to obtain the most accurate results.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
European Commission: i2010 - A European Information Society for growth and employment (2007), http://ec.europa.eu/i2010/
CardCommunications - Sitebrand Corporation: Email Trends Report (2007), http://www.cardcommunications.com/expresso/reports/Q1_07_trends_card.pdf
Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering than Can Track Concept Drift. In: Proceedings of the 5th International Conference on Case Based Reasoning, ICCBR-2003, Workshop of Long-Lived CBR Systems, pp. 115–123 (2003)
Jain, A., Zongker, D.: Feature Selection: Evaluation, Application, and Small Sample Performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(2), 153–158 (1997)
Peters, T., Robinson, G., Hooft, R., Hammond, M., Meyer, T., True, S., Walker, A., Hindle, C., Pickett, N., Stone, T.: SpamBayes Project (2002), http://spambayes.sourceforge.net
Mozilla Project: Mozilla Spam Filter, http://www.mozilla.org/mailnews/spam.html
Burton Computer Corporation: SpamProbe: A Fast Spam Bayesian Filter (2002), http://spamprobe.sourceforge.net/
Androutsopoulos, I., Metsis, V., Paliouras, G.: Spam Filtering with Naive Bayes – Which Naive Bayes? In: Proceedings of the 3rd Conference on Email and AntiSpam (2006)
Androutsopoulos, I., Koustias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.: An Evaluation of Naïve Bayesian Anti-Spam Filtering. In: Proc. of the Workshop on Machine Learning in the New Information Age at 11th European Conference on Machine Learning, pp. 9–17 (2000)
Schneider, K.M.: A comparison of event models for Naive Bayes anti-spam e-mail filtering. In: Proceedings of the 10th Conference of the European Chapter of the ACL, Budapest, Hungry (2003)
John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)
Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420 (1997)
Androutsopoulos, I.: Ling Spam Corpus (2000), http://www.iit.demokritos.gr/~ionandr/lingspam_public.tar.gz
Mason, J.: The Apache SpamAssassin Project (2005), http://spamassassin.apache.org
Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: Applying Lazy Learning Algorithms to Tackle Concept Drift in Spam Filtering. Expert Systems With Applications 33(1), 36–48 (2007)
Shaw, W.M., Burgin, R., Howell, P.: Performance standards and evaluations in IR test collections: Cluster-based retrieval models. Information Processing and Management (1997)
Graham-Cumming, J.: Understanding Spam Filter Accuracy. JGC spam and anti-spam newsletter (2004)
Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Tokenising, Stemming and Stopword Removal on Anti-Spam Filtering Domain. Lecture notes in artificial intelligence, vol. 4147, pp. 559–558 (2006)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)
Egan, J.P.: Signal Detection Theory and ROC Analysis. Academic Press, New York (1975)
Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine 8, 283–298 (1978)
Zweig, M.H., Campbell, G.: Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry 39, 561–577 (1993)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Méndez, J.R., Cid, I., Glez-Peña, D., Rocha, M., Fdez-Riverola, F. (2008). A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters. In: Perner, P. (eds) Advances in Data Mining. Medical Applications, E-Commerce, Marketing, and Theoretical Aspects. ICDM 2008. Lecture Notes in Computer Science(), vol 5077. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70720-2_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-70720-2_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70717-2
Online ISBN: 978-3-540-70720-2
eBook Packages: Computer ScienceComputer Science (R0)