A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters | SpringerLink
Skip to main content

A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters

  • Conference paper
Advances in Data Mining. Medical Applications, E-Commerce, Marketing, and Theoretical Aspects (ICDM 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5077))

Included in the following conference series:

Abstract

The main problem of Internet e-mail service is the massive spam message delivery. Everyday, millions of unwanted and unhelpful messages are received by Internet users annoying their mailboxes. Fortunately, nowadays there are different kinds of filters able to automatically identify and delete most of these messages. In order to reduce the bulk of information to deal with, only distinctive attributes are selected spam and legitimate e-mails. This work presents a comparative study about the performance of five well-known feature selection techniques when they are applied in conjunction with four different types of Naïve Bayes classifier. The results obtained from the experiments carried out show the relevance of choosing an appropriate feature selection technique in order to obtain the most accurate results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. European Commission: i2010 - A European Information Society for growth and employment (2007), http://ec.europa.eu/i2010/

  2. CardCommunications - Sitebrand Corporation: Email Trends Report (2007), http://www.cardcommunications.com/expresso/reports/Q1_07_trends_card.pdf

  3. Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering than Can Track Concept Drift. In: Proceedings of the 5th International Conference on Case Based Reasoning, ICCBR-2003, Workshop of Long-Lived CBR Systems, pp. 115–123 (2003)

    Google Scholar 

  4. Jain, A., Zongker, D.: Feature Selection: Evaluation, Application, and Small Sample Performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(2), 153–158 (1997)

    Article  Google Scholar 

  5. Peters, T., Robinson, G., Hooft, R., Hammond, M., Meyer, T., True, S., Walker, A., Hindle, C., Pickett, N., Stone, T.: SpamBayes Project (2002), http://spambayes.sourceforge.net

  6. Mozilla Project: Mozilla Spam Filter, http://www.mozilla.org/mailnews/spam.html

  7. Burton Computer Corporation: SpamProbe: A Fast Spam Bayesian Filter (2002), http://spamprobe.sourceforge.net/

  8. Androutsopoulos, I., Metsis, V., Paliouras, G.: Spam Filtering with Naive Bayes – Which Naive Bayes? In: Proceedings of the 3rd Conference on Email and AntiSpam (2006)

    Google Scholar 

  9. Androutsopoulos, I., Koustias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.: An Evaluation of Naïve Bayesian Anti-Spam Filtering. In: Proc. of the Workshop on Machine Learning in the New Information Age at 11th European Conference on Machine Learning, pp. 9–17 (2000)

    Google Scholar 

  10. Schneider, K.M.: A comparison of event models for Naive Bayes anti-spam e-mail filtering. In: Proceedings of the 10th Conference of the European Chapter of the ACL, Budapest, Hungry (2003)

    Google Scholar 

  11. John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)

    Google Scholar 

  12. Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

  13. Androutsopoulos, I.: Ling Spam Corpus (2000), http://www.iit.demokritos.gr/~ionandr/lingspam_public.tar.gz

  14. Mason, J.: The Apache SpamAssassin Project (2005), http://spamassassin.apache.org

  15. Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: Applying Lazy Learning Algorithms to Tackle Concept Drift in Spam Filtering. Expert Systems With Applications 33(1), 36–48 (2007)

    Article  Google Scholar 

  16. Shaw, W.M., Burgin, R., Howell, P.: Performance standards and evaluations in IR test collections: Cluster-based retrieval models. Information Processing and Management (1997)

    Google Scholar 

  17. Graham-Cumming, J.: Understanding Spam Filter Accuracy. JGC spam and anti-spam newsletter (2004)

    Google Scholar 

  18. Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Tokenising, Stemming and Stopword Removal on Anti-Spam Filtering Domain. Lecture notes in artificial intelligence, vol. 4147, pp. 559–558 (2006)

    Google Scholar 

  19. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)

    Google Scholar 

  20. Egan, J.P.: Signal Detection Theory and ROC Analysis. Academic Press, New York (1975)

    Google Scholar 

  21. Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine 8, 283–298 (1978)

    Article  Google Scholar 

  22. Zweig, M.H., Campbell, G.: Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry 39, 561–577 (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Méndez, J.R., Cid, I., Glez-Peña, D., Rocha, M., Fdez-Riverola, F. (2008). A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters. In: Perner, P. (eds) Advances in Data Mining. Medical Applications, E-Commerce, Marketing, and Theoretical Aspects. ICDM 2008. Lecture Notes in Computer Science(), vol 5077. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70720-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-70720-2_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-70717-2

  • Online ISBN: 978-3-540-70720-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics