Abstract
Rapid growth in the volume of unsolicited and unwanted messages has inspired the development of many anti-spam methods. Supervised anti-spam filters using machine-learning methods have been particularly effective in categorizing spam and non-spam messages. These automatically integrate spam corpora pre-processing, appropriate word lists selection, and the calculation of word weights, usually in a bag-of-words fashion. To develop an accurate spam filter is challenging because spammers attempt to decrease the probability of spam detection by using legitimate words. Complex models are therefore needed to solve such a problem. However, existing spam filtering methods usually converge to a poor local minimum, cannot effectively handle high-dimensional data and suffer from overfitting issues. To overcome these problems, we propose a novel spam filter integrating an N-gram tf.idf feature selection, modified distribution-based balancing algorithm and a regularized deep multi-layer perceptron NN model with rectified linear units (DBB-RDNN-ReL). As demonstrated on four benchmark spam datasets (Enron, SpamAssassin, SMS spam collection and Social networking), the proposed approach enables capturing more complex features from high-dimensional data by additional layers of neurons. Another advantage of this approach is that no additional dimensionality reduction is necessary and spam dataset imbalance is addressed using a modified distribution-based algorithm. We compare the performance of the approach with that of state-of-the-art spam filters (Minimum Description Length, Factorial Design using SVM and NB, Incremental Learning C4.5, and Random Forest, Voting and Convolutional Neural Network) and several machine learning algorithms commonly used to classify text. We show that the proposed model outperforms these other methods in terms of classification accuracy, with fewer false negatives and false positives. Notably, the proposed spam filter classifies both major (legitimate) and minor (spam) classes well on personalized / non-personalized and balanced / imbalanced spam datasets. In addition, we show that the proposed model performs better than the results reported by previous studies in terms of accuracy. However, the high computational expenses related to additional hidden layers limit its application as an online spam filter and make it difficult to overcome the problem of concept drift.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Abi-Haidar A, Rocha LM (2008) Adaptive spam detection inspired by the immune system. In: Artificial life XI, proceedings of the 11th international conference on the simulation and synthesis of living systems, pp 1–8. https://doi.org/10.1007/978-3-540-85072-4
Ahmed I, Ali R, Guan D, Lee YK, Lee S, Chung T (2015) Semi-supervised learning using frequent itemset and ensemble learning for SMS classification. Expert Syst Appl 42(3):1065–1073. https://doi.org/10.1016/j.eswa.2014.08.054
Almeida TA, Almeida J, Yamakami A (2011) Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers. J Internet Serv Appl 1(3):183–200. https://doi.org/10.1007/s13174-010-0014-7
Almeida TA, Hidalgo JMG, Yamakami A (2011) Contributions to the study of SMS spam filtering: new collection and results. In: Proceedings of the 11th ACM symposium on document engineering, pp 259–262. https://doi.org/10.1145/2034691.2034742
Almeida TA, Yamakami A (2012) Occam’s razor-based spam filter. J Internet Serv Appl 3(3):245–253. https://doi.org/10.1007/s13174-012-0067-x
Almeida TA, Yamakami A (2016) Compression-based spam filter. Secur Commun Netw 9(4):327–335. https://doi.org/10.1002/sec.639
Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of Naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd annual int ACM SIGIR conference on research and development in information retrieval, pp 160–167. https://doi.org/10.1145/345508.345569
Aragão MV, Frigieri EP, Ynoguti CA, Paiva AP (2016) Factorial design analysis applied to the performance of SMS anti-spam filtering systems. Expert Syst Appl 64:589–604. https://doi.org/10.1016/j.eswa.2016.08.038
Barushka A, Hajek P (2016) Spam filtering using regularized neural networks with rectified linear units. In: AI*IA 2016 advances in artificial intelligence. Springer, pp 65–75. https://doi.org/10.1007/978-3-319-49130-1_6
Basto-Fernandes V, Yevseyeva I, Méndez JR, Zhao J, Fdez-Riverola F, Emmerich MT (2016) A spam filtering multi-objective optimization study covering parsimony maximization and three-way classification. Appl Soft Comput 48:111–123. https://doi.org/10.1016/j.asoc.2016.06.043
Bermejo P, Gámez JA, Puerta JM (2011) Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Expert Syst Appl 38(3):2072–2080. https://doi.org/10.1016/j.eswa.2010.07.146
Bermejo P, Gámez JA, Puerta JM (2014) Speeding up incremental wrapper feature subset selection with Naive Bayes classifier. Knowl-Based Syst 55:140–147. https://doi.org/10.1016/j.knosys.2013.10.016
Bosma M, Meij E, Weerkamp W (2012) A framework for unsupervised spam detection in social networking sites. In: European conference on information retrieval. Springer, Berlin, pp 364–375. https://doi.org/10.1007/978-3-642-28997-2_31
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
Carpinter J, Hunt R (2006) Tightening the net: a review of current and next generation spam filtering tools. Comput Secur 25(8):566–578. https://doi.org/10.1016/j.cose.2006.06.001
Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of RANLP 2001, bulgaria, pp 58–64
Caruana G, Li M (2012) A survey of emerging approaches to spam filtering. ACM Comput Surv 44(2):1–27. https://doi.org/10.1145/2089125.2089129
Chhogyal K, Nayak A (2016) An empirical study of a simple Naive Bayes classifier based on ranking functions. In: Australasian joint conference on artificial intelligence. Springer, pp 324–331. https://doi.org/10.1007/978-3-319-50127-7_27
Clark J, Koprinska I, Poon J (2003) A neural network based approach to automated e-mail classification. In: Proceedings of the IEEE/WIC international conference on web intell (WI’03). IEEE, pp 702–705. https://doi.org/10.1109/WI.2003.1241300
Cormack GV (2006) Email spam filtering: a systematic review. Found Trends Inf Retr 1(4):335–455. https://doi.org/10.1561/1500000006
Delany SJ, Buckley M, Greene D (2012) SMS spam filtering: methods and data. Expert Syst Appl 39 (10):9899–9908. https://doi.org/10.1016/j.eswa.2012.02.053
Dhillon IS, Mallela S, Kumar R (2003) A divisive information-theoretic feature clustering algorithm for text classification. J Mach Learn Res 3:1265–1287. https://doi.org/10.1162/153244303322753661
Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054. https://doi.org/10.1109/72.788645
El Boujnouni M (2017) SMS spam filtering using N-gram method, information gain metric and an improved version of SVDD classifier. J Eng Sci Technol Rev 10(1):131–137
Fang A (2016) Applications of the maximum entropy principle in spam email classification. J Residuals Sci Technol 13(6):1–4. https://doi.org/10.12783/issn.1544-8053/13/6/1
Fawcett T (2003) In vivo spam filtering: a challenge problem for KDD. ACM SIGKDD Explor Newsl 5(2):140–148. https://doi.org/10.1145/980972.980990
Fdez-Riverola F, Iglesias EL, Diaz F, Méndez JR, Corchado JM (2007) Spamhunting: an instance-based reasoning system for spam labelling and filtering. Dec Supp Syst 43(3):722–736. https://doi.org/10.1016/j.dss.2006.11.012
Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. Journal-Japanese Soc For Artif Intell 14(5):771–780
Garcia S, Fernandez A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064. https://doi.org/10.1016/j.ins.2009.12.010
Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recognit 43(1):5–13. https://doi.org/10.1016/j.patcog.2009.06.009
Guzella T, Caminhas W (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10206–10222. https://doi.org/10.1016/j.eswa.2009.02.037
Hagenau M, Liebmann M, Neumann D (2013) Automated news reading: stock price prediction based on financial news using context-capturing features. Dec Supp Syst 55(3):685–697. https://doi.org/10.1016/j.dss.2013.02.006
Hassan D (2016) Investigating the effect of combining text clustering with classification on improving spam email detection. In: Madureira A, Abraham A, Gamboa D, Novais P (eds) International conference on intelligent systems design and applications. Springer, Cham, pp 99–107. https://doi.org/10.1007/978-3-319-53480-0_10
Henning JL (2006) SPEC CPU2006 Benchmark descriptions. ACM SIGARCH Comput Archit News 34 (4):1–17. https://doi.org/10.1145/1186736.1186737
Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE T Pattern Anal 24(3):289–300. https://doi.org/10.1109/34.990132
Hoanca B (2006) How good are our weapons in the spam wars? IEEE Technol Soc Mag 25(1):22–30. https://doi.org/10.1109/MTAS.2006.1607720
Jaitly N, Hinton G (2011) Learning a better representation of speech soundwaves using restricted Boltzmann machines, pp 5884–5887. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). https://doi.org/10.1109/ICASSP.2011.5947700
Jiang S, Pang G, Wu M, Kuang L (2012) An improved k-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509. https://doi.org/10.1016/j.eswa.2011.08.040
Kaya Y, Ertuğrul ÖF (2016) A novel approach for spam email detection based on shifted binary patterns. Secur Commun Netw 9(10):1216–1225. https://doi.org/10.1002/sec.1412
Khan A, Baharudin B, Lee L (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1):4–20. https://doi.org/10.1016/j.eswa.2011.08.040
Khorshidpour Z, Hashemi S, Hamzeh A (2017) Evaluation of random forest classifier in security domain. Appl Intell. https://doi.org/10.1007/s10489-017-0907-2
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv:1408.5882
Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify e-mail. Inf Sci 177(10):2167–2187. https://doi.org/10.1016/j.ins.2006.12.005
Lai C (2007) An empirical study of three machine learning methods for spam filtering. Knowl-Based Syst 20(3):249–254. https://doi.org/10.1016/j.knosys.2006.05.016
Laorden C, Ugarte-Pedrero X, Santos I, Sanz B, Nieves J, Bringas PG (2014) Study on the effectiveness of anomaly detection for spam filtering. Inf Sci 277:421–444. https://doi.org/10.1016/j.ins.2014.02.114
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Liu Y, Wang Y, Feng L, Zhu X (2016) Term frequency combined hybrid feature selection method for spam filtering. Pattern Anal Applic 19(2):369–383. https://doi.org/10.1016/j.asoc.2016.06.043
Liu AC (2004) The effect of oversampling and undersampling on classifying imbalanced text datasets. The University of Texas at Austin, Austin. https://doi.org/10.1.1.101.5878
Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th international conference on machine learning, vol 30, pp 1–6
Méndez J, Corzo B, Glez-Peña D, Fdez-Riverola F, Díaz F (2007) Analyzing the performance of spam filtering methods when dimensionality of input vector changes. In: Perner P (ed) Machine learning and data mining in pattern recognition. Springer, Berlin, pp 364–378. https://doi.org/10.1007/978-3-540-73499-4_28
Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with Naive Bayes - which Naive Bayes?. In: Third conference on email and antispam (CEAS), pp 27–28. https://doi.org/10.1.1.61.5542
Mishra R, Thakur RS (2013) Analysis of random forest and Naive Bayes for spam mail using feature selection catagorization. Int J Comput Appl 80(3):42–47
Nagwani NK, Sharaff A (2017) SMS spam filtering and thread identification using bi-level text classification and clustering techniques. J Inf Sci 43(1):75–87. https://doi.org/10.1177/0165551515616310
Najadat H, Abdulla N, Abooraig R, Nawasrah S (2016) Spam detection for mobile short messaging service using data mining classifiers. Int J Comput Sci Inf Secur 14(8):511–517
Nam J, Kim J, Mencía EL, Gurevych I, Fürnkranz J (2014) Large-scale multi-label text classification - revisiting neural networks. In: Calders T, Esposito F, Hüllermeier E, Melo R (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 437–452. https://doi.org/10.1007/978-3-662-44851-9_28
Obied A, Alhajj R (2009) Fraudulent and malicious sites on the web. Appl Intell 30(2):112–120. https://doi.org/10.1007/s10489-007-0102-y
Rozza A, Lombardi G, Casiraghi E (2009) Novel IPCA-based classifiers and their application to spam filtering. In: Ninth international conference on intelligent systems design and applications, ISDA’09. IEEE, pp 797–802. https://doi.org/10.1109/ISDA.2009.21
Quinlan JR (1996) Improved use of continuous attributes in c4. 5. J Artificial Intell Res 4:77–90. https://doi.org/10.1613/jair.279
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learn for text categorization, papers from the 1998 workshop, vol 62, pp 98–105. https://doi.org/10.1.1.48.1254
Sanghani G, Kotecha K (2016) Personalized spam filtering using incremental training of support vector machine. IEEE, pp 323–328. In: International conference on computing, analytics and security trends (CAST). https://doi.org/10.1109/CAST.2016.7914988
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47. https://doi.org/10.1145/505282.505283
Shams R, Mercer RE (2013) Personalized spam filtering with natural language attributes. In: 12th international conference on machine learning and applications (ICMLA), vol 2. IEEE, pp 127–132. https://doi.org/10.1109/ICMLA.2013.117
Shams R, Mercer RE (2016) Supervised classification of spam emails with natural language stylometry. Neural Comput Appl 27(8):2315–2331. https://doi.org/10.1007/s00521-015-2069-7
Shen H, Li Z (2014) Leveraging social networks for effective spam filtering. IEEE Trans Comput 63(11):2743–2759. https://doi.org/10.1109/TC.2013.152
Sheu JJ, Chen YK, Chu KT, Tang JH, Yang WP (2016) An intelligent three-phase spam filtering method based on decision tree data mining. Secur Commun Netw 9(17):4013–4026. https://doi.org/10.1002/sec.1584
Sheu JJ, Chu KT, Li NF, Lee CC (2017) An efficient incremental learning mechanism for tracking concept drift in spam filtering. PloS One 12(2):e0171518. https://doi.org/10.1371/journal.pone.0171518
Silva RM, Alberto TC, Almeida TA, Yamakami A (2017) Towards filtering undesired short text messages using an online learning approach with semantic indexing. Expert Syst Appl 83:314–325. https://doi.org/10.1016/j.eswa.2017.04.055
Talbot D (2008) Where spam is born. MIT Technol Rev
Trivedi SK, Dey S (2013) An enhanced genetic programming approach for detecting unsolicited emails. In: IEEE 16th international conference on computational science and engineering (CSE), pp 1153–1160. https://doi.org/10.1109/CSE.2013.171
Trivedi SK, Dey S (2016) A combining classifiers approach for detecting email spams. In: 30th international conference on advanced information networking and applications workshops (WAINA). IEEE, pp 355–360. https://doi.org/10.1109/WAINA.2016.127
Trivedi SK, Dey S (2016) A comparative study of various supervised feature selection methods for spam classification. In: Proceedings of the 2nd international conference on information and communication technology for competitive strategies. ACM, p 64. https://doi.org/10.1145/2905055.2905122
Tzortzis G, Likas A (2007) Deep belief networks for spam filtering. In: 19th IEEE international conference on tools with artificial intelligence, ICTAI 2007, vol 2. IEEE, pp 306–309. https://doi.org/10.1109/ICTAI.2007.65
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl-Based Syst 36:226–235. https://doi.org/10.1016/j.knosys.2012.06.005
Uysal AK, Gunal S, Ergin S, Gunal ES (2012) A novel framework for SMS spam filtering. In: 2012 international symposium on innovations in intelligent systems and applications (INISTA). IEEE, pp 1–4. https://doi.org/10.1109/INISTA.2012.6246947
Vyas T, Prajapati P, Gadhwal S (2015) A survey and evaluation of supervised machine learning techniques for spam e-mail filtering. In: IEEE international conference on electrical, computer and communication technologies (ICECCT). IEEE, pp 1–7. https://doi.org/10.1109/ICECCT.2015.7226077
Watkins A, Timmis J (2004) Artificial immune recognition system (AIRS): an immune-inspired supervised learning algorithm. Genet Program Evolvable Mach 5(3):291–317. https://doi.org/10.1023/B:GENP.0000030197.83685.94
Wei CP, Chen HC, Cheng TH (2008) Effective spam filtering: a single-class learning and ensemble approach. Decis Supp Syst 45(3):491–503. https://doi.org/10.1016/j.dss.2007.06.010
Wu CH, Tsai CH (2009) Robust classification for spam filtering by back-propagation neural networks using behavior-based features. Appl Intell 31:107–121. https://doi.org/10.1007/s10489-008-0116-0
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: International conference on machine learning, vol 3, pp 856–863
Yu B, Xu ZB (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowl-Based Syst 21(4):355–362. https://doi.org/10.1016/j.knosys.2008.01.001
Yue X, Abraham A, Chi ZX, Hao YY, Mo H (2007) Artificial immune system inspired behavior-based anti-spam filter. Soft Comput - A Fusion of Found, Methodol and Appl 11(8):729–740. https://doi.org/10.1007/s00500-006-0116-0
Zhang Y, Wang S, Phillips P, Ji G (2014) Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl-Based Syst 64:22–31. https://doi.org/10.1016/j.knosys.2014.03.015
Zhang L, Zhu J, Yao T (2004) An evaluation of statistical spam filtering techniques. ACM Trans Asian Lang Inf Process 3(4):243–269. https://doi.org/10.1.1.109.7685
Zheng X, Zeng Z, Chen Z, Yu Y, Rong C (2015) Detecting spammers on social networks. Neurocomputing 159:27–34. https://doi.org/10.1016/j.neucom.2015.02.047
Zhou B, Yao Y, Luo J (2014) Cost-sensitive three-way email spam filtering. J Intell Inf Syst 42(1):19–45. https://doi.org/10.1007/s10844-013-0254-7
Zitar RA, Hamdan A (2013) Genetic optimized artificial immune system in spam detection: a review and a model. Artif Intell Rev 40(3):305–377. https://doi.org/10.1007/s10462-011-9285-z
Acknowledgements
We gratefully acknowledge the help provided by constructive comments of the anonymous referees.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Barushka, A., Hajek, P. Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks. Appl Intell 48, 3538–3556 (2018). https://doi.org/10.1007/s10489-018-1161-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1161-y