Abstract
With the rise of social web, there has also been a great concern about the quality of user-generated content on social media sites (SMSs). Deceptive comments harm users’ trust in online social media and cause financial loss to firms. Previous studies use various features and classification algorithms to detect and filter social spam on several social media platforms. However, to the best of our knowledge, previous studies have not exploited both probabilistic topic modeling and incremental learning to detect social spam on SMSs. Thus, the main contribution of this paper is design of a novel detection methodology that combines topic- and user-based features to improve the effectiveness of social spam detection. The proposed methodology exploits a probabilistic generative model, namely the labeled latent Dirichlet allocation (L-LDA), for mining the latent semantics from user-generated comments, and an incremental learning approach for tackling the changing feature space. An experiment based on a large dataset extracted from YouTube demonstrates the effectiveness of our proposed methodology, which achieves an average accuracy of 91.17 % in social spam detection. Our statistical analysis reveals that topic-based features significantly improve social spam detection, which has significant implications for business practice.
Similar content being viewed by others
Notes
See “$611,000 fine as TripAdvisor gets bad review in Italy” by Barry Neild, Dec’14, available at http://edition.cnn.com/2014/12/23/travel/tripadvisor-fine/.
See “Fake online reviews: 4 ways companies can deceive you” by Megan Griffith-Greene, Nov’14, available at http://www.cbc.ca/news/business/fake-online-reviews-4-ways-companies-can-deceive-you-1.2825080.
A Chinese microblogging website (www.weibo.com).
References
van Marle, D. (2011) IP telephony shifts from unified communications to social media. In Proceedings of the 50th FITCE Congress, 2011 (pp. 1–4). Piscataway: IEEE
Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud computing and big data analytics: What is new from databases perspective? In Big Data Analytics (pp. 42–61). Berlin: Springer.
Chandramouli, R. (2011). Emerging social media threats: Technology and policy perspectives. In Proceedings of the 2nd Worldwide Cybersecurity Summit (WCS), London (pp. 1–4). Piscataway: IEEE
Zhou, L., Wu, J., & Zhang, D. (2014). Discourse cues to deception in the case of multiple receivers. Information & Management, 51(6), 726–737. doi:10.1016/j.im.2014.05.011.
Wu, G., Greene, D., Smyth, B., & Cunningham, P. A. (2010) Distortion as a validation criterion in the identification of suspicious reviews. In Proceedings of the 1st Workshop on Social Media Analytics, New York (pp. 10–13, SOMA ‘10): Association of Computing Machinery (ACM). doi:10.1145/1964858.1964860.
Yoo, K.-H., & Gretzel, U. (2009). Comparison of deceptive and truthful travel reviews. In W. Höpken, U. Gretzel, & R. Law (Eds.), Information and Communication Technologies in Tourism 2009 (pp. 37–47). Vienna: Springer.
Theft, fraud cost retailers $8 million a day: study. (2007), The Ottawa Citizen, pp. E.3-E3.
Wang, D., Irani, D., & Pu, C. (2014). SPADE: A social-spam analytics and detection framework. Social Network Analysis and Mining, 4(1), 1–18. doi:10.1007/s13278-014-0189-1.
Jagatic, T. N., Johnson, N. A., Jakobsson, M., & Menczer, F. (2007). Social phishing. Communications of ACM, 50(10), 94–100.
Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J. I., & Tseng, B. L. (2008). Detecting splogs via temporal dynamics using self-similarity analysis. ACM Transactions on the Web, 2(1), 4. doi:10.1145/1326561.1326565.
Boyd, D., & Heer, J. (2006) Profiles as conversation: Networked identity performance on friendster. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences, Koloa, Hawaii (Vol. 3, pp. 59c-59c). Piscataway: IEEE Computer Society
Brown, G., Howe, T., Ihbe, M., Prakash, A., & Borders, K. (2008). Social networks and context-aware spam. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, New York (pp. 403–412, CSCW ‘08): Association of Computing Machinery (ACM). doi:10.1145/1460563.1460628.
Zinman, A., & Donath, J. (2007). Is Britney Spears spam? In Paper presented at the 4th Conference on Email and Anti-Spam, Mountain View, California.
Harold, & Nguyen (2014). 2013 State of Social Media Spam Report (2013 Research Report ed., pp. 21). Burlingame, California: Nexgate.
Grier, C., Thomas, K., Paxson, V., & Zhang, M. (2010) @spam: the underground on 140 characters or less. In Proceedings of the 17th ACM Conference on Computer and Communications Security, New York (Vol. Chicago, Illinois, pp. 27–37): Association of Computing Machinery (ACM). doi:http://doi.acm.org/10.1145/1866307.1866311.
Zhang, D., Yan, Z., Jiang, H., & Kim, T. (2014). A domain-feature enhanced classification model for the detection of Chinese phishing e-Business websites. Information & Management, 51(7), 845–853.
Ensing, & David (2013). Money talks and listens: Characteristics of rating and review site users. Maritz Research’s White Papers, 4
IC3 (2008). 2008 Internet Crime Report (p. 25): Internet Crime Complaint Center.
Reviews, reputation, and revenue: The case of Yelp.com (2011). Harvard Business School, Boston College. http://www.hbs.edu/faculty/Publication%20Files/12-016_0464f20e-35b2-492e-a328-fb14a325f718.pdf.
Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings on the Conference on Empirical Methods in Natural Language Processing, Singapore (pp. 248–256): Association for Computational Linguistics
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Markines, B., Cattuto, C., & Menczer, F. (2009). Social spam detection. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, New York (pp. 41–48, AIRWeb ‘09): Association of Computing Machinery (ACM). doi:http://doi.acm.org/10.1145/1531914.1531924.
Lee, K., Caverlee, J., & Webb, S. (2010). Uncovering social spammers: social honeypots + machine learning. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York (pp. 435–442): Association of Computing Machinery (ACM). doi:10.1145/1835449.1835522.
Jin, X., Lin, C., Luo, J., & Han, J. (2011). A data mining-based spam detection system for social media networks. Proceedings of the VLDB Endowment, 4(12), 1458–1461.
Lin, L., & Kun, J. (2012). Detecting spam in Chinese microblogs: A study on Sina Weibo. In Proceedings of the 8th International Conference on Computational Intelligence and Security, Guangzhou, Guangdong Province (pp. 578–581): China Printing Solutions. doi:10.1109/cis.2012.135.
Dae-Ha, P., Eun-Ae, C., & Byung-Won, O. (2013). Social spam discovery using bayesian network classifiers based on feature extractions. In Proceedings of the 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, Melbourne, Australia, July 2013 (pp. 1808–1811) Piscataway: IEEE
Po-Ching, L., & Po-Min, H. (2013). A study of effective features for detecting long-surviving Twitter spam accounts. In Proceedings of the 15th International Conference on Advanced Communication Technology, PyeongChang, South Korea, Jan 2013 (pp. 841–846). Piscataway: IEEE
Sureka, A. (2011). Mining user comment activity for detecting forum spammers in Youtube. Paper presented at the 1st International Workshop on Usage Analysis and the Web of Data, Hyderabad, India
Brody, S., & Elhadad, N. (2010). An unsupervised aspect-sentiment model for online reviews. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Los Angeles, California (pp. 804–812): Association for Computational Linguistics
Liu, B., Liu, L., Tsykin, A., Goodall, G. J., Green, J. E., Zhu, M., et al. (2010). Identifying functional miRNA–mRNA regulatory modules with correspondence latent dirichlet allocation. Bioinformatics, 26(24), 3105–3111.
Wang, C., Blei, D., & Li, F.-F. (2009). Simultaneous image classification and annotation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL (pp. 1903–1910). Piscataway: IEEE
Bíró, I., Szabó, J., & Benczúr, A. A. (2008). Latent dirichlet allocation in web spam filtering. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, Beijing (pp. 29–32). New York: Association of Computing Machinery (ACM)
Cui, K., Zhou, B., Jia, Y., & Liang, Z. (2010). LDA-based model for online topic evolution mining. Computer Science, 37(11), 156–193.
Sizov, S. (2010). Geofolk: Latent spatial semantics in web 2.0 social media. In Proceedings of the third ACM international conference on Web search and data mining (pp. 281–290). New York: ACM
Geng, X., & Smith-Miles, K. (2009). Incremental learning. In S. Li & A. Jain (Eds.), Encyclopedia of biometrics (pp. 731–735). Berlin: Springer.
Mitchell, T. M. (1997). Machine learning. Boston: McGraw-Hill.
Mitchell, T. M. (1982). Generalization as search. Artificial Intelligence, 18(2), 203–226.
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), 139–172.
Utgoff, P. E. (1988). Id5: An incremental id3. In Proceedings of 5th International Workshop on Machine Learning, Ann Arbor, Michigan (pp. 107–120). Burlington, MA: Morgan Kaufmann
Martinez, C., & Tony, G.-C. (1995). ILA: Combining inductive learning with prior knowledge and reasoning. 17
Tsai, C. H., Lin, C. Y., & Lin, C. J. (2014). Incremental and decremental training for linear classification. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York (pp. 343–352). New York: Association of Computing Machinery (ACM)
Mairal, J. (2015). Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2), 829–855.
Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York: McGraw-Hill.
Aphinyanaphongs, Y., Fu, L. D., Li, Z., Peskin, E. R., Efstathiadis, E., Aliferis, C. F., et al. (2014). A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. Journal of the Association for Information Science and Technology, 65(10), 1964–1987.
Sood, S. O., Churchill, E. F., & Antin, J. (2012). Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology, 63(2), 270–285.
Joachims, T. (1997). A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, 1997 (pp. 143–151). San Francisco: Morgan Kaufmann Publishers Inc.
Soucy, P., & Mineau, G. W. (2005) Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the International Joint Conferences on Artificial Intelligence, Edinburgh, Scotland (Vol. 5, pp. 1130–1135): IJCAI Organization
Singhal, A., Choi, J., Hindle, D., Lewis, D. D., & Pereira, F. (1999). AT&T at TREC-7. In Proceedings of the 7th Text Retrieval Conference, Gaithersburg, MD (pp. 239–252): National Institute of Standards and Technology (NIST)
Alexandrov, M., Gelbukh, A. F., & Lozovoi, G. (2001) Chi square classifier for document categorization. In Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City (Vol. 2004, pp. 457–459). Belin: Springer
Dunham, M. H., & Ming, D. (2003). Introductory and advanced topics. Upper Saddle River, NJ: Prentice Hall/Pearson Education.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(7–8), 1289–1305.
Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. S., & Al-Rajeh, A. (2008). Automatic arabic text classification. In Paper presented at the 9th International Conference on the Statistical Analysis of Textual Data, Lyon.
Mesleh, A Md. (2007). Chi square feature extraction based svms arabic text categorization system. Journal of Computer Science, 3(6), 430–435.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Halliday, M. A., & Matthiessen, C. M. (2004). An introduction to functional grammar. New York: Routledge.
Fairclough, N. (2003). Analysing discourse: Textual analysis for social research. London: Routledge.
Abbasi, A., & Chen, H. (2008). CyberGate: a design framework and system for text analysis of computer-mediated communication. MIS Quarterly, 32(4), 811–837.
Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167–174.
Duan, Z., Gopalan, K., & Yuan, X. (2011). An empirical study of behavioral characteristics of spammers: Findings and implications. Computer Communications, 34(14), 1764–1776. doi:10.1016/j.comcom.2011.03.015.
Gao, H., Chen, Y., Lee, K., Palsetia, D., & Choudhary, A. N. (2012). Towards online spam filtering in social networks. In NDSS
Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., & Zhao, B. Y. (2010). Detecting and characterizing social spam campaigns. In Paper presented at the Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, Melbourne.
Chen, C., Wu, K., Srinivasan, V., & Zhang, X. (2013). Battling the internet water army: detection of hidden paid posters. In Paper presented at the Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Niagara.
Mukherjee, A., Liu, B., & Glance, N. Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st international conference on World Wide Web, 2012 (pp. 191–200). New York: ACM
Song, J., Lee, S., & Kim, J. (2011). Spam filtering in twitter using sender-receiver relationship. In R. Sommer, D. Balzarotti, & G. Maier (Eds.), Recent advances in intrusion detection (Vol. 6961, pp. 301–317)., Lecture Notes in Computer Science Berlin, Heidelberg: Springer.
Wang, A. H. (2010). Don’t follow me: Spam detection in Twitter. In Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT) 2010 (pp. 1–10)
Myers, E. W. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(1–4), 251–266.
Ukkonen, E. (1985). Algorithms for approximate string matching. Information and Control, 64(1), 100–118.
Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3), 291–316.
Manaskasemsak, B., Jiarpakdee, J., & Rungsawang, A. (2014). Adaptive Learning Ant Colony Optimization for Web Spam Detection. In Computational Science and Its Applications—ICCSA 2014 (Vol. 8584, pp. 642–653, Lecture Notes in Computer Science). Berlin: Springer.
Congfu, X., Baojun, S., Yunbiao, C., & Weike, P. (2014). An adaptive fusion algorithm for spam detection. IEEE Intelligent Systems, 29(4), 2–8.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408.
Li, Y., & Long, P. (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1–3), 361–387.
Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21th International Conference on Machine Learning, Banff, Alberta, Canada, 2004 (p. 116). New York: Association of Computing Machinery (ACM). doi:10.1145/1015330.1015332.
Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), 3–30.
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(3), 551–585.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Paper presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA.
O’Callaghan, D., Harrigan, M., Carthy, J., & Cunningham, P. A. (2012) Identifying discriminating network motifs in YouTube spam. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, Dublin (pp. 521–529): Association for the Advancement of Artificial Intelligence
O’Callaghan, D., Harrigan, M., Carthy, J., & Cunningham, P. A. (2012) Network analysis of recurring YouTube spam campaigns. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, Dublin (pp. 531–534)
Helft, M. (2008). Search ads come to YouTube. http://bits.blogs.nytimes.com/2008/10/13/search-ads-come-to-youtube/.
YouTube (2013). Youtube: Statistics.
Sivaselvan, B., & Gopalan, N. P. (2009). Data mining: Techniques and trends. New Delhi: Prentice-Hall.
Ahmed, S., & Mithun, F. (2004). Word stemming to enhance spam filtering. In Paper presented at the 1st Conference on Email and Anti-Spam, Mountain View, CA.
Sculley, D. (2010) Combined regression and ranking. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC (pp. 979–988). New York: Association of Computing Machinery (ACM)
Neyman, J. (1934). On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 558–625.
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. New York: Wiley.
Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg (Vol. 1, pp. 309–319, HLT’11): Association for Computational Linguistics
Acknowledgments
This work was supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Projects: CityU 11502115), and the Shenzhen Municipal Science and Technology R&D Funding - Basic Research Program (Project No. JCYJ20140419115614350).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Song, L., Lau, R.Y.K., Kwok, R.CW. et al. Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection. Electron Commer Res 17, 51–81 (2017). https://doi.org/10.1007/s10660-016-9244-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10660-016-9244-5