HiTACoD: Hierarchical Framework for Textual Abusive Content Detection | SN Computer Science Skip to main content

Advertisement

Log in

HiTACoD: Hierarchical Framework for Textual Abusive Content Detection

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

The effects of textual abusive content on social media can be quite adverse. This study investigates the performance of different machine learning and deep learning models using various feature representation and data augmentation techniques for this task. Also, the need for a multi-classification framework and data balancing in the classification of abusive content is also studied in this paper. The experiments were conducted on a specific dataset and task, and the results indicate that the choice of feature representation and classifier is crucial in textual abusive content detection. The results suggest that Tf-idf representation is more effective than bag of words representation in capturing the meaning and context of words in a text, which can improve the performance of the classifiers in detecting abuse. Additionally, the results also suggest that unigram features might be more effective than bigram features in this dataset. Furthermore, the use of pre-trained word embeddings such as Word2Vec in deep learning models can improve the performance of the models in classification tasks. The results also indicate that the performance of the models improves when data augmentation techniques such as SMOTE and Contextual word embedding data augmenter using BERT-base-uncased model from nlpaug library are used. Overall, the results suggest that the use of pre-trained word embeddings and data augmentation for imbalanced data can be promising for abusive content detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

This is not applicable. The dataset used in the study is a publicly available dataset and the reference to the same has been provided in the corresponding section.

Notes

  1. https://developer.twitter.com/en/products/twitter-api.

  2. https://www.github.com/makcedward/nlpaug/blob/master/nlpaug/augmenter/word/context_word_embs.py.

  3. https://huggingface.co/bert-base-uncased.

References

  1. International A. Toxic twitter—the psychological harms of violence and abuse against women online. https://www.amnesty.org/en/latest/news/2018/03/online-violence-against-women-chapter-6-6. Accessed 27 Feb 2023.

  2. Khurana D, Koli A, Khatter K, Singh S. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl. 2022;82:3713–44. https://doi.org/10.1007/s11042-022-13428-4.

    Article  Google Scholar 

  3. Vidgen B, Hale S, Staton S, Melham T, Margetts H, Kammar O, Szymczak M. Recalibrating classifiers for interpretable abusive content detection. In: Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science. Association for Computational Linguistics; 2020. p 132–8.

  4. Khan S, Kamal A, Fazil M, Alshara MA, Sejwal VK, Alotaibi RM, et al. HCovBi-caps: hate speech detection using convolutional and bi-directional gated recurrent unit with capsule network. IEEE Access. 2022. https://doi.org/10.1109/ACCESS.2022.3143799.

    Article  Google Scholar 

  5. Zhang Z, Robinson D, Tepper J. Detecting hate speech on twitter using a convolution-gru based deep neural network. In: The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018; Proceedings 15. Springer; 2018. p. 745–60.

  6. Watanabe H, Bouazizi M, Ohtsuki T. Hate speech on twitter: a pragmatic approach to collect hateful and offensive expressions and perform hate speech detection. IEEE Access. 2018;6:13825–35.

    Article  Google Scholar 

  7. Badjatiya P, Gupta S, Gupta M, Varma V. Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion (WWW '17 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE; 2017. p. 759–760. https://doi.org/10.1145/3041021.3054223.

    Google Scholar 

  8. Davidson T, Warmsley D, Macy M, Weber I. Automated hate speech detection and the problem of offensive language. In: Proceedings of the international AAAI conference on web and social media, vol. 11; 2017. p. 512–15. https://doi.org/10.1609/icwsm.v11i1.14955.

  9. Warner W, Hirschberg J. Detecting hate speech on the world wide web. In: Proceedings of the Second Workshop on Language in Social Media, Montréal, Canada. Association for Computational Linguistics; 2012. p. 19–26.

  10. Kwok I, Wang Y. Locate the hate: detecting tweets against blacks. Proc AAAI Conf Art Intell. 2013;27(1):1621–22. https://doi.org/10.1609/aaai.v27i1.8539.

    Google Scholar 

  11. Pete B, Williams ML. Cyber hate speech on twitter: an application of machine classification and statistical modeling for policy and decision making. Policy Internet. 2015;7(2):223–42. https://doi.org/10.1002/poi3.85.

    Article  Google Scholar 

  12. Waseem Z, Hovy D. Hateful symbols or hateful people? Predictive features for hate speech detection on twitter. In: Proceedings of the NAACL Student Research Workshop. San Diego, California: Association for Computational Linguistics; 2016. p. 88–93.

  13. Melton J, Bagavathi A, Krishnan S. DeL-haTE: a deep learning tunable ensemble for hate speech detection. In: 2020 19th IEEE international conference on machine learning and applications (ICMLA). IEEE; 2020. p. 1015–22.

  14. Shervin M, Zampieri M. Detecting Hate Speech in Social Media. Recent Advances in Natural Language Processing; 2017.

  15. Khan MU, Abbas A, Rehman A, Nawaz R. Hateclassify: a service framework for hate speech identification on social media. IEEE Internet Comput. 2020;25(1):40–9.

    Article  Google Scholar 

  16. Roy PK, Tripathy AK, Das TK, Gao XZ. A framework for hate speech detection using deep convolutional neural network. IEEE Access. 2020;8:204951–62.

    Article  Google Scholar 

  17. Mou G, Ye P, Lee K. SWE2: SubWord Enriched and Significant Word Emphasized Framework for Hate Speech Detection. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM '20). New York, NY, USA: Association for Computing Machinery; 2020. p. 1145–54. https://doi.org/10.1145/3340531.3411990

  18. Brownlee J. How to clean text for machine learning with python. https://machinelearningmastery.com/clean-text-machine-learning-python/. Accessed 27 Feb 2023.

  19. Pavan Kumar C, Dhinesh Babu L. Novel text preprocessing framework for sentiment analysis. In: Smart intelligent computing and applications: proceedings of the second international conference on SCI 2018, vol. 2. Springer; 2019. p. 309–17.

  20. López V, Fernández A, Moreno-Torres JG, Herrera F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl. 2012;39(7):6585–608.

    Article  Google Scholar 

  21. Padurariu C, Breaban ME. Dealing with data imbalance in text classification. Procedia Comput Sci. 2019;159:736–45.

    Article  Google Scholar 

  22. Luque A, Carrasco A, Martín A, de Las HA. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit. 2019;91:216–31.

    Article  Google Scholar 

  23. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002. https://doi.org/10.1613/jair.953.

    Article  MATH  Google Scholar 

  24. Ma E. NLP augmentation. https://github.com/makcedward/nlpaug. Accessed 25 May 2023.

  25. Ma E. Data augmentation library for text. Towards data science. https://towardsdatascience.com/data-augmentation-library-for-text-9661736b13ff. Accessed 27 Feb 2023.

  26. Zhang Y, Jin R, Zhou ZH. Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern. 2010. https://doi.org/10.1007/s13042-010-0001-0.

    Article  Google Scholar 

  27. William U, Mladenić D, Ciaramita M, Berendt B, Kołcz A, Grobelnik M, Mladenić D, et al. TF–IDF. Encycl Mach Learn. 2011; 986–87. https://doi.org/10.1007/978-0-387-30164-8_832.

    Article  Google Scholar 

  28. Tomas M, Chen K, Corrado GS, Dean J. Efficient estimation of word representations in vector space. Int Conf Learn Representations; 2013. arXiv preprint arXiv:1301.3781.

  29. Google. Classification: ROC Curve and AUC. Google. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc. Accessed 27 Feb 2023.

  30. Bertie V, Harris A, Nguyen D, Tromble R, Hale S, Margetts H. Challenges and frontiers in abusive content detection. In: Proceedings of the third workshop on abusive language online. Association for Computational Linguistics; 2019.

  31. Vidgen B, Derczynski L. Directions in abusive language training data, a systematic review: garbage in, garbage out. PLoS One. 2020;15(12): e0243300.

    Article  Google Scholar 

  32. Dixon SJ. Twitter: number of users worldwide 2024. https://www.statista.com/statistics/303681/twitter-users-worldwide/. Accessed 25 May 2023

  33. Hendrickson S, Kolb J, Lehman B, Montague J. Trend detection in social data. https://github.com/jeffakolb/Gnip-Trend-Detection/raw/master/paper/trends.pdf. Accessed 25 May 2023.

  34. Rodrigues AP, Fernandes R, Bhandary A, Shenoy AC, Shetty A, Anisha M. Real-time Twitter trend analysis using big data analytics and machine learning techniques. Wirel Commun Mob Comput. 2021;2021:1–13. https://doi.org/10.1155/2021/3920325.

    Article  Google Scholar 

  35. Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:1810.04805.

  36. Khan S, Fazil M, Sejwal VK, Alshara MA, Alotaibi RM, Kamal A, et al. BiCHAT: BiLSTM with deep CNN and hierarchical attention for hate speech detection. J King Saud Univ Comput Inf Sci. 2022;34(7):4335–44.

    Google Scholar 

  37. Ding Y, Zhou X, Zhang X. YNU_DYX at SemEval-2019 Task 5: a stacked BiGRU model based on capsule network in detection of hate. In: Proceedings of the 13th International Workshop on Semantic Evaluation. Minneapolis, Minnesota, USA: Association for Computational Linguistics; 2019. p. 535–9.

  38. Qureshi KA, Sabih M. Un-compromised credibility: social media based multi-class hate speech classification for text. IEEE Access. 2021;9:109465–77.

    Article  Google Scholar 

  39. Gashroo OB, Mehrotra M. Analysis and classification of abusive textual content detection in online social media. In: Intelligent communication technologies and virtual mobile networks: proceedings of ICICV 2022. Springer; 2022. p. 173–90.

  40. Papegnies E, Labatut V, Dufour R, Linares G. Impact of content features for automatic online abuse detection. In: Computational Linguistics and intelligent text processing: 18th international conference, CICLing 2017, Budapest, Hungary, April 17–23, 2017, Revised Selected Papers, Part II 18. Springer; 2018; p. 404–19.

  41. Chiril P, Moriceau V, Benamara F, Mari A, Origgi G, Coulomb-Gully M. An annotated corpus for sexism detection in French tweets. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association; 2020. p. 1397–1403.

  42. Tulkens S, Hilte L, Lodewyckx E, Verhoeven B, Daelemans W. The automated detection of racist discourse in Dutch social media. Comput Linguist Neth J. 2016;6:3–20.

    Google Scholar 

  43. Yin W, Zubiaga A. Hidden behind the obvious: misleading keywords and implicitly abusive language on social media. Online Soc Netw Media. 2022;30: 100210.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ovais Bashir Gashroo.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Research Trends in Computational Intelligence” guest edited by Anshul Verma, Pradeepika Verma, Vivek Kumar Singh and S. Karthikeyan.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gashroo, O.B., Mehrotra, M. HiTACoD: Hierarchical Framework for Textual Abusive Content Detection. SN COMPUT. SCI. 4, 727 (2023). https://doi.org/10.1007/s42979-023-02213-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-02213-1

Keywords