HiTACoD: Hierarchical Framework for Textual Abusive Content Detection

Gashroo, Ovais Bashir; Mehrotra, Monica

doi:10.1007/s42979-023-02213-1

HiTACoD: Hierarchical Framework for Textual Abusive Content Detection

Original Research
Published: 25 September 2023

Volume 4, article number 727, (2023)
Cite this article

SN Computer Science Aims and scope Submit manuscript

147 Accesses
Explore all metrics

Abstract

The effects of textual abusive content on social media can be quite adverse. This study investigates the performance of different machine learning and deep learning models using various feature representation and data augmentation techniques for this task. Also, the need for a multi-classification framework and data balancing in the classification of abusive content is also studied in this paper. The experiments were conducted on a specific dataset and task, and the results indicate that the choice of feature representation and classifier is crucial in textual abusive content detection. The results suggest that Tf-idf representation is more effective than bag of words representation in capturing the meaning and context of words in a text, which can improve the performance of the classifiers in detecting abuse. Additionally, the results also suggest that unigram features might be more effective than bigram features in this dataset. Furthermore, the use of pre-trained word embeddings such as Word2Vec in deep learning models can improve the performance of the models in classification tasks. The results also indicate that the performance of the models improves when data augmentation techniques such as SMOTE and Contextual word embedding data augmenter using BERT-base-uncased model from nlpaug library are used. Overall, the results suggest that the use of pre-trained word embeddings and data augmentation for imbalanced data can be promising for abusive content detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Analysis and Classification of Abusive Textual Content Detection in Online Social Media

Machine Learning for Identifying Abusive Content in Text Data

A Comparison of Classical Versus Deep Learning Techniques for Abusive Content Detection on Social Media Sites

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

This is not applicable. The dataset used in the study is a publicly available dataset and the reference to the same has been provided in the corresponding section.

Notes

References

International A. Toxic twitter—the psychological harms of violence and abuse against women online. https://www.amnesty.org/en/latest/news/2018/03/online-violence-against-women-chapter-6-6. Accessed 27 Feb 2023.
Khurana D, Koli A, Khatter K, Singh S. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl. 2022;82:3713–44. https://doi.org/10.1007/s11042-022-13428-4.
Article Google Scholar
Vidgen B, Hale S, Staton S, Melham T, Margetts H, Kammar O, Szymczak M. Recalibrating classifiers for interpretable abusive content detection. In: Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science. Association for Computational Linguistics; 2020. p 132–8.
Khan S, Kamal A, Fazil M, Alshara MA, Sejwal VK, Alotaibi RM, et al. HCovBi-caps: hate speech detection using convolutional and bi-directional gated recurrent unit with capsule network. IEEE Access. 2022. https://doi.org/10.1109/ACCESS.2022.3143799.
Article Google Scholar
Zhang Z, Robinson D, Tepper J. Detecting hate speech on twitter using a convolution-gru based deep neural network. In: The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018; Proceedings 15. Springer; 2018. p. 745–60.
Watanabe H, Bouazizi M, Ohtsuki T. Hate speech on twitter: a pragmatic approach to collect hateful and offensive expressions and perform hate speech detection. IEEE Access. 2018;6:13825–35.
Article Google Scholar
Badjatiya P, Gupta S, Gupta M, Varma V. Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion (WWW '17 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE; 2017. p. 759–760. https://doi.org/10.1145/3041021.3054223.
Google Scholar
Davidson T, Warmsley D, Macy M, Weber I. Automated hate speech detection and the problem of offensive language. In: Proceedings of the international AAAI conference on web and social media, vol. 11; 2017. p. 512–15. https://doi.org/10.1609/icwsm.v11i1.14955.
Warner W, Hirschberg J. Detecting hate speech on the world wide web. In: Proceedings of the Second Workshop on Language in Social Media, Montréal, Canada. Association for Computational Linguistics; 2012. p. 19–26.
Kwok I, Wang Y. Locate the hate: detecting tweets against blacks. Proc AAAI Conf Art Intell. 2013;27(1):1621–22. https://doi.org/10.1609/aaai.v27i1.8539.
Google Scholar
Pete B, Williams ML. Cyber hate speech on twitter: an application of machine classification and statistical modeling for policy and decision making. Policy Internet. 2015;7(2):223–42. https://doi.org/10.1002/poi3.85.
Article Google Scholar
Waseem Z, Hovy D. Hateful symbols or hateful people? Predictive features for hate speech detection on twitter. In: Proceedings of the NAACL Student Research Workshop. San Diego, California: Association for Computational Linguistics; 2016. p. 88–93.
Melton J, Bagavathi A, Krishnan S. DeL-haTE: a deep learning tunable ensemble for hate speech detection. In: 2020 19th IEEE international conference on machine learning and applications (ICMLA). IEEE; 2020. p. 1015–22.
Shervin M, Zampieri M. Detecting Hate Speech in Social Media. Recent Advances in Natural Language Processing; 2017.
Khan MU, Abbas A, Rehman A, Nawaz R. Hateclassify: a service framework for hate speech identification on social media. IEEE Internet Comput. 2020;25(1):40–9.
Article Google Scholar
Roy PK, Tripathy AK, Das TK, Gao XZ. A framework for hate speech detection using deep convolutional neural network. IEEE Access. 2020;8:204951–62.
Article Google Scholar
Mou G, Ye P, Lee K. SWE2: SubWord Enriched and Significant Word Emphasized Framework for Hate Speech Detection. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM '20). New York, NY, USA: Association for Computing Machinery; 2020. p. 1145–54. https://doi.org/10.1145/3340531.3411990
Brownlee J. How to clean text for machine learning with python. https://machinelearningmastery.com/clean-text-machine-learning-python/. Accessed 27 Feb 2023.
Pavan Kumar C, Dhinesh Babu L. Novel text preprocessing framework for sentiment analysis. In: Smart intelligent computing and applications: proceedings of the second international conference on SCI 2018, vol. 2. Springer; 2019. p. 309–17.
López V, Fernández A, Moreno-Torres JG, Herrera F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl. 2012;39(7):6585–608.
Article Google Scholar
Padurariu C, Breaban ME. Dealing with data imbalance in text classification. Procedia Comput Sci. 2019;159:736–45.
Article Google Scholar
Luque A, Carrasco A, Martín A, de Las HA. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit. 2019;91:216–31.
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002. https://doi.org/10.1613/jair.953.
Article MATH Google Scholar
Ma E. NLP augmentation. https://github.com/makcedward/nlpaug. Accessed 25 May 2023.
Ma E. Data augmentation library for text. Towards data science. https://towardsdatascience.com/data-augmentation-library-for-text-9661736b13ff. Accessed 27 Feb 2023.
Zhang Y, Jin R, Zhou ZH. Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern. 2010. https://doi.org/10.1007/s13042-010-0001-0.
Article Google Scholar
William U, Mladenić D, Ciaramita M, Berendt B, Kołcz A, Grobelnik M, Mladenić D, et al. TF–IDF. Encycl Mach Learn. 2011; 986–87. https://doi.org/10.1007/978-0-387-30164-8_832.
Article Google Scholar
Tomas M, Chen K, Corrado GS, Dean J. Efficient estimation of word representations in vector space. Int Conf Learn Representations; 2013. arXiv preprint arXiv:1301.3781.
Google. Classification: ROC Curve and AUC. Google. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc. Accessed 27 Feb 2023.
Bertie V, Harris A, Nguyen D, Tromble R, Hale S, Margetts H. Challenges and frontiers in abusive content detection. In: Proceedings of the third workshop on abusive language online. Association for Computational Linguistics; 2019.
Vidgen B, Derczynski L. Directions in abusive language training data, a systematic review: garbage in, garbage out. PLoS One. 2020;15(12): e0243300.
Article Google Scholar
Dixon SJ. Twitter: number of users worldwide 2024. https://www.statista.com/statistics/303681/twitter-users-worldwide/. Accessed 25 May 2023
Hendrickson S, Kolb J, Lehman B, Montague J. Trend detection in social data. https://github.com/jeffakolb/Gnip-Trend-Detection/raw/master/paper/trends.pdf. Accessed 25 May 2023.
Rodrigues AP, Fernandes R, Bhandary A, Shenoy AC, Shetty A, Anisha M. Real-time Twitter trend analysis using big data analytics and machine learning techniques. Wirel Commun Mob Comput. 2021;2021:1–13. https://doi.org/10.1155/2021/3920325.
Article Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:1810.04805.
Khan S, Fazil M, Sejwal VK, Alshara MA, Alotaibi RM, Kamal A, et al. BiCHAT: BiLSTM with deep CNN and hierarchical attention for hate speech detection. J King Saud Univ Comput Inf Sci. 2022;34(7):4335–44.
Google Scholar
Ding Y, Zhou X, Zhang X. YNU_DYX at SemEval-2019 Task 5: a stacked BiGRU model based on capsule network in detection of hate. In: Proceedings of the 13th International Workshop on Semantic Evaluation. Minneapolis, Minnesota, USA: Association for Computational Linguistics; 2019. p. 535–9.
Qureshi KA, Sabih M. Un-compromised credibility: social media based multi-class hate speech classification for text. IEEE Access. 2021;9:109465–77.
Article Google Scholar
Gashroo OB, Mehrotra M. Analysis and classification of abusive textual content detection in online social media. In: Intelligent communication technologies and virtual mobile networks: proceedings of ICICV 2022. Springer; 2022. p. 173–90.
Papegnies E, Labatut V, Dufour R, Linares G. Impact of content features for automatic online abuse detection. In: Computational Linguistics and intelligent text processing: 18th international conference, CICLing 2017, Budapest, Hungary, April 17–23, 2017, Revised Selected Papers, Part II 18. Springer; 2018; p. 404–19.
Chiril P, Moriceau V, Benamara F, Mari A, Origgi G, Coulomb-Gully M. An annotated corpus for sexism detection in French tweets. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association; 2020. p. 1397–1403.
Tulkens S, Hilte L, Lodewyckx E, Verhoeven B, Daelemans W. The automated detection of racist discourse in Dutch social media. Comput Linguist Neth J. 2016;6:3–20.
Google Scholar
Yin W, Zubiaga A. Hidden behind the obvious: misleading keywords and implicitly abusive language on social media. Online Soc Netw Media. 2022;30: 100210.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Jamia Millia Islamia, New Delhi, 110025, Delhi, India
Ovais Bashir Gashroo & Monica Mehrotra

Authors

Ovais Bashir Gashroo
View author publications
You can also search for this author inPubMed Google Scholar
Monica Mehrotra
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ovais Bashir Gashroo.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Research Trends in Computational Intelligence” guest edited by Anshul Verma, Pradeepika Verma, Vivek Kumar Singh and S. Karthikeyan.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gashroo, O.B., Mehrotra, M. HiTACoD: Hierarchical Framework for Textual Abusive Content Detection. SN COMPUT. SCI. 4, 727 (2023). https://doi.org/10.1007/s42979-023-02213-1

Download citation

Received: 28 February 2023
Accepted: 02 August 2023
Published: 25 September 2023
DOI: https://doi.org/10.1007/s42979-023-02213-1

Keywords

Part of a collection:

Research Trends in Computational Intelligence

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

HiTACoD: Hierarchical Framework for Textual Abusive Content Detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis and Classification of Abusive Textual Content Detection in Online Social Media

Machine Learning for Identifying Abusive Content in Text Data

A Comparison of Classical Versus Deep Learning Techniques for Abusive Content Detection on Social Media Sites

Explore related subjects

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now