Aggressive and Offensive Language Identification in Hindi, Bangla, and English: A Comparative Study | SN Computer Science Skip to main content

Advertisement

Log in

Aggressive and Offensive Language Identification in Hindi, Bangla, and English: A Comparative Study

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

In the present paper, we carry out a comparative study between offensive and aggressive language and attempt to understand their inter-relationship. To carry out this study, we develop classifiers for offensive and aggressive language identification in Hindi, Bangla, and English using the datasets released for the languages as part of the two shared tasks: hate speech and offensive content identification in Indo-European languages (HASOC) and aggression and misogyny identification task at TRAC-2. The HASOC dataset is annotated with the information about offensive language and TRAC-2 dataset is annotated with the information about aggressive language. We experiment with SVM as well as BERT and its different derivatives such as ALBERT and DistilBERT for developing the classifiers. The best classifiers achieve an impressive F-score in between 0.70 and 0.80 for different tasks. We use these classifiers to cross-annotate the two datasets, and look at the co-occurrence of different sub-categories of aggression and offense. The study shows that even though aggression and offense significantly overlaps, but still one does not entail the other.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Please note that in both these tasks train, dev/validation and test sets were provided separately. In this table, the figures for train refer to the total of train and dev datasets—in our experiments also, we use the two datasets together for training the system and use the test set for reporting the system performance.

  2. In this section, we discuss the experiments related to the development of classifiers for sub-task A and sub-task B of HASOC shared task and sub-task A of the TRAC-2 shared task. For experiments related to sub-task C of HASOC and sub-task B of TRAC-2, you may refer to [33] and [5, 27].

  3. https://github.com/kaushaltrivedi/fast-bert.

  4. https://github.com/huggingface/pytorch-transformers.

  5. https://medium.com/huggingface/introducing-fastbert-a-simple-deep-learning-library-for-bert-models-89ff763ad384.

  6. https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d.

  7. The precision, recall and F-score reported in this paper are calculated using the scitkit-learn classification_report function, which gives an averaged F-score weighted by the total ‘support’ for each class—this, as expected, has resulted in some F-scores that do not lie in between the precision and recall values.

  8. I could not find any previous study that directly compares or studies the inter-relation between these phenomena and the assumption about their independence or being synonymous is largely implicit in the silence of the most of the researchers working in these areas. However, I have discussed some notable exceptions in “ Introduction”.

  9. Utterances are generally considered to be present only in spoken speech. However, considering social media comments to be a close approximation of speech, it would be more logical to divide these comments into utterances than sentences. In speech, utterances are characterised by short pauses while speaking. We considered the presence of single or multiple sentence-terminating punctuation, viz., full stop, exclamation mark, and question mark, as indicative of one utterance. Thus, these are used for calculating the numbers about punctuation.

  10. The 300-dimensional vectors of each of the lexical items in the GLoVe model were reduced to 2-dimension using principal component analysis (PCA) and then plotted as a scatter diagram to generate this visualisation—it is a standard and well-accepted method for visualising high-dimensional vectors in lower dimensions while studying similarity.

  11. We would like to reiterate here that simply use of profane word do not make the comment offensive—profanity is one of the factors that may lead to offense. However, as has been shown in this study neither of the two entail the other.

References

  1. Agarwal S, Sureka A. Using knn and svm based one-class classifier for detecting online radicalization on twitter. In: International conference on distributed computing and internet technology. Springer; 2015. pp. 431–442.

  2. Agarwal S, Sureka A. Characterizing linguistic attributes for automatic classification of intent based racist/radicalized posts on tumblr micro-blogging website. arXiv preprint. 2017;arXiv:1701.04931.

  3. Badjatiya P, Gupta S, Gupta M, Varma V. Deep learning for hate speech detection in tweets. In: Proceedings of the 26th international conference on world wide web companion, International World Wide Web Conferences Steering Committee. 2017. pp. 759–760.

  4. Basile V, Bosco C, Fersini E, Nozza D, Patti V, Rangel Pardo FM, Rosso P, Sanguinetti M. SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in twitter. In: Proceedings of the 13th international workshop on semantic evaluation, Association for Computational Linguistics, Minneapolis, Minnesota, USA. 2019. pp. 54–63. https://doi.org/10.18653/v1/S19-2007. https://www.aclweb.org/anthology/S19-2007.

  5. Bhattacharya S, Singh S, Kumar R, Bansal A, Bhagat A, Dawer Y, Lahiri B, Ojha AK. Developing a multilingual annotated corpus of misogyny and aggression. In: Proceedings of the second workshop on trolling, aggression and cyberbullying, European Language Resources Association (ELRA), Marseille, France. 2020. pp. 158–168.

  6. Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly A, Holt B, Varoquaux G. API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD workshop: languages for data mining and machine learning. 2013. pp. 108–122.

  7. Burnap P, Williams ML. Hate speech, machine classification and statistical modelling of information flows on twitter: interpretation and communication for policy decision making. In: Proceedings of internet, policy and politics. 2014. pp. 1–18.

  8. Cambria E, Chandra P, Sharma A, Hussain A. Do not feel the trolls. In: ISWC, Shanghai. 2010.

  9. Chen Y, Zhou Y, Zhu S, Xu H. Detecting offensive language in social media to protect adolescent online safety. privacy, security, risk and trust (passat). In: International conference on social computing (SocialCom). 2012. pp. 71–80.

  10. Culpeper J. Impoliteness: using language to cause offence. Cambridge: Cambridge University Press; 2011

    Book  Google Scholar 

  11. Dadvar M, Trieschnigg D, de Jong F. Experts and machines against bullies: a hybrid approach to detect cyberbullies. In: Advances in artificial intelligence. Berlin: Springer; 2014. pp. 275–281.

  12. Dadvar M, Trieschnigg D, Ordelman R, de Jong F. Improving cyberbullying detection with user context. In: Advances in information retrieval. Springer; 2013. pp. 693–696.

  13. Davidson T, Warmsley D, Macy M, Weber I. Automated hate speech detection and the problem of offensive language. In: Proceedings of ICWSM. 2017.

  14. Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint. 2018; arXiv:1810.04805.

  15. Díaz-Torres MJ, Morán-Méndez PA, Villasenor-Pineda L, Montes-y Gómez M, Aguilera J, Meneses-Lerín L. Automatic detection of offensive language in social media: defining linguistic criteria to build a Mexican Spanish dataset. In: Proceedings of the second workshop on trolling, aggression and cyberbullying, European Language Resources Association (ELRA), Marseille, France. 2020. pp. 132–136. https://www.aclweb.org/anthology/2020.trac-1.21.

  16. Dinakar K, Jones B, Lieberman CHH, Picard R. Common sense reasoning for detection, prevention, and mitigation of cyberbullying. ACM Trans Interact Intell Syst (TiiS). 2012;2(3):18:1–30.

    Google Scholar 

  17. Djuric N, Zhou J, Morris R, Grbovic M, Radosavljevic V, Bhamidipati N. Hate speech detection with comment embeddings. In: Proceedings of the 24th international conference on world wide web. 2015. pp. 29–30.

  18. Fortana P. Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes. Master’s thesis, Faculdade de Engenharia da Universidade do Porto. 2017.

  19. Gitari ND, Zuping Z, Damien H, Long J. A lexicon- based approach for hate speech detection. Int J Multimed Ubiquitous Eng. 2015;10(4):215–30.

    Article  Google Scholar 

  20. Greevy E. Automatic text categorisation of racist webpages. Ph.D. thesis, Dublin City University. 2004.

  21. Greevy E, Smeaton AF. Classifying racist texts using a support vector machine. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, ACM. 2004. pp. 468–469.

  22. Hearst MA. Support vector machines. IEEE Intell Syst. 1998;13(4):18–28. https://doi.org/10.1109/5254.708428.

    Article  Google Scholar 

  23. Hee CV, Lefever E, Verhoeven B, Mennes J, Desmet B, Pauw GD, Daelemans W, Hoste V. Detection and fine-grained classification of cyberbullying events. In: Proceedings of international conference recent advances in natural language processing (RANLP). 2015. pp. 672–680.

  24. Kachru BB. The other tongue: English across Cultures. Urbana: University of Illinois Press; 1982.

    Google Scholar 

  25. Kachru BB. The Indianization of English: the English language in India. New Delhi: Oxford University Press; 1983.

    Google Scholar 

  26. Kumar R, Ojha AK. Kmi-panlingua at HASOC 2019: SVM vs BERT for hate speech and offensive content detection. In: Mehta P, Rosso P, Majumder P, Mitra M, editors. Working notes of FIRE 2019 - forum for information retrieval evaluation, Kolkata, India, December 12–15, 2019, CEUR workshop proceedings, vol. 2517, pp. 285–292. CEUR-WS.org. 2019. http://ceur-ws.org/Vol-2517/T3-14.pdf.

  27. Kumar R, Ojha AK, Malmasi S, Zampieri M. Evaluating aggression identification in social media. In: Proceedings of the second workshop on trolling, aggression and cyberbullying, European Language Resources Association (ELRA), Marseille, France. 2020. pp. 1–5.

  28. Kumar R, Reganti AN, Bhatia A, Maheshwari T. Aggression-annotated corpus of hindi-english code-mixed data. In: Chair NCC, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, editors. Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), European Language Resources Association (ELRA), Paris, France. 2018.

  29. Kumar S, Spezzano F, Subrahmanian V. Accurately detecting trolls in slashdot zoo via decluttering. In: Proceedings of IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). 2014. pp. 188–195.

  30. Lan Z, Chen M, Goodma S, Gimpel K, Sharma P, Soricut R. Albert: a lite bert for self-supervised learning of language representations. 2019.

  31. Malmasi S, Zampieri M. Challenges in discriminating profanity from hate speech. J Exp Theor Artif Intell. 2018;30:1–16.

    Article  Google Scholar 

  32. Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel A. Overview of the hasoc track at fire 2019: hate speech and offensive content identification in Indo-European languages. In: Proceedings of the 11th forum for information retrieval evaluation, FIRE ’19, Association for Computing Machinery, New York, NY, USA. 2019. pp. 14–17. https://doi.org/10.1145/3368567.3368584.

  33. Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel A. Overview of the hasoc track at fire 2019: hate speech and offensive content identification in Indo-European languages. In: Proceedings of the 11th forum for information retrieval evaluation. 2019. pp. 14–17.

  34. Mihaylov T, Georgiev GD, Ontotext A, Nakov P. Finding opinion manipulation trolls in news community forums. In: Proceedings of the nineteenth conference on computational natural language learning, CoNLL. 2015. pp. 310–314.

  35. Mojica LG. Modeling trolling in social media conversations. 2016. arXiv:1612.05310 [cs.CL]. https://arxiv.org/pdf/1612.05310.pdf.

  36. Montani JP, Schüller P. Tuwienkbs19 at germeval task 2, 2019: ensemble learning for german offensive language detection. In: Proceedings of the 15th conference on natural language processing (KONVENS 2019), German Society for Computational Linguistics and Language Technology, Erlangen, Germany. 2019. pp. 418–422.

  37. Nitin Bansal A, Sharma SM, Kumar K, Aggarwal A, Goyal S, Choudhary K, Chawla K, Jain K, Bhasinar M. Classification of flames in computer mediated communications (2012). arXiv:1202.0617 [cs.SI]. https://arxiv.org/pdf/1202.0617.pdf.

  38. Nitta T, Masui F, Ptaszynski M, Kimura Y, Rzepka R, Araki K. Detecting cyberbullying entries on informal school websites based on category relevance maximization. In: Proceedings of IJCNLP. 2013. pp. 579–586.

  39. Nobata C, Tetreault J, Thomas A, Mehdad Y, Chang Y. Abusive language detection in online user content. In: Proceedings of the 25th international conference on world wide web, International World Wide Web Conferences Steering Committee. 2016. pp. 145–153.

  40. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

    MathSciNet  MATH  Google Scholar 

  41. Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP). 2014. pp. 1532–1543. http://www.aclweb.org/anthology/D14-1162

  42. Sanh V, Debut L, Chaumond J, Wolf T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2019.

  43. Sax S. Flame wars: automatic insult detection. Tech. rep., Stanford University. 2016.

  44. Schmid F, Thielemann J, Mantwill A, Xi J, Labudde D, Spranger M. Fosil - offensive language classification of german tweets combining svms and deep learning techniques. In: Proceedings of the 15th conference on natural language processing (KONVENS 2019), German Society for Computational Linguistics and Language Technology, Erlangen, Germany. 2019. pp. 382–386.

  45. Schmidt A, Wiegand M. A survey on hate speech detection using natural language processing. In: Proceedings of the fifth international workshop on natural language processing for social media, Association for Computational Linguistics, Valencia, Spain. 2017. pp. 1–10.

  46. Struß JM, Siegel M, Ruppenhofer J, Wiegand M, Klenner M. Overview of germeval task 2, 2019 shared task on the identification of offensive language. In: Proceedings of the 15th conference on natural language processing (KONVENS 2019), German Society for Computational Linguistics and Language Technology, Erlangen, Germany. 2019. pp. 354–365.

  47. Tedeschi JT, Felson RB. Violence, aggression, and coercive actions. Washington: American Psychological Association; 1994.

    Book  Google Scholar 

  48. Vigna FD, Cimino A, Dell’Orletta F, Petrocchi M, Tesconi M. Hate me, hate me not: hate speech detection on facebook. In: Proceedings of the first Italian conference on cybersecurity, 2017. pp. 86–95.

  49. Waseem Z, Davidson T, Warmsley D, Weber I. Understanding abuse: a typology of abusive language detection subtasks. In: Proceedings of the first workshop on abusive language online, Association for Computational Linguistics. 2017. pp. 78–84. http://aclweb.org/anthology/W17-3012.

  50. Waseem Z, Hovy D. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In: Proceedings of NAACL-HLT. 2016. pp. 88–93.

  51. Wiegand M, Siegel M, Ruppenhofer J. Overview of the GermEval 2018 shared task on the identification of offensive language. In: Proceedings of GermEval. 2018.

  52. Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R. Predicting the type and target of offensive posts in social media. In: Proceedings of the annual conference of the North American chapter of the association for computational linguistics: human language technology (NAACL-HLT). 2019.

  53. Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R. SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In: Proceedings of the 13th international workshop on semantic evaluation, Association for Computational Linguistics, Minneapolis, Minnesota, USA. 2019. pp. 75–86. https://doi.org/10.18653/v1/S19-2010. https://www.aclweb.org/anthology/S19-2010.

  54. Zampieri M, Nakov P, Rosenthal S, Atanasova P, Karadzhov G, Mubarak H, Derczynski L, Pitenis Z, Çöltekin c. SemEval-2020 task 12: multilingual offensive language identification in social media (OffensEval 2020). In: Proceedings of SemEval. 2020.

Download references

Acknowledgements

We would like to thank the organisers of HASOC and TRAC-2 shared tasks for making the datasets publicly available, which has enabled us to carry out this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ritesh Kumar.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Social Media Analytics and its Evaluation” guest edited by Thomas Mandl, Sandip Modha, and Prasenjit Majumder.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, R., Lahiri, B. & Ojha, A.K. Aggressive and Offensive Language Identification in Hindi, Bangla, and English: A Comparative Study. SN COMPUT. SCI. 2, 26 (2021). https://doi.org/10.1007/s42979-020-00414-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-020-00414-6

Keywords

Navigation