Unraveling COVID-19 Misinformation with Latent Dirichlet Allocation and CatBoost | SpringerLink
Skip to main content

Unraveling COVID-19 Misinformation with Latent Dirichlet Allocation and CatBoost

  • Conference paper
  • First Online:
Advances in Computational Collective Intelligence (ICCCI 2022)

Abstract

The COVID-19 pandemic brought upon a plethora of misinformation from fake news articles and posts on social media platforms. This necessitates the task of identifying whether a particular piece of information about COVID-19 is legitimate or not. However, with excessive misinformation spreading rapidly over the internet, manual verification of sources becomes infeasible. Several studies have already explored the use of machine learning towards automating COVID-19 misinformation detection. This paper will investigate COVID-19 misinformation detection in three parts. First, we identify the common themes found in COVID-19 misinformation data using Latent Dirichlet Allocation (LDA). Second, we use CatBoost as a classifier for detecting misinformation and compare its performance against other classifiers such as SVM, XGBoost, and LightGBM. Lastly, we highlight CatBoost’s most important features and decision-making mechanism using Shapley values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Al-Ahmad, B., Al-Zoubi, A.M., Abu Khurma, R., Aljarah, I.: An evolutionary fake news detection method for Covid-19 pandemic information. Symmetry 13(6) (2021). https://doi.org/10.3390/sym13061091, https://www.mdpi.com/2073-8994/13/6/1091

  2. Bangyal, W.H., et al.: Detection of fake news text classification on Covid-19 using deep learning approaches. Computat. Math. Methods Med. 2021 (2021)

    Google Scholar 

  3. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009)

    Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  5. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2016, pp. 785–794. ACM, New York (2016). https://doi.org/10.1145/2939672.2939785, http://doi.acm.org/10.1145/2939672.2939785

  6. Chen, Y., Han, X.: CatBoost for fraud detection in financial transactions. In: 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), pp. 176–179 (2021). https://doi.org/10.1109/ICCECE51280.2021.9342475

  7. scikit-learn developers: 6.2. feature extraction - scikit-learn 1.1.1 documentation (2022). https://scikit-learn.org/stable/modules/feature_extraction.html. Accessed 15 Jan 2022

  8. Dhankar, A., Samuel, H., Hassan, F., Farruque, N., Bolduc, F., Zaïane, O.: Analysis of Covid-19 misinformation in social media using transfer learning. In: 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 880–885 (2021). https://doi.org/10.1109/ICTAI52525.2021.00141

  9. Dorogush, A.V., Ershov, V., Gulin, A.: CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018)

  10. Joy, R.A.: An interpretable CatBoost model to predict the power of combined cycle power plants. In: 2021 International Conference on Information Technology (ICIT), pp. 435–439 (2021). https://doi.org/10.1109/ICIT52682.2021.9491700

  11. Kapusta, J., Drlik, M., Munk, M.: Using of n-grams from morphological tags for fake news classification. PeerJ Comput. Sci. 7, e624 (2021)

    Article  Google Scholar 

  12. Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS 2017, pp. 3149–3157. Curran Associates Inc., Red Hook (2017)

    Google Scholar 

  13. Khan, J.Y., Khondaker, M.T.I., Afroz, S., Uddin, G., Iqbal, A.: A benchmark study of machine learning models for online fake news detection. Mach. Learn. Appl. 4, 100032 (2021)

    Google Scholar 

  14. Koirala, A.: Covid-19 fake news dataset. https://doi.org/10.13140/RG.2.2.26509.56805. Accessed 23 Nov 2021

  15. Koirala, A.: Covid-19 fake news classification with deep learning. https://doi.org/10.13140/RG.2.2.26509.56805. Accessed 20 Dec 2021

  16. Li, S.: Explore Covid-19 infodemic (2020). https://towardsdatascience.com/explore-covid-19-infodemic-2d1ceaae2306. Accessed 23 Nov 2021

  17. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 4765–4774. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf. Accessed 18 Dec 2021

  18. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  19. Qureshi, K.A., Malick, R.A.S., Sabih, M., Cherifi, H.: Complex network and source inspired Covid-19 fake news classification on twitter. IEEE Access 9, 139636–139656 (2021). https://doi.org/10.1109/ACCESS.2021.3119404

    Article  Google Scholar 

  20. Rehurek, R., Sojka, P.: GenSIM-Python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, vol. 3, no. 2 (2011)

    Google Scholar 

  21. Saenz, J.A., Kalathur Gopal, S.R., Shukla, D.: Covid-19 fake news infodemic research dataset (covid19-fnir dataset) (2021). https://dx.doi.org/10.21227/b5bt-5244. Accessed 23 Nov 2021

  22. Selivanov, D.: Topic modeling (2018). http://text2vec.org/topic_modeling.html. Accessed 20 Dec 2021

  23. Sievert, C., Shirley, K.: Ldavis: a method for visualizing and interpreting topics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pp. 63–70 (2014)

    Google Scholar 

  24. Tohka, J., van Gils, M.: Evaluation of machine learning algorithms for health and wellness applications: a tutorial. Comput. Biol. Med. 132, 104324 (2021). https://doi.org/10.1016/j.compbiomed.2021.104324, https://www.sciencedirect.com/science/article/pii/S0010482521001189

  25. WHO: World health organization definition: Infodemic. https://www.who.int/health-topics/infodemic/. Accessed 28 Dec 2021

  26. Winter, E.: The Shapley value. In: Handbook of Game Theory with Economic Applications 3, 2025–2054 (2002)

    Google Scholar 

  27. Zhang, X., Wu, G.X.: Text classification method of Dongba classics based on CatBoost algorithm. In: The 8th International Symposium on Test Automation Instrumentation (ISTAI 2020), vol. 2020, pp. 133–139 (2020). https://doi.org/10.1049/icp.2021.1336

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joy Nathalie M. Avelino .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Avelino, J.N.M., Felizmenio Jr., E.P., Naval Jr., P.C. (2022). Unraveling COVID-19 Misinformation with Latent Dirichlet Allocation and CatBoost. In: Bădică, C., Treur, J., Benslimane, D., Hnatkowska, B., Krótkiewicz, M. (eds) Advances in Computational Collective Intelligence. ICCCI 2022. Communications in Computer and Information Science, vol 1653. Springer, Cham. https://doi.org/10.1007/978-3-031-16210-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16210-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16209-1

  • Online ISBN: 978-3-031-16210-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics