Abstract
The COVID-19 pandemic brought upon a plethora of misinformation from fake news articles and posts on social media platforms. This necessitates the task of identifying whether a particular piece of information about COVID-19 is legitimate or not. However, with excessive misinformation spreading rapidly over the internet, manual verification of sources becomes infeasible. Several studies have already explored the use of machine learning towards automating COVID-19 misinformation detection. This paper will investigate COVID-19 misinformation detection in three parts. First, we identify the common themes found in COVID-19 misinformation data using Latent Dirichlet Allocation (LDA). Second, we use CatBoost as a classifier for detecting misinformation and compare its performance against other classifiers such as SVM, XGBoost, and LightGBM. Lastly, we highlight CatBoost’s most important features and decision-making mechanism using Shapley values.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Al-Ahmad, B., Al-Zoubi, A.M., Abu Khurma, R., Aljarah, I.: An evolutionary fake news detection method for Covid-19 pandemic information. Symmetry 13(6) (2021). https://doi.org/10.3390/sym13061091, https://www.mdpi.com/2073-8994/13/6/1091
Bangyal, W.H., et al.: Detection of fake news text classification on Covid-19 using deep learning approaches. Computat. Math. Methods Med. 2021 (2021)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2016, pp. 785–794. ACM, New York (2016). https://doi.org/10.1145/2939672.2939785, http://doi.acm.org/10.1145/2939672.2939785
Chen, Y., Han, X.: CatBoost for fraud detection in financial transactions. In: 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), pp. 176–179 (2021). https://doi.org/10.1109/ICCECE51280.2021.9342475
scikit-learn developers: 6.2. feature extraction - scikit-learn 1.1.1 documentation (2022). https://scikit-learn.org/stable/modules/feature_extraction.html. Accessed 15 Jan 2022
Dhankar, A., Samuel, H., Hassan, F., Farruque, N., Bolduc, F., Zaïane, O.: Analysis of Covid-19 misinformation in social media using transfer learning. In: 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 880–885 (2021). https://doi.org/10.1109/ICTAI52525.2021.00141
Dorogush, A.V., Ershov, V., Gulin, A.: CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018)
Joy, R.A.: An interpretable CatBoost model to predict the power of combined cycle power plants. In: 2021 International Conference on Information Technology (ICIT), pp. 435–439 (2021). https://doi.org/10.1109/ICIT52682.2021.9491700
Kapusta, J., Drlik, M., Munk, M.: Using of n-grams from morphological tags for fake news classification. PeerJ Comput. Sci. 7, e624 (2021)
Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS 2017, pp. 3149–3157. Curran Associates Inc., Red Hook (2017)
Khan, J.Y., Khondaker, M.T.I., Afroz, S., Uddin, G., Iqbal, A.: A benchmark study of machine learning models for online fake news detection. Mach. Learn. Appl. 4, 100032 (2021)
Koirala, A.: Covid-19 fake news dataset. https://doi.org/10.13140/RG.2.2.26509.56805. Accessed 23 Nov 2021
Koirala, A.: Covid-19 fake news classification with deep learning. https://doi.org/10.13140/RG.2.2.26509.56805. Accessed 20 Dec 2021
Li, S.: Explore Covid-19 infodemic (2020). https://towardsdatascience.com/explore-covid-19-infodemic-2d1ceaae2306. Accessed 23 Nov 2021
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 4765–4774. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf. Accessed 18 Dec 2021
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Qureshi, K.A., Malick, R.A.S., Sabih, M., Cherifi, H.: Complex network and source inspired Covid-19 fake news classification on twitter. IEEE Access 9, 139636–139656 (2021). https://doi.org/10.1109/ACCESS.2021.3119404
Rehurek, R., Sojka, P.: GenSIM-Python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, vol. 3, no. 2 (2011)
Saenz, J.A., Kalathur Gopal, S.R., Shukla, D.: Covid-19 fake news infodemic research dataset (covid19-fnir dataset) (2021). https://dx.doi.org/10.21227/b5bt-5244. Accessed 23 Nov 2021
Selivanov, D.: Topic modeling (2018). http://text2vec.org/topic_modeling.html. Accessed 20 Dec 2021
Sievert, C., Shirley, K.: Ldavis: a method for visualizing and interpreting topics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pp. 63–70 (2014)
Tohka, J., van Gils, M.: Evaluation of machine learning algorithms for health and wellness applications: a tutorial. Comput. Biol. Med. 132, 104324 (2021). https://doi.org/10.1016/j.compbiomed.2021.104324, https://www.sciencedirect.com/science/article/pii/S0010482521001189
WHO: World health organization definition: Infodemic. https://www.who.int/health-topics/infodemic/. Accessed 28 Dec 2021
Winter, E.: The Shapley value. In: Handbook of Game Theory with Economic Applications 3, 2025–2054 (2002)
Zhang, X., Wu, G.X.: Text classification method of Dongba classics based on CatBoost algorithm. In: The 8th International Symposium on Test Automation Instrumentation (ISTAI 2020), vol. 2020, pp. 133–139 (2020). https://doi.org/10.1049/icp.2021.1336
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Avelino, J.N.M., Felizmenio Jr., E.P., Naval Jr., P.C. (2022). Unraveling COVID-19 Misinformation with Latent Dirichlet Allocation and CatBoost. In: Bădică, C., Treur, J., Benslimane, D., Hnatkowska, B., Krótkiewicz, M. (eds) Advances in Computational Collective Intelligence. ICCCI 2022. Communications in Computer and Information Science, vol 1653. Springer, Cham. https://doi.org/10.1007/978-3-031-16210-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-16210-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16209-1
Online ISBN: 978-3-031-16210-7
eBook Packages: Computer ScienceComputer Science (R0)