{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T19:11:53Z","timestamp":1721416313381},"reference-count":23,"publisher":"Association for Computing Machinery (ACM)","issue":"6","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"Word embedding is used to represent words for text analysis. It plays an essential role in many Natural Language Processing (NLP) studies and has hugely contributed to the extraordinary developments in the field in the last few years. In Arabic, diacritic marks are a vital feature for the readability and understandability of the language. Current Arabic word embeddings are non-diacritized. In this article, we aim to develop and compare word embedding models based on diacritized and non-diacritized corpora to study the impact of Arabic diacritization on word embeddings. We propose evaluating the models in four different ways: clustering of the nearest words; morphological semantic analysis; part-of-speech tagging; and semantic analysis. For a better evaluation, we took the challenge to create three new datasets from scratch for the three downstream tasks. We conducted the downstream tasks with eight machine learning algorithms and two deep learning algorithms. Experimental results show that the diacritized model exhibits a better ability to capture syntactic and semantic relations and in clustering words of similar categories. Overall, the diacritized model outperforms the non-diacritized model. We obtained some more interesting findings. For example, from the morphological semantics analysis, we found that with the increase in the number of target words, the advantages of the diacritized model are also more obvious, and the diacritic marks have more significance in POS tagging than in other tasks.<\/jats:p>","DOI":"10.1145\/3592603","type":"journal-article","created":{"date-parts":[[2023,4,19]],"date-time":"2023-04-19T12:10:30Z","timestamp":1681906230000},"page":"1-30","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["The Impact of Arabic Diacritization on Word Embeddings"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"http:\/\/orcid.org\/0000-0003-1092-5931","authenticated-orcid":false,"given":"Mohamed","family":"Abbache","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Tianjin University of Technology, Tianjin, China"}]},{"ORCID":"http:\/\/orcid.org\/0009-0008-4183-7620","authenticated-orcid":false,"given":"Ahmed","family":"Abbache","sequence":"additional","affiliation":[{"name":"Mathematics and its Applications Laboratory, Faculty of Exact Sciences and Computing, Hassiba Ben Bouali University of Chlef, Ouled Fares, Chlef Province, Algeria"}]},{"ORCID":"http:\/\/orcid.org\/0009-0001-4120-171X","authenticated-orcid":false,"given":"Jingwen","family":"Xu","sequence":"additional","affiliation":[{"name":"Computer Science, Faculty of Information Engineering, Computer Science and Statistics, Sapienza University of Rome, Rome, Italy"}]},{"ORCID":"http:\/\/orcid.org\/0000-0001-9811-6914","authenticated-orcid":false,"given":"Farid","family":"Meziane","sequence":"additional","affiliation":[{"name":"Data Science Research Centre, University of Derby, The United Kingdom"}]},{"ORCID":"http:\/\/orcid.org\/0009-0001-3702-0000","authenticated-orcid":false,"given":"Xianbin","family":"Wen","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Tianjin University of Technology, Tianjin, China"}]}],"member":"320","published-online":{"date-parts":[[2023,6,16]]},"reference":[{"key":"e_1_3_2_2_1","volume-title":"Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS\u201918)","author":"Abid Wael","year":"2018","unstructured":"Wael Abid and Younes Bensouda Mourri. 2018. Improving English to Arabic machine translation. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS\u201918), Montr\u00e9al, Canada."},{"key":"e_1_3_2_3_1","volume-title":"Proceedings of the ML4D Workshop at 34th Conference on Neural Information Processing Systems (NeurIPS) 2020 Workshop on Machine Learning for the Developing WorldarXiv preprint","author":"Adewumi Tosin P.","year":"2020","unstructured":"Tosin P. Adewumi, Foteini Liwicki, and Marcus Liwicki. 2020. The challenge of diacritics in Yor\u00f9b\u00e1 embeddings. In Proceedings of the ML4D Workshop at 34th Conference on Neural Information Processing Systems (NeurIPS) 2020 Workshop on Machine Learning for the Developing World. Vancouver, Canada. arXiv preprint arXiv:2011.07605."},{"key":"e_1_3_2_4_1","article-title":"Massive vs. curated embeddings for low-resourced languages: the case of Yor\u00f9b\u00e1 and Twi","author":"Alabi Jesujoba","year":"2020","unstructured":"Jesujoba Alabi, Kwabena Amponsah-Kaakyire, David Adelani, and Cristina Espa\u00f1a-Bonet. 2020. Massive vs. curated embeddings for low-resourced languages: the case of Yor\u00f9b\u00e1 and Twi. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC\u201920). Computation and Language. arXi v preprint arXiv:1912.02481. Version 2.","journal-title":"Proceedings of the 12th Language Resources and Evaluation Conference (LREC\u201920). Computation and Language"},{"key":"e_1_3_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSP-SPE.2013.6642556"},{"key":"e_1_3_2_6_1","volume-title":"An Arabic Morphological Analyzer and Part-of-speech tagger. A Thesis Presented to the Faculty of Informatics Engineering","author":"Altabba Muhammad","year":"2010","unstructured":"Muhammad Altabba, Ammar Al-Zaraee, and Mohammad Arif Shukairy. 2010. An Arabic Morphological Analyzer and Part-of-speech tagger. A Thesis Presented to the Faculty of Informatics Engineering, Arab International University, Damascus, Syria."},{"key":"e_1_3_2_7_1","volume-title":"Proceedings of the Association for Machine Translation in the Americas: MT Researchers' Track Conferences","author":"Alqahtani Sawsan","year":"2016","unstructured":"Sawsan Alqahtani, Mahmoud Ghoneim, and Mona Diab. 2016. Investigating the impact of various partial diacritization schemes on Arabic-English statistical machine translation. In Proceedings of the Association for Machine Translation in the Americas: MT Researchers' Track Conferences. Austin, TX, USA, 191\u2013204."},{"key":"e_1_3_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3434236"},{"key":"e_1_3_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3465336.3475098"},{"key":"e_1_3_2_10_1","article-title":"The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network text classification","author":"Dharma Eddy Muntina","year":"2022","unstructured":"Eddy Muntina Dharma, Ford Lumban Gaol, Harco Leslie Hendric Spits Warnars, and Benfano Soewito. 2022. The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network text classification. Journal of Theoretical and Applied Information Technology (2022).","journal-title":"Journal of Theoretical and Applied Information Technology"},{"key":"e_1_3_2_11_1","volume-title":"Proceedings of Machine Translation Summit XI","author":"Diab Mona","year":"2007","unstructured":"Mona Diab, Mahmoud Ghoneim, and Nizar Habash. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of Machine Translation Summit XI, Copenhagen, Denmark."},{"key":"e_1_3_2_12_1","unstructured":"Esther Fleming. 2020. Is Madinah Arabic free? Retrieved May 20 2022 from https:\/\/www.sidmartinbio.org\/is-madinah-arabic-free\/."},{"key":"e_1_3_2_13_1","article-title":"Learning word vectors for 157 languages","author":"Grave Edouard","year":"2018","unstructured":"Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893.","journal-title":"arXiv preprint"},{"key":"e_1_3_2_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-23281-8_29"},{"key":"e_1_3_2_15_1","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511809071"},{"key":"e_1_3_2_16_1","first-page":"306","volume-title":"Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14)","author":"Masmoudi Abir","year":"2014","unstructured":"Abir Masmoudi, Mariem Ellouze Khemakhem, Yannick Est\u00e8ve, Lamia Hadrich Belguith, and Nizar Habash. 2014. A corpus and phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14), ID L14-1385, Reykjavik, Iceland, 306\u2013310."},{"key":"e_1_3_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297278"},{"key":"e_1_3_2_18_1","unstructured":"Ayman Nadeem. 2020. Arabic trilateral roots. Retrieved May 20 2022 from https:\/\/medium.com\/@aymannadeem\/arabic-trilateral-roots-3186e8319b0."},{"key":"e_1_3_2_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2017.10.117"},{"key":"e_1_3_2_20_1","volume-title":"Document Embedding Models - A Comparison with Bag-of-WordsMaster Thesis","author":"Stohler Robin","year":"2018","unstructured":"Robin Stohler. 2018. Document Embedding Models - A Comparison with Bag-of-Words. Master Thesis, Supervisors: Abraham Bernstein. Merlin - OEC Faculty Information System. University of Zurich. Zurich ZH, Switzerland."},{"key":"e_1_3_2_21_1","volume-title":"Morphological Indication of Al-Khasa'is Book for Ebn Jini: Descriptive Analytical Study","author":"Qawaqzeh Othman Salem Bakheet","year":"2019","unstructured":"Othman Salem Bakheet Qawaqzeh. 2019. Morphological Indication of Al-Khasa'is Book for Ebn Jini: Descriptive Analytical Study. University of Jordan Deanship of Academic Research (DAR)."},{"key":"e_1_3_2_22_1","volume-title":"Proceedings of the WoLeR 2011 Conference at ESSLLI International Workshop on Lexical Resources at: Ljubliana","author":"Neme Alexis","year":"2011","unstructured":"Alexis Neme. 2011. A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers. In Proceedings of the WoLeR 2011 Conference at ESSLLI International Workshop on Lexical Resources at: Ljubliana."},{"key":"e_1_3_2_23_1","first-page":"139","volume-title":"Proceedings of the 5th Arabic Natural Language Processing Workshop","author":"Younes Ahmed","year":"2020","unstructured":"Ahmed Younes and Julie Weeds. 2020. Embed more ignore less (EMIL): Exploiting enriched representations for Arabic NLP. In Proceedings of the 5th Arabic Natural Language Processing Workshop. 139\u2013154."},{"key":"e_1_3_2_24_1","first-page":"147","volume-title":"Tashkeela: Novel corpus of Arabic vocalized texts, data for autodiacritization systems","author":"Zerrouki Taha","year":"2017","unstructured":"Taha Zerrouki and Amar Balla. 2017. Tashkeela: Novel corpus of Arabic vocalized texts, data for autodiacritization systems. Data Brief, 147\u2013151."}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3592603","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,17]],"date-time":"2023-06-17T07:31:18Z","timestamp":1686987078000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3592603"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,16]]},"references-count":23,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2023,6,30]]}},"alternative-id":["10.1145\/3592603"],"URL":"https:\/\/doi.org\/10.1145\/3592603","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"value":"2375-4699","type":"print"},{"value":"2375-4702","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,16]]},"assertion":[{"value":"2022-06-07","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-30","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}