Abstract
Preprocessing the input textual data is the main starting step in any Natural Language Processing (NLP) application. Word stemming, i.e. extracting the stem or root of the input word, is a vital process within the preprocessing step. In this process, some words like “player”, “playing”, and “played” are mapped to their stem “play”. In the English language, there are several algorithms and approaches that can be applied directly to handle this process. On the other hand, there are some trials for similar algorithms in Arabic, but all have weak performance due to the complexity of the language and the approaches used for building such algorithms. In this paper, we presented a novel deep learning-based model, called Masdar, for Arabic stemming. The proposed model leverages the power of the deep learning, especially the recurrent neural networks, in building an efficient Arabic stemmer that is capable of producing very accurate stems for most of the input words. Some experiments are conducted to compare the performance of the proposed model with the latest cited Arabic stemmers on a dataset of about 6000 Arabic word/stem pairs. The experimental results show that Masder outperformed the other stemmers. It can efficiently produce the correct stems with about 95% accuracy on the whole dataset and about 82% accuracy on the unseen test words.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Snowball Stemmers: http://snowball.tartarus.org/texts/introduction.html.
- 2.
Arabic Language, https://en.wikipedia.org/wiki/Arabic.
- 3.
Varieties of Arabic, https://en.wikipedia.org/wiki/Varieties_of_Arabic.
- 4.
PyTorch Deep Learning Platform, https://pytorch.org/.
References
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Al-Shalabi, R., Kanaan, G., Nour, F.M., Ghwanmeh, S.: Stemmer algorithm for arabic words based on excessive letter locations. In: Innovations 2007: 4th International Conference on Innovations in Information Technology, IIT (2008)
Khoja, S., Garside, R.: Stemming Arabic Text, Computing Department, Lancaster University, Lancaster, UK (1999)
Abu-Salem, H., Al-Omari, M.: Stemming methodologies over individual query words for an arabic information retrieval system. J. Am. Soc. Inf. Sci. 50, 524–529 (1999)
Al-Kabi, M.N., Kazakzeh, S.A., Abu Ata, B.M., Al-Rababah, S.A., Alsmadi, I.M.: A novel root based Arabic stemmer. J. King Saud Univ. - Comput. Inf. Sci. 27, 94–103 (2015)
Singh, J., Gupta, V.: A systematic review of text stemming techniques. Artif. Intell. Rev. 48, 157–217 (2017)
Darwish, K.: Building a shallow Arabic morphological analyzer in one day. In: Proceedings of the ACL-2002 Workshop on Computational Approaches to Semitic Languages (2002)
Thabet, N.: Stemming the Qur’an. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages - Semitic 2004 (2004)
Kchaou, Z., Kanoun, S.: Arabic stemming with two dictionaries. In: 2008 International Conference on Innovations in Information Technology, IIT 2008 (2008)
Aljlayl, M.: On Arabic search : improving the retrieval effectiveness via a light stemming approach. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management (2002)
Abuata, B., Al-Omari, A.: A rule-based stemmer for Arabic Gulf dialect. J. King Saud Univ. - Comput. Inf, Sci (2015)
El-Beltagy, S.R., Rafea, A.: An accuracy-enhanced light stemmer for arabic text. ACM Trans. Speech Lang. Process. 7, 2 (2011)
Boubas, A., Lulu, L., Belkhouche, B., Harous, S.: GENESTEM: a novel approach for an Arabic stemmer using genetic algorithms. In: 2011 International Conference on Innovations in Information Technology, IIT 2011 (2011)
Ghwanmeh, S., Rabab’Ah, S., Al-Shalabi, R., Kanaan, G.: Enhanced algorithm for extracting the root of arabic words. In: Proceedings of the 2009 6th International Conference on Computer Graphics, Imaging and Visualization: New Advances and Trends, CGIV 2009 (2009)
Al-Nashashibi, M.Y., Neagu, D., Yaghi, A.A.: Stemming techniques for Arabic words: a comparative study. In: ICCTD 2010–2010 2nd International Conference on Computer Technology and Development, Proceedings (2010)
Otair, M.A.: Comparative analysis of Arabic stemming algorithms. Int. J. Manag. Inf. Technol. 5, 1 (2013)
Dahab, M.Y., Al Ibrahim, A., Al-Mutawa, R.: A comparative study on Arabic stemmers. Int. J. Comput. Appl. 125, 38–47 (2015)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 5, 1 (1986)
Olah, C.: Understanding LSTM Networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Toderici, G., Vincent, D., Johnston, N., Hwang, S.J., Minnen, D., Shor, J., Covell, M.: Full resolution image compression with recurrent neural networks. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (2017)
Gregor, K., Graves, A., Com, W.G.: DRAW: a recurrent neural network for image generation. In: The 32nd International Conference on Machine Learning (ICML) (2015)
Chen, P., Sun, Z., Bing, L., Yang, W.: Recurrent attention network on memory for aspect sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Neural Information Processing Systems (NIPS 2014) (2014)
Tjandra, A., Sakti, S., Manurung, R., Adriani, M., Nakamura, S.: Gated recurrent neural tensor network. In: Proceedings of the International Joint Conference on Neural Networks (2016)
Arisoy, E., Sethy, A., Ramabhadran, B., Chen, S.: Bidirectional recurrent neural network language models for automatic speech recognition. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (2015)
Gadri, S., Moussaoui, A.: Information retrieval: a new multilingual stemmer based on a statistical approach. In: 3rd International Conference on Control, Engineering and Information Technology, CEIT 2015 (2015)
Acknowledgments
Our thanks to the administration of the High-Performance Computing Center (HPCC) at King Abdulaziz University for their support and the access to the Aziz Supercomputer that helped us in the learning process of the proposed model that requires high computing capabilities for training the Sequence-to-Sequence model in Masdar.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Fouad, M.M., Mahany, A., Katib, I. (2020). Masdar: A Novel Sequence-to-Sequence Deep Learning Model for Arabic Stemming. In: Bi, Y., Bhatia, R., Kapoor, S. (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1038. Springer, Cham. https://doi.org/10.1007/978-3-030-29513-4_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-29513-4_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29512-7
Online ISBN: 978-3-030-29513-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)