Abstract
Rule-based systems based on two-level morphology for tagging the morphological features of a word work quite well for Bengali language and are able to predict all possible morphological derivations for standard forms of words whose roots occur in the dictionary. However many words have multiple morphological derivations and the correct morphological derivation depends upon the context of the word. Non-dictionary words are also very frequent. Machine learning based methods have been used for predicting the values of morphological features of a word which take into account the context of the word. Although the machine learning systems to some extent can disambiguate the cases related to the words with multiple possible values, these systems needs to be improved to make more efficient use of the character-level information. Character-level information is particularly important for analysis of out-of-vocabulary (OOV) words which are not seen in the training data. We propose a method which makes use of both the context of the word as well as makes efficient use of the constituent characters of the words in order to develop a high quality morphological analyzer for Bengali. In this work we show that using character-level information along with the contextual information improves the performance of the morphological analyzer both for the OOV words and in predicting the correct analyses for the instances of the words that can have multiple morphological derivations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/. Software available from tensorflow.org
Ali, M.N.Y., Al-Mamun, S.M.A., Das, J.K., Nurannabi, A.M.: Morphological analysis of bangla words for universal networking language. In: 2008 Third International Conference on Digital Information Management, pp. 532–537 (Nov 2008). https://doi.org/10.1109/ICDIM.2008.4746734
Barik, B., Sarkar, S.: Pattern based pruning of morphological alternatives of bengali wordforms. In: 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1724–1730 (2014). https://doi.org/10.1109/ICACCI.2014.6968551
Bhattacharya, S., Choudhury, M., Sarkar, S., Basu, A.: Inflectional morphology synthesis for bengali noun, pronoun and verb systems. In: In Proceedings of the National Conference on Computer Processing of Bangla NCCPB, pp. 34–43 (2005)
Bohnet, B., McDonald, R., Simoes, G., Andor, D., Pitler, E., Maynez, J.: Morphosyntactic tagging with a meta-bilstm model over context sensitive token encodings. arXiv preprint arXiv:1805.08237 (2018)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Chakrabarty, A., Garain, U.: Benlem (a bengali lemmatizer) and its role in WSD. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15(3), 12:1–12:18 (Feb 2016). https://doi.org/10.1145/2835494, http://doi.acm.org/10.1145/2835494
Dozat, T., Qi, P., Manning, C.D.: Stanford’s graph-based neural dependency parser at the conll 2017 shared task. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 20–30 (2017)
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)
Nivre, J., et al.: Universal dependencies v1: a multilingual treebank collection. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 1659–1666. European Language Resources Association, Portorož, Slovenia (2016)
Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset. In: Chair, N.C.C., et al., (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Smith, A., Bohnet, B., de Lhoneux, M., Nivre, J., Shao, Y., Stymne, S.: 82 treebanks, 34 models: universal dependency parsing with multi-treebank models. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 113–123. Association for Computational Linguistics (2018), http://aclweb.org/anthology/K18-2011
Tkachenko, A., Sirts, K.: Modeling composite labels for neural morphological tagging. In: Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 368–379. Association for Computational Linguistics (2018). http://aclweb.org/anthology/K18-1036
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Das, A., Sarkar, S. (2023). MorphBen: A Neural Morphological Analyzer for Bengali Language. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_42
Download citation
DOI: https://doi.org/10.1007/978-3-031-24337-0_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)