Abstract
Reduplication is a highly productive process in Bengali word formation, with significant implications for various natural language processing (NLP) applications, such as parts-of-speech tagging and sentiment analysis. Despite its importance, this area has not been extensively explored in computational linguistics, especially for low-resource languages like Bengali. This study first demonstrates that a two-way finite-state transducer (FST) can effectively capture complete reduplication generation processes in Bengali. Second, it is shown that the formation of partial reduplication requires a set of 2-way FSTs due to the diverse patterns involved in Bengali partial reduplications. Third, the research highlights the utility of the reduplication generation process in identifying Bengali reduplication instances, achieving a commendable F1-Score of 88.11%. This method outperforms current state-of-the-art methods for identifying reduplicated expressions in Bengali text. This research contributes valuable insights into the computational representation of reduplication in Bengali, offering potential enhancements for NLP tasks in low-resource language scenarios.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
(a) Bengali Wikipedia: https://bn.wikipedia.org/wiki/ (b) Bengali Corpus obtained from the Technology Development for Indian Languages Programme (TDIL), MeitY—Govt. of India (https://www.meity.gov.in/).
References
Alblwi, A., Mahyoob, M., Al-Garaady, J., & Mustafa, K. S. (2023). A deterministic finite-state morphological analyzer for Urdu nominal system. Engineering, Technology and Applied Science Research, 13(3), 431.
Balli, C., Guzel, M. S., Bostanci, E., & Mishra, A. (2022). Sentimental analysis of twitter users from Turkish content with natural language processing. Computational Intelligence and Neuroscience. https://doi.org/10.1155/2022/2455160
Bauer, L. (1988). Introducing linguistic morphology (Vol. 57). Edinburgh University Press.
Beesley, K. R., & Karttunen, L. (2003). Finite-state morphology: Xerox tools and techniques. CSLI.
Bui, V.-T., & Savary, A. (2024). Cross-type French multiword expression identification with pre-trained masked language models. In Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024) (pp. 4198–4204). ELRA and ICCL.
Chakraborty, T., & Bandyopadhyay, S. (2010). Identification of reduplication in bengali corpus and their semantic analysis: a rule-based approach. In Proceedings of the multiword expressions: From theory to applications (MWE 2010) (pp. 73–76).
Choudhury, S. K., & Kundu, B. (2012). CONVEX: conjunct verb extraction from parallel corpus: A hybrid approach. In 2012 4th international conference on intelligent human computer interaction (IHCI) (pp. 1–6). https://doi.org/10.1109/IHCI.2012.6481852
Dash, N. S. (2011). Compound nouns and adjectives in Bangla: Some empirical observations, presented at the multiword workshop (MWW) at the AUKBC. Anna University.
Dash, N. S. (2015). A descriptive study of Bengali words (pp. 225–254). Cambridge University Press.
Dolatian, H., & Heinz, J. (2018). Modeling reduplication with 2-way finite-state transducers. In Proceedings of the 15th workshop on computational research in phonetics, phonology, and morphology (pp. 66–77).
Filiot, E., & Reynier, P.-A. (2016). Transducers, logic and algebra for functions of finite words. ACM SIGLOG News, 3(3), 4–19.
Garg, P., Marwaha, A., & Goel, M. B. (2020). Identification and classification of reduplication words in Punjabi language. International Journal of Scientific & Technology Research, 9(6), 532–537.
Gayen, V., & Sarkar, K. (2013). Automatic identification of bengali noun-noun compounds using random forest. In Proceedings of the 9th workshop on multiword expressions (pp. 64–72). Association for Computational Linguistics.
Haugen, J., Ussishkin, A., & Dawson, C. (2022). Learning a typologically unusual reduplication pattern: An artificial language learning study of base-dependent reduplication. Morphology, 32(3), 299–315. https://doi.org/10.1007/s11525-022-09396-y
Inkelas, S., & Zoll, C. (2005). Reduplication: Doubling in morphology (Vol. 106). Cambridge University Press.
Islam, M. S., & Das, J. K. (2014). Design analysis rules to identify proper noun from Bengali sentence for universal networking language. International Journal of Modern Education and Computer Science, 6(8), 1–9.
Malik, M. G. A., Boitet, C., & Bhattacharyya, P. (2008). Hindi Urdu machine transliteration using finite-state transducers. In 22nd international conference on computational linguistics (pp. 537–544).
Miller, C.D., & De Santo, A. (2023). Extending finite-state models of reduplication to tone in Thai. In Proceedings of the society for computation in linguistics 2023 (pp. 85–94). Association for Computational Linguistics.
Mishra, A., & Mishra, A. (2023). Identifying and analyzing reduplication multiword expressions in Hindi text using machine learning. TEM Journal, 12, 1732–1741. https://doi.org/10.18421/TEM123-56
Mukhopadhayay, S., Dasgupta, T., Sinha, M., & Basu, A. (2012). Automatic extraction of compound verbs from Bangla Corpora. In Proceedings of the 3rd workshop on South and Southeast Asian natural language processing (pp. 153–162). The COLING 2012 Organizing Committee.
Pathak, D., Nandi, S., & Sarmah, P. (2022). Reduplication in Assamese: Identification and modeling. Transactions on Asian and Low-Resource Language Information Processing, 21, 1–18. https://doi.org/10.1145/3510419
Paul, S. (2003). Composition of compound verbs in Bangla. In Proceedings of the workshop on multi-verb constructions, trondheim summer school 2003. Norwegian University of Science and Technology.
Rawski, J., Dolatian, H., Heinz, J., & Raimy, E. (2023). Regular and polyregular theories of reduplication. Glossa, 8(1), 8885. https://doi.org/10.16995/glossa.8885
Rossyaykin, P., & Loukachevitch, N. (2020). Finding new multiword expressions for existing thesaurus. Communications in Computer and Information Science, 1292, 166–180.
Rubino, C. (2005). Reduplication: Form, function and distribution. Studies on Reduplication, 28(2005), 11–29.
Rueter, J., Hämäläinen, M., & Alnajjar, K. (2023). Modelling the reduplicating lushootseed morphology with an FST and LSTM. In Proceedings of the workshop on natural language processing for indigenous languages of the Americas (AmericasNLP) (pp. 40–46). Association for Computational Linguistics.
Saini, J., & Gaikwad, H. (2023). A generic tool for identification of Indo-Aryan multi word expression. SN Computer Science. https://doi.org/10.1007/s42979-023-02181-6
Senapati, A. (2022a). A self-reliant finite automata for reduplication detection. In International conference on asian language processing (IALP) (pp. 1–5). https://doi.org/10.1109/IALP57159.2022.9961258
Senapati, A. (2022b). A fuzzy system for identifying partial reduplication. Computación y Sistemas, 26(1), 81–90. https://doi.org/10.13053/CyS-26-1-4154
Shallit, J. (2008). A second course in formal languages and automata theory (1st ed.). Cambridge University Press.
Walsh, A., Lynn, T., & Foster, J. (2022). A BERT’s eye view: identification of Irish multiword expressions using pre-trained language models. In Proceedings of the 18th workshop on multiword expressions @LREC2022 (pp. 89–99). European Language Resources Association.
Zaninello, A., & Birch, A. (2020). Multiword expression aware neural machine translation. In Proceedings of the twelfth language resources and evaluation conference (pp. 3816–3825). European Language Resources Association.
Acknowledgements
We extend our sincere gratitude to the Technology Development for Indian Languages Programme (TDIL), MeitY, Govt. of India, for generously granting us permission to access their Bengali dataset for the purpose of conducting this study.
Funding
Not Applicable.
Author information
Authors and Affiliations
Contributions
The main manuscript text was authored by Abhijit Barman, with Dr. Diganta Saha and Dr. Alok Ranjan Pal providing overall supervision for the project.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to declare that are relevant to this article.
Ethical approval
Not Applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Barman, A., Saha, D. & Pal, A.R. Bengali reduplication generation with finite-state transducers (FSTs). Int J Speech Technol 27, 729–737 (2024). https://doi.org/10.1007/s10772-024-10124-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-024-10124-6