Bengali reduplication generation with finite-state transducers (FSTs) | International Journal of Speech Technology Skip to main content
Log in

Bengali reduplication generation with finite-state transducers (FSTs)

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Reduplication is a highly productive process in Bengali word formation, with significant implications for various natural language processing (NLP) applications, such as parts-of-speech tagging and sentiment analysis. Despite its importance, this area has not been extensively explored in computational linguistics, especially for low-resource languages like Bengali. This study first demonstrates that a two-way finite-state transducer (FST) can effectively capture complete reduplication generation processes in Bengali. Second, it is shown that the formation of partial reduplication requires a set of 2-way FSTs due to the diverse patterns involved in Bengali partial reduplications. Third, the research highlights the utility of the reduplication generation process in identifying Bengali reduplication instances, achieving a commendable F1-Score of 88.11%. This method outperforms current state-of-the-art methods for identifying reduplicated expressions in Bengali text. This research contributes valuable insights into the computational representation of reduplication in Bengali, offering potential enhancements for NLP tasks in low-resource language scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

(a) Bengali Wikipedia: https://bn.wikipedia.org/wiki/ (b) Bengali Corpus obtained from the Technology Development for Indian Languages Programme (TDIL), MeitY—Govt. of India (https://www.meity.gov.in/).

Notes

  1. https://www.meity.gov.in/.

  2. https://bn.wikipedia.org/wiki/.

  3. http://www.rabindra-rachanabali.nltr.org.

References

  • Alblwi, A., Mahyoob, M., Al-Garaady, J., & Mustafa, K. S. (2023). A deterministic finite-state morphological analyzer for Urdu nominal system. Engineering, Technology and Applied Science Research, 13(3), 431.

    Article  Google Scholar 

  • Balli, C., Guzel, M. S., Bostanci, E., & Mishra, A. (2022). Sentimental analysis of twitter users from Turkish content with natural language processing. Computational Intelligence and Neuroscience. https://doi.org/10.1155/2022/2455160

    Article  Google Scholar 

  • Bauer, L. (1988). Introducing linguistic morphology (Vol. 57). Edinburgh University Press.

    Google Scholar 

  • Beesley, K. R., & Karttunen, L. (2003). Finite-state morphology: Xerox tools and techniques. CSLI.

    Google Scholar 

  • Bui, V.-T., & Savary, A. (2024). Cross-type French multiword expression identification with pre-trained masked language models. In Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024) (pp. 4198–4204). ELRA and ICCL.

  • Chakraborty, T., & Bandyopadhyay, S. (2010). Identification of reduplication in bengali corpus and their semantic analysis: a rule-based approach. In Proceedings of the multiword expressions: From theory to applications (MWE 2010) (pp. 73–76).

  • Choudhury, S. K., & Kundu, B. (2012). CONVEX: conjunct verb extraction from parallel corpus: A hybrid approach. In 2012 4th international conference on intelligent human computer interaction (IHCI) (pp. 1–6). https://doi.org/10.1109/IHCI.2012.6481852

  • Dash, N. S. (2011). Compound nouns and adjectives in Bangla: Some empirical observations, presented at the multiword workshop (MWW) at the AUKBC. Anna University.

    Google Scholar 

  • Dash, N. S. (2015). A descriptive study of Bengali words (pp. 225–254). Cambridge University Press.

    Google Scholar 

  • Dolatian, H., & Heinz, J. (2018). Modeling reduplication with 2-way finite-state transducers. In Proceedings of the 15th workshop on computational research in phonetics, phonology, and morphology (pp. 66–77).

  • Filiot, E., & Reynier, P.-A. (2016). Transducers, logic and algebra for functions of finite words. ACM SIGLOG News, 3(3), 4–19.

    Article  Google Scholar 

  • Garg, P., Marwaha, A., & Goel, M. B. (2020). Identification and classification of reduplication words in Punjabi language. International Journal of Scientific & Technology Research, 9(6), 532–537.

    Google Scholar 

  • Gayen, V., & Sarkar, K. (2013). Automatic identification of bengali noun-noun compounds using random forest. In Proceedings of the 9th workshop on multiword expressions (pp. 64–72). Association for Computational Linguistics.

  • Haugen, J., Ussishkin, A., & Dawson, C. (2022). Learning a typologically unusual reduplication pattern: An artificial language learning study of base-dependent reduplication. Morphology, 32(3), 299–315. https://doi.org/10.1007/s11525-022-09396-y

    Article  Google Scholar 

  • Inkelas, S., & Zoll, C. (2005). Reduplication: Doubling in morphology (Vol. 106). Cambridge University Press.

    Book  Google Scholar 

  • Islam, M. S., & Das, J. K. (2014). Design analysis rules to identify proper noun from Bengali sentence for universal networking language. International Journal of Modern Education and Computer Science, 6(8), 1–9.

    Article  Google Scholar 

  • Malik, M. G. A., Boitet, C., & Bhattacharyya, P. (2008). Hindi Urdu machine transliteration using finite-state transducers. In 22nd international conference on computational linguistics (pp. 537–544).

  • Miller, C.D., & De Santo, A. (2023). Extending finite-state models of reduplication to tone in Thai. In Proceedings of the society for computation in linguistics 2023 (pp. 85–94). Association for Computational Linguistics.

  • Mishra, A., & Mishra, A. (2023). Identifying and analyzing reduplication multiword expressions in Hindi text using machine learning. TEM Journal, 12, 1732–1741. https://doi.org/10.18421/TEM123-56

    Article  Google Scholar 

  • Mukhopadhayay, S., Dasgupta, T., Sinha, M., & Basu, A. (2012). Automatic extraction of compound verbs from Bangla Corpora. In Proceedings of the 3rd workshop on South and Southeast Asian natural language processing (pp. 153–162). The COLING 2012 Organizing Committee.

  • Pathak, D., Nandi, S., & Sarmah, P. (2022). Reduplication in Assamese: Identification and modeling. Transactions on Asian and Low-Resource Language Information Processing, 21, 1–18. https://doi.org/10.1145/3510419

    Article  Google Scholar 

  • Paul, S. (2003). Composition of compound verbs in Bangla. In Proceedings of the workshop on multi-verb constructions, trondheim summer school 2003. Norwegian University of Science and Technology.

  • Rawski, J., Dolatian, H., Heinz, J., & Raimy, E. (2023). Regular and polyregular theories of reduplication. Glossa, 8(1), 8885. https://doi.org/10.16995/glossa.8885

    Article  Google Scholar 

  • Rossyaykin, P., & Loukachevitch, N. (2020). Finding new multiword expressions for existing thesaurus. Communications in Computer and Information Science, 1292, 166–180.

    Article  Google Scholar 

  • Rubino, C. (2005). Reduplication: Form, function and distribution. Studies on Reduplication, 28(2005), 11–29.

    Article  Google Scholar 

  • Rueter, J., Hämäläinen, M., & Alnajjar, K. (2023). Modelling the reduplicating lushootseed morphology with an FST and LSTM. In Proceedings of the workshop on natural language processing for indigenous languages of the Americas (AmericasNLP) (pp. 40–46). Association for Computational Linguistics.

  • Saini, J., & Gaikwad, H. (2023). A generic tool for identification of Indo-Aryan multi word expression. SN Computer Science. https://doi.org/10.1007/s42979-023-02181-6

    Article  Google Scholar 

  • Senapati, A. (2022a). A self-reliant finite automata for reduplication detection. In International conference on asian language processing (IALP) (pp. 1–5). https://doi.org/10.1109/IALP57159.2022.9961258

  • Senapati, A. (2022b). A fuzzy system for identifying partial reduplication. Computación y Sistemas, 26(1), 81–90. https://doi.org/10.13053/CyS-26-1-4154

    Article  Google Scholar 

  • Shallit, J. (2008). A second course in formal languages and automata theory (1st ed.). Cambridge University Press.

    Book  Google Scholar 

  • Walsh, A., Lynn, T., & Foster, J. (2022). A BERT’s eye view: identification of Irish multiword expressions using pre-trained language models. In Proceedings of the 18th workshop on multiword expressions @LREC2022 (pp. 89–99). European Language Resources Association.

  • Zaninello, A., & Birch, A. (2020). Multiword expression aware neural machine translation. In Proceedings of the twelfth language resources and evaluation conference (pp. 3816–3825). European Language Resources Association.

Download references

Acknowledgements

We extend our sincere gratitude to the Technology Development for Indian Languages Programme (TDIL), MeitY, Govt. of India, for generously granting us permission to access their Bengali dataset for the purpose of conducting this study.

Funding

Not Applicable.

Author information

Authors and Affiliations

Authors

Contributions

The main manuscript text was authored by Abhijit Barman, with Dr. Diganta Saha and Dr. Alok Ranjan Pal providing overall supervision for the project.

Corresponding author

Correspondence to Abhijit Barman.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare that are relevant to this article.

Ethical approval

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barman, A., Saha, D. & Pal, A.R. Bengali reduplication generation with finite-state transducers (FSTs). Int J Speech Technol 27, 729–737 (2024). https://doi.org/10.1007/s10772-024-10124-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-024-10124-6

Keywords

Navigation