Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages

Kuwanto, Garry; Akyürek, Afra Feyza; Tourni, Isidora Chara; Li, Siyang; Jones, Alex; Wijaya, Derry

doi:10.1007/978-981-99-7025-4_39

Garry Kuwanto¹²,
Afra Feyza Akyürek¹²,
Isidora Chara Tourni¹²,
Siyang Li¹²,
Alex Jones¹³ &
…
Derry Wijaya¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14327))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

900 Accesses

Abstract

We conduct an empirical study of neural machine translation (NMT) for truly low-resource languages, and present a training curriculum fit for cases when both parallel training data and compute resource are lacking, reflecting the reality of most of the world’s languages and the researchers working on these languages. Previously, unsupervised NMT, which employs back-translation (BT) and auto-encoding (AE) tasks has been shown barren for low-resource languages. We demonstrate that leveraging comparable data and code-switching as weak supervision, combined with pre-training with BT and AE objectives, result in remarkable improvements for low-resource languages even when using only modest compute resources. The training curriculum proposed in this work achieves BLEU scores that improve over supervised NMT trained on the same backbone architecture, showcasing the potential of weakly-supervised NMT for low-resource languages.

G. Kuwanto, A. F. Akyürek, I. C. Tourni and S. Li—Contributed Equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8464; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving Neural Machine Translation for Low Resource Languages Using Mixed Training: The Case of Ethiopian Languages

Extremely low-resource neural machine translation for Asian languages

Article Open access 01 December 2020

Scaling neural machine translation to 200 languages

Article Open access 05 June 2024

Notes

1.
http://github.com/facebookresearch/XLM.

References

Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.747
Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Advances in Neural Information Processing Systems, pp. 7059–7069 (2019). https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf
Grover, J., Mitra, P.: Bilingual word embeddings with bucketed CNN for parallel sentence extraction. In: Proceedings of ACL 2017, Student Research Workshop, pp. 11–16. Association for Computational Linguistics, Vancouver, Canada (2017). https://www.aclweb.org/anthology/P17-3003
Guo, M., et al.: Effective parallel corpus mining using bilingual sentence embeddings. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 165–176. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/W18-6317, https://www.aclweb.org/anthology/W18-6317
Joshi, P., Santy, S., Budhiraja, A., Bali, K., Choudhury, M.: The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6282–6293. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.560, https://www.aclweb.org/anthology/2020.acl-main.560
Kim, Y., Graça, M., Ney, H.: When and why is unsupervised neural machine translation useless? In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pp. 35–44. European Association for Machine Translation, Lisboa, Portugal (2020). https://www.aclweb.org/anthology/2020.eamt-1.5
Lample, G., Conneau, A., Denoyer, L., Ranzato, M.: Unsupervised machine translation using monolingual corpora only. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=rkYTTf-AZ
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020). https://doi.org/10.1162/tacl_a_00343
Article Google Scholar
Marchisio, K., Duh, K., Koehn, P.: When does unsupervised machine translation work? In: Proceedings of the Fifth Conference on Machine Translation, pp. 571–583. Association for Computational Linguistics (2020). https://www.aclweb.org/anthology/2020.wmt-1.68
Schwenk, H.: Filtering and mining parallel data in a joint multilingual space. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 228–234. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-2037, https://www.aclweb.org/anthology/P18-2037
Schwenk, H., Chaudhary, V., Sun, S., Gong, H., Guzmán, F.: WikiMatrix: mining 135M parallel sentences in 1620 language pairs from Wikipedia. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1351–1361. Association for Computational Linguistics (2021), https://www.aclweb.org/anthology/2021.eacl-main.115
Schwenk, H., Wenzek, G., Edunov, S., Grave, E., Joulin, A.: CCMatrix: mining billions of high-quality parallel sentences on the web. arXiv preprint arXiv:1911.04944 (2019)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with Subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Association for Computational Linguistics, Berlin, Germany (2016). https://doi.org/10.18653/v1/P16-1162 , https://www.aclweb.org/anthology/P16-1162
Tracey, J., et al.: Corpus building for low resource languages in the DARPA LORELEI program. In: Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, pp. 48–55. European Association for Machine Translation, Dublin, Ireland (2019). https://www.aclweb.org/anthology/W19-6808
Yang, J., Ma, S., Zhang, D., Wu, S., Li, Z., Zhou, M.: Alternating language modeling for cross-lingual pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 9386–9393 (2020). https://ojs.aaai.org/index.php/AAAI/article/view/6480
Yang, Z., Hu, B., Han, A., Huang, S., Ju, Q.: CSP:code-switching pre-training for neural machine translation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2624–2636. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.208, https://aclanthology.org/2020.emnlp-main.208

Download references

Author information

Authors and Affiliations

Boston University, Boston, MA, USA
Garry Kuwanto, Afra Feyza Akyürek, Isidora Chara Tourni, Siyang Li & Derry Wijaya
Dartmouth College, Hanover, NH, USA
Alex Jones

Authors

Garry Kuwanto
View author publications
You can also search for this author in PubMed Google Scholar
Afra Feyza Akyürek
View author publications
You can also search for this author in PubMed Google Scholar
Isidora Chara Tourni
View author publications
You can also search for this author in PubMed Google Scholar
Siyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Alex Jones
View author publications
You can also search for this author in PubMed Google Scholar
Derry Wijaya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Garry Kuwanto .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Fenrong Liu
SEEK Limited, Cremorne, NSW, Australia
Arun Anand Sadanandan
MIMOS Berhad, Kuala Lumpur, Malaysia
Duc Nghia Pham
Universitas Indonesia, Depok, Indonesia
Petrus Mursanto
Tabcorp Holdings Limited, Melbourne, VIC, Australia
Dickson Lukose

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuwanto, G., Akyürek, A.F., Tourni, I.C., Li, S., Jones, A., Wijaya, D. (2024). Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages. In: Liu, F., Sadanandan, A.A., Pham, D.N., Mursanto, P., Lukose, D. (eds) PRICAI 2023: Trends in Artificial Intelligence. PRICAI 2023. Lecture Notes in Computer Science(), vol 14327. Springer, Singapore. https://doi.org/10.1007/978-981-99-7025-4_39

Download citation

DOI: https://doi.org/10.1007/978-981-99-7025-4_39
Published: 10 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7024-7
Online ISBN: 978-981-99-7025-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Neural Machine Translation for Low Resource Languages Using Mixed Training: The Case of Ethiopian Languages

Extremely low-resource neural machine translation for Asian languages

Scaling neural machine translation to 200 languages

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Neural Machine Translation for Low Resource Languages Using Mixed Training: The Case of Ethiopian Languages

Extremely low-resource neural machine translation for Asian languages

Scaling neural machine translation to 200 languages

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation