How Much Does Tokenization Affect Neural Machine Translation?

Domingo, Miguel; García-Martínez, Mercedes; Helle, Alexandre; Casacuberta, Francisco; Herranz, Manuel

doi:10.1007/978-3-031-24337-0_38

Miguel Domingo⁸,
Mercedes García-Martínez⁹,
Alexandre Helle⁹,
Francisco Casacuberta⁸ &
…
Manuel Herranz⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13451))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

636 Accesses
5 Citations

Abstract

Tokenization or segmentation is a wide concept that covers simple processes such as separating punctuation from words, or more sophisticated processes such as applying morphological knowledge. Neural Machine Translation (NMT) requires a limited-size vocabulary for computational cost and enough examples to estimate word embeddings. Separating punctuation and splitting tokens into words or subwords has proven to be helpful to reduce vocabulary and increase the number of examples of each word, improving the translation quality. Tokenization is more challenging when dealing with languages with no separator between words. In order to assess the impact of the tokenization in the quality of the final translation on NMT, we experimented on five tokenizers over ten language pairs. We reached the conclusion that the tokenization significantly affects the final translation quality and that the best tokenizer differs for different language pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

Text-Text Neural Machine Translation: A Survey

Article 23 June 2023

Neural machine translation: Challenges, progress and future

Article 15 September 2020

Notes

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2015)
Britz, D., Goldie, A., Luong, T., Le, Q.: Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906 (2017)
Dyer, C.: Using a maximum entropy model to build segmentation lattices for MT. In: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 406–414 (2009)
Google Scholar
Goldwater, S., McClosky, D.: Improving statistical MT through morphological analysis. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 676–683 (2005)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huck, M., Riess, S., Fraser, A.: Target-side word segmentation strategies for neural machine translation. In: Proceedings of the Conference on Machine Translation, pp. 56–67 (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810 (2017)
Koehn, P.: Statistical significance tests for machine translation evaluation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 388–395 (2004)
Google Scholar
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 177–180 (2007)
Google Scholar
Kudo, T.: Sentencepiece experiments (2018). https://github.com/google/sentencepiece/blob/master/doc/experiments.md
Kudo, T.: Subword regularization: improving neural network translation models with multiple subword candidates. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 66–75 (2018)
Google Scholar
Nguyen, T., Vogel, S., Smith, N.A.: Nonparametric word segmentation for machine translation. In: Proceedings of the International Conference on Computational Linguistics, pp. 815–823 (2010)
Google Scholar
Nießen, S., Ney, H.: Statistical machine translation with scarce resources using morpho-syntactic information. Comput. Linguist. 30(2), 181–204 (2004)
Article MATH Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026 (2013)
Pinnis, M., Krišlauks, R., Deksne, D., Miks, T.: Neural machine translation for morphologically rich languages with improved sub-word units and synthetic data. In: Proceedings of the International Conference on Text, Speech, and Dialogue, pp. 237–245 (2017)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725 (2016)
Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the Association for Machine Translation in the Americas, pp. 223–231 (2006)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter. In: Proceedings of the Special Interest Group of the Association for Computational Linguistics Workshop on Chinese Language Processing, pp. 168–171 (2005)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Zhao, H., Utiyama, M., Sumita, E., Lu, B.L.: An empirical study on word segmentation for Chinese machine translation. In: Proceedings of the Computational Linguistics and Intelligent Text Processing, pp. 248–263 (2013)
Google Scholar

Download references

Acknowledgments

The research leading to these results has received funding from the Centro para el Desarrollo Tecnológico Industrial (CDTI) and the European Union through Programa Operativo de Crecimiento Inteligente (EXPEDIENT: IDI-20170964). We gratefully acknowledge the support of NVIDIA Corporation with the donation of a GPU used for part of this research.

Author information

Authors and Affiliations

Pattern Recognition and Human Language Technology Research Center, Universitat Politècnica de València, Camino de Vera s/n, 46022, Valencia, Spain
Miguel Domingo & Francisco Casacuberta
Pangeanic/B.I Europa PangeaMT Technologies Division, Valencia, Spain
Mercedes García-Martínez, Alexandre Helle & Manuel Herranz

Authors

Miguel Domingo
View author publications
You can also search for this author in PubMed Google Scholar
Mercedes García-Martínez
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Helle
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Casacuberta
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Herranz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miguel Domingo .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Domingo, M., García-Martínez, M., Helle, A., Casacuberta, F., Herranz, M. (2023). How Much Does Tokenization Affect Neural Machine Translation?. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-24337-0_38
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

How Much Does Tokenization Affect Neural Machine Translation?

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

Text-Text Neural Machine Translation: A Survey

Neural machine translation: Challenges, progress and future

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

How Much Does Tokenization Affect Neural Machine Translation?

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

Text-Text Neural Machine Translation: A Survey

Neural machine translation: Challenges, progress and future

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation