Abstract
Recently, many studies have shown the efficiency of using Bidirectional Encoder Representations from Transformers (BERT) in various Natural Language Processing (NLP) tasks. Specifically, English spelling correction task that uses Encoder-Decoder architecture and takes advantage of BERT has achieved state-of-the-art result. However, to our knowledge, there is no implementation in Vietnamese yet. Therefore, in this study, a combination of Transformer architecture (state-of-the-art for Encoder-Decoder model) and BERT was proposed to deal with Vietnamese spelling correction. The experiment results have shown that our model outperforms other approaches as well as the Google Docs Spell Checking tool, achieves an 86.24 BLEU score on this task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
The tool can be found on the Google Docs website (https://docs.google.com/). We collected samples by using a web browser behavior simulator based on Selenium framework that manipulate the Google spell checking tool to correct all of its possible suggestions.
References
Bassil, Y., Alwani, M.: Ocr post-processing error correction algorithm using google online spelling suggestion. J. Emerging Trends Comput. Inf. Sci. (2012)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2019)
of Education Vietnam M:
Ministry of Education Publisher (2002)
Fivez, P., Šuster, S., Daelemans, W.: Unsupervised context-sensitive spelling correction of clinical free-text with word and character n-gram embeddings. In: BioNLP 2017, pp. 143–148. Association for Computational Linguistics, Vancouver, Canada Aug 2017
Fivez, P., Suster, S., Daelemans, W.: Unsupervised context-sensitive spelling correction of english and dutch clinical free-text with word and character n-gram embeddings (2017)
Hao, C.X.:
Youth Publisher (2003)
Hladek, D., Staš, J., Pleva, M.: Survey of automatic spelling correction. Electronics 9, 1670 (2020)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997)
Kaneko, M., Mita, M., Kiyono, S., Suzuki, J., Inui, K.: Encoder-decoder models can benefit from pre-trained masked language models in grammatical error correction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4248–4254. Association for Computational Linguistics (2020)
Khanh, P.H.: Good spelling of vietnamese texts, one aspect of computational linguistics in vietnam. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL 2000 p. 1–2. Association for Computational Linguistics, USA (2000)
Kissos, I., Dershowitz, N.: Ocr error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 198–203. IEEE (2016)
Kiyono, S., Suzuki, J., Mita, M., Mizumoto, T., Inui, K.: An empirical study of incorporating pseudo data into grammatical error correction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1236–1242. Association for Computational Linguistics, Hong Kong, China, Nov 2019
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.: OpenNMT: Open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations, pp. 67–72. Association for Computational Linguistics, Vancouver, Canada (Jul 2017)
Liu, J., Cheng, F., Wang, Y., Shindo, H., Matsumoto, Y.: Automatic error correction on Japanese functional expressions using character-based neural machine translation. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation. Association for Computational Linguistics, Hong Kong, 1–3 Dec 2018
Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, 2–4 May 2013. Workshop Track Proceedings (2013)
Nguyen, D.Q., Nguyen, A.T.: PhoBERT: Pre-trained language models for Vietnamese. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1037–1042 (2020)
Nguyen, H.T., Dang, T.B., Nguyen, L.M.: Deep learning approach for vietnamese consonant misspell correction. In: Nguyen, L.-M., Phan, X.-H., Hasida, K., Tojo, S. (eds.) PACLING 2019. CCIS, vol. 1215, pp. 497–504. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6168-9_40
Nguyen, H., Dang, T., Nguyen, T.T., Le, C.: Using large n-gram for vietnamese spell checking. Adv. Intell. Syst. Comput. 326, 617–627 (2015)
Nguyen, P.H., Ngo, T.D., Phan, D.A., Dinh, T.P., Huynh, T.Q.: Vietnamese spelling detection and correction using bi-gram, minimum edit distance, soundex algorithms with some additional heuristics. In: 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, pp. 96–102. IEEE (2008)
Nguyen, Q.D., Le, D.A., Zelinka, I.: Ocr error correction for unconstrained vietnamese handwritten text, pp. 132–138 (12 2019)
Ott, M., et al.: fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). pp. 48–53. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA (Jul 2002)
Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (Oct 2014)
Pham, N.L., Nguyen, T.H., Nguyen, V.V.: Grammatical error correction for vietnamese using machine translation. In: Nguyen, L.-M., Phan, X.-H., Hasida, K., Tojo, S. (eds.) PACLING 2019. CCIS, vol. 1215, pp. 505–512. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6168-9_41
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS 2014, pp. 3104–3112. MIT Press, Cambridge, MA, USA (2014)
Tedjopranoto, M., Wijaya, A., Santoso, L., Suhartono, D.: Correcting typographical error and understanding user intention in chatbot by combining n-gram and machine learning using schema matching technique. Int. J. Mach. Learn. Comput. 9, 471–476 (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Xuan, P.: Solutions to spelling mistakes in written vietnamese. VNU J. Sci. Educ. Research 33(2) (2017)
Yuan, Z., Briscoe, T.: Grammatical error correction using neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 380–386. Association for Computational Linguistics (Jun 2016)
Zhu, J., Xia, Y., Wu, L., He, D., Qin, T., Zhou, W., Li, H., Liu, T.: Incorporating BERT into neural machine translation. In: Eighth International Conference on Learning Representations (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ngo, T.H., Tran, H.D., Huynh, T., Hoang, K. (2022). A Combination of BERT and Transformer for Vietnamese Spelling Correction. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13757. Springer, Cham. https://doi.org/10.1007/978-3-031-21743-2_43
Download citation
DOI: https://doi.org/10.1007/978-3-031-21743-2_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21742-5
Online ISBN: 978-3-031-21743-2
eBook Packages: Computer ScienceComputer Science (R0)