Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition | SpringerLink
Skip to main content

Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12824))

Included in the following conference series:

Abstract

Attention-based encoder-decoder framework is widely used in the scene text recognition task. However, for the current state-of-the-art (SOTA) methods, there is room for improvement in terms of the efficient usage of local visual and global context information of the input text image, as well as the robust correlation between the scene processing module (encoder) and the text processing module (decoder). In this paper, we propose a Representation and Correlation Enhanced Encoder-Decoder Framework (RCEED) to address these deficiencies and break performance bottleneck. In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map. In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space. 1) The decoder initialization is guided by the holistic feature and global glimpse vector exported from the encoder. 2) The feature enriched glimpse vector produced by the Multi-Head General Attention is used to assist the RNN iteration and the character prediction at each time step. Meanwhile, we also design a Layernorm-Dropout LSTM cell to improve model’s generalization towards changeable texts. Extensive experiments on the benchmarks demonstrate the advantageous performance of RCEED in scene text recognition tasks, especially the irregular ones. The source code will be available https://github.com/Mona9955/RCEED-ICDAR2021.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. PAMI 39(11), 2298–2304 (2016)

    Article  Google Scholar 

  2. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. PAMI 41(9), 2035–2048 (2018)

    Article  Google Scholar 

  3. Yang, L., Wang, P., Li, H., Li, Z., Zhang, Y.: A holistic representation guided attention network for scene text recognition. Neurocomputing 414, 67–75 (2020)

    Article  Google Scholar 

  4. Litman, R., Anschel, O., Tsiper, S., Litman, R., Mazor, S., Manmatha, R.: SCATTER: selective context attentional scene text recognizer. In: CVPR, pp. 11962–11972 (2020)

    Google Scholar 

  5. Zuo, L.Q., Sun, H.M., Mao, Q.C., Qi, R., Jia, R.S.: Natural scene text recognition based on encoder-decoder framework. IEEE Access 7, 62616–62623 (2019)

    Article  Google Scholar 

  6. Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: A simple and strong baseline for irregular text recognition. In: AAAI, vol. 33, pp. 8610–8617 (2019)

    Google Scholar 

  7. Lu, N., Yu, W., Qi, X., Chen, Y., Gong, P., Xiao, R.: MASTER: multi-aspect non-local network for scene text recognition. arXiv preprint arXiv:1910.02562 (2019)

  8. Sheng, F., Chen, Z., Xu, B.: NRTR: a no-recurrence sequence-to-sequence model for scene text recognition. In: ICDAR, pp. 781–786. IEEE (2019)

    Google Scholar 

  9. Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation. PAMI 18(7), 690–706 (1996)

    Article  Google Scholar 

  10. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464. IEEE (2011)

    Google Scholar 

  11. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: CVPR, pp. 4168–4176 (2016)

    Google Scholar 

  12. Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: CVPR, pp. 2231–2239 (2016)

    Google Scholar 

  13. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  14. Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: ICCV, pp. 5076–5084 (2017)

    Google Scholar 

  15. Bai, F., Cheng, Z., Niu, Y., Pu, S., Zhou, S.: Edit probability for scene text recognition. In: CVPR, pp. 1508–1516 (2018)

    Google Scholar 

  16. Bookstein, F.L., Green, W.D.K.: A thin-plate spline and the decomposition of deformations. Math. Methods Med. Imaging 2, 14–28 (1993)

    Article  Google Scholar 

  17. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. arXiv preprint arXiv:1506.02025 (2015)

  18. Zhan, F., Lu, S.: ESIR: end-to-end scene text recognition via iterative image rectification. In: CVPR, pp. 2059–2068 (2019)

    Google Scholar 

  19. Luo, C., Jin, L., Sun, Z.: MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)

    Article  Google Scholar 

  20. Liu, W., Chen, C., Wong, K.Y.: Char-net: A character-aware neural network for distorted scene text recognition. In: AAAI, vol. 32 (2018)

    Google Scholar 

  21. Liao, M., Lyu, P., He, M., Yao, C., Wu, W., Bai, X.: Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. PAMI 43(2), 532–548 (2021)

    Google Scholar 

  22. Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W.: RobustScanner: dynamically enhancing positional clues for robust text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 135–151. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_9

    Chapter  Google Scholar 

  23. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  24. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  25. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)

  26. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014)

  27. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)

  28. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  29. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)

  30. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)

    Google Scholar 

  31. Mishra, A., Alahari, K., Jawahar, C.V.: Scene text recognition using higher order language priors. In: BMVC-British Machine Vision Conference. BMVA (2012)

    Google Scholar 

  32. Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493. IEEE (2013)

    Google Scholar 

  33. Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160. IEEE (2015)

    Google Scholar 

  34. Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013)

    Google Scholar 

  35. Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (61976214, 61721004, 61633021), and Science and Technology Project of SGCC Research on feature recognition and prediction of typical ice and wind disaster for transmission lines based on small sample machine learning method.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mengmeng Cui .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cui, M., Wang, W., Zhang, J., Wang, L. (2021). Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12824. Springer, Cham. https://doi.org/10.1007/978-3-030-86337-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86337-1_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86336-4

  • Online ISBN: 978-3-030-86337-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics