Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

Cui, Mengmeng; Wang, Wei; Zhang, Jinjin; Wang, Liang

doi:10.1007/978-3-030-86337-1_11

Mengmeng Cui¹¹,
Wei Wang¹¹,
Jinjin Zhang¹¹ &
…
Liang Wang^11,12

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12824))

Included in the following conference series:

International Conference on Document Analysis and Recognition

3349 Accesses
6 Citations

Abstract

Attention-based encoder-decoder framework is widely used in the scene text recognition task. However, for the current state-of-the-art (SOTA) methods, there is room for improvement in terms of the efficient usage of local visual and global context information of the input text image, as well as the robust correlation between the scene processing module (encoder) and the text processing module (decoder). In this paper, we propose a Representation and Correlation Enhanced Encoder-Decoder Framework (RCEED) to address these deficiencies and break performance bottleneck. In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map. In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space. 1) The decoder initialization is guided by the holistic feature and global glimpse vector exported from the encoder. 2) The feature enriched glimpse vector produced by the Multi-Head General Attention is used to assist the RNN iteration and the character prediction at each time step. Meanwhile, we also design a Layernorm-Dropout LSTM cell to improve model’s generalization towards changeable texts. Extensive experiments on the benchmarks demonstrate the advantageous performance of RCEED in scene text recognition tasks, especially the irregular ones. The source code will be available https://github.com/Mona9955/RCEED-ICDAR2021.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Article 10 February 2021

Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

Article 21 February 2024

Revolutionizing Scene Text Recognition: Unleashing the Power of Dual Step Attention Mechanism in the Encoder

References

Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. PAMI 39(11), 2298–2304 (2016)
Article Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. PAMI 41(9), 2035–2048 (2018)
Article Google Scholar
Yang, L., Wang, P., Li, H., Li, Z., Zhang, Y.: A holistic representation guided attention network for scene text recognition. Neurocomputing 414, 67–75 (2020)
Article Google Scholar
Litman, R., Anschel, O., Tsiper, S., Litman, R., Mazor, S., Manmatha, R.: SCATTER: selective context attentional scene text recognizer. In: CVPR, pp. 11962–11972 (2020)
Google Scholar
Zuo, L.Q., Sun, H.M., Mao, Q.C., Qi, R., Jia, R.S.: Natural scene text recognition based on encoder-decoder framework. IEEE Access 7, 62616–62623 (2019)
Article Google Scholar
Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: A simple and strong baseline for irregular text recognition. In: AAAI, vol. 33, pp. 8610–8617 (2019)
Google Scholar
Lu, N., Yu, W., Qi, X., Chen, Y., Gong, P., Xiao, R.: MASTER: multi-aspect non-local network for scene text recognition. arXiv preprint arXiv:1910.02562 (2019)
Sheng, F., Chen, Z., Xu, B.: NRTR: a no-recurrence sequence-to-sequence model for scene text recognition. In: ICDAR, pp. 781–786. IEEE (2019)
Google Scholar
Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation. PAMI 18(7), 690–706 (1996)
Article Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464. IEEE (2011)
Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: CVPR, pp. 4168–4176 (2016)
Google Scholar
Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: CVPR, pp. 2231–2239 (2016)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: ICCV, pp. 5076–5084 (2017)
Google Scholar
Bai, F., Cheng, Z., Niu, Y., Pu, S., Zhou, S.: Edit probability for scene text recognition. In: CVPR, pp. 1508–1516 (2018)
Google Scholar
Bookstein, F.L., Green, W.D.K.: A thin-plate spline and the decomposition of deformations. Math. Methods Med. Imaging 2, 14–28 (1993)
Article Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. arXiv preprint arXiv:1506.02025 (2015)
Zhan, F., Lu, S.: ESIR: end-to-end scene text recognition via iterative image rectification. In: CVPR, pp. 2059–2068 (2019)
Google Scholar
Luo, C., Jin, L., Sun, Z.: MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)
Article Google Scholar
Liu, W., Chen, C., Wong, K.Y.: Char-net: A character-aware neural network for distorted scene text recognition. In: AAAI, vol. 32 (2018)
Google Scholar
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., Bai, X.: Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. PAMI 43(2), 532–548 (2021)
Google Scholar
Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W.: RobustScanner: dynamically enhancing positional clues for robust text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 135–151. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_9
Chapter Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014)
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)
Google Scholar
Mishra, A., Alahari, K., Jawahar, C.V.: Scene text recognition using higher order language priors. In: BMVC-British Machine Vision Conference. BMVA (2012)
Google Scholar
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493. IEEE (2013)
Google Scholar
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160. IEEE (2015)
Google Scholar
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013)
Google Scholar
Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)
Article Google Scholar

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (61976214, 61721004, 61633021), and Science and Technology Project of SGCC Research on feature recognition and prediction of typical ice and wind disaster for transmission lines based on small sample machine learning method.

Author information

Authors and Affiliations

Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China
Mengmeng Cui, Wei Wang, Jinjin Zhang & Liang Wang
School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China
Liang Wang

Authors

Mengmeng Cui
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jinjin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Liang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mengmeng Cui .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cui, M., Wang, W., Zhang, J., Wang, L. (2021). Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12824. Springer, Cham. https://doi.org/10.1007/978-3-030-86337-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-86337-1_11
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86336-4
Online ISBN: 978-3-030-86337-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

Revolutionizing Scene Text Recognition: Unleashing the Power of Dual Step Attention Mechanism in the Encoder

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

Revolutionizing Scene Text Recognition: Unleashing the Power of Dual Step Attention Mechanism in the Encoder

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation