Abstract
Scene text image super-resolution (STISR) is a popular research topic due to its great potential for improving downstream recognition performance. Many recent STISR approaches have utilized recognition feedback to guide the reconstruction process. However, their effectiveness is often limited by inaccurate recognition feedback and insufficient use of visual priors. To address these challenges, we propose a novel GenerAtive pRior guiDEd Network, namely GARDEN, which surpasses existing practices by exploiting enriched generative priors for precise and reliable guidance towards STISR. Innovatively, GARDEN leverages a pre-trained Vision Transformer (ViT) as the generative style bank, which provides diverse image priors and further assists in generating reliable text priors. This allows the network to leverage prior information from both visual and semantic domains for the final reconstruction, leading to more efficient learning of both texture generation and text recovery. In addition, GARDEN introduces multi-scale sequential residual block (MS-SRB), a simple, efficient, and flexible structure for achieving the maximal utilization of generative priors. By leveraging enriched generative priors within a novel architecture design, GARDEN is better suited to encode, transfer, and reconstruct super-resolution text images than the best previous methods in terms of both fidelity and recognition accuracy, as shown in Fig. 1. Code will be publicly available.
Y. Kong and W. Ma—The authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alberti, C., Ling, J., Collins, M., Reitter, D.: Fusion of detected objects in text for visual question answering. In: EMNLP-IJCNLP, pp. 2131–2140 (2019)
Biten, A.F., et al.: Scene text visual question answering. In: ICCV, pp. 4291–4301 (2019)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL 5, 135–146 (2017)
Chan, K.C., Wang, X., Xu, X., Gu, J., Loy, C.C.: GLEAN: generative latent bank for large-factor image super-resolution. In: CVPR, pp. 14245–14254 (2021)
Chen, J., Li, B., Xue, X.: Scene text telescope: text-focused scene image super-resolution. In: CVPR, pp. 12026–12035 (2021)
Chen, J., Yu, H., Ma, J., Li, B., Xue, X.: Text gestalt: stroke-aware scene text image super-resolution. In: AAAI, pp. 285–293 (2022)
Chen, X., Wang, X., Zhou, J., Qiao, Y., Dong, C.: Activating more pixels in image super-resolution transformer. In: CVPR, pp. 22367–22377 (2023)
Chen, Z., Zhang, Y., Gu, J., Kong, L., Yang, X., Yu, F.: Dual aggregation transformer for image super-resolution. In: ICCV (2023)
Dai, T., Cai, J., Zhang, Y., Xia, S.T., Zhang, L.: Second-order attention network for single image super-resolution. In: CVPR, pp. 11065–11074 (2019)
Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_13
Dong, C., Zhu, X., Deng, Y., Loy, C.C., Qiao, Y.: Boosting optical character recognition: a super-resolution approach. arXiv preprint (2015)
Goodfellow, I.J., et al.: Generative adversarial nets. In: NeurIPS, pp. 2672–2680 (2014)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)
Hamza, H., Belaïd, Y., Belaïd, A.: Case-based reasoning for invoice analysis and recognition. In: Weber, R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 404–418. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74141-1_28
Han, W., Chang, S., Liu, D., Yu, M., Witbrock, M., Huang, T.S.: Image super-resolution via dual-state recurrent networks. In: CVPR, pp. 1654–1663 (2018)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR, pp. 16000–16009 (2022)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)
Ledig, C., Theis, L., Huszár, F., Caballero, J.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR, pp. 4681–4690 (2017)
Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: CVPR, pp. 510–519 (2019)
Li, X., Li, W., Ren, D., Zhang, H., Wang, M., Zuo, W.: Enhanced blind face restoration with multi-exemplar images and adaptive spatial feature fusion. In: CVPR, pp. 2706–2715 (2020)
Li, X., Liu, M., Ye, Y., Zuo, W., Lin, L., Yang, R.: Learning warped guidance for blind face restoration. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 278–296. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_17
Li, X., Zuo, W., Loy, C.C.: Learning generative structure prior for blind text image super-resolution. In: CVPR, pp. 10103–10113 (2023)
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: SwinIR: image restoration using swin transformer. In: ICCV, pp. 1833–1844 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Lu, Z., Liu, H., Li, J., Zhang, L.: Efficient transformer for single image super-resolution. arXiv preprint (2021)
Luan, Y., Eisenstein, J., Toutanova, K., Collins, M.: Sparse, dense, and attentional representations for text retrieval. TACL 9, 329–345 (2021)
Luo, C., Jin, L., Sun, Z.: MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)
Ma, C., Rao, Y., Cheng, Y., Chen, C., Lu, J., Zhou, J.: Structure-preserving super resolution with gradient guidance. In: CVPR, pp. 7769–7778 (2020)
Ma, J., Guo, S., Zhang, L.: Text prior guided scene text image super-resolution. IEEE TIP (2023)
Ma, J., Liang, Z., Zhang, L.: A text attention network for spatial deformation robust scene text image super-resolution. In: CVPR, pp. 5911–5920 (2022)
Menon, S., Damian, A., Hu, S., Ravi, N., Rudin, C.: Pulse: self-supervised photo upsampling via latent space exploration of generative models. In: CVPR, pp. 2437–2445 (2020)
Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC (2012)
Mou, Y., et al.: PlugNet: degradation aware scene text recognition supervised by a pluggable super-resolution unit. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 158–174. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_10
Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: EdgeConnect: structure guided image inpainting using edge prediction (2019)
Niu, B., et al.: Single image super-resolution via a holistic attention network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 191–207. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_12
Pan, X., Zhan, X., Dai, B., Lin, D., Loy, C.C., Luo, P.: Exploiting deep generative prior for versatile image restoration and manipulation 44(11), 7474–7489 (2021)
Pandey, R.K., Vignesh, K., Ramakrishnan, A.: Binary document image super resolution for improved readability and OCR performance. arXiv preprint (2018)
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013)
Quan, Y., Yang, J., Chen, Y., Xu, Y., Ji, H.: Collaborative deep learning for super-resolving blurry text images. IEEE Trans. Comput. Imaging 6, 778–790 (2020)
Ren, Y., Yu, X., Zhang, R., Li, T.H., Liu, S., Li, G.: StructureFlow: image inpainting via structure-aware appearance flow. In: ICCV, pp. 181–190 (2019)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI 39(11), 2298–2304 (2016)
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. IEEE TPAMI 41(9), 2035–2048 (2018)
Shi, Q., Zhu, Y., Liu, Y., Ye, J., Yang, D.: Perceiving multiple representations for scene text image super-resolution guided by text recognizer. Eng. Appl. Artif. Intell. 124, 106551 (2023)
Soh, J.W., Park, G.Y., Jo, J., Cho, N.I.: Natural and realistic single image super-resolution with explicit natural manifold discrimination. In: CVPR, pp. 8122–8131 (2019)
Tai, Y., Yang, J., Liu, X., Xu, C.: MemNet: a persistent memory network for image restoration. In: ICCV, pp. 4539–4547 (2017)
Tran, H.T., Ho-Phuoc, T.: Deep Laplacian pyramid network for text images super-resolution. In: International Conference on Computing and Communication Technologies, pp. 1–6 (2019)
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: CVPR, pp. 9446–9454 (2018)
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)
Wang, W., et al.: Scene text image super-resolution in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 650–666. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_38
Wang, W., Xie, E., Sun, P., Wang, W., Tian, L., Shen, C., Luo, P.: TextSR: content-aware text super-resolution guided by recognition. arXiv preprint (2019)
Wang, X., Li, Y., Zhang, H., Shan, Y.: Towards real-world blind face restoration with generative facial prior. In: CVPR, pp. 9168–9178 (2021)
Wang, X., Xie, L., Dong, C., Shan, Y.: Real-ESRGAN: training real-world blind super-resolution with pure synthetic data. In: ICCV, pp. 1905–1914 (2021)
Wang, X., et al.: ESRGAN: enhanced super-resolution generative adversarial networks. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11133, pp. 63–79. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11021-5_5
Xu, X., Sun, D., Pan, J., Zhang, Y., Pfister, H., Yang, M.H.: Learning to super-resolve blurry face and text images. In: ICCV, pp. 251–260 (2017)
Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: CVPR, pp. 5791–5800 (2020)
Yang, T., Ren, P., Xie, X., Zhang, L.: GAN prior embedded network for blind face restoration in the wild. In: CVPR, pp. 672–681 (2021)
Zamir, S.W., et al.: Learning enriched features for real image restoration and enhancement. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 492–511. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_30
Zhang, K., Gool, L.V., Timofte, R.: Deep unfolding network for image super-resolution. In: CVPR, pp. 3217–3226 (2020)
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image restoration. IEEE TPAMI 43(7), 2480–2495 (2020)
Zhao, C., et al.: Scene text image super-resolution via parallelly contextual attention network. In: ACM MM, pp. 2908–2917 (2021)
Zhu, S., Zhao, Z., Fang, P., Xue, H.: Improving scene text image super-resolution via dual prior modulation network. In: AAAI (2023)
Acknowledgment
This research is supported in part by National Natural Science Foundation of China (Grant No.: 62441604, 61936003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kong, Y., Ma, W., Jin, L., Xue, Y. (2024). GARDEN: Generative Prior Guided Network for Scene Text Image Super-Resolution. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14808. Springer, Cham. https://doi.org/10.1007/978-3-031-70549-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-70549-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70548-9
Online ISBN: 978-3-031-70549-6
eBook Packages: Computer ScienceComputer Science (R0)