GARDEN: Generative Prior Guided Network for Scene Text Image Super-Resolution | SpringerLink
Skip to main content

GARDEN: Generative Prior Guided Network for Scene Text Image Super-Resolution

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2024 (ICDAR 2024)

Abstract

Scene text image super-resolution (STISR) is a popular research topic due to its great potential for improving downstream recognition performance. Many recent STISR approaches have utilized recognition feedback to guide the reconstruction process. However, their effectiveness is often limited by inaccurate recognition feedback and insufficient use of visual priors. To address these challenges, we propose a novel GenerAtive pRior guiDEd Network, namely GARDEN, which surpasses existing practices by exploiting enriched generative priors for precise and reliable guidance towards STISR. Innovatively, GARDEN leverages a pre-trained Vision Transformer (ViT) as the generative style bank, which provides diverse image priors and further assists in generating reliable text priors. This allows the network to leverage prior information from both visual and semantic domains for the final reconstruction, leading to more efficient learning of both texture generation and text recovery. In addition, GARDEN introduces multi-scale sequential residual block (MS-SRB), a simple, efficient, and flexible structure for achieving the maximal utilization of generative priors. By leveraging enriched generative priors within a novel architecture design, GARDEN is better suited to encode, transfer, and reconstruct super-resolution text images than the best previous methods in terms of both fidelity and recognition accuracy, as shown in Fig. 1. Code will be publicly available.

Y. Kong and W. Ma—The authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alberti, C., Ling, J., Collins, M., Reitter, D.: Fusion of detected objects in text for visual question answering. In: EMNLP-IJCNLP, pp. 2131–2140 (2019)

    Google Scholar 

  2. Biten, A.F., et al.: Scene text visual question answering. In: ICCV, pp. 4291–4301 (2019)

    Google Scholar 

  3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL 5, 135–146 (2017)

    Article  Google Scholar 

  4. Chan, K.C., Wang, X., Xu, X., Gu, J., Loy, C.C.: GLEAN: generative latent bank for large-factor image super-resolution. In: CVPR, pp. 14245–14254 (2021)

    Google Scholar 

  5. Chen, J., Li, B., Xue, X.: Scene text telescope: text-focused scene image super-resolution. In: CVPR, pp. 12026–12035 (2021)

    Google Scholar 

  6. Chen, J., Yu, H., Ma, J., Li, B., Xue, X.: Text gestalt: stroke-aware scene text image super-resolution. In: AAAI, pp. 285–293 (2022)

    Google Scholar 

  7. Chen, X., Wang, X., Zhou, J., Qiao, Y., Dong, C.: Activating more pixels in image super-resolution transformer. In: CVPR, pp. 22367–22377 (2023)

    Google Scholar 

  8. Chen, Z., Zhang, Y., Gu, J., Kong, L., Yang, X., Yu, F.: Dual aggregation transformer for image super-resolution. In: ICCV (2023)

    Google Scholar 

  9. Dai, T., Cai, J., Zhang, Y., Xia, S.T., Zhang, L.: Second-order attention network for single image super-resolution. In: CVPR, pp. 11065–11074 (2019)

    Google Scholar 

  10. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_13

    Chapter  Google Scholar 

  11. Dong, C., Zhu, X., Deng, Y., Loy, C.C., Qiao, Y.: Boosting optical character recognition: a super-resolution approach. arXiv preprint (2015)

    Google Scholar 

  12. Goodfellow, I.J., et al.: Generative adversarial nets. In: NeurIPS, pp. 2672–2680 (2014)

    Google Scholar 

  13. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)

    Google Scholar 

  14. Hamza, H., Belaïd, Y., Belaïd, A.: Case-based reasoning for invoice analysis and recognition. In: Weber, R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 404–418. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74141-1_28

    Chapter  Google Scholar 

  15. Han, W., Chang, S., Liu, D., Yu, M., Witbrock, M., Huang, T.S.: Image super-resolution via dual-state recurrent networks. In: CVPR, pp. 1654–1663 (2018)

    Google Scholar 

  16. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR, pp. 16000–16009 (2022)

    Google Scholar 

  17. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43

    Chapter  Google Scholar 

  18. Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)

    Google Scholar 

  19. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)

    Google Scholar 

  20. Ledig, C., Theis, L., Huszár, F., Caballero, J.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR, pp. 4681–4690 (2017)

    Google Scholar 

  21. Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: CVPR, pp. 510–519 (2019)

    Google Scholar 

  22. Li, X., Li, W., Ren, D., Zhang, H., Wang, M., Zuo, W.: Enhanced blind face restoration with multi-exemplar images and adaptive spatial feature fusion. In: CVPR, pp. 2706–2715 (2020)

    Google Scholar 

  23. Li, X., Liu, M., Ye, Y., Zuo, W., Lin, L., Yang, R.: Learning warped guidance for blind face restoration. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 278–296. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_17

    Chapter  Google Scholar 

  24. Li, X., Zuo, W., Loy, C.C.: Learning generative structure prior for blind text image super-resolution. In: CVPR, pp. 10103–10113 (2023)

    Google Scholar 

  25. Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: SwinIR: image restoration using swin transformer. In: ICCV, pp. 1833–1844 (2021)

    Google Scholar 

  26. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)

    Google Scholar 

  27. Lu, Z., Liu, H., Li, J., Zhang, L.: Efficient transformer for single image super-resolution. arXiv preprint (2021)

    Google Scholar 

  28. Luan, Y., Eisenstein, J., Toutanova, K., Collins, M.: Sparse, dense, and attentional representations for text retrieval. TACL 9, 329–345 (2021)

    Article  Google Scholar 

  29. Luo, C., Jin, L., Sun, Z.: MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)

    Article  Google Scholar 

  30. Ma, C., Rao, Y., Cheng, Y., Chen, C., Lu, J., Zhou, J.: Structure-preserving super resolution with gradient guidance. In: CVPR, pp. 7769–7778 (2020)

    Google Scholar 

  31. Ma, J., Guo, S., Zhang, L.: Text prior guided scene text image super-resolution. IEEE TIP (2023)

    Google Scholar 

  32. Ma, J., Liang, Z., Zhang, L.: A text attention network for spatial deformation robust scene text image super-resolution. In: CVPR, pp. 5911–5920 (2022)

    Google Scholar 

  33. Menon, S., Damian, A., Hu, S., Ravi, N., Rudin, C.: Pulse: self-supervised photo upsampling via latent space exploration of generative models. In: CVPR, pp. 2437–2445 (2020)

    Google Scholar 

  34. Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC (2012)

    Google Scholar 

  35. Mou, Y., et al.: PlugNet: degradation aware scene text recognition supervised by a pluggable super-resolution unit. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 158–174. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_10

    Chapter  Google Scholar 

  36. Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: EdgeConnect: structure guided image inpainting using edge prediction (2019)

    Google Scholar 

  37. Niu, B., et al.: Single image super-resolution via a holistic attention network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 191–207. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_12

    Chapter  Google Scholar 

  38. Pan, X., Zhan, X., Dai, B., Lin, D., Loy, C.C., Luo, P.: Exploiting deep generative prior for versatile image restoration and manipulation 44(11), 7474–7489 (2021)

    Google Scholar 

  39. Pandey, R.K., Vignesh, K., Ramakrishnan, A.: Binary document image super resolution for improved readability and OCR performance. arXiv preprint (2018)

    Google Scholar 

  40. Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013)

    Google Scholar 

  41. Quan, Y., Yang, J., Chen, Y., Xu, Y., Ji, H.: Collaborative deep learning for super-resolving blurry text images. IEEE Trans. Comput. Imaging 6, 778–790 (2020)

    Article  MathSciNet  Google Scholar 

  42. Ren, Y., Yu, X., Zhang, R., Li, T.H., Liu, S., Li, G.: StructureFlow: image inpainting via structure-aware appearance flow. In: ICCV, pp. 181–190 (2019)

    Google Scholar 

  43. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI 39(11), 2298–2304 (2016)

    Article  Google Scholar 

  44. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. IEEE TPAMI 41(9), 2035–2048 (2018)

    Article  Google Scholar 

  45. Shi, Q., Zhu, Y., Liu, Y., Ye, J., Yang, D.: Perceiving multiple representations for scene text image super-resolution guided by text recognizer. Eng. Appl. Artif. Intell. 124, 106551 (2023)

    Article  Google Scholar 

  46. Soh, J.W., Park, G.Y., Jo, J., Cho, N.I.: Natural and realistic single image super-resolution with explicit natural manifold discrimination. In: CVPR, pp. 8122–8131 (2019)

    Google Scholar 

  47. Tai, Y., Yang, J., Liu, X., Xu, C.: MemNet: a persistent memory network for image restoration. In: ICCV, pp. 4539–4547 (2017)

    Google Scholar 

  48. Tran, H.T., Ho-Phuoc, T.: Deep Laplacian pyramid network for text images super-resolution. In: International Conference on Computing and Communication Technologies, pp. 1–6 (2019)

    Google Scholar 

  49. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: CVPR, pp. 9446–9454 (2018)

    Google Scholar 

  50. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)

    Google Scholar 

  51. Wang, W., et al.: Scene text image super-resolution in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 650–666. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_38

  52. Wang, W., Xie, E., Sun, P., Wang, W., Tian, L., Shen, C., Luo, P.: TextSR: content-aware text super-resolution guided by recognition. arXiv preprint (2019)

    Google Scholar 

  53. Wang, X., Li, Y., Zhang, H., Shan, Y.: Towards real-world blind face restoration with generative facial prior. In: CVPR, pp. 9168–9178 (2021)

    Google Scholar 

  54. Wang, X., Xie, L., Dong, C., Shan, Y.: Real-ESRGAN: training real-world blind super-resolution with pure synthetic data. In: ICCV, pp. 1905–1914 (2021)

    Google Scholar 

  55. Wang, X., et al.: ESRGAN: enhanced super-resolution generative adversarial networks. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11133, pp. 63–79. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11021-5_5

    Chapter  Google Scholar 

  56. Xu, X., Sun, D., Pan, J., Zhang, Y., Pfister, H., Yang, M.H.: Learning to super-resolve blurry face and text images. In: ICCV, pp. 251–260 (2017)

    Google Scholar 

  57. Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: CVPR, pp. 5791–5800 (2020)

    Google Scholar 

  58. Yang, T., Ren, P., Xie, X., Zhang, L.: GAN prior embedded network for blind face restoration in the wild. In: CVPR, pp. 672–681 (2021)

    Google Scholar 

  59. Zamir, S.W., et al.: Learning enriched features for real image restoration and enhancement. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 492–511. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_30

    Chapter  Google Scholar 

  60. Zhang, K., Gool, L.V., Timofte, R.: Deep unfolding network for image super-resolution. In: CVPR, pp. 3217–3226 (2020)

    Google Scholar 

  61. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40

    Chapter  Google Scholar 

  62. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image restoration. IEEE TPAMI 43(7), 2480–2495 (2020)

    Article  Google Scholar 

  63. Zhao, C., et al.: Scene text image super-resolution via parallelly contextual attention network. In: ACM MM, pp. 2908–2917 (2021)

    Google Scholar 

  64. Zhu, S., Zhao, Z., Fang, P., Xue, H.: Improving scene text image super-resolution via dual prior modulation network. In: AAAI (2023)

    Google Scholar 

Download references

Acknowledgment

This research is supported in part by National Natural Science Foundation of China (Grant No.: 62441604, 61936003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lianwen Jin .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1864 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kong, Y., Ma, W., Jin, L., Xue, Y. (2024). GARDEN: Generative Prior Guided Network for Scene Text Image Super-Resolution. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14808. Springer, Cham. https://doi.org/10.1007/978-3-031-70549-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70549-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70548-9

  • Online ISBN: 978-3-031-70549-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics