Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment | SpringerLink
Skip to main content

Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment

  • Conference paper
  • First Online:
Advances in Information and Communication (FICC 2023)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 652))

Included in the following conference series:

Abstract

Visual (image, video) quality assessments can be modelled by visual features in different domains, e.g., spatial, frequency, and temporal domains. Perceptual mechanism in the human visual system (HVS) play a crucial role in the generation of quality perception. This paper proposes a general framework for no-reference visual quality assessment using efficient windowed transformer architectures. A lightweight module for multi-stage channel attention is integrated into the Swin (shifted window) Transformer. Such module can represent appropriate perceptual mechanisms in image quality assessment (IQA) to build an accurate IQA model. Meanwhile, representative features for image quality perception in the spatial and frequency domains can also be derived from the IQA model, which are then fed into another windowed transformer architecture for video quality assessment (VQA). The VQA model efficiently reuses attention information across local windows to tackle the issue of expensive time and memory complexities of original transformer. Experimental results on both large-scale IQA and VQA databases demonstrate that the proposed quality assessment models outperform other state-of-the-art models by large margins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 25167
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 31459
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Wang, Z., Sheikh, H.R., Bovik, A.C.: Objective video quality assessment. In: The Handbook of Video Databases: Design and Applications, pp. 1041–1078. CRC Press,  (2003)

    Google Scholar 

  2. Sazaad, P.Z.M., Kawayoke, Y., Horita, Y.: No reference image quality assessment for JPEG2000 based on spatial features. Signal Process. Image Commun. 23(4), 257–268 (2008)

    Article  Google Scholar 

  3. Shahid, M., Rossholm, A., Lövström, B., Zepernick, H.-J.: No-reference image and video quality assessment: a classification and review of recent approaches. EURASIP Journal on Image and Video Processing 2014(1), 1–32 (2014). https://doi.org/10.1186/1687-5281-2014-40

    Article  Google Scholar 

  4. Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21(12), 4695–4708 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  5. Saad, M.A., Bovik, A.C., Charrier, C.: Blind prediction of natural video quality. IEEE Trans. Image Process. 23(3), 1352–1365 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  6. Korhonen, J.: Two-level approach for no-reference consumer video quality assessment. IEEE Trans. Image Process. 28(12), 5923–5938 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bianco, S., Celona, L., Napoletano, P., Schettini, R.: On the use of deep learning for blind image quality assessment. SIViP 12(2), 355–362 (2017). https://doi.org/10.1007/s11760-017-1166-8

    Article  Google Scholar 

  8. Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Trans. Image Process. 29, 4041–4056 (2020)

    Article  MATH  Google Scholar 

  9. Itti, L., Koch, C.: Computational modelling of visual attention. Nat. Rev. Neurosci. 2, 194–203 (2001)

    Article  Google Scholar 

  10. Engelke, U., Kaprykowsky, H., Zepernick, H.-J., Ndjiki-Nya, P.: Visual attention in quality assessment. IEEE Signal Process. Mag. 28(6), 50–59 (2011)

    Article  Google Scholar 

  11. Kelly, H.: Visual contrast sensitivity. Optica Acta: Int. J. Opt. 24(2), 107–112 (1977)

    Article  MathSciNet  Google Scholar 

  12. Geisler, W.S.,  Perry, J.S.: A real-time foveated multi-resolution system for low-bandwidth video communication. In: SPIE Human Vision Electron. Imaging, San Jose, CA, USA, vol. 3299, pp. 294–305  (1998)

    Google Scholar 

  13. Zhang, X., Lin, W., Xue, P.: Just-noticeable difference estimation with pixels in images. J. Vis. Commun. 19(1), 30–41 (2007)

    Article  Google Scholar 

  14. You, J.,  Korhonen, J.: Transformer for image quality assessment. In: IEEE International Conference on Image Processing (ICIP), Anchorage, Alaska, USA (2021)

    Google Scholar 

  15. Ke, J., Wang, O., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: Multi-scale image quality transformer. In: IEEE/CVF  International Conference on Computer Vision (ICCV), Virtual (2021)

    Google Scholar 

  16. You, J.,  Korhonen, J.: Attention integrated hierarchical networks for no-reference image quality assessment. J. Vis. Commun., 82 (2022)

    Google Scholar 

  17. Wu, J., Ma, J., Liang, F., Dong, W., Shi, G., Lin, W.: End-to-end blind image quality prediction with cascaded deep neural network. IEEE Trans. Image Process. 29, 7414–7426 (2020)

    Article  MATH  Google Scholar 

  18. Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network.  In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020)

    Google Scholar 

  19. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R.,  Li, F.-F.: Large-scale video classification with convolutional neural networks. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Columbus, OH, USA (2014)

    Google Scholar 

  20. Varga, D., Szirányi, T.: No-reference video quality assessment via pretrained CNN and LSTM networks. Signal Image Video Proc. 13, 1569–1576 (2019)

    Google Scholar 

  21. Korhonen, J., Su, Y.,  You, J.: Blind natural video quality prediction via statistical temporal features and deep spatial features. In: ACM  International Conference Multimedia (MM), Seattle, United States (2020)

    Google Scholar 

  22. Li, D., Jiang, T., Jiang, M.: Quality assessment of in-the-wild videos. In: ACM  International Conference Multimedia (MM), Nice France (2019)

    Google Scholar 

  23. You, J., Korhonen, J.: Deep neural networks for no-reference video quality assessment. In: IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan (2019)

    Google Scholar 

  24. Tu, Z., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: UGC-VQA: Benchmarking blind video quality assessment for user generated content. IEEE Trans. Image Process. 30, 4449–4464 (2021)

    Article  Google Scholar 

  25. You, J.: Long short-term convolutional transformer for no-reference video quality assessment. In: ACM  International Conference Multimedia (MM), Chengdu, China (2021)

    Google Scholar 

  26. Göring, S., Skowronek, J., Raake, A.: DeViQ - A deep no reference video quality model. In: Proceedings of Human Vision and Electronic Imaging (HVEI), Burlingame, California USA (2018)

    Google Scholar 

  27. Li, X., Guo, Q., Lu, X.: Spatiotemporal statistics for video quality assessment. IEEE Trans. Image Process. 25(7), 3329–3342 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  28. Lu, Y., Wu, J., Li, L., Dong, W., Zhang, J., Shi, G.: Spatiotemporal representation learning for blind video quality assessment. IEEE Trans. Circuits Syst. Video Technol. 32(6), 3500–3513 (2021)

    Article  Google Scholar 

  29. Vaswani, A., et al.: Attention is all your need. In: Advance in Neural Information Processing System (NIPS), Long Beach, CA, USA (2017)

    Google Scholar 

  30. Devlin, J., Chang, M.-W., Lee, K.,  Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, pp. 4171–4186, Minneapolis, Minnesota, USA (2019)

    Google Scholar 

  31. Xing, F., Wang, Y-G., Wang, H., Li, L., and Zhu, G.: StarVQA: Space-time attention for video quality assessment. https://doi.org/10.48550/arXiv.2108.09635 (2021)

  32. Kitaev, N., Kaiser, L.,  Levskaya, A.: Reformer: The efficient Transformer. In: International Conference on Learning Representations (ICLR), Virtual (2020)

    Google Scholar 

  33. Zhu, C., et al.: Long-short Transformer: Efficient Transformers for language and vision. In: Advance in Neural Information Processing System (NeurIPS), Virtual (2021)

    Google Scholar 

  34. Liu, Z., et al.: Swin Transformer: Hierarchical vision Transformer using shifted windows. In: IEEE/CVF  International Conference on Computer Vision (ICCV), Virtual (2021)

    Google Scholar 

  35. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR), Virtual (2021)

    Google Scholar 

  36. Fang, Y., Zhu, H., Zeng, Y., Ma, K.,  Wang, Z.: Perceptual quality assessment of smartphone photography.  In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020)

    Google Scholar 

  37. Hosu, V., et al.: The Konstanz natural video database (KoNViD-1k). In: International Conference on Quality of Multimedia Experience (QoMEX), Erfurt, Germany (2017)

    Google Scholar 

  38. Wang, Y., Inguva, S.,  Adsumilli, B.: YouTube UGC dataset for video compression research. In: International Workshop on Multimedia Signal Processing (MMSP), Kuala Lumpur, Malaysia (2019)

    Google Scholar 

  39. Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik A.C.: From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020)

    Google Scholar 

  40. Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process 30, 3474–3486 (2021)

    Article  Google Scholar 

  41. Li, D., Jiang, T., Jiang, M.: Unified quality assessment of in-the-wild videos with mixed datasets training. Int. J. Comput. Vis. 129, 1238–1257 (2021)

    Article  Google Scholar 

  42. ITU-T Recommendation P.910. Subjective video quality assessment methods for multimedia applications,” ITU (2008)

    Google Scholar 

  43. Virtanen, T., Nuutinen, M., Vaahteranoksa, M., Oittinen, P., Häkkinen, J.: CID2013: A database for evaluating no-reference image quality assessment algorithms. IEEE Trans. Image Process 24(1), 390–402 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  44. Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Process 25(1), 372–387 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  45. Sinno, Z., Bovik, A.C.: Large-scale study of perceptual video quality. IEEE Trans. Image Process. 28(2), 612–627 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  46.  Jung, A.B., Wada, K., Crall, J., et al.: Imgaug, https://github.com/aleju/imgaug

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junyong You .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

You, J., Zhang, Z. (2023). Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment. In: Arai, K. (eds) Advances in Information and Communication. FICC 2023. Lecture Notes in Networks and Systems, vol 652. Springer, Cham. https://doi.org/10.1007/978-3-031-28073-3_33

Download citation

Publish with us

Policies and ethics