Abstract
Machine visual intelligence has exploded in recent years. Large-scale, high-quality image and video datasets significantly empower learning-based machine vision models, especially deep-learning models. However, images and videos are usually compressed before being analyzed in practical situations where transmission or storage is limited, leading to a noticeable performance loss of vision models. In this work, we broadly investigate the impact on the performance of machine vision from image and video coding. Based on the investigation, we propose Just Recognizable Distortion (JRD) to present the maximum distortion caused by data compression that will reduce the machine vision model performance to an unacceptable level. A large-scale JRD-annotated dataset containing over 340,000 images is built for various machine vision tasks, where the factors for different JRDs are studied. Furthermore, an ensemble-learning-based framework is established to predict the JRDs for diverse vision tasks under few- and non-reference conditions, which consists of multiple binary classifiers to improve the prediction accuracy. Experiments prove the effectiveness of the proposed JRD-guided image and video coding to significantly improve compression and machine vision performance. Applying predicted JRD is able to achieve remarkably better machine vision task accuracy and save a large number of bits.












Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Which means for some images, lower image quality can lead to more accurate category prediction from machine classifier. Nevertheless, these images are still ignored as explained in Sect. 4, and they will not influence the conclusion of this section or the paper.
Which means these person objects cannot be detected under any selected compression levels.
References
Aqqa, M., Mantini, P., & Shah, S. K. (2019). Understanding how video quality affects object detection algorithms. In VISIGRAPP (5: VISAPP) (pp. 96–104).
Bross, B., Chen, J., & Liu, S. (2018). Versatile video coding (draft 5). JVET-K1001.
Chen, Y., Murherjee, D., Han, J., Grange, A., Xu, Y., Liu, Z., Parker, S., Chen, C., Su, H., Joshi, U., & Chiang, C. H. (2018). An overview of core coding tools in the av1 video codec. In 2018 picture coding symposium (PCS) (pp. 41–45). IEEE.
Chen, Z., Fan, K., Wang, S., Duan, L., Lin, W., & Kot, A. C. (2019). Toward intelligent sensing: Intermediate deep feature compression. IEEE Transactions on Image Processing, 29, 2230–2243.
Chou, C. H., & Li, Y. C. (1995). A perceptually tuned subband image coder based on the measure of just-noticeable-distortion profile. IEEE Transactions on Circuits and Systems for Video Technology, 5(6), 467–476.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.
Dodge, S., & Karam, L. (2016). Understanding how image quality affects deep neural networks. In 2016 eighth international conference on quality of multimedia experience (QoMEX) (pp. 1–6), IEEE.
Dodge, S., & Karam, L. (2017). A study and comparison of human and deep learning recognition performance under visual distortions. In 2017 26th international conference on computer communication and networks (ICCCN) (pp 1–7). IEEE.
Duan, L. Y., Chandrasekhar, V., Chen, J., Lin, J., Wang, Z., Huang, T., et al. (2015). Overview of the mpeg-cdvs standard. IEEE Transactions on Image Processing, 25(1), 179–194.
Duan, L. Y., Lou, Y., Bai, Y., Huang, T., Gao, W., Chandrasekhar, V., et al. (2018). Compact descriptors for video analysis: The emerging mpeg standard. IEEE MultiMedia, 26(2), 44–54.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Fan, C., Lin, H., Hosu, V., Zhang, Y., Jiang, Q., Hamzaoui, R., & Saupe, D. (2019). Sur-net: Predicting the satisfied user ratio curve for image compression with deep learning. In 2019 eleventh international conference on quality of multimedia experience (QoMEX) (pp. 1–6), IEEE.
Geirhos, R., Temme, C. R., Rauber, J., Schütt, H. H., Bethge, M., & Wichmann, F. A. (2018). Generalisation in humans and deep neural networks. Advances in Neural Information Processing Systems, 31, 7538–7550.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Hu, T., Qi, H., Huang, Q., & Lu, Y. (2019). See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017a). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700–4708).
Huang, Q., Wang, H., Lim, S. C., Kim, H. Y., Jeong, S. Y., & Kuo, C. C. J. (2017b). Measure and prediction of hevc perceptually lossy/lossless boundary qp values. In: 2017 data compression conference (DCC) (pp. 42–51). IEEE.
Jayant, N., Johnston, J., & Safranek, R. (1993). Signal compression based on models of human perception. Proceedings of the IEEE, 81(10), 1385–1422.
Jin, L., Lin, J. Y., Hu, S., Wang, H., Wang, P., Katsavounidis, I., et al. (2016). Statistical study on perceived jpeg image quality via mcl-jci dataset construction and analysis. Electronic Imaging, 2016(13), 1–9.
Li, Y., Jia, C., Wang, S., Zhang, X., Wang, S., Ma, S., & Gao, W. (2018). Joint rate-distortion optimization for simultaneous texture and deep feature compression of facial images. In 2018 IEEE fourth international conference on multimedia big data (BigMM) (pp 1–5). IEEE.
Lin, H., Hosu, V., Fan, C., Zhang, Y., Mu, Y., Hamzaoui, R., et al. (2020). Sur-featnet: Predicting the satisfied user ratio curve for image compression with deep feature learning. Quality and User Experience, 5, 1–23.
Lin, J. Y., Jin, L., Hu, S., Katsavounidis, I., Li, Z., Aaron, A., & Kuo, C. C. J. (2015). Experimental design and analysis of jnd test on coded image/video. In Applications of digital image processing XXXVIII, International Society for optics and photonics (Vol. 9599, p. 95990Z).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer.
Liu, D., Wang, D., & Li, H. (2017). Recognizable or not: Towards image semantic quality assessment for compression. Sensing and Imaging, 18(1), 1.
Liu, H., Zhang, Y., Zhang, H., Fan, C., Kwong, S., Kuo, C. C. J., et al. (2019). Deep learning-based picture-wise just noticeable distortion prediction model for image compression. IEEE Transactions on Image Processing, 29, 641–656.
Lou, Y., Duan, L. Y., Wang, S., Chen, Z., Bai, Y., Chen, C., et al. (2019). Front-end smart visual sensing and back-end intelligent analysis: A unified infrastructure for economizing the visual system of city brain. IEEE Journal on Selected Areas in Communications, 37(7), 1489–1503.
Ma, S., Zhang, X., Wang, S., Zhang, X., Jia, C., & Wang, S. (2018). Joint feature and texture coding: Toward smart video representation via front-end intelligence. IEEE Transactions on Circuits and Systems for Video Technology, 29(10), 3095–3105.
Redondi, A., Baroffio, L., Bianchi, L., Cesana, M., & Tagliasacchi, M. (2016). Compress-then-analyze versus analyze-then-compress: What is best in visual sensor networks? IEEE Transactions on Mobile Computing, 15(12), 3000–3013.
Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.
Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R., Issa, E. B., Kar, K., Bashivan, P., Prescott-Roy, J., Schmidt, K., et al. (2018). Brain-score: Which artificial neural network for object recognition is most brain-like? BioRxiv p. 407007.
Shi, J., & Chen, Z. (2020). Reinforced bit allocation under task-driven semantic distortion metrics. In 2020 IEEE international symposium on circuits and systems (ISCAS) (pp. 1–5). IEEE.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Skodras, A., Christopoulos, C., & Ebrahimi, T. (2001). The jpeg 2000 still image compression standard. IEEE Signal Processing Magazine, 18(5), 36–58.
Su, J., Vargas, D. V., & Sakurai, K. (2019). One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, 23(5), 828–841.
Sullivan, G. J., Ohm, J. R., Han, W. J., & Wiegand, T. (2012). Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology, 22(12), 1649–1668.
Tan, M., & Le, Q. V. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset.
Wang, H., Gan, W., Hu, S., Lin, J. Y., Jin, L., Song, L., Wang, P., Katsavounidis, I., Aaron, A., & Kuo, C. C. J. (2016). Mcl-jcv: a jnd-based h. 264/avc video quality assessment dataset. In 2016 IEEE international conference on image processing (ICIP) (pp. 1509–1513). IEEE.
Wang, H., Katsavounidis, I., Zhou, J., Park, J., Lei, S., Zhou, X., et al. (2017). Videoset: A large-scale compressed video quality dataset based on jnd measurement. Journal of Visual Communication and Image Representation, 46, 292–302.
Wang, H., Katsavounidis, I., Huang, Q., Zhou, X., & Kuo, C. C. J. (2018a). Prediction of satisfied user ratio for compressed video. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6747–6751). IEEE.
Wang, H., Zhang, X., Yang, C., & Kuo, C. C. J. (2018b). Analysis and prediction of jnd-based video quality model. In 2018 picture coding symposium (PCS) (pp 278–282). IEEE.
Wang, S., Wang, S., Yang, W., Zhang, X., Wang, S., Ma, S., & Gao, W. (2020). Towards analysis-friendly face representation with scalable feature and texture compression. arXiv preprint arXiv:2004.10043.
Wiegand, T., Sullivan, G. J., Bjontegaard, G., & Luthra, A. (2003). Overview of the h. 264/avc video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7), 560–576.
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492–1500.
Yamins, D. L., & DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 19(3), 356–365.
Yang, X., Ling, W., Lu, Z., Ong, E. P., & Yao, S. (2005). Just noticeable distortion model and its applications in video coding. Signal Processing: Image Communication, 20(7), 662–680.
Zhang, J., Jia, C., Lei, M., Wang, S., Ma, S., & Gao, W. (2019). Recent development of avs video coding standard: Avs3. In 2019 picture coding symposium (PCS) (pp. 1–5). IEEE.
Zhang, X., Ma, S., Wang, S., Zhang, X., Sun, H., & Gao, W. (2016). A joint compression scheme of video feature descriptors and visual content. IEEE Transactions on Image Processing, 26(2), 633–647.
Zhang, X., Yang, C., Wang, H., Xu, W., & Kuo, C. C. J. (2020). Satisfied-user-ratio modeling for compressed video. IEEE Transactions on Image Processing, 29, 3777–3789.
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (62072008, 62025101), PKU-Baidu Fund (2019BD003) and High-performance Computing Platform of Peking University, which are gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Dong Xu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Q., Wang, S., Zhang, X. et al. Just Recognizable Distortion for Machine Vision Oriented Image and Video Coding. Int J Comput Vis 129, 2889–2906 (2021). https://doi.org/10.1007/s11263-021-01505-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01505-4