Efficient Semantic Video Segmentation with Per-Frame Inference

Liu, Yifan; Shen, Chunhua; Yu, Changqian; Wang, Jingdong

doi:10.1007/978-3-030-58607-2_21

Yifan Liu¹²,
Chunhua Shen¹²,
Changqian Yu^12,13 &
…
Jingdong Wang¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12355))

Included in the following conference series:

European Conference on Computer Vision

5574 Accesses
85 Citations

Abstract

For semantic segmentation, most existing real-time deep models trained with each frame independently may produce inconsistent results when tested on a video sequence. A few methods take the correlations in the video sequence into account, e.g., by propagating the results to the neighbouring frames using optical flow, or extracting frame representations using multi-frame information, which may lead to inaccurate results or unbalanced latency. In contrast, here we explicitly consider the temporal consistency among frames as extra constraints during training and process each frame independently in the inference phase. Thus no computation overhead is introduced for inference. Compact models are employed for real-time execution. To narrow the performance gap between compact models and large models, new temporal knowledge distillation methods are designed. Weighing among accuracy, temporal smoothness and efficiency, our proposed method outperforms previous keyframe based methods and corresponding baselines which are trained with each frame independently on benchmark datasets including Cityscapes and Camvid. Code is available at: https://git.io/vidseg.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Streaming Multiscale Deep Equilibrium Models

Clockwork Convnets for Video Semantic Segmentation

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

Notes

1.
The details of calculations in ConvLSTM is referred in [28], and we also include the key equations in Section S1.2 in supplementary materials.

References

Bian, J.W., Zhan, H., Wang, N., Chin, T.J., Shen, C., Reid, I.: Unsupervised depth learning in challenging indoor video: weak rectification to rescue. arXiv:2006.02708 (2020). Comp. Res. Repository
Bian, J., et al.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 35–45 (2019)
Google Scholar
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2_5
Chapter Google Scholar
Chandra, S., Couprie, C., Kokkinos, I.: Deep spatio-temporal random fields for efficient video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8915–8924 (2018)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)
Article Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Fayyaz, M., Saffar, M.H., Sabokrou, M., Fathy, M., Huang, F., Klette, R.: STFCN: spatio-temporal fully convolutional neural network for semantic segmentation of street scenes. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10116, pp. 493–509. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54407-6_33
Chapter Google Scholar
Gadde, R., Jampani, V., Gehler, P.V.: Semantic video CNNs through representation warping. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4453–4462 (2017)
Google Scholar
Gupta, A., Johnson, J., Alahi, A., Fei-Fei, L.: Characterizing and improving stability in neural style transfer. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4067–4076 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
He, T., Shen, C., Tian, Z., Gong, D., Sun, C., Yan, Y.: Knowledge adaptation for efficient semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 578–587 (2019)
Google Scholar
Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015). Comp. Res. Repository
Jain, S., Wang, X., Gonzalez, J.E.: Accel: a corrective fusion network for efficient semantic segmentation on video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8866–8875 (2019)
Google Scholar
Lai, W.-S., Huang, J.-B., Wang, O., Shechtman, E., Yumer, E., Yang, M.-H.: Learning blind video temporal consistency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 179–195. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_11
Chapter Google Scholar
Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. ACM Trans. Graph. 23(3), 689–694 (2004)
Article Google Scholar
Li, Q., Jin, S., Yan, J.: Mimicking very efficient network for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7341–7349 (2017)
Google Scholar
Li, Y., Shi, J., Lin, D.: Low-latency video semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5997–6005 (2018)
Google Scholar
Liu, S., Wang, C., Qian, R., Yu, H., Bao, R., Sun, Y.: Surveillance video parsing with single frame supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 413–421 (2017)
Google Scholar
Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z., Wang, J.: Structured knowledge distillation for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2604–2613 (2019)
Google Scholar
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 561–580. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_34
Chapter Google Scholar
Miksik, O., Munoz, D., Bagnell, J.A., Hebert, M.: Efficient temporal consistency for streaming video scene analysis. In: Proceedings of the IEEE International Conference on Robotics and Automation, pp. 133–139. IEEE (2013)
Google Scholar
Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6819–6828 (2018)
Google Scholar
Orsic, M., Kreso, I., Bevandic, P., Segvic, S.: In defense of pre-trained ImageNet architectures for real-time semantic segmentation of road-driving images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Reda, F., Pottorff, R., Barker, J., Catanzaro, B.: FlowNet2-PyTorch: PyTorch implementation of FlowNet 2.0: evolution of optical flow estimation with deep networks (2017). https://github.com/NVIDIA/flownet2-pytorch
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. arXiv:1412.6550 (2014). Comp. Res. Repository
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Shelhamer, E., Rakelly, K., Hoffman, J., Darrell, T.: Clockwork convnets for video semantic segmentation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 852–868. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_69
Chapter Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 802–810 (2015)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Sun, K., et al.: High-resolution representations for labeling pixels and regions. arXiv:1904.04514 (2019). Comp. Res. Repository
Tian, Z., He, T., Shen, C., Yan, Y.: Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3126–3135 (2019)
Google Scholar
Xu, Y.S., Fu, T.J., Yang, H.K., Lee, C.Y.: Dynamic video segmentation network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6556–6565 (2018)
Google Scholar
Yao, C.H., Chang, C.Y., Chien, S.Y.: Occlusion-aware video temporal consistency. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 777–785. ACM (2017)
Google Scholar
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 334–349. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_20
Chapter Google Scholar
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: Proceedings of the International Conference on Learning Representations (2017)
Google Scholar
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: ICNet for real-time semantic segmentation on high-resolution images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 418–434. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_25
Chapter Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Google Scholar
Zhu, X., Dai, J., Yuan, L., Wei, Y.: Towards high performance video object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7210–7218 (2018)
Google Scholar
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)
Google Scholar
Zhu, Y., et al.: Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8856–8865 (2019)
Google Scholar

Download references

Acknowledgements

Correspondence should be addressed to CS. CS was in part supported by ARC DP ‘Deep learning that scales’.

Author information

Authors and Affiliations

The University of Adelaide, Adelaide, Australia
Yifan Liu, Chunhua Shen & Changqian Yu
Huazhong University of Science and Technology, Wuhan, China
Changqian Yu
Microsoft Research, Redmond, USA
Jingdong Wang

Authors

Yifan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chunhua Shen
View author publications
You can also search for this author in PubMed Google Scholar
Changqian Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jingdong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingdong Wang .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 21312 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Shen, C., Yu, C., Wang, J. (2020). Efficient Semantic Video Segmentation with Per-Frame Inference. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12355. Springer, Cham. https://doi.org/10.1007/978-3-030-58607-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-58607-2_21
Published: 07 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58606-5
Online ISBN: 978-3-030-58607-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Semantic Video Segmentation with Per-Frame Inference

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Streaming Multiscale Deep Equilibrium Models

Clockwork Convnets for Video Semantic Segmentation

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 21312 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient Semantic Video Segmentation with Per-Frame Inference

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Streaming Multiscale Deep Equilibrium Models

Clockwork Convnets for Video Semantic Segmentation

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 21312 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation