Video summarization with u-shaped transformer

Chen, Yaosen; Guo, Bing; Shen, Yan; Zhou, Renshuang; Lu, Weichen; Wang, Wei; Wen, Xuming; Suo, Xinhua

doi:10.1007/s10489-022-03451-1

Video summarization with u-shaped transformer

Published: 06 April 2022

Volume 52, pages 17864–17880, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yaosen Chen ORCID: orcid.org/0000-0002-7212-1755^1,3,
Bing Guo¹,
Yan Shen²,
Renshuang Zhou^1,3,
Weichen Lu³,
Wei Wang^1,3,4,
Xuming Wen^3,4 &
…
Xinhua Suo¹

963 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

In recent years, supervised video summarization has made tremendous progress with treating it as a sequence-to-sequence learning task. However, traditional recurrent neural networks (RNNs) have limitations in sequence modeling of long sequences, and the use of a transformer for sequence modeling requires a large number of parameters. We propose an efficient U-shaped transformer for video summarization tasks in this paper to address this issue, which we call “Uformer”. Precisely, Uformer consists of three key components: embedding, Uformer block, and prediction head. First of all, the image features sequence is represented by the pre-trained deep convolutional network, then represented by a liner embedding. The image feature sequence differences are also represented by another liner embedding and concatenate together to form a two-stream embedding feature in the embedding component. Secondly, we stack multiple transformer layers into a U-shaped block to integrate the representations learned from the previous layers. Multi-scale Uformer can not only learn longer sequence information but also reduce the number of parameters and calculations. Finally, prediction head regression the localization of the keyframes and learning the corresponding classification scores. Uformer combine with non-maximum suppression (NMS) for post-processing to get the final video summarization. We improved the F-score from 50.2% to 53.9% by 3.7% on the SumMe dataset and improved F-score from 62.1% to 63.0% by 0.9% on the TVSum dataset. Our proposed model with 0.85M parameters which are only 32.32% of DR-DSN’s parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Wanet: weight and attention network for video summarization

Article Open access 11 January 2024

A two-stage attention augmented fully convolutional network-based dynamic video summarization

Article 21 August 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Li Y, Merialdo B (2010) Multi-video summarization based on video-mmr. In: 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10 (pp. 1-4). IEEE
Zhao B, Xing EP (2014) Quasi real-time summarization for consumer videos. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 2513–2520
Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664
Article MathSciNet MATH Google Scholar
Meng J, Wang S, Wang H, Yuan J, Tan YP (2017) Video summarization via multi-view representative selection. In: proceedings of the IEEE International Conference on Computer Vision Workshops, pp 1189–1198
Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European conference on computer vision (pp. 766-782). Springer, Cham
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. Inproceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 202–211
Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder–decoder networks. IEEE Trans Circuits Syst Video Technol 30(6):1709–1717
Article Google Scholar
Wei H, Ni B, Yan Y, Yu H, Yang X, Yao C (2018, April) Video summarization via semantic attended networks. Inproceedings of the AAAI Conference on Artificial Intelligence (Vol. 32), 1
Zhou K, Qiao Y, Xiang T (2018, April) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. Inproceedings of the AAAI Conference on Artificial Intelligence (Vol.32), 1
Huang S, Li X, Zhang Z, Wu F, Han J (2018) User-ranking video summarization with multi-stage spatio–temporal representation. IEEE Trans Image Process 28(6):2654–2664
Article MathSciNet MATH Google Scholar
Rochan M, Ye L, Wang Y (2018) Video summarization using fully convolutional sequence networks. Inproceedings of the European Conference on Computer Vision (ECCV), pp 347–363
Fajtl J, Sokeh HS, Argyriou V, Monekosso D, Remagnino P (2018) Summarizing videos with attention. In: Asian Conference on Computer Vision (pp. 39-54). Springer, Cham
Zhu W, Lu J, Li J, Zhou J (2020) DSNEt: A Flexible Detect-to-Summarize Network for Video Summarization. IEEE Trans Image Process 30:948–962
Article Google Scholar
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE transactions on neural networks and learning systems 30(10):3047–3058
Article Google Scholar
Gao J, Wang Q, Li X (2019) Pcc net: Perspective crowd counting via spatial convolutional network. IEEE Trans Circuits Syst Video Technol 30(10):3486–3498
Article Google Scholar
Niu L, Xu X, Chen L, Duan L, Xu D (2016) Action and event recognition in videos by learning from heterogeneous web sources. IEEE transactions on neural networks and learning systems 28(6):1290–1304
Article MathSciNet Google Scholar
Zhao B, Li X, Lu X (2019) Property-constrained dual learning for video summarization. IEEE transactions on neural networks and learning systems 31(10):3989–4000
Article Google Scholar
Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664
Article MathSciNet MATH Google Scholar
Wang Q, Gao J, Lin W, Yuan Y (2019) Learning from synthetic data for crowd counting in the wild. Inproceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8198–8207
Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 982–990
Liu T, Kender JR (2002) Optimization algorithms for the selection of key frame sequences of variable length. In: European conference on computer vision (pp. 403-417). Springer, Berlin, Heidelberg
Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: Summarizing web videos using titles. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 5179– 5187
Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: European conference on computer vision (pp. 505-520). Springer, Cham
Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: European conference on computer vision (pp. 540-555). Springer, Cham
Kuanar SK, Panda R, Chowdhury AS (2013) Video key frame extraction through dynamic Delaunay clustering with a structural constraint. J Vis Commun Image Represent 24(7):1212– 1227
Article Google Scholar
Li X, Zhao B, Lu X (2017) Key frame extraction in the summary space. IEEE transactions on cybernetics 48(6):1923–1934
Article Google Scholar
Elhamifar E, Sapiro G, Vidal R (2012) See all by looking at a few: Sparse modeling for finding representative objects. In: 2012 IEEE conference on computer vision and pattern recognition (pp. 1600-1607). IEEE
Zhang H, Wang Z, Liu D (2014) A comprehensive review of stability analysis of continuous-time recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems 25(7):1229–1262
Article Google Scholar
Zhao B, Li X, Lu X (2018) Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7405–7414
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Boundary graph convolutional network for temporal action detection. Image Vis Comput, 104144
Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. Inproceedings of the IEEE/CVF International Conference on Computer Vision, pp 3889–3898
De Avila SEF, Lopes APB, da Luz Jr A, de Albuquerque Araújo A (2011) VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32 (1):56–68
Article Google Scholar
Zhang K, Chao WL, Sha F, Grauman K (2016) Summary transfer: Exemplar-based subset selection for video summarization. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 1059–1067
Lagani’ere R, Bacco R, Hocevar A, Lambert P, Pa”ıs G, Ionescu BE (2008) Video summarization from spatio-temporal features. In: ACM TRECVid Video Summarization Workshop, pp 144– 148
Ma Y, Lu L, Zhang H, Li M (2002) A user attention model for video summarization. In: ACM International Conference on Multimedia (MM), pp 533–542
Ba TT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM transactions on multimedia computing, communications, and applications 3(1):3
Article Google Scholar
Lu S, Wang Z, Mei T, Guan G, Feng DD (2014) A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Trans. Multimedia 16(6):1497–1509
Article Google Scholar
Luan Q, Song M, Liau CY, Bu J, Liu Z, Sun M-T (2014) Video summarization based on nonnegative linear reconstruction, pp 1–6
Elhamifar E, De Paolis Kaluza MC (2017) Online summarization via submodular and convex optimization. In: Proc, IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp 1818–1826
Elhamifar E, Sapiro G, Sastry SS (2016) Dissimilarity-based sparse subset selection. IEEE Trans. Pattern Anal. Mach. Intell. 38(11):2182–2197
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008. 1, 3, 4, 5, 8, 11, 12, 14, 15, 16, 17
Ott M, Edunov S, Grangier D, Auli M (2018) Scaling neural machine translation. Inproceedings of the Third Conference on Machine Translation:, Research Papers, 1, pp. 1–9
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pretraining of deep bidirectional transformers for language understanding, arXiv:1810.04805. 1, 2, 4, 12, 14, 15, 17, 20
Anonymous (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Submitted to International Conference on Learning Representations. under review. 1, 4, 5, 6, 7, 8, 18, 19
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers, arXiv:2005.12872. 1, 2, 7, 8, 18
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection, arXiv:2010.04159. 1, 7, 8, 18
Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal selfattention network for referring image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10502–10511. 1, 7, 9
Yang F, Yang H, Fu J, Lu H, Guo B (2020) Learning texture transformer network for image super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5791– 5800. 1, 7, 10, 11, 18
Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2020) Pre-trained image processing transformer, arXiv:2012.00364. 1, 6, 10, 11
Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers, Inproceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5103–5114. 1, 12, 13, 14, 15
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proc IEEE Conf. Comput. Vis. Pattern Recognit., pp 1–9
Tran D, et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541
Chen Y, Guo B, Shen Y, Wang W, Suo X, Zhang Z (2020) Using efficient group pseudo-3d network to learn spatio-temporal features. SIViP, pp 1–9
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp. 568–576
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv:1606.08415
Jung Y, Cho D, Kim D, Woo S, Kweon IS (2019) Discriminative feature learning for unsupervised video summarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 8537–8544
Liu YT, Li YJ, Yang FE, Chen SF, Wang YCF (2019) Learning hierarchical self-attention for video summarization. In: 2019 IEEE International Conference on Image Processing (ICIP) (pp. 3377-3381). IEEE
He X, Hua Y, Song T, Zhang Z, Xue Z, Ma R, Guan H (2019) Unsupervised video summarization with attentive conditional generative adversarial networks. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 2296– 2304
Jung Y, Cho D, Woo S, Kweon IS (2020) Global-and-Local Relative Position Embedding for Unsupervised Video Summarization. In: European Conference on Computer Vision, ECCV 2020. European Conference on Computer Vision
Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Capsule Boundary Network with 3D Convolutional Dynamic Routing for Temporal Action Detection. IEEE Transactions on Circuits and Systems for Video Technology
Gong B, Chao WL, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. Advances in neural information processing systems 27:2069–2077
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61772352; National Key Research and Development Project under Grant No. 2020YFB1711800 and 2020YFB1707900; the Science and Technology Project of Sichuan Province under Grant No. 2019YFG0400, 2021YFG0152, 2020YFG0479, 2020YFG0322, 2020GFW035, and the R&D Project of Chengdu City under Grant No. 2019-YF05-01790-GX.

Author information

Authors and Affiliations

College of Computer Science, Sichuan University, Chengdu, Sichuan, 610065, China
Yaosen Chen, Bing Guo, Renshuang Zhou, Wei Wang & Xinhua Suo
School of Computer Science, Chengdu University of Information Technology, Chengdu, Sichuan, 610225, China
Yan Shen
Media Intelligence Laboratory, ChengDu Sobey Digital Technology Co., Ltd, Chengdu, Sichuan, 610041, China
Yaosen Chen, Renshuang Zhou, Weichen Lu, Wei Wang & Xuming Wen
Peng Cheng Laboratory, Shenzhen, 518055, China
Wei Wang & Xuming Wen

Authors

Yaosen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bing Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yan Shen
View author publications
You can also search for this author in PubMed Google Scholar
Renshuang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Weichen Lu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xuming Wen
View author publications
You can also search for this author in PubMed Google Scholar
Xinhua Suo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Guo.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Y., Guo, B., Shen, Y. et al. Video summarization with u-shaped transformer. Appl Intell 52, 17864–17880 (2022). https://doi.org/10.1007/s10489-022-03451-1

Download citation

Accepted: 22 February 2022
Published: 06 April 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10489-022-03451-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Video summarization with u-shaped transformer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

Wanet: weight and attention network for video summarization

A two-stage attention augmented fully convolutional network-based dynamic video summarization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Video summarization with u-shaped transformer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

Wanet: weight and attention network for video summarization

A two-stage attention augmented fully convolutional network-based dynamic video summarization

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation