Abstract
In recent years, supervised video summarization has made tremendous progress with treating it as a sequence-to-sequence learning task. However, traditional recurrent neural networks (RNNs) have limitations in sequence modeling of long sequences, and the use of a transformer for sequence modeling requires a large number of parameters. We propose an efficient U-shaped transformer for video summarization tasks in this paper to address this issue, which we call “Uformer”. Precisely, Uformer consists of three key components: embedding, Uformer block, and prediction head. First of all, the image features sequence is represented by the pre-trained deep convolutional network, then represented by a liner embedding. The image feature sequence differences are also represented by another liner embedding and concatenate together to form a two-stream embedding feature in the embedding component. Secondly, we stack multiple transformer layers into a U-shaped block to integrate the representations learned from the previous layers. Multi-scale Uformer can not only learn longer sequence information but also reduce the number of parameters and calculations. Finally, prediction head regression the localization of the keyframes and learning the corresponding classification scores. Uformer combine with non-maximum suppression (NMS) for post-processing to get the final video summarization. We improved the F-score from 50.2% to 53.9% by 3.7% on the SumMe dataset and improved F-score from 62.1% to 63.0% by 0.9% on the TVSum dataset. Our proposed model with 0.85M parameters which are only 32.32% of DR-DSN’s parameters.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Li Y, Merialdo B (2010) Multi-video summarization based on video-mmr. In: 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10 (pp. 1-4). IEEE
Zhao B, Xing EP (2014) Quasi real-time summarization for consumer videos. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 2513–2520
Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664
Meng J, Wang S, Wang H, Yuan J, Tan YP (2017) Video summarization via multi-view representative selection. In: proceedings of the IEEE International Conference on Computer Vision Workshops, pp 1189–1198
Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European conference on computer vision (pp. 766-782). Springer, Cham
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. Inproceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 202–211
Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder–decoder networks. IEEE Trans Circuits Syst Video Technol 30(6):1709–1717
Wei H, Ni B, Yan Y, Yu H, Yang X, Yao C (2018, April) Video summarization via semantic attended networks. Inproceedings of the AAAI Conference on Artificial Intelligence (Vol. 32), 1
Zhou K, Qiao Y, Xiang T (2018, April) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. Inproceedings of the AAAI Conference on Artificial Intelligence (Vol.32), 1
Huang S, Li X, Zhang Z, Wu F, Han J (2018) User-ranking video summarization with multi-stage spatio–temporal representation. IEEE Trans Image Process 28(6):2654–2664
Rochan M, Ye L, Wang Y (2018) Video summarization using fully convolutional sequence networks. Inproceedings of the European Conference on Computer Vision (ECCV), pp 347–363
Fajtl J, Sokeh HS, Argyriou V, Monekosso D, Remagnino P (2018) Summarizing videos with attention. In: Asian Conference on Computer Vision (pp. 39-54). Springer, Cham
Zhu W, Lu J, Li J, Zhou J (2020) DSNEt: A Flexible Detect-to-Summarize Network for Video Summarization. IEEE Trans Image Process 30:948–962
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE transactions on neural networks and learning systems 30(10):3047–3058
Gao J, Wang Q, Li X (2019) Pcc net: Perspective crowd counting via spatial convolutional network. IEEE Trans Circuits Syst Video Technol 30(10):3486–3498
Niu L, Xu X, Chen L, Duan L, Xu D (2016) Action and event recognition in videos by learning from heterogeneous web sources. IEEE transactions on neural networks and learning systems 28(6):1290–1304
Zhao B, Li X, Lu X (2019) Property-constrained dual learning for video summarization. IEEE transactions on neural networks and learning systems 31(10):3989–4000
Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664
Wang Q, Gao J, Lin W, Yuan Y (2019) Learning from synthetic data for crowd counting in the wild. Inproceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8198–8207
Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 982–990
Liu T, Kender JR (2002) Optimization algorithms for the selection of key frame sequences of variable length. In: European conference on computer vision (pp. 403-417). Springer, Berlin, Heidelberg
Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: Summarizing web videos using titles. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 5179– 5187
Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: European conference on computer vision (pp. 505-520). Springer, Cham
Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: European conference on computer vision (pp. 540-555). Springer, Cham
Kuanar SK, Panda R, Chowdhury AS (2013) Video key frame extraction through dynamic Delaunay clustering with a structural constraint. J Vis Commun Image Represent 24(7):1212– 1227
Li X, Zhao B, Lu X (2017) Key frame extraction in the summary space. IEEE transactions on cybernetics 48(6):1923–1934
Elhamifar E, Sapiro G, Vidal R (2012) See all by looking at a few: Sparse modeling for finding representative objects. In: 2012 IEEE conference on computer vision and pattern recognition (pp. 1600-1607). IEEE
Zhang H, Wang Z, Liu D (2014) A comprehensive review of stability analysis of continuous-time recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems 25(7):1229–1262
Zhao B, Li X, Lu X (2018) Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7405–7414
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Boundary graph convolutional network for temporal action detection. Image Vis Comput, 104144
Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. Inproceedings of the IEEE/CVF International Conference on Computer Vision, pp 3889–3898
De Avila SEF, Lopes APB, da Luz Jr A, de Albuquerque Araújo A (2011) VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32 (1):56–68
Zhang K, Chao WL, Sha F, Grauman K (2016) Summary transfer: Exemplar-based subset selection for video summarization. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 1059–1067
Lagani’ere R, Bacco R, Hocevar A, Lambert P, Pa”ıs G, Ionescu BE (2008) Video summarization from spatio-temporal features. In: ACM TRECVid Video Summarization Workshop, pp 144– 148
Ma Y, Lu L, Zhang H, Li M (2002) A user attention model for video summarization. In: ACM International Conference on Multimedia (MM), pp 533–542
Ba TT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM transactions on multimedia computing, communications, and applications 3(1):3
Lu S, Wang Z, Mei T, Guan G, Feng DD (2014) A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Trans. Multimedia 16(6):1497–1509
Luan Q, Song M, Liau CY, Bu J, Liu Z, Sun M-T (2014) Video summarization based on nonnegative linear reconstruction, pp 1–6
Elhamifar E, De Paolis Kaluza MC (2017) Online summarization via submodular and convex optimization. In: Proc, IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp 1818–1826
Elhamifar E, Sapiro G, Sastry SS (2016) Dissimilarity-based sparse subset selection. IEEE Trans. Pattern Anal. Mach. Intell. 38(11):2182–2197
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008. 1, 3, 4, 5, 8, 11, 12, 14, 15, 16, 17
Ott M, Edunov S, Grangier D, Auli M (2018) Scaling neural machine translation. Inproceedings of the Third Conference on Machine Translation:, Research Papers, 1, pp. 1–9
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pretraining of deep bidirectional transformers for language understanding, arXiv:1810.04805. 1, 2, 4, 12, 14, 15, 17, 20
Anonymous (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Submitted to International Conference on Learning Representations. under review. 1, 4, 5, 6, 7, 8, 18, 19
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers, arXiv:2005.12872. 1, 2, 7, 8, 18
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection, arXiv:2010.04159. 1, 7, 8, 18
Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal selfattention network for referring image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10502–10511. 1, 7, 9
Yang F, Yang H, Fu J, Lu H, Guo B (2020) Learning texture transformer network for image super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5791– 5800. 1, 7, 10, 11, 18
Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2020) Pre-trained image processing transformer, arXiv:2012.00364. 1, 6, 10, 11
Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers, Inproceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5103–5114. 1, 12, 13, 14, 15
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proc IEEE Conf. Comput. Vis. Pattern Recognit., pp 1–9
Tran D, et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541
Chen Y, Guo B, Shen Y, Wang W, Suo X, Zhang Z (2020) Using efficient group pseudo-3d network to learn spatio-temporal features. SIViP, pp 1–9
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp. 568–576
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv:1606.08415
Jung Y, Cho D, Kim D, Woo S, Kweon IS (2019) Discriminative feature learning for unsupervised video summarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 8537–8544
Liu YT, Li YJ, Yang FE, Chen SF, Wang YCF (2019) Learning hierarchical self-attention for video summarization. In: 2019 IEEE International Conference on Image Processing (ICIP) (pp. 3377-3381). IEEE
He X, Hua Y, Song T, Zhang Z, Xue Z, Ma R, Guan H (2019) Unsupervised video summarization with attentive conditional generative adversarial networks. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 2296– 2304
Jung Y, Cho D, Woo S, Kweon IS (2020) Global-and-Local Relative Position Embedding for Unsupervised Video Summarization. In: European Conference on Computer Vision, ECCV 2020. European Conference on Computer Vision
Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Capsule Boundary Network with 3D Convolutional Dynamic Routing for Temporal Action Detection. IEEE Transactions on Circuits and Systems for Video Technology
Gong B, Chao WL, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. Advances in neural information processing systems 27:2069–2077
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant No. 61772352; National Key Research and Development Project under Grant No. 2020YFB1711800 and 2020YFB1707900; the Science and Technology Project of Sichuan Province under Grant No. 2019YFG0400, 2021YFG0152, 2020YFG0479, 2020YFG0322, 2020GFW035, and the R&D Project of Chengdu City under Grant No. 2019-YF05-01790-GX.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, Y., Guo, B., Shen, Y. et al. Video summarization with u-shaped transformer. Appl Intell 52, 17864–17880 (2022). https://doi.org/10.1007/s10489-022-03451-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03451-1