Video summarization with u-shaped transformer | Applied Intelligence
Skip to main content

Video summarization with u-shaped transformer

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In recent years, supervised video summarization has made tremendous progress with treating it as a sequence-to-sequence learning task. However, traditional recurrent neural networks (RNNs) have limitations in sequence modeling of long sequences, and the use of a transformer for sequence modeling requires a large number of parameters. We propose an efficient U-shaped transformer for video summarization tasks in this paper to address this issue, which we call “Uformer”. Precisely, Uformer consists of three key components: embedding, Uformer block, and prediction head. First of all, the image features sequence is represented by the pre-trained deep convolutional network, then represented by a liner embedding. The image feature sequence differences are also represented by another liner embedding and concatenate together to form a two-stream embedding feature in the embedding component. Secondly, we stack multiple transformer layers into a U-shaped block to integrate the representations learned from the previous layers. Multi-scale Uformer can not only learn longer sequence information but also reduce the number of parameters and calculations. Finally, prediction head regression the localization of the keyframes and learning the corresponding classification scores. Uformer combine with non-maximum suppression (NMS) for post-processing to get the final video summarization. We improved the F-score from 50.2% to 53.9% by 3.7% on the SumMe dataset and improved F-score from 62.1% to 63.0% by 0.9% on the TVSum dataset. Our proposed model with 0.85M parameters which are only 32.32% of DR-DSN’s parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Li Y, Merialdo B (2010) Multi-video summarization based on video-mmr. In: 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10 (pp. 1-4). IEEE

  2. Zhao B, Xing EP (2014) Quasi real-time summarization for consumer videos. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 2513–2520

  3. Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664

    Article  MathSciNet  MATH  Google Scholar 

  4. Meng J, Wang S, Wang H, Yuan J, Tan YP (2017) Video summarization via multi-view representative selection. In: proceedings of the IEEE International Conference on Computer Vision Workshops, pp 1189–1198

  5. Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European conference on computer vision (pp. 766-782). Springer, Cham

  6. Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. Inproceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 202–211

  7. Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder–decoder networks. IEEE Trans Circuits Syst Video Technol 30(6):1709–1717

    Article  Google Scholar 

  8. Wei H, Ni B, Yan Y, Yu H, Yang X, Yao C (2018, April) Video summarization via semantic attended networks. Inproceedings of the AAAI Conference on Artificial Intelligence (Vol. 32), 1

  9. Zhou K, Qiao Y, Xiang T (2018, April) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. Inproceedings of the AAAI Conference on Artificial Intelligence (Vol.32), 1

  10. Huang S, Li X, Zhang Z, Wu F, Han J (2018) User-ranking video summarization with multi-stage spatio–temporal representation. IEEE Trans Image Process 28(6):2654–2664

    Article  MathSciNet  MATH  Google Scholar 

  11. Rochan M, Ye L, Wang Y (2018) Video summarization using fully convolutional sequence networks. Inproceedings of the European Conference on Computer Vision (ECCV), pp 347–363

  12. Fajtl J, Sokeh HS, Argyriou V, Monekosso D, Remagnino P (2018) Summarizing videos with attention. In: Asian Conference on Computer Vision (pp. 39-54). Springer, Cham

  13. Zhu W, Lu J, Li J, Zhou J (2020) DSNEt: A Flexible Detect-to-Summarize Network for Video Summarization. IEEE Trans Image Process 30:948–962

    Article  Google Scholar 

  14. Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE transactions on neural networks and learning systems 30(10):3047–3058

    Article  Google Scholar 

  15. Gao J, Wang Q, Li X (2019) Pcc net: Perspective crowd counting via spatial convolutional network. IEEE Trans Circuits Syst Video Technol 30(10):3486–3498

    Article  Google Scholar 

  16. Niu L, Xu X, Chen L, Duan L, Xu D (2016) Action and event recognition in videos by learning from heterogeneous web sources. IEEE transactions on neural networks and learning systems 28(6):1290–1304

    Article  MathSciNet  Google Scholar 

  17. Zhao B, Li X, Lu X (2019) Property-constrained dual learning for video summarization. IEEE transactions on neural networks and learning systems 31(10):3989–4000

    Article  Google Scholar 

  18. Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664

    Article  MathSciNet  MATH  Google Scholar 

  19. Wang Q, Gao J, Lin W, Yuan Y (2019) Learning from synthetic data for crowd counting in the wild. Inproceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8198–8207

  20. Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 982–990

  21. Liu T, Kender JR (2002) Optimization algorithms for the selection of key frame sequences of variable length. In: European conference on computer vision (pp. 403-417). Springer, Berlin, Heidelberg

  22. Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: Summarizing web videos using titles. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 5179– 5187

  23. Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: European conference on computer vision (pp. 505-520). Springer, Cham

  24. Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: European conference on computer vision (pp. 540-555). Springer, Cham

  25. Kuanar SK, Panda R, Chowdhury AS (2013) Video key frame extraction through dynamic Delaunay clustering with a structural constraint. J Vis Commun Image Represent 24(7):1212– 1227

    Article  Google Scholar 

  26. Li X, Zhao B, Lu X (2017) Key frame extraction in the summary space. IEEE transactions on cybernetics 48(6):1923–1934

    Article  Google Scholar 

  27. Elhamifar E, Sapiro G, Vidal R (2012) See all by looking at a few: Sparse modeling for finding representative objects. In: 2012 IEEE conference on computer vision and pattern recognition (pp. 1600-1607). IEEE

  28. Zhang H, Wang Z, Liu D (2014) A comprehensive review of stability analysis of continuous-time recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems 25(7):1229–1262

    Article  Google Scholar 

  29. Zhao B, Li X, Lu X (2018) Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7405–7414

  30. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762

  31. Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Boundary graph convolutional network for temporal action detection. Image Vis Comput, 104144

  32. Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. Inproceedings of the IEEE/CVF International Conference on Computer Vision, pp 3889–3898

  33. De Avila SEF, Lopes APB, da Luz Jr A, de Albuquerque Araújo A (2011) VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32 (1):56–68

    Article  Google Scholar 

  34. Zhang K, Chao WL, Sha F, Grauman K (2016) Summary transfer: Exemplar-based subset selection for video summarization. Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 1059–1067

  35. Lagani’ere R, Bacco R, Hocevar A, Lambert P, Pa”ıs G, Ionescu BE (2008) Video summarization from spatio-temporal features. In: ACM TRECVid Video Summarization Workshop, pp 144– 148

  36. Ma Y, Lu L, Zhang H, Li M (2002) A user attention model for video summarization. In: ACM International Conference on Multimedia (MM), pp 533–542

  37. Ba TT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM transactions on multimedia computing, communications, and applications 3(1):3

    Article  Google Scholar 

  38. Lu S, Wang Z, Mei T, Guan G, Feng DD (2014) A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Trans. Multimedia 16(6):1497–1509

    Article  Google Scholar 

  39. Luan Q, Song M, Liau CY, Bu J, Liu Z, Sun M-T (2014) Video summarization based on nonnegative linear reconstruction, pp 1–6

  40. Elhamifar E, De Paolis Kaluza MC (2017) Online summarization via submodular and convex optimization. In: Proc, IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp 1818–1826

  41. Elhamifar E, Sapiro G, Sastry SS (2016) Dissimilarity-based sparse subset selection. IEEE Trans. Pattern Anal. Mach. Intell. 38(11):2182–2197

    Article  Google Scholar 

  42. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008. 1, 3, 4, 5, 8, 11, 12, 14, 15, 16, 17

  43. Ott M, Edunov S, Grangier D, Auli M (2018) Scaling neural machine translation. Inproceedings of the Third Conference on Machine Translation:, Research Papers, 1, pp. 1–9

  44. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pretraining of deep bidirectional transformers for language understanding, arXiv:1810.04805. 1, 2, 4, 12, 14, 15, 17, 20

  45. Anonymous (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Submitted to International Conference on Learning Representations. under review. 1, 4, 5, 6, 7, 8, 18, 19

  46. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers, arXiv:2005.12872. 1, 2, 7, 8, 18

  47. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection, arXiv:2010.04159. 1, 7, 8, 18

  48. Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal selfattention network for referring image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10502–10511. 1, 7, 9

  49. Yang F, Yang H, Fu J, Lu H, Guo B (2020) Learning texture transformer network for image super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5791– 5800. 1, 7, 10, 11, 18

  50. Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2020) Pre-trained image processing transformer, arXiv:2012.00364. 1, 6, 10, 11

  51. Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers, Inproceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5103–5114. 1, 12, 13, 14, 15

  52. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proc IEEE Conf. Comput. Vis. Pattern Recognit., pp 1–9

  53. Tran D, et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497

  54. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541

  55. Chen Y, Guo B, Shen Y, Wang W, Suo X, Zhang Z (2020) Using efficient group pseudo-3d network to learn spatio-temporal features. SIViP, pp 1–9

  56. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp. 568–576

  57. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941

  58. Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv:1606.08415

  59. Jung Y, Cho D, Kim D, Woo S, Kweon IS (2019) Discriminative feature learning for unsupervised video summarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 8537–8544

  60. Liu YT, Li YJ, Yang FE, Chen SF, Wang YCF (2019) Learning hierarchical self-attention for video summarization. In: 2019 IEEE International Conference on Image Processing (ICIP) (pp. 3377-3381). IEEE

  61. He X, Hua Y, Song T, Zhang Z, Xue Z, Ma R, Guan H (2019) Unsupervised video summarization with attentive conditional generative adversarial networks. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 2296– 2304

  62. Jung Y, Cho D, Woo S, Kweon IS (2020) Global-and-Local Relative Position Embedding for Unsupervised Video Summarization. In: European Conference on Computer Vision, ECCV 2020. European Conference on Computer Vision

  63. Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Capsule Boundary Network with 3D Convolutional Dynamic Routing for Temporal Action Detection. IEEE Transactions on Circuits and Systems for Video Technology

  64. Gong B, Chao WL, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. Advances in neural information processing systems 27:2069–2077

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61772352; National Key Research and Development Project under Grant No. 2020YFB1711800 and 2020YFB1707900; the Science and Technology Project of Sichuan Province under Grant No. 2019YFG0400, 2021YFG0152, 2020YFG0479, 2020YFG0322, 2020GFW035, and the R&D Project of Chengdu City under Grant No. 2019-YF05-01790-GX.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bing Guo.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Guo, B., Shen, Y. et al. Video summarization with u-shaped transformer. Appl Intell 52, 17864–17880 (2022). https://doi.org/10.1007/s10489-022-03451-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03451-1

Keywords