Fast Video Shot Transition Localization with Deep Structured Models | SpringerLink
Skip to main content

Fast Video Shot Transition Localization with Deep Structured Models

  • Conference paper
  • First Online:
Computer Vision – ACCV 2018 (ACCV 2018)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11361))

Included in the following conference series:

Abstract

Detection of video shot transition is a crucial pre-processing step in video analysis. Previous studies are restricted on detecting sudden content changes between frames through similarity measurement and multi-scale operations are widely utilized to deal with transitions of various lengths. However, localization of gradual transitions are still under-explored due to the high visual similarity between adjacent frames. Cut shot transitions are abrupt semantic breaks while gradual shot transitions contain low-level spatial-temporal patterns caused by video effects, e.g. dissolve. In this paper, we propose a structured network aiming to detect these two shot transitions using targeted models separately. Considering speed performance trade-offs, we design the following framework. In the first stage, a light filtering module is utilized for collecting candidate transitions on multiple scales. Then, cut transitions and gradual transitions are selected from those candidates by separate detectors. To be more specific, the cut transition detector focus on measuring image similarity and the gradual transition detector is able to capture temporal pattern of consecutive frames, even locating the positions of gradual transitions. The light filtering module can rapidly exclude most of the video frames from further processing and maintain an almost perfect recall of both cut and gradual transitions. The targeted models in the second stage further process the candidates obtained in the first stage to achieve a high precision. With one TITAN GPU, the proposed method can achieve a 30\(\times \) real-time speed. Experiments on public TRECVID07 and RAI databases show that our method outperforms the state-of-the-art methods. To train a high-performance shot transition detector, we contribute a new database ClipShots, which contains 128636 cut transitions and 38120 gradual transitions from 4039 online videos. ClipShots intentionally collect short videos for more hard cases caused by hand-held camera vibrations, large object motions, and occlusion. The database is avaliable at https://github.com/Tangshitao/ClipShots.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Apostolidis, E., Mezaris, V.: Fast shot segmentation combining global and local visual descriptors. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6583–6587. IEEE (2014)

    Google Scholar 

  2. Baraldi, L., Grana, C., Cucchiara, R.: Shot and scene detection via hierarchical clustering for re-using broadcast video. In: Azzopardi, G., Petkov, N. (eds.) CAIP 2015, Part I. LNCS, vol. 9256, pp. 801–811. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23192-1_67

    Chapter  Google Scholar 

  3. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)

    Google Scholar 

  4. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)

    Google Scholar 

  5. Domnic, S.: Walsh-Hadamard transform kernel-based feature vector for shot boundary detection. IEEE Trans. Image Process. 23(12), 5187–5197 (2014)

    Article  MathSciNet  Google Scholar 

  6. Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part III. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47

    Chapter  Google Scholar 

  7. Gygli, M.: Ridiculously fast shot boundary detection with fully convolutional neural networks (2017). arXiv preprint: arXiv:1705.08214

  8. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? (2017). arXiv preprint: arXiv:1711.09577

  9. Hassanien, A., Elgharib, M., Selim, A., Hefeeda, M., Matusik, W.: Large-scale, fast and accurate shot boundary detection through spatio-temporal convolutional neural networks (2017). arXiv preprint: arXiv:1705.03281

  10. Huang, Q., Xiong, Y., Xiong, Y., Zhang, Y., Lin, D.: From trailers to storylines: an efficient way to learn from movies (2018). arXiv preprint: arXiv:1806.05341

  11. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<\)0.5 MB model size (2016). arXiv preprint: arXiv:1602.07360

  12. Kawai, Y., Sumiyoshi, H., Yagi, N.: Shot boundary detection at TRECVID 2007. In: TRECVID (2007)

    Google Scholar 

  13. Kay, W., et al.: The kinetics human action video dataset (2017). arXiv preprint: arXiv:1705.06950

  14. Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 988–996. ACM (2017)

    Google Scholar 

  15. Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  16. Liu, Z., Gibbon, D., Zavesky, E., Shahraray, B., Haffner, P.: At&t research at TRECVID 2007. In: Proceedings of TRECVID Workshop, pp. 19–26 (2007)

    Google Scholar 

  17. Lu, Z.M., Shi, Y.: Fast video shot boundary detection based on svd and pattern matching. IEEE Trans. Image Process. 22(12), 5136–5145 (2013)

    Article  MathSciNet  Google Scholar 

  18. Mühling, M., Ewerth, R., Stadelmann, T., Zöfel, C., Shi, B., Freisleben, B.: University of Marburg at TRECVID 2007: shot boundary detection and high level feature extraction. In: TRECVID (2007)

    Google Scholar 

  19. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5534–5542. IEEE (2017)

    Google Scholar 

  20. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)

    Google Scholar 

  21. Song, Y., Redi, M., Vallmitjana, J., Jaimes, A.: To click or not to click: automatic selection of beautiful thumbnails from videos. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 659–668. ACM (2016)

    Google Scholar 

  22. Wang, J., et al.: Learning fine-grained image similarity with deep ranking (2014). arXiv preprint: arXiv:1404.4661

  23. Wang, L., Xiong, Y., Lin, D., Van Gool, L.: UntrimmedNets for weakly supervised action recognition and detection. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2 (2017)

    Google Scholar 

  24. Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: The IEEE International Conference on Computer Vision (ICCV), vol. 6, p. 8 (2017)

    Google Scholar 

  25. Yuan, J., Li, J., Lin, F., Zhang, B.: A unified shot boundary detection framework based on graph partition model. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 539–542. ACM (2005)

    Google Scholar 

  26. Yuan, J., et al.: A formal study of shot boundary detection. IEEE Trans. Circ. Syst. Video Technol. 17(2), 168–186 (2007)

    Article  Google Scholar 

  27. Yusoff, Y., Christmas, W.J., Kittler, J.: Video shot cut detection using adaptive thresholding. In: BMVC, pp. 1–10 (2000)

    Google Scholar 

  28. Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4353–4361. IEEE (2015)

    Google Scholar 

  29. Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47

    Chapter  Google Scholar 

  30. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: The IEEE International Conference on Computer Vision (ICCV), vol. 8 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shitao Tang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tang, S., Feng, L., Kuang, Z., Chen, Y., Zhang, W. (2019). Fast Video Shot Transition Localization with Deep Structured Models. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11361. Springer, Cham. https://doi.org/10.1007/978-3-030-20887-5_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20887-5_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20886-8

  • Online ISBN: 978-3-030-20887-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics