Evaluation of multiple features for violent scenes detection | Multimedia Tools and Applications Skip to main content
Log in

Evaluation of multiple features for violent scenes detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Violent scenes detection (VSD) is a challenging problem because of the heterogeneous content, large variations in video quality, and complex semantic meanings of the concepts involved. In the last few years, combining multiple features from multi-modalities has proven to be an effective strategy for general multimedia event detection (MED), but the specific event detection like VSD has been comparatively less studied. Here, we evaluated the use of multiple features and their combination in a violent scenes detection system. We rigorously analyzed a set of low-level features and a deep learning feature that captures the appearance, color, texture, motion and audio in video. We also evaluated the utility of mid-level visual information obtained from detecting related violent concepts. Experiments were performed on the publicly available MediaEval VSD 2014 dataset. The results showed that visual and motion features are better than audio features. Moreover, the performance of the mid-level features was nearly as good as that of the low-level visual features. Experiments with a number of fusion methods showed that all single features are complementary and help to improve overall performance. This study also provides an empirical foundation for selecting feature sets that are capable of dealing with heterogeneous content comprising violent scenes in movies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://www.technicolor.com/en/innovation/research-innovation/scientific-data-sharing/violent-scenes-dataset

References

  1. Acar E, Albayrak S (2014) Tub-irml at mediaeval 2014 violent scenes detection task: Violence modeling through feature space partitioning

  2. Aly R, Arandjelovic R, Chatfield K, Douze M, Fernando B, Harchaoui Z, McGuinness K, O’Connor NE, Oneata D, Parkhi OM (2013) The axes submissions at trecvid

  3. Avila S, Moreira D, Perez M, Moraes D, Cota I, Testoni V, Valle E, Goldenstein S, Rocha A (2014) Recod at mediaeval 2014: Violent scenes detection task

  4. Bogdan I, Schluter J, Mironica I, Schedl M (2013) A naive mid-level concept-based fusion approach to violence detection in hollywood movies. In: ACM Conference on International Conference on Multimedia Retrieval, pp 215–222

  5. Bosch A, Zisserman A, Muñoz X (2006) Scene classification via plsa Computer vision–ECCV 2006. Springer, pp 517–530

  6. Bosch A, Zisserman A, Muoz X (2007) Image classification using random forests and ferns. In: IEEE 11th international conference on Computer vision. ICCV 2007, pp 1–8. IEEE

  7. Burghouts GJ, Geusebroek JM (2009) Performance evaluation of local colour invariants. Comput Vis Image Underst 113(1):48–62

    Article  Google Scholar 

  8. Clarin C, Dionisio J, Echavez M, Naval PC (2005) Dove: Detection of movie violence using motion intensity analysis on skin and blood. Workshops and Demonstrations - ECCV:150–156

  9. Castán D, Rodríguez M, Ortega A, Orrite C, Lleida E (2014) Vivolab and cvlab-mediaeval 2014: Violent scenes detection affect task

  10. Cdric P, Demarty CH, Gravier G, Gros P (2011) Technicolor and inria/irisa at mediaeval 2011: Learning temporal modality integration with bayesian networks. In: MediaEval Multimedia Benchmark Workshop

  11. Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

    Article  Google Scholar 

  12. Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178

    Article  MathSciNet  MATH  Google Scholar 

  13. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, pp 1–2

  14. Dai Q, Tu J, Shi Z, Jiang YG, Xue X (2013) Fudan at mediaeval 2013: Violent scenes detection using motion features and part-level attributes. In: Mediaeval

  15. Dai Q, Wu Z, Jiang YG, Xue X, Tang J (2014) Fudan-njust at mediaeval 2014: Violent scenes detection using deep neural networks

  16. Demarty CH, Ionescu B, Jiang YG, Quang VL, Schedl M, Penet C (2014) Benchmarking violent scenes detection in movies. In: Content-based multimedia indexing (CBMI), 2014 12th international workshop on. IEEE, pp 1–6

  17. Demarty CH, Penet C, Soleymani M, Gravier G (2014) Vsd, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. Multimedia Tools and Applications :1–26

  18. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on Computer vision and pattern recognition, 2009. CVPR 2009, pp 248–255. IEEE

  19. Derbas N, Safadi B, Quénot G, et al. (2013) Lig at mediaeval 2013 affect task: Use of a generic method and joint audio-visual words. In: Mediaeval. Citeseer

  20. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531

  21. Harris ZS (1954) Distributional structure. Word

  22. Hauptmann A, Yan R, Lin WH, Christel M, Wactlar H (2007) Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Trans Multimedia 9(5):958–966. doi:10.1109/TMM.2007.900150

    Article  Google Scholar 

  23. Hung MH, Pan JS (2015) A real-time action detection system for surveillance videos using template matching. Journal of Information Hiding and Multimedia Signal Processing 6(6):1088–1099

    Google Scholar 

  24. Jaakkola T, Haussler D et al. (1999) Exploiting generative models in discriminative classifiers. Advances in neural information processing systems:487–493

  25. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia. ACM, pp 675–678

  26. Jian L, Wang W (2009) Weakly-supervised violence detection in movies with audio and video based co-training. Advances in Multimedia Information Processing-PCM, pp 930–935

  27. Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval. ACM, pp 494–501

  28. Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Trans Multimedia 12(1):42–53

    Article  Google Scholar 

  29. Jiang YG, Zeng X, Ye G, Ellis D, Chang SF, Bhattacharya S, Shah M (2010) Columbia-ucf trecvid2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching. In: TRECVID

  30. Jingen L, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. IEEE Conf Comput Vis Pattern Recognit (CVPR):3337–3344

  31. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  32. Lai PS, Cheng SS, Sun SY, Huang T, Su J, Xu YY, Chen Y, Chuang SC, Tseng C, Hsieh C (2005) Automated information mining on multimedia tv news archives. In: Knowledge-based intelligent information and engineering systems. Springer, pp 1238–1244

  33. Lam V, Le DD, Le SP, Satoh S, Duong DA (2012) Nii, Japan at mediaeval 2012 violent scenes detection affect task. In: Mediaeval Citeseer

  34. Lam V, Le D, Phan S, Satoh S, Duong DA (2014) NII-UIT at mediaeval 2014 violent scenes detection affect task

  35. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on Computer vision and pattern recognition, 2006, vol 2, pp 2169–2178. IEEE

  36. Li-Jia L, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification. Advances in Neural Information Processing Systems:1378–1386

  37. Liang-Hua C, Hsu HW, Wang LY, Su CW (2011) Violence detection in movies. Computer Graphics Imaging and Visualization (CGIV):119–124

  38. Liu H, Singh P (2004) Conceptneta practical commonsense reasoning tool-kit. BT technology journal 22(4):211–226

    Article  Google Scholar 

  39. Liu XF, Zhu XX (2015) Parallel feature extraction through preserving global and discriminative property for kernel-based image classification. Journal of Information Hiding and Multimedia Signal Processing 6(5):977–986

    Google Scholar 

  40. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  41. Ma Z, Yang Y, Cai Y, Sebe N, Hauptmann AG (2012) Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In: Proceedings of the 20th ACM International Conference on Multimedia, MM ’12 , pp 469–478. doi:10.1145/2393347.2393414

  42. Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. IEEE Trans Multimedia 14(1):88–101

    Article  Google Scholar 

  43. Mikolajczyk K, Schmid C (2002) An affine invariant interest point detector. In: Computer VisionECCV 2002. Springer, pp 128–142

  44. Myers GK, Nallapati R, van Hout J, Pancoast S, Nevatia R, Sun C, Habibian A, Koelma DC, van de Sande KE, Smeulders AW (2014) Evaluating multimedia features and fusion for example-based event detection. Mach Vis Appl 25 (1):17–32

    Article  Google Scholar 

  45. Nam J, Alghoniemy M, Tewfik AH (1998) Audio-visual content-based violent scene characterization. In: Image processing, 1998. ICIP 98. Proceedings. 1998 international conference on. IEEE, vol 1, pp 353–357

  46. Nascimento do, Teixeira B (2014) Mtm at mediaeval 2014 violence detection task

  47. Oh S, McCloskey S, Kim I, Vahdat A, Cannons KJ, Hajimirsadeghi H, Mori G, Perera AA, Pandey M, Corso JJ (2014) Multimedia event detection with multimodal feature fusion and temporal concept localization. Mach Vis Appl 25 (1):49–69

  48. Oneata D, Verbeek J, Schmid C (2014) The lear submission at thumos

  49. Penet C, Demarty CH, Gravier G, Gros P, et al. (2013) Technicolor/inria team at the mediaeval 2013 violent scenes detection task. In: MediaEval 2013 Working Notes

  50. Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Computer vision–ECCV 2010. Springer, pp 143–156

  51. Rabiner LR, Schafer RW (2007) Introduction to digital speech processing. Foundations and trends in signal processing 1(1):1–194

    Article  MATH  Google Scholar 

  52. Sadanand S, Corso JJ (2012) Action bank: a high-level representation of activity in video. In: Computer vision and pattern recognition (CVPR), 2012 IEEE conference on. IEEE, pp 1234–1241

  53. Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: Theory and practice. Int J Comput Vis 105(3):222–245

    Article  MathSciNet  MATH  Google Scholar 

  54. Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEE Trans Pattern Anal Mach Intell 31(4):591–606

    Article  Google Scholar 

  55. Sjöberg M, Schlüter J, Ionescu B, Schedl M (2013) Far at mediaeval 2013 violent scenes detection: Concept-based violent scenes detection in movies. In: Mediaeval

  56. Sjöberg M, Mironica I, Schedl M, Ionescu B (2014) Far at mediaeval 2014 violent scenes detection: A concept-based fusion approach

  57. Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia. ACM, pp 399–402

  58. Sun C, Nevatia R (2013) Large-scale web video event classification by use of fisher vectors. In: Applications of computer vision (WACV), 2013 IEEE workshop on. IEEE, pp 15–22

  59. Tan CC, Ngo CW (2013) The vireo team at mediaeval 2013: Violent scenes detection by mid-level concepts learnt from youtube. In: Mediaeval

  60. Tv and movie violence (2010) Why watching it is harmful to children. http://www.ocd.pitt.edu/Files/PDF/Parenting/TvAndMovieViolence.pdf. Accessed 10 Jan 2015

  61. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Computer vision (ICCV), 2013 IEEE international conference on. IEEE, pp 3551–3558

  62. Yu G, Wang W, Jiang S, Huang Q, Gao W (2008) Detecting violent scenes in movies by auditory and visual cues. Advances in Multimedia Information Processing-PCM:317–326

  63. Zhang B, Yi Y, Wang H, Yu J (2014) Mic-tju at mediaeval violent scenes detection (vsd) 2014

Download references

Acknowledgements

This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number B2013-26-01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vu Lam.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lam, V., Phan, S., Le, DD. et al. Evaluation of multiple features for violent scenes detection. Multimed Tools Appl 76, 7041–7065 (2017). https://doi.org/10.1007/s11042-016-3331-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-016-3331-4

Keywords

Navigation