Abstract
Violent scenes detection (VSD) is a challenging problem because of the heterogeneous content, large variations in video quality, and complex semantic meanings of the concepts involved. In the last few years, combining multiple features from multi-modalities has proven to be an effective strategy for general multimedia event detection (MED), but the specific event detection like VSD has been comparatively less studied. Here, we evaluated the use of multiple features and their combination in a violent scenes detection system. We rigorously analyzed a set of low-level features and a deep learning feature that captures the appearance, color, texture, motion and audio in video. We also evaluated the utility of mid-level visual information obtained from detecting related violent concepts. Experiments were performed on the publicly available MediaEval VSD 2014 dataset. The results showed that visual and motion features are better than audio features. Moreover, the performance of the mid-level features was nearly as good as that of the low-level visual features. Experiments with a number of fusion methods showed that all single features are complementary and help to improve overall performance. This study also provides an empirical foundation for selecting feature sets that are capable of dealing with heterogeneous content comprising violent scenes in movies.
Similar content being viewed by others
References
Acar E, Albayrak S (2014) Tub-irml at mediaeval 2014 violent scenes detection task: Violence modeling through feature space partitioning
Aly R, Arandjelovic R, Chatfield K, Douze M, Fernando B, Harchaoui Z, McGuinness K, O’Connor NE, Oneata D, Parkhi OM (2013) The axes submissions at trecvid
Avila S, Moreira D, Perez M, Moraes D, Cota I, Testoni V, Valle E, Goldenstein S, Rocha A (2014) Recod at mediaeval 2014: Violent scenes detection task
Bogdan I, Schluter J, Mironica I, Schedl M (2013) A naive mid-level concept-based fusion approach to violence detection in hollywood movies. In: ACM Conference on International Conference on Multimedia Retrieval, pp 215–222
Bosch A, Zisserman A, Muñoz X (2006) Scene classification via plsa Computer vision–ECCV 2006. Springer, pp 517–530
Bosch A, Zisserman A, Muoz X (2007) Image classification using random forests and ferns. In: IEEE 11th international conference on Computer vision. ICCV 2007, pp 1–8. IEEE
Burghouts GJ, Geusebroek JM (2009) Performance evaluation of local colour invariants. Comput Vis Image Underst 113(1):48–62
Clarin C, Dionisio J, Echavez M, Naval PC (2005) Dove: Detection of movie violence using motion intensity analysis on skin and blood. Workshops and Demonstrations - ECCV:150–156
Castán D, Rodríguez M, Ortega A, Orrite C, Lleida E (2014) Vivolab and cvlab-mediaeval 2014: Violent scenes detection affect task
Cdric P, Demarty CH, Gravier G, Gros P (2011) Technicolor and inria/irisa at mediaeval 2011: Learning temporal modality integration with bayesian networks. In: MediaEval Multimedia Benchmark Workshop
Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, pp 1–2
Dai Q, Tu J, Shi Z, Jiang YG, Xue X (2013) Fudan at mediaeval 2013: Violent scenes detection using motion features and part-level attributes. In: Mediaeval
Dai Q, Wu Z, Jiang YG, Xue X, Tang J (2014) Fudan-njust at mediaeval 2014: Violent scenes detection using deep neural networks
Demarty CH, Ionescu B, Jiang YG, Quang VL, Schedl M, Penet C (2014) Benchmarking violent scenes detection in movies. In: Content-based multimedia indexing (CBMI), 2014 12th international workshop on. IEEE, pp 1–6
Demarty CH, Penet C, Soleymani M, Gravier G (2014) Vsd, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. Multimedia Tools and Applications :1–26
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on Computer vision and pattern recognition, 2009. CVPR 2009, pp 248–255. IEEE
Derbas N, Safadi B, Quénot G, et al. (2013) Lig at mediaeval 2013 affect task: Use of a generic method and joint audio-visual words. In: Mediaeval. Citeseer
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531
Harris ZS (1954) Distributional structure. Word
Hauptmann A, Yan R, Lin WH, Christel M, Wactlar H (2007) Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Trans Multimedia 9(5):958–966. doi:10.1109/TMM.2007.900150
Hung MH, Pan JS (2015) A real-time action detection system for surveillance videos using template matching. Journal of Information Hiding and Multimedia Signal Processing 6(6):1088–1099
Jaakkola T, Haussler D et al. (1999) Exploiting generative models in discriminative classifiers. Advances in neural information processing systems:487–493
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia. ACM, pp 675–678
Jian L, Wang W (2009) Weakly-supervised violence detection in movies with audio and video based co-training. Advances in Multimedia Information Processing-PCM, pp 930–935
Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval. ACM, pp 494–501
Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Trans Multimedia 12(1):42–53
Jiang YG, Zeng X, Ye G, Ellis D, Chang SF, Bhattacharya S, Shah M (2010) Columbia-ucf trecvid2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching. In: TRECVID
Jingen L, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. IEEE Conf Comput Vis Pattern Recognit (CVPR):3337–3344
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Lai PS, Cheng SS, Sun SY, Huang T, Su J, Xu YY, Chen Y, Chuang SC, Tseng C, Hsieh C (2005) Automated information mining on multimedia tv news archives. In: Knowledge-based intelligent information and engineering systems. Springer, pp 1238–1244
Lam V, Le DD, Le SP, Satoh S, Duong DA (2012) Nii, Japan at mediaeval 2012 violent scenes detection affect task. In: Mediaeval Citeseer
Lam V, Le D, Phan S, Satoh S, Duong DA (2014) NII-UIT at mediaeval 2014 violent scenes detection affect task
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on Computer vision and pattern recognition, 2006, vol 2, pp 2169–2178. IEEE
Li-Jia L, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification. Advances in Neural Information Processing Systems:1378–1386
Liang-Hua C, Hsu HW, Wang LY, Su CW (2011) Violence detection in movies. Computer Graphics Imaging and Visualization (CGIV):119–124
Liu H, Singh P (2004) Conceptneta practical commonsense reasoning tool-kit. BT technology journal 22(4):211–226
Liu XF, Zhu XX (2015) Parallel feature extraction through preserving global and discriminative property for kernel-based image classification. Journal of Information Hiding and Multimedia Signal Processing 6(5):977–986
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Ma Z, Yang Y, Cai Y, Sebe N, Hauptmann AG (2012) Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In: Proceedings of the 20th ACM International Conference on Multimedia, MM ’12 , pp 469–478. doi:10.1145/2393347.2393414
Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. IEEE Trans Multimedia 14(1):88–101
Mikolajczyk K, Schmid C (2002) An affine invariant interest point detector. In: Computer VisionECCV 2002. Springer, pp 128–142
Myers GK, Nallapati R, van Hout J, Pancoast S, Nevatia R, Sun C, Habibian A, Koelma DC, van de Sande KE, Smeulders AW (2014) Evaluating multimedia features and fusion for example-based event detection. Mach Vis Appl 25 (1):17–32
Nam J, Alghoniemy M, Tewfik AH (1998) Audio-visual content-based violent scene characterization. In: Image processing, 1998. ICIP 98. Proceedings. 1998 international conference on. IEEE, vol 1, pp 353–357
Nascimento do, Teixeira B (2014) Mtm at mediaeval 2014 violence detection task
Oh S, McCloskey S, Kim I, Vahdat A, Cannons KJ, Hajimirsadeghi H, Mori G, Perera AA, Pandey M, Corso JJ (2014) Multimedia event detection with multimodal feature fusion and temporal concept localization. Mach Vis Appl 25 (1):49–69
Oneata D, Verbeek J, Schmid C (2014) The lear submission at thumos
Penet C, Demarty CH, Gravier G, Gros P, et al. (2013) Technicolor/inria team at the mediaeval 2013 violent scenes detection task. In: MediaEval 2013 Working Notes
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Computer vision–ECCV 2010. Springer, pp 143–156
Rabiner LR, Schafer RW (2007) Introduction to digital speech processing. Foundations and trends in signal processing 1(1):1–194
Sadanand S, Corso JJ (2012) Action bank: a high-level representation of activity in video. In: Computer vision and pattern recognition (CVPR), 2012 IEEE conference on. IEEE, pp 1234–1241
Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: Theory and practice. Int J Comput Vis 105(3):222–245
Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEE Trans Pattern Anal Mach Intell 31(4):591–606
Sjöberg M, Schlüter J, Ionescu B, Schedl M (2013) Far at mediaeval 2013 violent scenes detection: Concept-based violent scenes detection in movies. In: Mediaeval
Sjöberg M, Mironica I, Schedl M, Ionescu B (2014) Far at mediaeval 2014 violent scenes detection: A concept-based fusion approach
Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia. ACM, pp 399–402
Sun C, Nevatia R (2013) Large-scale web video event classification by use of fisher vectors. In: Applications of computer vision (WACV), 2013 IEEE workshop on. IEEE, pp 15–22
Tan CC, Ngo CW (2013) The vireo team at mediaeval 2013: Violent scenes detection by mid-level concepts learnt from youtube. In: Mediaeval
Tv and movie violence (2010) Why watching it is harmful to children. http://www.ocd.pitt.edu/Files/PDF/Parenting/TvAndMovieViolence.pdf. Accessed 10 Jan 2015
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Computer vision (ICCV), 2013 IEEE international conference on. IEEE, pp 3551–3558
Yu G, Wang W, Jiang S, Huang Q, Gao W (2008) Detecting violent scenes in movies by auditory and visual cues. Advances in Multimedia Information Processing-PCM:317–326
Zhang B, Yi Y, Wang H, Yu J (2014) Mic-tju at mediaeval violent scenes detection (vsd) 2014
Acknowledgements
This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number B2013-26-01.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lam, V., Phan, S., Le, DD. et al. Evaluation of multiple features for violent scenes detection. Multimed Tools Appl 76, 7041–7065 (2017). https://doi.org/10.1007/s11042-016-3331-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-3331-4