Abstract
The automatic detection of human violence in video surveillance is an area of great attention due to its application in security, monitoring, and prevention systems. Detecting violence in real time could prevent criminal acts and even save lives. There are many investigations and proposals for the detection of violence in video surveillance; however, most of them focus on effectiveness and not on efficiency. They focus on overcoming the accuracy results of other proposals and not on their applicability in a real scenario and real-time. In this work, we propose an efficient model for recognizing human violence in real-time, based on deep learning, composed of two modules, a spatial attention module (SA) and a temporal attention module (TA). SA extracts spatial features and regions of interest by frame difference of two consecutive frames and morphological dilation. TA extracts temporal features by averaging all three RGB channels in a single channel to have three frames as input to a 2D CNN backbone. The proposal was evaluated in efficiency, accuracy, and real-time. The results showed that our work has the best efficiency compared to other proposals. Accuracy was very close to the result of the best proposal, and latency was very close to real-time. Therefore our model can be applied in real scenarios and in real-time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Gao, Y., Liu, H., Sun, X., Wang, C., Liu, Y.: Violence detection using oriented violent flows. Image Vis. Comput. 48–49(2015), 37–41 (2016). https://doi.org/10.1016/j.imavis.2016.01.006
Deniz, O., Serrano, I., Bueno, G., Kim, T.K.: Fast violence detection in video. In: VISAPP 2014 - Proceedings 9th International Conference on Computer Vision Theory Applications, vol. 2, December 2014, pp. 478–485 (2014). https://doi.org/10.5220/0004695104780485
Bilinski, P.: Human violence recognition and detection in surveillance videos, pp. 30–36 (2016). https://doi.org/10.1109/AVSS.2016.7738019
Zhang, T., Jia, W., He, X., Yang, J.: Discriminative dictionary learning with motion weber local descriptor for violence detection. IEEE Trans. Circuits Syst. Video Technol. 27(3), 696–709 (2017)
Deb, T., Arman, A., Firoze, A.: Machine cognition of violence in videos using novel outlier-resistant VLAD. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 989–994 (2018)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 1, no. January, pp. 568–576 (2014)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, Decemeber 2016, pp. 1933–1941 (2016). https://doi.org/10.1109/CVPR.2016.213
Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans. Image Process. 27(5), 2326–2339 (2018). https://doi.org/10.1109/TIP.2018.2791180
Wang, L., Xiong, Y., Wang, Z., Qiao, Yu., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Zhu, Y., Lan, Z., Newsam, S., Hauptmann, A.: Hidden two-stream convolutional networks for action recognition. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 363–378. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6_23
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013). https://doi.org/10.1109/TPAMI.2012.59
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2015, pp. 4489–4497 (2015). https://doi.org/10.1109/ICCV.2015.510
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October, pp. 5534–5542 (2017). https://doi.org/10.1109/ICCV.2017.590
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 4724–4733 (2017). https://doi.org/10.1109/CVPR.2017.502
Dong, Z., Qin, J., Wang, Y.: Multi-stream deep networks for person to person violence detection in videos. In: Chinese Conference on Pattern Recognition, pp. 517–531 (2016)
Zhou, P., Ding, Q., Luo, H., Hou, X.: Violent interaction detection in video based on deep learning. J. Phys: Conf. Ser. 844(1), 12044 (2017)
Serrano, I., Deniz, O., Espinosa-Aranda, J.L., Bueno, G.: Fight recognition in video using Hough forests and 2D convolutional neural network. IEEE Trans. Image Process. 27(10), 4787–4797 (2018)
Sudhakaran, S., Lanz, O.: Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6 (2017)
Hanson, A., PNVR, K., Krishnagopal, S., Davis, L.: Bidirectional convolutional LSTM for the detection of violence in videos. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 280–295. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11012-3_24
Ulutan, O., Rallapalli, S., Srivatsa, M., Torres, C., Manjunath, B.S.: Actor conditioned attention maps for video action detection. In: Proceedings of IEEE Winter Conference on Applcations of Computer Vision (WACV), pp. 516–525 (2020)
Meng, L., et al.: Interpretable spatio-temporal attention for video action recognition. In: Proceedings of IEEE/CVF International Conference Computer Vision Workshop (ICCVW), October 2019, pp. 1513–1522 (2019)
Kang, M.S., Park, R.H., Park, H.M.: Efficient spatio-temporal modeling methods for real-time violence recognition. IEEE Access 9, 76270–76285 (2021)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of IEEE/CVF Conference on Computer Vision Pattern Recognition, June 2018, pp. 6450–6459 (2018)
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), October 2017, pp. 5534–5542 (2017)
Hanson, A., PNVR, K., Krishnagopal, S., Davis, L.: Bidirectional convolutional LSTM for the detection of violence in videos. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 280–295. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11012-3_24
Li, J., Jiang, X., Sun, T., Xu, K.: Efficient violence detection using 3D convolutional neural networks. In: Proceedings of 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), September 2019, pp. 1–8 (2018)
Soliman, M.M., et al.: Violence recognition from videos using deep learning techniques. In: Proceedings of 9th International Conference on Intelligent Computing and Information System (ICICIS), December 2019, pp. 80–85 (2019)
Akti, S., Tataroglu, G.A., Ekenel, H.K.: Vision-based fight detection from surveillance cameras. In: Proceedings of 9th International Conference on Image Process. Theory, Tools Application (IPTA), November 2019, pp. 1–6 (2019)
Traoré, A., Akhloufi, M.A.: 2D bidirectional gated recurrent unit convolutional neural networks for end-to-end violence detection in videos. In: Campilho, A., Karray, F., Wang, Z. (eds.) ICIAR 2020. LNCS, vol. 12131, pp. 152–160. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50347-5_14
Huillcen Baca, H.A., Gutierrez Caceres, J.C., de Luz Palomino Valdivia, F.: Efficiency in human actions recognition in video surveillance using 3D CNN and DenseNet. In: Arai, K. (eds.) Advances in Information and Communication. FICC 2022. LNNS, vol. 438, pp. 342–355. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98012-2_26
Cheng, M., Cai, K., Li, M.: RWF-2000: an open large scale video database for violence detection. arXiv preprint arXiv:1911.05913 (2019)
Bermejo Nievas, E., Deniz Suarez, O., Bueno García, G., Sukthankar, R.: Violence detection in video using computer vision techniques. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch, W. (eds.) CAIP 2011. LNCS, vol. 6855, pp. 332–339. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23678-5_39
Su, Y., Lin, G., Zhu, J., Wu, Q.: Human interaction learning on 3D skeleton point clouds for video violence recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 74–90. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_5
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huillcen Baca, H.A., de Luz Palomino Valdivia, F., Solis, I.S., Cruz, M.A., Caceres, J.C.G. (2023). Human Violence Recognition in Video Surveillance in Real-Time. In: Arai, K. (eds) Advances in Information and Communication. FICC 2023. Lecture Notes in Networks and Systems, vol 652. Springer, Cham. https://doi.org/10.1007/978-3-031-28073-3_52
Download citation
DOI: https://doi.org/10.1007/978-3-031-28073-3_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28072-6
Online ISBN: 978-3-031-28073-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)