{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T13:11:35Z","timestamp":1740143495209,"version":"3.37.3"},"reference-count":75,"publisher":"Association for Computing Machinery (ACM)","issue":"10","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62132001, 61925201, 62272013"],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,10,31]]},"abstract":"\n Video anomaly detection (VAD) aims to identify events or scenes in videos that deviate from typical patterns. Existing approaches primarily focus on reconstructing or predicting frames to detect anomalies and have shown improved performance in recent years. However, they often depend highly on local spatio-temporal information and face the challenge of insufficient object feature modeling. To address the above issues, this article proposes a video anomaly detection framework with\n E<\/jats:bold>\n nhanced\n O<\/jats:bold>\n bject Information and\n G<\/jats:bold>\n lobal\n T<\/jats:bold>\n emporal Dependencies\n (EOGT)<\/jats:bold>\n and the main novelties are: (1) A\n L<\/jats:bold>\n ocal\n O<\/jats:bold>\n bject\n A<\/jats:bold>\n nomaly\n S<\/jats:bold>\n tream\n (LOAS)<\/jats:bold>\n is proposed to extract local multimodal spatio-temporal anomaly features at the object level. LOAS integrates two modules: a\n D<\/jats:bold>\n iffusion-based\n O<\/jats:bold>\n bject\n R<\/jats:bold>\n econstruction\n N<\/jats:bold>\n etwork\n (DORN)<\/jats:bold>\n with multimodal conditions detects anomalies with object RGB information; and an\n O<\/jats:bold>\n bject\n P<\/jats:bold>\n ose\n A<\/jats:bold>\n nomaly Refiner\n (OPA)<\/jats:bold>\n discovers anomalies with human pose information. (2) A\n G<\/jats:bold>\n lobal\n T<\/jats:bold>\n emporal\n S<\/jats:bold>\n trengthening\n S<\/jats:bold>\n tream\n (GTSS)<\/jats:bold>\n with video-level temporal dependencies is proposed, which leverages video-level temporal dependencies to identify long-term and video-specific anomalies effectively. Both streams are jointly employed in EOGT to learn multimodal and multi-scale spatio-temporal anomaly features for VAD, and we finally fuse the anomaly features and scores to detect anomalies at the frame level. Extensive experiments are conducted to verify the performance of EOGT on three public datasets: ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.\n <\/jats:p>","DOI":"10.1145\/3662185","type":"journal-article","created":{"date-parts":[[2024,5,6]],"date-time":"2024-05-06T11:07:14Z","timestamp":1714993634000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["EOGT: Video Anomaly Detection with Enhanced Object Information and Global Temporal Dependency"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6496-7734","authenticated-orcid":false,"given":"Ruoyan","family":"Pi","sequence":"first","affiliation":[{"name":"Wangxuan Institute of Computer Technology, Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2938-6798","authenticated-orcid":false,"given":"Peng","family":"Wu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Northwestern Polytechnical University, Xi'an, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8502-5685","authenticated-orcid":false,"given":"Xiangteng","family":"He","sequence":"additional","affiliation":[{"name":"Wangxuan Institute of Computer Technology, Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7658-3845","authenticated-orcid":false,"given":"Yuxin","family":"Peng","sequence":"additional","affiliation":[{"name":"Wangxuan Institute of Computer Technology, Peking University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2024,9,12]]},"reference":[{"issue":"5","key":"e_1_3_1_2_2","first-page":"2293","article-title":"A survey of single-scene video anomaly detection","volume":"44","author":"Ramachandra Bharathkumar","year":"2020","unstructured":"Bharathkumar Ramachandra, Michael J. Jones, and Ranga Raju Vatsavai. 2020. A survey of single-scene video anomaly detection. IEEE Trans. Pattern Anal. Mach. Intell. 44, 5 (2020), 2293\u20132312.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00136"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00179"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00684"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01333"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.338"},{"issue":"1","key":"e_1_3_1_8_2","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1109\/TPAMI.2013.111","article-title":"Anomaly detection and localization in crowded scenes","volume":"36","author":"Li Weixin","year":"2013","unstructured":"Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. 2013. Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1 (2013), 18\u201332.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45053-X_48"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2015.2477242"},{"key":"e_1_3_1_11_2","article-title":"Attribute-based representations for accurate and interpretable video anomaly detection","author":"Reiss Tal","year":"2022","unstructured":"Tal Reiss and Yedid Hoshen. 2022. Attribute-based representations for accurate and interpretable video anomaly detection. arXiv preprint arXiv:2212.00789 (2022).","journal-title":"arXiv preprint arXiv:2212.00789"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01246"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_21"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.315"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240615"},{"key":"e_1_3_1_16_2","article-title":"Learning deep representations of appearance and motion for anomalous event detection","author":"Xu Dan","year":"2015","unstructured":"Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015).","journal-title":"arXiv preprint arXiv:1510.01553"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00356"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33015216"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350899"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16177"},{"issue":"7","key":"e_1_3_1_21_2","first-page":"2609","article-title":"A deep one-class neural network for anomalous event detection in complex scenes","volume":"31","author":"Wu Peng","year":"2019","unstructured":"Peng Wu, Jing Liu, and Fang Shen. 2019. A deep one-class neural network for anomalous event detection in complex scenes. IEEE Trans. Neural Netw. Learn. Syst. 31, 7 (2019), 2609\u20132622.","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00493"},{"key":"e_1_3_1_23_2","article-title":"Context recovery and knowledge retrieval: A novel two-stream framework for video anomaly detection","author":"Cao Congqi","year":"2022","unstructured":"Congqi Cao, Yue Lu, and Yanning Zhang. 2022. Context recovery and knowledge retrieval: A novel two-stream framework for video anomaly detection. arXiv preprint arXiv:2209.02899 (2022).","journal-title":"arXiv preprint arXiv:2209.02899"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3129349"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2948286"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3148392"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2021.108213"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01402"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2890749"},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","unstructured":"Taiyi Su Hanli Wang and Lei Wang. 2023. Multi-level content-aware boundary detection for temporal action proposal generation. IEEE Trans. Image Process. 32 (2023) 6090\u20136101.","DOI":"10.1109\/TIP.2023.3328471"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3052930"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01255"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.91"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.81"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579998"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3418213"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW59228.2023.00290"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.126561"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1111\/j.2517-6161.1977.tb01600.x"},{"key":"e_1_3_1_40_2","article-title":"Divide and conquer in video anomaly detection: A comprehensive review and new approach","author":"Xiao Jian","year":"2023","unstructured":"Jian Xiao, Tianyuan Liu, and Genlin Ji. 2023. Divide and conquer in video anomaly detection: A comprehensive review and new approach. arXiv preprint arXiv:2309.14622 (2023).","journal-title":"arXiv preprint arXiv:2309.14622"},{"key":"e_1_3_1_41_2","article-title":"Nice: Non-linear independent components estimation","author":"Dinh Laurent","year":"2014","unstructured":"Laurent Dinh, David Krueger, and Yoshua Bengio. 2014. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014).","journal-title":"arXiv preprint arXiv:1410.8516"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00947"},{"key":"e_1_3_1_44_2","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33 (2020), 6840\u20136851.","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_45_2","first-page":"8780","article-title":"Diffusion models beat GANs on image synthesis","volume":"34","author":"Dhariwal Prafulla","year":"2021","unstructured":"Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 34 (2021), 8780\u20138794.","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_46_2","article-title":"Video diffusion models","author":"Ho Jonathan","year":"2022","unstructured":"Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. 2022. Video diffusion models. arXiv:2204.03458 (2022).","journal-title":"arXiv:2204.03458"},{"key":"e_1_3_1_47_2","unstructured":"Uriel Singer Adam Polyak Thomas Hayes Xi Yin Jie An Songyang Zhang Qiyuan Hu Harry Yang Oron Ashual Oran Gafni Devi Parikh Sonal Gupta and Yaniv Taigman. 2022. Make-a-Video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)."},{"key":"e_1_3_1_48_2","article-title":"Phenaki: Variable length video generation from open domain textual description","author":"Villegas Ruben","year":"2022","unstructured":"Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. 2022. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399 (2022).","journal-title":"arXiv preprint arXiv:2210.02399"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612405"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA48891.2023.10160399"},{"key":"e_1_3_1_51_2","article-title":"Human motion diffusion model","author":"Tevet Guy","year":"2022","unstructured":"Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. 2022. Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022).","journal-title":"arXiv preprint arXiv:2209.14916"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01726"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01080"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00509"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-43153-1_5"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00721"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV56688.2023.00037"},{"key":"e_1_3_1_58_2","article-title":"Attention guided graph convolutional networks for relation extraction","author":"Guo Zhijiang","year":"2019","unstructured":"Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Attention guided graph convolutional networks for relation extraction. arXiv preprint arXiv:1906.07510 (2019).","journal-title":"arXiv preprint arXiv:1906.07510"},{"key":"e_1_3_1_59_2","unstructured":"Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR 8748\u20138763."},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_1_61_2","first-page":"6299","article-title":"Quo vadis, action recognition? A new model and the kinetics dataset","author":"Carreira Joao","year":"2017","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 6299\u20136308.","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2022.3222784"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.86"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.45"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58555-6_20"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350899"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2019.11.024"},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01517"},{"issue":"9","key":"e_1_3_1_70_2","first-page":"4505","article-title":"A background-agnostic framework with adversarial training for abnormal event detection in video","volume":"44","author":"Georgescu Mariana Iuliana","year":"2021","unstructured":"Mariana Iuliana Georgescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. 2021. A background-agnostic framework with adversarial training for abnormal event detection in video. IEEE Trans. Pattern Anal. Mach. Intell. 44, 9 (2021), 4505\u20134523.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01321"},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i1.19898"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20080-9_29"},{"key":"e_1_3_1_74_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19772-7_24"},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2023.103656"},{"key":"e_1_3_1_76_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01795"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3662185","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,12]],"date-time":"2024-09-12T12:24:34Z","timestamp":1726143874000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3662185"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,12]]},"references-count":75,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2024,10,31]]}},"alternative-id":["10.1145\/3662185"],"URL":"https:\/\/doi.org\/10.1145\/3662185","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2024,9,12]]},"assertion":[{"value":"2023-12-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-12","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-09-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}