{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,4,5]],"date-time":"2025-04-05T06:10:48Z","timestamp":1743833448986,"version":"3.37.3"},"publisher-location":"New York, NY, USA","reference-count":51,"publisher":"ACM","funder":[{"name":"the National Research Foundation, Singapore"},{"name":"the National Natural Science Foundation of China","award":["No.:U1936203"]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,10,26]]},"DOI":"10.1145\/3581783.3612395","type":"proceedings-article","created":{"date-parts":[[2023,10,27]],"date-time":"2023-10-27T11:26:54Z","timestamp":1698406014000},"page":"5634-5644","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4694-1231","authenticated-orcid":false,"given":"Zhenyang","family":"Li","sequence":"first","affiliation":[{"name":"Shandong University, Qingdao, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8691-5372","authenticated-orcid":false,"given":"Yangyang","family":"Guo","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0595-6856","authenticated-orcid":false,"given":"Kejie","family":"Wang","sequence":"additional","affiliation":[{"name":"Shandong University, Qingdao, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4638-0603","authenticated-orcid":false,"given":"Xiaolin","family":"Chen","sequence":"additional","affiliation":[{"name":"Shandong University, Jinan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1476-0273","authenticated-orcid":false,"given":"Liqiang","family":"Nie","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology,\u00a0Shenzhen, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4846-2015","authenticated-orcid":false,"given":"Mohan","family":"Kankanhalli","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]}],"member":"320","published-online":{"date-parts":[[2023,10,27]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086","author":"Anderson Peter","year":"2018","unstructured":"Peter Anderson , Xiaodong He , Chris Buehler , Damien Teney , Mark Johnson , Stephen Gould , and Lei Zhang . 2018 . Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086 . Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086."},{"key":"e_1_3_2_1_2_1","volume-title":"VQA: Visual Question Answering. In IEEE International Conference on Computer Vision. 2425--2433","author":"Antol Stanislaw","year":"2015","unstructured":"Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu , Margaret Mitchell , Dhruv Batra , C. Lawrence Zitnick , and Devi Parikh . 2015 . VQA: Visual Question Answering. In IEEE International Conference on Computer Vision. 2425--2433 . Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In IEEE International Conference on Computer Vision. 2425--2433."},{"key":"e_1_3_2_1_3_1","volume-title":"End-to-End Object Detection with Transformers. In European Conference on Computer Vision","volume":"12346","author":"Carion Nicolas","year":"2020","unstructured":"Nicolas Carion , Francisco Massa , Gabriel Synnaeve , Nicolas Usunier , Alexander Kirillov , and Sergey Zagoruyko . 2020 . End-to-End Object Detection with Transformers. In European Conference on Computer Vision , Vol. 12346 . 213--229. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. In European Conference on Computer Vision, Vol. 12346. 213--229."},{"key":"e_1_3_2_1_4_1","volume-title":"UNITER: UNiversal Image-TExt Representation Learning. In European Conference on Computer Vision. 104--120","author":"Chen Yen-Chun","year":"2020","unstructured":"Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El Kholy , Faisal Ahmed , Zhe Gan , Yu Cheng , and Jingjing Liu . 2020 . UNITER: UNiversal Image-TExt Representation Learning. In European Conference on Computer Vision. 104--120 . Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In European Conference on Computer Vision. 104--120."},{"key":"e_1_3_2_1_5_1","volume-title":"International Conference on Machine Learning. 1931--1942","author":"Cho Jaemin","year":"2021","unstructured":"Jaemin Cho , Jie Lei , Hao Tan , and Mohit Bansal . 2021 . Unifying Vision-and-Language Tasks via Text Generation . In International Conference on Machine Learning. 1931--1942 . Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying Vision-and-Language Tasks via Text Generation. In International Conference on Machine Learning. 1931--1942."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_6_1","DOI":"10.1145\/3406095"},{"key":"e_1_3_2_1_7_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics. 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics. 4171--4186."},{"key":"e_1_3_2_1_8_1","volume-title":"MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition. 5079--5088","author":"Ding Yang","year":"2022","unstructured":"Yang Ding , Jing Yu , Bang Liu , Yue Hu , Mingxin Cui , and Qi Wu . 2022 . MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition. 5079--5088 . Yang Ding, Jing Yu, Bang Liu, Yue Hu, Mingxin Cui, and Qi Wu. 2022. MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition. 5079--5088."},{"key":"e_1_3_2_1_9_1","volume-title":"International Conference on Learning Representations. 1--21","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , Jakob Uszkoreit , and Neil Houlsby . 2021 . An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale . In International Conference on Learning Representations. 1--21 . Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. 1--21."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_10_1","DOI":"10.24963\/ijcai.2022\/762"},{"volume-title":"European Conference on Computer Vision. 15--29","author":"Farhadi Ali","unstructured":"Ali Farhadi , Seyyed Mohammad Mohsen Hejrati , Mohammad Amin Sadeghi , Peter Young , Cyrus Rashtchian , Julia Hockenmaier , and David A. Forsyth . 2010. Every Picture Tells a Story: Generating Sentences from Images . In European Conference on Computer Vision. 15--29 . Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every Picture Tells a Story: Generating Sentences from Images. In European Conference on Computer Vision. 15--29.","key":"e_1_3_2_1_11_1"},{"unstructured":"Zhe Gan Yen-Chun Chen Linjie Li Chen Zhu Yu Cheng and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In Advances in Neural Information Processing Systems. Zhe Gan Yen-Chun Chen Linjie Li Chen Zhu Yu Cheng and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In Advances in Neural Information Processing Systems.","key":"e_1_3_2_1_12_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_13_1","DOI":"10.1145\/3565266"},{"key":"e_1_3_2_1_14_1","volume-title":"AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss. In International Joint Conference on Artificial Intelligence. 708--714","author":"Guo Yangyang","year":"2021","unstructured":"Yangyang Guo , Liqiang Nie , Zhiyong Cheng , Feng Ji , Ji Zhang , and Alberto Del Bimbo . 2021 . AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss. In International Joint Conference on Artificial Intelligence. 708--714 . Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Feng Ji, Ji Zhang, and Alberto Del Bimbo. 2021. AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss. In International Joint Conference on Artificial Intelligence. 708--714."},{"key":"e_1_3_2_1_15_1","volume-title":"Scaling Up Vision-Language Pre-training for Image Captioning. In IEEE Conference on Computer Vision and Pattern Recognition. 17980--17989","author":"Hu Xiaowei","year":"2022","unstructured":"Xiaowei Hu , Zhe Gan , Jianfeng Wang , Zhengyuan Yang , Zicheng Liu , Yumao Lu , and Lijuan Wang . 2022 . Scaling Up Vision-Language Pre-training for Image Captioning. In IEEE Conference on Computer Vision and Pattern Recognition. 17980--17989 . Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. Scaling Up Vision-Language Pre-training for Image Captioning. In IEEE Conference on Computer Vision and Pattern Recognition. 17980--17989."},{"key":"e_1_3_2_1_16_1","volume-title":"Seeing Out of the Box: End-to-End Pre-Training for Vision-Language Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition. 12976--12985","author":"Huang Zhicheng","year":"2021","unstructured":"Zhicheng Huang , Zhaoyang Zeng , Yupan Huang , Bei Liu , Dongmei Fu , and Jianlong Fu . 2021 . Seeing Out of the Box: End-to-End Pre-Training for Vision-Language Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition. 12976--12985 . Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. 2021. Seeing Out of the Box: End-to-End Pre-Training for Vision-Language Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition. 12976--12985."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_17_1","DOI":"10.1609\/aaai.v34i07.6776"},{"key":"e_1_3_2_1_18_1","volume-title":"International Conference on Machine Learning. 5583--5594","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim , Bokyung Son , and Ildoo Kim . 2021 . ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision . In International Conference on Machine Learning. 5583--5594 . Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In International Conference on Machine Learning. 5583--5594."},{"unstructured":"Junnan Li Ramprasaath R. Selvaraju Akhilesh Gotmare Shafiq R. Joty Caiming Xiong and Steven Chu-Hong Hoi. 2021b. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems. 9694--9705. Junnan Li Ramprasaath R. Selvaraju Akhilesh Gotmare Shafiq R. Joty Caiming Xiong and Steven Chu-Hong Hoi. 2021b. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems. 9694--9705.","key":"e_1_3_2_1_19_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_20_1","DOI":"10.18653\/v1\/2021.acl-long.202"},{"key":"e_1_3_2_1_21_1","volume-title":"2023 a. Learning to Agree on Vision Attention for Visual Commonsense Reasoning. Transactions on Multimedia","author":"Li Zhenyang","year":"2023","unstructured":"Zhenyang Li , Yangyang Guo , Kejie Wang , Fan Liu , Liqiang Nie , and Mohan Kankanhalli . 2023 a. Learning to Agree on Vision Attention for Visual Commonsense Reasoning. Transactions on Multimedia ( 2023 ), 1--11. Zhenyang Li, Yangyang Guo, Kejie Wang, Fan Liu, Liqiang Nie, and Mohan Kankanhalli. 2023 a. Learning to Agree on Vision Attention for Visual Commonsense Reasoning. Transactions on Multimedia (2023), 1--11."},{"key":"e_1_3_2_1_22_1","volume-title":"2023 b. Joint Answering and Explanation for Visual Commonsense Reasoning","author":"Li Zhenyang","year":"2023","unstructured":"Zhenyang Li , Yangyang Guo , Kejie Wang , Yinwei Wei , Liqiang Nie , and Mohan S. Kankanhalli . 2023 b. Joint Answering and Explanation for Visual Commonsense Reasoning . IEEE Transactions on Image Processing ( 2023 ), 3836--3846. Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, and Mohan S. Kankanhalli. 2023 b. Joint Answering and Explanation for Visual Commonsense Reasoning. IEEE Transactions on Image Processing (2023), 3836--3846."},{"key":"e_1_3_2_1_23_1","volume-title":"Disentangled Multimodal Representation Learning for Recommendation. Transactions on Multimedia","author":"Liu Fan","year":"2022","unstructured":"Fan Liu , Huilin Chen , Zhiyong Cheng , Anan Liu , Liqiang Nie , and Mohan Kankanhalli . 2022a. Disentangled Multimodal Representation Learning for Recommendation. Transactions on Multimedia ( 2022 ), 1--11. Fan Liu, Huilin Chen, Zhiyong Cheng, Anan Liu, Liqiang Nie, and Mohan Kankanhalli. 2022a. Disentangled Multimodal Representation Learning for Recommendation. Transactions on Multimedia (2022), 1--11."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_24_1","DOI":"10.1145\/3343031.3350953"},{"key":"e_1_3_2_1_25_1","volume-title":"Interest-Aware Message-Passing GCN for Recommendation. In International World Wide Web Conferences. 1296--1305","author":"Liu Fan","year":"2021","unstructured":"Fan Liu , Zhiyong Cheng , Lei Zhu , Zan Gao , and Liqiang Nie . 2021 . Interest-Aware Message-Passing GCN for Recommendation. In International World Wide Web Conferences. 1296--1305 . Fan Liu, Zhiyong Cheng, Lei Zhu, Zan Gao, and Liqiang Nie. 2021. Interest-Aware Message-Passing GCN for Recommendation. In International World Wide Web Conferences. 1296--1305."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_26_1","DOI":"10.1145\/3556537"},{"doi-asserted-by":"crossref","unstructured":"Zhenguang Liu Runyang Feng Haoming Chen Shuang Wu Yixing Gao Yunjun Gao and Xiang Wang. 2022b. Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation. In Computer Vision and Pattern Recognition. 10996--11006. Zhenguang Liu Runyang Feng Haoming Chen Shuang Wu Yixing Gao Yunjun Gao and Xiang Wang. 2022b. Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation. In Computer Vision and Pattern Recognition. 10996--11006.","key":"e_1_3_2_1_27_1","DOI":"10.1109\/CVPR52688.2022.01073"},{"unstructured":"Jiasen Lu Dhruv Batra Devi Parikh and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems. 13--23. Jiasen Lu Dhruv Batra Devi Parikh and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems. 13--23.","key":"e_1_3_2_1_28_1"},{"key":"e_1_3_2_1_29_1","volume-title":"Video Transformer Network. In IEEE International Conference on Computer Vision Workshops. 3156--3165","author":"Neimark Daniel","year":"2021","unstructured":"Daniel Neimark , Omri Bar , Maya Zohar , and Dotan Asselmann . 2021 . Video Transformer Network. In IEEE International Conference on Computer Vision Workshops. 3156--3165 . Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. 2021. Video Transformer Network. In IEEE International Conference on Computer Vision Workshops. 3156--3165."},{"key":"e_1_3_2_1_30_1","volume-title":"Search-oriented Micro-video Captioning. In International Conference on Multimedia. 3234--3243","author":"Nie Liqiang","year":"2022","unstructured":"Liqiang Nie , Leigang Qu , Dai Meng , Min Zhang , Qi Tian , and Alberto Del Bimbo . 2022 . Search-oriented Micro-video Captioning. In International Conference on Multimedia. 3234--3243 . Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented Micro-video Captioning. In International Conference on Multimedia. 3234--3243."},{"key":"e_1_3_2_1_31_1","volume-title":"Dynamic Modality Interaction Modeling for Image-Text Retrieval. In International Conference on Research and Development in Information Retrieval. 1104--1113","author":"Qu Leigang","year":"2021","unstructured":"Leigang Qu , Meng Liu , Jianlong Wu , Zan Gao , and Liqiang Nie . 2021 . Dynamic Modality Interaction Modeling for Image-Text Retrieval. In International Conference on Research and Development in Information Retrieval. 1104--1113 . Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In International Conference on Research and Development in Information Retrieval. 1104--1113."},{"key":"e_1_3_2_1_32_1","volume-title":"Image Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Linguistics. 2556--2565","author":"Sharma Piyush","year":"2018","unstructured":"Piyush Sharma , Nan Ding , Sebastian Goodman , and Radu Soricut . 2018 . Conceptual Captions: A Cleaned, Hypernymed , Image Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Linguistics. 2556--2565 . Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Linguistics. 2556--2565."},{"key":"e_1_3_2_1_33_1","volume-title":"FLAVA: A Foundational Language And Vision Alignment Model. In IEEE Conference on Computer Vision and Pattern Recognition. 15617--15629","author":"Singh Amanpreet","year":"2022","unstructured":"Amanpreet Singh , Ronghang Hu , Vedanuj Goswami , Guillaume Couairon , Wojciech Galuba , Marcus Rohrbach , and Douwe Kiela . 2022 . FLAVA: A Foundational Language And Vision Alignment Model. In IEEE Conference on Computer Vision and Pattern Recognition. 15617--15629 . Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 2022. FLAVA: A Foundational Language And Vision Alignment Model. In IEEE Conference on Computer Vision and Pattern Recognition. 15617--15629."},{"key":"e_1_3_2_1_34_1","volume-title":"KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning. Knowledge-Based Systems","author":"Song Dandan","year":"2021","unstructured":"Dandan Song , Siyi Ma , Zhanchen Sun , Sicheng Yang , and Lejian Liao . 2021. KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning. Knowledge-Based Systems ( 2021 ), 107408. Dandan Song, Siyi Ma, Zhanchen Sun, Sicheng Yang, and Lejian Liao. 2021. KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning. Knowledge-Based Systems (2021), 107408."},{"key":"e_1_3_2_1_35_1","volume-title":"VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations. 1--16","author":"Su Weijie","year":"2020","unstructured":"Weijie Su , Xizhou Zhu , Yue Cao , Bin Li , Lewei Lu , Furu Wei , and Jifeng Dai . 2020 . VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations. 1--16 . Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations. 1--16."},{"key":"e_1_3_2_1_36_1","volume-title":"Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis. In International Conference on Multimedia. 15--23","author":"Sun Teng","year":"2022","unstructured":"Teng Sun , Wenjie Wang , Liqiang Jing , Yiran Cui , Xuemeng Song , and Liqiang Nie . 2022 . Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis. In International Conference on Multimedia. 15--23 . Teng Sun, Wenjie Wang, Liqiang Jing, Yiran Cui, Xuemeng Song, and Liqiang Nie. 2022. Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis. In International Conference on Multimedia. 15--23."},{"key":"e_1_3_2_1_37_1","volume-title":"LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Empirical Methods in Natural Language Processing. 5099--5110.","author":"Tan Hao","year":"2019","unstructured":"Hao Tan and Mohit Bansal . 2019 . LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Empirical Methods in Natural Language Processing. 5099--5110. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Empirical Methods in Natural Language Processing. 5099--5110."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_38_1","DOI":"10.1145\/3503161.3548187"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_39_1","DOI":"10.1109\/CVPR46437.2021.01383"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_40_1","DOI":"10.1109\/TMM.2022.3168424"},{"key":"e_1_3_2_1_41_1","volume-title":"Neural multimodal cooperative learning toward micro-video understanding","author":"Wei Yinwei","year":"2019","unstructured":"Yinwei Wei , Xiang Wang , Weili Guan , Liqiang Nie , Zhouchen Lin , and Baoquan Chen . 2019. Neural multimodal cooperative learning toward micro-video understanding . IEEE Transactions on Image Processing ( 2019 ), 1--14. Yinwei Wei, Xiang Wang, Weili Guan, Liqiang Nie, Zhouchen Lin, and Baoquan Chen. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing (2019), 1--14."},{"unstructured":"Aming Wu Linchao Zhu Yahong Han and Yi Yang. 2019. Connective Cognition Network for Directional Visual Commonsense Reasoning. In Advances in Neural Information Processing Systems. 5670--5680. Aming Wu Linchao Zhu Yahong Han and Yi Yang. 2019. Connective Cognition Network for Directional Visual Commonsense Reasoning. In Advances in Neural Information Processing Systems. 5670--5680.","key":"e_1_3_2_1_42_1"},{"unstructured":"Enze Xie Wenhai Wang Zhiding Yu Anima Anandkumar Jose M. Alvarez and Ping Luo. 2021. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Advances in Neural Information Processing Systems. 12077--12090. Enze Xie Wenhai Wang Zhiding Yu Anima Anandkumar Jose M. Alvarez and Ping Luo. 2021. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Advances in Neural Information Processing Systems. 12077--12090.","key":"e_1_3_2_1_43_1"},{"key":"e_1_3_2_1_44_1","volume-title":"Attend and Tell: Neural Image Caption Generation with Visual Attention. In International Conference on Machine Learning. 2048--2057","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu , Jimmy Ba , Ryan Kiros , Kyunghyun Cho , Aaron C. Courville , Ruslan Salakhutdinov , Richard S. Zemel , and Yoshua Bengio . 2015 . Show , Attend and Tell: Neural Image Caption Generation with Visual Attention. In International Conference on Machine Learning. 2048--2057 . Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In International Conference on Machine Learning. 2048--2057."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_45_1","DOI":"10.1609\/aaai.v35i4.16428"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_46_1","DOI":"10.1109\/CVPR.2019.00688"},{"key":"e_1_3_2_1_47_1","volume-title":"Jize Cao, Ali Farhadi, and Yejin Choi.","author":"Zellers Rowan","year":"2021","unstructured":"Rowan Zellers , Ximing Lu , Jack Hessel , Youngjae Yu , Jae Sung Park , Jize Cao, Ali Farhadi, and Yejin Choi. 2021 . MERLOT : Multimodal Neural Script Knowledge Models. In Advances in Neural Information Processing Systems . 23634--23651. Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. 2021. MERLOT: Multimodal Neural Script Knowledge Models. In Advances in Neural Information Processing Systems. 23634--23651."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_48_1","DOI":"10.1109\/TIP.2021.3138302"},{"key":"e_1_3_2_1_49_1","volume-title":"Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning. In International Conference on Multimedia. 1793--1802","author":"Zhang Xi","year":"2021","unstructured":"Xi Zhang , Feifei Zhang , and Changsheng Xu . 2021 . Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning. In International Conference on Multimedia. 1793--1802 . Xi Zhang, Feifei Zhang, and Changsheng Xu. 2021. Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning. In International Conference on Multimedia. 1793--1802."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_50_1","DOI":"10.1109\/TMM.2021.3091882"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_51_1","DOI":"10.1109\/TCDS.2021.3079278"}],"event":{"sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"acronym":"MM '23","name":"MM '23: The 31st ACM International Conference on Multimedia","location":"Ottawa ON Canada"},"container-title":["Proceedings of the 31st ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3581783.3612395","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,31]],"date-time":"2024-10-31T22:53:15Z","timestamp":1730415195000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3581783.3612395"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,26]]},"references-count":51,"alternative-id":["10.1145\/3581783.3612395","10.1145\/3581783"],"URL":"https:\/\/doi.org\/10.1145\/3581783.3612395","relation":{},"subject":[],"published":{"date-parts":[[2023,10,26]]},"assertion":[{"value":"2023-10-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}