[1] Pan B., Cai H., Huang D.A., Lee K.H., Gaidon A., Adeli E., and Niebles J.C.Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870-10879, 2020. [2] Ryu H., Kang S., Kang H., and Yoo C.D.Semantic Grouping Network for Video Captioning. In proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, pp. 2514-2522, 2021. [3] Venugopalan S., Rohrbach M., Donahue J., Mooney R., Darrell T., and Saenko K.Sequence to Sequence-Video to Text. InProceedings of the IEEE international conference on computer vision, pp. 4534-4542, 2015. [4] Lin K., Li L., Lin C.C., Ahmed F., Gan Z., Liu Z., Lu Y., and Wang L.Swinbert: End-to-End Transformers with Sparse Attention for Video Captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17949-17958, 2022. [5] Kaushik P., Kumar K.V., and Biswas P.Context Bucketed Text Responses using Generative Adversarial Neural Network in and roid Application with Tens or Flow-Lite Framework. In2022 8th International Conference on Signal Processing and Communication (ICSC), IEEE, pp. 324-328, 2022. [6] Zhu F., Hwang J.N., Ma Z., Chen G., and Guo J.Understand ing Objects in Video: Object-Oriented Video Captioning via Structured Trajectory and Adversarial Learning. IEEE Access, vol. 8, pp. 169146-169159, 2020. [7] Hendria W.F., Velda V., Putra B.H.H., Adzaka, F., and Jeong, C. Action Knowledge for Video Captioning with Graph Neural Networks. Journal of King Saud University-Computer and Information Sciences, vol. 35, no. 4, pp. 50-62, 2023. [8] Yan C., Tu Y., Wang X., Zhang Y., Hao X., Zhang Y., and Dai Q.STAT: Spatial-Temporal Attention Mechanism for Video Captioning. IEEE transactions on multimedia, vol. 22, no. 1, pp. 229-241, 2019. [9] Yan Y., Zhuang N., Ni B., Zhang J., Xu M., Zhang Q., Zhang Z., Cheng S., Tian Q., Xu Y., and Yang X.Fine-Grained Video Captioning via Graph-Based Multi-Granularity Interaction Learning. IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 2, pp. 666-683, 2019. [10] Lin K., Gan Z., and Wang L.Augmented Partial Mutual Learning with Frame Masking for Video Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, pp. 2047-2055, 2021. [11] Cho K., Courville A., and Bengio Y.Describing Multimedia Content using Attention-Based Encoder-Decoder Networks. IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1875-1886, 2015. [12] Moniruzzaman M., Yin Z., He Z., Qin R., and Leu M.C.Human Action Recognition by Discriminative Feature Pooling and Video Segment Attention Model. IEEE Transactions on Multimedia, vol. 24, pp. 689-701, 2021. [13] Yu Q., Song J., Song Y.Z., Xiang T., and Hospedales T.M.Fine-Grained Instance-Level Sketch-Based Image Retrieval. International Journal of Computer Vision, vol. 129, pp. 484-500, 2021. [14] Gao L., Guo Z., Zhang H., Xu X., and Shen H.T.Video Captioning with Attention-Based LSTM and Semantic Consistency. IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2045-2055, 2017. [15] Prudviraj J., Reddy M.I., Vishnu C., and Mohan C.K.AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description. IEEE Transactions on Image Processing, vol. 31, pp. 5559-5569, 2022. [16] Liu, Z.Y. and Liu, J.W.Hypergraph Attentional Convolutional Neural Network for Salient Object Detection. The Visual Computer, vol. 39, no. 7, pp. 2881-2907, 2023. [17] Hua X., Wang X., Rui T., Shao F., and Wang D.Adversarial Reinforcement Learning with Object-Scene Relational Graph for Video Captioning. IEEE Transactions on Image Processing, vol. 31, pp. 2004-2016, 2022. [18] Liu F., Ren X., Wu X., Yang B., Ge S., Zou Y., and Sun X.O2na: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning.arXiv preprint arXiv:2108.02359, 2021. [19] Li L., Zhang Y., Tang S., Xie L., Li X., and Tian Q.Adaptive Spatial Location with Balanced Loss for Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 17-30, 2020. [20] Selvaraju R.R., Cogswell M., Das A., Vedantam R., Parikh D., and Batra D.Grad-Cam: Visual Explanations from Deep Networks via Gradient-Based Localization. InProceedings of the IEEE international conference on computer vision, pp. 618-626, 2017. [21] Donahue J.,Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625-2634, 2015. [22] Xu J., Mei T., Yao T., and Rui Y.Msr-Vtt: A Large Video Description Dataset for Bridging Video and Language. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296, 2016. [23] Krishna R., Hata K., Ren F., Fei-Fei, L., and Carlos Niebles, J. Dense-Captioning Events in Videos. InProceedings of the IEEE international conference on computer vision, pp. 706-715, 2017. [24] Yu H., Wang J., Huang Z., Yang Y., and Xu W.Video Paragraph Captioning using Hierarchical Recurrent Neural Networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4584-4593, 2016. [25] Yu J., Li J., Yu Z., and Huang Q.Multimodal Transformer with Multi-View Visual Representation for Image Captioning. IEEE transactions on circuits and systems for video technology, vol. 30, no. 12, pp. 4467-4480, 2019. [26] Islam S., Dash A., Seum A., Raj A.H., Hossain T., and Shah F.M.Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods. SN Computer Science, vol. 2, no. 2, pp. 1-28, 2021. [27] Su Y., Li Y., Xu N., and Liu A.A.Hierarchical Deep Neural Network for Image Captioning. Neural Processing Letters, vol. 52, pp. 1057-1067, 2020. [28] Wang T., Zheng H., Yu M., Tian Q., and Hu H.Event-Centric Hierarchical Representation for Dense Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 5, pp. 1890-1900, 2020. [29] Li J., Tan G., Ke X., Si H., and Peng Y.Object Detection Based on Knowledge Graph Network. Applied Intelligence, vol. 53, no. 12, pp. 15045-15066, 2023. [30] Lu, X. and Gao, Y.Guide and Interact: Scene-Graph Based Generation and Control of Video Captions. Multimedia Systems, vol. 29, no. 2, pp. 797-809, 2023. [31] Li, X. and Jiang, S.Know More Say Less: Image Captioning Based on Scene Graphs. IEEE Transactions on Multimedia, vol. 21, no. 8, pp. 2117-2130, 2019. [32] Zhang Z., Shi Y., Yuan C., Li B., Wang P., Hu W., and Zha Z.J.Object Relational Graph with Teacher-Recommended Learning for Video Captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13278-13288, 2020. [33] Jo Y., Lee S., Lee A.S., Lee H., Oh H., and Seo M.Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment.arXiv preprint arXiv:2307.02682, 2023. [34] Zhang Z., Xu D., Ouyang W., and Zhou L.Dense Video Captioning using Graph-Based Sentence Summarization. IEEE Transactions on Multimedia, vol. 23, pp. 1799-1810, 2020. |