Learning Scene Graph for Better Cross-Domain Image Captioning

Jia, Junhua; Xin, Xiaowei; Gao, Xiaoyan; Ding, Xiangqian; Pang, Shunpeng

doi:10.1007/978-981-99-8435-0_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14427))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

786 Accesses

Abstract

The current image captioning (IC) methods achieve good results within a single domain primarily due to training on a large amount of annotated data. However, the performance of single-domain image captioning methods suffers when extended to new domains. To address this, we propose a cross-domain image captioning framework, called SGCDIC, which achieves cross-domain generalization of image captioning models by simultaneously optimizing two coupled tasks, i.e., image captioning and text-to-image synthesis (TIS). Specifically, we propose a scene-graph-based approach SGAT for image captioning tasks. The image synthesis task employs a GAN variant (DFGAN) to synthesize plausible images based on the generated text descriptions by SGAT. We compare the generated images with the real images to enhance the image captioning performance in new domains. We conduct extensive experiments to evaluate the performance of SGCDIC by using the MSCOCO as the source domain data, and using Flickr30k and Oxford-102 as the new domain data. Sufficient comparative experiments and ablation studies demonstrate that SGCDIC achieves substantially better performance than the strong competitors for the cross-domain image captioning task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8464; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

RGFormer: Residual Gated Transformer for Image Captioning

Research on image caption generation method based on multi-modal pre-training model and text mixup optimization

Article 28 May 2024

A Sub-captions Semantic-Guided Network for Image Captioning

Notes

1.
https://github.com/karpathy/neuraltalk.

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, H., Wang, Y., Yang, X., Li, J.: Captioning transformer with scene graph guiding. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2538–2542. IEEE (2021)
Google Scholar
Chen, T.H., Liao, Y.H., Chuang, C.Y., Hsu, W.T., Fu, J., Sun, M.: Show, adapt and tell: Adversarial training of cross-domain image captioner. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 521–530 (2017)
Google Scholar
Damodaran, V., et al.: Understanding the role of scene graphs in visual question answering. arXiv preprint arXiv:2101.05479 (2021)
Dessì, R., Bevilacqua, M., Gualdoni, E., Rakotonirina, N.C., Franzon, F., Baroni, M.: Cross-domain image captioning with discriminative finetuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6935–6944 (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Herzig, R., Bar, A., Xu, H., Chechik, G., Darrell, T., Globerson, A.: Learning canonical representations for scene graph to image generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 210–227. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_13
Chapter Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Huang, Y., Xue, H., Liu, B., Lu, Y.: Unifying multimodal transformer for bi-directional image and text generation. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1138–1147 (2021)
Google Scholar
Jia, J., et al.: Image captioning based on scene graphs: a survey. Expert Syst. Appl. 231, 120698 (2023)
Article Google Scholar
Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1219–1228 (2018)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Kim, T., et al.: L-verse: bidirectional generation between image and text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16526–16536 (2022)
Google Scholar
Liao, Z., Huang, Q., Liang, Y., Fu, M., Cai, Y., Li, Q.: Scene graph with 3D information for change captioning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5074–5082 (2021)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: common objects in context (2014). arXiv preprint arXiv:1405.0312 (2019)
Lyu, F., Feng, W., Wang, S.: vtGraphNet: learning weakly-supervised scene graph for complex visual grounding. Neurocomputing 413, 51–60 (2020)
Article Google Scholar
Nguyen, K., Tripathi, S., Du, B., Guha, T., Nguyen, T.Q.: In defense of scene graphs for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1407–1416 (2021)
Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE (2008)
Google Scholar
Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C.: DF-GAN: a simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525 (2022)
Google Scholar
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)
Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684–699 (2018)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Yu, J., Chai, Y., Wang, Y., Hu, Y., Wu, Q.: CogTree: cognition tree loss for unbiased scene graph generation. arXiv preprint arXiv:2009.07526 (2020)
Yu, L., Zhang, W., Wang, J., Yu, Y.: SeqGAN: sequence generative adversarial nets with policy gradient. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Google Scholar
Yuan, J., et al.: Discriminative style learning for cross-domain image captioning. IEEE Trans. Image Process. 31, 1723–1736 (2022)
Article Google Scholar
Zhao, W., Wu, X., Luo, J.: Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans. Image Process. 30, 1180–1192 (2020)
Article MathSciNet Google Scholar
Zhong, Y., Wang, L., Chen, J., Yu, D., Li, Y.: Comprehensive image captioning via scene graph decomposition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 211–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_13
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Science and Engineering, Ocean University of China, Shandong, 266000, China
Junhua Jia, Xiaowei Xin, Xiaoyan Gao & Xiangqian Ding
School of Computer Engineering, Weifang University, Shandong, 261061, China
Shunpeng Pang

Authors

Junhua Jia
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Xin
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xiangqian Ding
View author publications
You can also search for this author in PubMed Google Scholar
Shunpeng Pang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shunpeng Pang .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jia, J., Xin, X., Gao, X., Ding, X., Pang, S. (2024). Learning Scene Graph for Better Cross-Domain Image Captioning. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14427. Springer, Singapore. https://doi.org/10.1007/978-981-99-8435-0_10

Download citation

DOI: https://doi.org/10.1007/978-981-99-8435-0_10
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8434-3
Online ISBN: 978-981-99-8435-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Scene Graph for Better Cross-Domain Image Captioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

RGFormer: Residual Gated Transformer for Image Captioning

Research on image caption generation method based on multi-modal pre-training model and text mixup optimization

A Sub-captions Semantic-Guided Network for Image Captioning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Learning Scene Graph for Better Cross-Domain Image Captioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

RGFormer: Residual Gated Transformer for Image Captioning

Research on image caption generation method based on multi-modal pre-training model and text mixup optimization

A Sub-captions Semantic-Guided Network for Image Captioning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation