Weakly supervised grounded image captioning with semantic matching

Du, Sen; Zhu, Hong; Lin, Guangfeng; Liu, Yuanyuan; Wang, Dong; Shi, Jing; Wu, Zhong

doi:10.1007/s10489-024-05389-y

Weakly supervised grounded image captioning with semantic matching

Published: 25 March 2024

Volume 54, pages 4300–4318, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Sen Du¹,
Hong Zhu¹,
Guangfeng Lin²,
Yuanyuan Liu¹,
Dong Wang¹,
Jing Shi¹ &
…
Zhong Wu¹

379 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Visual attention has been extensively adopted in many tasks, such as image captioning. It not only improves the performance of image captioning but is also used to enhance the quality of caption rationality. Rationality can be understood as the ability to maintain attention on the correct regions while generating words or phrases. This is critical for alleviating the problems of object hallucinations. Recently, many researchers have devoted to improving grounding accuracy by linking generated object words or phrases to appropriate regions of the image. However, collecting word-region alignment is expensive and limited, and the generated object words may not appear in the annotation sentences. To address this challenge, we propose a weakly supervised grounded image captioning method. Specifically, we design a region-word matching block to estimate the match scores for the candidate nouns with all regions. Compared to manual annotations, the match score may contain some mistakes. To make the captioning model compatible with these mistakes, we design a reinforcement loss that takes into account both attention weights and match scores. This allows the captioning model to generate a more accurate and grounded sentence. Experimental results on two commonly used benchmark datasets (MSCOCO, Flickr30k) demonstrate the superiority of the proposed blocks. Extensive ablation studies also validate the effectiveness and robustness of the proposed modules. Last but not least, our blocks are available in a variety of captioning models and do not require additional label or extra time consumption in inference stage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Boost image captioning with knowledge reasoning

Article 27 October 2020

Relational Attention with Textual Enhanced Transformer for Image Captioning

Object-aware semantics of attention for image captioning

Article 14 November 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The data used in this paper are already publicly available.

References

Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Aditya S, Yang Y, Baral C, Aloimonos Y, Fermüller C (2018) Image understanding using vision and reasoning through scene description graph. Comput Vis Image Understanding 173:33–45
Article Google Scholar
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: Generating sentences from images. In: European conference on computer vision, pp 15–29. Springer
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Hendricks LA, Burns K, Saenko K, Darrell T, Rohrbach A (2018) Women also snowboard: Overcoming bias in captioning models. In: Proceedings of the european conference on computer vision (ECCV), pp 771–787
Liu F, Ren X, Wu X, Ge S, Fan W, Zou Y, Sun X (2020) Prophet attention: Predicting attention with future attention. Adv Neural Inf Process Syst 33:1865–1876
Google Scholar
Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: The quirks and what works. arXiv:1505.01809
Liu X, Li H, Shao J,Chen D, Wang X Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In: Proceedings of the european conference on computer vision (ECCV), pp 338–354 (2018)
Ordonez V, Kulkarni G, Berg T (2011) Im2text: Describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé-III H (2012) Midge: Generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics, pp 747–756
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057. PMLR
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086 (2018)
Ji J, Xu C, Zhang X, Wang B, Song X (2020) Spatio-temporal memory attention for image captioning. IEEE Trans Image Process 29:7615–7628
Article Google Scholar
Wang W, Chen Z, Hu H (2019) Hierarchical attention network for image captioning. Proc AAAI Conf Artif Intell 33:8957–8964
Google Scholar
Yu L, Zhang J, Wu Q (2021) Dual attention on pyramid feature maps for image captioning. IEEE Trans Multimed 24:1775–1786
Article Google Scholar
Song Z, Zhou X, Mao Z, Tan J (2021) Image captioning with context-aware auxiliary guidance. Proc AAAI Conf Artif Intell 35:2584–2592
Google Scholar
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10971–10980
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578–10587
Jiang W, Zhou W, Hu H (2022) Double-stream position learning transformer network for image captioning. IEEE Trans Circ Syst Vid Technol 32(11):7706–7718. https://doi.org/10.1109/TCSVT.2022.3181490
Article Google Scholar
Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin C-W, Ji R (2021) Dual-level collaborative transformer for image captioning. Proc AAAI Conf Artif Intell 35:2286–2293
Google Scholar
Zhou L, Palangi H, Zhang L, Hu H, Corso J, Gao J (2020) Unified vision-language pre-training for image captioning and vqa. Proc AAAI Conf Artif Intell 34:13041–13049
Google Scholar
Li Y, Pan Y, Yao T, Chen J, Mei T (2021) Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. Proc AAAI Conf Artif Intell 35:8518–8526
Google Scholar
Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5579–5588
Lu J, Goswami V, Rohrbach M, Parikh D, Lee S (2020) 12-in-1: Multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10437–10446
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European conference on computer vision, pp 121–137. Springer
Wang Z, Yu J, Yu AW, Dai Z, Tsvetkov Y, Cao Y (2021) Simvlm: Simple visual language model pretraining with weak supervision. arXiv:2108.10904
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE international conference on computer vision, pp 2970–2979
Shetty R, Rohrbach M, Anne-Hendricks L, Fritz M, Schiele B (2017) Speaking the same language: Matching machine to human captions by adversarial training. In: Proceedings of the IEEE international conference on computer vision, pp 4135–4144
Dognin P, Melnyk I, Mroueh Y, Ross J, Sercu T (2019) Adversarial semantic alignment for improved image captions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10463–10471
Zhang J, Mei K, Zheng Y, Fan J (2020) Integrating part of speech guidance for image captioning. IEEE Trans Multimed 23:92–104
Article Google Scholar
Yan C, Hao Y, Li L, Yin J, Liu A, Mao Z, Chen Z, Gao X (2021) Task-adaptive attention for image captioning. IEEE Trans Circ Syst Vid Technol 32(1):43–51
Article Google Scholar
Yang X, Zhang H, Gao C, Cai J (2022) Learning to collocate visual-linguistic neural modules for image captioning. Int J Comput Vis 1–19
Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the 27th ACM international conference on multimedia, pp 765–773
Ji J, Du Z, Zhang X (2021) Divergent-convergent attention for image captioning. Pattern Recognit 115:107928
Article Google Scholar
Milewski V, Moens M-F, Calixto I (2020) Are scene graphs good enough to improve image captioning? arXiv:2009.12313
Zhang W, Shi H, Tang S, Xiao J, Yu Q, Zhuang Y (2021) Consensus graph representation learning for better grounded image captioning. Proc AAAI Conf Artif Intell 35:3394–3402
Google Scholar
Chen N, Pan X, Chen R, Yang L, Lin Z, Ren Y, Yuan H, Guo X, Huang F, Wang W (2021) Distributed attention for grounded image captioning. In: Proceedings of the 29th ACM international conference on multimedia, pp 1966–1975
Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4777–4786
Hu W, Wang L, Xu L (2022) Spatial-semantic attention for grounded image captioning. In: 2022 IEEE international conference on image processing (ICIP), pp 61–65. IEEE
Jiang W, Zhu M, Fang Y, Shi G, Zhao X, Liu Y (2022) Visual cluster grounding for image captioning. IEEE Trans Image Process
Zhang H, Niu Y, Chang S-F (2018) Grounding referring expressions in images by variational context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4158–4166
Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407
Article Google Scholar
Liu Y, Wan B, Zhu X, He X (2020) Learning cross-modal context graph for visual grounding. Proc AAAI Conf Artif Intell 34:11645–11652
Google Scholar
Yang Z, Gong B, Wang L, Huang W, Yu D, Luo J (2019) A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4683–4693
Huang B, Lian D, Luo W, Gao S (2021) Look before you leap: Learning landmark features for one-stage visual grounding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16888–16897
Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B Grounding of textual phrases in images by reconstruction. In: European conference on computer vision, pp 817–834 (2016). Springer
Chen K, Gao J, Nevatia R (2018) Knowledge aided consistency for weakly supervised phrase grounding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4042–4050
Wang L, Huang J, Li Y, Xu K, Yang Z, Yu D (2021) Improving weakly supervised visual grounding by contrastive knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14090–14100
Gupta T, Vahdat A, Chechik G, Yang X, Kautz J, Hoiem D (2020) Contrastive learning for weakly supervised phrase grounding. In: European conference on computer vision, pp 752–768. Springer
Liu A-A, Zhai Y, Xu N, Nie W, Li W, Zhang Y (2021) Region-aware image captioning via interaction learning. IEEE Trans Circ Syst Vid Technol
Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019) Grounded video description. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6587
Bin Y, Ding Y, Peng B, Peng L, Yang Y, Chua T-S (2022) Entity slot filling for visual captioning. IEEE Trans Circ Syst Vid Technol 32(1):52–62. https://doi.org/10.1109/TCSVT.2021.3063297
Article Google Scholar
Ma C-Y, Kalantidis Y, AlRegib G, Vajda P, Rohrbach M, Kira Z Learning to generate grounded visual captions without localization supervision. In: European conference on computer vision, pp 353–370 (2020). Springer
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR
Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis 128(7):1956–1981
Article Google Scholar
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Luo R, Price B, Cohen S, Shakhnarovich G (2018) Discriminability objective for training descriptive captions.In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6964–6974
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755. Springer
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Article Google Scholar
Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp 2641–2649
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Li Y, Pan Y, Yao T, Mei T (2022) Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17990–17999
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:1
Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255. IEEE
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the Acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Vedantam R, Lawrence-Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: European conference on computer vision, pp 382–398. Springer
Wang C, Shen Y, Ji L (2022) Geometry attention transformer with position-aware lstms for image captioning. Expert Syst Appl 201:117174
Article Google Scholar
Biten AF, Gómez L, Karatzas D (2022) Let there be a clock on the beach: Reducing object hallucination in image captioning. In: 2022 IEEE/CVF winter conference on applications of computer vision (WACV), pp 2473–2482. 10.1109/WACV51458.2022.00253
Gao L, Fan K, Song J, Liu X, Xu X, Shen HT (2019) Deliberate attention networks for image captioning. Proc AAAI Conf Artif Intell 33:8320–8327
Google Scholar
Wei H, Li Z, Huang F, Zhang C, Ma H, Shi Z (2021) Integrating scene semantic knowledge into image captioning. ACM Trans Multimed Comput Commun Appl (TOMM) 17(2):1–22
Article Google Scholar
Cao S, An G, Zheng Z, Wang Z (2022) Vision-enhanced and consensus-aware transformer for image captioning. IEEE Trans Circ Syst Vid Technol 32(10):7005–7018
Article Google Scholar
Mao Y, Chen L, Jiang Z, Zhang D, Zhang Z, Shao J, Xiao J (2022) Rethinking the reference-based distinctive image captioning. In: Proceedings of the 30th ACM international conference on multimedia, pp 4374–4384
Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recognit 138:109420
Article Google Scholar
Li Z, Wei J, Huang F, Ma H (2023) Modeling graph-structured contexts for image captioning. Image Vis Comput 129:104591
Article Google Scholar
Wang C, Gu X (2023) Learning joint relationship attention network for image captioning. Expert Syst Appl 211:118474
Kuo C-W, Kira Z (2023) Haav: Hierarchical aggregation of augmented views for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11039–11049
Parvin H, Naghsh-Nilchi AR, Mohammadi HM (2023) Transformer-based local-global guidance for image captioning. Expert Syst Appl 223:119774
Article Google Scholar
Zhou Y, Zhang Y, Hu Z, Wang M (2021) Semi-autoregressive transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3139–3143
Zhang Z, Wu Q, Wang Y, Chen F (2022) Exploring pairwise relationships adaptively from linguistic context in image captioning. IEEE Trans Multimed 24:3101–3113
Article Google Scholar
Hu N, Ming Y, Fan C, Feng F, Lyu B (2022) Tsfnet: Triple-steam image captioning. IEEE Trans Multimed
Hu J, Yang Y, Yao L, An Y, Pan L (2022) Position-guided transformer for image captioning. Image Vis Comput 128:104575
Article Google Scholar
Wang Y, Xu J, Sun Y (2022) A visual persistence model for image captioning. Neurocomputing 468:48–59
Huang Y, Chen J, Ma H, Ma H, Ouyang W, Yu C (2022) Attribute assisted teacher-critical training strategies for image captioning. Neurocomputing 506:265–276
Article Google Scholar
Dubey S, Olimov F, Rafique MA, Kim J, Jeon M (2023) Label-attention transformer with geometrically coherent objects for image captioning. Inf Sci 623:812–831
Article Google Scholar
Chen L, Yang Y, Hu J, Pan L, Zhai H (2023) Relational-convergent transformer for image captioning. Displays 77:102377
Article Google Scholar
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the european conference on computer vision (ECCV), pp 684–699

Download references

Acknowledgements

This work was supported by the NSFC (Program no. 61771386), Key Research and Development Program of Shaanxi (Program no. 2020SF-359), and Key Lab. of Manufacturing Equipment of Shaanxi Province(Program no.JXZZZB-2022-02).

Author information

Authors and Affiliations

Automation and Information Engineering, Xi’an University of Technology, 5 South Jinhua Road, Xi’an,Shaanxi Province, 710048, China
Sen Du, Hong Zhu, Yuanyuan Liu, Dong Wang, Jing Shi & Zhong Wu
Information Science Department, Xi’an University of Technology, 5 South Jinhua Road, Xi’an,Shaanxi Province, 710048, China
Guangfeng Lin

Authors

Sen Du
View author publications
You can also search for this author in PubMed Google Scholar
Hong Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Guangfeng Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yuanyuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Shi
View author publications
You can also search for this author in PubMed Google Scholar
Zhong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Sen Du design, implementation, Formal analysis and writing. Hong Zhu project administration, supervision. GuangFeng Lin Review & Editing. Yuanyuan Liu Review & Editing. Dong Wang Visualization. Jing Shi Review & Editing. Zhong Wu Review & Editing.

Corresponding author

Correspondence to Hong Zhu.

Ethics declarations

Competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Du, S., Zhu, H., Lin, G. et al. Weakly supervised grounded image captioning with semantic matching. Appl Intell 54, 4300–4318 (2024). https://doi.org/10.1007/s10489-024-05389-y

Download citation

Accepted: 09 March 2024
Published: 25 March 2024
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10489-024-05389-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Weakly supervised grounded image captioning with semantic matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Boost image captioning with knowledge reasoning

Relational Attention with Textual Enhanced Transformer for Image Captioning

Object-aware semantics of attention for image captioning

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Weakly supervised grounded image captioning with semantic matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Boost image captioning with knowledge reasoning

Relational Attention with Textual Enhanced Transformer for Image Captioning

Object-aware semantics of attention for image captioning

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation