Abstract
Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption consisting of common concepts. In this paper, we first analyze both modern captioning systems and evaluation metrics through empirical experiments to quantify these phenomena. We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be ‘topped’ using simple captioning systems relying on object detectors. Inspired by these observations, we design a new metric (SPICE-U) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness. Finally, we also demonstrate a general technique to improve any existing captioning model – by using mutual information as a re-ranking objective during decoding. Empirically, this results in more unique and informative captions, and improves three different state-of-the-art models on SPICE-U as well as average score over existing metrics (Code is available at https://github.com/princetonvisualai/SPICE-U).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The objects classes are: man, person, tree, ground, shirt, wall, sky, window, building, and head.
- 2.
The trained object detectors are taken from the bottom-up part of the captioning model [2].
- 3.
The resulting model is similar to Baby Talk [15], which uses object, attribute, and relationship classifiers to generate image descriptions.
- 4.
For “There is a person” uniqueness is 0, since it’s the most common of the objects, and SPICE-U score is 0 by definition.
- 5.
We calculate the correlation between the mean value of human votes (+1 if they prefer caption b over caption c, −1 otherwise) and the score \(R_m(b) - R_m(c)\), where \(R_m(s)\) is the score of sentence s given by metric m.
- 6.
We also tried linear interpolation and it works not as good as the log-linear interpolation.
- 7.
The TopDown model from https://github.com/poojahira/image-captioning-bot-tom-up-top-down, the DiscCap from https://github.com/ruotianluo/DiscCap-tioning and AoANet from https://github.com/husthuaan/AoANet.
- 8.
The captioning metrics measure different aspects of the captions and are largely uncorrelated with each other [33]; we use the geometric mean as a simple summary statistic of the overall performance of the models. For CHAIR lower scores are better so we use \(\frac{1}{CHAIR}\) in the geometric mean.
References
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: ECCV (2016)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Bahl, L., Brown, P., de Souza, P., Mercer, R.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP (1986)
Cui, Y., Yang, G., Veit, A., Huang, X., Belongie, S.: Learning to evaluate image captioning. In: CVPR (2018)
Datta, D., Varma, S., Chowdary, C.R., Singh, S.K.: Multimodal retrieval using mutual information based textual query reformulation. Expert Syst. Appl. 68, 81–92 (2017)
Dognin, P., Melnyk, I., Mroueh, Y., Ross, J., Sercu, T.: Adversarial semantic alignment for improved image captions. In: CVPR (2019)
Henning, C.A., Ewerth, R.: Estimating the information gap between textual and visual representations. In: ICMR (2017)
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: ICCV (2019)
Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: CVPR (2016)
Johnson, J., et al.: Image retrieval using scene graphs. In: CVPR (2015)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Kimura, R., Iida, S., Cui, H., Hung, P.H., Utsuro, T., Nagata, M.: Selecting informative context sentence by forced back-translation. In: MT Summit XVII (2019)
Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR (2017)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Kulkarni, G., et al.: BabyTalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Lavie, A., Agarwal, A.: Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: StatMT (2007)
Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: NAACL HLT (2016)
Li, J., Jurafsky, D.: Mutual Information and Diverse Decoding Improve Neural Machine Translation. arXiv:1601.00372 [cs] (2016). arXiv: 1601.00372
Li, W., et al.: Object-driven text-to-image synthesis via adversarial training. In: CVPR (2019)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014)
Lindh, A., Ross, R.J., Mahalunkar, A., Salton, G., Kelleher, J.D.: Generating diverse and meaningful captions. In: ICANN (2018)
Liu, L., Tang, J., Wan, X., Guo, Z.: Generating diverse and descriptive image captions using visual paraphrases. In: ICCV (2019)
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of SPIDEr. In: ICCV (2017)
Liu, X., Li, H., Shao, J., Chen, D., Wang, X.: Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In: ECCV (2018)
Lu, D., Whitehead, S., Huang, L., Ji, H., Chang, S.F.: Entity-aware image caption generation. In: EMNLP (2018)
Lu, J., Xiong, C., Parikh, D., Socher, R.: knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR (2017)
Luo, R., Shakhnarovich, G., Cohen, S., Price, B.: Discriminability objective for training descriptive captions. In: CVPR (2018)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Melas-Kyriazi, L., Rush, A., Han, G.: Training for diversity in image paragraph captioning. In: EMNLP (2018)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2001)
Povey, D., Woodland, P.: Minimum phone error and I-smoothing for improved discriminative training. In: ICASSP (2002)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (2017)
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: EMNLP (2018)
Shetty, R., Rohrbach, M., Hendricks, L.A., Fritz, M., Schiele, B.: speaking the same language: matching machine to human captions by adversarial training. In: ICCV (2017)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. (1972)
Tu, Z., Liu, Y., Shang, L., Liu, X., Li, H.: Neural machine translation with reconstruction. In: AAAI (2017)
Vedantam, R., Bengio, S., Murphy, K., Parikh, D., Chechik, G.: Context-aware captions from context-agnostic supervision. In: CVPR (2017)
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)
Vijayakumar, A.K., et al.: Diverse beam search for improved description of complex scenes. In: AAAI (2018)
Vijayakumar, A.K., et al.: Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models. arXiv:1610.02424 [cs] (2018). arXiv: 1610.02424
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge (2017)
Wang, Q., Chan, A.B.: Describing like humans: on diversity in image captioning. In: CVPR (2019)
Wu, B., Jia, F., Liu, W., Ghanem, B.: Diverse image annotation. In: CVPR (2017)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: CVPR (2018)
Yao, T., Mei, T., Ngo, C.W.: Co-reranking by mutual reinforcement for image search. In: CVPR (2010)
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)
Zhang, Y., et al.: Generating informative and diverse conversational responses via adversarial information maximization. In: NeurIPS (2018)
Acknowledgments
This work is partially supported by KAUST under Award No. OSRCRG2017-3405, by Samsung and by the Princeton CSML DataX award. We would like to thank Arjun Mani, Vikram Ramaswamy and Angelina Wang for their helpful feedback on the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, Z., Feng, B., Narasimhan, K., Russakovsky, O. (2020). Towards Unique and Informative Captioning of Images. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12352. Springer, Cham. https://doi.org/10.1007/978-3-030-58571-6_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-58571-6_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58570-9
Online ISBN: 978-3-030-58571-6
eBook Packages: Computer ScienceComputer Science (R0)