Abstract
Knowledge Distillation (KD) [6], as an effective technique for model compression and improving a model’s performance, has been widely studied and adopted. However, most previous researches focus on image classification and few on sequence generation (such as Neural Machine Translation). We also note that few works for image captioning have incorporated KD, but they mainly treat it as a training trick. In contrast, we thoroughly investigate KD in the context of the image captioning task by conducting a series of experiments in this work. Specifically, we first apply the standard word-level KD to the image captioning model and explore cross-model distillation and self-distillation. We find that self-distillation is a practical choice that can achieve competitive performance while without spending time on choosing teacher’s architecture. Inspired by the sequence-level distillation for Neural Machine Translation (NMT) [11], we secondly adopt and modify it for image captioning and observe that competitive performance can be obtained using only one-fifth of resources and the speed of inference can be significantly improved by eliminating the need for beam search at the cost of slight performance degradation. Inspired by distilling BERT [19] for NMT, we finally try to distill VL-BERT [12] to make the captioning model look ahead by leveraging its bidirectional nature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Vaswani, A., et al.: Attention is all you need. In arXiv (2017)
Lin, T.-Y., et al.: Microsoft coco: common objects in context. In: ECCV (2014)
Papineni, K., et al.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)
Vinyals, O., et al.: Show and tell: a neural image caption generator. In: CVPR (2015)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Hinton, G.E., et al.: Distilling the knowledge in a neural network. In arXiv (2015)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Vedantam, R., Lawrence, Z.C., Parikh, D. Cider: consensus-based image description evaluation. In: CVPR (2015)
Zhang, Y., et al.: Deep mutual learning. In: CVPR (2018)
Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In arXiv (2016)
Zhou, L., et al.: Unified vision-language pre-training for image captioning and vqa. In: AAAI (2020)
Zhang, L., et al.: Be your own teacher: improve the performance of convolutional neural networks via self distillation. In: ICCV (2019)
Zhou, Y., et al.: More grounded image captioning by distilling image-text matching model. In: CVPR (2020)
Pan, Y., et al.: X-linear attention networks for image captioning. In: CVPR (2020)
Denkowski, M., Alon, L.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Hahn, S., Choi, H.: Self-knowledge distillation in natural language processing. In arXiv (2019)
Rennie, S.J., et al.: Self-critical sequence training for image captioning. In: CVPR (2017)
Chen, Y.-C., et al.: Distilling knowledge learned in BERT for text generation. In arXiv (2019)
Dognin, P.L., et al.: Alleviating noisy data in image captioning with cooperative distillation. In: arXiv (2020)
Guo, L., et al.: Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. In arXiv (2020)
Zhang, Z., et al.: Object relational graph with teacher-recommended learning for video captioning. In: CVPR (2020)
Pan, B., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: CVPR (2020)
Plummer, B.A., et al.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Anderson, P., et al.: SPICE: Semantic Propositional Image Caption Evaluation. In: ECCV (2016)
Yuan, L., et al.: Revisiting knowledge distillation via label smoothing regularization. In: CVPR (2020)
Li, J., et al.: Learning to learn from noisy labeled data. In: CVPR (2019)
He, Y.-Y., Jianxin, W., Wei, X.-S.: Distilling virtual examples for long-tailed recognition. In arXiv (2021)
Furlanello, T., et al.: Born again neural networks. In: ICML (2018)
Devlin, J., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. In arXiv (2018)
Dhar, G.K.V.P.S., et al.: Baby Talk: Understanding and Generating Simple Image Descriptions (2013)
Mitchell, M., et al.: Midge: generating image descriptions from computer vision detections. In: ECACL (2012)
Huang, L., et al.: Attention on attention for image captioning. In: ICCV (2019)
Cornia, M., et al.: Meshed-memory transformer for image captioning. In: CVPR (2020)
Chen, D., Mei, J.P., Wang, C., Feng, Y., Chen, C.: Online knowledge distillation with diverse peers. In: AAAI (2020)
Yuan, L., Tay, F.E., Li, G., Wang, T., Feng, J.: Revisit knowledge distillation: a teacher-free framework. In: CVPR (2020)
Huang, Z., Wang, N.: Like what you like: Knowledge distill via neuron selectivity transfer. In arXiv (2017)
Wei, H.R., Huang, S., Wang, R., Dai, X., Chen, J.: Online distilling from checkpoints for neural machine translation. In: NAACL-HLT (2019)
Freitag, M., Al-Onaizan, Y., Sankaran, B.: Ensemble distillation for neural machine translation. In arXiv (2017)
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In arXiv (2019)
Lu, J., et al.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In arXiv (2019)
Li, L.H., et al.: VisualBERT: a simple and performant baseline for vision and language. In arXiv (2019)
Kim, J., Park, S.U.K., Kwak, N.: Paraphrasing complex network: network compression via factor transfer. In arXiv (2018)
Romero, A., et al.: Fitnets: hints for thin deep nets. In arXiv (2014)
Zagoruyko, S., Nikos, K.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In arXiv (2016)
Park, W., et al.: Relational knowledge distillation. In: CVPR (2019)
Chen, H., et al.: Learning student networks via feature embedding. IEEE TNNLS (2020)
Xie, J., et al.: Training convolutional neural networks with cheap convolutions and online distillation. In arXiv (2019)
Bagherinezhad, H., et al.: Label refinery: improving imagenet classification through label progression. In arXiv (2018)
Karpathy, A., Li, F.-F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Kingma, D.P., Jimmy, B.: Adam: a method for stochastic optimization. In arXiv (2014)
Sharma, P., et al.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Szegedy, C., et al.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Dong, J., Hu, Z., Zhou, Y. (2021). Revisiting Knowledge Distillation for Image Captioning. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds) Artificial Intelligence. CICAI 2021. Lecture Notes in Computer Science(), vol 13069. Springer, Cham. https://doi.org/10.1007/978-3-030-93046-2_52
Download citation
DOI: https://doi.org/10.1007/978-3-030-93046-2_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93045-5
Online ISBN: 978-3-030-93046-2
eBook Packages: Computer ScienceComputer Science (R0)