Abstract
Image–text retrieval aims to understand the similarity relationship among image–text pairs while using a ranking model with an optimal distance metric. Although mining the informative pairs is of central importance to training a ranking model, the current dominating ranking model, Cross-Encoder (CE), processes image–text pair jointly with cross-attention mechanisms, imposing \({\mathcal {O}}(N^2)\) encoding complexity. Consequently, with limited computational resources, we can not train CE with a large batch size, where only a mini-batch of pairs is accessible at each iteration. In contrast, the efficient but not effective model, Bi-Encoder(BE), encodes texts and images separately, achieving an \({\mathcal {O}}(N)\) encoding complexity. Thus, to fulfill the potential of CE, we propose an Asymmetric Bi-Encoder(ABE) approach, which is a combination of CE and BE. For image-to-text retrieval, we encode images with BE and encode texts with CE. In contrast, we encode texts with BE and encode images with CE for text-to-image retrieval. Furthermore, in the training phase, we sample large-scale negative pairs with BE to overcome the batch size limitation and mine more informative examples with \({\mathcal {O}}(N)\) complexity. Our proposed method is conceptually simple and easy to implement, with systematic experiments on public benchmarks validating our method’s effectiveness in boosting image-text retrieval.
Similar content being viewed by others
Data availability
All code and data are available.
References
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), New York, NY, pp. 11336–11344 (2020)
Yuan, A., Li, X., Lu, X.: 3g structure for image caption generation. Neurocomputing 330, 17–28 (2019)
Lee, K., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, pp. 212–228 (2018)
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)
Xu, N., Liu, A.-A., Nie, W., Su, Y.: Multi-guiding long short-term memory for video captioning. Multimedia Syst. 25, 663–672 (2019)
Guo, L., Liu, J., Lu, S., Lu, H.: Show, tell, and polish: ruminant decoding for image captioning. IEEE Trans. Multimedia 22(8), 2149–2162 (2019)
Li, J., Wang, Y., Zhao, D.: Layer-wise enhanced transformer with multi-modal fusion for image caption. Multimedia Systems 29, 1–14 (2022)
do Carmo Nogueira, T., Vinhal, C.D.N., da Cruz Júnior, G., Ullmann, M.R.D., Marques, T.C.: A reference-based model using deep learning for image captioning. Multimedia Systems 29, 1–17 (2022)
Dou, Z.-Y., Xu, Y., Gan, Z., Wang, J., Wang, S., Wang, L., Zhu, C., Zhang, P., Yuan, L., Peng, N., et al.: An empirical study of training end-to-end vision-and-language transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18166–18176 (2022)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems (NIPS), pp. 13 (2019)
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Proceedings of the 16th European Conference on Computer Vision (ECCV) (2020)
Zhang, B., Hu, H., Jain, V., Ie, E., Sha, F.: Learning to represent image and text with denotation graph. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 823–839 (2020)
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 394 (2018)
Dong, J., Li, X., Xu, C., Ji, S., He, Y., Yang, G., Wang, X.: Dual encoding for zero-example video retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Wang, X., Han, X., Huang, W., Dong, D., Scott, M.R.: Multi-similarity loss with general pair weighting for deep metric learning. In: CVPR (2019)
Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., Wei, Y.: Circle loss: A unified perspective of pair similarity optimization. In: CVPR (2020)
Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, pp. 740–755 (2014)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NIPS, pp. 5998 (2017)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
Zhang, C., Yang, Y., Guo, J., Jin, G., Song, D., Liu, A.A.: Improving text-image cross-modal retrieval with contrastive loss. Multimedia Syst. 29, 1–7 (2022)
Sun, H., Qin, X., Liu, X.: Image-text matching using multi-subspace joint representation. Multimedia Systems 29, 1–15 (2023)
Lu, H., Fei, N., Huo, Y., Gao, Y., Lu, Z., Wen, J.-R.: Cots: collaborative two-stream vision-language pre-training model for cross-modal retrieval. In: Proceedings of the IEEE/CVF conference on computer Vision and pattern recognition, pp. 15692–15701 (2022)
Chen, Y., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: universal image-text representation learning. In: Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, pp. 104–120 (2020)
Lu, X., Zhao, T., Lee, K.: Visualsparta: an embarrassingly simple approach to large-scale text-to-image search with weighted bag-of-words. In: ACL (2021)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 4171–4186 (2019)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference (BMVC) (2018)
Geigle, G., Pfeiffer, J., Reimers, N., Vulić, I., Gurevych, I.: Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. Trans. Assoc. Computat. Linguist. 10, 503–521 (2022)
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. (IJCV) 123(1), 32–73 (2017)
Dong, X., Zhang, H., Zhu, L., Nie, L., Liu, L.: Hierarchical feature aggregation based on transformer for image-text matching. IEEE Trans. Circuits Syst. Video Tech. 32(9), 6437–6447 (2022)
Liu, Q., Li, W., Chen, Z., Hua, B.: Deep metric learning for image retrieval in smart city development. Sustain. Cities Soc. 73, 103067 (2021)
Wu, C.-Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: ICCV (2017)
Suh, Y., Han, B., Kim, W., Lee, K.M.: Stochastic class-based hard example mining for deep metric learning. In: CVPR (2019)
Harwood, B., Kumar BG, V., Carneiro, G., Reid, I., Drummond, T.: Smart mining for deep metric learning. In: ICCV (2017)
Wang, X., Zhang, H., Huang, W., Scott, M.R.: Cross-batch memory for embedding learning. In: CVPR (2020)
Lu, H., Huo, Y., Ding, M., Fei, N., Lu, Z.: Cross-modal contrastive learning for generalizable and efficient image-text retrieval. Machine Intelligence Research 20, 1–14 (2023)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: Multi-task vision and language representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Gan, Z., Chen, Y., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. In: Advances in Neural Information Processing Systems (NeurIPS), virtual (2020)
Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning (ICML) (2021)
Sun, S., Chen, Y.-C., Li, L., Wang, S., Fang, Y., Liu, J.: Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (2021)
Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2310–2318 (2017)
Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Huang, Y., Wu, Q., Song, C., Wang, L.: Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6163–6171 (2018)
Wang, Z., Liu, X., Li, H., Sheng, L., Yan, J., Wang, X., Shao, J.: Camp: Cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 5764–5773 (2019)
Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., Fan, X.: Position focused attention network for image-text matching. In: International Joint Conference on Artificial Intelligence (IJCAI) (2019)
Hu, Z., Luo, Y., Lin, J., Yan, Y., Chen, J.: Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 789–795 (2019)
Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: European Conference on Computer Vision (ECCV), pp. 18–34 (2020). Springer
Wu, J., Wu, C., Lu, J., Wang, L., Cui, X.: Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circuits Syst. Video Tech. 32(1), 388–397 (2022)
Ge, X., Chen, F., Jose, J.M., Ji, Z., Wu, Z., Liu, X.: Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5185–5193 (2021)
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 641–656 (2022)
Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1218–1226 (2021)
Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)
Acknowledgements
This work is supported by the National Key R &D Program of China (2018AAA0100104, 2018AAA0100100), Natural Science Foundation of Jiangsu Province (BK20211164).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xiong, W., Liu, H., Mi, S. et al. Asymmetric bi-encoder for image–text retrieval. Multimedia Systems 29, 3805–3818 (2023). https://doi.org/10.1007/s00530-023-01162-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-023-01162-2