Abstract
Cross-modal image-text retrieval is a challenging task due to the inherent ambiguity between modalities. However, most existing methods formulate this problem either with the coarse-grained information of the global image, ignoring the valuable fine-grained information implicit in local instances, or with the local features of the images and words, failing to provide a global understanding. In this paper, we propose a novel Fine-grained Feature Assisted Cross-modal Image-Text Retrieval (FiACR) model to learn a comprehensive and informative visual representation for cross-modal retrieval. Specifically, to address the absence of local information, we design a Local-Global Visual Features Fusion (LGVFF) module to aggregate global image and local instance information. By aggregation, FiACR can capture and leverage the images’ intricate details, which enables an accurate alignment between image and text. To enhance the global visual representation capability, we utilize the instance features to filter the global image feature’s attention and encourage it to focus on prominent regions in the image. Experimental results on several datasets show the competitive accuracy of our method compared to prior art.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bu, H.H., Kim, N.C., Kim, S.H.: Content-based image retrieval using a fusion of global and local features. ETRI J. (2023)
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)
Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: universal image-text representation learning. In: European Conference on Computer Vision, pp. 104–120. Springer (2020)
Cheng, M., Sun, Y., Wang, L., Zhu, X., Yao, K., Chen, J., Song, G., Han, J., Liu, J., Ding, E., et al.: Vista: vision and scene text aggregation for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5184–5193 (2022)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 539–546. IEEE (2005)
Chun, S.: Improved probabilistic image-text representations. arXiv preprint arXiv:2305.18171 (2023)
Chun, S., Oh, S.J., De Rezende, R.S., Kalantidis, Y., Larlus, D.: Probabilistic embeddings for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8415–8424 (2021)
Cui, Y., Yu, Z., Wang, C., Zhao, Z., Zhang, J., Wang, M., Yu, J.: Rosita: enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 797–806 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
He, L., Liu, S., An, R., Zhuo, Y., Tao, J.: An end-to-end framework based on vision-language fusion for remote sensing cross-modal text-image retrieval. Mathematics 11(10), 2279 (2023)
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)
Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2310–2318 (2017)
Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. arXiv preprint arXiv:2106.06509 (2021)
Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763 (2019)
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Kim, D., Kim, N., Kwak, S.: Improving cross-modal retrieval with set of diverse embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23422–23431 (2023)
Kuo, C.W., Kira, Z.: Beyond a pre-trained object detector: cross-modal textual and visual context for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17969–17979 (2022)
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)
Li, H., Song, J., Gao, L., Zhu, X., Shen, H.: Prototype-based aleatoric uncertainty quantification for cross-modal retrieval. Adv. Neural Inf. Process. Syst. 36 (2024)
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4654–4662 (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 32 (2019)
Mafla, A., Rezende, R.S., Gomez, L., Larlus, D., Karatzas, D.: Stacmr: scene-text aware cross-modal retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2220–2230 (2021)
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Ren, S., Lin, J., Zhao, G., Men, R., Yang, A., Zhou, J., Sun, X., Yang, H.: Learning relation alignment for calibrated cross-modal retrieval. arXiv preprint arXiv:2105.13868 (2021)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: You only learn one representation: unified network for multiple tasks. arXiv preprint arXiv:2105.04206 (2021)
Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., Song, J.: Matching images and text with multi-modal tensor fusion and re-ranking. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 12–20 (2019)
Yuan, Z., Zhang, W., Tian, C., Rong, X., Zhang, Z., Wang, H., Fu, K., Sun, X.: Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Trans. Geosci. Remote Sens. 60, 1–16 (2022)
Zeng, S., Liu, C., Zhou, J., Chen, Y., Jiang, A., Li, H.: Learning hierarchical semantic correspondences for cross-modal image-text retrieval. In: Proceedings of the 2022 International Conference on Multimedia Retrieval, pp. 239–248 (2022)
Zhang, X., Li, H., Ye, M.: Negative pre-aware for noisy cross-modal matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 7341–7349 (2024)
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)
Zong, Z., Song, G., Liu, Y.: Detrs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6748–6758 (2023)
Acknowledgment
This work was supported by National Science and Technology Major Project (No. 2022ZD0118201), National Natural Science Foundation of China (No. 62372151 and No.72188101).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bu, C., Liu, X., Huang, Z., Su, Y., Tu, J., Hong, R. (2025). Fine-grained Feature Assisted Cross-modal Image-text Retrieval. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_21
Download citation
DOI: https://doi.org/10.1007/978-981-97-8795-1_21
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8794-4
Online ISBN: 978-981-97-8795-1
eBook Packages: Computer ScienceComputer Science (R0)