Fine-grained Feature Assisted Cross-modal Image-text Retrieval

Bu, Chaofei; Liu, Xueliang; Huang, Zhen; Su, Yuling; Tu, Junfeng; Hong, Richang

doi:10.1007/978-981-97-8795-1_21

Chaofei Bu¹⁵,
Xueliang Liu^15,16,
Zhen Huang¹⁷,
Yuling Su¹⁵,
Junfeng Tu¹⁵ &
…
Richang Hong¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15041))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

23 Accesses

Abstract

Cross-modal image-text retrieval is a challenging task due to the inherent ambiguity between modalities. However, most existing methods formulate this problem either with the coarse-grained information of the global image, ignoring the valuable fine-grained information implicit in local instances, or with the local features of the images and words, failing to provide a global understanding. In this paper, we propose a novel Fine-grained Feature Assisted Cross-modal Image-Text Retrieval (FiACR) model to learn a comprehensive and informative visual representation for cross-modal retrieval. Specifically, to address the absence of local information, we design a Local-Global Visual Features Fusion (LGVFF) module to aggregate global image and local instance information. By aggregation, FiACR can capture and leverage the images’ intricate details, which enables an accurate alignment between image and text. To enhance the global visual representation capability, we utilize the instance features to filter the global image feature’s attention and encourage it to focus on prominent regions in the image. Experimental results on several datasets show the competitive accuracy of our method compared to prior art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Article 03 May 2023

SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

Article 05 February 2024

References

Bu, H.H., Kim, N.C., Kim, S.H.: Content-based image retrieval using a fusion of global and local features. ETRI J. (2023)
Google Scholar
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)
Google Scholar
Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: universal image-text representation learning. In: European Conference on Computer Vision, pp. 104–120. Springer (2020)
Google Scholar
Cheng, M., Sun, Y., Wang, L., Zhu, X., Yao, K., Chen, J., Song, G., Han, J., Liu, J., Ding, E., et al.: Vista: vision and scene text aggregation for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5184–5193 (2022)
Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 539–546. IEEE (2005)
Google Scholar
Chun, S.: Improved probabilistic image-text representations. arXiv preprint arXiv:2305.18171 (2023)
Chun, S., Oh, S.J., De Rezende, R.S., Kalantidis, Y., Larlus, D.: Probabilistic embeddings for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8415–8424 (2021)
Google Scholar
Cui, Y., Yu, Z., Wang, C., Zhao, Z., Zhang, J., Wang, M., Yu, J.: Rosita: enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 797–806 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
He, L., Liu, S., An, R., Zhuo, Y., Tao, J.: An end-to-end framework based on vision-language fusion for remote sensing cross-modal text-image retrieval. Mathematics 11(10), 2279 (2023)
Article Google Scholar
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)
Google Scholar
Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2310–2318 (2017)
Google Scholar
Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. arXiv preprint arXiv:2106.06509 (2021)
Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763 (2019)
Google Scholar
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Kim, D., Kim, N., Kwak, S.: Improving cross-modal retrieval with set of diverse embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23422–23431 (2023)
Google Scholar
Kuo, C.W., Kira, Z.: Beyond a pre-trained object detector: cross-modal textual and visual context for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17969–17979 (2022)
Google Scholar
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)
Google Scholar
Li, H., Song, J., Gao, L., Zhu, X., Shen, H.: Prototype-based aleatoric uncertainty quantification for cross-modal retrieval. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4654–4662 (2019)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Mafla, A., Rezende, R.S., Gomez, L., Larlus, D., Karatzas, D.: Stacmr: scene-text aware cross-modal retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2220–2230 (2021)
Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Ren, S., Lin, J., Zhao, G., Men, R., Yang, A., Zhou, J., Sun, X., Yang, H.: Learning relation alignment for calibrated cross-modal retrieval. arXiv preprint arXiv:2105.13868 (2021)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Google Scholar
Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: You only learn one representation: unified network for multiple tasks. arXiv preprint arXiv:2105.04206 (2021)
Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., Song, J.: Matching images and text with multi-modal tensor fusion and re-ranking. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 12–20 (2019)
Google Scholar
Yuan, Z., Zhang, W., Tian, C., Rong, X., Zhang, Z., Wang, H., Fu, K., Sun, X.: Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Trans. Geosci. Remote Sens. 60, 1–16 (2022)
Google Scholar
Zeng, S., Liu, C., Zhou, J., Chen, Y., Jiang, A., Li, H.: Learning hierarchical semantic correspondences for cross-modal image-text retrieval. In: Proceedings of the 2022 International Conference on Multimedia Retrieval, pp. 239–248 (2022)
Google Scholar
Zhang, X., Li, H., Ye, M.: Negative pre-aware for noisy cross-modal matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 7341–7349 (2024)
Google Scholar
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)
Google Scholar
Zong, Z., Song, G., Liu, Y.: Detrs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6748–6758 (2023)
Google Scholar

Download references

Acknowledgment

This work was supported by National Science and Technology Major Project (No. 2022ZD0118201), National Natural Science Foundation of China (No. 62372151 and No.72188101).

Author information

Authors and Affiliations

Hefei University of Technology, Anhui, China
Chaofei Bu, Xueliang Liu, Yuling Su, Junfeng Tu & Richang Hong
Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei, 230088, China
Xueliang Liu
National University of Defense Technology, Changsha, China
Zhen Huang

Authors

Chaofei Bu
View author publications
You can also search for this author in PubMed Google Scholar
Xueliang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yuling Su
View author publications
You can also search for this author in PubMed Google Scholar
Junfeng Tu
View author publications
You can also search for this author in PubMed Google Scholar
Richang Hong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xueliang Liu .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bu, C., Liu, X., Huang, Z., Su, Y., Tu, J., Hong, R. (2025). Fine-grained Feature Assisted Cross-modal Image-text Retrieval. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_21

Download citation

DOI: https://doi.org/10.1007/978-981-97-8795-1_21
Published: 03 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8794-4
Online ISBN: 978-981-97-8795-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fine-grained Feature Assisted Cross-modal Image-text Retrieval

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Fine-grained Feature Assisted Cross-modal Image-text Retrieval

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

SSM: Semantic Selection and Multi-view Alignment for Image-Text Retrieval

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation