Fine-grained Feature Assisted Cross-modal Image-text Retrieval | SpringerLink
Skip to main content

Fine-grained Feature Assisted Cross-modal Image-text Retrieval

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15041))

Included in the following conference series:

  • 23 Accesses

Abstract

Cross-modal image-text retrieval is a challenging task due to the inherent ambiguity between modalities. However, most existing methods formulate this problem either with the coarse-grained information of the global image, ignoring the valuable fine-grained information implicit in local instances, or with the local features of the images and words, failing to provide a global understanding. In this paper, we propose a novel Fine-grained Feature Assisted Cross-modal Image-Text Retrieval (FiACR) model to learn a comprehensive and informative visual representation for cross-modal retrieval. Specifically, to address the absence of local information, we design a Local-Global Visual Features Fusion (LGVFF) module to aggregate global image and local instance information. By aggregation, FiACR can capture and leverage the images’ intricate details, which enables an accurate alignment between image and text. To enhance the global visual representation capability, we utilize the instance features to filter the global image feature’s attention and encourage it to focus on prominent regions in the image. Experimental results on several datasets show the competitive accuracy of our method compared to prior art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bu, H.H., Kim, N.C., Kim, S.H.: Content-based image retrieval using a fusion of global and local features. ETRI J. (2023)

    Google Scholar 

  2. Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)

    Google Scholar 

  3. Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: universal image-text representation learning. In: European Conference on Computer Vision, pp. 104–120. Springer (2020)

    Google Scholar 

  4. Cheng, M., Sun, Y., Wang, L., Zhu, X., Yao, K., Chen, J., Song, G., Han, J., Liu, J., Ding, E., et al.: Vista: vision and scene text aggregation for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5184–5193 (2022)

    Google Scholar 

  5. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 539–546. IEEE (2005)

    Google Scholar 

  6. Chun, S.: Improved probabilistic image-text representations. arXiv preprint arXiv:2305.18171 (2023)

  7. Chun, S., Oh, S.J., De Rezende, R.S., Kalantidis, Y., Larlus, D.: Probabilistic embeddings for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8415–8424 (2021)

    Google Scholar 

  8. Cui, Y., Yu, Z., Wang, C., Zhao, Z., Zhang, J., Wang, M., Yu, J.: Rosita: enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 797–806 (2021)

    Google Scholar 

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  10. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  11. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)

  12. He, L., Liu, S., An, R., Zhuo, Y., Tao, J.: An end-to-end framework based on vision-language fusion for remote sensing cross-modal text-image retrieval. Mathematics 11(10), 2279 (2023)

    Article  Google Scholar 

  13. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)

    Google Scholar 

  14. Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2310–2318 (2017)

    Google Scholar 

  15. Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. arXiv preprint arXiv:2106.06509 (2021)

  16. Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763 (2019)

    Google Scholar 

  17. Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)

    Google Scholar 

  18. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

    Google Scholar 

  19. Kim, D., Kim, N., Kwak, S.: Improving cross-modal retrieval with set of diverse embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23422–23431 (2023)

    Google Scholar 

  20. Kuo, C.W., Kira, Z.: Beyond a pre-trained object detector: cross-modal textual and visual context for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17969–17979 (2022)

    Google Scholar 

  21. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)

    Google Scholar 

  22. Li, H., Song, J., Gao, L., Zhu, X., Shen, H.: Prototype-based aleatoric uncertainty quantification for cross-modal retrieval. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  23. Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4654–4662 (2019)

    Google Scholar 

  24. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  25. Mafla, A., Rezende, R.S., Gomez, L., Larlus, D., Karatzas, D.: Stacmr: scene-text aware cross-modal retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2220–2230 (2021)

    Google Scholar 

  26. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  27. Ren, S., Lin, J., Zhao, G., Men, R., Yang, A., Zhou, J., Sun, X., Yang, H.: Learning relation alignment for calibrated cross-modal retrieval. arXiv preprint arXiv:2105.13868 (2021)

  28. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)

    Google Scholar 

  29. Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: You only learn one representation: unified network for multiple tasks. arXiv preprint arXiv:2105.04206 (2021)

  30. Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., Song, J.: Matching images and text with multi-modal tensor fusion and re-ranking. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 12–20 (2019)

    Google Scholar 

  31. Yuan, Z., Zhang, W., Tian, C., Rong, X., Zhang, Z., Wang, H., Fu, K., Sun, X.: Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Trans. Geosci. Remote Sens. 60, 1–16 (2022)

    Google Scholar 

  32. Zeng, S., Liu, C., Zhou, J., Chen, Y., Jiang, A., Li, H.: Learning hierarchical semantic correspondences for cross-modal image-text retrieval. In: Proceedings of the 2022 International Conference on Multimedia Retrieval, pp. 239–248 (2022)

    Google Scholar 

  33. Zhang, X., Li, H., Ye, M.: Negative pre-aware for noisy cross-modal matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 7341–7349 (2024)

    Google Scholar 

  34. Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)

    Google Scholar 

  35. Zong, Z., Song, G., Liu, Y.: Detrs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6748–6758 (2023)

    Google Scholar 

Download references

Acknowledgment

This work was supported by National Science and Technology Major Project (No. 2022ZD0118201), National Natural Science Foundation of China (No. 62372151 and No.72188101).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xueliang Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bu, C., Liu, X., Huang, Z., Su, Y., Tu, J., Hong, R. (2025). Fine-grained Feature Assisted Cross-modal Image-text Retrieval. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15041. Springer, Singapore. https://doi.org/10.1007/978-981-97-8795-1_21

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8795-1_21

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8794-4

  • Online ISBN: 978-981-97-8795-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics