PLOT: Text-Based Person Search with Part Slot Attention for Corresponding Part Discovery

Park, Jicheol; Kim, Dongwon; Jeong, Boseung; Kwak, Suha

doi:10.1007/978-3-031-72664-4_27

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15079))

Included in the following conference series:

European Conference on Computer Vision

245 Accesses

Abstract

Text-based person search, employing free-form text queries to identify individuals within a vast image collection, presents a unique challenge in aligning visual and textual representations, particularly at the human part level. Existing methods often struggle with part feature extraction and alignment due to the lack of direct part-level supervision and reliance on heuristic features. We propose a novel framework that leverages a part discovery module based on slot attention to autonomously identify and align distinctive parts across modalities, enhancing interpretability and retrieval accuracy without explicit part-level correspondence supervision. Additionally, text-based dynamic part attention adjusts the importance of each part, further improving retrieval outcomes. Our method is evaluated on three public benchmarks, significantly outperforming existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Part-Based Multi-Scale Attention Network for Text-Based Person Search

Cross-Modal Semantic Alignment Learning for Text-Based Person Search

See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval

References

Aggarwal, S., Radhakrishnan, V.B., Chakraborty, A.: Text-based person search via attribute-aided matching. In: Proceedings of the Winter Conference on Applications of Computer Vision (WACV) (2020)
Google Scholar
Chen, Y., Zhang, G., Lu, Y., Wang, Z., Zheng, Y.: TIPCB: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494, 171–181 (2022)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Annual Conference of the North American Chapter of the Association for Computational Linguistics (2019)
Google Scholar
Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666 (2021)
Gao, C., et al.: Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036 (2021)
Gray, D., Brennan, S., Tao, H.: Evaluating appearance models for recognition, reacquisition, and tracking. In: Proceedings of the IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS) (2007)
Google Scholar
Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2020)
Google Scholar
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Chapter Google Scholar
Li, S., Cao, M., Zhang, M.: Learning semantic-aligned feature representation for text-based person search. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2022)
Google Scholar
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Li, W., Zhao, R., Wang, X.: Human reidentification with transferred metric learning. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part I. LNCS, vol. 7724, pp. 31–44. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_3
Chapter Google Scholar
Li, W., Zhao, R., Xiao, T., Wang, X.: DeepReID: deep filter pairing neural network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Google Scholar
Li, Y., He, J., Zhang, T., Liu, X., Zhang, Y., Wu, F.: Diverse part discovery: occluded person re-identification with part-aware transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Locatello, F., et al.: Object-centric learning with slot attention (2020)
Google Scholar
Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 29, 5542–5556 (2020)
Article Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (ICML) (2021)
Google Scholar
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Shao, Z., Zhang, X., Fang, M., Lin, Z., Wang, J., Ding, C.: Learning granularity-unified representations for text-to-image person re-identification. In: Proceedings of the ACM Multimedia Conference (ACMMM) (2022)
Google Scholar
Shu, X., et al.: See finer, see more: implicit modality alignment for text-based person retrieval. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13805, pp. 624–641. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25072-9_42
Chapter Google Scholar
Suo, W., et al.: A simple and robust correlation filtering method for text-based person search. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 726–742. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_42
Chapter Google Scholar
Wang, C., Luo, Z., Lin, Y., Li, S.: Text-based person search via multi-granularity embedding learning. In: Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI) (2021)
Google Scholar
Wang, Z., Fang, Z., Wang, J., Yang, Y.: ViTAA: visual-textual attributes alignment in person search by natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 402–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_24
Chapter Google Scholar
Wang, Z., et al.: CAIBC: capturing all-round information beyond color for text-based person retrieval. In: Proceedings of the ACM Multimedia Conference (ACMMM) (2022)
Google Scholar
Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer GAN to bridge domain gap for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Wu, Y., Yan, Z., Han, X., Li, G., Zou, C., Cui, S.: LapsCore: language-guided person search via color reasoning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Xiao, T., Li, S., Wang, B., Lin, L., Wang, X.: End-to-end deep learning for person search. arXiv preprint arXiv:1604.01850 (2016)
Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. IEEE Trans. Image Process. 32, 6032–6046 (2023)
Article Google Scholar
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 707–723. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_42
Chapter Google Scholar
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Zhu, A., et al.: DSSL: deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 209–217 (2021)
Google Scholar

Download references

Acknowledgements

This work was supported by the IITP grants and the NRF grants funded by Ministry of Science and ICT, Korea (RS-2019-II191906; RS-2022-II220926; NRF-2018R1A5A1060031; NRF-2021R1A2C3012728)

Author information

Authors and Affiliations

Pohang University of Science and Technology (POSTECH), Pohang, South Korea
Jicheol Park, Dongwon Kim, Boseung Jeong & Suha Kwak

Authors

Jicheol Park
View author publications
You can also search for this author in PubMed Google Scholar
Dongwon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Boseung Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Suha Kwak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jicheol Park .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2975 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Park, J., Kim, D., Jeong, B., Kwak, S. (2025). PLOT: Text-Based Person Search with Part Slot Attention for Corresponding Part Discovery. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15079. Springer, Cham. https://doi.org/10.1007/978-3-031-72664-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-72664-4_27
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72663-7
Online ISBN: 978-3-031-72664-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PLOT: Text-Based Person Search with Part Slot Attention for Corresponding Part Discovery