CSFuser: A Cascade Siamese Fusion Architecture for RGB-Infrared Object Detection

Li, Ziyi; Zhang, Gang; Zeng, Zhigang; Hu, Xiaolin

doi:10.1007/978-981-97-4399-5_17

Ziyi Li^26,27,
Gang Zhang²⁷,
Zhigang Zeng^26,28 &
…
Xiaolin Hu²⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14827))

Included in the following conference series:

International Symposium on Neural Networks

548 Accesses

Abstract

RGB-Infrared multi-modal object detection harnesses diverse and complementary information from RGB and infrared images, offering significant advantages in intelligent transportation. The primary challenge lies in the effective fusion of RGB and infrared images. Presently, the fusion process is hindered by two aspects: firstly, the misalignment between RGB and infrared images complicates the fusion process; secondly, the substantial differences in features between the two modalities impede the learning of complementary features. Existing fusion architectures often overlook these challenges or prioritize the RGB data, thereby neglecting the full potential of infrared data. To address these challenges, we introduce Multi-scale Attention-based Complementary Fusion (MACF) module, a straightforward yet effective feature fusion module embedded within the cascade siamese architecture. A lightweight alignment module is designed to align the two modalities before feature extraction. Through progressive fusion of RGB and infrared features, CSFuser addresses challenges directly. Extensive experiments on the DENSE dataset under adverse conditions like heavy snow, rain, and fog demonstrates CSFuser’s superiority over the leading method, HRFuser, while being 2x faster. The excellent performance underscores our method’s ability to effectively fuse RGB and infrared images. The codes will be made publicly available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 10295; Price includes VAT (Japan)

Softcover Book: JPY 12869; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

SimpliFusion: a simplified infrared and visible image fusion network

Article 29 May 2024

Dual Attention Feature Fusion for Visible-Infrared Object Detection

References

Algorithms; data from university of oxford provide new insights into algorithms (the pascal visual object classes challenge: A retrospective). Computers, Networks & Communications (2015)
Google Scholar
Bijelic, M., et al.: Seeing through fog without seeing fog: deep multimodal sensor fusion in unseen adverse weather. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11682–11692 (2020)
Google Scholar
Broedermann, T., Sakaridis, C., Dai, D., Van Gool, L.: HRFuser: a multi-resolution sensor fusion architecture for 2d object detection. In: IEEE International Conference on Intelligent Transportation Systems (ITSC) (2023)
Google Scholar
Dai, J., et al.: Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773 (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fang, X., Yang, Y., Fu, Y.: Visible-infrared person re-identification via semantic alignment and affinity inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11270–11279, October 2023
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning, pp. 1989–1998. PMLR (2018)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Kim, J., Kim, H., Kim, T., Kim, N., Choi, Y.: MLPD: multi-label pedestrian detector in multispectral domain. IEEE Robot. Autom. Lett. 6(4), 7846–7853 (2021). https://doi.org/10.1109/LRA.2021.3099870
Article Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Park, H., Lee, S., Lee, J., Ham, B.: Learning by aligning: visible-infrared person re-identification using cross-modal correspondences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12046–12055, October 2021
Google Scholar
Roopak, S., Eduard, S., James, W., Bentz, I., Guyon, C.: Signature verification using a “siamese’’ time delay neural network. Int. J. Pattern Recognit Artif Intell. 07(4), 669–669 (1993)
Article Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Zhang, H., Fromont, E., Lefevre, S., Avignon, B.: Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 276–280 (2020). https://doi.org/10.1109/ICIP40778.2020.9191080
Zhang, L., Zhu, X., Chen, X., Yang, X., Lei, Z., Liu, Z.: Weakly aligned cross-modal learning for multispectral pedestrian detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Zhang, Q., Lai, C., Liu, J., Huang, N., Han, J.: FMCNet: feature-level modality compensation for visible-infrared person re-identification. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7339–7348 (2022). https://doi.org/10.1109/CVPR52688.2022.00720
Zhang, Y., Carballo, A., Yang, H., Takeda, K.: Perception and sensing for autonomous vehicles under adverse weather conditions: a survey. ISPRS J. Photogram. Remote Sens. 196(146-177) (2023)
Google Scholar
Zhu, Y., Sun, X., Wang, M., Huang, H.: Multi-modal feature pyramid transformer for RGB-infrared object detection. IEEE Trans. Intell. Transp. Syst. 24(9), 9984–9995 (2023). https://doi.org/10.1109/TITS.2023.3266487
Article Google Scholar

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China (grant U2341228) and the THU-Bosch JCML Center.

Author information

Authors and Affiliations

School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
Ziyi Li & Zhigang Zeng
Department of Computer Science and Technology, BNRist, THU-Bosch JCML Center, Tsinghua University, Beijing, 100084, China
Ziyi Li, Gang Zhang & Xiaolin Hu
Key Laboratory of Image Processing and Intelligent Control of Education Ministry of China, Wuhan, 430074, China
Zhigang Zeng

Authors

Ziyi Li
View author publications
You can also search for this author in PubMed Google Scholar
Gang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhigang Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolin Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhigang Zeng or Xiaolin Hu .

Editor information

Editors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Xinyi Le
South China University of Technology, Guangzhou, China
Zhijun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Z., Zhang, G., Zeng, Z., Hu, X. (2024). CSFuser: A Cascade Siamese Fusion Architecture for RGB-Infrared Object Detection. In: Le, X., Zhang, Z. (eds) Advances in Neural Networks – ISNN 2024. ISNN 2024. Lecture Notes in Computer Science, vol 14827. Springer, Singapore. https://doi.org/10.1007/978-981-97-4399-5_17

Download citation

DOI: https://doi.org/10.1007/978-981-97-4399-5_17
Published: 07 July 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-4398-8
Online ISBN: 978-981-97-4399-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CSFuser: A Cascade Siamese Fusion Architecture for RGB-Infrared Object Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

SimpliFusion: a simplified infrared and visible image fusion network

Dual Attention Feature Fusion for Visible-Infrared Object Detection

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

CSFuser: A Cascade Siamese Fusion Architecture for RGB-Infrared Object Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

SimpliFusion: a simplified infrared and visible image fusion network

Dual Attention Feature Fusion for Visible-Infrared Object Detection

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation