Abstract
RGB-Infrared multi-modal object detection harnesses diverse and complementary information from RGB and infrared images, offering significant advantages in intelligent transportation. The primary challenge lies in the effective fusion of RGB and infrared images. Presently, the fusion process is hindered by two aspects: firstly, the misalignment between RGB and infrared images complicates the fusion process; secondly, the substantial differences in features between the two modalities impede the learning of complementary features. Existing fusion architectures often overlook these challenges or prioritize the RGB data, thereby neglecting the full potential of infrared data. To address these challenges, we introduce Multi-scale Attention-based Complementary Fusion (MACF) module, a straightforward yet effective feature fusion module embedded within the cascade siamese architecture. A lightweight alignment module is designed to align the two modalities before feature extraction. Through progressive fusion of RGB and infrared features, CSFuser addresses challenges directly. Extensive experiments on the DENSE dataset under adverse conditions like heavy snow, rain, and fog demonstrates CSFuser’s superiority over the leading method, HRFuser, while being 2x faster. The excellent performance underscores our method’s ability to effectively fuse RGB and infrared images. The codes will be made publicly available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Algorithms; data from university of oxford provide new insights into algorithms (the pascal visual object classes challenge: A retrospective). Computers, Networks & Communications (2015)
Bijelic, M., et al.: Seeing through fog without seeing fog: deep multimodal sensor fusion in unseen adverse weather. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11682–11692 (2020)
Broedermann, T., Sakaridis, C., Dai, D., Van Gool, L.: HRFuser: a multi-resolution sensor fusion architecture for 2d object detection. In: IEEE International Conference on Intelligent Transportation Systems (ITSC) (2023)
Dai, J., et al.: Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773 (2017)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fang, X., Yang, Y., Fu, Y.: Visible-infrared person re-identification via semantic alignment and affinity inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11270–11279, October 2023
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning, pp. 1989–1998. PMLR (2018)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Kim, J., Kim, H., Kim, T., Kim, N., Choi, Y.: MLPD: multi-label pedestrian detector in multispectral domain. IEEE Robot. Autom. Lett. 6(4), 7846–7853 (2021). https://doi.org/10.1109/LRA.2021.3099870
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Park, H., Lee, S., Lee, J., Ham, B.: Learning by aligning: visible-infrared person re-identification using cross-modal correspondences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12046–12055, October 2021
Roopak, S., Eduard, S., James, W., Bentz, I., Guyon, C.: Signature verification using a “siamese’’ time delay neural network. Int. J. Pattern Recognit Artif Intell. 07(4), 669–669 (1993)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Zhang, H., Fromont, E., Lefevre, S., Avignon, B.: Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 276–280 (2020). https://doi.org/10.1109/ICIP40778.2020.9191080
Zhang, L., Zhu, X., Chen, X., Yang, X., Lei, Z., Liu, Z.: Weakly aligned cross-modal learning for multispectral pedestrian detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Zhang, Q., Lai, C., Liu, J., Huang, N., Han, J.: FMCNet: feature-level modality compensation for visible-infrared person re-identification. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7339–7348 (2022). https://doi.org/10.1109/CVPR52688.2022.00720
Zhang, Y., Carballo, A., Yang, H., Takeda, K.: Perception and sensing for autonomous vehicles under adverse weather conditions: a survey. ISPRS J. Photogram. Remote Sens. 196(146-177) (2023)
Zhu, Y., Sun, X., Wang, M., Huang, H.: Multi-modal feature pyramid transformer for RGB-infrared object detection. IEEE Trans. Intell. Transp. Syst. 24(9), 9984–9995 (2023). https://doi.org/10.1109/TITS.2023.3266487
Acknowledgement
This work was supported by the National Natural Science Foundation of China (grant U2341228) and the THU-Bosch JCML Center.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, Z., Zhang, G., Zeng, Z., Hu, X. (2024). CSFuser: A Cascade Siamese Fusion Architecture for RGB-Infrared Object Detection. In: Le, X., Zhang, Z. (eds) Advances in Neural Networks – ISNN 2024. ISNN 2024. Lecture Notes in Computer Science, vol 14827. Springer, Singapore. https://doi.org/10.1007/978-981-97-4399-5_17
Download citation
DOI: https://doi.org/10.1007/978-981-97-4399-5_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-4398-8
Online ISBN: 978-981-97-4399-5
eBook Packages: Computer ScienceComputer Science (R0)