Abstract
Scale variation is one of the primary challenges in object detection. Recently, different strategies have been introduced to address this challenge, achieving promising performance. However, limitations still exist in these detectors. On the one hand, as for the large-scale deep layers, the localizing power of the features is relatively low. On the other hand, as for the small-scale shallow layers, the categorizing ability of the features is relatively weak. Actually, the limitations are self-solving, as the above two aspects can be mutually beneficial to each other. Therefore, we propose the Stacked Pyramid Attention Network (SPANet) to bridge the gap between different scales. In SPANet, two lightweight modules, i.e. top-down feature map attention module (TDFAM) and bottom-up feature map attention module (BUFAM), are designed. Via learning the channel attention and spatial attention, each module effectively builds connections between features from adjacent scales. By progressively integrating BUFAM and TDFAM into two encoder–decoder structures, two novel feature aggregating branches are built. In this way, the branches fully complement the localizing power from small-scale features and the categorizing power from large-scale features, therefore enhancing the detection accuracy while keeping lightweight. Extensive experiments on two challenging benchmarks (PASCAL VOC and MS COCO datasets) demonstrate the effectiveness of our SPANet, showing that our model reaches a competitive trade-off between accuracy and speed.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bell S, Lawrence ZC, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2874–2883
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms—improving object detection with one line of code. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 5561–5569
Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6154–6162
Cao J, Pang Y, Zhao S, Li X (2019) High-level semantic networks for multi-scale object detection. IEEE Trans Circuits Syst Video Technol 30:3372–3386
Cao J, Cholakkal H, Anwer RM, Khan FS, Pang Y, Shao L (2020) D2det: towards high quality object detection and instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11485–11494
Chen Y, Yang T, Zhang X, Meng G, Xiao X, Sun J (2019) Detnas: backbone search for object detection. In: Advances in neural information processing systems (NIPS), pp 6638–6648
Dai J, Li Y, He K, Sun J (2016) R-fcn: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems (NIPS), pp 379–387
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 764–773
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The Pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659
Fu Z, Jin Z, Qi G, Shen C, Jiang R, Chen Y, Hua X (2018) Previewer for multi-scale object detector. In: Proceedings of the 26th ACM international conference on multimedia (MM), pp 265–273
Ghiasi G, Lin TY, Le QV (2019) Nas-fpn: Learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7036–7045
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 580–587
Guo Y, Wu Z, Shen D (2020) Learning longitudinal classification-regression model for infant hippocampus segmentation. Neurocomputing 391:191–198
Hao S, Zhou Y, Guo Y (2020) A brief survey on semantic segmentation with deep learning. Neurocomputing 406:302–321
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2961–2969
Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, et al (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7310–7311
Huang L, Yang Y, Deng Y, Yu Y (2015) Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874
Ji Z, Kong Q, Wang H, Pang Y (2019) Small and dense commodity object detection with multi-scale receptive field attention. In: Proceedings of the 27th ACM international conference on multimedia (MM), pp 1349–1357
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV), pp 734–750
Li H, Sun F, Liu L, Wang L (2015) A novel traffic sign detection method via color segmentation and robust shape matching. Neurocomputing 169:77–88
Li S, Yang L, Huang J, Hua XS, Zhang L (2019) Dynamic anchor feature selection for single-shot object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 6609–6618
Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 6054–6063
Li Y, Pang Y, Shen J, Cao J, Shao L (2020) Netnet: neighbor erasing and transferring network for better single shot object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13349–13358
Li Z, Lang C, Liew J, Hou Q, Li Y, Feng J (2020) Cross-layer feature pyramid network for salient object detection. arXiv preprint arXiv:2002.10864
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 740–755
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2117–2125
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2980–2988
Liu S, Huang D, et al (2018) Receptive field block net for accurate and fast object detection. In: Proceedings of the European conference on computer vision (ECCV), pp 385–400
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 21–37
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440
Ma X, Wang Z, Li H, Zhang P, Ouyang W, Fan X (2019) Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In: Proceedings of the IEEE international conference on computer vision, pp 6851–6860
Ouyang W, Wang X, Zeng X, Qiu S, Luo P, Tian Y, Li H, Yang S, Wang Z, Loy CC, et al (2015) Deepid-net: deformable deep convolutional neural networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2403–2412
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7263–7271
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (NIPS), pp 91–99
Shrivastava A, Sukthankar R, Malik J, Gupta A (2016) Beyond skip connections: top-down modulation for object detection. arXiv preprint arXiv:1612.06851
Tian Z, Shen C, Chen H, He T (2019) Fcos: fully convolutional one-stage object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 9627–9636
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Wang J, Chen K, Yang S, Loy CC, Lin D (2019) Region proposal by guided anchoring. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2965–2974
Woo S, Park J, Lee JY, So KI (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Yang B, Yang C, Liu Q, Yin X (2019) Joint rotation-invariance face detection and alignment with angle-sensitivity cascaded networks. In: Proceedings of the 27th ACM international conference on multimedia (MM), pp 1473–1480
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122
Yu J, Jiang Y, Wang Z, Cao Z, Huang T (2016) Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM international conference on multimedia (MM), pp 516–520
Zhang C, Kim J (2019) Object detection with location-aware deformable convolution and backward attention filtering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 9452–9461
Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4203–4212
Zhang Z, Qiao S, Xie C, Shen W, Wang B, Yuille AL (2018) Single-shot object detection with enriched semantics. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 5813–5821
Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. Proc AAAI Conf Artif Intell 33:9259–9266
Zhou X, Zhuo J, Krahenbuhl P (2019) Bottom-up object detection by grouping extreme and center points. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 850–859
Zhu Y, Zhao C, Wang J, Zhao X, Wu Y, Lu H (2017) Couplenet: coupling global structure with local parts for object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 4126–4134
Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 391–405
Acknowledgements
The research was supported in part by the National Key Research and Development Program under Grant Nos. 2017YFC0820604, and in part by the National Nature Science Foundation of China under Grant Nos. 61772171, 62072152 and the Fundamental Research Funds for the Central Universities under grants PA2020GDKC0023, PA2019GDZC0095.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hao, S., Wang, Z. & Sun, F. Stacked Pyramid Attention Network for Object Detection. Neural Process Lett 54, 2759–2782 (2022). https://doi.org/10.1007/s11063-021-10505-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10505-x