Stacked Pyramid Attention Network for Object Detection | Neural Processing Letters Skip to main content
Log in

Stacked Pyramid Attention Network for Object Detection

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Scale variation is one of the primary challenges in object detection. Recently, different strategies have been introduced to address this challenge, achieving promising performance. However, limitations still exist in these detectors. On the one hand, as for the large-scale deep layers, the localizing power of the features is relatively low. On the other hand, as for the small-scale shallow layers, the categorizing ability of the features is relatively weak. Actually, the limitations are self-solving, as the above two aspects can be mutually beneficial to each other. Therefore, we propose the Stacked Pyramid Attention Network (SPANet) to bridge the gap between different scales. In SPANet, two lightweight modules, i.e. top-down feature map attention module (TDFAM) and bottom-up feature map attention module (BUFAM), are designed. Via learning the channel attention and spatial attention, each module effectively builds connections between features from adjacent scales. By progressively integrating BUFAM and TDFAM into two encoder–decoder structures, two novel feature aggregating branches are built. In this way, the branches fully complement the localizing power from small-scale features and the categorizing power from large-scale features, therefore enhancing the detection accuracy while keeping lightweight. Extensive experiments on two challenging benchmarks (PASCAL VOC and MS COCO datasets) demonstrate the effectiveness of our SPANet, showing that our model reaches a competitive trade-off between accuracy and speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Bell S, Lawrence ZC, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2874–2883

  2. Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms—improving object detection with one line of code. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 5561–5569

  3. Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6154–6162

  4. Cao J, Pang Y, Zhao S, Li X (2019) High-level semantic networks for multi-scale object detection. IEEE Trans Circuits Syst Video Technol 30:3372–3386

    Article  Google Scholar 

  5. Cao J, Cholakkal H, Anwer RM, Khan FS, Pang Y, Shao L (2020) D2det: towards high quality object detection and instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11485–11494

  6. Chen Y, Yang T, Zhang X, Meng G, Xiao X, Sun J (2019) Detnas: backbone search for object detection. In: Advances in neural information processing systems (NIPS), pp 6638–6648

  7. Dai J, Li Y, He K, Sun J (2016) R-fcn: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems (NIPS), pp 379–387

  8. Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 764–773

  9. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The Pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  10. Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659

  11. Fu Z, Jin Z, Qi G, Shen C, Jiang R, Chen Y, Hua X (2018) Previewer for multi-scale object detector. In: Proceedings of the 26th ACM international conference on multimedia (MM), pp 265–273

  12. Ghiasi G, Lin TY, Le QV (2019) Nas-fpn: Learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7036–7045

  13. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 580–587

  14. Guo Y, Wu Z, Shen D (2020) Learning longitudinal classification-regression model for infant hippocampus segmentation. Neurocomputing 391:191–198

    Article  Google Scholar 

  15. Hao S, Zhou Y, Guo Y (2020) A brief survey on semantic segmentation with deep learning. Neurocomputing 406:302–321

    Article  Google Scholar 

  16. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2961–2969

  17. Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, et al (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7310–7311

  18. Huang L, Yang Y, Deng Y, Yu Y (2015) Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874

  19. Ji Z, Kong Q, Wang H, Pang Y (2019) Small and dense commodity object detection with multi-scale receptive field attention. In: Proceedings of the 27th ACM international conference on multimedia (MM), pp 1349–1357

  20. Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV), pp 734–750

  21. Li H, Sun F, Liu L, Wang L (2015) A novel traffic sign detection method via color segmentation and robust shape matching. Neurocomputing 169:77–88

    Article  Google Scholar 

  22. Li S, Yang L, Huang J, Hua XS, Zhang L (2019) Dynamic anchor feature selection for single-shot object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 6609–6618

  23. Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 6054–6063

  24. Li Y, Pang Y, Shen J, Cao J, Shao L (2020) Netnet: neighbor erasing and transferring network for better single shot object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13349–13358

  25. Li Z, Lang C, Liew J, Hou Q, Li Y, Feng J (2020) Cross-layer feature pyramid network for salient object detection. arXiv preprint arXiv:2002.10864

  26. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 740–755

  27. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2117–2125

  28. Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2980–2988

  29. Liu S, Huang D, et al (2018) Receptive field block net for accurate and fast object detection. In: Proceedings of the European conference on computer vision (ECCV), pp 385–400

  30. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 21–37

  31. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440

  32. Ma X, Wang Z, Li H, Zhang P, Ouyang W, Fan X (2019) Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In: Proceedings of the IEEE international conference on computer vision, pp 6851–6860

  33. Ouyang W, Wang X, Zeng X, Qiu S, Luo P, Tian Y, Li H, Yang S, Wang Z, Loy CC, et al (2015) Deepid-net: deformable deep convolutional neural networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2403–2412

  34. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7263–7271

  35. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788

  36. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (NIPS), pp 91–99

  37. Shrivastava A, Sukthankar R, Malik J, Gupta A (2016) Beyond skip connections: top-down modulation for object detection. arXiv preprint arXiv:1612.06851

  38. Tian Z, Shen C, Chen H, He T (2019) Fcos: fully convolutional one-stage object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 9627–9636

  39. Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

    Article  Google Scholar 

  40. Wang J, Chen K, Yang S, Loy CC, Lin D (2019) Region proposal by guided anchoring. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2965–2974

  41. Woo S, Park J, Lee JY, So KI (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19

  42. Yang B, Yang C, Liu Q, Yin X (2019) Joint rotation-invariance face detection and alignment with angle-sensitivity cascaded networks. In: Proceedings of the 27th ACM international conference on multimedia (MM), pp 1473–1480

  43. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122

  44. Yu J, Jiang Y, Wang Z, Cao Z, Huang T (2016) Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM international conference on multimedia (MM), pp 516–520

  45. Zhang C, Kim J (2019) Object detection with location-aware deformable convolution and backward attention filtering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 9452–9461

  46. Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4203–4212

  47. Zhang Z, Qiao S, Xie C, Shen W, Wang B, Yuille AL (2018) Single-shot object detection with enriched semantics. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 5813–5821

  48. Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. Proc AAAI Conf Artif Intell 33:9259–9266

    Google Scholar 

  49. Zhou X, Zhuo J, Krahenbuhl P (2019) Bottom-up object detection by grouping extreme and center points. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 850–859

  50. Zhu Y, Zhao C, Wang J, Zhao X, Wu Y, Lu H (2017) Couplenet: coupling global structure with local parts for object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 4126–4134

  51. Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 391–405

Download references

Acknowledgements

The research was supported in part by the National Key Research and Development Program under Grant Nos. 2017YFC0820604, and in part by the National Nature Science Foundation of China under Grant Nos. 61772171, 62072152 and the Fundamental Research Funds for the Central Universities under grants PA2020GDKC0023, PA2019GDZC0095.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shijie Hao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hao, S., Wang, Z. & Sun, F. Stacked Pyramid Attention Network for Object Detection. Neural Process Lett 54, 2759–2782 (2022). https://doi.org/10.1007/s11063-021-10505-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10505-x

Keywords

Navigation