FE-RetinaNet: Small Target Detection with Parallel Multi-Scale Feature Enhancement
Abstract
:1. Introduction
- We propose a simple and effective parallel multi-scale feature enhancement module, which can expand the characteristics of the receptive field without down sampling by using dilated convolution and assist the backbone network to extract shallow features with multi-scale context information.
- We introduce a method to improve the backbone network specifically for the target detection task, which effectively reduces the gap between the feature extraction network in the detection task and the classification task. This method enables the high-level feature map of the backbone network to preserve the texture information of small targets as much as possible while preserving large receptive fields and strong semantic information.
- We combine the auxiliary multi-scale feature enhancement module with the original FPN structure to construct a bidirectional feature pyramid network containing multi-scale shallow information. Unlike most bidirectional structures that reuse the backbone network to extract features in additional branches, this article uses the multi-scale feature enhancement module as the input of the additional branches, which brings brand-new feature information to the network.
2. Related Work
3. Method
3.1. Overall Architecture
3.2. Improved Backbone Network ResNet-D
3.3. Multi-Scale Feature Enhancement Module
3.4. Bidirectional Feature Pyramid Network
4. Experiments
4.1. Datasets and Evaluation Metrics
4.2. Experiments on COCO Object Detection
4.3. Ablation Research
4.4. Small Target Detection Performance Comparision
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision; Springer: Cham, The Netherlands, 2016; pp. 21–37. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In European Conference on Computer Vision; Springer: Cham, The Netherlands, 2016; pp. 630–645. [Google Scholar]
- Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Detnet: A backbone network for object detection. arXiv 2018, arXiv:1804.06215. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Wang, T.; Anwer, R.M.; Cholakkal, H.; Khan, F.S.; Pang, Y.; Shao, L. Learning rich features at high-speed for single-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
- Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
- Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. arXiv 2017, arXiv:1701.04128. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Method | Backbone | Input Size | AP | AP50 | AP75 | APS | APM | APL |
---|---|---|---|---|---|---|---|---|
Two-stage methods | ||||||||
Faster R-CNN | VGG-16 | ~1000 × 600 | 24.2 | 45.3 | 23.5 | 7.7 | 26.4 | 37.1 |
Faster R-CNN w FPN | ResNet-101 | ~1000 × 600 | 36.2 | 59.1 | 39.0 | 18.2 | 39.0 | 48.2 |
Cascade R-CNN | ResNet-101 | ~1280 × 800 | 42.8 | 62.1 | 46.3 | 23.7 | 45.5 | 55.2 |
CoupleNet | ResNet-101 | ~1280 × 800 | 34.4 | 54.8 | 37.2 | 13.4 | 38.1 | 50.8 |
R-FCN | ResNet-101 | ~1000 × 600 | 29.9 | 51.9 | - | 10.8 | 32.8 | 45.0 |
Mask R-CNN | ResNet-101 | ~1280 × 800 | 38.2 | 60.3 | 41.7 | 20.1 | 41.1 | 50.2 |
One-stage methods | ||||||||
YOLOv2 | DarkNet-19 | 544 × 544 | 21.6 | 44.0 | 19.2 | 5.0 | 22.4 | 35.5 |
SSD513 | ResNet-101 | 513 × 513 | 31.2 | 50.4 | 33.3 | 10.2 | 34.5 | 49.8 |
RetinaNet | ResNet-50 | ~832 × 500 | 32.5 | 50.9 | 34.8 | 13.9 | 35.8 | 46.7 |
RetinaNet | ResNet-101 | ~832 × 500 | 34.4 | 53.1 | 36.8 | 14.7 | 38.5 | 49.1 |
YOLOv3 | DarkNet-53 | 608 × 608 | 33.0 | 57.9 | 34.4 | 18.3 | 35.4 | 51.1 |
RefineDet | VGG-16 | 512 × 512 | 33.0 | 54.5 | 35.5 | 16.3 | 36.3 | 44.3 |
DSSD513 | ResNet-101 | 513 × 513 | 33.2 | 53.3 | 35.2 | 13.0 | 35.4 | 51.1 |
RFBNet | VGG-16 | 512×512 | 33.8 | 54.2 | 35.9 | 16.2 | 37.1 | 47.4 |
RFBNet-E | VGG-16 | 512 × 512 | 34.4 | 55.7 | 36.4 | 17.6 | 37.0 | 47.6 |
EfficientDet-D0 | EfficientNet | 512 × 512 | 34.6 | 53.0 | 37.1 | - | - | - |
EFIP | VGG-16 | 512 × 512 | 34.6 | 55.8 | 36.8 | 18.3 | 38.2 | 47.1 |
Ours | ||||||||
FE-RetinaNet | ResNet-50 | 512 × 512 | 34.2 | 52.8 | 37.1 | 16.8 | 37.2 | 47.6 |
FE-RetinaNet | ResNet-101 | 512 × 512 | 36.2 | 56.4 | 39.3 | 18.0 | 39.7 | 49.9 |
RetinaNet | ResNet-D | Bidirectional FPN | MFEM | AP | AP50 | AP75 | APS | APM | APL |
---|---|---|---|---|---|---|---|---|---|
√ | 32.3 | 50.6 | 34.5 | 13.7 | 35.5 | 46.3 | |||
√ | √ | 32.8 | 51.3 | 35.2 | 14.4 | 35.9 | 46.8 | ||
√ | √ | √ | 33.5 | 51.9 | 36.1 | 15.1 | 36.1 | 47.1 | |
√ | √ | √ | √ | 34.1 | 52.7 | 36.9 | 16.7 | 37.0 | 47.5 |
FE-RetinaNet | r1 = 1 | r2 = 2 | r3 = 3 | r4 = 4 | AP | APS | APM | APL |
---|---|---|---|---|---|---|---|---|
√ | √ | 33.6 | 15.3 | 36.2 | 47.0 | |||
√ | √ | √ | 33.8 | 15.8 | 36.5 | 47.2 | ||
√ | √ | √ | √ | 34.1 | 16.4 | 36.8 | 47.3 | |
√ | √ | √ | √ | √ | 34.2 | 16.8 | 37.0 | 47.5 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liang, H.; Yang, J.; Shao, M. FE-RetinaNet: Small Target Detection with Parallel Multi-Scale Feature Enhancement. Symmetry 2021, 13, 950. https://doi.org/10.3390/sym13060950
Liang H, Yang J, Shao M. FE-RetinaNet: Small Target Detection with Parallel Multi-Scale Feature Enhancement. Symmetry. 2021; 13(6):950. https://doi.org/10.3390/sym13060950
Chicago/Turabian StyleLiang, Hong, Junlong Yang, and Mingwen Shao. 2021. "FE-RetinaNet: Small Target Detection with Parallel Multi-Scale Feature Enhancement" Symmetry 13, no. 6: 950. https://doi.org/10.3390/sym13060950
APA StyleLiang, H., Yang, J., & Shao, M. (2021). FE-RetinaNet: Small Target Detection with Parallel Multi-Scale Feature Enhancement. Symmetry, 13(6), 950. https://doi.org/10.3390/sym13060950