Composite Backbone Small Object Detection Based on Context and Multi-Scale Information with Attention Mechanism
Abstract
:1. Introduction
- Data augmentation [4,5,6,7] has been widely adopted in small object detection. Through utilizing diverse strategies to augment the training data, the dataset can be expanded in size and diversity. However, data augmentation also presents certain issues, such as increased computational costs. And poorly devised augmenting approaches might add additional noise, negatively impacting feature extracting. Additionally, data augmentation merely adds features of small objects without considering how to optimize the extraction of these features.
- Certain studies have proposed the integration of context information to assist in detection tasks. This involves learning background features surrounding the object and global scene features. Though these explorations have yielded performance improvements, devising an appropriately balanced strategy for extracting context information and preventing small objects from being influenced by medium to large-sized objects remains a challenge.
- Moreover, multi-scale learning is widely used. The feature pyramid network (FPN) emerges as the multi-scale network for comprehensive feature extraction in object detection [8]. This approach aims to leverage an extensive range of feature layers, fusing the shallow layers and deep layer; the fused feature layer has richer position information and semantic information. Building upon this foundation, Liang et al. [9] proposed a Deep FPN, which incorporates lateral connections and is trained using specifically designed anchor boxes and loss functions. Merugu et al. [10,11] also employed a similar multi-module approach. However, these methods primarily focus on how to superimpose additional features for detection, ignoring specific multi-scale learning strategies tailored for small objects.
- This work introduces a composite backbone network architecture, enabling the two backbone networks to simultaneously extract and fuse features, thereby obtaining more usable features to enhance detection accuracy.
- This work designs a composite dilated convolution and attention module (CDAM). This module convolves and fuses shallow feature maps with varying dilation rates to effectively incorporate context information for better detection performance.
- This work presents the feature elimination module (FEM). This module mainly reduces the impact of medium and large objects on small objects by performing object elimination on the shallow feature layer.
2. Related Work
2.1. Object Detection
2.2. Small Object Detection
2.3. Context Information
3. Methodology
3.1. Overall Framework
3.2. Composite Dilated Convolution and Attention Module
3.3. Feature Elimination Module
3.4. Composite Backbone Networks
4. Experiments
4.1. Experimental Setup
4.1.1. Dataset
4.1.2. Performance Indicator
4.1.3. Training Details
4.2. Ablation Study
4.2.1. The Impact of CDAM
4.2.2. The Impact of FEM
4.2.3. The Impact of Composite Backbone
4.3. Comparison with Baseline
4.4. Comparison with Other Models
4.5. Qualitative Results
5. Conclusions and Future Works
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef] [PubMed]
- Zhu, Y.; Zhou, Q.; Liu, N.; Xu, Z.; Ou, Z.; Mou, X.; Tang, J. ScaleKD: Distilling Scale-Aware Knowledge in Small Object Detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19723–19733. [Google Scholar]
- Kisantalk, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
- Chen, C.; Zhang, Y.; Lv, Q.; Wei, S.; Wang, X.; Sun, X.; Dong, J. Rrnet: A hybrid detector for object detection in drone-captured images. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
- Chen, Y.; Zhang, P.; Li, Z.; Li, Y.; Zhang, X.; Meng, G.; Xiang, S.; Sun, J.; Jia, J. Stitcher: Feedback-driven data provider for object detection. arXiv 2020, arXiv:2004.12432. [Google Scholar]
- Demirel, B.; Baran, O.B.; Cinbis, R.G. Meta-tuning Loss Functions and Data Augmentation for Few-shot Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7339–7349. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liang, Z.; Shao, J.; Zhang, D.; Gao, L. Small object detection using deep feature pyramid networks. In Advances in Multimedia Information Processing–PCM 2018, Proceedings of the 19th Pacific-Rim Conference on Multimedia, Hefei, China, 21–22 September 2018; Proceedings, Part III 19; Springer International Publishing: Cham, Switzerland, 2018; pp. 554–564. [Google Scholar]
- Bathula, A.; Muhuri, S.; Gupta, S.K.; Merugu, S. Secure certificate sharing based on Blockchain framework for online education. Multimed. Tools Appl. 2023, 82, 16479–16500. [Google Scholar] [CrossRef]
- Bathula, A.; Merugu, S.; Skandha, S.S. Academic Projects on Certification Management Using Blockchain—A Review. In Proceedings of the 2022 International Conference on Recent Trends in Microelectronics, Automation, Computing and Communications Systems (ICMACC), Hyderabad, India, 28–30 December 2022; pp. 1–6. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
- Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. arXiv 2021, arXiv:2108.11539. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Zoph, B.; Cubuk, E.D.; Ghiasi, G.; Lin, T.Y.; Shlens, J.; Le, Q.V. Learning data augmentation strategies for object detection. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16; Springer International Publishing: Cham, Switzerland, 2020; pp. 566–583. [Google Scholar]
- Nayan, A.A.; Saha, J.; Mozumder, A.N.; Mahmud, K.R. Real time detection of small objects. Int. J. Innov. Technol. Explor. Eng. 2020, 9, 837–843. [Google Scholar]
- Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-resolution detection network for small objects. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended feature pyramid network for small object detection. IEEE Trans. Multimed. 2021, 24, 1968–1979. [Google Scholar] [CrossRef]
- Li, J.; Wei, Y.; Liang, X.; Dong, J.; Xu, T.; Feng, J.; Yan, S. Attentive contexts for object detection. IEEE Trans. Multimed. 2016, 19, 944–954. [Google Scholar] [CrossRef]
- Zeng, X.; Ouyang, W.; Yan, J.; Li, H.; Xiao, T.; Wang, K.; Liu, Y.; Zhou, Y.; Yang, B.; Wang, Z.; et al. Crafting gbd-net for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2109–2123. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Wang, R.; Shan, S.; Chen, X. Structure inference net: Object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6985–6994. [Google Scholar]
- Xu, H.; Jiang, C.; Liang, X.; Lin, L.; Li, Z. Reasoning-rcnn: Unifying adaptive global reasoning into large-scale object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6419–6428. [Google Scholar]
- Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6569–6578. [Google Scholar]
- Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9657–9666. [Google Scholar]
- Merugu, S.; Tiwari, A.; Sharma, S.K. Spatial–spectral image classification with edge preserving method. J. Indian Soc. Remote Sens. 2021, 49, 703–711. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Liu, Z.; Gao, G.; Sun, L.; Fang, L. IPG-net: Image pyramid guidance network for small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 1026–1027. [Google Scholar]
- Available online: https://cocodataset.org/ (accessed on 1 January 2017).
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
- Lu, X.; Li, B.; Yue, Y.; Li, Q.; Yan, J. Grid r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7363–7372. [Google Scholar]
- Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Light-head r-cnn: In defense of two-stage object detector. arXiv 2017, arXiv:1711.07264. [Google Scholar]
- Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
- Zhang, Z.; Qiao, S.; Xie, C.; Shen, W.; Wang, B.; Yuille, A.L. Single-shot object detection with enriched semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5813–5821. [Google Scholar]
- Kong, T.; Sun, F.; Tan, C.; Liu, H.; Huang, W. Deep feature pyramid reconfiguration for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 169–185. [Google Scholar]
- Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
- Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13668–13677. [Google Scholar]
Method | Feature Extraction | Context Information | Multi-Scale Information |
---|---|---|---|
[12,31] | single backbone | - | - |
[8] | single backbone | - | FPN |
[15,16,18] | single backbone | - | based on FPN |
[32] | single backbone | - | Img Pyramid |
[23] | single backbone | by multi-context windows | - |
[24] | single backbone | by LSTM | - |
[25] | single backbone | by Graph Neural Network | - |
[26] | single backbone | by Knowledge Graph | - |
Ours | composite backbone | by dilated convolution | based on FPN and add FEM |
Method | Size | AP | |||||
---|---|---|---|---|---|---|---|
Baseline | 43.6 | 62.2 | 47.0 | 22.9 | 45.2 | 59.9 | |
CDM | 44.4 | 62.6 | 48.3 | 23.5 | 46.6 | 60.7 | |
CDM + CBAM | 44.9 | 63.0 | 48.8 | 23.8 | 47.4 | 62.9 |
Rate = 1 | Rate = 2 | Rate = 3 | Size | |||
---|---|---|---|---|---|---|
✓ | - | - | 22.9 | 45.2 | 59.9 | |
✓ | ✓ | - | 23.3 | 45.9 | 60.1 | |
✓ | - | ✓ | 23.2 | 45.7 | 60.5 | |
✓ | ✓ | ✓ | 23.5 | 46.6 | 60.7 |
Method | Size | AP | |||||
---|---|---|---|---|---|---|---|
Baseline | 43.6 | 62.2 | 47.0 | 22.9 | 45.2 | 59.9 | |
FEM | 44.7 | 63.0 | 48.5 | 23.7 | 45.1 | 60.3 |
Method | Size | AP | |||||
---|---|---|---|---|---|---|---|
Baseline | 43.6 | 62.2 | 47.0 | 22.9 | 45.2 | 59.9 | |
Composite Backbone | 44.7 | 63.1 | 48.4 | 24.0 | 46.2 | 61.3 |
Method | Size | AP | AR | |||||
---|---|---|---|---|---|---|---|---|
YOLOv7 | 45.8 | 65.5 | 49.2 | 27.1 | 52.3 | 62.5 | 54.7 | |
Ours | 46.5 | 65.2 | 51.1 | 29.7 | 51.9 | 62.6 | 55.6 |
Method | Backbone | Size | AP | |||||
---|---|---|---|---|---|---|---|---|
RetinaNet [34] | ResNet50 | 35.1 | 54.2 | 37.7 | 18.0 | 39.3 | 46.4 | |
Faster R-CNN + FPN [8] | ResNet101 | - | 36.2 | 59.1 | 39.0 | 18.2 | 39.0 | 48.2 |
Cascade R-CNN [35] | ResNet101 | - | 42.8 | 62.1 | 46.3 | 23.7 | 45.5 | 55.2 |
Soft-NMS [36] | Aligned-Inception-ResNet | 40.9 | 62.8 | - | 23.3 | 43.6 | 53.3 | |
Grid R-CNN + FPN [37] | ResNeXt101 | 43.2 | 63.0 | 46.6 | 25.1 | 46.5 | 55.2 | |
LH R-CNN [38] | ResNet101 | 41.5 | - | - | 25.2 | 45.3 | 53.1 | |
IPG R-CNN [32] | IPGNet101 | 45.7 | 64.3 | 49.9 | 26.6 | 48.6 | 58.3 | |
Ours | CSPDarknet53 | 47.2 | 65.5 | 51.8 | 27.2 | 51.8 | 61.1 | |
YOLOv2 [13] | Darknet | 21.6 | 44.0 | 19.2 | 5.0 | 22.4 | 35.5 | |
SSD [31] | ResNet101 | 31.2 | 50.4 | 33.3 | 10.2 | 34.5 | 49.8 | |
DSSD [39] | ResNet101 | 33.2 | 53.3 | 35.2 | 13.0 | 35.4 | 51.1 | |
DES [40] | VGG16 | 30.1 | 53.2 | 34.6 | 13.9 | 36.0 | 47.6 | |
DFPR [41] | ResNet101 | 34.6 | 54.3 | 37.3 | 14.7 | 38.1 | 51.9 | |
RefineDet [42] | VGG16 | 33.0 | 54.5 | 35.5 | 16.3 | 36.3 | 44.3 | |
CenterNet [28] | HRNet-W64 | 44.0 | 62.6 | 47.1 | 23.0 | 47.3 | 57.8 | |
CornerNet [27] | Hourglass104 | 42.1 | 57.8 | 45.3 | 20.8 | 44.8 | 56.7 | |
Ours | CSPDarknet53 | 46.1 | 64.7 | 50.5 | 24.1 | 50.9 | 62.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jing, X.; Liu, X.; Liu, B. Composite Backbone Small Object Detection Based on Context and Multi-Scale Information with Attention Mechanism. Mathematics 2024, 12, 622. https://doi.org/10.3390/math12050622
Jing X, Liu X, Liu B. Composite Backbone Small Object Detection Based on Context and Multi-Scale Information with Attention Mechanism. Mathematics. 2024; 12(5):622. https://doi.org/10.3390/math12050622
Chicago/Turabian StyleJing, Xinhan, Xuesong Liu, and Baolin Liu. 2024. "Composite Backbone Small Object Detection Based on Context and Multi-Scale Information with Attention Mechanism" Mathematics 12, no. 5: 622. https://doi.org/10.3390/math12050622
APA StyleJing, X., Liu, X., & Liu, B. (2024). Composite Backbone Small Object Detection Based on Context and Multi-Scale Information with Attention Mechanism. Mathematics, 12(5), 622. https://doi.org/10.3390/math12050622