YOLO-GCRS: A Remote Sensing Image Object Detection Algorithm Incorporating a Global Contextual Attention Mechanism
Abstract
:1. Introduction
- We design the YOLO-GCRS by integrating the global-context-aware mechanism from YOLOv5 with the C3 architecture in version 6.1 of YOLOv5s. This integration enhances the network model’s ability to capture the global features of an image. Additionally, we conducted an analysis to evaluate the detection performance when adding the mechanism at different positions within the backbone.
- We propose a new convolutional extraction module, CBM, to replace the CBS module in the original framework. This replacement significantly improves the model’s detection accuracy when objecting objects.
- Lastly, we introduce a detection head called ECAHead, which incorporates the ECA attention mechanism. This design allows for the comprehensive extraction of high-dimensional channel features.
2. Background
- Group all rectangular boxes based on their category labels and sort the groups in descending order of confidence scores.
- Start by identifying the rectangular box with the most reliable confidence score in step 1. Proceed by sequentially evaluating the remaining rectangular boxes. Compute the IOU between every bounding box and the currently chosen box that has the highest score. Remove any boxes that surpass a predefined IOU threshold, ensuring that only the most relevant boxes are retained for further analysis.
- Repeat step 2 for the remaining rectangular boxes obtained from step 2 until all boxes have been processed.
3. Proposed Method
3.1. Global Context Block
3.2. CBM Model
3.3. ECAHead
- The global average pooling operation is applied to the input feature map.
- Subsequently, a one-dimensional convolution procedure with a convolution kernel size of k is executed. The Sigmoid activation function, as demonstrated in Equation (7), is used to calculate the weights, represented by , for each channel.
- The original input feature map’s elements are then given weights, leading to the production of the ultimate output feature map. This multiplication process ensures that each element of the input feature map is appropriately weighted to contribute to the final representation.
4. Experiments and Results Analysis
4.1. Experimental Environment
4.2. Datasets
4.3. Evaluation Metrics
4.3.1. Precision
4.3.2. Recall
4.3.3. Mean Average Precision
4.3.4. FLOPs
4.3.5. FPS
4.4. Parameter Setting and Network Training
4.4.1. Parameter Setting
4.4.2. Network Training
4.4.3. Ablation Experiments
4.4.4. Visualization Experiments
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Liu, F.; Zhu, J.; Wang, W.; Kuang, M. Surface-to-air missile sites detection agent with remote sensing images. Sci. China Inf. Sci. 2021, 64, 1–3. [Google Scholar] [CrossRef]
- Zhang, Y.; Ning, G.; Chen, S.; Yang, Y. Impact of rapid urban sprawl on the local meteorological observational environment based on remote sensing images and GIS technology. Remote Sens. 2021, 13, 2624. [Google Scholar] [CrossRef]
- Luo, S.; Yu, J.; Xi, Y.; Liao, X. Aircraft target detection in remote sensing images based on improved YOLOv5. IEEE Access 2022, 10, 5184–5192. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Felzenszwalb, P.F.; Girshick, R.B.; Mcallester, D.A. Cascade object detection with deformable part models. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010. [Google Scholar]
- Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef] [PubMed]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Wen, G.; Cao, P.; Wang, H.; Chen, H.; Liu, X.; Xu, J.; Zaiane, O. MS-SSD: Multi-scale single shot detector for ship detection in remote sensing images. Appl. Intell. 2023, 53, 1586–1604. [Google Scholar] [CrossRef]
- Qu, Z.; Zhu, F.; Qi, C. Remote sensing image target detection: Improvement of the YOLOv3 model with auxiliary networks. Remote Sens. 2021, 13, 3908. [Google Scholar] [CrossRef]
- Shen, L.; Tao, H.; Ni, Y.; Wang, Y.; Stojanovic, V. Improved YOLOv3 model with feature map cropping for multi-scale road object detection. Meas. Sci. Technol. 2023, 34, 045406. [Google Scholar] [CrossRef]
- Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF INTERNATIONAL conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
- Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 390–391. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. Mach. Learn. PMLR 2015, 37, 448–456. [Google Scholar]
- Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on MULTIMEDIA, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
- Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, 20–24 August 2006; Volume 3, pp. 850–855. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
- Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]
Options | Configuration |
---|---|
Operating System | Ubuntu |
CPU | E5-2680 v4 |
GPU | GeForce RTX 3060 |
Memory | 14 GB |
CUDA | 11.1 |
Pytorch version | 1.10.0 |
Labeling of the Dataset | Number of Images |
---|---|
aircraft | 446 |
playground | 189 |
overpass | 176 |
oil tank | 165 |
Parameters | Value |
---|---|
weights | yolov5s.pt |
division ratio | 7:2:1 (train:val:test) |
optimizer | SGD |
batch size | 16 |
epochs | 100 |
Method | Precision | Recall | [email protected] | FLOPs |
---|---|---|---|---|
YOLOv5s | 0.930 | 0.939 | 0.950 | 15.8 |
+front-2 | 0.980 | 0.934 | 0.970 | 15.9 |
+behind-2 | 0.968 | 0.950 | 0.965 | 16.4 |
+backbone-4 | 0.968 | 0.939 | 0.956 | 16.4 |
Method | Precision | Recall | [email protected] | FLOPs |
---|---|---|---|---|
YOLOv5s(CBS) | 0.930 | 0.939 | 0.950 | 15.8 |
+CBR | 0.959 | 0.937 | 0.958 | 15.8 |
+CBM | 0.953 | 0.948 | 0.967 | 15.8 |
Method | Precision | Recall | [email protected] | FLOPs |
---|---|---|---|---|
YOLOv5s | 0.930 | 0.939 | 0.950 | 15.8 |
+ECAHead-s | 0.964 | 0.924 | 0.941 | 15.8 |
+ECAHead-m | 0.946 | 0.973 | 0.973 | 15.8 |
+ECAHead-l | 0.954 | 0.942 | 0.969 | 15.8 |
+ECAHead-a | 0.944 | 0.938 | 0.957 | 15.8 |
Method | Precision | Recall | [email protected] | FLOPs |
---|---|---|---|---|
YOLOv5s | 0.930 | 0.939 | 0.950 | 15.8 |
+ECAHead | 0.946 | 0.973 | 0.973 | 15.8 |
+SEHead | 0.987 | 0.947 | 0.964 | 15.8 |
+CBAMHead | 0.957 | 0.952 | 0.962 | 15.8 |
+SAHead | 0.954 | 0.931 | 0.956 | 15.8 |
Method | Precision | Recall | [email protected] | FLOPs | FPS/(frame/s) |
---|---|---|---|---|---|
YOLOv5s | 0.930 | 0.939 | 0.950 | 15.8 | 76.8 |
+GC-C3 | 0.968 | 0.950 | 0.965 | 16.4 | 43.5 |
+CBM | 0.953 | 0.948 | 0.967 | 15.8 | 69.1 |
+ECAHead | 0.946 | 0.973 | 0.973 | 15.8 | 74.7 |
YOLO-GCRS | 0.983 | 0.947 | 0.977 | 16.4 | 42.6 |
Method | Precision | Recall | [email protected] |
---|---|---|---|
YOLOv5s | 0.930 | 0.939 | 0.950 |
YOLOv7-tiny | 0.953 | 0.957 | 0.957 |
YOLOv8s | 0.871 | 0.864 | 0.902 |
YOLO-GCRS | 0.983 | 0.947 | 0.977 |
Method | Precision | Recall | [email protected] |
---|---|---|---|
YOLOv5s | 0.935 | 0.927 | 0.942 |
YOLO-GCRS | 0.950 | 0.933 | 0.955 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liao, H.; Zhu, W. YOLO-GCRS: A Remote Sensing Image Object Detection Algorithm Incorporating a Global Contextual Attention Mechanism. Electronics 2023, 12, 4272. https://doi.org/10.3390/electronics12204272
Liao H, Zhu W. YOLO-GCRS: A Remote Sensing Image Object Detection Algorithm Incorporating a Global Contextual Attention Mechanism. Electronics. 2023; 12(20):4272. https://doi.org/10.3390/electronics12204272
Chicago/Turabian StyleLiao, Huan, and Wenqiu Zhu. 2023. "YOLO-GCRS: A Remote Sensing Image Object Detection Algorithm Incorporating a Global Contextual Attention Mechanism" Electronics 12, no. 20: 4272. https://doi.org/10.3390/electronics12204272
APA StyleLiao, H., & Zhu, W. (2023). YOLO-GCRS: A Remote Sensing Image Object Detection Algorithm Incorporating a Global Contextual Attention Mechanism. Electronics, 12(20), 4272. https://doi.org/10.3390/electronics12204272