MCG-RTDETR: Multi-Convolution and Context-Guided Network with Cascaded Group Attention for Object Detection in Unmanned Aerial Vehicle Imagery
Abstract
:1. Introduction
- We integrated dual convolution and deformable convolutions into the backbone part of original RT-DETR. These convolution operations better capture complex feature information and geometric deformations in various scenarios and object sizes.
- We incorporated a cascaded group attention module into the encoder part to focus on critical feature regions while suppressing non-relevant background information. We replaced the traditional downsampling operation with context-guided downsampling to preserve contextual information of the targets.
- To tackle challenges posed by varying scales and dense scenes, we specifically optimized the structure of the neck to fuse features better, and the detection heads include adjusting output layers suitable for small objects.
- Through the aforementioned enhancements, we conducted rigorous experimental validations on the VisDrone2019 dataset. The results demonstrate significant improvements in both quantitative and qualitative evaluation metrics. These performance improvements not only affirm the efficacy of our method but also showcase its potential in practical scenarios.
2. Related Work
2.1. General Object Detection
2.2. Object Detection in UAV Images
3. Proposed Method
3.1. Overall Framework
3.2. Improvement of Feature Extractor
3.3. Improvement of Efficient Hybrid Encoder
3.4. Predict Head
3.5. -Aware Query Selection
4. Experiments and Results
4.1. Dataset and Implementation Details
4.2. Evaluation Metrics
4.3. Ablation Experiments
4.4. Comparisons of Performance
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. Geosci. Remote Sens. 2022, 10, 91–124. [Google Scholar] [CrossRef]
- Liu, Z.; Rodriguez-Opazo, C.; Teney, D.; Gould, S. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2125–2134. [Google Scholar]
- Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
- Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 341–357. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. arXiv 2023, arXiv:2304.00501. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
- Zhong, J.; Chen, J.; Mian, A. DualConv: Dual convolutional kernels for lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9528–9535. [Google Scholar] [CrossRef] [PubMed]
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
- Wu, T.; Tang, S.; Zhang, R.; Zhang, Y. CGNet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2021, 30, 1169–1179. [Google Scholar] [CrossRef] [PubMed]
- Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. Visdrone-det2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, FL, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Jocher, G. YOLOv5 by Ultralytics (Version 7.0). 2020. Available online: https://zenodo.org/records/7347926 (accessed on 18 December 2023).
- Solawetz, J. What is YOLOv8? The Ultimate Guide. 2023. Available online: https://blog.roboflow.com/whats-new-in-yolov8/ (accessed on 18 December 2023).
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Yu, C.S.; Shin, Y. SAR ship detection based on improved YOLOv5 and BiFPN. ICT Express 2023, 10, 28–33. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
- Lin, Q.; Ding, Y.; Xu, H.; Lin, W.; Li, J.; Xie, X. Ecascade-RCNN: Enhanced cascade RCNN for multi-scale object detection in UAV images. In Proceedings of the International Conference on Automation, Robotics and Applications, Prague, Czech Republic, 4–6 February 2021; pp. 268–272. [Google Scholar]
- Chen, C.; Gong, W.; Chen, Y.; Li, W. Object detection in remote Sensing images based on a scene-contextual feature pyramid network. Remote Sens. 2019, 11, 339. [Google Scholar] [CrossRef]
- Zhang, X.; Izquierdo, E.; Chandramouli, K. Dense and small object detection in UAV vision based on cascade network. In Proceedings of the IEEE International Conference on Computer Vision Workshop, Seoul, Republic of Korea, 27–28 October 2019; pp. 118–126. [Google Scholar]
- Liu, Y.; Ding, Z.; Cao, Y.; Chang, M. Multi-scale feature fusion UAV image object detection method based on dilated convolution and attention mechanism. In Proceedings of the International Conference on Information Technology: IoT and Smart City, Xi’an, China, 25–27 December 2020; pp. 125–132. [Google Scholar]
- Liang, X.; Zhang, J.; Zhuo, L.; Li, Y.; Tian, Q. Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1758–1770. [Google Scholar] [CrossRef]
- Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. UAV-YOLO: Small object detection on unmanned aerial vehicle perspective. Sensors 2020, 20, 2238. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z.; Liu, Y.; Liu, T.; Lin, Z.; Wang, S. DAGN: A real-time UAV remote sensing image vehicle detection framework. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1884–1888. [Google Scholar] [CrossRef]
- Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-resolution detection network for small objects. In Proceedings of the IEEE International Conference on Multimedia and Expo, Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Zhang, P.; Zhong, Y.; Li, X. SlimYOLOv3: Narrower, faster and better for real-time UAV applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 37–45. [Google Scholar]
- Wang, F.; Wang, H.; Qin, Z.; Tang, J. UAV target detection algorithm based on improved YOLOv8. IEEE Access 2023, 11, 116534–116544. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14420–14430. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
- Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
- Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
- Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing aided hyper inference and fine-tuning for small object detection. In Proceedings of the IEEE International Conference on Image Processing, Bordeaux, France, 16–19 October 2022; pp. 966–970. [Google Scholar]
- Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Methods | |||||||
---|---|---|---|---|---|---|---|
RT-DETR-r18 | 25.0 | 52.4 | 20.2 | 21.7 | 46.1 | 66.7 | 57.0 |
RT-DETR-DualConv | 24.8 | 52.4 | 19.7 | 21.8 | 44.8 | 69.5 | 47.3 |
RT-DETR-DualConv-CGD | 24.6 | 51.5 | 19.8 | 21.2 | 46.0 | 70.6 | 52.1 |
RT-DETR-DualConv-P2 | 27.7 | 55.1 | 23.9 | 24.5 | 47.9 | 64.5 | 68.6 |
RT-DETR-DualConv-CGD-P2 | 28.1 | 55.7 | 24.4 | 24.9 | 48.7 | 69.5 | 75.1 |
RT-DETR-DualConv-CGD-P3 | 28.9 | 57.1 | 25.3 | 25.7 | 49.3 | 69.0 | 90.8 |
RT-DETR-DualConv-DeConv-CGD-P2 | 28.0 | 55.5 | 24.4 | 24.7 | 48.0 | 66.2 | 73.9 |
RT-DETR-DualConv-DeConv-CGD-P2-CGAM | 28.8 | 57.1 | 25.0 | 25.3 | 50.4 | 68.3 | 74.0 |
RT-DETR-DualConv-DeConv-CGD-P3 | 29.1 | 57.5 | 25.4 | 25.8 | 49.8 | 70.3 | 89.6 |
RT-DETR-DualConv-DeConv-CGD-P3-CGAM | 29.7 | 58.2 | 26.3 | 26.2 | 51.0 | 73.5 | 89.7 |
Methods | |||||||
---|---|---|---|---|---|---|---|
RT-DETR-r18 | 3.5 | 19.4 | 34.4 | 31.5 | 54.5 | 72.6 | 19.88 |
RT-DETR-DualConv | 3.4 | 19.2 | 34.4 | 31.6 | 53.5 | 77.8 | 15.88 |
RT-DETR-DualConv-CGD | 3.5 | 19.3 | 33.9 | 31.0 | 54.4 | 78.3 | 18.33 |
RT-DETR-DualConv-P2 | 3.7 | 20.7 | 37.9 | 35.2 | 56.3 | 72.6 | 14.60 |
RT-DETR-DualConv-CGD-P2 | 3.6 | 20.9 | 38.2 | 35.5 | 56.9 | 77.4 | 17.20 |
RT-DETR-DualConv-CGD-P3 | 3.7 | 21.4 | 38.8 | 36.0 | 57.7 | 75.7 | 19.55 |
RT-DETR-DualConv-DeConv-CGD-P2 | 3.6 | 20.8 | 38.2 | 35.7 | 55.8 | 73.5 | 20.46 |
RT-DETR-DualConv-DeConv-CGD-P2-CGAM | 3.6 | 21.1 | 39.1 | 36.3 | 58.8 | 73.9 | 20.29 |
RT-DETR-DualConv-DeConv-CGD-P3 | 3.7 | 21.5 | 38.6 | 35.8 | 58.0 | 77.0 | 22.81 |
RT-DETR-DualConv-DeConv-CGD-P3-CGAM | 3.7 | 21.7 | 39.2 | 36.3 | 59.4 | 80.4 | 22.64 |
Methods | |||||||||
---|---|---|---|---|---|---|---|---|---|
Faster R-CNN [5] | 24.3 | 39.6 | 25.9 | 15.4 | 36.4 | 45.0 | 208 | 41.39 | 38.2 |
RetinaNet [6] | 17.3 | 29.1 | 17.9 | 8.1 | 29.4 | 35.2 | 210 | 36.52 | 36.1 |
Cascade R-CNN [15] | 25.1 | 39.8 | 26.7 | 15.7 | 37.6 | 46.3 | 236 | 69.29 | 13.7 |
GFL [40] | 24.7 | 39.8 | 25.6 | 15.0 | 37.1 | 47.4 | 206 | 32.28 | 36.7 |
ATSS-dyhead [41,42] | 26.3 | 41.5 | 27.7 | 16.2 | 40.1 | 55.7 | 110 | 38.91 | 24.7 |
TOOD [43] | 26.3 | 41.9 | 27.5 | 16.8 | 38.5 | 49.0 | 199 | 32.04 | 34.7 |
RTMDET-tiny [44] | 19.9 | 33.2 | 20.2 | 10.0 | 31.7 | 42.9 | 8.03 | 4.88 | 90.5 |
YOLOX-tiny [45] | 18.9 | 34.5 | 18.3 | 11.9 | 27.7 | 29.6 | 7.58 | 5.04 | 235.7 |
YOLOv5 [19] | 19.1 | 32.9 | 19.0 | 9.9 | 30.4 | 38.6 | 7.1 | 2.50 | 235.5 |
YOLOv8 [20] | 19.7 | 33.7 | 19.6 | 10.3 | 31.1 | 38.7 | 8.1 | 3.01 | 223.0 |
RT-DETR-r18 [10] | 25.0 | 52.4 | 20.2 | 21.7 | 46.1 | 66.7 | 57.0 | 19.88 | 125.3 |
RT-DETR-r50 [10] | 25.6 | 54.6 | 19.7 | 22.6 | 45.5 | 67.0 | 129.6 | 41.97 | 75.3 |
MCG-RTDETR | 29.7 | 58.2 | 26.3 | 26.2 | 51.0 | 73.5 | 89.7 | 22.64 | 84.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, C.; Shin, Y. MCG-RTDETR: Multi-Convolution and Context-Guided Network with Cascaded Group Attention for Object Detection in Unmanned Aerial Vehicle Imagery. Remote Sens. 2024, 16, 3169. https://doi.org/10.3390/rs16173169
Yu C, Shin Y. MCG-RTDETR: Multi-Convolution and Context-Guided Network with Cascaded Group Attention for Object Detection in Unmanned Aerial Vehicle Imagery. Remote Sensing. 2024; 16(17):3169. https://doi.org/10.3390/rs16173169
Chicago/Turabian StyleYu, Chushi, and Yoan Shin. 2024. "MCG-RTDETR: Multi-Convolution and Context-Guided Network with Cascaded Group Attention for Object Detection in Unmanned Aerial Vehicle Imagery" Remote Sensing 16, no. 17: 3169. https://doi.org/10.3390/rs16173169
APA StyleYu, C., & Shin, Y. (2024). MCG-RTDETR: Multi-Convolution and Context-Guided Network with Cascaded Group Attention for Object Detection in Unmanned Aerial Vehicle Imagery. Remote Sensing, 16(17), 3169. https://doi.org/10.3390/rs16173169