AgeDETR: Attention-Guided Efficient DETR for Space Target Detection
Abstract
:1. Introduction
- We introduce AgeDETR for space target detection, which significantly improves detection performance by incorporating advanced attention mechanisms into the backbone network and encoder. The model comprises four principal components: the EF-ResNet18 backbone network, the AGFE module, the AGFF module, and a sophisticated decoder that ensures precise target localization and classification. Together, these components effectively tackle the common challenges in space target detection.
- To tackle illumination variability encountered in space target detection, we design the EF-ResNet18 architecture as the backbone network. This architecture creatively combines the FasterNet block [49] and EMA [50] technologies to establish the EF-Block module, which is seamlessly integrated into ResNet18 [51]. This design significantly boosts the feature extraction capabilities of the backbone network and optimizes computational efficiency. With these enhancements, EF-ResNet18 provides stable target detection under varying lighting conditions and improves the precision and robustness of the detection results.
- To overcome the challenges posed by complex space backgrounds, we propose the AGFE module. This module meticulously integrates two complementary attention mechanisms, specifically designed to enhance target feature recognition and optimize the extraction of critical information. The AGFE module also employs a single-layer Transformer encoder to efficiently process high-level features, simplifying computational steps. This strategy significantly improves the accuracy of the model in recognizing and locating targets against complex backgrounds and optimizes computational efficiency.
- To address the issues associated with diverse target sizes, we introduce the AGFF module. Unlike traditional multi-scale fusion methods for natural images, AGFF employs an attention-guided strategy. This strategy facilitates the fusion of features across adjacent layers through high-level attention mechanisms, which effectively enhance feature fusion performance. Consequently, this module significantly enhances the capability of the model to detect and classify targets of varying scales accurately, thereby boosting overall detection performance.
2. Related Work
2.1. Current State of Object Detection Algorithms
2.2. Attention Mechanism
- Channel attention. Channel attention boosts performance by dynamically adjusting the importance of each channel and selectively focusing on relevant features. Hu et al. [57] introduced the concept of channel attention with the development of SENet, centered around the Squeeze-and-Excitation (SE) block. This block collects global information, captures channel-wise relationships, and enhances representational capabilities. It recalibrates channel-wise feature responses to improve feature discriminability. Nevertheless, an SE block captures global information solely through global average pooling, which restricts its modeling capability, limiting its ability to capture complex interactions between channels. To address this issue, Gao et al. [58] introduced a Global Second-order Pooling block into the squeeze module. This module enables the modeling of high-order statistics and the synthesis of global information, thus enhancing the expressive power of the network. However, this enhancement increases computational demands. Lee et al. [59] developed the lightweight Style-based Recalibration Module (SRM) to overcome the limitations of existing channel attention methods. This module effectively recalibrates CNN feature maps by extracting and integrating style information from each channel. SRM utilizes style pooling to derive style details from the channels and assigns recalibration weights through channel-independent integration. Despite its effectiveness, the focus on style information might not generalize well to tasks requiring a broader context. Wang et al. [60] introduced the Efficient Channel Attention (ECA) block, which uses a 1D convolution to determine the interaction between channels instead of relying on dimensionality reduction. This approach significantly reduces the computational complexity associated with dimensionality reduction techniques, making the ECA block more efficient. Still, while this method improves computational efficiency, it also introduces challenges in capturing more complex channel interactions that might be better addressed with higher-dimensional convolutions or more sophisticated attention mechanisms.
- Spatial attention. Spatial attention is a mechanism that adaptively selects and emphasizes specific spatial regions. By focusing on areas of interest, this approach optimizes the extraction of relevant features and improves overall performance. For instance, Mnih et al. [61] proposed the Recurrent Attention Model (RAM), which sequentially focuses on different regions. This approach enables the model to process one part of the input at a time and decide on the subsequent focus point, mirroring the human method of scanning visual scenes. Although the RAM demonstrates effectiveness in tasks requiring sequential attention and context accumulation, it relies heavily on sequential processing, which might limit its application in real-time tasks. From another perspective, CNNs excel at processing image data due to their translation equivariance. Nonetheless, they lack rotation, scaling, and warping invariance, limiting their robustness in certain scenarios. To address these limitations, Jaderberg et al. [62] proposed Spatial Transformer Networks (STNs), the first attention mechanism explicitly designed to predict relevant regions and provide transformation invariance to deep neural networks. While STNs enhance the ability to focus on important regions and learn these invariances, they also introduce increased model complexity and potential computational overhead. The limitations of both the RAM and STNs suggest that while each addresses specific challenges, neither fully overcomes the trade-offs among efficiency, accuracy, and complexity.
- Hybrid attention. Hybrid attention integrates channel and spatial attention mechanisms for a holistic understanding of image features. Woo et al. [53] proposed a hybrid attention mechanism known as CBAM. By sequentially combining channel and spatial attention, CBAM leverages the spatial and cross-channel relationships of features to guide the network on what and where to focus. Despite its effectiveness in enhancing feature selection, CBAM faces the challenge of increased model complexity. The Residual Attention Network [63] highlights the importance of informative features across spatial and channel dimensions. This network uses a bottom-up structure with multi-level convolutional layers to generate a three-dimensional attention map encompassing height, width, and channel. However, it faces challenges such as high computational expenses and limited receptive field expansion. To address these challenges, Park et al. [64] proposed the Bottleneck Attention Module (BAM) to enhance network representational capability. The BAM uses dilated convolutions to broaden the receptive field and implements a bottleneck structure, which helps to minimize computational expenses. Additionally, it adjusts features in both the channel and spatial dimensions, thus enhancing feature representation. Despite the effectiveness of dilated convolutions in expanding the receptive field, they struggle to capture long-range contextual information and enhance cross-channel relationships. Liu et al. [65] proposed Cross-scale Attention, a mechanism that enables dynamic feature interaction across different scales. By integrating information from multiple scales, this approach enhances the robustness of feature representation, allowing the model to more effectively address challenges arising from scale variations. However, the process of fusing features across scales could introduce additional computational complexity, potentially increasing inference time. Ouyang et al. proposed EMA, a multi-scale attention mechanism designed to effectively focus on relevant features across different scales. This mechanism aims to retain information within each channel while reducing computational overhead. By dynamically allocating attention to various scales, EMA enhances model performance. It integrates feature information from different levels, enabling the model to detect local details while perceiving global context. This fusion of multi-scale features has proven crucial to advancing computer vision research, particularly in handling complex visual scenes.
- Self-attention mechanism. The self-attention mechanism, initially introduced by Vaswani et al. [66] in natural language processing, has significantly impacted computer vision, especially in object detection. This mechanism is highly effective in capturing long-range dependencies and the global context, which is essential to comprehending complex visual scenes. Specifically, Vision Transformer (ViT) presented by Dosovitskiy et al. [67] demonstrates the power of self-attention in computer vision. By treating image patches as tokens and applying Transformer to these sequences, ViT achieves competitive performance compared with traditional CNNs on large-scale image classification tasks. Even so, ViT encounters challenges in processing high-resolution images due to its reliance on a fixed number of tokens, which limits its ability to efficiently handle finer details. Additionally, Wang et al. [68] developed Non-Local Neural Networks, which apply self-attention in video and image tasks to capture long-range dependencies more effectively than convolutional layers. This method has been shown to have greater performance on various tasks, including video classification and object detection, by incorporating global information into the feature representation. Despite these improvements, Non-Local Neural Networks still face challenges in computational efficiency due to the quadratic complexity of the self-attention mechanism. Particularly, Carion et al. pioneered using self-attention in object detection with the introduction of DETR. This model uses the Transformer architecture to process entire images as sequences, directly modeling relationships between distant regions. Although DETR achieves state-of-the-art performance, it struggles with slow convergence during training and requires large datasets to generalize effectively. Building on the success of DETR, Zhu et al. introduced Deformable DETR, which improves the original with deformable attention modules. These modules dynamically adjust the receptive fields based on the input features, making the model more efficient with high-resolution images and improving detection accuracy. Nevertheless, while Deformable DETR improves efficiency, it still faces challenges in balancing complexity with accuracy, especially in scenarios requiring real-time processing. Over the years, significant improvements have been made to enhance the efficiency and effectiveness of self-attention mechanisms. For instance, Yin et al. [69] introduced Disentangled Non-Local Neural Networks, which disentangle non-local operations to boost the representational power of self-attention. This enhancement enables more accurate and effective capture of long-range dependencies. Even with these advancements, challenges remain in optimizing the trade-off between computational cost and the accuracy of long-range dependency modeling.
3. Method
3.1. EF-ResNet18
3.2. AGFE Module
3.2.1. Self-Attention-Guided Feature Enhancement (SaGFE) Module
3.2.2. Channel Attention-Guided Feature Enhancement (CaGFE) Module
3.3. AGFF Module
4. Experiments
4.1. Datasets and Evaluation Measures
4.2. Experimental Settings
4.3. Comparisons with Other Methods
4.4. Ablation Studies
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Su, S.; Niu, W.; Li, Y.; Ren, C.; Peng, X.; Zheng, W.; Yang, Z. Dim and Small Space-Target Detection and Centroid Positioning Based on Motion Feature Learning. Remote Sens. 2023, 15, 2455. [Google Scholar] [CrossRef]
- Wang, S.; Zhang, K.; Chao, L.; Chen, G.; Xia, Y.; Zhang, C. Investigating the Feasibility of Using Satellite Rainfall for the Integrated Prediction of Flood and Landslide Hazards over Shaanxi Province in Northwest China. Remote Sens. 2023, 15, 2457. [Google Scholar] [CrossRef]
- Zhang, H.; Gao, J.; Xu, Q.; Ran, L. Applying Time-Expended Sampling to Ensemble Assimilation of Remote-Sensing Data for Short-Term Predictions of Thunderstorms. Remote Sens. 2023, 15, 2358. [Google Scholar] [CrossRef]
- Jiang, C.; Zhao, D.; Zhang, Q.; Liu, W. A Multi-GNSS/IMU Data Fusion Algorithm Based on the Mixed Norms for Land Vehicle Applications. Remote Sens. 2023, 15, 2439. [Google Scholar] [CrossRef]
- Saynisch, J.; Irrgang, C.; Thomas, M. On the use of satellite altimetry to detect ocean circulation’s magnetic signals. J. Geophys. Res. Ocean. 2018, 123, 2305–2314. [Google Scholar] [CrossRef]
- Kuznetsov, V.D.; Sinelnikov, V.M.; Alpert, S.N. Yakov Alpert: Sputnik-1 and the first satellite ionospheric experiment. Adv. Space Res. 2015, 55, 2833–2839. [Google Scholar] [CrossRef]
- Buchs, R.; Florin, M.V. Collision Risk from Space Debris: Current Status, Challenges and Response Strategies; International Risk Governance Center: Geneva, Switzerland, 2021. [Google Scholar]
- Johnson, N.L. Orbital debris: The growing threat to space operations. In Proceedings of the 33rd Annual Guidance and Control Conference, Breckenridge, CO, USA, 5–10 February 2010. Number AAS 10-011. [Google Scholar]
- Tao, H.; Che, X.; Zhu, Q.; Li, X. Satellite In-Orbit Secondary Collision Risk Assessment. Int. J. Aerosp. Eng. 2022, 2022, 6358188. [Google Scholar] [CrossRef]
- Kennewell, J.A.; Vo, B.N. An overview of space situational awareness. In Proceedings of the 16th International Conference on Information Fusion, Istanbul, Turkey, 9–12 July 2013; pp. 1029–1036. [Google Scholar]
- McCall, G.H.; Darrah, J.H. Space Situational Awareness: Difficult, Expensive-and Necessary. Air Space Power J. 2014, 28, 6. [Google Scholar]
- Meng, W.; Jin, T.; Zhao, X. Adaptive method of dim small object detection with heavy clutter. Appl. Opt. 2013, 52, D64–D74. [Google Scholar] [CrossRef]
- Han, J.; Liu, S.; Qin, G.; Zhao, Q.; Zhang, H.; Li, N. A Local Contrast Method Combined with Adaptive Background Estimation for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1442–1446. [Google Scholar] [CrossRef]
- Duk, V.; Rosenberg, L.; Ng, B.W.H. Target Detection in Sea-Clutter Using Stationary Wavelet Transforms. IEEE Trans. Aerosp. Electron. Syst. 2017, 53, 1136–1146. [Google Scholar] [CrossRef]
- Smith, J.; Doe, J.; Zhang, W. Temporal Filtering for Enhanced Space Target Detection. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 1234–1245. [Google Scholar]
- Liu, J.; Zhang, J.; Chen, W. Dim and Small Target Detection Based on Improved Spatio-Temporal Filtering. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 3456–3467. [Google Scholar]
- Liu, J.; Zhang, J.; Chen, W. Infrared Moving Small Target Detection Based on Space–Time Combination in Complex Scenes. Remote Sens. 2023, 15, 5380. [Google Scholar] [CrossRef]
- Wang, Q.; Gu, Y.; Tuia, D. Discriminative Multiple Kernel Learning for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2016, 54, 3912–3924. [Google Scholar] [CrossRef]
- Wang, Q.; Wang, M.; Huang, J.; Liu, T.; Shen, T.; Gu, Y. Unsupervised Domain Adaptation for Cross-Scene Multispectral Point Cloud Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5705115. [Google Scholar] [CrossRef]
- Wang, Q.; Wang, M.; Zhang, Z.; Song, J.; Zeng, K.; Shen, T.; Gu, Y. Multispectral Point Cloud Superpoint Segmentation. Sci. China Technol. Sci. 2023, 67, 1270–1281. [Google Scholar] [CrossRef]
- Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-Infrared Object Detection by Reducing Cross-Modality Redundancy. Remote Sens. 2022, 14, 2020. [Google Scholar] [CrossRef]
- Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, L.; Wang, G.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Volume 1, pp. 326–366. [Google Scholar]
- Xue, D.; Sun, J.; Hu, Y.; Zheng, Y.; Zhu, Y.; Zhang, Y. Dim small target detection based on convolutinal neural network in star image. Multimed. Tools Appl. 2020, 79, 4681–4698. [Google Scholar] [CrossRef]
- Xiang, Y.; Xi, J.; Cong, M.; Yang, Y.; Ren, C.; Han, L. Space debris detection with fast grid-based learning. In Proceedings of the 2020 IEEE 3rd International Conference of Safe Production and Informatization (IICSPI), Chongqing City, China, 28–30 November 2020; pp. 205–209. [Google Scholar] [CrossRef]
- Xi, J.; Xiang, Y.; Ersoy, O.K.; Cong, M.; Wei, X.; Gu, J. Space Debris Detection Using Feature Learning of Candidate Regions in Optical Image Sequences. IEEE Access 2020, 8, 150864–150877. [Google Scholar] [CrossRef]
- Redmon, S.; Divvala, R.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolo9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Farhadi, A.; Hejrati, B.; Ravanbakhsh, M.; Bagheri, Y.; Ghodrati, A.; Davoodi, S.; Sedghi, M. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Jocher, G.; Ultralytics. YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 5 September 2023).
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. Yolov6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Yolov7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/Cvf Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor DETR: Query design for transformer-based detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022. [Google Scholar]
- Cao, X.; Yuan, P.; Feng, B.; Niu, K. DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-DETR: Accelerate DETR training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast Convergence of DETR with Spatially Modulated Co-Attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Sun, Z.; Cao, S.; Yang, Y.; Kitani, K. Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic DETR: End-to-end object detection with dynamic attention. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 2968–2977. [Google Scholar]
- Cao, X.; Yuan, P.; Feng, B.; Niu, K. Cf-DETR: Coarse-to-fine transformers for end-to-end object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022. [Google Scholar]
- JustIC03. MFDS-DETR: Multi-level Feature Fusion with Deformable Self-Attention for White Blood Cell Detection. arXiv 2022, arXiv:2212.11659. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
- Pauly, L.; Jamrozik, M.L.; Del Castillo, M.O.; Borgue, O.; Singh, I.P.; Makhdoomi, M.R.; Christidi-Loumpasefski, O.O.; Gaudilliere, V.; Martinez, C.; Rathinam, A.; et al. Lessons from a Space Lab—An Image Acquisition Perspective. Int. J. Aerosp. Eng. 2023, 2023, 9944614. [Google Scholar] [CrossRef]
- Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.; Chan, S. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 17–24 June 2023; pp. 12021–12031. [Google Scholar] [CrossRef]
- Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 1807. [Google Scholar]
- Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient DETR: Improving End-to-End Object Detector with Dense Prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
- Li, F.; Zeng, A.; Liu, S.; Zhang, H.; Li, H.; Zhang, L.; Ni, L.M. Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dynamic Anchor Boxes are Better Queries for DETR. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25 April 2022. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
- Gao, Z.L.; Xie, J.T.; Wang, Q.L.; Li, P.H. Global Second-Order Pooling Convolutional Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3019–3028. [Google Scholar]
- Lee, H.; Kim, H.E.; Nam, H. SRM: A Style-Based Recalibration Module for Convolutional Neural Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 1854–1862. [Google Scholar]
- Wang, Q.L.; Wu, B.G.; Zhu, P.F.; Li, P.H.; Zuo, W.M.; Hu, Q.H. ECA-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
- Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 13–18 December 2014; Volume 2, pp. 2204–2212. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015; Volume 28, pp. 37–45. [Google Scholar]
- Wang, F.; Jiang, M.Q.; Qian, C.; Yang, S.; Li, C.; Zhang, H.G.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6450–6458. [Google Scholar]
- Park, A.; OtherAuthor, A. Bottleneck Attention Module for Efficient Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7459–7468. [Google Scholar]
- Liu, S.; Qi, X.; Qin, H.; Shi, J.; Jia, J. CBNet: A Novel Composite Backbone Network Architecture for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10512–10521. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, Online, 3–7 May 2021. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Yin, M.; Yao, Z.; Cao, Y.; Li, X.; Zhang, Z.; Lin, S.; Hu, H. Disentangled non-local neural networks. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2020; Volume 12360, pp. 191–207. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Precision (%) | Recall (%) | (%) | (%) | Parameters, M | GFLOPs | |
---|---|---|---|---|---|---|
YOLOv5s | 94.1 | 89.4 | 95.9 | 81.3 | 9.1 | 23.8 |
YOLOv6s | 93.1 | 87.8 | 94.5 | 81.6 | 16.3 | 44.0 |
YOLOv8s | 94.9 | 92.2 | 96.9 | 83.9 | 11.1 | 28.5 |
YOLOv9c | 96.9 | 94.2 | 97.8 | 86.9 | 25.3 | 102.1 |
RT-DETR | 95.5 | 91.8 | 94.6 | 81.2 | 20.1 | 58.6 |
AgeDETR | 97.9 | 96.0 | 97.9 | 85.2 | 15.1 | 47.9 |
EF-ResNet18 | AGEE | AGFF | Precision (%) | Recall (%) | (%) | (%) | Parameters, M | GFLOPs |
---|---|---|---|---|---|---|---|---|
× | × | × | 93.3 | 90.7 | 92.8 | 78.3 | 15.3 | 42.1 |
✓ | × | × | 97.2 | 95.2 | 96.6 | 83.4 | 12.3 | 36.3 |
✓ | ✓ | × | 97.1 | 95.8 | 96.8 | 83.7 | 13.2 | 36.4 |
✓ | ✓ | ✓ | 97.9 | 96.0 | 97.9 | 85.2 | 15.1 | 47.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, X.; Xi, B.; Xu, H.; Zheng, T.; Xue, C. AgeDETR: Attention-Guided Efficient DETR for Space Target Detection. Remote Sens. 2024, 16, 3452. https://doi.org/10.3390/rs16183452
Wang X, Xi B, Xu H, Zheng T, Xue C. AgeDETR: Attention-Guided Efficient DETR for Space Target Detection. Remote Sensing. 2024; 16(18):3452. https://doi.org/10.3390/rs16183452
Chicago/Turabian StyleWang, Xiaojuan, Bobo Xi, Haitao Xu, Tie Zheng, and Changbin Xue. 2024. "AgeDETR: Attention-Guided Efficient DETR for Space Target Detection" Remote Sensing 16, no. 18: 3452. https://doi.org/10.3390/rs16183452
APA StyleWang, X., Xi, B., Xu, H., Zheng, T., & Xue, C. (2024). AgeDETR: Attention-Guided Efficient DETR for Space Target Detection. Remote Sensing, 16(18), 3452. https://doi.org/10.3390/rs16183452