4.3.1. Results on DOTA
We evaluate the proposed method against other state-of-the-art approaches on the DOTA dataset. The results are presented in
Table 1. Our method achieved a
of 76.34%, and under the approach of multiscale training and testing, our method achieved a
of 79.31%. When comparing metrics in different categories, our detectors performed the best for the categories of small vehicles, ships, and tennis courts, and achieved the second-best results for the ground field tracks, basketball courts, roundabouts, and helicopters.
Table 1.
Comparison to state-of-the-art methods on the DOTA-v1.0 dataset. R-101 denotes ResNet-101 (likewise for R-50 and R-152), RX-101 denotes ResNeXt-101 and H-104 denotes Hourglass-104. The best result is highlighted in bold, and the second-best result is underlined. * denotes multiscale training and multiscale testing.
Table 1.
Comparison to state-of-the-art methods on the DOTA-v1.0 dataset. R-101 denotes ResNet-101 (likewise for R-50 and R-152), RX-101 denotes ResNeXt-101 and H-104 denotes Hourglass-104. The best result is highlighted in bold, and the second-best result is underlined. * denotes multiscale training and multiscale testing.
Methods | Backbone | PL | BD | BR | GTF | SV | LV | SH | TC | BC | ST | SBF | RA | HA | SP | HC | |
---|
FR-O [12] | R-101 | 79.09 | 69.12 | 17.17 | 63.49 | 34.20 | 37.16 | 36.20 | 89.19 | 69.60 | 58.96 | 49.40 | 52.52 | 46.69 | 44.80 | 46.30 | 52.93 |
RRPN [50] | R-101 | 88.52 | 71.20 | 31.66 | 59.30 | 51.85 | 56.19 | 57.25 | 90.81 | 72.84 | 67.38 | 56.69 | 52.84 | 53.08 | 51.94 | 53.58 | 61.01 |
RetinaNet-R [46] | R-101 | 88.92 | 67.67 | 33.55 | 56.83 | 66.11 | 73.28 | 75.24 | 90.87 | 73.95 | 75.07 | 43.77 | 56.72 | 51.05 | 55.86 | 21.46 | 62.02 |
CADNet [51] | R-101 | 87.80 | 82.40 | 49.40 | 73.50 | 71.10 | 63.50 | 76.60 | 90.90 | 79.20 | 73.30 | 48.40 | 60.90 | 62.00 | 67.00 | 62.20 | 69.90 |
O2-DNet [52] | H-104 | 89.31 | 82.14 | 47.33 | 61.21 | 71.32 | 74.03 | 78.62 | 90.76 | 82.23 | 81.36 | 60.93 | 60.17 | 58.21 | 66.98 | 61.03 | 71.04 |
CenterMap-Net [53] | R-50 | 88.88 | 81.24 | 53.15 | 60.65 | 78.62 | 66.55 | 78.10 | 88.83 | 77.80 | 83.61 | 49.36 | 66.19 | 72.10 | 72.36 | 58.70 | 71.74 |
BBAVector [54] | R-101 | 88.35 | 79.96 | 50.69 | 62.18 | 78.43 | 78.98 | 87.94 | 90.85 | 83.58 | 84.35 | 54.13 | 60.24 | 65.22 | 64.28 | 55.70 | 72.32 |
SCRDet [19] | R-101 | 89.98 | 80.65 | 52.09 | 68.36 | 68.36 | 60.32 | 72.41 | 90.85 | 87.94 | 86.86 | 65.02 | 66.68 | 66.25 | 68.24 | 65.21 | 72.61 |
DRN [55] | H-104 | 89.71 | 82.34 | 47.22 | 64.10 | 76.22 | 74.43 | 85.84 | 90.57 | 86.18 | 84.89 | 57.65 | 61.93 | 69.30 | 69.63 | 58.48 | 73.23 |
Gliding Vertex [56] | R-101 | 89.89 | 85.99 | 46.09 | 78.48 | 70.32 | 69.44 | 76.93 | 90.71 | 79.36 | 83.80 | 57.79 | 68.35 | 72.90 | 71.03 | 59.78 | 73.39 |
SRDF [20] | R-101 | 87.55 | 84.12 | 52.33 | 63.46 | 78.21 | 77.02 | 88.13 | 90.88 | 86.68 | 85.58 | 47.55 | 64.88 | 65.17 | 71.42 | 59.51 | 73.50 |
R3Det [24] | R-152 | 89.49 | 81.17 | 50.53 | 66.10 | 70.92 | 78.66 | 78.21 | 90.81 | 85.26 | 84.23 | 61.81 | 63.77 | 68.16 | 69.83 | 67.17 | 73.74 |
FCOSR-S [57] | R-50 | 89.09 | 80.58 | 44.04 | 73.33 | 79.07 | 76.54 | 87.28 | 90.88 | 84.89 | 85.37 | 55.95 | 64.56 | 66.92 | 76.96 | 55.32 | 74.05 |
S2A-Net [37] | R-50 | 89.11 | 82.84 | 48.37 | 71.11 | 78.11 | 78.39 | 87.25 | 90.83 | 84.90 | 85.64 | 60.36 | 62.60 | 65.26 | 69.13 | 57.94 | 74.12 |
SCRDet++ [40] | R-101 | 89.20 | 83.36 | 50.92 | 68.17 | 71.61 | 80.23 | 78.53 | 90.83 | 86.09 | 84.04 | 65.93 | 60.80 | 68.83 | 71.31 | 66.24 | 74.41 |
Oriented R-CNN [23] | R-50 | 88.79 | 82.18 | 52.64 | 72.14 | 78.75 | 82.35 | 87.68 | 90.76 | 85.35 | 84.68 | 61.44 | 64.99 | 67.40 | 69.19 | 57.01 | 75.00 |
MaskOBB [58] | RX-101 | 89.56 | 89.95 | 54.21 | 72.90 | 76.52 | 74.16 | 85.63 | 89.85 | 83.81 | 86.48 | 54.89 | 69.64 | 73.94 | 69.06 | 63.32 | 75.33 |
CBDA-Net [18] | R-101 | 89.17 | 85.92 | 50.28 | 65.02 | 77.72 | 82.32 | 87.89 | 90.48 | 86.47 | 85.90 | 66.85 | 66.48 | 67.41 | 71.33 | 62.89 | 75.74 |
DODet [59] | R-101 | 89.61 | 83.10 | 51.43 | 72.02 | 79.16 | 81.99 | 87.71 | 90.89 | 86.53 | 84.56 | 62.21 | 65.38 | 71.98 | 70.79 | 61.93 | 75.89 |
SREDet(ours) | R-101 | 89.36 | 85.51 | 50.87 | 74.52 | 80.50 | 74.78 | 86.43 | 90.91 | 87.40 | 83.97 | 64.36 | 69.10 | 67.72 | 73.65 | 65.93 | 76.34 |
SREDet(ours) * | R-101 | 90.23 | 86.75 | 54.34 | 80.81 | 80.41 | 79.37 | 87.02 | 90.90 | 88.28 | 86.84 | 70.16 | 70.68 | 74.43 | 76.11 | 73.42 | 79.32 |
This performance can be attributed to the rotation-invariant features’ sensitivity to capturing the orientation of objects and the feature enhancement effects realized through semantic information. Represented by its detection capabilities for swimming pools, helicopters, and planes, our method effectively identifies and regresses objects with irregular shapes. This is primarily due to incorporating semantic segmentation information as supervision, allowing the network to focus precisely on the object and contextual features against complex backgrounds, providing more regression clues. Additionally, our method performs well with densely arranged objects, such as cars, benefiting from SFEM which reduces the coupling of intra-class features, thereby highlighting crucial features. We also observed that for ground field tracks, roundabouts, and baseball diamonds, utilizing semantic segmentation information is more efficient than object bounding box masks. The primary reason for this is that masks might include background information or other objects, causing feature confusion or erroneous enhancement. Our method also adeptly handles the challenges posed by arbitrary orientations, irregular shapes, dense arrangements, and varying scales of remote sensing objects, achieving precise rotation object detection.
From the visualized detection images, as seen in
Figure 6, it can be observed that our network achieves excellent detection results for various types of objects. As seen in the first row of images, the network can accurately detect harbors of different shapes and sizes. This is primarily due to the MRFEN module’s ability to extract features of varying scales and shapes. From the fourth column of images, it is evident that the network exhibits effective detection performance on dense objects, largely attributed to the SFEM, which alleviates the feature overlap among similar objects and enhances the feature maps of small objects.
4.3.2. Ablation Study
We conducted ablation studies on the proposed modules to determine their respective contributions and effectiveness. All experiments employed simple random flipping as an augmentation technique to avoid overfitting. The results of these experiments are depicted in
Table 2, while the error-type metrics proposed in this paper are presented in
Table 3.
Firstly, to ascertain the effectiveness of the MRFPN and SFEM modules individually, model variants that solely incorporated each module were developed on the basis of the baseline. The integration of the MRFPN module led to a 4.7 increase in detection performance, particularly for objects such as basketball courts, storage tanks, and harbors. This improvement suggests that the multiscale rotation-invariant features extracted by the module facilitate the network’s effective detection of objects varying in scale and orientation. Incorporation of the SFEM module resulted in a 5.4 improvement in detection capabilities for objects like large vehicles, swimming pools, helicopters, and ships, indicating that the SFEM module effectively intensifies object features and mitigates feature overlap among closely spaced objects. Finally, the combined application of both modules yielded an overall improvement of 6.3, demonstrating that the two feature enhancement components produce a synergistic effect. The multiscale rotation-invariant features extracted by MRFPN benefit the semantic segmentation tasks within the SFEM module, whereas the SFEM module can suppress noise in the features extracted by MRFPN.
We compared the responses of different types of errors to various improvement strategies, as seen in
Table 3. In general, all improvement strategies were observed to contribute to a decrease in classification errors, regression errors, false positives in background detection, and missed objects detections, thus validating the effectiveness of our proposed modules. When only MRFPN was introduced,
was 5.84 and
was 6.76. Similarly, with the introduction of SFEM alone,
was 5.56 and
was 7.13. The comparison reveals that MRFPN can provide richer features (most notably by reducing classification errors) and reduce missed detections, but it may misidentify some backgrounds as objects. On the other hand, SFEM can suppress background noise and enhance object features (most notably reducing regression errors), but this approach can lead to increased missed objects. However, when both modules are applied together, they simultaneously reduce false positives from the background and missed detections, suggesting that the two components work together synergistically.
4.3.3. Detailed Evaluation and Performance Testing of Components
In this section, we mainly explore the impact of different styles of semantic labels (SemSty) and feature enhancement methods(Enh-Mtds) on network performance. Expl and Impl represent the explicit and implicit enhancement methods mentioned in Methods 3.2 of this article, respectively. Mask refers to the semantic mask obtained from object bounding boxes, and Seg indicates semantic segmentation information. Based on the experimental results in
Table 4, we observe that under the same semantic annotation of Mask, the implicit method outperforms the explicit method by 1.5. Similarly, under the semantic annotation of Seg, the implicit method surpasses the explicit method by 1.4. Thus, when choosing the same type of semantic label, the implicit enhancement method is superior to the explicit enhancement method. This advantage primarily stems from the implicit method’s ability to decouple features of different objects into separate channels, facilitating the classification and regression of various categories of objects. In contrast, the explicit enhancement method is highly dependent on the accuracy of semantic segmentation, where any misclassification or omission in segmentation directly impacts network performance.
Furthermore, we also observe that under the explicit enhancement approach, Seg annotation improves performance by 0.9 compared to Mask annotation, and under the implicit enhancement method, Seg annotation leads to a 0.8 improvement over Mask annotation. Therefore, using Seg for supervision is superior to Mask under the same feature enhancement method, mainly due to the precise semantic information reducing background contamination and inter-class feature overlap. Specifically, using Seg annotation significantly outperforms mask annotation for objects like roundabouts. This is primarily because in the original DOTA dataset annotations the labeling for RA is not uniform, including objects like small or large vehicles, leading to inter-class feature overlap when using the Mask directly as semantic supervision. The mask may include part of the background information for objects with irregular shapes, such as swimming pools and helicopters, affecting the network’s regression performance.
We compared the responses of different feature enhancement strategies to various types of errors, as seen in
Table 5, all improvement strategies led to reductions in classification errors, regression errors, false positives from the background, and missed detections of objects, which demonstrates the effectiveness and versatility of the methods. Notably, using Masks as supervisory information with explicit feature enhancement best improved the issue of missed detections. However, among the four strategies, this approach showed the least improvement in false positives from the background, primarily because using Mask as semantic supervisory information reduces the difficulty of semantic segmentation but also increases the risk of incorrect segmentation.
Regarding false positives from the background, under the same style of semantic annotation, models using implicit enhancement methods outperform those with explicit frameworks. This advantage is mainly because semantic segmentation information does not directly affect network features. Instead, it indirectly generates weights for spatial feature enhancement and decoupling between different types of features, mitigating the direct impact of semantic segmentation errors. Concerning regression errors, using the same feature enhancement method is superior to using a Mask. The main reason is that using Seg for semantic supervisory information can provide more accurate enhancement areas, which aids the network’s regression tasks.
We provide a detailed visualization of different strategies for feature enhancement, as seen in
Figure 7. From the visualization results of object boxes, it is evident that when using Masks as semantic guidance, false detections occur (as indicated by the red circles in the figure). Additionally, for some object detection cases, the results are suboptimal, failing to completely enclose the objects (as indicated by the green circles in the figure). This is primarily attributed to the utilization of Mask as a semantic guide, which introduces erroneous semantic information. In (e), for example, areas of the sea without ships are segmented as harbors, directly impacting the generation of feature weights and resulting in poor detection outcomes.
Regarding the feature maps, employing implicit enhancement effectively decouples features of different categories into different channels, as demonstrated by (h) and (k) as well as (i) and (l). It is apparent that (h) and (i) enhance features belonging to the category of ships, while (k) and (l) enhance features characteristic of harbors. Furthermore, a comparison of feature maps reveals that for images containing dense objects, using Segmentation as semantic supervision is more effective, yielding clearer and more responsive feature maps.
In the MRFPN, we tested different numbers of feature layers and compared the use of standard convolutions with deformable convolutions, as seen in
Table 6. The experiments revealed that when using standard convolutions, there was no significant difference in performance between using four and five feature layers. However, after employing DCN for feature extraction, additional feature layers improved the network’s performance. This improvement is primarily attributed to the DCN’s enhanced capability to extract features from irregular targets.
In our experimental analysis of different strategies within the SFEM module, as seen in
Table 7. When an equal number of dilated convolutions are stacked at each layer of the feature map, enhancing features across all feature maps yields better outcomes than enhancing only a subset of feature maps. When enhancing the same set of feature maps, appropriately stacking a certain number of dilated convolutions can enhance the model’s detection performance. The primary reason is that multiple layers of dilated convolutions introduce a larger receptive field to the SFEM module, enabling the acquisition of more comprehensive contextual information.
We proposed a method for implicitly generating weights using semantic segmentation information to enhance feature maps. Therefore, the accuracy of semantic segmentation directly affects the network’s performance. In the SFEM module, we tested three different losses, as seen in
Table 8. By comparison, it can be seen that without adjusting the loss weights, focal loss performs best on the DOTA dataset for the class imbalance in remote sensing images. However, considering that Dice loss has a stronger ability to distinguish target regions, and based on our statistics, background pixels account for 96.95% of the dataset. We introduced weights to Dice loss by setting the classification weight of background pixels to 1 and foreground pixels to 20. The experimental results showed that this approach achieved the best performance.
We conducted comparative experiments to test the SFEM module with different base models, including the two-stage detection algorithm Faster R-CNN and the single-stage object detection model YOLOv8, as seen in
Table 9.
All models were trained on the training set and tested on the validation set. Our module achieved an improvement of 0.88 on over Faster R-CNN, which is less pronounced compared to the single-stage detector. The main reason is that the RPN operation in the two-stage algorithm helps the network focus on the key feature regions of the target, rather than detecting over the entire feature map. Our module achieved improvements of 0.61 and 0.76 on over YOLOv8-m and YOLOv8-l, respectively. It is worth noting that, for a fair comparison, no pre-trained models were used during training, and the default data augmentation method of YOLOv8 was applied.