4.1. Datasets
Three publicly available object-detection datasets of remote sensing images, namely DIOR [
34], HRRSD [
35], and AIBD [
36], are used to evaluate the proposed methods in the experiments. Some examples of DIOR, HRRSD, and AIBD are shown as
Figure 7.
The AIBD dataset is dedicated to self-annotation for building detection tasks. AIBD was proposed for the first time in a category containing a single object: buildings. The sample size is 500 × 500, and there are 11,571 samples in total, the same number as the annotation files. Based on the COCO dataset standard, buildings are classified into large, medium, and small. The numbers of large, medium, and small instances are 16,824, 121,515, and 51,977, respectively. They are distinguished from each other by different colors with vastly different backgrounds. The number of pixels of buildings ranges from tens to hundreds of thousands. The geometric shapes of these building instances are diverse, including irregular shapes, such as U-shapes, T-shapes, and L-shapes. The raw data of AIBD are from the Inria aerial image data, accessible on 1 August 2020, mainly for se-mantic segmentation; training sets and test sets were selected from five cities. For each city, about 81 square kilometers of regions and 36 image blocks were selected. The training and test sets contained 180 image patches covering 405 km2. The resolution of each image block was 5000 × 5000, and the geographic resolution was 0.3 m.
The HRRSD dataset was released by the University of Chinese Academy of Sciences in 2019. The dataset contains 21,761 images sampled from Google Earth and Baidu Maps, and the spatial resolution of these images ranges from 0.15 to 1.2 m. The dataset contains 55,740 instances covering 13 different object categories. These categories are airplanes, baseball fields, intersections, surface fields, basketball courts, bridges, ships, storage tanks, ports, parking lots, tennis courts, T-junctions, and vehicles. The biggest highlight of this dataset is that it has balanced samples under the category that cannot be used, and each category has nearly 4000 samples. In addition, the sample count of the training subset of this dataset is 5401, and the sample counts of the validation and test subsets are 5417 and 10943, respectively. The “training values” subset is the combination of the training and validation subsets.
The DIOR dataset is a large-scale benchmark dataset mainly used for object detection in remote sensing images. Northwestern Polytechnical University in China released DIOR through sampling on Google Earth. The dataset contains 23,463 images, 20 object classes, and 192,472 instances. The 20 object categories include airplanes, baseball fields, basketball courts, airports, bridges, chimneys, highway service areas, dams, highway tollbooths, ground track fields, seaports, golf courses, flyovers, stadiums, storage tanks, ships, tennis courts, vehicles, railway stations, and windmills. The dataset image size is 800 × 800, with a spatial resolution from 0.5 to 30 m. This dataset has four salient features: (1) it contains a large number of object instances and images, (2) a variety of different object scales, and (3) different weather, imaging conditions, seasons, etc., and (4) it has high intra-class diversity and inter-class similarity.
In this paper, the percentage of the training set, validation set, and test set of the DIOR dataset is 0.25, 0.25, and 0.5, respectively. The training set and verification set in the other two datasets, HRRSD and AIBD, are jointly used for model training.
4.2. Evaluation Metrics
In this paper, the mAP (mean average precision) and mAP_50 are selected as the main indicators of the experimental results. The evaluation metrics used are from the standard COCO metric set. We also choose mAP_75, mAP_s, mAP_m, and mAP_l as the evaluation indicators of the experimental results. mAP_50 and mAP_75 indicate the accuracy, with a threshold value of 0.5 and 0.75, respectively. mAP_s demonstrates that the average accuracy of the object is small (smaller than 322); mAP_m represents that the average accuracy is at a medium level (between 322 and 962); and mAP_l expresses that the average accuracy is large (bigger than 962). mAP is an indicator for measuring recognition accuracy in object detection, and is the average of AP for multiple categories. mAP is defined in Equation (15).
where Q represents the category set of object detection and
is the average accuracy rate of the object under the calculation category. Precision-recall (PR) is calculated as shown in Equations (16) and (17).
where true positive (TP) represents the number of positive samples that are correctly predicted as positive. True negative (TN) represents the number of negative samples that are correctly predicted as negative. False positive (FP) represents the number of negative samples that are incorrectly predicted as positive. False negative (FN) represents the number of positive samples that are incorrectly predicted as negative.
In addition, the TP, TN, FP, and FN of object detection are derived by the IOU (intersection over union). The IOU measures the overlap rate between two regions within the object detection range, as shown in Equation (18).
where IOU represents the overlapping area between the predicted bounding box
and the true bounding box
divided by the union area of the two. The predicted bounding boxes are classified as true or false by comparing the IOU with a given threshold
T. If IOU ≥
T, it is considered true. If the contrary, the detection is considered false.
4.4. Experimental Analysis
Firstly, the improved algorithm proposed in this paper is compared with the object detection algorithm mentioned above in the AIBD dataset. The visualized test results are shown in
Figure 8.
In
Figure 8, although AIBD is a building object detection dataset containing a single category, the intra-category differences are very large. It can be seen that the apparent characteristics of buildings greatly vary, whether for common rectangular buildings or for buildings with irregular shapes. However, the overall test result of MAME-YOLOX is satisfactory. The yellow rectangular box is the object of the undetected building, which is a false negative (FNs). Because the color of the building is very similar to the surrounding roads, the edge features are not clear.
The quantitative comparison experiment results of the AIBD dataset are shown in
Table 2 and
Table 3.
Table 2 presents algorithm, backbone, mAP, and mAP_50 values.
Table 3 presents the mAP_75, mAP_s, mAP_m, and mAP_l values.
From
Table 2, we can clearly see MAME-YOLOX has the best effect on the two main indicators of mAP and mAP_50, which are 0.479 and 0.848, respectively. Moreover, from
Table 2, we can see for different object sizes; MAME-YOLOX’s mAP_m indicator is optimal. The second-best results of mAP and mAP_50 were obtained by YOLOX, which are 0.467 and 0.832, respectively. Compared with the YOLOX algorithm, MAME-YOLOX’s overall mAP value increased by 1.8%, while the mAP_50 value increased by 1.7%.
Secondly, the improved algorithm proposed in this paper is compared with the object detection algorithm mentioned above in the HRRSD dataset, which has 13 object categories. The visualized test results are shown in
Figure 9.
In
Figure 9, we can see most of the objects of the HRRSD dataset have obvious features, such as apparent texture and color, and the size of the object is large. However, information-rich high-score images also have the problems of small differences between categories, large differences within categories, and it being difficult to unify semantics. Therefore, the red rectangular boxes in
Figure 9 are all examples of false negative detections. Among them, the first one is that the highway is wrongly detected as a bridge. From the visual observation, its foreground and background information is very similar to the bridge. In the second red rectangle, the road turntable is mistakenly detected as a storage tank. If the context information of the object image is fully considered, the object should not be mistakenly detected. The quantitative comparison experiment results on HRRSD datasets are shown in
Table 4 and
Table 5.
Table 4 presents the algorithm, backbone, mAP, and mAP_50 value.
Table 5 presents the mAP_75, mAP_s, mAP_m, and mAP_l values.
From
Table 4, it can be seen that Faster R-CNN obtained the best mAP and mAP_50 indicator results, at 0.561 and 0.888, respectively. MAME-YOLOX ranks second, and the results are 0.549 and 0.862, respectively. However, our algorithm processing speed is better than that of Faster R-CNN. Compared with the YOLOX algorithm, MAME-YOLOX’s overall mAP value increased by 1.4%, while the mAP_50 value increased by 1.6%.
Thirdly, the improved algorithm proposed in this paper is compared with the object detection algorithm mentioned above in the DIOR dataset. The visualized test results are shown in
Figure 10.
The DIOR dataset contains 20 object categories, and in
Figure 10 we can see that when the object instance has the characteristics of a small size and a dense appearance, such as vehicles, ports may be detected by mistake. However, objects with relatively fixed appearance features, such as aircrafts and tanks, are rarely detected by mistake or omitted. The quantitative comparison experiment results of the DIOR dataset are shown in
Table 6 and
Table 7.
Table 6 provides the algorithm, backbone, mAP, and mAP_50 value.
Table 7 presents the mAP_75, mAP_s, mAP_m, and mAP_l values.
From
Table 6, we can see MAME-YOLOX obtains the optimal mAP indicator, 0.501, and the optimal mAP_50 indicator, 0.772, respectively. Second place is obtained by the SSD512 algorithm, where the mAP result is 0.497 and mAP_50 is 0.759. Compared with the YOLOX algorithm, MAME-YOLOX’s overall mAP value increased by 3.2%, while the mAP_50 value increased by 3.4%.
The ablation experiments on the AIBD, HRRSD, and DIOR datasets are shown in
Table 8,
Table 9 and
Table 10, respectively.
The experimental results demonstrate that the architecture of the proposed method is effective and achieves better results. For example, the mAP and mAP_50 of the MAME-YOLOX are 0.273 and 0.472, respectively, being better than those of the YOLOX, YOLOX + SwinTrans., and YOLOX + CBAM. The SwinTrans. module and CBAM module are all beneficial to improve the performance; however, the MAME-YOLOX achieves the best results.
From the qualitative visualization examples of the three datasets, the proposed framework in this paper fits more closely with the remote sensing object to be detected. At the same time, this method can detect some small remote sensing objects that are not easy to find, which reduces the missed detection rate of small objects to a certain extent, and thus improves the detection accuracy of remote sensing objects. MAME-YOLOX is a very promising method, which has a strong ability for object detection in remote sensing images. It is not only applicable to datasets of multiple object categories, but also applicable to datasets of single object categories.