1. Introduction
Remote sensing images are obtained when we observe the earth’s surface from different angles and heights through aero or space aircrafts. Thus, the image scenes are usually complex and diverse, and there are many noises for long-distance surveillance. In addition, the objects in remote sensing images have large degrees of freedom in scale, orientation, and density. At present, remote sensing images are widely used in many fields, such as traffic monitoring, disaster assessment, urban management, and so on [
1]. With the development of remote sensing techniques, object detection in remote sensing images is gradually attracting attention.
Object detection is a very challenging task in computer vision. The in-depth study of deep learning has supported considerable progress. Many detectors, such as R-CNN [
2], Fast R-CNN [
3], Faster R-CNN [
4], SSD [
5], YOLO [
6], CornerNet [
7], FCOS [
8], CenterNet [
9], and other classic object detection methods, have been proposed. However, when these methods are directly applied to remote sensing images, the detection performance will decrease obviously. The main reason for this is that there are obvious differences between remote sensing images and general natural images, which are as follows: (1) In remote sensing images, objects vary greatly in scale. There are many object categories in remote sensing images, and there may be large-scale differences among objects of different categories. As shown in
Figure 1a, the scale of a soccer ball field is usually more than ten times larger than a tennis court, and hundreds of times larger than a small vehicle, even existing in the same picture. (2) Small objects are easily submerged by background noises. Remote sensing images have complex backgrounds and usually contain many background noises. As shown in
Figure 1b, we take two points on the sea ripples and the edges of the ship, and then zoom in on the neighborhoods, respectively. The two white circles demonstrate the neighborhoods, and they have very little contrast. Small objects are easily disturbed by background noises, which leads to misdetection or false detection. (3) Objects show various orientations and are often densely arranged. For example, there are 16 densely parked vehicles in
Figure 1c. When we use horizontal bounding boxes to label them, five of them are filtered out by NMS (non-maximum suppression), and 31.25% of the vehicles are missed. Therefore, the main problems are summarized as scale, noise, and orientation.
For these challenging problems, researchers have carried out a series of related works for object detection in remote sensing images. For multi-scale variations, many methods obtain multi-scale features from different layers to improve the accuracy of object detection, such as RADet [
10], a refined feature pyramid network with top-down feature maps, SGD (Semantic Guided Decoder) [
11], aggregating multiple high-level features; for background noises, attention mechanisms have been verified to suppress them, such as EFM-Net [
12], which includes a background filtering module based on attention mechanisms to suppress the background interference and enhance the attention of objects simultaneously, AMGM (Attention Map Guidance Mechanism) [
13], which optimizes the network to keep the identification ability, and GABAM (Generating Anchor Boxes based on Attention Mechanism) [
14], which generates attention map to obtain diverse and adaptable anchor boxes; for arbitrary orientation, many methods use oriented bounding boxes to locate objects, which effectively eliminate the mismatching between bounding boxes and corresponding objects. However, when facing the various problems, the existing methods still cannot effectively deal with the above complex situations, and the detection accuracy is still low.
Therefore, this paper proposes an object detection method based on feature enhancement and hybrid attention (FEHA-OD) for remote sensing images. To address the detection difficulties of remote sensing images with large variations in object scale, submerged small objects having insufficient detail features and arbitrary direction objects, the proposed method builds a feature enhancement fusion network (FEFN), designs a hybrid attention mechanism (HAM) module, and uses box boundary-aware vectors for targeted treatment. First, for large variations in object scale, FEFN carries out dilated convolution with different dilation rates to enhance the original fused features, and thus fuses multi-scale, multi-receptive field feature maps to make full use of different level information. The enhanced and fused features are more discriminative and robust, which are conducive to multi-scale object detection. Second, for small objects having insufficient detail features, HAM module proposes a supervised pixel attention to reduce background interference and retain certain context, which can locate the object position more accurately. Moreover, the HAM module also uses a channel attention to focus on learning the high correlation channels, which can guide the training to pay more attention to informative and relevant regions. Context dependence and channel correlation are helpful to supplement the missing features of small objects. Thus, the HAM module can suppress background noises to prevent small objects from being overwhelmed and facilitate the detection of small objects. Finally, the objects are located by using box boundary-aware vectors, which contain four vectors pointing from the center point to the edge of the object. Box boundary-aware vectors determine the location of the bounding box more precisely and detect the arbitrary direction object accurately, even if the objects are densely arranged. By solving the above problems, the proposed FEHA-OD can improve the accuracy of object detection in remote sensing images.
3. Methods
3.1. Overall Framework
The overall framework of FEHA-OD shown in
Figure 2 mainly consists of the feature extraction network, prediction module, and loss function. The FEFN and HAM are parts of the feature extraction network for feature enhancement. The key modules are designed as follows:
- (1)
Feature extraction network: Firstly, an input image is sent to ResNet101 Conv1-5, which is the backbone. Then, FEFN and HAM modules perform the feature enhancement. Finally, a feature map that is 4 times smaller (scale s = 4) than the input image is output.
- (2)
Feature enhancement fusion network (FEFN): Firstly, a 1 × 1 convolutional is used to adjust the number of the high-level layer channels and bilinear interpolation is used to upsample the high-level layer. The upsampled result is combined with the low-level layer through element-wise addition. Then, the combined feature map is sent into three sub-networks containing different receptive fields to enhance original features. Finally, the results of these sub-networks are fused to obtain multi-scale representation.
- (3)
Hybrid attention mechanism (HAM): A supervised pixel attention mechanism and a channel attention mechanism are combined, which makes the network focus on both the pixel-level information regarding an object and valid channel information, effectively reducing the disturbance of background noise and detecting small objects.
- (4)
Prediction module: The image is predicted by four branches, namely heatmap, offset, box parameters, and orientation, and these four branches work together to achieve oriented object detection.
- (5)
Loss function: A multi-task loss function is used to train and combine the above modules better.
The input RGB image is
. After the feature extraction network, the feature map
A is output as follows.
where
denotes the processing of the feature extraction network.
C denotes the channel numbers, which is set to 256.
S denotes the downsampling multiplier, which is set to 4.
W and
H are the width and height of the input image.
Then, A is transformed into four branches in the prediction module: heatmap , offset , box parameters , and orientation , where K represents the number of object categories. Each branch gets a corresponding loss. Finally, by combining all the losses, the output image with detection results can be obtained.
3.2. Feature Enhancement Fusion Network
The existing methods usually just aggregate hierarchical feature maps between high- and low-level output layers, which do not take full advantage of the fused feature information. In comparison, the proposed FEFN considers deeply the current fused layers and further enhances the receptive fields of the fused feature maps by using three sub-networks, which contain different dilation rates and different numbers of dilation convolution. This operation can reduce semantic differences between the features of different layers and extract multi-scale features on different perceptual fields, which enhances multi-scale feature representation.
The FEFN module shown in
Figure 3 first uses a 1 × 1 convolutional layer to adjust the number of channels for the high-level features
Convn. Then, the adjusted result is upsampled through bilinear interpolation, and the refined high-level features
Convn1 are gotten. Then, the refined high-level features implement element-wise addition with low-level features
Convn−1 and get the first fused feature map
P. The feature map
P is followed by three sub-networks. In each sub-network, the dilation convolution with different dilation rates and different convolution numbers operates. Finally, the outputs of sub-networks
P1,
P2,
P3 are concatenated in channel dimension to support the effective fusion of contextual information. The enhanced fused feature map
can be obtained as follows:
where
represents the dilation convolution with dilation rate
n, and
denotes a concatenation operation.
With the same computational complexity as a standard convolution, dilated convolution inserts “holes” between pixels in convolutional kernels to maintain the resolution and receptive field of the network [
29]. By changing the dilation rate and convolutional number, we can obtain sub-networks with different receptive fields, which can be used to capture multi-scale context information and improve the detection performance of objects with large-scale variations. The three sub-networks with different groups of dilation convolution layers further consider the current fused feature
P thoroughly.
Furthermore, through the fusion of three sub-networks containing different dilation convolution, the FEFN module enhances the object information with different receptive fields. This approach captures high-level semantic features while retaining more low-level detailed features, and considers the fused feature maps from different layers. Therefore, FEFN utilizes different scales, different receptive fields, and thus obtains more discriminative and robust features, which facilitates multi-scale object detection.
3.3. Hybrid Attention Mechanism
Pixel attention and channel attention can both emphasize informative features and prevent small objects being overwhelmed by noises. Hence, this paper proposes a hybrid attention mechanism (HAM) integrating pixel attention and channel attention. Pixel attention uses a supervised network to enhance the object-related pixel features and retain context information, which can learn more accurate location information. Channel attention uses SKNet [
30], which can consider the relationship between channels, as well as the convolutional kernels, allowing the network to focus on high quality channels.
The HAM module is shown in
Figure 4. In the pixel attention network, the feature map
C first passes through an inception structure with different ratio convolutional kernels to learn a two-channel saliency map. The saliency map indicates the scores of the foreground and background, respectively. To guide this network, a supervised learning approach is used. The network obtains a binary map as the label according to ground-truth, and sets the object pixels in ground-truth to 1, while the rest of the pixels are set to the background value of 0. Then, the network uses the cross-entropy loss of the binary map and the saliency map as the attention loss:
where
w and
h are the width and height of the saliency map, while
and
are the ground-truth values and saliency map values of the pixel points at (
x,
y), respectively.
Meanwhile, as the saliency map is continuous, background information will not be completely eliminated, which facilitates the retention of context information and improves robustness. Then, a softmax operation is performed on the saliency map and the weight matrix of pixel attention is obtained.
Channel attention uses SKNet, which contains three parts, namely split, fuse, and select. Split generates two feature maps containing different scale information by different convolutional kernels. Fuse combines two feature maps from different branches through element-wise addition, integrating information from different branches. Then, the global information is embedded through global average pooling (GAP), which is squeezed to a 1 × 1 × C vector representing the channel-based statistics, and the vector is followed by a fully connected (FC). Select regresses the weight information between the channels by using two softmax functions, and the weight information is multiplied by feature maps generated in split. Finally, the summed vector is output, representing the channel weight.
HAM multiplies pixel attention weights and channel attention weights, and obtains a new feature map A. Unlike channel attention that only focuses on the importance of different channels, HAM also considers encoding the spatial information. In detail, pixel attention weight can increase the importance of object location, effectively enhancing object location features and weakening background features. Meanwhile, channel attention weight can increase the importance of task-related channels, allowing the network to selectively emphasize object-related features and suppress useless ones. Thus, the HAM module can learn more object features to weaken the background noise and to prevent the misdetection of small objects.
3.4. Prediction
The prediction module shown in
Figure 2 consists of four branches: heatmap, offset, box parameters, and orientation, which share the same feature extraction network for training, and predict the class and bounding box for each object. The location of the object center is inferred from the heatmap and offset branch, and the box boundary-aware vectors are obtained according to the box parameters and orientation branch. Thus, the prediction module achieves oriented object detection.
3.4.1. Heatmap
The heatmap branch predicts a center point heatmap
, which is the category confidence of the predicted object center point as the key point. When training the heatmap, since all pixels in the heatmap are negative samples except for the center point, resulting in unbalanced positive and negative samples, we use the variant focal loss as the loss function to adjust the penalty weights of the positive and negative samples. The loss function is defined below, which regresses the position and class of the center point:
The parameters α and β are set to 2 and 4, and N is the total number of objects.
3.4.2. Offset
Since the resolution of the heatmap is 4 times smaller than the input image, the point generated from the input image to the output heatmap is usually a floating-point number. However, the center point
p extracted from the heatmap is an integer, which results in a position error between the quantified floating center point and the integer center point. Therefore, an offset
o is predicted in this offset branch to compensate the error. The offset of the center point is computed as follows:
The Smooth L1 loss function is used to optimize the offset:
where
N is the total number of objects,
refers to the ground-truth offsets, and
m indexes the objects.
3.4.3. Box Parameters
Due to the arbitrary orientation of the object, box boundary-aware vectors describe the oriented bounding box. Box boundary-aware vectors are four vectors from the object center point to the edges of the object, including
t,
r,
b, and
l, pointing to the top, right, bottom, and left, respectively. Thus, the four vectors are distributed in fixed quadrants of the Cartesian system of coordinates. The box parameters are defined as
d = [
t,
r,
b,
l,
w,
h], where
w and
h represent the width and height of the external horizontal bounding box, respectively. Regress the box parameters by using the Smooth L1 loss function:
where
and
are the predicted and ground-truth box parameters, respectively.
3.4.4. Orientation
Some objects are horizontal or nearly horizontal in remote sensing images, as shown in
Figure 5a. If we use the box boundary-aware vectors to describe the location of these objects, the vectors will be distributed at the boundary of the quadrant. Thus, these vectors are difficult to differentiate, which leads to inaccurate locating. We group the oriented bounding boxes into two classes, horizontal and rotation bounding box, and define the class
to denote them, respectively.
where IOU is the intersection-over-union between the rotation bounding box (RBB) and the horizontal bounding box (HBB), as shown in
Figure 5b,c. In the orientation branch, we use the binary cross-entropy loss to predict the orientation class:
where
and
are the predicted and ground-truth box parameters, respectively.
3.5. Loss
The multi-task loss function is defined as follows:
where
,
,
, and
are the losses of four branches in the prediction module. The network learns the position and class of the center point by
. Combining
, the network predicts the offset to reduce the position error of the center point. Then, the network uses
to regress the box parameters at the center point. Moreover,
Lα is used for the orientation class. Importantly,
is an attention loss in HAM that guides pixel attention to weaken the interference of background noise and learn useful object features. The method is jointly trained by a multi-task loss function to achieve better detection performance.
4. Results and Analysis
4.1. Details
We use the public dataset DOTAv-1.0 to evaluate the proposed method. The dataset contains 2806 aerial images. In these images, there are diverse objects with different scales, orientations, and shapes. The resolutions of these images range from 800 × 800 to 4000 × 4000 pixels.
DOTA-v1.0 has 15 categories: plane (PL), baseball diamond (BD), bridge, ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccerball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC); it contains 188,282 instances.
Due to the large and variable size of DOTA images, the images need to be cropped first. In this paper, we use the same segmentation algorithm as ROI Transformer and the images are cropped into 600 × 600 patches with a stride of 100. The input images have two scales, which are 0.5 and 1. The training and test sets contain 69,337 and 35,777 segmented images respectively. After testing, the results of the segmented images are combined into a final result, which is evaluated via the DOTA online server.
We implement our method with PyTorch. The backbone weights are pre-trained on the ImageNet dataset and the other weights are initialized under the default settings of PyTorch. We use some methods, including random flipping and cropping, for data enhancement. In addition, the experiments are performed on a Taitan RTX 24G server with a training batch size of 10, and we use Adam with an initial learning rate of 6.25 × 10−5 to optimize the total loss and train the network for about 80 epochs.
4.2. Evaluation Index
The experiments use mean average precision (mAP) as an evaluation metric for detection accuracy. mAP is the average of the detection accuracy of multiple categories. The average precision (
AP) of each category can be obtained by the precision-recall curve, which is obtained by calculating precision (
P) and recall (
R). The
P,
R, and
AP are defined as follows:
where
TP stands for true positive value,
FN stands for false negative value, and
FP stands for false positive value.
4.3. Experimental Comparison and Analyses
4.3.1. Experimental Comparison for FEFN
In order to make full use of the features in different layers and enhance the expression capability of multi-scale features, the FEFN is added to the feature extraction network. Experiments are conducted before and after adding the FEFN respectively, and the detection results are shown in
Table 1. The mAP is improved by 1.51% after adding the FEFN.
The visualization comparison results are shown in
Figure 6. Without FEFN, some small vehicles (SV) and the ground track field (GTF) are missed, which is shown in
Figure 6a. For good observation, we use the red arrows to indicate the missed objects. In contrast, after adding FEFN, these missed objects are detected accurately, which is shown in
Figure 6b. At the same time, the soccerball field labelled by the yellow bounding box is located more accurately. FEFN makes full use of the features in different layers to enhance the expression capability of multi-scale features, and thus obtains better results.
4.3.2. Experimental Comparison for HAM
In order to prevent small objects from being submerged by noises, this paper enhances object features and suppresses background noises by introducing HAM. The experimental results using different attention mechanisms are shown in
Table 2.
As shown in
Table 2, by including the channel attention mechanism, the method with SKNet achieves a 0.6% mAP improvement compared with the baseline. The method with pixel attention achieves a 0.81%mAP improvement. They both suppress background noises to a certain extent, but the improvement is not significant. The method with the proposed HAM achieves a 2.48%mAP improvement. By adding a hybrid attention module to simultaneously enhance the learning of positional information and effective channels, while suppressing non-object features more effectively, HAM behaves better than other methods, such as SKNet and pixel attention.
The visualization results with or without HAM are shown in
Figure 7. Obviously, the feature map after adding HAM (shown in
Figure 7c) illustrates clearer boundaries of small objects, as well as more distinctive object features than the baseline without HAM (shown in
Figure 7b). For example, the ships in the first-row image and the ships in the second-row image can be seen in
Figure 7c, but they are completely obscured in
Figure 7b and submerged in the background. Actually, HAM suppresses background noises and enhances the object features, making small objects submerged by noises stand out.
4.3.3. Ablation Experiments for Different Modules
To verify the effectiveness of each module, a set of ablation experiments are conducted. The baseline is BBAVectors and the methods ‘+FEFN’, ‘+HAM’, and ‘+HAM + FEFN’ represent the addition of FEFN, HAM, as well as HAM and FEFN to the baseline, respectively. For all methods, all trainings converge at 70 epochs and the convergence curves of the loss function are shown in
Figure 8. The ‘+HAM’ or ‘+FEFN’ method converges faster and performs better than the baseline, which illustrates that both ‘+HAM’ and ‘+FEFN’ can contribute to the convergence process. Further, regardless of the converge speed or losses, ‘+HAM + FEFN’ exceeds the above methods in terms of performance. This means that the ‘+HAM + FEFN’ method converges faster and performs better than the baseline and other methods. Therefore, ‘+HAM + FEFN’, integrating ‘+HAM’ and ‘+FEFN’, jointly improves performance.
The experimental results of various methods of adding different modules are shown in
Table 3. By including FEFN, the method ‘Baseline+FEFN’ achieves a 1.51% mAP improvement. By adding HAM, the method ‘Baseline+HAM’ achieves a 2.48% mAP improvement. These improvements have been discussed earlier. Finally, the method ‘Baseline+HAM+FEFN’ achieves a 2.81% mAP improvement. FEFN enhances multi-scale features, and HAM learns more object features to weaken the inference of background noises. The combination of the two improves the performance of object detection.
Some typical detection results of the baseline (in the first row) and the proposed method (in the second row) are shown in
Figure 9. Baseline tends to produce incorrect detection under different scenarios. In the first sample image shown in
Figure 9a, many small vehicles in the red dashed box, as well as the vehicles indicated by red arrows, are not detected. The aircraft wings in the lower left part of image are not accurately located. In the second sample image shown in
Figure 9a, for the two ships alongside each other, only one ship is detected and the other ship is missed for baseline, because the noise of sea ripples causes blurred edges of ships. In the last sample image shown in
Figure 9a, the baseline method mistakenly deems some parking lines as cars, because the vehicles have very low contrast with the background and are thus easily confused with parking lines. In contrast, the proposed method detects the missed objects and reduces false detection under various adverse scenarios, as illustrated in the second row of
Figure 9b. Excellent results are largely attributed to the inclusion of FEFN and HAM within the proposed method.
4.3.4. Comparison with the State-of-the-Arts
In order to further evaluate the method to prove the effectiveness of the proposed method, AP (%) for each category of objects and mAP (%) for multiple categories are both computed. Our method is compared with state-of-the-art methods based on the DOTA dataset. The results are shown in
Table 4.
RRPN [
26] and R2CNN [
27] directly use the high-level features to predict the category and location information of objects. Due to the large size of remote sensing images and small size objects, the object feature information is seriously lost after multiple operations of convolution and pooling, and their performances are comparably lower than the other methods. ICN [
16] proposes image cascade networks to enrich the features and improves mAP to 68.16%. RoI Transformer [
28] transfers the horizontal bounding boxes to oriented bounding boxes by learning the spatial transformation, and thus improves mAP to 69.56%. CADNet [
17] exploits attention-modulated features, as well as global and local contexts, and improves mAP to 69.9%. SCRDet [
18] proposes an inception fusion network and adopts a targeted feature fusion strategy, which fully considers the feature fusion and anchor sampling, and improves mAP to 71.16%. BBAVectors [
19] uses the box boundary-aware vectors based on the center point of objects to capture the oriented bounding boxes, and improves mAP to 72.32%. DRN [
31] proposes a feature selection module (FSM) to adjust receptive fields in accordance with objects and a dynamic refinement head (DRH) to refine the prediction dynamically in an object-aware manner, and improves mAP to 73.23%.
Compared with RRPN, R2CNN, ICN, RoI-Transformer, CADNet, SCRDet, BBAVectors, and DRN, the mAP of FEHA-OD is improved by 14.01%, 14.35%, 6.86%, 5.46%, 5.12%, 3.86%, 2.7%, and 1.79%, respectively. FEHA-OD achieves the highest mAP. In detail, FEHA-OD is effective in detecting some large objects, such as BD, BC, and RA. For other large objects, such as TC and SP, FEHA-OD also maintains a good detection performance. These objects with large scale usually appear together with some small objects, such as SV and LV. FEHA-OD also has good detection performance for multi-scale objects. The main reason for this is that the FEFN effectively uses the fused features from different layers and facilitates multi-scale object detection. Moreover, the detection accuracy of SV, LV, ST, SH, and other small objects with complex backgrounds is very good. The main reason for such effective detection is that the HAM module effectively highlights object features and prevents small objects from being submerged by noises. In particular, objects with arbitrary orientations can also be detected very well. Some visual detection results of FEHA-OD on the DOTA dataset are shown in
Figure 10. The objects belonging to different categories are labeled by rectangles with different colors. Considering both very large objects, such as roundabout (RA) and ground track field (GTF), or small objects submerged by background noise, such as a small vehicle (SV), FEHA-OD can achieve excellent detection performance. Even if these objects present various scales and orientations, seriously disturbed by noises, FEHA-OD locates and classifies them accurately.
5. Conclusions
Considering the various objects with large-scale variations, arbitrary orientations, as well as background noises in remote sensing images, this paper proposes an object detection method (FEHA-OD) for remote sensing images, which focuses on feature enhancement and hybrid attention. It proposes FEFN with different receptive fields to reorganize and fuse feature maps, which further improves the multi-scale feature representation and addresses the large-scale variations. Moreover, to address the problem of background noises, which may overwhelm the features of small objects, the HAM module is designed to increase the positioning accuracy of objects, as well as to obtain the informative channels. It can suppress background noises and relatively enhance the small object features. Furthermore, this paper regresses the box parameters for objects with arbitrary orientations by using box boundary-aware vectors in the prediction module. Furthermore, a multi-task loss function is used to combine the above modules, and the final detection result is determined. Experiments on public dataset DOTA show that the proposed method achieves 75.02% mAP, showing an improvement of 2.7% mAP compared with BBAVectors. The robustness and adaptability of FEHA-OD is also verified by the experimental results in a variety of complex scenarios. The optimization of computational speed will be the subject of our future work.