Object Detection in Remote Sensing Images by Combining Feature Enhancement and Hybrid Attention

Zheng, Jin; Wang, Tong; Zhang, Zhi; Wang, Hongwei

doi:10.3390/app12126237

Open AccessArticle

Object Detection in Remote Sensing Images by Combining Feature Enhancement and Hybrid Attention

¹

School of Computer Science & Engineering, Beihang University, Beijing 100191, China

²

School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(12), 6237; https://doi.org/10.3390/app12126237

Submission received: 24 May 2022 / Revised: 8 June 2022 / Accepted: 15 June 2022 / Published: 19 June 2022

(This article belongs to the Special Issue Intelligent Computing and Remote Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The objects in remote sensing images have large-scale variations, arbitrary directions, and are usually densely arranged, and small objects are easily submerged by background noises. They all hinder accurate object detection. To address the above problems, this paper proposes an object detection method combining feature enhancement and hybrid attention. Firstly, a feature enhancement fusion network (FEFN) is designed, which carries out dilated convolution with different dilation rates acting on the multi-layer features, and thus fuses multi-scale, multi-receptive field feature maps to enhance the original features. FEFN obtains more robust and discriminative features, which adapt to various objects with different scales. Then, a hybrid attention mechanism (HAM) module composed of pixel attention and channel attention is proposed. Through context dependence and channel correlation, introduced by pixel attention and channel attention respectively, HAM can make the network focus on object features and suppress background noises. Finally, this paper uses box boundary-aware vectors to determine the locations of objects and detect the arbitrary direction objects accurately, even if they are densely arranged. Experiments on public dataset DOTA show that the proposed method achieves 75.02% mAP, showing an improvement of 2.7% mAP compared with BBAVectors.

Keywords:

object detection; remote sensing image; feature enhancement; feature fusion; hybrid attention

1. Introduction

Remote sensing images are obtained when we observe the earth’s surface from different angles and heights through aero or space aircrafts. Thus, the image scenes are usually complex and diverse, and there are many noises for long-distance surveillance. In addition, the objects in remote sensing images have large degrees of freedom in scale, orientation, and density. At present, remote sensing images are widely used in many fields, such as traffic monitoring, disaster assessment, urban management, and so on [1]. With the development of remote sensing techniques, object detection in remote sensing images is gradually attracting attention.

Object detection is a very challenging task in computer vision. The in-depth study of deep learning has supported considerable progress. Many detectors, such as R-CNN [2], Fast R-CNN [3], Faster R-CNN [4], SSD [5], YOLO [6], CornerNet [7], FCOS [8], CenterNet [9], and other classic object detection methods, have been proposed. However, when these methods are directly applied to remote sensing images, the detection performance will decrease obviously. The main reason for this is that there are obvious differences between remote sensing images and general natural images, which are as follows: (1) In remote sensing images, objects vary greatly in scale. There are many object categories in remote sensing images, and there may be large-scale differences among objects of different categories. As shown in Figure 1a, the scale of a soccer ball field is usually more than ten times larger than a tennis court, and hundreds of times larger than a small vehicle, even existing in the same picture. (2) Small objects are easily submerged by background noises. Remote sensing images have complex backgrounds and usually contain many background noises. As shown in Figure 1b, we take two points on the sea ripples and the edges of the ship, and then zoom in on the neighborhoods, respectively. The two white circles demonstrate the neighborhoods, and they have very little contrast. Small objects are easily disturbed by background noises, which leads to misdetection or false detection. (3) Objects show various orientations and are often densely arranged. For example, there are 16 densely parked vehicles in Figure 1c. When we use horizontal bounding boxes to label them, five of them are filtered out by NMS (non-maximum suppression), and 31.25% of the vehicles are missed. Therefore, the main problems are summarized as scale, noise, and orientation.

For these challenging problems, researchers have carried out a series of related works for object detection in remote sensing images. For multi-scale variations, many methods obtain multi-scale features from different layers to improve the accuracy of object detection, such as RADet [10], a refined feature pyramid network with top-down feature maps, SGD (Semantic Guided Decoder) [11], aggregating multiple high-level features; for background noises, attention mechanisms have been verified to suppress them, such as EFM-Net [12], which includes a background filtering module based on attention mechanisms to suppress the background interference and enhance the attention of objects simultaneously, AMGM (Attention Map Guidance Mechanism) [13], which optimizes the network to keep the identification ability, and GABAM (Generating Anchor Boxes based on Attention Mechanism) [14], which generates attention map to obtain diverse and adaptable anchor boxes; for arbitrary orientation, many methods use oriented bounding boxes to locate objects, which effectively eliminate the mismatching between bounding boxes and corresponding objects. However, when facing the various problems, the existing methods still cannot effectively deal with the above complex situations, and the detection accuracy is still low.

Therefore, this paper proposes an object detection method based on feature enhancement and hybrid attention (FEHA-OD) for remote sensing images. To address the detection difficulties of remote sensing images with large variations in object scale, submerged small objects having insufficient detail features and arbitrary direction objects, the proposed method builds a feature enhancement fusion network (FEFN), designs a hybrid attention mechanism (HAM) module, and uses box boundary-aware vectors for targeted treatment. First, for large variations in object scale, FEFN carries out dilated convolution with different dilation rates to enhance the original fused features, and thus fuses multi-scale, multi-receptive field feature maps to make full use of different level information. The enhanced and fused features are more discriminative and robust, which are conducive to multi-scale object detection. Second, for small objects having insufficient detail features, HAM module proposes a supervised pixel attention to reduce background interference and retain certain context, which can locate the object position more accurately. Moreover, the HAM module also uses a channel attention to focus on learning the high correlation channels, which can guide the training to pay more attention to informative and relevant regions. Context dependence and channel correlation are helpful to supplement the missing features of small objects. Thus, the HAM module can suppress background noises to prevent small objects from being overwhelmed and facilitate the detection of small objects. Finally, the objects are located by using box boundary-aware vectors, which contain four vectors pointing from the center point to the edge of the object. Box boundary-aware vectors determine the location of the bounding box more precisely and detect the arbitrary direction object accurately, even if the objects are densely arranged. By solving the above problems, the proposed FEHA-OD can improve the accuracy of object detection in remote sensing images.

2. Related Work

2.1. Scale Change

Objects with different scales can be more appropriately reflected in different levels of feature maps. Thus, for the detection of multi-scale objects, the fusion of high-level and low-level layer features is a common method. Chen et al. [15] enhanced the feature map from the lowest-level layer with semantic information, and fused the enhanced feature map with the feature map from the higher-level layers, which improved the performance of small object detection without affecting the detection of large objects. Seyedet al. [16] proposed a novel joint image cascade with enhanced semantic features and a feature pyramid network with multi-scale convolutional kernels, which can extract multi-scale semantic features and improve the detection results. Zhang et al. [17] introduced a pyramid local context network to learn the global and local contexts of objects, as well as the multi-scale features. Yang et al. [18] designed a finer anchor sampling and a feature fusion network to improve the detection performance for small objects. Jing et al. [19] upsampled high-level layers and combined them with the low-level layers to aggregate feature maps for richer features.

Most of the above methods consider multi-layer and multi-scale features, but just directly aggregate the hierarchical feature maps of different layers for fusion. Actually, the receptive fields learned from different scale feature maps, in which the different semantic information is contained, are different. The direct fusion of two features with large semantic information gaps will weaken the expression of multi-scale features for the subsequent network [20]. From the perspective of refined features, introducing multi-scale receptive fields acting on the multi-layer features, to further enhance the features, can allow to obtain more discriminative and robust features, and make better use of the fused features.

2.2. Noise Disturbance

Background noise is a main factor affecting the accuracy of object detection, especially in remote sensing images with complex backgrounds, where large amounts of noise can easily drown small objects. The attention mechanism is an effective method to detect objects from noise interference. For example, Li et al. [21] used an attention mechanism in each feature map to enhance the feature and reduce the influence of background noises. Zhang et al. [17] designed a spatial- and scale-aware attention module, which is helpful for detecting objects with sparse texture and low background contrast. Yang et al. [18] suppressed background noise by adding a supervised multi-dimensional attention network. Li et al. [22] designed a saliency pyramid, which combined a saliency algorithm with a feature pyramid network to weaken the influence brought by background noises. The saliency pyramid was combined with a global attention module branch to further enhance the semantic information and the expression ability of object features.

Attention mechanisms can selectively emphasize informative features to improve detection performance. Among them, channel attention and position attention are two important mechanisms. Channel attention, such as SENet [23], computes channel attention weight with the help of global pooling and provides notable performance gains. However, channel attention only considers inter-channel information, ignoring the importance of positional information, which is the key to capturing object structure. CBAM [24] attempts to exploit positional information by reducing the channel dimension of the input tensor, and then uses convolution to compute spatial attention. However, convolution can only capture local relations and is unable to retain more context information that is essential for vision tasks [25]. How to choose the suitable attention mechanism, or even build a hybrid attention, which is beneficial for small object detection and background noise suppression, is a problem worthy of in-depth study.

2.3. Various Orientations

Due to various orientations and dense distribution of objects, classical object detection methods adopting the horizontal bounding boxes easily result in a large IOU (intersection over union) between the horizontal bounding boxes and the real oriented bounding boxes of objects. If a horizontal bounding box is used to label an object with an oblique direction, the object is easily filtered out by non-maximum polar suppression, resulting in misdetection. Therefore, researchers proposed oriented bounding boxes and achieved better detection performance. The classic methods include RRPN [26], R2CNN [27], etc. RRPN improved the RPN network [4] by adding rotated anchors with different angles, and developed a rotated RPN network. R2CNN regressed the parameters of the oriented bounding box by using several ROI pooling operations with different sizes. Ding et al. [28] proposed RoI Transformer to learn the transformation from a horizontal bounding box into an oriented bounding box and extracted the rotation-invariant features for object classification and location regression. However, the above methods are only based on angle to capture the oriented bounding boxes.

Actually, small angle change has little influence on the total loss in training, but it may lead to a large IOU difference between the prediction box and the ground-truth box. Therefore, Jing et al. [19] used box boundary-aware vectors (BBAVectors) to describe the oriented bounding boxes. Instead of regressing the angles, it regressed the four vertices of the object pointing from the object center to the four edges, and then captured the oriented bounding boxes. Compared with the methods that learn the angles of an object, the BBAVectors method achieves better performance.

The above methods focus on the difficult problems of object detection in remote sensing images, but there are still some deficiencies. First, when they obtain multi-scale features, they usually aggregate the feature maps of different layers directly. However, the receptive fields of different scale feature maps are different, and the representation capability of multi-scale features will be weakened because of the direct aggregation of the feature maps in different layers. Considering the characteristics of the fused feature layers, and effectively using the fused features, even enhancing original features, is important. Secondly, the existing methods usually consider a certain attention mechanism, such as channel attention or spatial attention. However, a hybrid attention can highlight object features more effectively. Finally, for objects with arbitrary directions, most existing methods get oriented bounding boxes through angle regression, but angle optimization cannot guarantee that the IOU error is minimized. Localizing objects by using BBAVvectors can better optimize IOU errors and improve detection performance. Therefore, this paper proposes a feature enhancement fusion network (FEFN) and a hybrid attention mechanism (HAM) module, and uses BBAVvectors to overcome these deficiencies, thereby achieving better detection performance.

3. Methods

3.1. Overall Framework

The overall framework of FEHA-OD shown in Figure 2 mainly consists of the feature extraction network, prediction module, and loss function. The FEFN and HAM are parts of the feature extraction network for feature enhancement. The key modules are designed as follows:

(1): Feature extraction network: Firstly, an input image is sent to ResNet101 Conv1-5, which is the backbone. Then, FEFN and HAM modules perform the feature enhancement. Finally, a feature map that is 4 times smaller (scale s = 4) than the input image is output.
(2): Feature enhancement fusion network (FEFN): Firstly, a 1 × 1 convolutional is used to adjust the number of the high-level layer channels and bilinear interpolation is used to upsample the high-level layer. The upsampled result is combined with the low-level layer through element-wise addition. Then, the combined feature map is sent into three sub-networks containing different receptive fields to enhance original features. Finally, the results of these sub-networks are fused to obtain multi-scale representation.
(3): Hybrid attention mechanism (HAM): A supervised pixel attention mechanism and a channel attention mechanism are combined, which makes the network focus on both the pixel-level information regarding an object and valid channel information, effectively reducing the disturbance of background noise and detecting small objects.
(4): Prediction module: The image is predicted by four branches, namely heatmap, offset, box parameters, and orientation, and these four branches work together to achieve oriented object detection.
(5): Loss function: A multi-task loss function is used to train and combine the above modules better.

The input RGB image is

I \in R^{3 \times W \times H}

. After the feature extraction network, the feature map A is output as follows.

A = f_{F E N} (I), A \in R^{C \times W / s \times H / s}

(1)

where

f_{F E N}

denotes the processing of the feature extraction network. C denotes the channel numbers, which is set to 256. S denotes the downsampling multiplier, which is set to 4. W and H are the width and height of the input image.

Then, A is transformed into four branches in the prediction module: heatmap

H M \in R^{K \times W / s \times H / s}

, offset

O \in R^{2 \times W / s \times H / s}

, box parameters

B \in R^{10 \times W / s \times H / s}

, and orientation

θ \in R^{1 \times W / s \times H / s}

, where K represents the number of object categories. Each branch gets a corresponding loss. Finally, by combining all the losses, the output image with detection results can be obtained.

3.2. Feature Enhancement Fusion Network

The existing methods usually just aggregate hierarchical feature maps between high- and low-level output layers, which do not take full advantage of the fused feature information. In comparison, the proposed FEFN considers deeply the current fused layers and further enhances the receptive fields of the fused feature maps by using three sub-networks, which contain different dilation rates and different numbers of dilation convolution. This operation can reduce semantic differences between the features of different layers and extract multi-scale features on different perceptual fields, which enhances multi-scale feature representation.

The FEFN module shown in Figure 3 first uses a 1 × 1 convolutional layer to adjust the number of channels for the high-level features Conv_n. Then, the adjusted result is upsampled through bilinear interpolation, and the refined high-level features Conv_n1 are gotten. Then, the refined high-level features implement element-wise addition with low-level features Conv_n₋₁ and get the first fused feature map P. The feature map P is followed by three sub-networks. In each sub-network, the dilation convolution with different dilation rates and different convolution numbers operates. Finally, the outputs of sub-networks P₁, P₂, P₃ are concatenated in channel dimension to support the effective fusion of contextual information. The enhanced fused feature map

{C o n v}_{n - 1}^{'}

can be obtained as follows:

P_{1} = f_{1} (P), P_{2} = f_{1} (f_{2} (P)), P_{3} = f_{1} (f_{2} (f_{2} (P)))

(2)

{C o n v}_{n - 1}^{'} = f_{c o n c a t} (P_{1}, P_{2}, P_{3})

(3)

where

f_{n}

represents the dilation convolution with dilation rate n, and

f_{c o n c a t}

denotes a concatenation operation.

With the same computational complexity as a standard convolution, dilated convolution inserts “holes” between pixels in convolutional kernels to maintain the resolution and receptive field of the network [29]. By changing the dilation rate and convolutional number, we can obtain sub-networks with different receptive fields, which can be used to capture multi-scale context information and improve the detection performance of objects with large-scale variations. The three sub-networks with different groups of dilation convolution layers further consider the current fused feature P thoroughly.

Furthermore, through the fusion of three sub-networks containing different dilation convolution, the FEFN module enhances the object information with different receptive fields. This approach captures high-level semantic features while retaining more low-level detailed features, and considers the fused feature maps from different layers. Therefore, FEFN utilizes different scales, different receptive fields, and thus obtains more discriminative and robust features, which facilitates multi-scale object detection.

3.3. Hybrid Attention Mechanism

Pixel attention and channel attention can both emphasize informative features and prevent small objects being overwhelmed by noises. Hence, this paper proposes a hybrid attention mechanism (HAM) integrating pixel attention and channel attention. Pixel attention uses a supervised network to enhance the object-related pixel features and retain context information, which can learn more accurate location information. Channel attention uses SKNet [30], which can consider the relationship between channels, as well as the convolutional kernels, allowing the network to focus on high quality channels.

The HAM module is shown in Figure 4. In the pixel attention network, the feature map C first passes through an inception structure with different ratio convolutional kernels to learn a two-channel saliency map. The saliency map indicates the scores of the foreground and background, respectively. To guide this network, a supervised learning approach is used. The network obtains a binary map as the label according to ground-truth, and sets the object pixels in ground-truth to 1, while the rest of the pixels are set to the background value of 0. Then, the network uses the cross-entropy loss of the binary map and the saliency map as the attention loss:

L_{a t t e n} = - \frac{1}{w \times h} \sum_{x y} ({\hat{l}}_{x y} \log (l_{x y}) + (1 - {\hat{l}}_{x y}) \log (1 - l_{x y}))

(4)

where w and h are the width and height of the saliency map, while

{\hat{l}}_{x y}

and

l_{x y}

are the ground-truth values and saliency map values of the pixel points at (x, y), respectively.

Meanwhile, as the saliency map is continuous, background information will not be completely eliminated, which facilitates the retention of context information and improves robustness. Then, a softmax operation is performed on the saliency map and the weight matrix of pixel attention is obtained.

Channel attention uses SKNet, which contains three parts, namely split, fuse, and select. Split generates two feature maps containing different scale information by different convolutional kernels. Fuse combines two feature maps from different branches through element-wise addition, integrating information from different branches. Then, the global information is embedded through global average pooling (GAP), which is squeezed to a 1 × 1 × C vector representing the channel-based statistics, and the vector is followed by a fully connected (FC). Select regresses the weight information between the channels by using two softmax functions, and the weight information is multiplied by feature maps generated in split. Finally, the summed vector is output, representing the channel weight.

HAM multiplies pixel attention weights and channel attention weights, and obtains a new feature map A. Unlike channel attention that only focuses on the importance of different channels, HAM also considers encoding the spatial information. In detail, pixel attention weight can increase the importance of object location, effectively enhancing object location features and weakening background features. Meanwhile, channel attention weight can increase the importance of task-related channels, allowing the network to selectively emphasize object-related features and suppress useless ones. Thus, the HAM module can learn more object features to weaken the background noise and to prevent the misdetection of small objects.

3.4. Prediction

The prediction module shown in Figure 2 consists of four branches: heatmap, offset, box parameters, and orientation, which share the same feature extraction network for training, and predict the class and bounding box for each object. The location of the object center is inferred from the heatmap and offset branch, and the box boundary-aware vectors are obtained according to the box parameters and orientation branch. Thus, the prediction module achieves oriented object detection.

3.4.1. Heatmap

The heatmap branch predicts a center point heatmap

\overset{\land}{H M} \in R^{K \times W / s \times H / s}

, which is the category confidence of the predicted object center point as the key point. When training the heatmap, since all pixels in the heatmap are negative samples except for the center point, resulting in unbalanced positive and negative samples, we use the variant focal loss as the loss function to adjust the penalty weights of the positive and negative samples. The loss function is defined below, which regresses the position and class of the center point:

L_{h m} = - \frac{1}{N} \sum_{x y k} {\begin{matrix} {(1 - {\overset{\land}{H M}}_{x y k})}^{α} \log ({\overset{\land}{H M}}_{x y k}) \begin{matrix} \begin{matrix}  \end{matrix} \end{matrix} \begin{matrix} i f & H M_{x y k} = 1 \end{matrix} \\ {(1 - H M_{x y k})}^{β} {({\overset{\land}{H M}}_{x y k})}^{α} \log (1 - {\overset{\land}{H M}}_{x y k}) o t h e r w i s e \end{matrix}

(5)

The parameters α and β are set to 2 and 4, and N is the total number of objects.

3.4.2. Offset

Since the resolution of the heatmap is 4 times smaller than the input image, the point generated from the input image to the output heatmap is usually a floating-point number. However, the center point p extracted from the heatmap is an integer, which results in a position error between the quantified floating center point and the integer center point. Therefore, an offset o is predicted in this offset branch to compensate the error. The offset of the center point is computed as follows:

o = (\frac{p_{x}}{S} - ⌊ \frac{p_{x}}{S} ⌋, \frac{p_{y}}{S} - ⌊ \frac{p_{y}}{S} ⌋)

(6)

The Smooth L1 loss function is used to optimize the offset:

L_{o} = \frac{1}{N} \sum_{m = 1}^{N} S m o o t h_{L_{1}} (o_{m} - {\hat{o}}_{m})

(7)

where N is the total number of objects,

\hat{o}

refers to the ground-truth offsets, and m indexes the objects.

3.4.3. Box Parameters

Due to the arbitrary orientation of the object, box boundary-aware vectors describe the oriented bounding box. Box boundary-aware vectors are four vectors from the object center point to the edges of the object, including t, r, b, and l, pointing to the top, right, bottom, and left, respectively. Thus, the four vectors are distributed in fixed quadrants of the Cartesian system of coordinates. The box parameters are defined as d = [t, r, b, l, w, h], where w and h represent the width and height of the external horizontal bounding box, respectively. Regress the box parameters by using the Smooth L1 loss function:

L_{b} = \frac{1}{N} \sum_{m = 1}^{N} S m o o t h L_{1} (d_{m} - {\hat{d}}_{m})

(8)

where

d

and

\hat{d}

are the predicted and ground-truth box parameters, respectively.

3.4.4. Orientation

Some objects are horizontal or nearly horizontal in remote sensing images, as shown in Figure 5a. If we use the box boundary-aware vectors to describe the location of these objects, the vectors will be distributed at the boundary of the quadrant. Thus, these vectors are difficult to differentiate, which leads to inaccurate locating. We group the oriented bounding boxes into two classes, horizontal and rotation bounding box, and define the class

\hat{θ}

to denote them, respectively.

\hat{θ} = {\begin{cases} 1 \begin{matrix} (RBB) & IOU (RBB, HBB) < 0.95 \end{matrix} \\ 0 \begin{matrix} (HBB) & otherwise, \end{matrix} \end{cases}

(9)

where IOU is the intersection-over-union between the rotation bounding box (RBB) and the horizontal bounding box (HBB), as shown in Figure 5b,c. In the orientation branch, we use the binary cross-entropy loss to predict the orientation class:

L_{θ} = - \frac{1}{N} \sum_{i}^{N} ({\hat{θ}}_{i} \log (θ_{i}) + (1 - {\hat{θ}}_{i}) \log (1 - θ_{i}))

(10)

where

θ

and

\hat{θ}

are the predicted and ground-truth box parameters, respectively.

3.5. Loss

The multi-task loss function is defined as follows:

L = L_{h m} + L_{o} + L_{b} + L_{θ} + L_{a t t e n}

(11)

where

L_{h m}

,

L_{o}

,

L_{b}

, and

L_{θ}

are the losses of four branches in the prediction module. The network learns the position and class of the center point by

L_{h m}

. Combining

L_{o}

, the network predicts the offset to reduce the position error of the center point. Then, the network uses

L_{b}

to regress the box parameters at the center point. Moreover, L_α is used for the orientation class. Importantly,

L_{a t t e n}

is an attention loss in HAM that guides pixel attention to weaken the interference of background noise and learn useful object features. The method is jointly trained by a multi-task loss function to achieve better detection performance.

4. Results and Analysis

4.1. Details

We use the public dataset DOTAv-1.0 to evaluate the proposed method. The dataset contains 2806 aerial images. In these images, there are diverse objects with different scales, orientations, and shapes. The resolutions of these images range from 800 × 800 to 4000 × 4000 pixels.

DOTA-v1.0 has 15 categories: plane (PL), baseball diamond (BD), bridge, ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccerball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC); it contains 188,282 instances.

Due to the large and variable size of DOTA images, the images need to be cropped first. In this paper, we use the same segmentation algorithm as ROI Transformer and the images are cropped into 600 × 600 patches with a stride of 100. The input images have two scales, which are 0.5 and 1. The training and test sets contain 69,337 and 35,777 segmented images respectively. After testing, the results of the segmented images are combined into a final result, which is evaluated via the DOTA online server.

We implement our method with PyTorch. The backbone weights are pre-trained on the ImageNet dataset and the other weights are initialized under the default settings of PyTorch. We use some methods, including random flipping and cropping, for data enhancement. In addition, the experiments are performed on a Taitan RTX 24G server with a training batch size of 10, and we use Adam with an initial learning rate of 6.25 × 10⁻⁵ to optimize the total loss and train the network for about 80 epochs.

4.2. Evaluation Index

The experiments use mean average precision (mAP) as an evaluation metric for detection accuracy. mAP is the average of the detection accuracy of multiple categories. The average precision (AP) of each category can be obtained by the precision-recall curve, which is obtained by calculating precision (P) and recall (R). The P, R, and AP are defined as follows:

P = T P / (T P + F P)

(12)

R = T P / (T P + F N)

(13)

A P = \int_{0}^{1} p (r) d r

(14)

where TP stands for true positive value, FN stands for false negative value, and FP stands for false positive value.

4.3. Experimental Comparison and Analyses

4.3.1. Experimental Comparison for FEFN

In order to make full use of the features in different layers and enhance the expression capability of multi-scale features, the FEFN is added to the feature extraction network. Experiments are conducted before and after adding the FEFN respectively, and the detection results are shown in Table 1. The mAP is improved by 1.51% after adding the FEFN.

The visualization comparison results are shown in Figure 6. Without FEFN, some small vehicles (SV) and the ground track field (GTF) are missed, which is shown in Figure 6a. For good observation, we use the red arrows to indicate the missed objects. In contrast, after adding FEFN, these missed objects are detected accurately, which is shown in Figure 6b. At the same time, the soccerball field labelled by the yellow bounding box is located more accurately. FEFN makes full use of the features in different layers to enhance the expression capability of multi-scale features, and thus obtains better results.

4.3.2. Experimental Comparison for HAM

In order to prevent small objects from being submerged by noises, this paper enhances object features and suppresses background noises by introducing HAM. The experimental results using different attention mechanisms are shown in Table 2.

As shown in Table 2, by including the channel attention mechanism, the method with SKNet achieves a 0.6% mAP improvement compared with the baseline. The method with pixel attention achieves a 0.81%mAP improvement. They both suppress background noises to a certain extent, but the improvement is not significant. The method with the proposed HAM achieves a 2.48%mAP improvement. By adding a hybrid attention module to simultaneously enhance the learning of positional information and effective channels, while suppressing non-object features more effectively, HAM behaves better than other methods, such as SKNet and pixel attention.

The visualization results with or without HAM are shown in Figure 7. Obviously, the feature map after adding HAM (shown in Figure 7c) illustrates clearer boundaries of small objects, as well as more distinctive object features than the baseline without HAM (shown in Figure 7b). For example, the ships in the first-row image and the ships in the second-row image can be seen in Figure 7c, but they are completely obscured in Figure 7b and submerged in the background. Actually, HAM suppresses background noises and enhances the object features, making small objects submerged by noises stand out.

4.3.3. Ablation Experiments for Different Modules

To verify the effectiveness of each module, a set of ablation experiments are conducted. The baseline is BBAVectors and the methods ‘+FEFN’, ‘+HAM’, and ‘+HAM + FEFN’ represent the addition of FEFN, HAM, as well as HAM and FEFN to the baseline, respectively. For all methods, all trainings converge at 70 epochs and the convergence curves of the loss function are shown in Figure 8. The ‘+HAM’ or ‘+FEFN’ method converges faster and performs better than the baseline, which illustrates that both ‘+HAM’ and ‘+FEFN’ can contribute to the convergence process. Further, regardless of the converge speed or losses, ‘+HAM + FEFN’ exceeds the above methods in terms of performance. This means that the ‘+HAM + FEFN’ method converges faster and performs better than the baseline and other methods. Therefore, ‘+HAM + FEFN’, integrating ‘+HAM’ and ‘+FEFN’, jointly improves performance.

The experimental results of various methods of adding different modules are shown in Table 3. By including FEFN, the method ‘Baseline+FEFN’ achieves a 1.51% mAP improvement. By adding HAM, the method ‘Baseline+HAM’ achieves a 2.48% mAP improvement. These improvements have been discussed earlier. Finally, the method ‘Baseline+HAM+FEFN’ achieves a 2.81% mAP improvement. FEFN enhances multi-scale features, and HAM learns more object features to weaken the inference of background noises. The combination of the two improves the performance of object detection.

Some typical detection results of the baseline (in the first row) and the proposed method (in the second row) are shown in Figure 9. Baseline tends to produce incorrect detection under different scenarios. In the first sample image shown in Figure 9a, many small vehicles in the red dashed box, as well as the vehicles indicated by red arrows, are not detected. The aircraft wings in the lower left part of image are not accurately located. In the second sample image shown in Figure 9a, for the two ships alongside each other, only one ship is detected and the other ship is missed for baseline, because the noise of sea ripples causes blurred edges of ships. In the last sample image shown in Figure 9a, the baseline method mistakenly deems some parking lines as cars, because the vehicles have very low contrast with the background and are thus easily confused with parking lines. In contrast, the proposed method detects the missed objects and reduces false detection under various adverse scenarios, as illustrated in the second row of Figure 9b. Excellent results are largely attributed to the inclusion of FEFN and HAM within the proposed method.

4.3.4. Comparison with the State-of-the-Arts

In order to further evaluate the method to prove the effectiveness of the proposed method, AP (%) for each category of objects and mAP (%) for multiple categories are both computed. Our method is compared with state-of-the-art methods based on the DOTA dataset. The results are shown in Table 4.

RRPN [26] and R2CNN [27] directly use the high-level features to predict the category and location information of objects. Due to the large size of remote sensing images and small size objects, the object feature information is seriously lost after multiple operations of convolution and pooling, and their performances are comparably lower than the other methods. ICN [16] proposes image cascade networks to enrich the features and improves mAP to 68.16%. RoI Transformer [28] transfers the horizontal bounding boxes to oriented bounding boxes by learning the spatial transformation, and thus improves mAP to 69.56%. CADNet [17] exploits attention-modulated features, as well as global and local contexts, and improves mAP to 69.9%. SCRDet [18] proposes an inception fusion network and adopts a targeted feature fusion strategy, which fully considers the feature fusion and anchor sampling, and improves mAP to 71.16%. BBAVectors [19] uses the box boundary-aware vectors based on the center point of objects to capture the oriented bounding boxes, and improves mAP to 72.32%. DRN [31] proposes a feature selection module (FSM) to adjust receptive fields in accordance with objects and a dynamic refinement head (DRH) to refine the prediction dynamically in an object-aware manner, and improves mAP to 73.23%.

Compared with RRPN, R2CNN, ICN, RoI-Transformer, CADNet, SCRDet, BBAVectors, and DRN, the mAP of FEHA-OD is improved by 14.01%, 14.35%, 6.86%, 5.46%, 5.12%, 3.86%, 2.7%, and 1.79%, respectively. FEHA-OD achieves the highest mAP. In detail, FEHA-OD is effective in detecting some large objects, such as BD, BC, and RA. For other large objects, such as TC and SP, FEHA-OD also maintains a good detection performance. These objects with large scale usually appear together with some small objects, such as SV and LV. FEHA-OD also has good detection performance for multi-scale objects. The main reason for this is that the FEFN effectively uses the fused features from different layers and facilitates multi-scale object detection. Moreover, the detection accuracy of SV, LV, ST, SH, and other small objects with complex backgrounds is very good. The main reason for such effective detection is that the HAM module effectively highlights object features and prevents small objects from being submerged by noises. In particular, objects with arbitrary orientations can also be detected very well. Some visual detection results of FEHA-OD on the DOTA dataset are shown in Figure 10. The objects belonging to different categories are labeled by rectangles with different colors. Considering both very large objects, such as roundabout (RA) and ground track field (GTF), or small objects submerged by background noise, such as a small vehicle (SV), FEHA-OD can achieve excellent detection performance. Even if these objects present various scales and orientations, seriously disturbed by noises, FEHA-OD locates and classifies them accurately.

5. Conclusions

Considering the various objects with large-scale variations, arbitrary orientations, as well as background noises in remote sensing images, this paper proposes an object detection method (FEHA-OD) for remote sensing images, which focuses on feature enhancement and hybrid attention. It proposes FEFN with different receptive fields to reorganize and fuse feature maps, which further improves the multi-scale feature representation and addresses the large-scale variations. Moreover, to address the problem of background noises, which may overwhelm the features of small objects, the HAM module is designed to increase the positioning accuracy of objects, as well as to obtain the informative channels. It can suppress background noises and relatively enhance the small object features. Furthermore, this paper regresses the box parameters for objects with arbitrary orientations by using box boundary-aware vectors in the prediction module. Furthermore, a multi-task loss function is used to combine the above modules, and the final detection result is determined. Experiments on public dataset DOTA show that the proposed method achieves 75.02% mAP, showing an improvement of 2.7% mAP compared with BBAVectors. The robustness and adaptability of FEHA-OD is also verified by the experimental results in a variety of complex scenarios. The optimization of computational speed will be the subject of our future work.

Author Contributions

Conceptualization, J.Z. and H.W.; methodology, J.Z. and Z.Z.; software, T.W. and Z.Z.; validation, J.Z., T.W. and Z.Z.; formal analysis, H.W.; investigation, T.W.; resources, T.W.; data curation, Z.Z.; writing—original draft preparation, T.W.; writing—review and editing, J.Z.; visualization, Z.Z.; supervision, J.Z.; project administration, H.W.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

National Nature Science Foundation of China (No. 61876014). the Fundamental Research Funds for Central Universities (3122020044).

Institutional Review Board Statement

No applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: [https://captain-whu.github.io/DOTA, accessed on 15 June 2022].

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, Z.; Guo, W.; Zhu, S.; Yu, W. Toward arbitrary-oriented ship detection with rotated region proposal and discrimination networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1745–1749. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region Based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. FastR-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Boston, MA, USA, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar]
Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, W.; Anguelovn, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multiBox detector. In Lecture Notes in Computer Science 9905; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Washington, DC, USA, 2016; pp. 779–788. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Tian, Z.; Shen, C.H.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Zhou, X.; Wang, D. Objects as points [OL]. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Li, Y.; Huang, Q.; Pei, X.; Jiao, L.; Shang, R. RADet: Refine Feature Pyramid Network and Multi-Layer Attention Network for Arbitrary-Oriented Object Detection of Remote Sensing Images. Remote Sens. 2020, 12, 389. [Google Scholar] [CrossRef] [Green Version]
Huang, Z.; Chen, H.; Liu, B.; Wang, Z. Semantic-Guided Attention Refinement Network for Salient Object Detection in Optical Remote Sensing Images. Remote Sens. 2021, W13, 2163. [Google Scholar] [CrossRef]
Wang, Y.; Jia, Y.; Gu, L. EFM-Net: Feature Extraction and Filtration with Mask Improvement Network for Object Detection in Remote Sensing Images. Remote Sens. 2021, 13, 4151. [Google Scholar] [CrossRef]
Xiong, S.; Tan, Y.; Li, Y.; Wen, C.; Yan, P. Subtask Attention Based Object Detection in Remote Sensing Images. Remote Sens. 2021, 13, 1925. [Google Scholar] [CrossRef]
Tian, Z.; Zhan, R.; Hu, J.; Wang, W.; He, Z.; Zhuang, Z. Generating Anchor Boxes Based on Attention Mechanism for Object Detection in Remote Sensing Images. Remote Sens. 2020, 12, 2416. [Google Scholar] [CrossRef]
Chen, S.Q.; Zhan, R.H.; Zhang, J. Geospatial object detection in remote sensing imagery based on multiscale single-shot detector with activated semantics. Remote Sens. 2018, 10, 820. [Google Scholar] [CrossRef] [Green Version]
Majid Azimi, S.; Vig, E.; Bahmanyar, R.; Körner, M.; Reinartz, P. Towards multi-class object detection in unconstrained remote sensing imagery. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; pp. 150–165. [Google Scholar]
Zhang, G.J.; Lu, S.J.; Zhang, W. CAD-Net: A context-aware detection network for objects in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef] [Green Version]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Xian, S.; Fu, K. SCRDet: Towards More robust detection for small, cluttered and rotated objects. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented object detection in aerial images with box boundary-aware vectors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 2150–2159. [Google Scholar]
Li, J.; Wang, Y.; Wang, C.; Tai, Y.; Qian, J.; Yang, J.; Wang, C.; Li, J.; Huang, F. DSFD: Dual shot face detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5060–5069. [Google Scholar]
Li, Q.P.; Mou, L.C.; Jiang, K.Y.; Liu, Q.J.; Wang, Y.H.; Zhu, X.X. Hierarchical region based convolution neural network for multiscale object detection in remote sensing images. In Proceedings of the 2018 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2018), Valencia, Spain, 22–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4355–4358. [Google Scholar]
Li, C.Y.; Luo, B.; Hong, H.L.; Su, X.; Wang, Y.J.; Liu, J.; Wang, C.; Zhang, J.; Wei, L. Object detection based on global-local saliency constraint in aerial images. Remote Sens. 2020, 12, 1435. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhang, L.; Cheng, M.-M.; Feng, J. Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4003–4012. [Google Scholar]
Ma, J.Q.; Shao, W.Y.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-Oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef] [Green Version]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2cnn: Rotational region cnn for orientation robust scene text detection [OL]. arXiv 2017, arXiv:1706.09579. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning RoI transformer for detecting oriented objects in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE Computer Society Press: Los Alamitos, CA, USA, 2019; pp. 2844–2853. [Google Scholar]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11207–11216. [Google Scholar]

Figure 1. Characteristics of remote sensing images. The images on the second row represent the enlarged drawings of the corresponding red boxes in the first row. (a) Objects vary greatly in scale. (b) Small objects are easily submerged by background noises. (c) Objects appear in various orientations and are densely arranged.

Figure 2. Overall framework of the proposed FEHA-OD.

Figure 3. Feature enhancement fusion network.

Figure 4. Hybrid attention mechanism.

Figure 5. Orientation cases.

Figure 6. Visualization comparison results without/with FEFN.

Figure 7. Visualization results comparison of different methods with or without HAM.

Figure 8. Training loss diagram.

Figure 9. Comparison of detection results.

Figure 10. Visualization results of our method.

Table 1. Performance of FEFN.

Method	mAP(%)
Without FEFN	72.21
With FEFN	73.72

Table 2. Performance comparison of different attention mechanisms.

Method	mAP(%)
Without attention mechanism	72.21
With SKNet	72.81
With Pixel Attention	73.02
With HAM	74.69

Table 3. Results of ablation experiments.

Method	mAP(%)
Baseline	72.21
Baseline + FEFN	73.72
Baseline + HAM	74.69
Baseline + HAM + FEFN	75.02

Table 4. Experimental results comparing different methods.

Method	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
RRPN [26]	88.52	71.20	31.66	59.30	51.85	56.19	57.25	90.81	72.84	67.38	56.69	52.84	53.08	51.94	53.58	61.01
R2CNN [27]	80.94	65.75	35.34	67.44	59.92	50.91	55.81	90.67	66.92	72.39	55.06	52.23	55.14	53.35	48.22	60.67
ICN [16]	81.36	74.30	47.70	70.32	64.89	67.82	69.98	90.76	79.06	78.20	53.64	62.90	67.02	64.17	50.23	68.16
RoI-Trans [28]	88.64	78.52	43.44	75.92	68.81	73.68	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67	69.56
CADNet [17]	87.80	82.40	49.40	73.50	71.10	63.50	76.70	90.90	79.20	73.30	48.40	60.90	62.00	67.00	62.20	69.90
SCRDet [18]	89.66	81.22	45.50	75.10	68.27	60.17	66.83	90.90	80.69	86.15	64.05	63.48	65.34	68.01	62.05	71.16
BBAV [19]	88.35	79.96	50.69	62.18	78.43	78.98	87.94	90.85	83.58	84.35	54.13	60.24	65.22	64.28	55.70	72.32
DRN [31]	89.71	82.34	47.22	64.10	76.22	74.43	85.84	90.57	86.18	84.89	57.65	61.93	69.30	69.63	58.48	73.23
FEHA-OD	89.62	82.72	51.02	67.92	77.23	81.12	87.09	90.87	87.32	86.92	59.62	67.68	65.89	67.34	61.87	75.02

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, J.; Wang, T.; Zhang, Z.; Wang, H. Object Detection in Remote Sensing Images by Combining Feature Enhancement and Hybrid Attention. Appl. Sci. 2022, 12, 6237. https://doi.org/10.3390/app12126237

AMA Style

Zheng J, Wang T, Zhang Z, Wang H. Object Detection in Remote Sensing Images by Combining Feature Enhancement and Hybrid Attention. Applied Sciences. 2022; 12(12):6237. https://doi.org/10.3390/app12126237

Chicago/Turabian Style

Zheng, Jin, Tong Wang, Zhi Zhang, and Hongwei Wang. 2022. "Object Detection in Remote Sensing Images by Combining Feature Enhancement and Hybrid Attention" Applied Sciences 12, no. 12: 6237. https://doi.org/10.3390/app12126237

APA Style

Zheng, J., Wang, T., Zhang, Z., & Wang, H. (2022). Object Detection in Remote Sensing Images by Combining Feature Enhancement and Hybrid Attention. Applied Sciences, 12(12), 6237. https://doi.org/10.3390/app12126237

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Object Detection in Remote Sensing Images by Combining Feature Enhancement and Hybrid Attention

Abstract

1. Introduction

2. Related Work

2.1. Scale Change

2.2. Noise Disturbance

2.3. Various Orientations

3. Methods

3.1. Overall Framework

3.2. Feature Enhancement Fusion Network

3.3. Hybrid Attention Mechanism

3.4. Prediction

3.4.1. Heatmap

3.4.2. Offset

3.4.3. Box Parameters

3.4.4. Orientation

3.5. Loss

4. Results and Analysis

4.1. Details

4.2. Evaluation Index

4.3. Experimental Comparison and Analyses

4.3.1. Experimental Comparison for FEFN

4.3.2. Experimental Comparison for HAM

4.3.3. Ablation Experiments for Different Modules

4.3.4. Comparison with the State-of-the-Arts

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI