1. Introduction
With the rapid development of remote sensing spaceborne technologies, such as Sentinel-1, TerraSAR-X, and RADARSAT-2 [
1,
2,
3], target detection on remote sensing images has been playing an important position in the field of civil areas and defense security [
4,
5,
6]. However, target detection in remote sensing images remains a great challenge due to the complex background and existence of speckle noises in remote sensing images. Therefore, it is worthwhile to develop a detector with strong feature extraction capabilities to obtain better target detection performance of remote sensing images.
Over the past decade, remote sensing images have provided abundant shape structure and texture information of landscape targets, and the 2-dimensional target detection algorithms have been widely studied in remote sensing images. Cheng et al. [
7], developed a discriminatively trained mixture model for extracting feature pyramids from multi-scale layers using a histogram of oriented gradient (HOG) [
8], and then threshold operation was performed on the response of the model to judge the presence of the target in the remote sensing image. Bai et al. [
9] used the ranking support vector machine (SVM) [
10] to identify the existence of the target in the remote sensing image. In addition, the methods [
11,
12,
13] extracted semantic information from texture and shape features, and used the machine learning algorithms such as the contrast box algorithm [
14] and semi-supervised hierarchical classification [
15] to get the final detection results. Though these conventional methods are effective in specific scenes, the methods are based on hand-crafted features with poor generalization abilities, which makes it difficult and time-consuming to detect targets on large complex remote sensing image data sets.
Recently, the convolutional neural network (CNN) based methods have achieved encouraging results for general target detection and classification problems [
16,
17,
18,
19]. In particular, target detection using CNN has achieved remarkable successes, which can be classified into two types: The first is the two-stage detector, such as Region-CNN [
20], Fast R-CNN [
21], Faster R-CNN [
22], and feature pyramid networks (FPNs) [
23], and the other is the single-stage detector, such as the you only look once (YOLO) series [
24,
25,
26], Single Shot MultiBox Detector (SSD) [
27], deconvolutional single shot detector (DSSD) [
28]. The former type relies on a series of candidate region suggestions as samples, and then classifies the samples based on a CNN. The latter type directly estimates the target region without generating candidate regional proposals, and directly transforms the target region positioning problem into a regression processing problem.
The above-mentioned visual detection algorithms have also been widely used in remote sensing image target detection. Zhu et al. [
29] proposed a new method of airport target detection based on a convolution neural network, combined with a cascaded area recommendation network. Based on the SSD [
21] detection framework, Chen et al. [
30] presented a detection method for airplane detection that attempted to improve the detection accuracy of multi-scale remote sensing image targets. Yang et al. [
31] proposed a multi-scale remote sensing image target detection method based on the feature pyramid networks (FPN) [
23] detection framework. Although target detection in remote sensing images may be relatively accurate using the CNN, there are usually no effective approaches to utilize contextual information fully, which makes it difficult to understand the complex scenarios in remote sensing images, which may result in inaccurate detection of the small targets and cause the problems named as “box-in-box” [
32] as shown in
Figure 1a. In the figure, we can see that the SSD detects a single target with two overlapping boxes. The smaller box has partial image such as the part of the ship. In order to solve the same problem, Wang et al. [
33] has taken the objects proposals generated by SSD and replaces the original visual geometry group 16 (VGG16) [
18] with a densely connected network [
34] as the backbone network to improve the relevance of contextual information. However, this method increases the computational cost of the network, and the detection accuracy of the method for small-scale targets is poor.
To address the above problem, especially for small target detection, in this paper, we present a single-stage detector named context information scene perception (CISP)Net for target detection on remote sensing images, which is based on the Single Shot MultiBox Detector (SSD) [
27] and apply a context information scene perception (CISP) module to obtain the context information for targets of different scales. Compared with other detection methods such as YOLO [
24], SSD [
27], DSSD [
28], and RSSD [
32], our framework is more suitable for target detection in remote sensing images, and has achieved the relatively advanced performance. The contributions of our work can be summarized as follows:
- (1)
Different from previous detection models, we built a remote sensing image target detection framework based on SSD that can improve the relevance of contextual information and handle different complex scenes.
- (2)
To improve the relevance of contextual information, we propose a context information scene perception (CISP) module to obtain the context information for targets of different scales.
- (3)
We show that CISPNet can implement relatively advanced and reliable performance on remote sensing image dataset and the NWPU VRH-10 dataset and verify the effectiveness of the optimization for its architecture.
The rest of this paper is organized as follows.
Section 2 introduces the basis of the proposed method.
Section 3 introduces the details of the proposed CISPNet framework.
Section 4 presents experiments conducted on remote sensing dataset and the NWPU VRH-10 dataset to validate the effectiveness of the proposed framework and discusses the results of the proposed method. Finally,
Section 5 concludes this paper.
3. Proposed Method
In this section we will detail the architecture of the proposed CISPNet framework. As shown in
Figure 3a, the CISPNet assembles four context information scene perception (CISP) modules and two feature fusion modules (FFM) into a conventional SSD. The structure of these additional modules is simple and can be easily combined with conventional detection networks.
In the context information scene perception (CISP) module, as shown in
Figure 3b, multiple dilated rate convolution layers are used in parallel. Each dilated convolution layer has a different dilated rate, and the size of the dilated rate reflects its corresponding receptive field size. The context information scene perception module uses convolution kernels with different receptive fields to extract features from Conv4_3 and FC7, so that the model can perceive the changes of context information in different scales and sub-domains. In this way, the loss of semantic information can be reduced and make the feature map understand more contextual information from different scales. The inner structure of context information scene perception module is shown in
Figure 3b. Firstly, the number of channels of the feature map
is reduced by using a 1 × 1 convolution to obtain a feature map
. Then, three kinds of dilated convolutions with different dilated rates
are used for feature sampling on the feature map
in parallel, and the feature map
,
, and
are obtained. Finally, the feature map is concatenated to obtain the final feature map
,
.
Based on the CISP module, we propose a single-stage detector CISPNet, which can detect small targets more effectively. More specifically, we first assemble two CISP modules between features of the layers Conv4_3 and FC7 and the layers FC7 and Conv8_2 respectively. In addition, we connect another two separate CISP modules to Conv4_3 and FC7 detection branches respectively, and then generate the new layers Conv4_3 and FC7, respectively. As the layers Conv4_3 and FC7 in the backbone are relatively shallow, and the feature semantic information extraction by a shallow network is less and might not have enough capability to detect small targets, we use CISP modules to enhance Conv4_3 and FC7 features. Step forward, feature fusion strategies always contribute to learning better features from the combination of original features [
28,
32]. We also applied this method in CISPNet. In detail, with the help of two FFMs, the new Conv4_3 and FC7 are generated by feature fusion.
In order to more clearly show the influence of the proposed CISP module on the image feature extraction, we select a remote sensing image containing a large ship target and three small ship targets in a wide sea area, as the input image and extract the features through SSD and the proposed CISPNet.
Figure 4 shows a qualitative comparison of the feature mapping results from our proposed CISPNet and the SSD. With the increase of resolution, the feature map becomes smaller and more abstract, and small-sized objects hardly have a response on the deeper layers. As shown in
Figure 4a
1,a
2,b
1,b
2, the semantic information of a ship from the feature map of CISPNet’s Conv4_3 and FC7 layers is clearer than that extracted by the Conv4_3 and FC7 layers of SSD, in particular, the features extracted by CISPNet can highlight the smaller ship targets. At the same time, the above comparison also shows that the CISP module could reduce the loss of semantic information in the process of feature extraction, and make the feature map understand more contextual information from different scales.
Figure 4a
3,a
4 have more highlighted feature information and stronger semantic information than those in
Figure 4b
3,b
4. In theory, the prediction results on
Figure 4a
1,a
2 are better than those on
Figure 4b
1,d
2, especially for the detection of small targets.
The CISP module could also be placed in other positions, and more CISPs will bring more context information to the conventional network. Considering the trade-off between the improvement of precision and the increase of inference time, we have experimented and finally select the version shown in
Figure 3a.
4. Experiments and Results
In this section, we first introduce the construction of the dataset for target detection in a remote sensing image, and then illustrate the evaluation metrics, training strategies, and implementation details. Then, we compare the proposed CISPNet with state-of-the-art methods to demonstrate its advantage. Finally, some ablation studies are discussed to verify the role of each component, in addition, the expanded experiment shows that the proposed CISPNet performance over SSD in fuzzy target detection.
4.1. Benchmark Dataset
Similar to the related works [
29,
30,
31], the remote sensing images used in this paper were collected from Google Earth with a resolution of 500 × 375 pixels and a spatial resolution of 0.5 m to 2 m. Compare with the targets in nature scene images, like ImageNet [
35], MS COCO [
36], Pascal VOC [
37], the targets in remote sensing images (such as aircraft, ship, and oiltanker) usually have complex backgrounds. In addition, the remote sensing images used in this paper are different from the existing remote sensing images databases, such as the NWPU VHR-10 dataset [
38], and (Aerial Image Dataset)AID dataset [
39], which include more small targets (the area for small targets: area < 32
2 pixels) and medium targets (the area for medium targets: 32
2 < area < 96
2 pixels), which further increases the difficulty of target detection in remote sensing images.
We collected 1500 remote sensing images to construct a dataset for tiny target detection, each image is required to be marked with detection targets: aircraft, ship, oiltanker, and the manually pixel-wise annotated in Pascal VOC format. Each detection object is marked with a ground truth box as shown in
Figure 5.
Table 1 gives more details about this data set and
Figure 6 shows the aspect ratio and area size distribution. It is obvious that the area of bounding boxes is generally small in
Figure 6, which means that the architecture of our model should pay more attention to small detection targets and the set size of the default box can be further reduced. It is noteworthy that the aspect ratio is not symmetrically distributed around 1 and its distribution is also relatively concentrated as shown in
Figure 6.
4.2. Evaluation Metrics
The mean average precision (mAP) and the precision-recall curves (PRC) commonly used in the field of target detection are used to compare the detection performance of different methods. The precision-recall curves and the mean average precision are described in detail below.
The precision indicator can be seen as a measure of exactness or fidelity, and the recall indicator is a measure of completeness. The precision and recall indicators are formulated as follows:
where, TP, FP, and FN represent the number of true positives, false positives, and false negatives respectively. The precision-recall curve commonly takes the recall as the transverse coordinates and the precision as the vertical coordinates. If the detection method can keep a high precision with the increasing of recall, it means that the detection method has good performance.
The average precision (AP) calculates the average value of the precision over the interval from recall = 0 to recall = 1, which can be formulated as:
where
is the value of precision and
denotes the value of recall. Hence, the AP is equal to the area under the curve, the higher the AP, the better the performance of the detection method, and the mean average precision (mAP) represents the average value of all category AP.
4.3. Training Strategies and Implementation Details
We implemented the proposed CISPNet with PyTorch v0.4.1 on a PC with an NVIDIA GeForce GTX 1080Ti graphics processing units (GPUs), CUDA8.0, and cuDNN 6.0.21. In this experiment, we perform and evaluate all experiments on our remote sensing image dataset benchmark. In order to obtain a more robust model for the size and shape of various remote sensing image targets, data enhancement operations should be carried out for each training image, such as random cropping and flipping of the input image. Considering the memory limitations of GPU, when the input size is 300, the batch size is set to 16. Following [
27], we train a total of 120 k iterations, with a learning rate of 10
−3 for the first 80 k iterations, 10
−4 for the next 20 k iterations, then continue training for 20 k iterations with a 10
−5 learning rate. We fine-tune the entire model with a weight decay of 0.0005 and a momentum of 0.9. The optimizer chosen is the stochastic gradient descent.
The aspect ratios of the images set the aspect ratios of the default boxes, as shown in
Figure 4a, we set different aspect ratios for the default boxes according to the length-width ratio distribution of the image and denote them as
, in addition, we use K-means to calculate the three cluster centers of the area size as the scale of the default boxes. During the training process, we need to confirm which default boxes conform to ground truth box detection and train the corresponding network. For box matching, we match default boxes to any ground truth box with Jaccard overlap higher than a threshold of 0.5.
4.4. Experimental Results and Comparisons
In order to comprehensively evaluate the superiority and effectiveness of the proposed CISPNet model, we compared it with the RCNN based algorithms, YOLO, and SSD methods. For a fair comparison, the same training data set and testing data set are used for the proposed CISPNet method and other comparison methods.
Table 2 and
Figure 7 show the quantitative comparison results of ten different methods, measured by AP values and PRC, respectively. R-CNN [
20] is the first algorithm to use CNN for target detection, and its obvious disadvantage is that it generates about 2000 regional proposals generated by selective search to hypothesize object locations, which requires the computer to have a lot of memory space. In addition, normalized operation of regional proposals makes the algorithm lose a lot of contextual information and semantic information, resulting in positioning accuracy of 61.48% mAP and detection efficiency of 0.07 frames per second (FPS). In order to solve the problem of losing context information and semantic information in the process of image normalization by R-CNN, Fast R-CNN [
21] inputs an entire image into the network and extracts a fixed-length feature vector from the feature map through the region of interest (RoI) pooling layer, which reduces the running time of the detection networks and exposes the bottleneck of regional proposal computation. Therefore, Faster R-CNN [
22] introduces a region proposal network (RPN) that shares entire image convolutional features with the detection network, thus enabling nearly cost-free region proposals and achieves 68.75% mAP at the speed of 10.2 FPS. YOLO [
24] and YOLO v2 [
25] convert the detection problem into a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. YOLO can identify targets in images more quickly than the RCNN based algorithms due to the simple network structure, however, it has great restrictions when precisely localizing certain targets, especially small ones. The detection efficiency achieves 64.2 FPS, while the detection accuracy of mAP is only 61.61%. In SSD [
27], multiple feature layers are used to improve the detection performance. However, the low-level layers have the shortcoming of insufficient semantic information, and the high-level layer has the problem of large loss of feature information due to the down sampling, which may result in inaccurate detection of targets. The mAP on the testing data set achieves 74.01%. Recently, various methods have attempted to improve the accuracy of SSD, especially for small targets. DSSD [
28] uses ResNet-101 [
19] instead of VGG16 [
18] to achieve higher accuracy. RSSD [
34], DOSD [
40], and ESSD [
41] have improved the detection accuracy by fusing the low-level and high-level features.
In CISPNet, we use the CISP module to make the feature map understand more contextual information from different scales, and use the FFM module to fuse the features of the shallower layers to enhance the semantic information of the shallower layers. It solves the problem that SSD lacks understanding of context information and semantic information. From
Table 2 and
Figure 7, the proposed CISPNet method outperforms the other methods. Specifically, our CISPNet obtained 6.37%, 6.83%, and 5.78% performance gains in terms of mean AP over the aircraft, oiltanker, and ship, compared with the SSD model, respectively. Our CISPNet performs much better than the original SSD and other detectors. Specifically, our network can achieve 80.34% mAP at the speed of 50.7 FPS.
4.5. Detection Examples
In order to compare the detection performance of CISPNet and SSD for the remote sensing images in an intuitive manner, as shown in
Figure 8, we visualized some of the images. From the figure, we can clearly see that our proposed CISPNet is beneficial to the detection of targets in the remote sensing images, especially for small targets.
In the upper two rows of images, we can see that SSD detection results show a single object with various overlapping boxes and the smaller box has partial image such as the part of a ship or the part of an aircraft. However, the detection results of the proposed CISPNet algorithm in the same picture did not show the box-in-box status. In the lower three rows of images, the dimensions of the aircrafts, ships and oiltankers are small. SSD cannot accurately detect the position of aircrafts, ships and oiltankers because of the lack of understanding of the scene. CISPNet can more fully understand the contextual information via the CISP models and FFMs so that it can better distinguish the background and the detected targets, and determine the location of the aircrafts, ships and oiltankers through the contextual information.
4.6. Ablation Study
In this section, we set up different models and test them on the remote sensing image test dataset to verify the impact of each module on the detection performance. At the same time, we also discuss the effect of different aspect ratios of default boxes on detection accuracy. The results are shown in
Table 3 and
Table 4.
4.6.1. SSD with Context Information Scene Perception Modules and the Feature Fusion Modules
To estimate the contribution of different components of CISPNet, we further constructed three variants and tested them on the remote sensing image dataset tests to verify the impact of each module on the detection performance. The results are shown in
Table 3. The first step is to validate the effect of CISP module. We insert four CISPs at positions shown in
Figure 3a, and the first two rows of
Table 3 illustrate that the CISP module has a higher promotion, and the mAP on the testing data set achieves 79.38%, which is improved by 5.37% compared with the conventional SSD. Second, we insert two FFMs at bottom positions shown in
Figure 3a, the mAP is 76.15%, which is better than the conventional SSD. Finally, we insert four CISPs and two FFMs at positions shown in
Figure 3a and the last row of
Table 3, and the mAP is increased to 80.34%, which shows the effectiveness of the added four CISPs and two FFMs for enhancing the overall performance. These experiments have proved the importance of each component of CISPNet.
4.6.2. SSD with the Different Aspect Ratios of Default Boxes
In order to handle different target scales, we set default boxes with different aspect ratios to process images with different scales, and combined the results afterwards. We specified different aspect ratios for the models and recorded their evaluated results in
Table 4.
As shown in
Table 4, based on the experiments, we gradually increased the features of the different aspect ratios of the default box for detection. We started with a single default box with an aspect ratio of 1 and gradually denoted the default box with different aspect ratios. From the presented results, it can be inferred that the performance impacts of the large-scale different aspect ratios pairs were relatively large. This proved that the default boxes of different aspect ratios were beneficial for the improvement of the detection performance that in turn justified the effectiveness of our model.
4.7. Fuzzy Target Detection
Optical remote sensing image is usually photographed outdoors in a high angle shot via satellite and aerial sensors, and thus, there may be fuzzy targets, which are caused by cloud cover, shadow noise, and cluttered backgrounds. In addition to comparing the performance of the algorithms in the remote sensing image train and test dataset, the CISPNet and SSD algorithms are used to detect fuzzy images. The detection results are shown in
Figure 9. We can clearly see that our proposed CISPNet is beneficial to the detection of fuzzy targets. This is because that the context information scene perception (CISP) module and feature fusion module (FFM) are added in CISPNet, which can more comprehensively understand the feature information from context information so that it can better distinguish between the background and the fuzzy targets, and determine the location of fuzzy targets through the contextual information.
4.8. Experiments on the NWPU VRH-10 Dataset
NWPU VRH-10 [
32] is a well annotated dataset that includes remote sensing target detection, which contains 2D bounding boxes annotated on 650 remote images for airplane (757 instances), ship (302 instances), storage tank (655 instances), ballpark (390 instances), tennis court (524 instances), basketball court (159 instances), ground track filed (163 instances), harbor (224 instances), bridge (124 instances), and vehicle (447 instances), with 10 categories in total. The split ratio of training and testing set is 6:4, and the Intersection over Union(IoU) threshold of the evaluated is 0.5 on the testing.
As shown in
Table 5, we evaluate our proposed method on the NWPU VRH-10 dataset to compare with the other popular architectures reviewed and re-implemented in [
42]. Our CISPNet model achieved the highest mAP score (0.917) compared to others. Obviously, CISPNet is better than SSD in detecting small and medium targets such as aircraft, ship, storage tankers and so on. The detailed average precision of each category is listed in
Table 5, and our proposed CISPNet performs well compared to most of them.
5. Conclusions
In this paper, we have proposed an effective single-stage framework, CISPNet, which is based on SSD and the context information scene perception (CISP) module, to improve the detection accuracy of small and dense targets in optical remote sensing. On the one hand, it proves that the CISPNet achieves better performance on the remote sensing image benchmark dataset than SSD and other advanced detectors, and the best performance is achieved with the most compact structure; on the other hand, it proves that CISPNet introduces the context information scene perception (CISP) module and this feature fusion module (FFM) can effectively improve performance. Moreover, experimental results on the NWPU VRH-10 dataset reveal that CISPNet has significantly outperformed the original SSD, especially for detecting small targets. In addition, in extended experiments, the performance of CISPNet in fuzzy target detection is better than the conventional SSD.
In the feature, we expect research will be able to demonstrate the proposed method is not restricted to SSD based methods but also applicable to other structures utilizing multi-scale features.