1. Introduction
Recently, deep learning (DL) technology has attracted more and more attention in a variety of fields with satisfying results. The convolutional neural network (CNN), as one of the deep learning network models, has greatly promoted the progress of remote sensing technology. Remote sensing image analysis has been a hot topic, and it has been widely utilized in many fields such as urban planning, land-used management, and environmental surveillance. Many conventional approaches employed hand-crafted features to achieve object segmentation and tracking. However, these methods cannot achieve an important performance for the complicated appearance changes of the ground objects in very high resolution (VHR) aerial imagery. In the past decade, deep learning-based semantic segmentation has played an important part in the remote sensing applications. The block diagram of remote sensing systems framework is shown in
Figure 1.
Deep learning-based remote sensing methods have many practical applications; the collected remote sensing data from sensors should be processed, and only the available informative could be recorded for future usages, such as event detection, human–computer interaction, video abstraction, object tracking [
1], scene segmentation [
2], urban management and planning [
3], etc. Although much work has been done over recent years, effective object segmentation and tracking approaches for object segmentation and tracking in complex scenarios remain challenges.
Object segmentation and tracking technology are some of the important components in the field of computer vision. Deep learning (DL) technology and correlation filter (CF) technology have reported great potentials for real-time segmentation and tracking tasks. The original CF method performs excellent inference speed by utilizing element-wise multiplications in fast Fourier transform. However, the robustness of baseline approaches often drops considerably in several complex scenarios. In the last five years, many effective deep network architectures have been widely applied to remote sensing data processing and have become integral parts of our everyday lives. Currently, most existing DL methods rely heavily on the abundant training data. Additionally, the deep learning models are trained on large data offline via an end-to-end framework and aggressively learn the network model parameters online. These methods have shown promising applications in some challenging benchmarks.
However, there are two weaknesses for inaccurate object position prediction. One is the inadequate feature fusion of multilayer response maps. The other is the limitation of single modal. The use of single modal data is not always enough to reach proper spectral and spatial resolutions. Thus, multiple source data acquired by sensors onboard different platforms should be combined.
The growth of multimodal data poses a challenge for efficient data processing technology. The existing object detection deep network architectures are designed for various vision tasks. However, the targets contain few pixels in remote sensing images and exhibit arbitrary perspective transformations, thus many technical challenges are left open.
In this paper, we design a real-time convolutional framework for remote sensing data detection and segmentation. The proposed framework is capable of conducting mid-level fusion of multiple sources of data. Experiments were carried out on some common datasets. The contributions of the paper can be summarized as follows.
(1) We propose an effective feature information–interaction visual attention model for multimodal data fusion and enhancement, which utilizes channel information to weight self-attentive feature maps of multi-source data, completing extraction, fusion, and enhancement of global semantic feature with local contextual information of the object.
(2) To improve the effectiveness of multi-source feature fusion, we further develop an adaptively cyclic feature information–interaction model, which adopts branch prediction to decide the number of visual perceptions, accomplishing adaptive fusion of global semantic features and local fine-grained information.
(3) Our experiments reveal that the proposed appro6ach can provide competitive advantages with respect to baseline methods.
The rest of the paper is summarized as follows. In
Section 2, we review related work.
Section 3 introduces our approach for object segmentation tasks in detail. In
Section 4, we report results on some common datasets.
Section 5 discusses the advantages of our approach.
Section 6 summarizes this paper.
2. Related Works
Recently, extensive studies of object segmentation and tracking methods were surveyed. Deep learning based methods can produce satisfying results. In the following, we mainly review the key methods that related to our approach.
Deep learning approaches: Deep learning technology is utilized to enhance the robustness of visual tasks (e.g., segmentation, classification, and tracking). Some known methods combine DL models with CF to perform accuracy segmentation and tracking such as HCF (hierarchical convolutional features) [
4], MOS (multiscale optimized segmentation) [
5], DeepSRDCF (convolutional features for correlation filter) [
6], ECO (efficient convolution operators) [
7], benchmarking [
8], SSCF (spatial semantic convolutional features) [
9], RNT (residual network tracker) [
10], GAN-RI (GAN re-identification) [
11], and DRDN (deep residual dense network) [
12]. Another approach utilizes classification and regression networks to formulate object segmentation and tracking tasks, such as SMIS (supervised methods image segmentation) [
13], FCNT (fully convolutional networks) [
14], DeepTrack [
15], and CNN-SVM [
16]. The benefit of the above methods is obviously that the high-level semantic features of deep network model are utilized to match objects. However, the computational complexity is not increased due to online update mechanisms of the object template.
In the last five years, several effective deep network models were trained on large classification datasets offline and employed to segment and track objects online, including MDNet (multi-domain network) [
17], CFNet (correlation filter network) [
18], ACFN (attentional correlation filter network) [
19], etc. Recently, the Siamese network model [
20,
21,
22,
23,
24] successfully solved the inaccurate performances. SINT (Siamese instance network tracker) [
20] regards the tracking problem as a verification task and learns a similarity measure for object matching in each frame. The representative approaches contain SiamRPN++ [
21], deeper and wider Siamese tracker [
22], DaSiamRPN [
23], and so on. In the VITAL method [
25], hard samples are generated through utilizing adversarial learning, and an effective loss function is leveraged to address the class imbalance problem. These kinds of methods promote the development of deep learning models and obtain satisfying evaluations on several challenging datasets. However, most deep learning models suffer from under-fitting problems because of a lack of training samples.
Correlation filter approaches: The approaches based on the correlation filter framework achieved promising results between speed and accuracy [
26,
27,
28,
29,
30,
31,
32,
33,
34]. They are classified into two categories: baseline methods and improved regularization methods.
Several baseline methods are presented to improve speed and accuracy by utilizing scale prediction [
27], spatial regularization [
28,
35], and long-term tracking [
29]. Initially, the MOSSE method was first introduced into object tracking by using a single feature channel. Then, Henriques et al. [
26] proposed an effective kernelized tracking method (KCF) using the circular correlation solution scheme for ridge regression. Danelljan et al. [
27] trained DCF using a scale pyramid representation to handle the scale variations of the object. However, baseline methods are constrained in the detection region due to the equality of patch size and filter size.
To solve above problems, some improved regularization methods were developed, which include SRDCF [
28], DeepSRDCF [
6], ECO [
7], STRCF [
30], C-COT [
31], CSR-DCF [
32], ATOM [
33], MCPF [
36], etc. In ACFN [
19], a subset is chosen from the associated CFs as an attention scheme to improve the performance. To alleviate the unwanted boundary effects, they further propose a spatial constraint [
28] to penalize the correlation filter coefficients during the training process. Several approaches combine correlation filters with high-level semantic features, which produces a remarkable advance in performance. CSR-DCF [
32] exploits color histograms as features to obtain a saliency response map in the Fourier domain, which trains the attention network model in an end-to-end way.
Transfer learning approaches: There have been many efforts to utilize the transfer learning for processing remote sensing imagery [
37]. The first successful application of these models to object tracking was presented by Wang et.al. [
38]. They pre-trained a stacked denoising auto-encoder (SDAE) from an ILSVRC dataset and then transferred it to an object tracking task. Since then, some supervised transfer learning-based approaches have been presented to segment and track object. They offline-train deep models using other datasets as source domains and then use the learned model online to obtain satisfying accuracy in the target domain. However, the high computational cost is an obvious deficiency. Some transfer learning-based methods utilize the features from different deep network layers to improve tracking performance. Gao et al. [
39] exploited the extracted prior knowledge from the Gaussian processes learning to improve the robustness. In [
40], an effective offline-trained meta-updater was presented to achieve robust tracking performance, which consisted of an online local tracker, a meta-updater, a re-detector, and an online verifier in the long-term tracking framework.
Other approaches: Benedek et al. [
41] proposed a novel object-change modeling approach based on multitemporal marked point processes, which simultaneously exploits low-level change information between the time layers and the object-level building description to recognize and separate changed and unaltered buildings. Xu et al. [
42] developed an image segmentation neural network based on the deep residual networks and used a guided filter to extract buildings in remote sensing imagery. Grinias et al. [
43] proposed a novel segmentation algorithm based on a Markov random field model and obtained good classification performance. Shi et al. [
44] constructed a convolutional network based on a generative adversarial network to discriminate between ground truth maps and generated maps by the segmentation model.
3. Methodology
In this section, we first introduce the problem and the motivation and then give a detailed description of deep network architecture and channel attention. Finally, we apply our deep network architecture to object segmentation and tracking.
3.1. Problems and Motivations
In the remote sensing field, the Single Shot MultiBox Detector (SSD) [
45] is one of the most representative detection methods with respect to speed and accuracy trade-off. Nevertheless, some drawbacks limit the accuracy of the algorithm. First, the semantic information of shallow layers is weak, and it fails to capture global dependent information to predict small and dense clusters of objects in RS images. Second, the feature maps from medium layers present the problem of feature confusion, which makes it difficult to accurately regress bounding boxes. Finally, the deep layers have less object contextual information, making it fail to predict large objects confidently.
Inspired by information guidance between self-attentive models [
37], we propose a feature information–interaction model, which introduces feature map channel weights on the self-attentive module and takes a weighted mechanism to focus on the regional block. On this basis, an adaptively cyclic information–interaction visual model is developed to solve the problem of insufficient feature fusion, which concentrates on the feature map more than once to distinguish the background clutter.
3.2. Feature Information–Interaction Model
As mentioned above, the existing self-attentive models associate the internal information of feature maps and concentrate on the local information of the object, ignoring the inter-channel feature information association, i.e., the global semantic feature information of the object. To tackle the problem, we propose a feature information–interaction model (FIM), where weighted channels of feature maps are proposed to perceive the global semantic and the local fine-grained features of the object. The overall structure of FIM is shown in
Figure 2.
Given the feature map, U∈RH×W×C, U1∈RH×W×C, U2∈RH×W×C/8, and U3∈RH×W×C/8 are obtained through the convolution operator. The channel attention and the self-attentive modules are achieved by these feature maps. For the channel attention, global average pooling and sigmoid are utilized to get the feature map of the weighted channel, and the resulting feature map is U12∈R1×1×C. For the self-attention module, we utilize dimension transformation to obtain the intermediate response map, and the weighted channel supervision information is utilized to assist the intermediate self-attention feature map.
U22∈
RH×W×C,
U23∈
R1×1×C, and
U32∈
RH×W×C are obtained to represent the enhanced feature maps with global semantic features and local fine-grained information. Additionally, we merge the original feature map
U into enhanced feature maps for enriching semantic feature information. Therefore, the adaptively weighted attention information can be obtained through threshold multiply and add operations in Equation (1).
where
U14 is the enhanced feature map;
U12,
U22, and
U are the intermediate layer information, and
α denotes the predicted threshold through convolution; ⨁ and ⨂ stand for element-add and element-multiply operations, respectively. Then, self-attentive feature maps can be obtained in spatial dimension by using Equations (2) and (3).
where
U24 and
U34 are the enhanced feature maps;
U,
U22,
U23, and
U32 are the intermediate layer information.
α and
β are the predicted thresholds, and we set
α +
β = 1 empirically. Finally, the threshold weighting scheme is utilized to generate the resulting feature map through Equation (4).
where
Uf is the resulting feature map;
U14,
U24, and
U34 are the intermediate enhanced features;
α,
β, and
γ are adaptive thresholds.
3.3. Adaptively Cyclic Feature Information–Interaction Model
To enrich the global semantic feature and the local contextual information of the object, we propose an adaptively cyclic information–interaction model (ACFIM) to strengthen the ability of feature extraction. Concretely speaking, the convolutional prediction module is developed to control the location and the number of visual perceptions and adaptively concentrates on the feature map more than once, better distinguishing the similarities and the differences between objects.
Based on our knowledge and experience, the feature maps of the shallow layer have enriched local fine-grained information of the object, and the feature maps of the deep layer contain abundant global semantic features of the object.
Therefore, cycle times and locations of feature fusion between shallow, medium, and deep layers are different. For the feature maps of the shallow layer, we set three times default for the cyclic model to better enrich and represent the global feature flow of the object. For the feature maps of the medium and the deep layers, we set two times default for the cyclic model for fine-tuning the information flow between low-level fine-grained features and high-level semantic features. In addition, we set an intermediate threshold δ for deciding the cyclic location to better adapt the enhanced feature map. If δ is less than a predefined threshold (0.5 in the paper), we merge the enhanced feature map into the original feature map for loop initialization. Otherwise, we merge it into the intermediate feature map for information fusion and initialization.
3.4. Objective Loss Function
The overall loss function of SSD is defined as a weighted sum of the confidence loss and the localization loss; more detailed 6information can be referenced in [
15]. In our model, we adopt focal loss for confidence to address the problem of class imbalance. In addition, we slightly adjust the parameters between the default anchors and the ground-truth boxes, as shown in Equations (5) and (6).
3.5. Online Segmentation and Tracking
Given the object location of the first frame with annotation, we construct an initial training set by using a data augmentation scheme, which includes 20 positive samples. Then, the backbone model is fine-tuned with training samples of the first frame.
The appearance of the object in the process of object data processing may change. To capture the appearance variations, we update the object template with the previous video observations. First, we define a fixed length unit L to store the object state at every frame and update the object template if the length unit reaches a fixed number of elements. The element with the maximum saliency score in the length unit is utilized to update the template of the object.
Thus, the updated template is expressed as:
where
η denotes an empirical learning parameter;
∈
Rkn×1 stands for a new object template, which consists of the initial object template
cf∈
Rkn×1 and the last updated template
cp∈
Rkn×1. The feature of the object is concatenated into a column vector as the initial template at the original frame. Then, the initial template of the object combines with the new updated template to alleviate the drift of the object.
In the test stage, in each frame, we crop several search regions centered at the object location of the last frame by using the multiple scales scheme. Then, these search regions are inputted to the ResNet-50 network model to extract the object features. The fine-grained features of the object are considered as the CF layer. The high-level object feature is obtained according to the channel attention module. Furthermore, we randomly draw the object candidates
x = {
x1,
x2,…,
xN} based on the object location in the previous frames. Finally, the candidate states in search region are computed by Equation (8).
The object states with the highest response value are regarded as the final tracking result. Then, we update the object template by using training samples obtained in previous sequences in every 20 frames.
4. Results
4.1. Implementation Details
In the study, we implemented our approach on the framework of Pytorch with an NVIDIA GTX 2080Ti GPU and an Intel i7 CPU (32G RAM) and utilized SSD as our baseline model. The proposed visual attention model was embedded into the extra four prediction modules. During the training stage, without bells and whistles, we followed the original SSD strategies, which included data augmentation, backbone network, scale, and aspect ratios for the predefined anchors. In the paper, the learning rate schedule was slightly changed to obtain a better performance.
4.2. PASCAL VOC2007
The PASCAL VOC dataset [
46] contains 20 object categories. The pixel size of the image varies and is usually (horizontal view) 500 × 375 or (longitudinal view) 375 × 500. The mean average precision was used to measure the performance of the object detection network (
http://host.robots.ox.ac.uk/pascal/VOC/) (accessed on 14 May 2021).
We trained our model on the PASCAL VOC2007 dataset and the VOC2012 trainval set and tested our model on the VOC2007 test set. We utilized a 10−3 learning rate for the first 80,000 iterations, then decreased it to 10−4 for the next 20,000 iterations and 10−5 for the remaining 20,000 iterations. In addition, we adopted a “warmup” strategy that gradually ramped up the learning rate, which contributed to stabilizing the training process. The momentum and the weight decay were set to 0.9 and 0.0005, respectively.
Table 1 shows experimental results on the PASCAL VOC test set, which were trained with VOC07 trainval and VOC12 trainval sets. The proposed approach obtained 79.7% mAP with 300 × 300 input images and 82.1% mAP with 512 × 512 input images, exceeding the latest SSD300* by 2.2 points and SSD512* by 2.6 points.
In
Table 2, we compare the proposed method with other methods under the same baseline model. For fairness and simplicity, we simply replaced our module with other visual attention models. Our method considers object feature information–interaction under visual perception and obtains the best accuracy among all models. Without bells and whistles, the baseline model SSD with FIM achieved 79.48% mAP on the VOC2007 test set, which proves the effectiveness of our method which concentrates on semantic feature and contextual information interaction.
4.3. MS COCO
The COCO dataset [
50] is a large, rich object detection, segmentation, and subtitle dataset. This dataset is mainly extracted from complex daily scenes. Targets in images are calibrated by accurate segmentation. Images include 91 categories of objects, 328,000 images, and 2500,000 labels. Thus far, the largest dataset with semantic segmentation provides 80 categories, more than 330,000 images, among which 200,000 are annotated (
http://cocodataset.org/#home) (accessed on 15 May 2021).
To further verify the effectiveness of the proposed method, we trained our model on MS COCO. We utilized the trainval35 k (118,287 images) for training and evaluated the results on the minival. The batch size was set to 32 for 300 × 300 input and 16 for 512 × 512 input. We trained the model with 10
−3 for the first 280,000 iterations, then 10
−4 and 10
−5 for the remaining 120,000 and 40,000 iterations. In
Table 3, we observe that our method achieved 27.6% AP@[0.5:0.95], 46.8%
[email protected], and 28.7%
[email protected], which improved the baseline model SSD300* by 2.5, 3.7, and 2.9 points, respectively. Our model with 512 × 512 input images also outperformed the baseline SSD512*.
It is noticeable that our model with 300 × 300 and 512 × 512 input images achieved 8.9% AP and 13.0% AP for small objects, respectively. The proposed method is more powerful in detection of small objects. For medium and large objects, our method validates the effectiveness of the feature information–interaction scheme.
4.4. HRSC2016
To further verify the validity of our approach, we conducted the experiments on a remote sensing dataset collected from Google Earth and harbored it with the complex scenario, which is called high resolution ship collections 2016 (HRSC2016). On this dataset, the sizes of the image are between 300 × 300 and 1500 × 900, and the image resolutions range from 0.4 to 2 m. In addition, the inclined bounding box and te horizontal bounding box are provided as ground truth for each ship.
Figure 3 shows some representative detection approach evaluations when the threshold was set to 0.6. The experimental report shows the advantages of our approach in dense ship detection. Multiple-source data were fused, which made the fused features have a strong discriminative ability and overcame the limitation of a single modal. The proposed model achieved a higher accuracy and better generalization. In other words, our approach is more robust to serious background clutter and fragmentary ships by using multiple sources of fusion information.
4.5. LaSOT
The LaSOT dataset [
51] is a single target tracking dataset with 1400 video sequences; each video has an average of 2512 frames, where the shortest video has 1000 frames, and the longest contains 11,397 frames. It is divided into 70 categories, each consisting of 20 video sequences (
https://cis.temple.edu/lasot/) (accessed on 24 February 2021).
Our approach was also evaluated on the LaSOT benchmark including 280 videos, and there was an average of 2500 frames in the dataset, making the appearance change of an object an important challenge.
Figure 4 reports the evaluation results.
The ATOM tracking method uses the ResNet-18 network model to separate the object from the image. The experiment evaluation results show that our method achieved an AUC score of 51.6% and had a lower failure rate of 15.1% while obtaining compatible robustness.
Figure 5 gives the qualitative segmentation results of our method and other competing methods.
4.6. Visualization Analysis
The comprehensive experiments were implemented on the ISPRS dataset. The dataset contains 38 images and consists of a true orthophoto (TOP) obtained from a larger TOP mosaic and is divided into six land cover classes. On this dataset, 24 classification label images are provided. The ground truth of the remaining scenes remains unreleased, and the benchmark is used for test verification. We utilized 15 images for training and nine images for testing. The network was trained by a data augmentation strategy that is a major method with rotation and scale variations of images.
Figure 6 reports the visualization results. The first column is the original and the resized true orthoimages for fair evaluation. The second column is the segmented output of our proposed approach. The last column indicates wrongly classified pixels via red/green image.
Figure 7 illuminates the results of the proposed method using some challenging videos. In the first row of sequences, the object experienced illumination variation and scale changes. We can see that the proposed approach was able to cope with these challenging factors and kept in touch with the object successfully, which was attributed to our approach using both the transfer learning model updating strategy and the attention mechanism. However, other methods failed to match the object during the tracking process due to illumination changes and scale changes. MDNet suffered from the illumination changes and gradually missed the object. ECO did not perform well at the 78th frame.
In the second and the third rows, these sequences encountered the challenges of scale variation, occlusion, and low contrast. These challenging factors greatly increased the difficulty for robust tracking. We can see that the proposed approach performed more robustly compared with other trackers, which drifted away from the object due to scale variations and occlusion. The proposed method performed well when the object was occluded in a complex scenario.
In the rest of the video sequences, most of the trackers missed the object and drifted. Our algorithm performed with the better accuracy in this scenario. This was mainly due to the proposed transfer learning model updating strategy and the attention module, which made the learned network concentrate on the robust object features and reduced the influence of background clutter within the image area. Overall, the proposed approach could track the object well in these challenging sequences.
4.7. Quantitative Evaluation
In order to further verify the effectiveness and the feasibility of the proposed method, we carried out the quantitative evaluation by calculating the accuracy evaluation indicators on the test set. The accuracy evaluation indicators included the precision ratio (PR), the recall ratio (RR), and the F1-score.
The
PR is the ratio of true positives to the sum of true positives and false positives, which is defined as:
The
RR represents the ratio of true positives to the sum of true positives and false negatives, which is written as:
The
F1 score integrates
PR and
RR. The higher the
F1 score is, the better is the result of the model prediction.
where
TP (true positive) and
TN (true negative) denote the total number of object pixels and non-object pixels correctly predicted, respectively.
FP (false positive) and
FN (false negative) denote the total number of pixels with an incorrect outcome from the object and the non-object regions, respectively. Total denotes the total number of pixels. The precision and the recall measures both range from 0 to 1. An
F1 score reaches its best value at one and its worst at zero.
We compared our method with state-of-the-art methods, including GAN [
44], FCN [
52], and SegNet [
53] on the ISPRS dataset. The evaluation results are reported in
Table 4. Our method outperformed all compared methods on the dataset. We can see that all evaluation indicators of the proposed method improved compared to state-of-the-art methods. The main reason is that the proposed method benefits from the feature information–interaction model (FIM). By introducing FIM, our method weighted channels of feature maps to perceive the global semantic and the local fine-grained features of the object. Moreover, these deep learning methods can make decisions at multiple layers to improve the accuracy.
The proposed method achieved state-of-the-art results on the Vaihingen and the Potsdam datasets in
Table 5. It can be clearly observed that the results support the idea that it is beneficial to use the cyclic feature information–interaction model.
5. Discussion
We further carried out the evaluation experiments to explain the contributions of different modules and different layer features. The AUC scores are reported using different backbone networks in
Table 6.
Feature selection: Different layer features play a significant part in tracking tasks. We found that the type of network layers and the number of parameters directly affected the tracking performance. First, ResNet-50 and AlexNet networks were considered as backbone networks to evaluate the accuracy of the proposed approach on two popular benchmarks. The proposed approach and SiamRPN++ exhibited stronger performance, benefiting from the deeper learning model. In other words, our approach achieved an obvious improvement by fine-tuning network parameters. In addition, the evaluation outputs report that conv4 alone achieved the satisfying performance with 0.347 in EAO. Low-level and high-level features from deep network architecture performed with 5% drops. Unsurprisingly, significant improvement could be obtained through combining Conv4 and Conv5.
Effectiveness of different components: Our method includes CF module, S module, and A module, and they denote CF layer, SiamRPN, and channel attention component, respectively. To verify the effectiveness of these modules, the variants of the proposed approach were implemented: (1) ours (S) denotes the tracking method only utilizing SiamRPN to predict the location of the object at each frame; (2) ours (CF) is the proposed approach by combining the shallow layer CF and the deep features representation to predict the object state at each frame; (3) ours (A) stands for our method with an attention scheme; and (4) ours (S + CF + A) denotes our proposed approach in the paper. The contribution of different modules is reported in
Table 7.
Table 8 reports the evaluation performance of the variants and verifies that all components improved the tracking accuracy. Removal of the channel attention module from our approach led to a 3.1% precision drop; the precision of the variant dropped by 7.8% without the correlation filter layer. We can see that the performances of both variants were comparable to the S module, but the failure examples increased during tracking and segmentation processing. This is because the channel attention module could focus on the important part of the image. Thus, the attention scheme was very important to achieve reliable tracking and segmentation. Our method resulted in a 9.5% EAO and an 8.7% accuracy improvement due to the rotated bounding box estimation.
Impact of different loss function terms: We compared different loss function terms, and their impacts are shown in
Figure 8. It is quite clear that every loss term made its contribution to the performance of our approach. Meanwhile,
Table 8 illustrates the performance of every loss term by showing that our approach outperformed every variant. The loss function indicates the importance of end-to-end training.
Figure 9 reports several building segmentation results of the UNet [
54] and the ResNet-50. The yellow, the green, and the red pixels denote “false negative”, “true positive”, and “false positive”, respectively. As shown in
Figure 10, the flyover was wrongly labeled as building by using the ResNet50 and the UNet network models. We can see that the proposed method could remove most false alarms due to the introduction of the attention mechanism into the deep network model, which resulted in generating more precise segmentation results; this indicates the integration of the attention channel module into the remote sensing image process helped to improve the performance. Our network architecture yielded satisfying segmentation results. In addition, generalization ability was improved due to a data augmentation strategy.
Failure cases: Although the proposed method could obtain good performance on public datasets, our method did not achieve the desired results for some farmlands. As shown in
Figure 10, our method had difficulty in some cases, which were wrongly detected in the urban construction change areas bounded by green boxes. This might be due to the ground surface changing that usually happens in farmlands. Moreover, the noises in labeling might further result in performance drop.
6. Conclusions
In this paper, we proposed a feature information–interaction model for multi-source data fusion under visual perception, which adopts channel information of feature maps to weight self-attention feature maps of multiple-source data, completing extraction, fusion, and enhancement of global semantic feature with object contextual information. Then, we presented an adaptively cyclic feature information–interaction model, which adopts a branch prediction mechanism to decide the number of visual perceptions, accomplishing adaptive fusion of global semantic features and local detailed information repeatedly. Experimental results demonstrate that the proposed approach significantly improves the accuracy of the baseline model.
However, our method still needs to be improved in terms of speed and real time. How to balance the computational complexity and the accuracy remains a big challenge. In the future, we would like to discover a lower computational complexity. Additionally, better pre-trained models will be applied to the research with the development of deep networks.