1. Introduction
All countries are paying great attention to urban land planning, agricultural management, and marine environmental monitoring, which are related to economic and military development. Additionally, they also have significant impacts on the ecological environment of the earth [
1,
2,
3]. Thus, earth observation technology is of vital importance. Remote sensing (RS) can dramatically increase the scope and quality of earth observation, which provides a certain degree of convenience for researchers to observe the characteristics of objects on the earth with a high resolution. In the field of RS, extracting high-precision objects from optical and synthetic aperture radar (SAR) images has become an enormous challenge and has attracted widespread attention.
Benefiting from the advancement of neural networks, increasing numbers of researchers are applying deep learning methods to RS image processing [
4,
5]. In general, RS image processing can be divided into several tasks. The first is an image-level task, where different objects are categorized into their respective classes based on their attributes and features [
6,
7]. The second task is an object-level task, which locates the specific position of each object in the picture and separates the region of interest (RoI) from the background region using bounding boxes [
8,
9,
10]. Finally, in the pixel-level task, semantic-segmentation-based methods separate the background from predicted objects on a pixel-by-pixel basis, but they are unable to distinguish different objects within the same category [
11,
12]. Object-detection and semantic-segmentation methods are also utilized to interpret images in RS image processing [
13,
14,
15]. By performing instance-level localization and pixel-level classification of objects of interest simultaneously, instance segmentation (IS) methods can combine the benefits of object-level methods and semantic-segmentation methods to interpret more comprehensive information. Due to the dense and compact distribution of objects in SAR and optical images, we adopt IS to interpret objects at the pixel-level and distinguish individual objects within the same category.
IS can be divided into one-stage-segmentation methods [
16,
17,
18] and two-stage-segmentation methods [
19,
20,
21]. One-stage IS algorithms directly extract features from the input image and output results in a single forward inference process, which typically results in a faster inference speed. Although the one-stage method has good performance in real-time detection, it is still slightly insufficient for the high-precision segmentation requirements of high-resolution RS images. On the other hand, two-stage IS algorithms are suitable for scenarios that require a high segmentation accuracy because they can locate and segment objects more accurately. However, their processing speed tends to be relatively slow. The mask region-based convolutional neural network (Mask-RCNN) [
21] is the most representative algorithm among two-stage-segmentation paradigms. On the basis of the Faster-RCNN [
9], a fully convolutional network (FCN) [
11] is connected in parallel to the segmentation head (the head usually refers to the network layer that predicts the output in deep learning) obtained by the mask instance. It often serves as a baseline in numerous IS comparative experiments. Owing to its excellent segmentation performance, there are also many IS algorithms in the field of RS that are improved on it [
22,
23,
24] and have achieved good segmentation results. The choice of the method depends on the specific application scenario and the desired level of accuracy and speed for IS.
Recent years have seen numerous innovations in IS using RS images. The cascade network was first proposed by Cai et al. [
25] and is extensively used by researchers in detection and segmentation tasks [
19,
20]. The structure consists of multiple stages, and the output information of each level is passed to the next level as input information, realizing the level-by-level transfer and processing of information. Inspired by the cascade design pattern in object-level tasks, Su et al. [
26] proposed a high-quality IS network (HQ-ISNet). This method combines a high-resolution feature pyramid network (FPN) [
27] with a cascade network, making full use of multilevel feature maps effectively to interpret complex backgrounds in RS images. Subsequently, Zeng et al. [
28] proposed the consistent proposals of IS network (CPISNet) to improve the feature-extraction method and integrated a cascade network with residual connections into their network to further improve the accuracy of aerial image mask prediction. The development of the attention mechanism [
29,
30] has had a great impact on RS image IS. The attention mechanism is an important approach in image processing which can automatically select the most relevant or important parts of the input data by learning weights. It allows the model to focus on specific regions or features, emphasizing areas that contain crucial information, thereby improving recognition accuracy. Zhao et al. [
31] proposed a SAR ship IS method based on collaborative attention mechanisms which combines the advantages of multiple attention mechanisms and can effectively extract the instance masks of objects in the image. Fang et al. [
32] introduced a contour-refinement network that works from coarse to fine. They utilized an attention-based feature pyramid subnetwork to enhance the prediction capability for small objects. The contour extracted from the coarse branch was further refined through the fine branch by learning edge features to achieve fine-grained building contours. Sun et al. [
33] proposed a multiscale feature pyramid network which applied the attention mechanism to multiscale features to improve the performance of multiscale ship detection in high-resolution SAR images. Unfortunately, when some attention mechanisms overemphasize certain local features, other critical global features may be neglected, leading to the loss of important details, and thus negatively affecting the RS image-segmentation outcome.
However, the aforementioned methods are two-stage approaches based on the regional proposal network (RPN) [
9]. Although employing this trick yields good results in detection accuracy, it still encounters some limitations, such as the need to predefine different aspect ratios and proportions for the anchor boxes, which requires prior knowledge or experience and lacks flexibility and versatility. To address these constraints, several empirical studies have shown that the algorithm can be improved through a sparse detection paradigm [
8,
34]. In addition, several anchor boxes designed for objects of different sizes often require manual adjustments, which may not be adaptable to RS images with complex backgrounds and varying sizes.
Following the successive proposals of many effective RS image-segmentation methods, some researchers have found that the quality of the segmentation depends not only on the algorithms themselves but also on the quality of the dataset’s annotation. Cheng et al. [
35] produced the Northwestern Polytechnical University Very-High-Resolution dataset (NWPU-VHR-10 dataset), which consists of very-high-resolution images. Su et al. [
26] expanded the NWPU-VHR-10 dataset by providing pixel-level IS annotations. Building on this work, the SAR ship detection dataset (SSDD) was also extended with closed polygon annotations for pixel-level ship interpretation, overcoming the limitations of existing SAR datasets. Wei et al. [
36] created a high-resolution SAR image dataset (HRSID), which is particularly useful for ship detection and the IS of SAR images. Although RS image datasets are already so rich, there are still great difficulties in having a large amount of high-quality data, which often requires significant manual costs. The application of transfer-learning techniques [
37,
38] substantially alleviates the limitations of training difficulties due to a lack of training data.
As mentioned above, compared with natural images, optical and SAR images are characterized by cluttered backgrounds, large differences in the size of objects, and complex shapes of instances caused by factors such as the capturing distance and orientation (as shown in
Figure 1), and thus their segmentation effects are limited. To effectively address this, this paper proposes a query-based cascade instance segmentation network (QCIS-Net), a RS image cascade instance segmentation network based on the efficient query mechanism. The network uniformly represents the location and visual information of instances in the RS image through queries. It mainly includes key components such as the efficient feature extraction (EFE) module, the multistage cascade task (MSCT) head, and the joint loss function. The EFE module can combine the global information modeling capability of the Transformer [
39] to solve the long-term dependency problem in visual space. The dynamic detection and segmentation head in the MSCT head uses a dynamic convolution kernel based on the query representation to concentrate on the part of the image that is of interest. It facilitates the detection and segmentation of task associations through a multistage structural design, allowing both tasks to benefit from each other and gradually improving the segmentation quality of the mask. The elaborated joint loss function can guide QCIS-Net for training and generate the final instance mask.
4. Discussion
Since the QCIS-Net model is larger, it converges slowly in the two datasets without transfer learning, which leads to poor results. However, the accuracy of the model after transfer learning improves significantly, and the performance of the proposed method is better than the comparison experiments. However, there are large fluctuations in the objects of APL in the experiments, which may be due to the fact that the proportion of small objects in the SAR and optical image experimental datasets is significantly higher than large objects. This may be one of the reasons for the fluctuations.
Although our approach has achieved good results in the IS of SAR images, the dataset for the SAR images is limited, and the annotations in the original images are not detailed enough. As a result, the final training results may not always perfectly fit the instance objects. Therefore, we are considering expanding the SAR image dataset in the future and planning to annotate the dataset with more precision. In addition, initializing the object query as zero during training can lead to the relatively slow convergence of the network. Hence, we plan to use the pyramid layer to output feature maps at different levels. By utilizing instance activations from different levels of feature maps, we can guide the initialization process of the query. This allows the query to possess semantic information and instance cues from the image during initialization, thereby accelerating the model’s convergence speed. The cascade architecture can give decent results for the network, but it still affects the inference speed, which is not suitable for use on mobile devices. Therefore, building real-time inference models through lightweight model techniques, such as knowledge distillation, becomes one of the priority tasks in our future work.