A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing

Zhang, Xiaoying; Shen, Jie; Hu, Huaijin; Yang, Houqun

doi:10.3390/math12182905

Open AccessArticle

A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing

¹

School of Computer Science and Technology, Hainan University, Haikou 570228, China

²

Haikou Key Laboratory of Deep Learning and Big Data Application Technology, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(18), 2905; https://doi.org/10.3390/math12182905

Submission received: 5 August 2024 / Revised: 9 September 2024 / Accepted: 16 September 2024 / Published: 18 September 2024

(This article belongs to the Special Issue Advances in Computer Vision and Machine Learning, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

With the goal of addressing the challenges of small, densely packed targets in remote sensing images, we propose a high-resolution instance segmentation model named QuadTransPointRend Net (QTPR-Net). This model significantly enhances instance segmentation performance in remote sensing images. The model consists of two main modules: preliminary edge feature extraction (PEFE) and edge point feature refinement (EPFR). We also created a specific approach and strategy named TransQTA for edge uncertainty point selection and feature processing in high-resolution remote sensing images. Multi-scale feature fusion and transformer technologies are used in QTPR-Net to refine rough masks and fine-grained features for selected edge uncertainty points while balancing model size and accuracy. Based on experiments performed on three public datasets: NWPU VHR-10, SSDD, and iSAID, we demonstrate the superiority of QTPR-Net over existing approaches.

Keywords:

instance segmentation; high-resolution remote sensing images; feature pyramid network; transformer

MSC:

68T45

1. Introduction

High-resolution remote sensing image instance segmentation holds significant importance in image processing and remote sensing applications. Instance segmentation provides more detailed and comprehensive information about land feature boundaries, which enables the more accurate recognition and differentiation of individual targets in remote sensing images. Land classification [1], target detection [2], and environmental monitoring [3] are dependent on this. High-resolution remote sensing images possess richer spectral, shape, and texture features than natural images, along with more structural targets and abundant scene information, making the instance segmentation of high-resolution remote sensing images more challenging.

Traditional image instance segmentation methods have matured considerably [4,5,6,7,8], aiming to segment every independent object within an image at the pixel level. Image segmentation in remote sensing images has unique characteristics when compared to natural images, such as many small targets, small inter-class variances, large intra-class variances, significant scale differences among different categories, high learning difficulty, and complex backgrounds. Therefore, instance segmentation methods that rely on natural images will not yield optimal results in remote sensing applications. In order to achieve optimal segmentation results for remote sensing datasets, researchers have developed instance segmentation models tailored to remote sensing image characteristics [9,10,11,12,13]. Xu et al. [14] proposed a remote sensing image instance segmentation method based on BoxInst from the perspective of weak supervision, which fully utilizes the existing rich OBB annotations and reduces the annotation burden. In order to address the issue of edge similarity in remote sensing images, this framework incorporates Canny edge supervision in a data-driven manner. The DCTC model [15] transforms classification problems into regression problems, iteratively regressing contour segmentation in remote sensing images to extract more accurate contour information and improve edge segmentation accuracy. QCIS-Net [16] is an end-to-end instance segmentation method that combines transformer architecture and query-based methods to efficiently extract features and facilitate the correlation between the multi-level tasks of detection and segmentation, solving the long-term dependency problem in the visual space during remote sensing image instance segmentation.

We analyze the impact of the characteristics of remote sensing images on instance segmentation in existing research and divide the reasons into three different aspects: complex backgrounds, multi-scale targets, and interclass similarity with intraclass variability. High-resolution remote sensing images can provide more details thanks to advancements in remote sensing imaging technology and improved image resolution, which leads to better edge recognition accuracy. Despite this, researchers have confirmed that the feature information of individual target instances in remote sensing images is insufficient for segmenting them using existing natural image instance segmentation methods. HQ-ISNet [9] fully utilized multi-level feature maps to improve the mask branch and alleviate spatial resolution loss in a feature pyramid network (FPN) [17], effectively overcoming the effects of complex backgrounds on remote sensing image segmentation. Li et al. [10] used a region proposal network (RPN) [5] and key points to enhance mask precision and boundary accuracy, resulting in the more accurate extraction of buildings from complex backgrounds. As a means of mitigating the issue of rough edge segmentation, Chen et al. [12] developed a supervised edge attention module that suppressed irrelevant features and highlighted edge feature details.

We propose a method of high-resolution remote sensing image instance segmentation based on PointRend [13], using an improved quadtree attention mechanism [18] to compute the attention mechanism from coarse to fine, rough segment irrelevant masks, and extract fine features from the relevant masks by using an improved quadtree attention mechanism.

The research contributions of this study are as follows:

We propose a model for segmenting high-resolution remote sensing images based on QuadtreeAttention and a transformer called QTPR-Net. This method comprises two main parts: a preliminary edge feature extraction (PEFE) module and a refinement module for the edge point feature (EPFR), achieving high accuracy in remote sensing image instance segmentation;
In the PEFE module, we propose an edge point detection strategy suitable for high-resolution remote sensing images, recursively adding coarse-grained features layer by layer. Through multi-level feature fusion, uncertain points are selected in areas of high uncertainty;
As part of the EPFR module, we propose a transformer structure based on QuadtreeAttention (TransQTA), which utilizes a quadtree attention mechanism of the token pyramid structure to select the highest scoring areas and add positional encoding. It captures different contextual information to produce precise mask predictions for edge pixels through a multi-level structured design.

The effectiveness of QTPR-Net has been validated using three public remote sensing image datasets: NWPU VHR-10 [19], SSDD [20], and iSAID [21]. This paper is organized as follows: Section 2 discusses related work, Section 3 presents a detailed introduction to our proposed model, and Section 4 discusses and analyzes the datasets used for the experiments, experimental details, evaluation criteria, and experimental results. Finally, Section 5 offers a review and summary.

2. Related Work

2.1. Instance Segmentation of Remote Sensing Images

Remote sensing images have traditionally been interpreted primarily for automatic target detection. Traditional interpretation, which involves human-defined features, is heavily dependent on expert knowledge, which reduces the expressive power of features and the effectiveness of detection. In response to increasingly complex remote sensing applications and demands, deep learning-based instance segmentation methods are being explored. Instance segmentation is an advanced task for computer vision that combines object detection and semantic segmentation. Increasing amounts of research have been conducted on the instance segmentation of remote sensing images, especially high-resolution remote sensing images.

Remote sensing image instance segmentation research has primarily focused on deep learning technologies, such as multi-scale feature fusion, dilated convolution, and attention mechanisms, in recent years. In multi-scale prediction, signals are sampled at varying granularities, and features are observed at various scales. Combining different levels of semantic information and spatial geometric information can produce more comprehensive and complete predictions. Gao et al. [22] introduced the CBAM module to the feature fusion process of FPN, extracting significant features at different scales and enhancing the capability to represent features, as well as reducing interference from irrelevant information. This method can improve segmentation performance if different weights are applied to input features, but it was only tested on the SSDD dataset, and its generalizability needs to be examined further. In order to mitigate the issue of an FPN not fully utilizing shallow feature maps, which are very useful for the detection and segmentation of small ships, Sun et al. developed a multi-scale feature pyramid network (MS-FPN [23]) using an atrous convolutional pyramid [24] (ACP). While the ACP module does have some limitations, it may reject the detection of some micro-ships as background noise when using shallow, high-resolution data.

Using the self-attention mechanism solves the issue of multiple vectors of varying sizes that may have certain relationships among themselves, but exploiting these relationships during training may result in poor model performance. AFL-Net [25] was designed by Yue et al., incorporating a self-attention module into the attention multi-scale feature fusion (AMFF) module, adaptively adjusting the weights of multi-scale features, enhancing global awareness, and alleviating the false positives and missed detections caused by complex building backgrounds. A novel multi-attendee path neural network (MAP-Net [26]) developed by Zhu et al., which addresses the problem of inaccurate edges in remote sensing image segmentation using convolutional neural networks, incorporates an enhancement module for spatial pooling to capture global dependencies and continuously extracts building entities, especially for large, low-texture buildings. With the aim of improving the perception of spatial information in remote sensing images, Wang et al. [27] developed a building extraction network, B-FGC-Net, based on the convolutional block attention module (CBAM) by introducing a spatial attention unit, simplifying deep convolutional neural network training, automatically learning feature expressions, adaptively obtaining spatial weights for features, and emphasizing the spatial information representation of features. LFO-Net [28] is a lightweight feature optimization network that utilizes channel and spatial attention mechanisms in feature layers to capture silent features and suppress less useful ones.

2.2. Vison Transformer

Vision Transformer [29] applies the transformer architecture to the field of computer vision, based on the substantial success of the transformer [30] in the field of natural language processing. Essentially, a transformer is a novel encoder–decoder structure based on an attention mechanism. On the basis of this, researchers have proposed models such as Mask Former [31], Mask2Former [32], and OneFormer [33]. In the field of computer vision, Vision Transformer has demonstrated excellent performance. The segmentation model presented by Feiniu et al. [34] incorporates CNNs and transformers, integrating their features and decoding them using Swin Transformers to handle contextual and remote dependencies. Saikat et al. [35] updated the standard ConvNet in Mednext by using mirrored transformer blocks. With limited image data, they employed a new technique for iteratively increasing kernel size via upsampling in small-core networks in order to prevent performance saturation. Compound scaling was also achieved on multiple levels (depth, width, and kernel size) to improve image segmentation. Ke et al. proposed a method based on Mask Transfiner [36] for high-quality instance segmentation. By decomposing image regions and representing them as quadtrees, Mask Transfiner can predict highly accurate instance masks with lower computational costs by processing only error-prone tree nodes and correcting them in parallel with a transformer. The MPViT [37] uses a unique approach for creating multi-scale patch embeddings and multi-path structures. Chen et al. [38] studied the idea of MPViT to enhance the segmentation effect in different scenes. The study results demonstrate that the transformer is effective for segmenting image instances, not only because it has unique characteristics for processing natural and other types of images but also because it plays an important role in segmenting remote sensing images. In order to enhance the performance of remote sensing image segmentation, QTPR-Net also employs the Vision Transformer approach.

3. Proposed Method

3.1. Overview

In this paper, we propose a high-resolution remote sensing image instance segmentation model based on PointRend, called QuadTransPointRend (QTPR-Net). According to Figure 1, the QTPR-Net framework is composed of two submodules: the preliminary edge feature extraction (PEFE) module and the refinement module for the edge point feature (EPFR).

3.2. Preliminary Edge Feature Extraction Module

Drawing on previous research, we selected the ResNet101 [39] network, FPN [17], and RPN [5] as our feature extraction networks (as shown in the Figure 2). With ResNet101, basic image features are extracted through a backbone network, with each stage producing feature maps (C2–C5), corresponding to a half-sampling of the previous stage, with 256, 512, 1024, and 2048 channels, respectively, and corresponding downsampling rates of

4 \times

,

8 \times

,

16 \times

, and

32 \times

. By using a series of convolutional operations, FPN combines the feature outputs from each stage through upsampling and lateral connecting, standardizing each feature map to 256 channels. With multiple scales, P2–P5, there is 4, 8, 16, and

32 \times

downsampling, and at the P6 level, pooling yields

256 \times H / 64 \times W / 64

to enhance feature diversity. A variety of feature layers are generated by the FPN for classification and regression operations, which are then aligned in the ROIAlign module, ultimately providing fixed-size region proposals of

256 \times 7 \times 7

. We perform target detection at this stage by outputting categories and boundary regression values from the previous stage’s region proposals, calculating each category’s probability using Softmax, and then generating fine masks for the edge points that are uncertain. Following the extraction and transformation of features from the previous stage to obtain feature Z, the following steps are taken to calculate the category scores, s, and scores for each category:

s = W_{c l s} Z + b_{c l s}

(1)

P (c_{i} | Z) = \frac{exp (s_{i})}{\sum_{j = 1}^{C} exp (s_{j})}

(2)

where

W_{c l s}

is the weight of the classification layer,

b_{c l s}

is the bias of the classification layer, and

P (c_{i} | Z)

represents the probability of candidate region category, i.

Afterward, in the regression layer, linear regression is used to predict the bounding box regression function. The boundary regression prediction parameters, t (including

Δ x

,

Δ y

,

Δ w

, and

Δ h

, representing the co-ordinate parameters, and the width and height parameters,

W_{r e g}

and

b_{r e g}

are the weights and biases of the regression layer, respectively) are used to adjust the bounding boxes of the candidate regions. The calculation method is as follows:

t = W_{r e g} Z + b_{r e g}

(3)

We set the candidate regions

(x^{'}, y^{'}, w^{'}, h^{'})

generated in the RPN stage and combine them with the predicted regression parameter, t, to obtain the final accurate bounding box

(x, y, w, h)

:

(x, y) = (x^{'} + w^{'} Δ x, y^{'} + w^{'} Δ y)

(4)

(w, h) = (w^{'} \cdot exp (Δ w), h^{'} \cdot exp (Δ h))

(5)

Finally, we obtain the center co-ordinates, width w, and height h, of the bounding box. Overall, in the preliminary processing module of the edge features, we have achieved the efficient detection and segmentation of targets through multi-level and multi-scale feature extraction and fusion.

3.3. Uncertain Edge Point Selection Strategy

Since remote sensing image instances are small and dense, the quality of edge segmentation has a significant impact on the effectiveness of target instance segmentation in high-resolution remote sensing image analysis. Many solutions have been proposed for edge segmentation, such as multi-scale feature fusion, feature decoupling, and box strategies to improve object detection and feature fusion effects. The core focus is on predicting the features of edge points. Thus, we combine Pointrend’s rendering technology to flexibly select points on the 2D plane of remote sensing images to predict their segmentation labels in an iterative process using coarse-to-fine points.

QTPR-Net depicts the output of image instance segmentation as a regular grid of labels with target instances encoded on a networked feature map. One or more CNN-formatted feature maps on the C-channel are further processed to output predictions for K class labels on regular grids of different resolutions. The selection of uncertain edge points is crucial for subsequent mask refinement. When selecting uncertain edge points, as many points as possible concentrated in high-frequency areas are selected, with the initial granularity size

M_{0}

determining the coarse granularity prediction. After obtaining a preliminary coarse granularity prediction of the mask, the mask’s resolution is gradually enhanced through recursive refinement steps, with the model progressively focusing on areas of highest uncertainty in mask prediction, processing

M_{i}

uncertain points at each step i. For example, in QTPR-Net, we tested different values to find the most suitable

M_{0}

for high-resolution remote sensing images, considering computational resources and processing speed. We chose the

M_{0}

= 8 with bilinear upsampling, which increased resolution by

M \times M = 512 \times 512

over six refinement steps.

3.4. Edge Point Feature Refinement Module

Small targets of the same category are often densely packed in high-resolution remote sensing images. By assigning different weights to coarse-level features, we can amplify the influence of contextual features on instance segmentation, achieving a higher level of precision in target segmentation. In the EPFR module (as shown in Figure 3), rough mask predictions

M_{coarse}

are performed as well as fine-grained segmentation mask predictions

M_{fine}

for uncertain edges. By combining both, we obtain the final segmentation mask, M:

M = M_{coarse} + M_{fine}

(6)

We designed an encoder–decoder structure with a multi-level quadtree attention [18] mechanism inspired by Vision Transformer to process image regions of different resolutions, further refining the extracted coarse-level feature maps. We call this structure TransQTA. Each encoder layer in TransQTA consists of a self-attention layer, a feedforward neural network, a normalization layer, and positional encoding fusion. In the self-attention layer, quadtree attention is combined to project the image feature tensor X, which is projected through the following equation to obtain the query-key-value pairs Q, K, and V:

Q = W_{q} X

(7)

K = W_{k} X

(8)

V = W_{v} X

(9)

where

W_{q}

,

W_{k}

, and

W_{v}

are learnable parameters. The self-attention scores are then obtained by applying the Softmax function to the dot product of queries and keys, performing tensor computations to calculate the dot product between Q and K, adjusted by a scaling factor:

A t t e n t i o n S c o r e (A) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}})

(10)

where

d_{k}

is the dimension of each head, i.e., the embedding channel dimension,

\sqrt{d_{k}}

represents the scaling factor, which helps to avoid overly large values before performing Softmax operations. Then, based on the obtained

A t t e n t i o n S c o r e (A)

, the weighted value vectors are obtained to produce the final result, and

QuadtreeAttention = einsum (^{'} n l s h, n s h d \to n l h d^{'}, A, V)

(11)

where n represents batch size, l denotes the number of query positions, s denotes the number of key positions, h represents the number of heads, and d represents the dimension of each head. Here, the tensor operation einsum is used to perform batch matrix multiplication, combining the attention scores and values.

This part also discusses multi-scale processing by calculating attention and message passing at different scales, which uses various topks for downsampling and local enhancement. By using topks, only the top k highest scores at each level are considered, reducing image noise by focusing on the most relevant features. It is also possible to reduce computational costs by setting appropriate topks, and suitable topks at different levels can provide more extensive information and refine the most important features of high-resolution remote sensing images.

The feature processing flow of the EPFR module is conducted by flattening feature maps into one-dimensional vectors; this module facilitates subsequent grid rendering, enhancing the ability to represent this nonlinearly by activating the ReLu function, which results in a feature map, F, by capturing contextual information. The TranQTA structure captures contextual information through layer-by-layer feature transformations, extracting more advanced feature information. A feature,

F_{q t a}

, is obtained in the self-attention part and is processed by the feed-forward neural network of a decoder structure into

F_{t r a n s}

of the same size and shape as

F_{q t a}

. Convolutional fusion layers are then used to integrate feature information to create the coarse segmentation mask,

M_{coarse}

:

F_{q t a} = L a y e r N o r m (F + Q u a d t r e e A t t e n t i o n (Q, K, V))

(12)

F_{t r a n s} = L a y e r N o r m (F_{q t a} + (R eLU (W_{1} F_{q t a} + b_{1}) W_{2} + b_{2}))

(13)

M_{coarse} = S i g m o i d (C o n v (F_{t r a n s}, W_{m a s k}, b_{m a s k})

(14)

where W and b are the weight matrix and bias parameters of each layer, respectively.

During the process of processing fine-grained features, we use bilinear interpolation to transform the basic features obtained in the first stage, the coarse masks from the previous stage, and the uncertain edge points, G, through a series of feature transformations and mask predictions to obtain fine-grained features through a series of feature transformations and mask predictions. A fine-grained segmentation mask,

M_{f i n e}

, is obtained by using the prediction layer:

M_{f i n e} = W_{p r e d} \cdot G + b_{p r e d}

(15)

4. Experiments

The following sections provide an overview of the specific experimental details, the datasets used, the evaluation metrics used, and the results obtained.

4.1. Dataset

Based on the diversity and representativeness of remote sensing image data sources, the challenges of remote sensing image scenes, and the complexity of remote sensing image scenes, we selected the NWPU VHR-10 dataset [19], SSDD dataset [20], and iSAID dataset [21] as the experimental datasets. These three public datasets comprehensively enhance and validate the performance of the remote sensing image instance segmentation model, ensuring the model’s effectiveness and generalizability. We also formatted these three datasets into the COCO format for ease of experimental verification, including training and validation sets with original images and JSON label files.

NWPU VHR-10 Dataset: Northwestern Polytechnical University released a 10-class geographical remote sensing dataset called NWPU VHR-10 for spatial object detection. This dataset contains 800 high-resolution images of satellite imagery derived from Google Earth and the Vaihingen dataset and is annotated by experts. The dataset covers the following areas: airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track and field, harbor, bridge, and vehicle. For our experiments, we set 30 k iterations and a learning rate of 0.02.

SSDD Dataset: The SSDD dataset is the first publicly available dataset widely used for research on deep learning-based synthetic aperture radar (SAR) image ship-detection technologies, produced by the Department of Electronics and Information Engineering at the Naval Aeronautical and Astronautical University. It contains 1160 images and 2456 ships. Although it has fewer images, the only category is ship, so this is sufficient to train detection models. For this experiment, we set 20 k iterations and a learning rate of 0.02.

iSAID Dataset: The iSAID dataset is a new open benchmark dataset for multi-class instance segmentation in remote sensing images. iSAID includes 15 categories with 655,451 individual instances marked separately, with up to 8000 instances in a single image and image resolutions ranging from 800 to 13,000, making it the first large-scale instance segmentation dataset in the remote sensing field. Considering the experimental environment and model training speed, based on prior public work, each image in this dataset was segmented into blocks of 800 × 800 pixels with a stride of 100 for fair benchmarking against existing methods. Detailed comparative experiments and ablation studies were conducted with 200 k iterations at a learning rate of 0.01.

4.2. Experimental Setup

Due to significant differences in resolution and quantity across the three datasets, we set different numbers of iterations and training batch sizes to obtain the best segmentation results. However, in the initial 5% iterations of model training, we started with 0.1% of the base learning rate and gradually reached the base rate using linear growth. Furthermore, in the QTPR-Net tuning strategy, we set the first half of the iteration count at the initial learning rate, decreasing it at the halfway point and three-fifths mark. In order to prevent overfitting, we also set a weight decay coefficient of 0.0001 and normalization parameters. Our experiments were conducted on NVIDIA GeForce RTX 4090 using Pytorch 2.0.0, CUDA 11.8, with ResNet101-FPN as the baseline and GPU num workers set to 4.

4.3. Evaluation Metrics

In order to assess the performance of deep learning instance segmentation methods, the following evaluation metrics are typically used:

R e c a l l

: This reflects the proportion of samples that are correctly predicted to be positive among all samples that are actually positive.

R e c a l l = \frac{T P}{T P + F N}

(16)

P r e c i s i o n

: This reflects the proportion of samples that are actually positive among all samples predicted to be positive.

P r e c i s i o n = \frac{T P}{T P + F P}

(17)

A c c u r a c y

: The proportion of correctly classified samples to the total number of samples.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(18)

Among them,

T P

is true positive, which is the number of positive samples predicted as positive samples;

F N

is false negative, which refers to the number of positive samples predicted as negative samples;

F P

is false positive, which refers to the number of negative samples predicted as positive samples;

T N

is true negative and refers to the number of negative samples predicted as negative samples. In this experiment, we used average precision (

A P

) as our primary evaluation criterion.

A P = \int_{0}^{1} P (R) d R

(19)

The

A P

values in the MS COCO dataset format are obtained by using a weighted calculation averaging across 10 IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. In addition to

A P

, MS COCO evaluation metrics also include average precision under a single threshold, such as

A P_{50}

(

I o U

threshold of 0.5) and

A P_{75}

(

I o U

threshold of 0.75).

{A P}_{S}

,

{A P}_{M}

, and

{A P}_{L}

represent the average precision means for instances with areas less than 32 × 32, larger than 32 × 32 but less than 96 × 96, and larger than 96 × 96, respectively. We will display the

{A P}_{bbox}

and detailed

{A P}_{segm}

for target segmentation and instance segmentation across different datasets. In addition to precision metrics, we also evaluate the model’s efficiency using the amount of memory usage (Memory), measuring the model’s complexity and storage requirements.

4.4. Comparative Experiments

4.4.1. Results on NWPU VHR-10

A comparison of QTPR-Net with other advanced methods can be found in the table below. According to Table 1, QTPR-Net achieved the highest

A P_{segm}

value of 69.1944 on the NWPU VHR-10 dataset, outperforming other models. QTPR-Net performed best in edge recognition, and the uncertain edge point handling module significantly affected performance compared to the base model, PointRend. Based on the NWPU VHR-10 dataset, Table 2 shows the class segmentation effects of QTPR-Net. QTPR-Net outperformed other models on small targets such as ships, harbors, and vehicles, proving that it is effective in segmenting small targets by instance. We also achieved the best results with our model despite it not occupying the most memory for this dataset.

4.4.2. Results Using SSDD

Table 3 presents the object detection and instance segmentation results of QTPR-Net when using the SSDD dataset. Since the SSDD dataset only has one category, ship, with many small targets, QTPR-Net achieved the best results in target edge segmentation and small target segmentation using

{A P}_{segm}

,

A P_{50}

, and

A P_{m}

. With QTPR-Net, uncertain edge points were handled significantly differently than with the base model. We found that QTPR-Net took up less memory than Mask Transfiner, but performed better, indicating its good balance. By taking Figure 4 as an example, we demonstrate the visualization of the QTPR Net model in terms of segmentation performance.

4.4.3. Results Using iSAID

For the iSAID dataset, Table 4 shows the results of QTPR-Net for object detection and instance segmentation. Unlike the first two public datasets, the iSAID dataset requires more memory than the first two, and other models require more computation to handle it. As shown in Table 4, QTPR-Net did not occupy the highest memory, far less than Mask RCNN and CondInst, and did not exceed much memory usage when compared to PointRend and MaskTransfiner, but achieved the best segmentation results, with a 37.07 instance segmentation effect. The results of QTPR-Net indicate that we have advantages both in terms of memory usage and segmentation. In Table 5, we provide the instance segmentation results for the 15 categories in the iSAID dataset, showing that QTPR-Net performed well with small targets such as ships, storage tanks, swimming pools, and harbors.

As an example, the left side of Figure 5 illustrates the changes in loss _box_reg, loss_cls, loss_mask, and loss_mask_point metrics during the instance segmentation process of our model using the NWPU VHR-10 dataset. These, respectively, represent the bounding box regression loss, classification loss, mask loss, and point loss. The right side of Figure 5 displays the model’s recall rate and mask accuracy. The model’s recall rate reached as high as 97.83%, and the accuracy on mask segmentation reached 97.27%. Curve changes indicate that the model’s convergence is very stable, with excellent results in category prediction and very close results in overall boundary and mask prediction, indicating an effective way to handle uncertain edge points.

4.5. Ablation Experiments

4.5.1. Ablation Experiments on TransQTA Module

In the TransQTA module, we designed different attention mechanisms for ablation studies. As a test of the effectiveness of our TransQTA module, we used the base model (without the TransQTA module), the QuadtreeAttention mechanism (without the transformer structure, abbreviated as QTA), the multi-head attention mechanism (with the transformer structure), and the QuadtreeAttention attention mechanism (with the transformer structure).

Our designed TransQTA module had the best segmentation effect on the NWPU VHR-10 dataset, as shown in Table 6, despite occupying a larger amount of memory, based on ablation study data across all three datasets; this demonstrates that our module significantly improved segmentation performance on small targets, despite occupying the highest amount of memory. TransQTA achieved the best results for SSDD, as shown in Table 7, but it did not consume the highest amount of memory, and it performed well across all metrics, which proves the importance of QuadTreeAttention for small target edge segmentation. Our TransQTA designed for the iSAID dataset successfully balanced the memory usage and segmentation results in Table 8, showing the best results were achieved by using QuadTreeAttention.

By using the SSDD dataset’s loss mask point metrics and accuracy curve graph as an example, as shown in Figure 6, the graph shows that our TransQTA module achieved the lowest loss value and a relatively high accuracy rate (95.70%) among the comparison modules. As clearly evidenced by our comparison module performance, QTPR-Net did not suffer from mask loss and had a higher final accuracy than the base model. It is also evident that our comparison modules were effective compared to the base model in terms of mask loss and final accuracy, proving the effectiveness of the QTPR-Net design.

4.5.2. Ablation Experiments of TransQTA Cascaded Structure

The transformer’s cascaded structure is shown in Figure 7. Across the three datasets, the cascaded ablation study results (as shown in Table 9, Table 10 and Table 11) for the transformer structure indicate that the best results are achieved with two layers. As a result, we set the transformer’s cascaded structure in the feature processing part of uncertain edge points to two layers, which satisfies the balance between memory consumption and segmentation quality.

As an example of the SSDD dataset’s data, the table data and Figure 8 indicate that, in this part of the ablation study, the TransQTA two-layer cascaded structure achieved the lowest point loss and the highest mask and point accuracy, resulting in the highest segmentation precision, demonstrating the effectiveness of the two-layer EncoderLayer structure.

In general, QTPR-Net met its design intentions and achieved satisfactory results on three public datasets, but there are also some shortcomings. Due to our model’s implementation based on the Dtectron2 framework, we used default parameters for some basic settings (such as normalization), which are set under the best conditions when performing instance segmentation on natural images. More experiments are needed to determine whether these settings are the best for high-resolution remote sensing images. Additionally, in the selection of the datasets, we chose the three most widely used public remote sensing image datasets, but we still haven’t verified some other scene datasets, such as the more frequently used WHU Building dataset, Potsdam dataset, and other architectural scene remote sensing image datasets. Consequently, we will also conduct instance segmentation research on these types of datasets in the future. Last but not least, although QTPR-Net achieved the best overall results in target detection and instance segmentation, the segmentation results for some large target categories were suboptimal, such as basketball courts, baseball fields, and ground athletic fields. This result is partly because there is inter-class similarity among them, and it may also be due to large targets being dispersed after the images are segmented, and, in other models, the segmentation effects of these few categories are not particularly excellent, requiring more edge feature information in order to improve their segmentation effects. Future research will incorporate more contextual information as well as target edges to improve segmentation effects.

5. Conclusions

In this paper, we propose a high-resolution remote sensing image instance segmentation model, QTPR-Net, and tested its effectiveness on the NWPU VHR-10, SSDD, and iSAID datasets, achieving instance segmentation accuracies of 69.1944, 71.5251, and 37.0704, respectively. Our network consists mainly of an initial edge feature extraction module and an edge point feature refinement module, where we designed an uncertain point selection strategy for selecting certain edge points in high-resolution remote sensing images to select edge points of higher quality and improve the segmentation of edge features. The edge point feature refinement module in QTPR-Net effectively verified the effectiveness of the quadtree attention mechanism in edge segmentation, employing a coarse-to-fine pyramid approach to enhance the attention of uncertain points and incorporating multi-scale positional encoding to improve efficiency and reduce loss. Additionally, QTPR-Net does not occupy much memory in a nondistributed environment, which enables us to balance the complexity of the model with its ability to segment. In future research, we will study whether some basic parameter settings are optimal for validation and improve the generalizability of the model with other scenes from high-resolution remote sensing image datasets, as well as take into account the global contextual information of target instances to fill in any gaps in the model.

Author Contributions

This work was conducted in collaboration with all authors. Conceptualization, H.Y. and X.Z.; methodology, X.Z.; validation, X.Z.; formal analysis, X.Z.; investigation, J.S.; resources, H.H.; data curation, J.S. and H.H.; writing—original draft preparation, X.Z.; writing—review and editing, H.Y. and X.Z.; visualization, J.S. and H.H.; supervision, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Hainan Province Science and Technology Special Fund under Grant ZDYF2022GXJS228, in part by Haikou Science and Technology Plan Project under Grant 2022-007 and Grant 2022-015.

Data Availability Statement

We used the publicly available datasets NWPU VHR-10, SSDD, and iSAID. The NWPU VHR-10 dataset can be accessed at https://gcheng-nwpu.github.io/##Datasets on 18 September 2024, the SSDD dataset can be accessed at https://github.com/TianwenZhang0825/Official-SSDD/blob/main/README.md on 18 September 2024, and the iSAID dataset can be accessed at https://captain-whu.github.io/iSAID on 18 September 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zheng, X.; Chen, X.; Lu, X.; Sun, B. Unsupervised Change Detection by Cross-Resolution Difference Learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5606616. [Google Scholar] [CrossRef]
Liu, J.; Yang, D.; Hu, F. Multiscale Object Detection in Remote Sensing Images Combined with Multi-Receptive-Field Features and Relation-Connected Attention. Remote Sens. 2022, 14, 427. [Google Scholar] [CrossRef]
Chen, D.; Ma, A.; Zheng, Z.; Zhong, Y. Large-Scale Agricultural Greenhouse Extraction for Remote Sensing Imagery Based on Layout Attention Network: A Case Study of China. ISPRS J. Photogramm. Remote Sens. 2023, 200, 73–88. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Peng, S.; Jiang, W.; Pi, H.; Li, X.; Bao, H.; Zhou, X. Deep Snake for Real-Time Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8530–8539. [Google Scholar] [CrossRef]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. SOLO: A Simple Framework for Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 8587–8601. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. arXiv 2020, arXiv:2003.10152. [Google Scholar]
Su, H.; Wei, S.; Liu, S.; Liang, J.; Wang, C.; Shi, J.; Zhang, X. HQ-ISNet: High-Quality Instance Segmentation for Remote Sensing Imagery. Remote Sens. 2020, 12, 989. [Google Scholar] [CrossRef]
Li, Q.; Mou, L.; Hua, Y.; Sun, Y.; Jin, P.; Shi, Y.; Zhu, X.X. Instance Segmentation of Buildings Using Keypoints. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 1452–1455. [Google Scholar] [CrossRef]
Chen, S.; Ogawa, Y.; Zhao, C.; Sekimoto, Y. Enhanced Large-Scale Building Extraction Evaluation: Developing a Two-Level Framework Using Proxy Data and Building Matching. Eur. J. Remote Sens. 2024, 57, 2374844. [Google Scholar] [CrossRef]
Chen, X.; Lian, Y.; Jiao, L.; Wang, H.; Gao, Y.; Lingling, S. Supervised Edge Attention Network for Accurate Image Instance Segmentation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 617–631. [Google Scholar] [CrossRef]
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image Segmentation As Rendering. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9796–9805. [Google Scholar] [CrossRef]
Cao, X.; Zou, H.; Li, J.; Ying, X.; He, S. OBBInst: Remote Sensing Instance Segmentation with Oriented Bounding Box Supervision. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103717. [Google Scholar] [CrossRef]
Chen, Z.; Liu, T.; Xu, X.; Leng, J.; Chen, Z. DCTC: Fast and Accurate Contour-Based Instance Segmentation With DCT Encoding for High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8697–8709. Available online: https://ieeexplore.ieee.org/document/10495157 (accessed on 1 April 2024). [CrossRef]
Chen, E.; Li, M.; Zhang, Q.; Chen, M. Query-Based Cascade Instance Segmentation Network for Remote Sensing Image Processing. Appl. Sci. 2023, 13, 9704. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Tang, S.; Zhang, J.; Zhu, S.; Tan, P. QuadTree Attention for Vision Transformers. arXiv 2022, arXiv:2201.02767. [Google Scholar]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-Class Geospatial Object Detection and Geographic Image Classification Based on Collection of Part Detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR Ship Detection Dataset (SSDD): Official Release and Comprehensive Data Analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Gupta, A.; Khan, S.H.; Sun, G.; Khan, F.S.; Zhu, F.; Shao, L.; Xia, G.; Bai, X. iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images. arXiv 2019, arXiv:1905.12886. [Google Scholar]
Gao, F.; Huo, Y.; Wang, J.; Hussain, A.; Zhou, H. Anchor-Free SAR Ship Instance Segmentation with Centroid-Distance Based Loss. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11352–11371. [Google Scholar] [CrossRef]
Sun, Z.; Meng, C.; Cheng, J.; Zhang, Z.; Chang, S. A Multi-Scale Feature Pyramid Network for Detection and Instance Segmentation of Marine Ships in SAR Images. Remote Sens. 2022, 14, 6312. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Qiu, Y.; Wu, F.; Qian, H.; Zhai, R.; Gong, X.; Yin, J.; Liu, C.; Wang, A. AFL-Net: Attentional Feature Learning Network for Building Extraction from Remote Sensing Images. Remote Sens. 2023, 15, 95. [Google Scholar] [CrossRef]
Zhu, Q.; Liao, C.; Hu, H.; Mei, X.; Li, H. MAP-Net: Multiple Attending Path Neural Network for Building Footprint Extraction From Remote Sensed Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6169–6181. [Google Scholar] [CrossRef]
Wang, Y.; Zeng, X.; Liao, X.; Zhuang, D. B-FGC-Net: A Building Extraction Network from High Resolution Remote Sensing Imagery. Remote Sens. 2022, 14, 269. [Google Scholar] [CrossRef]
Zhang, X.; Wang, H.; Xu, C.; Lv, Y.; Fu, C.; Xiao, H.; He, Y. A Lightweight Feature Optimizing Network for Ship Detection in SAR Image. IEEE Access 2019, 7, 141662–141678. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Cheng, B.; Schwing, A.; Kirillov, A. Per-Pixel Classification Is Not All You Need for Semantic Segmentation. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 17864–17875. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. [Google Scholar] [CrossRef]
Jain, J.; Li, J.; Chiu, M.; Hassani, A.; Orlov, N.; Shi, H. OneFormer: One Transformer to Rule Universal Image Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 2989–2998. [Google Scholar] [CrossRef]
Yuan, F.; Zhang, Z.; Fang, Z. An Effective CNN and Transformer Complementary Network for Medical Image Segmentation. Pattern Recognit. 2023, 136, 109228. [Google Scholar] [CrossRef]
Roy, S.; Koehler, G.; Ulrich, C.; Baumgartner, M.; Petersen, J.; Isensee, F.; Jäger, P.F.; Maier-Hein, K.H. MedNeXt: Transformer-Driven Scaling of ConvNets for Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2023: 26th International Conference, Vancouver, BC, Canada, 8–12 October 2023; pp. 405–415. [Google Scholar] [CrossRef]
Ke, L.; Danelljan, M.; Li, X.; Tai, Y.W.; Tang, C.K.; Yu, F. Mask Transfiner for High-Quality Instance Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4402–4411. [Google Scholar] [CrossRef]
Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. MPViT: Multi-Path Vision Transformer for Dense Prediction. arXiv 2021, arXiv:2112.11010. [Google Scholar]
Chen, S.; Ogawa, Y.; Zhao, C.; Sekimoto, Y. Large-Scale Individual Building Extraction from Open-Source Satellite Imagery via Super-Resolution-Based Instance Segmentation Approach. ISPRS J. Photogramm. Remote Sens. 2023, 195, 129–152. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H. Conditional Convolutions for Instance Segmentation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 282–298. [Google Scholar] [CrossRef]
Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8570–8578. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Wang, X.; Chen, H. BoxInst: High-Performance Instance Segmentation with Box Annotations. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 5439–5448. [Google Scholar] [CrossRef]

Figure 1. QTPR-Net framework.

Figure 2. The structure of the preliminary edge feature extraction module. (a) Shows the backbone network of the model; (b) Shows the RPN structure of the model; (c) Shows the BoxHead structure of the model.

Figure 3. The structure of the edge point feature refinement module.

Figure 4. Visualization of SSDD dataset.

Figure 5. Comparison chart of various indicators of the model for the NWPU VHR-10 dataset (on the left is a line graph of loss indicators, and on the right is a line graph of recall and accuracy indicators).

Figure 6. Comparison chart of various indicators of different structures for the SSDD dataset (on the left is a line graph of mask loss values, and on the right is a line graph of accuracy values).

Figure 7. The structure of TransQTA.

Figure 8. Comparison chart of the various indicators of different layers for the SSDD dataset (on the left is a line graph of the point loss values, and on the right is a line graph of the accuracy values).

Table 1. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the NWPU VHR-10 dataset.

Method	Memory (MB)	${A P}_{bbox}$	${A P}_{segm}$	${A P}_{50}$	${A P}_{75}$	${A P}_{S}$	${A P}_{M}$	${A P}_{L}$
MaskRCNN [4]	8509 M	70.6091	67.4047	93.2067	75.5896	57.7106	65.3746	75.0193
CondInst [40]	8258 M	68.8562	64.3168	91.1648	66.6747	53.7024	62.4261	71.5836
BlendMask [41]	3509 M	63.9361	60.8243	88.8132	63.5941	48.0288	59.5018	64.2823
PointRend [13]	8922 M	68.7919	67.7898	91.8704	72.6399	60.0279	65.9937	75.1733
BoxInst [42]	7320 M	66.4603	50.2457	81.0751	52.2650	38.6690	46.5827	57.2665
Mask Transfiner [36]	15,868 M	69.7329	67.3434	91.7960	75.1191	55.8363	65.8000	75.0204
QTPR-Net	10,376 M	70.5119	69.1944	93.0919	75.7410	53.9180	67.5924	76.4796

Table 2. Comparison results of category instance segmentation in the NWPU VHR-10 dataset. The abbreviations for the classes are AI: airplane, SH: ship, ST: tank, BD: baseball field, TC: tennis court, BC: basketball court, GT: ground track and field, HA: port, BR: bridge, and VE: vehicle.

Method	AI	SH	ST	BD	TC	BC	GT	HA	BR	VE
Mask RCNN [4]	51.523	59.181	84.496	82.353	72.799	76.122	95.502	54.244	41.62	56.207
CondInst [40]	42.917	53.106	84.74	82.641	66.744	79.746	95.915	52.058	31.163	54.137
BlendMask [41]	48.197	54.025	82.959	80.785	62.859	71.316	90.814	41.366	28.428	47.494
PointRend [13]	51.898	60.735	88.436	84.808	71.152	35.623	97.386	55.06	35.623	57.206
BoxInst [42]	17.029	48.981	81.02	78.874	66.322	65.318	92.951	5.591	6.694	39.677
Mask Transfiner [36]	53.178	61.218	85.857	85.953	68.461	78.373	91.689	53.325	39.107	56.271
QTPR-Net	52.389	61.561	80.879	83.850	72.714	82.950	97.904	56.331	40.960	57.604

Table 3. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the SSDD dataset.

Method	Memory (MB)	${AP}_{bbox}$	${A P}_{segm}$	${A P}_{50}$	${A P}_{75}$	${A P}_{S}$	${A P}_{M}$	${A P}_{L}$
MaskRCNN [4]	6916 M	72.1844	69.9374	95.5500	87.0571	67.9297	76.2323	46.1304
CondInst [40]	7637 M	72.3112	69.3922	95.7354	85.8785	67.4935	75.8190	53.3663
BlendMask [41]	2138 M	69.4160	67.4937	95.4400	84.7919	67.1617	70.0460	48.5545
PointRend [13]	8340 M	71.2352	70.4962	96.2620	87.6452	69.1656	75.6742	48.0363
Mask Transfiner [36]	13,127 M	72.3341	70.1995	95.5667	85.5286	68.9570	74.9495	41.0891
QTPR-Net	8509 M	72.5246	71.5251	96.5326	89.6856	69.6230	78.6076	55.5446

Table 4. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the iSAID dataset.

Method	Memory (MB)	${A P}_{bbox}$	${A P}_{segm}$	${A P}_{50}$	${A P}_{75}$	${A P}_{S}$	${A P}_{M}$	${A P}_{L}$
MaskRCNN [4]	24,444 M	41.3018	34.2557	58.5570	34.8988	18.7573	41.8256	52.8351
CondInst [40]	27,298 M	40.9067	32.9986	58.8544	32.6306	16.8397	41.6669	51.8629
BlendMask [41]	12,528 M	41.1042	33.7449	59.0631	33.9347	18.7465	41.7782	49.8384
PointRend [13]	14,673 M	38.9323	34.3458	57.9447	35.7689	18.8701	41.8448	49.2744
Mask Transfiner [36]	16,318 M	41.2341	34.9860	59.0648	36.0690	19.1850	41.6859	52.2770
QTPR-Net	17,993 M	42.4565	37.0704	60.9745	38.9419	22.4383	44.6925	54.7648

Table 5. Comparison results of category instance segmentation using the iSAID dataset. The abbreviations for the classes are: SH: Ship, ST: Storage Tank, BD: Baseball Diamond, TC: Tennis Court, BC: Basketball Court, GT: Ground Track Field, BR: Bridge, LV: Large Vehicle, SV: Small Vehicle, HE: Helicopter, SP: Swimming Pool, RO: Roundabout, SB: Soccerball Field, PL: Plane, and HA: Harbor.

Method	SH	ST	BD	TC	BC	GT	BR	LV	SV	HE	SP	RO	SB	PL	HA
MaskRCNN [4]	37.22	37.22	51.85	77.283	77.283	29.076	19.223	32.641	11.422	5.837	32.567	29.936	43.841	46.427	25.544
CondInst [40]	35.672	33.45	52.891	76.27	38.966	19.575	17.984	32.736	9.569	6.922	30.794	33.166	40.161	39.004	27.818
BlendMask [41]	36.994	34.104	51.746	77.669	37.041	18.773	18.885	34.141	11.574	7.098	33.224	34.672	38.267	45.473	26.513
PointRend [13]	38.219	34.334	50.913	77.488	36.489	26.823	18.075	35.615	12.375	6.642	33.483	27.222	40.155	49.573	27.782
Mask Transfiner [36]	38.177	34.723	53.458	77.21	39.663	26.95	20.053	33.894	12.338	5.97	33.664	29.567	44.053	48.602	26.469
QTPR-Net	40.161	36.47	53.081	78.599	36.713	33.336	22.706	37.529	13.247	6.929	34.975	35.568	43.802	52.553	30.388

Table 6. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the NWPU VHR-10 dataset.

NET	Memory (MB)	${A P}_{bbox}$	${A P}_{segm}$	${A P}_{50}$	${A P}_{75}$	${A P}_{S}$	${A P}_{M}$	${A P}_{L}$
Baseline	8922 M	68.7919	67.7898	91.8704	72.6399	60.0279	65.9937	75.1733
QTA	9157 M	69.0619	66.947	91.6985	72.7587	58.4274	64.7118	75.8612
MultiHead	9713 M	68.1484	66.9607	89.9188	72.9379	54.734	64.508	76.8688
textbfTransQTA	10,571 M	69.8459	68.7143	93.0919	75.741	53.918	67.5924	76.4796

Table 7. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the SSDD dataset.

NET	Memory (MB)	${A P}_{bbox}$	${A P}_{segm}$	${A P}_{50}$	${A P}_{75}$	${A P}_{S}$	${A P}_{M}$	${A P}_{L}$
Baseline	8340 M	71.2352	70.4962	96.2620	87.6452	69.1656	75.6742	48.0363
QTA	8526 M	71.0313	69.9524	94.6564	88.1010	67.6106	78.2828	62.0198
MultiHead	8044 M	71.1569	70.3533	95.5304	88.2287	68.2050	77.8009	57.5495
TransQTA	8509 M	72.5246	71.5251	96.5326	89.6856	69.6230	78.6076	55.5446

Table 8. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the iSAID dataset.

NET	Memory (MB)	${A P}_{bbox}$	${A P}_{segm}$	${A P}_{50}$	${A P}_{75}$	${A P}_{S}$	${A P}_{M}$	${A P}_{L}$
Baseline	14,673 M	38.9323	34.3458	57.9447	35.7689	18.8701	41.8448	49.2744
QTA	18,993 M	41.9985	36.9622	60.7912	38.9254	21.7348	44.0669	53.9221
MultiHead	17,224 M	39.5658	34.8422	58.3206	36.4288	18.7572	41.9450	51.3563
TransQTA	17,993 M	42.4565	37.0704	60.9745	38.9419	22.4383	44.6925	54.7648

Table 9. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the NWPU VHR-10 dataset.

Layer	Memory (MB)	${A P}_{bbox}$	${A P}_{segm}$	${A P}_{50}$	${A P}_{75}$	${A P}_{S}$	${A P}_{M}$	${A P}_{L}$
No cascade	8922 M	68.7919	67.7898	91.8704	72.6399	60.0279	65.9937	75.1733
1	9684 M	68.9075	67.5039	92.3354	73.2900	56.7301	66.6510	76.3986
2	10,376 M	69.8459	68.7143	93.0919	75.7410	53.9180	67.5924	76.4796
3	11,242 M	68.4582	67.4572	90.7146	74.2923	59.2013	65.6678	76.0052

Table 10. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the SSDD dataset.

Layer	Memory (MB)	${A P}_{bbox}$	${A P}_{segm}$	${A P}_{50}$	${A P}_{75}$	${A P}_{S}$	${A P}_{M}$	${A P}_{L}$
No cascade	8340 M	71.2352	70.4962	96.2620	87.6452	69.1656	75.6742	48.0363
1	9287 M	71.9147	70.3033	96.3966	87.3504	69.7567	74.2963	40.4752
2	8509 M	72.5246	71.5251	96.5326	89.6856	69.6230	78.6076	55.5446
3	10,767 M	71.7811	70.7983	96.3289	96.3289	69.2145	76.6380	60.0000

Table 11. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the iSAID dataset.

Layer	Memory (MB)	${A P}_{bbox}$	${A P}_{segm}$	${A P}_{50}$	${A P}_{75}$	${A P}_{S}$	${A P}_{M}$	${A P}_{L}$
No cascade	14,673 M	38.9323	34.3458	57.9447	35.7689	18.8701	41.8448	49.2744
1	18,111 M	42.3277	36.7066	60.3668	60.3668	60.3668	44.4097	44.4097
2	17,993 M	42.4565	37.0704	60.9745	38.9419	22.4383	44.6925	54.7648
3	19,024 M	42.0979	36.9179	36.9179	39.0054	21.5603	21.5603	55.5409

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Shen, J.; Hu, H.; Yang, H. A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing. Mathematics 2024, 12, 2905. https://doi.org/10.3390/math12182905

AMA Style

Zhang X, Shen J, Hu H, Yang H. A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing. Mathematics. 2024; 12(18):2905. https://doi.org/10.3390/math12182905

Chicago/Turabian Style

Zhang, Xiaoying, Jie Shen, Huaijin Hu, and Houqun Yang. 2024. "A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing" Mathematics 12, no. 18: 2905. https://doi.org/10.3390/math12182905

APA Style

Zhang, X., Shen, J., Hu, H., & Yang, H. (2024). A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing. Mathematics, 12(18), 2905. https://doi.org/10.3390/math12182905

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing

Abstract

1. Introduction

2. Related Work

2.1. Instance Segmentation of Remote Sensing Images

2.2. Vison Transformer

3. Proposed Method

3.1. Overview

3.2. Preliminary Edge Feature Extraction Module

3.3. Uncertain Edge Point Selection Strategy

3.4. Edge Point Feature Refinement Module

4. Experiments

4.1. Dataset

4.2. Experimental Setup

4.3. Evaluation Metrics

4.4. Comparative Experiments

4.4.1. Results on NWPU VHR-10

4.4.2. Results Using SSDD

4.4.3. Results Using iSAID

4.5. Ablation Experiments

4.5.1. Ablation Experiments on TransQTA Module

4.5.2. Ablation Experiments of TransQTA Cascaded Structure

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI