Next Article in Journal
An Attribute-Based End-to-End Policy-Controlled Signcryption Scheme for Secure Group Chat Communication
Next Article in Special Issue
Improving Hybrid Regularized Diffusion Processes with the Triple-Cosine Smoothness Constraint for Re-Ranking
Previous Article in Journal
On the Solutions of Linear Systems over Additively Idempotent Semirings
Previous Article in Special Issue
Reconstructing the Colors of Underwater Images Based on the Color Mapping Strategy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing

1
School of Computer Science and Technology, Hainan University, Haikou 570228, China
2
Haikou Key Laboratory of Deep Learning and Big Data Application Technology, Hainan University, Haikou 570228, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(18), 2905; https://doi.org/10.3390/math12182905
Submission received: 5 August 2024 / Revised: 9 September 2024 / Accepted: 16 September 2024 / Published: 18 September 2024
(This article belongs to the Special Issue Advances in Computer Vision and Machine Learning, 2nd Edition)

Abstract

:
With the goal of addressing the challenges of small, densely packed targets in remote sensing images, we propose a high-resolution instance segmentation model named QuadTransPointRend Net (QTPR-Net). This model significantly enhances instance segmentation performance in remote sensing images. The model consists of two main modules: preliminary edge feature extraction (PEFE) and edge point feature refinement (EPFR). We also created a specific approach and strategy named TransQTA for edge uncertainty point selection and feature processing in high-resolution remote sensing images. Multi-scale feature fusion and transformer technologies are used in QTPR-Net to refine rough masks and fine-grained features for selected edge uncertainty points while balancing model size and accuracy. Based on experiments performed on three public datasets: NWPU VHR-10, SSDD, and iSAID, we demonstrate the superiority of QTPR-Net over existing approaches.

1. Introduction

High-resolution remote sensing image instance segmentation holds significant importance in image processing and remote sensing applications. Instance segmentation provides more detailed and comprehensive information about land feature boundaries, which enables the more accurate recognition and differentiation of individual targets in remote sensing images. Land classification [1], target detection [2], and environmental monitoring [3] are dependent on this. High-resolution remote sensing images possess richer spectral, shape, and texture features than natural images, along with more structural targets and abundant scene information, making the instance segmentation of high-resolution remote sensing images more challenging.
Traditional image instance segmentation methods have matured considerably [4,5,6,7,8], aiming to segment every independent object within an image at the pixel level. Image segmentation in remote sensing images has unique characteristics when compared to natural images, such as many small targets, small inter-class variances, large intra-class variances, significant scale differences among different categories, high learning difficulty, and complex backgrounds. Therefore, instance segmentation methods that rely on natural images will not yield optimal results in remote sensing applications. In order to achieve optimal segmentation results for remote sensing datasets, researchers have developed instance segmentation models tailored to remote sensing image characteristics [9,10,11,12,13]. Xu et al. [14] proposed a remote sensing image instance segmentation method based on BoxInst from the perspective of weak supervision, which fully utilizes the existing rich OBB annotations and reduces the annotation burden. In order to address the issue of edge similarity in remote sensing images, this framework incorporates Canny edge supervision in a data-driven manner. The DCTC model [15] transforms classification problems into regression problems, iteratively regressing contour segmentation in remote sensing images to extract more accurate contour information and improve edge segmentation accuracy. QCIS-Net [16] is an end-to-end instance segmentation method that combines transformer architecture and query-based methods to efficiently extract features and facilitate the correlation between the multi-level tasks of detection and segmentation, solving the long-term dependency problem in the visual space during remote sensing image instance segmentation.
We analyze the impact of the characteristics of remote sensing images on instance segmentation in existing research and divide the reasons into three different aspects: complex backgrounds, multi-scale targets, and interclass similarity with intraclass variability. High-resolution remote sensing images can provide more details thanks to advancements in remote sensing imaging technology and improved image resolution, which leads to better edge recognition accuracy. Despite this, researchers have confirmed that the feature information of individual target instances in remote sensing images is insufficient for segmenting them using existing natural image instance segmentation methods. HQ-ISNet [9] fully utilized multi-level feature maps to improve the mask branch and alleviate spatial resolution loss in a feature pyramid network (FPN) [17], effectively overcoming the effects of complex backgrounds on remote sensing image segmentation. Li et al. [10] used a region proposal network (RPN) [5] and key points to enhance mask precision and boundary accuracy, resulting in the more accurate extraction of buildings from complex backgrounds. As a means of mitigating the issue of rough edge segmentation, Chen et al. [12] developed a supervised edge attention module that suppressed irrelevant features and highlighted edge feature details.
We propose a method of high-resolution remote sensing image instance segmentation based on PointRend [13], using an improved quadtree attention mechanism [18] to compute the attention mechanism from coarse to fine, rough segment irrelevant masks, and extract fine features from the relevant masks by using an improved quadtree attention mechanism.
The research contributions of this study are as follows:
  • We propose a model for segmenting high-resolution remote sensing images based on QuadtreeAttention and a transformer called QTPR-Net. This method comprises two main parts: a preliminary edge feature extraction (PEFE) module and a refinement module for the edge point feature (EPFR), achieving high accuracy in remote sensing image instance segmentation;
  • In the PEFE module, we propose an edge point detection strategy suitable for high-resolution remote sensing images, recursively adding coarse-grained features layer by layer. Through multi-level feature fusion, uncertain points are selected in areas of high uncertainty;
  • As part of the EPFR module, we propose a transformer structure based on QuadtreeAttention (TransQTA), which utilizes a quadtree attention mechanism of the token pyramid structure to select the highest scoring areas and add positional encoding. It captures different contextual information to produce precise mask predictions for edge pixels through a multi-level structured design.
The effectiveness of QTPR-Net has been validated using three public remote sensing image datasets: NWPU VHR-10 [19], SSDD [20], and iSAID [21]. This paper is organized as follows: Section 2 discusses related work, Section 3 presents a detailed introduction to our proposed model, and Section 4 discusses and analyzes the datasets used for the experiments, experimental details, evaluation criteria, and experimental results. Finally, Section 5 offers a review and summary.

2. Related Work

2.1. Instance Segmentation of Remote Sensing Images

Remote sensing images have traditionally been interpreted primarily for automatic target detection. Traditional interpretation, which involves human-defined features, is heavily dependent on expert knowledge, which reduces the expressive power of features and the effectiveness of detection. In response to increasingly complex remote sensing applications and demands, deep learning-based instance segmentation methods are being explored. Instance segmentation is an advanced task for computer vision that combines object detection and semantic segmentation. Increasing amounts of research have been conducted on the instance segmentation of remote sensing images, especially high-resolution remote sensing images.
Remote sensing image instance segmentation research has primarily focused on deep learning technologies, such as multi-scale feature fusion, dilated convolution, and attention mechanisms, in recent years. In multi-scale prediction, signals are sampled at varying granularities, and features are observed at various scales. Combining different levels of semantic information and spatial geometric information can produce more comprehensive and complete predictions. Gao et al. [22] introduced the CBAM module to the feature fusion process of FPN, extracting significant features at different scales and enhancing the capability to represent features, as well as reducing interference from irrelevant information. This method can improve segmentation performance if different weights are applied to input features, but it was only tested on the SSDD dataset, and its generalizability needs to be examined further. In order to mitigate the issue of an FPN not fully utilizing shallow feature maps, which are very useful for the detection and segmentation of small ships, Sun et al. developed a multi-scale feature pyramid network (MS-FPN [23]) using an atrous convolutional pyramid [24] (ACP). While the ACP module does have some limitations, it may reject the detection of some micro-ships as background noise when using shallow, high-resolution data.
Using the self-attention mechanism solves the issue of multiple vectors of varying sizes that may have certain relationships among themselves, but exploiting these relationships during training may result in poor model performance. AFL-Net [25] was designed by Yue et al., incorporating a self-attention module into the attention multi-scale feature fusion (AMFF) module, adaptively adjusting the weights of multi-scale features, enhancing global awareness, and alleviating the false positives and missed detections caused by complex building backgrounds. A novel multi-attendee path neural network (MAP-Net [26]) developed by Zhu et al., which addresses the problem of inaccurate edges in remote sensing image segmentation using convolutional neural networks, incorporates an enhancement module for spatial pooling to capture global dependencies and continuously extracts building entities, especially for large, low-texture buildings. With the aim of improving the perception of spatial information in remote sensing images, Wang et al. [27] developed a building extraction network, B-FGC-Net, based on the convolutional block attention module (CBAM) by introducing a spatial attention unit, simplifying deep convolutional neural network training, automatically learning feature expressions, adaptively obtaining spatial weights for features, and emphasizing the spatial information representation of features. LFO-Net [28] is a lightweight feature optimization network that utilizes channel and spatial attention mechanisms in feature layers to capture silent features and suppress less useful ones.

2.2. Vison Transformer

Vision Transformer [29] applies the transformer architecture to the field of computer vision, based on the substantial success of the transformer [30] in the field of natural language processing. Essentially, a transformer is a novel encoder–decoder structure based on an attention mechanism. On the basis of this, researchers have proposed models such as Mask Former [31], Mask2Former [32], and OneFormer [33]. In the field of computer vision, Vision Transformer has demonstrated excellent performance. The segmentation model presented by Feiniu et al. [34] incorporates CNNs and transformers, integrating their features and decoding them using Swin Transformers to handle contextual and remote dependencies. Saikat et al. [35] updated the standard ConvNet in Mednext by using mirrored transformer blocks. With limited image data, they employed a new technique for iteratively increasing kernel size via upsampling in small-core networks in order to prevent performance saturation. Compound scaling was also achieved on multiple levels (depth, width, and kernel size) to improve image segmentation. Ke et al. proposed a method based on Mask Transfiner [36] for high-quality instance segmentation. By decomposing image regions and representing them as quadtrees, Mask Transfiner can predict highly accurate instance masks with lower computational costs by processing only error-prone tree nodes and correcting them in parallel with a transformer. The MPViT [37] uses a unique approach for creating multi-scale patch embeddings and multi-path structures. Chen et al. [38] studied the idea of MPViT to enhance the segmentation effect in different scenes. The study results demonstrate that the transformer is effective for segmenting image instances, not only because it has unique characteristics for processing natural and other types of images but also because it plays an important role in segmenting remote sensing images. In order to enhance the performance of remote sensing image segmentation, QTPR-Net also employs the Vision Transformer approach.

3. Proposed Method

3.1. Overview

In this paper, we propose a high-resolution remote sensing image instance segmentation model based on PointRend, called QuadTransPointRend (QTPR-Net). According to Figure 1, the QTPR-Net framework is composed of two submodules: the preliminary edge feature extraction (PEFE) module and the refinement module for the edge point feature (EPFR).

3.2. Preliminary Edge Feature Extraction Module

Drawing on previous research, we selected the ResNet101 [39] network, FPN [17], and RPN [5] as our feature extraction networks (as shown in the Figure 2). With ResNet101, basic image features are extracted through a backbone network, with each stage producing feature maps (C2–C5), corresponding to a half-sampling of the previous stage, with 256, 512, 1024, and 2048 channels, respectively, and corresponding downsampling rates of 4 × , 8 × , 16 × , and 32 × . By using a series of convolutional operations, FPN combines the feature outputs from each stage through upsampling and lateral connecting, standardizing each feature map to 256 channels. With multiple scales, P2–P5, there is 4, 8, 16, and 32 × downsampling, and at the P6 level, pooling yields 256 × H / 64 × W / 64 to enhance feature diversity. A variety of feature layers are generated by the FPN for classification and regression operations, which are then aligned in the ROIAlign module, ultimately providing fixed-size region proposals of 256 × 7 × 7 . We perform target detection at this stage by outputting categories and boundary regression values from the previous stage’s region proposals, calculating each category’s probability using Softmax, and then generating fine masks for the edge points that are uncertain. Following the extraction and transformation of features from the previous stage to obtain feature Z, the following steps are taken to calculate the category scores, s, and scores for each category:
s = W c l s Z + b c l s
P ( c i | Z ) = exp ( s i ) j = 1 C exp ( s j )
where W c l s is the weight of the classification layer, b c l s is the bias of the classification layer, and P ( c i | Z ) represents the probability of candidate region category, i.
Afterward, in the regression layer, linear regression is used to predict the bounding box regression function. The boundary regression prediction parameters, t (including Δ x , Δ y , Δ w , and Δ h , representing the co-ordinate parameters, and the width and height parameters, W r e g and b r e g are the weights and biases of the regression layer, respectively) are used to adjust the bounding boxes of the candidate regions. The calculation method is as follows:
t = W r e g Z + b r e g
We set the candidate regions ( x , y , w , h ) generated in the RPN stage and combine them with the predicted regression parameter, t, to obtain the final accurate bounding box ( x , y , w , h ) :
( x , y ) = ( x + w Δ x , y + w Δ y )
( w , h ) = ( w · exp ( Δ w ) , h · exp ( Δ h ) )
Finally, we obtain the center co-ordinates, width w, and height h, of the bounding box. Overall, in the preliminary processing module of the edge features, we have achieved the efficient detection and segmentation of targets through multi-level and multi-scale feature extraction and fusion.

3.3. Uncertain Edge Point Selection Strategy

Since remote sensing image instances are small and dense, the quality of edge segmentation has a significant impact on the effectiveness of target instance segmentation in high-resolution remote sensing image analysis. Many solutions have been proposed for edge segmentation, such as multi-scale feature fusion, feature decoupling, and box strategies to improve object detection and feature fusion effects. The core focus is on predicting the features of edge points. Thus, we combine Pointrend’s rendering technology to flexibly select points on the 2D plane of remote sensing images to predict their segmentation labels in an iterative process using coarse-to-fine points.
QTPR-Net depicts the output of image instance segmentation as a regular grid of labels with target instances encoded on a networked feature map. One or more CNN-formatted feature maps on the C-channel are further processed to output predictions for K class labels on regular grids of different resolutions. The selection of uncertain edge points is crucial for subsequent mask refinement. When selecting uncertain edge points, as many points as possible concentrated in high-frequency areas are selected, with the initial granularity size M 0 determining the coarse granularity prediction. After obtaining a preliminary coarse granularity prediction of the mask, the mask’s resolution is gradually enhanced through recursive refinement steps, with the model progressively focusing on areas of highest uncertainty in mask prediction, processing M i uncertain points at each step i. For example, in QTPR-Net, we tested different values to find the most suitable M 0 for high-resolution remote sensing images, considering computational resources and processing speed. We chose the M 0 = 8 with bilinear upsampling, which increased resolution by M × M = 512 × 512 over six refinement steps.

3.4. Edge Point Feature Refinement Module

Small targets of the same category are often densely packed in high-resolution remote sensing images. By assigning different weights to coarse-level features, we can amplify the influence of contextual features on instance segmentation, achieving a higher level of precision in target segmentation. In the EPFR module (as shown in Figure 3), rough mask predictions M coarse are performed as well as fine-grained segmentation mask predictions M fine for uncertain edges. By combining both, we obtain the final segmentation mask, M:
M = M coarse + M fine
We designed an encoder–decoder structure with a multi-level quadtree attention [18] mechanism inspired by Vision Transformer to process image regions of different resolutions, further refining the extracted coarse-level feature maps. We call this structure TransQTA. Each encoder layer in TransQTA consists of a self-attention layer, a feedforward neural network, a normalization layer, and positional encoding fusion. In the self-attention layer, quadtree attention is combined to project the image feature tensor X, which is projected through the following equation to obtain the query-key-value pairs Q, K, and V:
Q = W q X
K = W k X
V = W v X
where W q , W k , and W v are learnable parameters. The self-attention scores are then obtained by applying the Softmax function to the dot product of queries and keys, performing tensor computations to calculate the dot product between Q and K, adjusted by a scaling factor:
A t t e n t i o n S c o r e ( A ) = S o f t m a x ( Q K T d k )
where d k is the dimension of each head, i.e., the embedding channel dimension, d k represents the scaling factor, which helps to avoid overly large values before performing Softmax operations. Then, based on the obtained A t t e n t i o n S c o r e ( A ) , the weighted value vectors are obtained to produce the final result, and
QuadtreeAttention = einsum ( n l s h , n s h d n l h d , A , V )
where n represents batch size, l denotes the number of query positions, s denotes the number of key positions, h represents the number of heads, and d represents the dimension of each head. Here, the tensor operation einsum is used to perform batch matrix multiplication, combining the attention scores and values.
This part also discusses multi-scale processing by calculating attention and message passing at different scales, which uses various topks for downsampling and local enhancement. By using topks, only the top k highest scores at each level are considered, reducing image noise by focusing on the most relevant features. It is also possible to reduce computational costs by setting appropriate topks, and suitable topks at different levels can provide more extensive information and refine the most important features of high-resolution remote sensing images.
The feature processing flow of the EPFR module is conducted by flattening feature maps into one-dimensional vectors; this module facilitates subsequent grid rendering, enhancing the ability to represent this nonlinearly by activating the ReLu function, which results in a feature map, F, by capturing contextual information. The TranQTA structure captures contextual information through layer-by-layer feature transformations, extracting more advanced feature information. A feature, F q t a , is obtained in the self-attention part and is processed by the feed-forward neural network of a decoder structure into F t r a n s of the same size and shape as F q t a . Convolutional fusion layers are then used to integrate feature information to create the coarse segmentation mask, M coarse :
F q t a = L a y e r N o r m ( F + Q u a d t r e e A t t e n t i o n ( Q , K , V ) )
F t r a n s = L a y e r N o r m ( F q t a + ( R eLU ( W 1 F q t a + b 1 ) W 2 + b 2 ) )
M coarse = S i g m o i d ( C o n v ( F t r a n s , W m a s k , b m a s k )
where W and b are the weight matrix and bias parameters of each layer, respectively.
During the process of processing fine-grained features, we use bilinear interpolation to transform the basic features obtained in the first stage, the coarse masks from the previous stage, and the uncertain edge points, G, through a series of feature transformations and mask predictions to obtain fine-grained features through a series of feature transformations and mask predictions. A fine-grained segmentation mask, M f i n e , is obtained by using the prediction layer:
M f i n e = W p r e d · G + b p r e d

4. Experiments

The following sections provide an overview of the specific experimental details, the datasets used, the evaluation metrics used, and the results obtained.

4.1. Dataset

Based on the diversity and representativeness of remote sensing image data sources, the challenges of remote sensing image scenes, and the complexity of remote sensing image scenes, we selected the NWPU VHR-10 dataset [19], SSDD dataset [20], and iSAID dataset [21] as the experimental datasets. These three public datasets comprehensively enhance and validate the performance of the remote sensing image instance segmentation model, ensuring the model’s effectiveness and generalizability. We also formatted these three datasets into the COCO format for ease of experimental verification, including training and validation sets with original images and JSON label files.
NWPU VHR-10 Dataset: Northwestern Polytechnical University released a 10-class geographical remote sensing dataset called NWPU VHR-10 for spatial object detection. This dataset contains 800 high-resolution images of satellite imagery derived from Google Earth and the Vaihingen dataset and is annotated by experts. The dataset covers the following areas: airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track and field, harbor, bridge, and vehicle. For our experiments, we set 30 k iterations and a learning rate of 0.02.
SSDD Dataset: The SSDD dataset is the first publicly available dataset widely used for research on deep learning-based synthetic aperture radar (SAR) image ship-detection technologies, produced by the Department of Electronics and Information Engineering at the Naval Aeronautical and Astronautical University. It contains 1160 images and 2456 ships. Although it has fewer images, the only category is ship, so this is sufficient to train detection models. For this experiment, we set 20 k iterations and a learning rate of 0.02.
iSAID Dataset: The iSAID dataset is a new open benchmark dataset for multi-class instance segmentation in remote sensing images. iSAID includes 15 categories with 655,451 individual instances marked separately, with up to 8000 instances in a single image and image resolutions ranging from 800 to 13,000, making it the first large-scale instance segmentation dataset in the remote sensing field. Considering the experimental environment and model training speed, based on prior public work, each image in this dataset was segmented into blocks of 800 × 800 pixels with a stride of 100 for fair benchmarking against existing methods. Detailed comparative experiments and ablation studies were conducted with 200 k iterations at a learning rate of 0.01.

4.2. Experimental Setup

Due to significant differences in resolution and quantity across the three datasets, we set different numbers of iterations and training batch sizes to obtain the best segmentation results. However, in the initial 5% iterations of model training, we started with 0.1% of the base learning rate and gradually reached the base rate using linear growth. Furthermore, in the QTPR-Net tuning strategy, we set the first half of the iteration count at the initial learning rate, decreasing it at the halfway point and three-fifths mark. In order to prevent overfitting, we also set a weight decay coefficient of 0.0001 and normalization parameters. Our experiments were conducted on NVIDIA GeForce RTX 4090 using Pytorch 2.0.0, CUDA 11.8, with ResNet101-FPN as the baseline and GPU num workers set to 4.

4.3. Evaluation Metrics

In order to assess the performance of deep learning instance segmentation methods, the following evaluation metrics are typically used:
R e c a l l : This reflects the proportion of samples that are correctly predicted to be positive among all samples that are actually positive.
R e c a l l = T P T P + F N
P r e c i s i o n : This reflects the proportion of samples that are actually positive among all samples predicted to be positive.
P r e c i s i o n = T P T P + F P
A c c u r a c y : The proportion of correctly classified samples to the total number of samples.
A c c u r a c y = T P + T N T P + T N + F P + F N
Among them, T P is true positive, which is the number of positive samples predicted as positive samples; F N is false negative, which refers to the number of positive samples predicted as negative samples; F P is false positive, which refers to the number of negative samples predicted as positive samples; T N is true negative and refers to the number of negative samples predicted as negative samples. In this experiment, we used average precision ( A P ) as our primary evaluation criterion.
A P = 0 1 P ( R ) d R
The A P values in the MS COCO dataset format are obtained by using a weighted calculation averaging across 10 IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. In addition to A P , MS COCO evaluation metrics also include average precision under a single threshold, such as A P 50 ( I o U threshold of 0.5) and A P 75 ( I o U threshold of 0.75). A P S , A P M , and A P L represent the average precision means for instances with areas less than 32 × 32, larger than 32 × 32 but less than 96 × 96, and larger than 96 × 96, respectively. We will display the A P bbox and detailed A P segm for target segmentation and instance segmentation across different datasets. In addition to precision metrics, we also evaluate the model’s efficiency using the amount of memory usage (Memory), measuring the model’s complexity and storage requirements.

4.4. Comparative Experiments

4.4.1. Results on NWPU VHR-10

A comparison of QTPR-Net with other advanced methods can be found in the table below. According to Table 1, QTPR-Net achieved the highest A P segm value of 69.1944 on the NWPU VHR-10 dataset, outperforming other models. QTPR-Net performed best in edge recognition, and the uncertain edge point handling module significantly affected performance compared to the base model, PointRend. Based on the NWPU VHR-10 dataset, Table 2 shows the class segmentation effects of QTPR-Net. QTPR-Net outperformed other models on small targets such as ships, harbors, and vehicles, proving that it is effective in segmenting small targets by instance. We also achieved the best results with our model despite it not occupying the most memory for this dataset.

4.4.2. Results Using SSDD

Table 3 presents the object detection and instance segmentation results of QTPR-Net when using the SSDD dataset. Since the SSDD dataset only has one category, ship, with many small targets, QTPR-Net achieved the best results in target edge segmentation and small target segmentation using A P segm , A P 50 , and A P m . With QTPR-Net, uncertain edge points were handled significantly differently than with the base model. We found that QTPR-Net took up less memory than Mask Transfiner, but performed better, indicating its good balance. By taking Figure 4 as an example, we demonstrate the visualization of the QTPR Net model in terms of segmentation performance.

4.4.3. Results Using iSAID

For the iSAID dataset, Table 4 shows the results of QTPR-Net for object detection and instance segmentation. Unlike the first two public datasets, the iSAID dataset requires more memory than the first two, and other models require more computation to handle it. As shown in Table 4, QTPR-Net did not occupy the highest memory, far less than Mask RCNN and CondInst, and did not exceed much memory usage when compared to PointRend and MaskTransfiner, but achieved the best segmentation results, with a 37.07 instance segmentation effect. The results of QTPR-Net indicate that we have advantages both in terms of memory usage and segmentation. In Table 5, we provide the instance segmentation results for the 15 categories in the iSAID dataset, showing that QTPR-Net performed well with small targets such as ships, storage tanks, swimming pools, and harbors.
As an example, the left side of Figure 5 illustrates the changes in loss _box_reg, loss_cls, loss_mask, and loss_mask_point metrics during the instance segmentation process of our model using the NWPU VHR-10 dataset. These, respectively, represent the bounding box regression loss, classification loss, mask loss, and point loss. The right side of Figure 5 displays the model’s recall rate and mask accuracy. The model’s recall rate reached as high as 97.83%, and the accuracy on mask segmentation reached 97.27%. Curve changes indicate that the model’s convergence is very stable, with excellent results in category prediction and very close results in overall boundary and mask prediction, indicating an effective way to handle uncertain edge points.

4.5. Ablation Experiments

4.5.1. Ablation Experiments on TransQTA Module

In the TransQTA module, we designed different attention mechanisms for ablation studies. As a test of the effectiveness of our TransQTA module, we used the base model (without the TransQTA module), the QuadtreeAttention mechanism (without the transformer structure, abbreviated as QTA), the multi-head attention mechanism (with the transformer structure), and the QuadtreeAttention attention mechanism (with the transformer structure).
Our designed TransQTA module had the best segmentation effect on the NWPU VHR-10 dataset, as shown in Table 6, despite occupying a larger amount of memory, based on ablation study data across all three datasets; this demonstrates that our module significantly improved segmentation performance on small targets, despite occupying the highest amount of memory. TransQTA achieved the best results for SSDD, as shown in Table 7, but it did not consume the highest amount of memory, and it performed well across all metrics, which proves the importance of QuadTreeAttention for small target edge segmentation. Our TransQTA designed for the iSAID dataset successfully balanced the memory usage and segmentation results in Table 8, showing the best results were achieved by using QuadTreeAttention.
By using the SSDD dataset’s loss mask point metrics and accuracy curve graph as an example, as shown in Figure 6, the graph shows that our TransQTA module achieved the lowest loss value and a relatively high accuracy rate (95.70%) among the comparison modules. As clearly evidenced by our comparison module performance, QTPR-Net did not suffer from mask loss and had a higher final accuracy than the base model. It is also evident that our comparison modules were effective compared to the base model in terms of mask loss and final accuracy, proving the effectiveness of the QTPR-Net design.

4.5.2. Ablation Experiments of TransQTA Cascaded Structure

The transformer’s cascaded structure is shown in Figure 7. Across the three datasets, the cascaded ablation study results (as shown in Table 9, Table 10 and Table 11) for the transformer structure indicate that the best results are achieved with two layers. As a result, we set the transformer’s cascaded structure in the feature processing part of uncertain edge points to two layers, which satisfies the balance between memory consumption and segmentation quality.
As an example of the SSDD dataset’s data, the table data and Figure 8 indicate that, in this part of the ablation study, the TransQTA two-layer cascaded structure achieved the lowest point loss and the highest mask and point accuracy, resulting in the highest segmentation precision, demonstrating the effectiveness of the two-layer EncoderLayer structure.
In general, QTPR-Net met its design intentions and achieved satisfactory results on three public datasets, but there are also some shortcomings. Due to our model’s implementation based on the Dtectron2 framework, we used default parameters for some basic settings (such as normalization), which are set under the best conditions when performing instance segmentation on natural images. More experiments are needed to determine whether these settings are the best for high-resolution remote sensing images. Additionally, in the selection of the datasets, we chose the three most widely used public remote sensing image datasets, but we still haven’t verified some other scene datasets, such as the more frequently used WHU Building dataset, Potsdam dataset, and other architectural scene remote sensing image datasets. Consequently, we will also conduct instance segmentation research on these types of datasets in the future. Last but not least, although QTPR-Net achieved the best overall results in target detection and instance segmentation, the segmentation results for some large target categories were suboptimal, such as basketball courts, baseball fields, and ground athletic fields. This result is partly because there is inter-class similarity among them, and it may also be due to large targets being dispersed after the images are segmented, and, in other models, the segmentation effects of these few categories are not particularly excellent, requiring more edge feature information in order to improve their segmentation effects. Future research will incorporate more contextual information as well as target edges to improve segmentation effects.

5. Conclusions

In this paper, we propose a high-resolution remote sensing image instance segmentation model, QTPR-Net, and tested its effectiveness on the NWPU VHR-10, SSDD, and iSAID datasets, achieving instance segmentation accuracies of 69.1944, 71.5251, and 37.0704, respectively. Our network consists mainly of an initial edge feature extraction module and an edge point feature refinement module, where we designed an uncertain point selection strategy for selecting certain edge points in high-resolution remote sensing images to select edge points of higher quality and improve the segmentation of edge features. The edge point feature refinement module in QTPR-Net effectively verified the effectiveness of the quadtree attention mechanism in edge segmentation, employing a coarse-to-fine pyramid approach to enhance the attention of uncertain points and incorporating multi-scale positional encoding to improve efficiency and reduce loss. Additionally, QTPR-Net does not occupy much memory in a nondistributed environment, which enables us to balance the complexity of the model with its ability to segment. In future research, we will study whether some basic parameter settings are optimal for validation and improve the generalizability of the model with other scenes from high-resolution remote sensing image datasets, as well as take into account the global contextual information of target instances to fill in any gaps in the model.

Author Contributions

This work was conducted in collaboration with all authors. Conceptualization, H.Y. and X.Z.; methodology, X.Z.; validation, X.Z.; formal analysis, X.Z.; investigation, J.S.; resources, H.H.; data curation, J.S. and H.H.; writing—original draft preparation, X.Z.; writing—review and editing, H.Y. and X.Z.; visualization, J.S. and H.H.; supervision, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Hainan Province Science and Technology Special Fund under Grant ZDYF2022GXJS228, in part by Haikou Science and Technology Plan Project under Grant 2022-007 and Grant 2022-015.

Data Availability Statement

We used the publicly available datasets NWPU VHR-10, SSDD, and iSAID. The NWPU VHR-10 dataset can be accessed at https://gcheng-nwpu.github.io/##Datasets on 18 September 2024, the SSDD dataset can be accessed at https://github.com/TianwenZhang0825/Official-SSDD/blob/main/README.md on 18 September 2024, and the iSAID dataset can be accessed at https://captain-whu.github.io/iSAID on 18 September 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zheng, X.; Chen, X.; Lu, X.; Sun, B. Unsupervised Change Detection by Cross-Resolution Difference Learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5606616. [Google Scholar] [CrossRef]
  2. Liu, J.; Yang, D.; Hu, F. Multiscale Object Detection in Remote Sensing Images Combined with Multi-Receptive-Field Features and Relation-Connected Attention. Remote Sens. 2022, 14, 427. [Google Scholar] [CrossRef]
  3. Chen, D.; Ma, A.; Zheng, Z.; Zhong, Y. Large-Scale Agricultural Greenhouse Extraction for Remote Sensing Imagery Based on Layout Attention Network: A Case Study of China. ISPRS J. Photogramm. Remote Sens. 2023, 200, 73–88. [Google Scholar] [CrossRef]
  4. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
  5. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  6. Peng, S.; Jiang, W.; Pi, H.; Li, X.; Bao, H.; Zhou, X. Deep Snake for Real-Time Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8530–8539. [Google Scholar] [CrossRef]
  7. Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. SOLO: A Simple Framework for Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 8587–8601. [Google Scholar] [CrossRef] [PubMed]
  8. Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. arXiv 2020, arXiv:2003.10152. [Google Scholar]
  9. Su, H.; Wei, S.; Liu, S.; Liang, J.; Wang, C.; Shi, J.; Zhang, X. HQ-ISNet: High-Quality Instance Segmentation for Remote Sensing Imagery. Remote Sens. 2020, 12, 989. [Google Scholar] [CrossRef]
  10. Li, Q.; Mou, L.; Hua, Y.; Sun, Y.; Jin, P.; Shi, Y.; Zhu, X.X. Instance Segmentation of Buildings Using Keypoints. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 1452–1455. [Google Scholar] [CrossRef]
  11. Chen, S.; Ogawa, Y.; Zhao, C.; Sekimoto, Y. Enhanced Large-Scale Building Extraction Evaluation: Developing a Two-Level Framework Using Proxy Data and Building Matching. Eur. J. Remote Sens. 2024, 57, 2374844. [Google Scholar] [CrossRef]
  12. Chen, X.; Lian, Y.; Jiao, L.; Wang, H.; Gao, Y.; Lingling, S. Supervised Edge Attention Network for Accurate Image Instance Segmentation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 617–631. [Google Scholar] [CrossRef]
  13. Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image Segmentation As Rendering. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9796–9805. [Google Scholar] [CrossRef]
  14. Cao, X.; Zou, H.; Li, J.; Ying, X.; He, S. OBBInst: Remote Sensing Instance Segmentation with Oriented Bounding Box Supervision. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103717. [Google Scholar] [CrossRef]
  15. Chen, Z.; Liu, T.; Xu, X.; Leng, J.; Chen, Z. DCTC: Fast and Accurate Contour-Based Instance Segmentation With DCT Encoding for High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8697–8709. Available online: https://ieeexplore.ieee.org/document/10495157 (accessed on 1 April 2024). [CrossRef]
  16. Chen, E.; Li, M.; Zhang, Q.; Chen, M. Query-Based Cascade Instance Segmentation Network for Remote Sensing Image Processing. Appl. Sci. 2023, 13, 9704. [Google Scholar] [CrossRef]
  17. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
  18. Tang, S.; Zhang, J.; Zhu, S.; Tan, P. QuadTree Attention for Vision Transformers. arXiv 2022, arXiv:2201.02767. [Google Scholar]
  19. Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-Class Geospatial Object Detection and Geographic Image Classification Based on Collection of Part Detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
  20. Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR Ship Detection Dataset (SSDD): Official Release and Comprehensive Data Analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
  21. Zamir, S.W.; Arora, A.; Gupta, A.; Khan, S.H.; Sun, G.; Khan, F.S.; Zhu, F.; Shao, L.; Xia, G.; Bai, X. iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images. arXiv 2019, arXiv:1905.12886. [Google Scholar]
  22. Gao, F.; Huo, Y.; Wang, J.; Hussain, A.; Zhou, H. Anchor-Free SAR Ship Instance Segmentation with Centroid-Distance Based Loss. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11352–11371. [Google Scholar] [CrossRef]
  23. Sun, Z.; Meng, C.; Cheng, J.; Zhang, Z.; Chang, S. A Multi-Scale Feature Pyramid Network for Detection and Instance Segmentation of Marine Ships in SAR Images. Remote Sens. 2022, 14, 6312. [Google Scholar] [CrossRef]
  24. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
  25. Qiu, Y.; Wu, F.; Qian, H.; Zhai, R.; Gong, X.; Yin, J.; Liu, C.; Wang, A. AFL-Net: Attentional Feature Learning Network for Building Extraction from Remote Sensing Images. Remote Sens. 2023, 15, 95. [Google Scholar] [CrossRef]
  26. Zhu, Q.; Liao, C.; Hu, H.; Mei, X.; Li, H. MAP-Net: Multiple Attending Path Neural Network for Building Footprint Extraction From Remote Sensed Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6169–6181. [Google Scholar] [CrossRef]
  27. Wang, Y.; Zeng, X.; Liao, X.; Zhuang, D. B-FGC-Net: A Building Extraction Network from High Resolution Remote Sensing Imagery. Remote Sens. 2022, 14, 269. [Google Scholar] [CrossRef]
  28. Zhang, X.; Wang, H.; Xu, C.; Lv, Y.; Fu, C.; Xiao, H.; He, Y. A Lightweight Feature Optimizing Network for Ship Detection in SAR Image. IEEE Access 2019, 7, 141662–141678. [Google Scholar] [CrossRef]
  29. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  30. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
  31. Cheng, B.; Schwing, A.; Kirillov, A. Per-Pixel Classification Is Not All You Need for Semantic Segmentation. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 17864–17875. [Google Scholar]
  32. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. [Google Scholar] [CrossRef]
  33. Jain, J.; Li, J.; Chiu, M.; Hassani, A.; Orlov, N.; Shi, H. OneFormer: One Transformer to Rule Universal Image Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 2989–2998. [Google Scholar] [CrossRef]
  34. Yuan, F.; Zhang, Z.; Fang, Z. An Effective CNN and Transformer Complementary Network for Medical Image Segmentation. Pattern Recognit. 2023, 136, 109228. [Google Scholar] [CrossRef]
  35. Roy, S.; Koehler, G.; Ulrich, C.; Baumgartner, M.; Petersen, J.; Isensee, F.; Jäger, P.F.; Maier-Hein, K.H. MedNeXt: Transformer-Driven Scaling of ConvNets for Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2023: 26th International Conference, Vancouver, BC, Canada, 8–12 October 2023; pp. 405–415. [Google Scholar] [CrossRef]
  36. Ke, L.; Danelljan, M.; Li, X.; Tai, Y.W.; Tang, C.K.; Yu, F. Mask Transfiner for High-Quality Instance Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4402–4411. [Google Scholar] [CrossRef]
  37. Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. MPViT: Multi-Path Vision Transformer for Dense Prediction. arXiv 2021, arXiv:2112.11010. [Google Scholar]
  38. Chen, S.; Ogawa, Y.; Zhao, C.; Sekimoto, Y. Large-Scale Individual Building Extraction from Open-Source Satellite Imagery via Super-Resolution-Based Instance Segmentation Approach. ISPRS J. Photogramm. Remote Sens. 2023, 195, 129–152. [Google Scholar] [CrossRef]
  39. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  40. Tian, Z.; Shen, C.; Chen, H. Conditional Convolutions for Instance Segmentation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 282–298. [Google Scholar] [CrossRef]
  41. Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8570–8578. [Google Scholar] [CrossRef]
  42. Tian, Z.; Shen, C.; Wang, X.; Chen, H. BoxInst: High-Performance Instance Segmentation with Box Annotations. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 5439–5448. [Google Scholar] [CrossRef]
Figure 1. QTPR-Net framework.
Figure 1. QTPR-Net framework.
Mathematics 12 02905 g001
Figure 2. The structure of the preliminary edge feature extraction module. (a) Shows the backbone network of the model; (b) Shows the RPN structure of the model; (c) Shows the BoxHead structure of the model.
Figure 2. The structure of the preliminary edge feature extraction module. (a) Shows the backbone network of the model; (b) Shows the RPN structure of the model; (c) Shows the BoxHead structure of the model.
Mathematics 12 02905 g002
Figure 3. The structure of the edge point feature refinement module.
Figure 3. The structure of the edge point feature refinement module.
Mathematics 12 02905 g003
Figure 4. Visualization of SSDD dataset.
Figure 4. Visualization of SSDD dataset.
Mathematics 12 02905 g004
Figure 5. Comparison chart of various indicators of the model for the NWPU VHR-10 dataset (on the left is a line graph of loss indicators, and on the right is a line graph of recall and accuracy indicators).
Figure 5. Comparison chart of various indicators of the model for the NWPU VHR-10 dataset (on the left is a line graph of loss indicators, and on the right is a line graph of recall and accuracy indicators).
Mathematics 12 02905 g005
Figure 6. Comparison chart of various indicators of different structures for the SSDD dataset (on the left is a line graph of mask loss values, and on the right is a line graph of accuracy values).
Figure 6. Comparison chart of various indicators of different structures for the SSDD dataset (on the left is a line graph of mask loss values, and on the right is a line graph of accuracy values).
Mathematics 12 02905 g006
Figure 7. The structure of TransQTA.
Figure 7. The structure of TransQTA.
Mathematics 12 02905 g007
Figure 8. Comparison chart of the various indicators of different layers for the SSDD dataset (on the left is a line graph of the point loss values, and on the right is a line graph of the accuracy values).
Figure 8. Comparison chart of the various indicators of different layers for the SSDD dataset (on the left is a line graph of the point loss values, and on the right is a line graph of the accuracy values).
Mathematics 12 02905 g008
Table 1. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the NWPU VHR-10 dataset.
Table 1. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the NWPU VHR-10 dataset.
MethodMemory (MB) A P bbox A P segm A P 50 A P 75 A P S A P M A P L
MaskRCNN [4]8509 M70.609167.404793.206775.589657.710665.374675.0193
CondInst [40]8258 M68.856264.316891.164866.674753.702462.426171.5836
BlendMask [41]3509 M63.936160.824388.813263.594148.028859.501864.2823
PointRend [13]8922 M68.791967.789891.870472.639960.027965.993775.1733
BoxInst [42]7320 M66.460350.245781.075152.265038.669046.582757.2665
Mask Transfiner [36]15,868 M69.732967.343491.796075.119155.836365.800075.0204
QTPR-Net10,376 M70.511969.194493.091975.741053.918067.592476.4796
Table 2. Comparison results of category instance segmentation in the NWPU VHR-10 dataset. The abbreviations for the classes are AI: airplane, SH: ship, ST: tank, BD: baseball field, TC: tennis court, BC: basketball court, GT: ground track and field, HA: port, BR: bridge, and VE: vehicle.
Table 2. Comparison results of category instance segmentation in the NWPU VHR-10 dataset. The abbreviations for the classes are AI: airplane, SH: ship, ST: tank, BD: baseball field, TC: tennis court, BC: basketball court, GT: ground track and field, HA: port, BR: bridge, and VE: vehicle.
MethodAISHSTBDTCBCGTHABRVE
Mask RCNN [4]51.52359.18184.49682.35372.79976.12295.50254.24441.6256.207
CondInst [40]42.91753.10684.7482.64166.74479.74695.91552.05831.16354.137
BlendMask [41]48.19754.02582.95980.78562.85971.31690.81441.36628.42847.494
PointRend [13]51.89860.73588.43684.80871.15235.62397.38655.0635.62357.206
BoxInst [42]17.02948.98181.0278.87466.32265.31892.9515.5916.69439.677
Mask Transfiner [36]53.17861.21885.85785.95368.46178.37391.68953.32539.10756.271
QTPR-Net52.38961.56180.87983.85072.71482.95097.90456.33140.96057.604
Table 3. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the SSDD dataset.
Table 3. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the SSDD dataset.
MethodMemory (MB) AP bbox A P segm A P 50 A P 75 A P S A P M A P L
MaskRCNN [4]6916 M72.184469.937495.550087.057167.929776.232346.1304
CondInst [40]7637 M72.311269.392295.735485.878567.493575.819053.3663
BlendMask [41]2138 M69.416067.493795.440084.791967.161770.046048.5545
PointRend [13]8340 M71.235270.496296.262087.645269.165675.674248.0363
Mask Transfiner [36]13,127 M72.334170.199595.566785.528668.957074.949541.0891
QTPR-Net8509 M72.524671.525196.532689.685669.623078.607655.5446
Table 4. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the iSAID dataset.
Table 4. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the iSAID dataset.
MethodMemory (MB) A P bbox A P segm A P 50 A P 75 A P S A P M A P L
MaskRCNN [4]24,444 M41.301834.255758.557034.898818.757341.825652.8351
CondInst [40]27,298 M40.906732.998658.854432.630616.839741.666951.8629
BlendMask [41]12,528 M41.104233.744959.063133.934718.746541.778249.8384
PointRend [13]14,673 M38.932334.345857.944735.768918.870141.844849.2744
Mask Transfiner [36]16,318 M41.234134.986059.064836.069019.185041.685952.2770
QTPR-Net17,993 M42.456537.070460.974538.941922.438344.692554.7648
Table 5. Comparison results of category instance segmentation using the iSAID dataset. The abbreviations for the classes are: SH: Ship, ST: Storage Tank, BD: Baseball Diamond, TC: Tennis Court, BC: Basketball Court, GT: Ground Track Field, BR: Bridge, LV: Large Vehicle, SV: Small Vehicle, HE: Helicopter, SP: Swimming Pool, RO: Roundabout, SB: Soccerball Field, PL: Plane, and HA: Harbor.
Table 5. Comparison results of category instance segmentation using the iSAID dataset. The abbreviations for the classes are: SH: Ship, ST: Storage Tank, BD: Baseball Diamond, TC: Tennis Court, BC: Basketball Court, GT: Ground Track Field, BR: Bridge, LV: Large Vehicle, SV: Small Vehicle, HE: Helicopter, SP: Swimming Pool, RO: Roundabout, SB: Soccerball Field, PL: Plane, and HA: Harbor.
MethodSHSTBDTCBCGTBRLVSVHESPROSBPLHA
MaskRCNN [4]37.2237.2251.8577.28377.28329.07619.22332.64111.4225.83732.56729.93643.84146.42725.544
CondInst [40]35.67233.4552.89176.2738.96619.57517.98432.7369.5696.92230.79433.16640.16139.00427.818
BlendMask [41]36.99434.10451.74677.66937.04118.77318.88534.14111.5747.09833.22434.67238.26745.47326.513
PointRend [13]38.21934.33450.91377.48836.48926.82318.07535.61512.3756.64233.48327.22240.15549.57327.782
Mask Transfiner [36]38.17734.72353.45877.2139.66326.9520.05333.89412.3385.9733.66429.56744.05348.60226.469
QTPR-Net40.16136.4753.08178.59936.71333.33622.70637.52913.2476.92934.97535.56843.80252.55330.388
Table 6. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the NWPU VHR-10 dataset.
Table 6. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the NWPU VHR-10 dataset.
NETMemory (MB) A P bbox A P segm A P 50 A P 75 A P S A P M A P L
Baseline8922 M68.791967.789891.870472.639960.027965.993775.1733
QTA9157 M69.061966.94791.698572.758758.427464.711875.8612
MultiHead9713 M68.148466.960789.918872.937954.73464.50876.8688
textbfTransQTA10,571 M69.845968.714393.091975.74153.91867.592476.4796
Table 7. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the SSDD dataset.
Table 7. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the SSDD dataset.
NETMemory (MB) A P bbox A P segm A P 50 A P 75 A P S A P M A P L
Baseline8340 M71.235270.496296.262087.645269.165675.674248.0363
QTA8526 M71.031369.952494.656488.101067.610678.282862.0198
MultiHead8044 M71.156970.353395.530488.228768.205077.800957.5495
TransQTA8509 M72.524671.525196.532689.685669.623078.607655.5446
Table 8. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the iSAID dataset.
Table 8. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the iSAID dataset.
NETMemory (MB) A P bbox A P segm A P 50 A P 75 A P S A P M A P L
Baseline14,673 M38.932334.345857.944735.768918.870141.844849.2744
QTA18,993 M41.998536.962260.791238.925421.734844.066953.9221
MultiHead17,224 M39.565834.842258.320636.428818.757241.945051.3563
TransQTA17,993 M42.456537.070460.974538.941922.438344.692554.7648
Table 9. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the NWPU VHR-10 dataset.
Table 9. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the NWPU VHR-10 dataset.
LayerMemory (MB) A P bbox A P segm A P 50 A P 75 A P S A P M A P L
No cascade8922 M68.791967.789891.870472.639960.027965.993775.1733
19684 M68.907567.503992.335473.290056.730166.651076.3986
210,376 M69.845968.714393.091975.741053.918067.592476.4796
311,242 M68.458267.457290.714674.292359.201365.667876.0052
Table 10. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the SSDD dataset.
Table 10. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the SSDD dataset.
LayerMemory (MB) A P bbox A P segm A P 50 A P 75 A P S A P M A P L
No cascade8340 M71.235270.496296.262087.645269.165675.674248.0363
19287 M71.914770.303396.396687.350469.756774.296340.4752
28509 M72.524671.525196.532689.685669.623078.607655.5446
310,767 M71.781170.798396.328996.328969.214576.638060.0000
Table 11. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the iSAID dataset.
Table 11. Comparison results of memory usage, object segmentation, and instance segmentation AP values for the iSAID dataset.
LayerMemory (MB) A P bbox A P segm A P 50 A P 75 A P S A P M A P L
No cascade14,673 M38.932334.345857.944735.768918.870141.844849.2744
118,111 M42.327736.706660.366860.366860.366844.409744.4097
217,993 M42.456537.070460.974538.941922.438344.692554.7648
319,024 M42.097936.917936.917939.005421.560321.560355.5409
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Shen, J.; Hu, H.; Yang, H. A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing. Mathematics 2024, 12, 2905. https://doi.org/10.3390/math12182905

AMA Style

Zhang X, Shen J, Hu H, Yang H. A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing. Mathematics. 2024; 12(18):2905. https://doi.org/10.3390/math12182905

Chicago/Turabian Style

Zhang, Xiaoying, Jie Shen, Huaijin Hu, and Houqun Yang. 2024. "A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing" Mathematics 12, no. 18: 2905. https://doi.org/10.3390/math12182905

APA Style

Zhang, X., Shen, J., Hu, H., & Yang, H. (2024). A New Instance Segmentation Model for High-Resolution Remote Sensing Images Based on Edge Processing. Mathematics, 12(18), 2905. https://doi.org/10.3390/math12182905

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop