Weakly Supervised Instance Segmentation in Aerial Images via Comprehensive Spatial Adaptation

Xu, Jingting; Luo, Peng; Mu, Dejun

doi:10.3390/rs16244757

Open AccessArticle

Weakly Supervised Instance Segmentation in Aerial Images via Comprehensive Spatial Adaptation

by

Jingting Xu

¹

,

Peng Luo

¹

and

Dejun Mu

^1,2,*

¹

School of Cybersecurity, Northwestern Polytechnical University, Xi’an 710072, China

²

Shenzhen Research Institute, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(24), 4757; https://doi.org/10.3390/rs16244757

Submission received: 22 October 2024 / Revised: 10 December 2024 / Accepted: 18 December 2024 / Published: 20 December 2024

Download

Browse Figures

Versions Notes

Abstract

:

Weakly supervised instance segmentation (WSIS) only employs image-level supervision to identify instance class labels and create segmentation masks, drawing increasing attention. Currently, existing WSIS methods primarily focus on activating the most discriminative regions and then inferring the entire instance by analyzing inter-pixel relationships within those regions. However, these identification regions are typically concentrated in limited but critical regions or are mistakenly activated in the background region, making it challenging to address scale variations among instances. Furthermore, different aerial instances often appear in close proximity, resulting in the merging of multiple instances of the same class. To tackle these challenges, we propose a comprehensive approach called Comprehensive Spatial Adaptation Segmentation (CSASeg). Specifically, the self-adaptive spatial-aware enhancement network (SSE) identifies extensive regions by analyzing spatial consistency within the class semantic map. Then, we develop a multi-level projection field (MPF) module to significantly enhance instance-level discrimination through deep-to-shallow residual estimation. Additionally, a foreground enhancement module is incorporated into SSE to reduce background noise while enhancing foreground details, significantly increasing the effectiveness of instance analysis. Extensive experiments conduct on three challenging datasets, iSAID, NWPU VHR-10.v2, and SSDD, demonstrate the competitiveness of our proposed approach.

Keywords:

instance segmentation; image-level supervision; computer vision

1. Introduction

With the rapid advancement of remote sensing technology, obtaining high-resolution aerial images has become much easier, presenting new challenges and opportunities for content interpretation [1,2,3,4]. To process these images effectively, it is essential to develop efficient methods. Instance segmentation, as an advanced image-analysis technology, has attracted widespread attention. It can not only accurately locate and classify objects in images, but also distinguish different object instances within the same category. This technology offers significant support for information extraction and scene understanding, particularly in urban planning, ecological monitoring, and agricultural management [5,6,7,8]. Therefore, it is crucial to develop effective instance-segmentation technology.

Instance segmentation has achieved remarkable advancements within fully supervised frameworks, primarily driven by the availability of pixel-level annotated data [9,10,11,12]. As shown in Figure 1, this approach typically requires precise manual contour annotations for numerous instances in the training dataset [13,14,15]. However, this poses significant challenges due to the widely distributed and close-range nature of aerial instances, making contour annotation a laborious, expensive and sometimes unreliable task. Recent research has aimed to reduce supervision needs using cost-effective bounding boxes or points for instance segmentation [16,17,18,19,20,21]. Following these trends in weakly supervised learning, we build on the work of [22,23,24] by further relaxing the supervision requirements. This allows the instance-segmentation network to be trained using only image-level annotations, specifically classification labels. Importantly, these image-level annotations can be obtained directly from existing aerial classification datasets. This approach not only simplifies the process but also significantly reduces the reliance on manual detailed contour annotation. However, the asymmetry between simple image-level annotations and complex instance segmentation poses a significant challenge in developing robust instance-segmentation models.

Current research on weakly supervised instance segmentation (WSIS) primarily adopts a two-stage learning paradigm [25,26,27]. Initially, class activation maps (CAMs) are identified, followed by the inference of specific instance masks through the exploitation of inter-pixel relationships within these regions. The first end-to-end WSIS method, known as IRNet [28], generates CAM (semantic segmentation masks) using an image classifier and derives instance masks based on them by inferring offsets and boundaries. Since IRNet, many subsequent studies have been devoted to further enhancing the performance of WSIS by either refining class activation maps [29,30,31,32] or adopting more robust learning strategies [33,34,35].

Despite the significant advancements reported in the aforementioned methods, WSIS remains a challenging issue characterized by two primary difficulties. As shown in Figure 2, most WSIS approaches are non-convex optimization procedures due to weak supervision, resulting in the model focusing on only a small crucial part rather than the comprehensive class activation. Moreover, class activation also faces ambiguity between the foreground and background. To address these challenges, much of the existing research concentrates on employing segment anything model (SAM) [36,37,38], linguistic cues [34,39], or adversarial erasing techniques [40,41,42,43] to propagate image-level labels from the most critical areas to adjacent areas, thereby enhancing the overall understanding of the class activation. Although these methods are effective, they often depend on prior knowledge or complicated mechanisms.

Most importantly, the lack of pixel-level ground truth creates significant challenges for WSIS when it comes to solving the instance-separation problem. In particular, aerial images often display multiple instances of the same category in close proximity. To tackle this issue, IRNet has developed a boundary-aware algorithm. However, it still struggles to obtain precise instance-level information, as the boundaries are derived from inter-class relations rather than inter-instance relations.

To address the first limitation, a novel self-adaptive spatial-aware enhancement network (SSE) is proposed, designed to infer the complete class activation region through spatial consistency learning. Initially, we apply augmentations or perturbations to the same image, subsequently introducing a consistency loss to ensure that the activation maps for the same class remain invariant under different augmentation or perturbation conditions. As this loss function is optimized, the elements of the entire semantic range will complement each other, leading to global activation. Concurrently, a foreground-enhancement module is integrated into SSE, utilizing two techniques—mean subtraction and binary multiplication—to effectively mitigate background noise. This process clarifies the boundary between foreground and background, thereby enhancing the effectiveness of instance analysis.

To address the second challenge, a multi-level projection field (MPF) module has been developed. MPF improves instance-segmentation performance by combining deep and shallow features to generate a more accurate projection field. Deep features provide an initial projection field estimate, while shallow features refine this estimate by calculating the residual. Cross-layer feature fusion helps to better combine the advantages of both. The integration of SSE and MPF forms an end-to-end comprehensive spatial adaptation segmentation (CSASeg) system for generating instance mask predictions. The evaluation conducted with the large-scale remote sensing image dataset iSAID [14], NWPU VHR-10.v2 [44], and SSDD [45] demonstrates the effectiveness of our method. In summary, this paper presents three key contributions:

(1): Preliminary research on WSIS for aerial images: We propose CSASeg, a groundbreaking weakly supervised method specifically designed for aerial instance segmentation.
(2): Self-adaptive spatial-aware enhancement network: We infer a comprehensive and well-refined class activation map by spatial consistency constraints and foreground-enhancement techniques.
(3): Multi-level projection field estimation: We propose the MPF, which is capable of deriving instance masks through multi-level projection field estimation.

2. Related Work

This section provides a detailed overview of fully supervised instance-segmentation techniques, including both proposal-based and proposal-free methods. Next, we review weakly supervised instance segmentation.

2.1. Fully Supervised Instance Segmentation

Instance segmentation has primarily concentrated on fully supervised settings, where all instances in the training dataset are manually annotated with contours. Currently, the majority of leading methods are based on a proposal-based pipeline. Among these, Mask R-CNN [46] and its variations demonstrate state-of-the-art performance. Cascade R-CNN [47] further effectively addresses model overfitting by applying multiple detection inference steps under different intersection over union (IoU) thresholds. HTC [48] proposes a multi-task, multi-stage hybrid cascade structure that enhances spatial context information. MS R-CNN [49] assesses mask quality by comparing the IoU between instance segmentation and ground truth, linking this quality to classification confidence.

In remote sensing, RMSIN [50] is a novel method for remote sensing image segmentation that significantly enhances segmentation accuracy using multi-scale interaction and adaptive rotational convolution technology. Some studies [51] propose a squeeze-and-excitation region extractor that selects a subset of relevant features from a specific layer. MAISE-Net [52] identifies that current frameworks either lack adequate support for mask interaction or offer minimal effectiveness regarding interaction, which leads to the proposal of a mask attention integration and scale-enhancement system. Furthermore, researchers have developed a series of ROI-based enhancement algorithms [53,54], which explore the interaction mechanism with surrounding context features, thereby advancing the development of aerial instance-segmentation technology.

However, due to the complicated configuration of proposal hyperparameters, they are less effective than proposal-free algorithms in terms of generality and real-time capability. Proposal-free techniques usually use an additional network to determine the center points for locating instances. FCOS [55] is a widely used technique for center point parsing. Several instance-segmentation algorithms have been developed based on the FCOS framework. Polar Mask [56] creates object masks by linking sampled points on the contour according to their polar coordinate distances from the center point. EmbedMask [57] integrates semantic segmentation with pixel embedding, allowing for the aggregation of pixels belonging to the same instance. It utilizes pixel-embedding information to accurately define the boundaries and shapes of each instance.

In addition to the central point, the SOLO series [58,59] effectively utilizes a grid to enhance instance localization. However, the granularity of the grid directly impacts its ability to identify smaller objects. Recently, query-based strategies [60] have proven more effective than both proposal-based and proposal-free algorithms. DiffusionInst [61] represents instances as vectors, treating instance segmentation as a denoising process that transforms noise into vectors. RSPrompter [5] creates collections of prompts related to specific categories to improve SAM for instance segmentation. SAM-RSIS [6] aims to utilize all the SAM structures for the remote sensing instance-segmentation task, enhancing performance through fine-tuning both the object-detection head and the segmentation head. Although fully supervised techniques achieve higher accuracy, they depend heavily on extensive training data that requires pixel-level annotations, making them impractical for aerial applications.

2.2. Weakly Supervised Instance Segmentation

Weakly supervised learning has received considerable attention as a way to minimize the need for pixel-level annotations. Recent studies are focused on efficient image-level annotations for instance-segmentation tasks. Among proposal-based approaches, PRM [62] proposes a class peak response mechanism to highlight the key localization clues in each instance and generate the corresponding instance mask through backpropagation. IAM [63] introduces a more comprehensive instance-filling strategy using random walks to obtain accurate instance masks. CountIns [64] accurately counts instances by creating an object density map, effectively mitigating the misclassification of occluded instances. These methods described above first generate class activation maps (CAMs) to identify the local maximum as the initial localization point of the instance. Next, back-propagation is applied to determine the regions in the feature map that contribute most to these localization points. These identified regions are then used to achieve instance segmentation.

Another approach in proposal-based methods involves using weakly supervised object-detection networks. Notable works include PDSL [65], LIID [22], and Label-PEnet [66]. As the visual foundation model evolves, WeakSAM [23] utilizes classification cues as SAM prompts to automatically generate proposals. It uses proposals as input to train a weakly supervised object-detection network and addresses the issues of predicted mask incompleteness and noise through adaptive mask generation and region of interest (RoI) drop regularization. However, WeakSAM emphasizes the performance improvement brought by incorporating an additional SAM model, which deviates from the initial goal of weak supervision.

Additionally, the proposal-free IRNet [28] utilizes class activation maps (CAMs) to establish intra-class relationships and derive instance masks by predicting offsets and boundaries. According to the deterministic principle, the BESTIE [33] and CL4WSIS [24] methods select a non-overlapping instance mask as a pseudo-instance mask. BESTIE first identifies candidate regions of interest in the image by analyzing connectivity. It introduces a peak attention module to sample regions that contain only a single peak response from the candidate instance regions, marking them as pseudo ground truth. CL4WSIS extends the BESTIE method for continuous learning of new categories and addresses the category drift issue through instance-aware guidance. TAPNet employs second-order pixel relations to clarify uncertain instance interactions within the BESTIE framework, effectively addressing the instance adhesion issue.

Both proposal-based and proposal-free methods significantly benefit from the refinement step to enhance segmentation quality. CIM [34] optimizes the coherence and consistency of the segmentation results by measuring the overlap between the predicted mask and the surrounding masks, adjusting and diffusing the predicted mask. Incorporating additional refinement branches will enhance the model’s sophistication. Our method aims to enhance performance while strictly following the weak supervision principle and does not depend on additional segmentation proposals or extra refinement networks.

3. Our Approach

In this section, we provide a detailed description of the proposed CSASeg method. Section 3.1 outlines our approach, and Section 3.2 elaborates on the proposed spatial-aware class activation method. Finally, Section 3.3 presents the MPF to generate instance-segmentation predictions.

3.1. An Overview of CSASeg

We have a set of aerial training images with image-level supervision, denoted as

D = {\{I_{n}, Y_{n}\}}_{n = 1}^{N}

.

N

represents the total number of images,

I_{n}

denotes the n-th image, and

Y_{n}

indicates the corresponding class label. The class label

Y_{n}

is depicted as a

C

-dimensional vector, where the length of the vector corresponds to the total number of classes.

Figure 3 illustrates the overall architecture of our CSASeg approach. This architecture integrates the proposed Spatial Segmentation-Enhancement (SSE, see Section 3.2) module and the Multi-Level Projection Field (MPF, see Section 3.3) block into a backbone network, such as ResNet50. Specifically, the SSE module first identifies broad class activation maps to locate the spatial semantic regions of instances. Unlike BESTIE, which only highlights the most discriminative activation regions, we implement a spatial-aware adaptive learning mechanism that facilitates complementary matching between image samples and their affine transformations, providing more comprehensive class activation regions. At the same time, we improve these areas using the proposed foreground enhancement (FE), which aims to create clearer boundaries while reducing background noise. Subsequently, MPF enhances the projection field by mapping the learned features to new spaces at various levels. This approach captures more intricate projection relationships, enhancing the ability to differentiate between instances effectively. To effectively optimize the CSASeg network, we propose the powerful loss function detailed below.

L = L_{c l s} + L_{s r} + L_{s a r} + L_{m p f}

(1)

where

L_{c l s}

,

L_{s r}

, and

L_{s a r}

refer to the image classification loss, spatial-aware regularization, and spatial-aware refinement loss in the first SSE stage, respectively, while

L_{m p f}

indicates the projection field loss in the MPF block.

3.2. Self-Adaptive Spatial-Aware Enhancement Network

SSE module uses a Siamese Network to extract features, which consists of two sub-networks with the same structure and shared weights. We employ ResNet-50 [67] network architecture for both branches. To enhance its suitability for our task, we remove the last fully connected layer of the network and replace it with a new convolutional layer with

C

channels. Here,

C

represents the number of dimensions corresponding to the image category labels. By employing global average pooling on the

C

-dimensional feature maps, we infer the classification results from

\hat{Y}

. The classification loss is defined as follows:

L_{c l s} = - \frac{1}{C} \sum_{c = 1}^{C} [Y_{i} log {\hat{Y}}_{i} + (1 - Y_{i}) log (1 - {\hat{Y}}_{i})]

(2)

We initially train a classification network and subsequently eliminate the pooling function to adapt it for a segmentation task. To this end, we normalize the feature maps of

C

channels to obtain the segmentation region for each category

M_{c} (x)

, which identifies the parts of the input image that significantly affect the model-classification decisions.

M_{c} (x) = \frac{ReLU f_{c} (x)}{\max f_{c} (x)}

(3)

Here,

f_{c}

represents the

C

-dimensional feature maps, and x denotes a 2D spatial coordinate within

f_{c}

. While these class activation maps effectively highlight significant regions, they may only indicate areas activated by part discrimination, failing to accurately delineate object boundaries. To overcome these limitations, we propose a module that enhances foreground while ensuring spatial consistent learning. We initially design a spatially aware consistent learning method to adaptively mine semantically complementary parts between paired images.

Consequently, a spatial affine transformation

A (\cdot)

applies to the training images, including re-scaling, flipping, and rotation. As illustrated in Figure 3, we can also derive class activation map

M_{c} (A I_{i})

. Subsequently, we calculate the spatial consistency constraint across diverse spatial transformations of the identical input image.

L_{s r} = {∥M_{c} (A I_{i}) - A M_{c} (I_{i})∥}_{1}

(4)

Here, for each image,

A I_{i}

denotes any spatial affine transformation of the image, and

M_{c} (x)

represents the corresponding class activation maps. In the early stages of training, each branch may learn different specific discriminant regions, which may overlap or differ. By proposing spatial consistency regularization

L_{s r}

, the model gradually adjusts the parameters of each branch during training so that the specific discriminant regions of different branches gradually complement each other. As training progresses, the model gradually reaches a stable state, and the prediction results of different branches for different affine transformations of the same image show a high degree of consistency, thereby achieving overall performance optimization and training convergence.

Additionally, the segmentation of various objects is influenced not only by their appearance but also by the surrounding background. We further develop a foreground-enhancement module to further optimize the preliminary class activation maps, which is able to enhance foreground and boundary features in class activation maps. Figure 4 gives an illustration of the structure. Specifically, the pixel

x_{i}

from initial class activation map

M_{c} (x)

can be revised by pixel

x_{j}

:

{\tilde{M}}_{c} (x_{i}) = \sum_{j \in Ω} r (x_{i}, x_{j}) M_{c} (x_{j})

(5)

where

{\tilde{M}}_{c}

represents the revised class activation map, and

Ω

means the set of all pixels on a feature map of size

H * W

.

r (x_{i}, x_{j})

calculates the correlation between the pixel vectors

x_{i}

and

x_{j}

.

r (x_{i}, x_{j}) = {(x_{i} - μ)}^{T} (x_{j} - σ) + μ^{T} x_{j}, μ = \frac{1}{Ω} \sum_{i \in Ω} x_{i}, σ = \frac{1}{Ω} \sum_{j \in Ω} x_{j}

(6)

Since background noise usually shows global or local consistency, the de-meaning operation can help suppress this consistency. Specifically,

μ

and

σ

are the mean values of all feature vectors in the feature map, which reflects the global trend in the feature space. By subtracting the mean

μ

and

σ

from each feature vector

x_{i}

and

x_{j}

, it helps to remove the global offset of the background and make the foreground features more prominent. The

{(x_{i} - μ)}^{T} (x_{j} - σ)

term helps capture the cosine similarity between foreground features, especially within same-category regions.

Based on the de-meaning operation, the

μ^{T} x_{j}

term can further adjust the distribution of the feature vectors so that

x_{j}

has the same impact on all

x_{i}

. This term tends to focus on boundary pixels among all pixels because the boundary pixels provide the most valuable activations when looking at the overall impact of all pixels. It enhances boundaries, especially in complex backgrounds.

r_{1} (x_{i}, x_{j})

not only considers the similarity of foreground features but also boundary constraints.

Foreground enhancement allows us to further refine the original class activation map and obtain a refined result. Here, given the revised class areas set

\tilde{M_{c}}

, the two branches of the Siamese network are optimized using the spatial-aware refinement loss functions

L_{s a r}

:

L_{sar} = ∥M_{c} (A I) - A {\tilde{M}}_{c} (I)∥ + {∥A M_{c} (I) - {\tilde{M}}_{c} (A I)∥}_{1}

(7)

During the test phase, we maintain one branch as the inference model to generate the final class activation maps.

3.3. Multilevel Projection Field Module

After obtaining the category-activation representation (instance-agnostic), we propose a powerful Multi-level Projection Field (MPF) estimation method for instance segmentation, which aims to exploit the coarse-to-fine multi-scale information in the network feature pyramid to effectively distinguish different instances within the same category and alleviate the challenges brought by instance variations. The projection field provides a vector representing the offset of that pixel relative to the central reference position of the instance. This field is composed of two channels, representing the horizontal and vertical offsets, respectively. By predicting the projection field, each pixel can be associated with its corresponding instance center point, thereby facilitating instance-level segmentation.

MPF improves the performance of instance segmentation by combining deep and shallow features to generate a more accurate projection field. Despite deep features with low resolution, these are more robust to instance variations. These features are better able to capture the overall structure and contours of the instance, although the details may not be clear enough. They thus provide an initial projection field estimate. This initial projected field provides the approximate position and orientation of the instance. Shallow features have higher resolution and can provide more detailed information. Based on the shallow features, the projection field can be further refined by calculating the projection field residual, which is equivalent to adding more details to the initial rough estimate to make the projection field more accurate. When using shallow features to predict projection residuals, deep features are integrated into shallow features, and a cross-layer feature-fusion method is proposed. This fusion method can reduce the distance between features at different levels, reduce the amplitude of the residual, and make it easier to infer.

We employ the ResNet-50 network to serve as the backbone for our MPF. As delineated in Figure 5, we extract an L-level feature pyramid

{\{F_{l} \in R^{W_{l} \times H_{l} \times C_{l}}\}}_{l = 1}^{L}

. Given

{\{F_{l}\}}_{l = 1}^{L}

, our network learns multi-level projection field

{\{P_{l} \in R^{W_{l} \times H_{l} \times 2}\}}_{l = 1}^{L}

. We adopt the weakly supervised instance-segmentation method BESTIE [33] to obtain the ground truth of the projection field

{\{{\hat{P}}_{l} \in R^{W_{l} \times H_{l} \times 2}\}}_{l = 1}^{L}

. Specifically, for each layer l of features, the network outputs a two-channel projection field

P_{l}

, where each pixel position

(x, y)

has a corresponding two-dimensional vector

P_{l_x} (x, y)

and

P_{l_y} (x, y)

. For each pixel position

(x, y)

, calculate its shifted coordinates

(x^{'}, y^{'})

to achieve instance segmentation:

\begin{matrix} x^{'} = x + P_{l_x} (x, y) \\ y^{'} = y + P_{l_y} (x, y) \end{matrix}

(8)

It is difficult to directly infer the projection field due to the diversity of object scales and shapes commonly displayed in remote sensing images, as well as the ambiguity caused by close instances. Therefore, we adopt a deep-to-shallow multi-level feature for estimation.

Given the deepest feature map

F_{L}

, which has the largest receptive field and strong long-distance relationship modelling ability, an initial projection field estimation

P_{L} \in R^{W_{L} \times H_{L} \times 2}

can be predicted. Start with an initial low-resolution projection field estimate

P_{L}

, the most straightforward strategy is to progressively refine

P_{L}

by gradually updating at each level of the feature pyramid to finally achieve a high-resolution projection field estimation

P_{1} \in R^{W_{1} \times H_{1} \times 2}

:

P_{l} = Conv (F_{l}, F_{l + 1}^{↑}, 2 P_{l + 1}^{↑}) \in R^{W_{l} \times H_{l} \times 2}

(9)

Here,

P_{l + 1}

and

F_{l + 1}

from the previous layer need to be upsampled to increase the spatial resolution (indicated by “↑”), and

P_{l + 1}

needs to be scaled by a factor of 2 to align with the resolution of the pyramid feature at level l. In this way, the estimation for MPF can be progressively refined by integrating the higher-level prediction

P_{l + 1}

with the lower-layer feature

F_{l}

. Since the size of the projected field is enlarged in each estimation step, the output has an inconsistent range of values, which makes it difficult for regression models to handle. Therefore, we propose the residual optimization mechanism to achieve more precise MPF estimation, in which the new estimation result is obtained by adding the residual of the current step to the estimation result of the previous step.

Instead of estimating the complete projection field directly

P_{l}

, the network estimates the difference (i.e., residual

▵ P_{l} \in R^{W_{l} \times H_{l} \times 2}

) between the current layer and the previous layer estimate

P_{l + 1}

. Equation (9) is consequently enhanced as:

P_{l} = \underset{Δ P_{l}}{\underset{︸}{Conv (F_{l}, F_{l + 1}^{↑})}} + 2 P_{l + 1}^{↑}

(10)

Residual magnitudes are usually smaller than complete projections, which means the range of values is similar at different pyramid levels and regression training is more stable.

For

Conv (F_{l}, F_{l + 1}^{↑})

operation, convolution operation is performed between the l-th layer and the upsampled l + 1-th layer feature map. We propose multi-level context fusion convolution (MCF). In detail, we use 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolution kernels to perform multi-scale feature extraction on

F_{l + 1}

. Among them, 1x1 convolution realizes inter-channel feature fusion and optimizes dimensions, 3 × 3 convolution captures local textures and edges to help accurate segmentation of small targets, while 5 × 5 and 7 × 7 convolution captures a wider context and deepens the understanding of object spatial relationships and backgrounds. Understand and adapt to the large-scale structure of remote sensing images. By integrating these convolution operations, the model learns rich multi-level features and significantly enhances expression capabilities. More importantly, this multi-scale feature-extraction strategy also helps to improve the alignment between

F_{l + 1}

and

F_{l}

, ensuring that feature information from different levels can be more accurately aligned during the upsampling process.

Finally, we use the regression loss function to measure the difference between the predicted value and the true value of the projection field and specifically choose the mean absolute error (MAE) as the loss function

L_{mpf}

. The formula is as follows:

L_{mpf} = \frac{1}{L \times N} \sum_{l = 1}^{L} \sum_{i = 1}^{N} |P_{l}^{i} - {\hat{P}}_{l}^{i}|

(11)

where

{\hat{P}}_{l}^{i}

is the ground truth value,

P_{l}^{i}

is the predicted value, and

N = H * W * 2

is the number of samples. By minimizing

L_{mpf}

, our model can learn the trend of the projection field more accurately and improve the accuracy of instance segmentation.

4. Experiments

To demonstrate the effectiveness of the CSASeg method, we first introduce the iSAID, NWPU VHR-10.v2, SSDD dataset, evaluation metrics, and implementation details. Subsequently, we conduct ablation studies to assess the performance of each component. Lastly, we compare our results with both fully supervised and state-of-the-art weakly supervised techniques.

4.1. Implementation Details

A comprehensive evaluation of the method is conducted using the Aerial Images Dataset (iSAID) [14], the NWPU VHR-10.v2 [44] dataset, and SAR Ship Detection Dataset (SSDD) [45]. iSAID serves as a benchmark, featuring 2806 high-resolution images with dimensions ranging from 800 to 13,000 pixels and encompassing a total of 655,451 annotated instances. The dataset contains 15 semantic categories that align with DOTA [69], namely, plane (PL), ship (SH), storage tank (ST), baseball diamond (BD), tennis court (TC), basketball court (BC), ground track field (GT), harbor (HA), bridge (BR), large vehicle (LV), small vehicle (SV), helicopter (HE), roundabout (RO), swimming pool (SP) and soccer ball field (SB). During the dataset-partitioning process, the original image is divided into 1/2 for the training set, 1/6 for the validation set, and 1/3 for the test set. Following the common practice, images are clipped to size

256 \times 256

. After cropping, the training set contains 260,688 images.

The NWPU VHR-10.v2 dataset contains 650 high-resolution images ranging from

533 \times 579

to

1728 \times 1028

pixels. After cropping and adjusting all images to

400 \times 400

pixels, the resulting training dataset contains 3075 images. The dataset contains 10 object classes and 3775 instances [70]. The ten categories are aeroplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle. We follow the official setting for dataset splits using 70% and 30% for training and testing, respectively.

The SSDD is the first public dataset at home and abroad dedicated to ship target detection in SAR images. It contains 1160 SAR images and annotation information of 2456 ships, with an average of 2.12 ships per image. We randomly use 80% of the images for training and the remaining 20% for testing.

To ensure a fair comparison, we employ the standard COCO-style average precision (AP) metric to evaluate the performance of our instance-segmentation task [25]. Specifically, we report the COCO-style metrics mAP25 (AP at an IoU threshold of 0.25), mAP50 (AP at an IoU threshold of 0.5), and mAP75 (AP at an IoU threshold of 0.75). We also calculate the average of the mean Average Precisions (mAPs) across IoU values ranging from 0.05 to 0.95. Furthermore, the mean Intersection over Union (mIoU) is utilized as a metric for assessing the class activation results. It is essential to emphasize that the pixel-level ground truth is employed exclusively for evaluation purposes and does not impact the model-training process.

Our proposed framework utilizes RTX 3090, PyTorch 1.8.1, cuda 11.1 for inferring class activation maps with SSE and projection fields with MPF. As shown in Table 1, different hyperparameters are set for the SSE and MPF branches. For the iSAID dataset, the batch size of SSE is 32, the number of training epochs is 6, the initial learning rate is 0.005, and the weight decay is

5 \times 10^{- 4}

. The batch size of MPF is 64, the number of training epochs is increased to 50, the initial learning rate is

1 \times 10^{- 4}

, and the weight decay is

1 \times 10^{- 4}

. For the NWPU VHR-10.v2 dataset, the SSE training epochs are 12, the learning rate is set to 0.05, and the weight decay is set to

5 \times 10^{- 4}

. The MPF training epochs are 100, the learning rate is set to

1 \times 10^{- 5}

, and the weight decay is also adjusted to

1 \times 10^{- 4}

. On the SSDD dataset, the batch size of SSE is 4, the training epochs are 20, the learning rate is set to 0.001, and the weight decay is

1 \times 10^{- 3}

. The batch size of MPF is 8, the number of training epochs is increased to 100, the learning rate is reduced to

1 \times 10^{- 5}

, and the weight decay is set to a smaller

1 \times 10^{- 4}

. The adjustment of these hyperparameters is aimed at optimizing the respective training processes to improve the performance of the model.

4.2. Detailed Comparisons and Ablation Study

Before comparing our model with other competitors, we conducted a series of ablation studies to assess the effectiveness of different design choices and hyper-parameter settings. These include the proposed affine transformation, foreground enhancement, and multi-layer fusion. All ablation studies are conducted on the iSAID validation set.

We evaluate the model by diving into its performance (mAP) and runtime efficiency (FPS) to gain valuable insights. Table 2 shows the results in detail, where FPS is frames per second, which indicates the number of image frames a model can process each second when handling image sequences.

All three proposed modules show significant improvements in model performance. In evaluating individual components, the MPF (mAPs 24.7%) showed the greatest performance improvement, followed by the FE (mAPs 24.4%) and SSE (mAPs 24.2%). Remote sensing images contain complex scenes and multi-scale targets, and it is difficult for single-scale features to capture all details. The MPF module effectively utilizes multi-scale information by combining deep-level and shallow-level features to improve the recognition ability of targets at different scales. Deep features capture the overall structure and high-level semantic information of the object, such as building outlines and road directions, providing initial projection field estimates for instance segmentation. Shallow-level features contain local details and texture information. By calculating the residual (the difference between the deep-level feature estimate and the real situation), the initial estimate is further corrected and the segmentation accuracy is improved. In addition, targets are densely arranged in remote sensing images, and MPF can more accurately distinguish each instance through residual estimation from depth to shallow layer, significantly improving instance-level recognition capabilities.

The FE module significantly improves the performance of semantic segmentation of remote sensing images through two branches: foreground perception and boundary perception. The foreground-perception branch focuses on the overall information and main features of the target area, effectively reducing the impact of background noise on the segmentation results. On the other hand, the boundary-aware branch focuses on the object boundary and edge information. Especially in scenes where objects are densely arranged, the boundary-aware branch can better distinguish adjacent objects and avoid segmentation errors. Through consistency loss, SSE is able to learn more robust feature representations, thereby capturing a wider range of category-activation regions. This is especially important for remote sensing images, where objects may have multi-scale and multi-pose characteristics.

The performance improvement of each module demonstrates that optimizing spatial perception and the projection field is essential for better results. Using the model in combination yields better results than in isolation. In particular, the combination of FE and MPF shows the most significant complementary characteristics. The best results, 26.1%, come from integrating all three modules, which create synergistic effects when combined.

To evaluate whether the model can meet the needs of real-time processing, we use the frames per second (FPS) as the evaluation criterion. According to the performance of each independent module, the SSE module shows high real-time performance, followed by the MPF and FE modules. During the SSE model training, a Siamese Network structure consisting of two identical sub-networks with shared weights was used. However, during the testing phase, this structure did not affect the FPS because the shared weights only needed to be calculated once during the forward-propagation process. Compared with the baseline network, MPF adds two convolution operations and introduces addition calculations. For an input feature map with dimensions

H \times W \times C

, the FE module involves several calculations: embedding feature calculation

O (C \times C^{'} \times H \times W)

, where

C^{'}

is the dimension of the embedded features; similarity matrix calculation

O (H \times W \times H \times W)

; normalization

O (H \times W \times H \times W)

; and weighted summation

O (H \times W \times H \times W)

. The overall computational complexity is

O (C \times C^{'} \times H \times W + H^{2} \times W^{2})

. Since

H \times W

is usually large, the

H^{2} \times W^{2}

term dominates the overall computational complexity, resulting in a significant increase in the computational complexity of the FE module. After all modules are combined, the overall frame rate of the system is 12.97 FPS, which is 3.38 FPS lower than the baseline model, but still able to meet the real-time needs of image processing.

We present additional experiments to further explore the design of the SSE module. To encourage complementary visual patterns, we apply an affine transformation of the data to one of the branches of the Siamese network, and then use the results from the two branches to construct a spatial-aware constraint. Theoretically, the transformation in Equation (4) can be any transformation. Following the principle of image processing, the four candidate data affine transformations include horizontal flipping, random rotation of [−20, 20] degrees, random brightness fluctuation in the range of −10 to 10, and rescaling with a down-sampling rate of 0.6. As shown in Table 3, rescaling can bring more significant performance improvements than rotation. This is because instance-segmentation models require certain scale invariance to handle objects of different sizes. In contrast, although rotation is also an effective data-enhancement technique, its impact on feature extraction is more complex, so the performance improvement is not as significant as rescaling.

We explore the embedding strategies of the FE module in different network levels, including integrating it at any feature layer of the network and after the class activation maps (CAMs). Table 4 shows that after embedding the FE module into the CAM, the mean intersection-over-union ratio (mIoU) of the model is significantly increased to 43.6%. Compared with embedding it into the feature map, the performance is improved by 3.1%. This finding shows that FE at the level of CAM can more effectively capture and strengthen key features, thereby significantly improving the segmentation accuracy and robustness of the model. Since the CAM can highlight the important areas in the input image that are related to a specific category, it directly reflects the model’s understanding of high-level semantic information, making these areas closer to the final decision-making process. Embedding the FE module in these important areas can further strengthen these key areas, making it easier for the model to capture and utilize this high-level semantic information.

We also employ the GABlock [71], CCBlock [72], NLBlock [68], GCBlock [73], and SCPBlock [9] for the foreground attention. In Table 5, we evaluate these five types of attention with the mIoU metric. The GABlock, CCBlock, NLBlock, and GCBlock methods scored relatively low. SCPBlock and FE (ours) methods have higher scores on mIoU, reaching 42.4 and 43.6, respectively, indicating that they have better performance in foreground perception. The FE method is optimized in terms of attention mechanism. The FE method improves the segmentation effect through two branches. The foreground perception branch specifically calculates the attention of the foreground object, helping the model focus on the key foreground area, more accurately identifying and segmenting the foreground object, and reducing background interference. The boundary-perception branch focuses on capturing the object boundary information and improving the accuracy of the segmentation boundary so that the model can better identify the outline of the foreground object and present a clearer boundary. This dual-branch design enhances the feature-expression ability and segmentation accuracy by processing the foreground and boundary information separately.

4.3. Experimental Results and Comparisons

In Figure 6, we perform visualization, where the class activation maps obtained through the non-local method have the disadvantage of boundary blurriness. This is because the non-local operation considers the feature correlations across the entire image, and when processing pixels near the image boundaries, they may be affected by irrelevant features outside the boundaries, leading to a loss of clarity in the boundary information. In contrast, our approach successfully generates class activation maps with tightly bounded edges through foreground-perception (subtract the

μ

and

σ

) and boundary-perception mechanisms.

As shown in Figure 7, projection fields clearly demonstrate the effectiveness of the proposed MPF in accurately capturing complex, slender, and concentrated object instances. The background region is represented in white, indicating that the projection offset of the pixels in this area to the corresponding instance center is zero. The projection offset essentially remains zero for these pixels, as they are not assigned to any instance. Meanwhile, regions with darker colors in the projection field represent larger projection values, which are commonly situated at the peripheries of instances.

To assess the effectiveness of our weak instance-segmentation framework, Figure 8 provides a more fine-grained representation of instance edges. We provide a visualization analysis of the iSAID dataset and the NWPU VHR-10.v2 dataset. Figure 9 and Figure 10 presents qualitative results. It can be seen that although there are significant differences between the two datasets in the spatial distribution and direction of instances, the segmentation results achieve more comprehensive coverage and can accurately approximate the target contour. However, as observed in the bottom row of Figure 10, our weakly supervised model fails to achieve satisfactory segmentation performance for the bridge and harbor. The failure is that bridge instances are less numerous in the training data, resulting in the model being undertrained. Bridges often co-occur with instances of the same texture feature, such as roads. Since road categories are not included in the dataset, these roads are classified as backgrounds, resulting in bridges being incorrectly identified as backgrounds as well. At the same time, the shape of the harbor is crisscrossed, and some areas may be treated as isolated components, further exacerbating the difficulty of segmentation.

The bridge category exhibits a long-tail characteristic in data distribution; that is, the minority category is dominant while the majority category samples are scarce, which leads to poor performance of the model on the minority category. To alleviate this problem, we adopt a data-augmentation strategy to expand the sample size of the minority category through geometric transformations (such as rotation, scaling, translation, and flipping) and color transformations (adjusting brightness, contrast, hue, and saturation).

We choose convolutional conditional random fields (ConvCRF) as a post-processing step to further refinement. ConvCRF can better model the spatial relationship between pixels, thereby improving the accuracy of segmentation boundaries. When applying ConvCRF for post-processing, we first generate a probability map based on the initial segmentation mask, where each pixel has a probability of belonging to a certain category. Then, ConvCRF adjusts these probabilities based on the similarity and spatial continuity between pixels, making the final segmentation result smoother and more coherent. For areas where bridges and roads coexist, ConvCRF can more accurately distinguish between bridges and roads by analyzing the texture and structural similarities between pixels, thereby reducing misclassification. For objects such as harbors that have complex shapes and are easily treated as isolated components, ConvCRF can consider a wider range of contextual information through its potential high-order function, help connect the segmented parts, and form a more complete segmentation mask.

Table 6 presents the results of our comparisons with fully supervised and weakly supervised instance-segmentation approaches on the iSAID test dataset. Here, the fully supervised methods include Mask R-CNN [46], SCNet [74], and CATNet [9], all of which are sourced from publicly available publications. Additionally, to provide a comprehensive comparison, weakly supervised instance-segmentation methods are evaluated which include IR Net [28], BESTIE [33], TAPNet [75], and CIM [34]. It is worth noting that the results marked with an asterisk “*” are refined pseudo-labels through an additional network, which leads to better performance. “Our*” result represents that we additionally trained a fully supervised instance-segmentation Mask R-CNN using the images and their associated pseudo masks. Among the existing approaches, our proposed CSASeg attains state-of-the-art performance, significantly outperforming the previous benchmarks. Meanwhile, the proposed method bridges the divide between weak supervision and full supervision.

To validate the effectiveness of our approach, Table 7 provides class-wise instance-segmentation results. Our model achieves superior performance across numerous categories, specifically in areas such as “soccer ball field”, “baseball diamond”, and “basketball court”. In other categories, our method remains competitive with the majority of current approaches.

The Table 8 compares and analyzes the performance of different models in weakly supervised instance segmentation. Among them, Cassette performed well, reaching 43.8% mAP, particularly in the “basketball court”, “port”, and “ship” categories. However, in the “aircraft” and “vehicle” categories, CSASeg’s performance falls short.

The Table 9 compares the performance of fully supervised and weakly supervised instance-segmentation methods. Among them, IR Net 2019 (ResNet-50) started with 36.4% mAPs and 57.0% mAP50, showing its early contribution in this field. Subsequent BESTIE 2022 gradually improved the performance of weakly supervised methods, reaching 37.3% mAPs and 58.1% mAP50, respectively. Our CSASeg 2024 (ResNet-50) outperforms methods without further refinement, with mAPs of 39.1%, mAP50 of 60.3%, and mAP75 of 51.8%, demonstrating our significant progress in the field of weakly supervised instance segmentation. After further refinement (marked with asterisks *), our CSASeg* 2024 further improves the mAPs to 40.9%.

One limitation of this model is that when processing remote sensing images, since these images are usually large, we usually crop the images into smaller blocks (e.g., 400 × 400 pixels) during the preprocessing stage. However, this may result in missing instance objects in some small blocks. When segmenting small blocks without instance objects, the model may produce noisy or wrong segmentation results because there are no actual target instances in these areas for the model to learn. This not only affects the accuracy of the segmentation but may also introduce unnecessary information and increase the complexity of post-processing.

5. Conclusions

In this study, we propose a comprehensive spatial adaptive segmentation (CSASeg) method to solve the local discrimination and instance-adhesion problems in weakly supervised instance segmentation (WSIS) of remote sensing images. This method improves the instance-level discrimination through the adaptive spatial perception-enhancement network (SSE) and the multi-level projection field (MPF) module and introduces a foreground-enhancement module to reduce background noise and enhance foreground information. Extensive experiments were comprehensively implemented on the challenging iSAID, NWPU-VHR-10.v2, and SSDD datasets, and we achieved state-of-the-art performances in comparison with the baseline.

Author Contributions

Methodology, J.X.; Validation, P.L.; Supervision, D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D Program of China under grant 2021YFB3100901, the National Science Foundation of China under Grant 62074131, 62272389, 62372069, and Shaanxi Provincial Key R&D Program 2023-ZDLGY-32. This work was also supported in part by the Shenzhen Fundamental Research Program under grant number JCYJ20210324131611032.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Geng, X.; Jiao, L.; Li, L.; Liu, F.; Liu, X.; Yang, S.; Zhang, X. Multisource joint representation learning fusion classification for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Du, X.; Zheng, X.; Lu, X.; Doudkin, A.A. Multisource remote sensing data classification with graph fusion network. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10062–10072. [Google Scholar] [CrossRef]
Han, W.; Li, J.; Wang, S.; Zhang, X.; Dong, Y.; Fan, R.; Zhang, X.; Wang, L. Geological remote sensing interpretation using deep learning feature and an adaptive multisource data fusion network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Zhang, F.; Liu, K.; Liu, Y.; Wang, C.; Zhou, W.; Zhang, H.; Wang, L. Multi-target Domain Adaptation Building Instance Extraction of Remote Sensing Imagery with Domain-common Approximation learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Luo, M.; Zhang, T.; Wei, S.; Ji, S. SAM-RSIS: Progressively adapting SAM with box prompting to remote sensing image instance segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Wang, J.; Ji, S.; Zhang, T. From image transfer to object transfer: Cross-domain instance segmentation based on center point feature alignment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
Liu, S.; Sun, H.; Zhang, Z.; Li, Y.; Zhong, R.; Li, J.; Chen, S. A multiscale deep feature for the instance segmentation of water leakages in tunnel using MLS point cloud intensity images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Liu, Y.; Li, H.; Hu, C.; Luo, S.; Luo, Y.; Chen, C.W. Learning to aggregate multi-scale context for instance segmentation in remote sensing images. IEEE Trans. Neural Netw. Learn. Syst. 2024, 1–5, Early Access. [Google Scholar] [CrossRef] [PubMed]
Hafiz, A.M.; Bhat, G.M. A survey on instance segmentation: State of the art. Int. J. Multimed. Inf. Retr. 2020, 9, 171–189. [Google Scholar] [CrossRef]
Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef]
Sultana, F.; Sufian, A.; Dutta, P. Evolution of image segmentation using deep convolutional neural network: A survey. Knowl.-Based Syst. 2020, 201, 106062. [Google Scholar] [CrossRef]
Mahbod, A.; Polak, C.; Feldmann, K.; Khan, R.; Gelles, K.; Dorffner, G.; Woitek, R.; Hatamikia, S.; Ellinger, I. NuInsSeg: A fully annotated dataset for nuclei instance segmentation in H&E-stained histological images. Sci. Data 2024, 11, 295. [Google Scholar] [PubMed]
Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 28–37. [Google Scholar]
Graham, S.; Jahanifar, M.; Azam, A.; Nimir, M.; Tsang, Y.W.; Dodd, K.; Hero, E.; Sahota, H.; Tank, A.; Benes, K.; et al. Lizard: A large-scale dataset for colonic nuclear instance segmentation and classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 684–693. [Google Scholar]
Li, W.; Liu, W.; Zhu, J.; Cui, M.; Hua, R.Y.X.; Zhang, L. Box2mask: Box-supervised instance segmentation via level-set evolution. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5157–5173. [Google Scholar] [CrossRef] [PubMed]
Kim, B.; Jeong, J.; Han, D.; Hwang, S.J. The devil is in the points: Weakly semi-supervised instance segmentation via point-guided mask representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11360–11370. [Google Scholar]
Lee, J.; Yi, J.; Shin, C.; Yoon, S. Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2643–2652. [Google Scholar]
Li, R.; He, C.; Zhang, Y.; Li, S.; Chen, L.; Zhang, L. Sim: Semantic-aware instance mask generation for box-supervised instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7193–7203. [Google Scholar]
Lee, H.; Hwang, S.; Kwak, S. Extreme Point Supervised Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17212–17222. [Google Scholar]
Wei, Z.; Chen, P.; Yu, X.; Li, G.; Jiao, J.; Han, Z. Semantic-aware SAM for Point-Prompted Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3585–3594. [Google Scholar]
Liu, Y.; Wu, Y.H.; Wen, P.; Shi, Y.; Qiu, Y.; Cheng, M.M. Leveraging instance-, image-and dataset-level information for weakly supervised instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1415–1428. [Google Scholar] [CrossRef] [PubMed]
Zhu, L.; Zhou, J.; Liu, Y.; Hao, X.; Liu, W.; Wang, X. Weaksam: Segment anything meets weakly-supervised instance-level recognition. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 7947–7956. [Google Scholar]
Hsieh, Y.H.; Chen, G.S.; Cai, S.X.; Wei, T.Y.; Yang, H.F.; Chen, C.S. Class-incremental continual learning for instance segmentation with image-level weak supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1250–1261. [Google Scholar]
Sun, Y.; Liao, S.; Gao, C.; Xie, C.; Yang, F.; Zhao, Y.; Sagata, A. Weakly supervised instance segmentation based on two-stage transfer learning. IEEE Access 2020, 8, 24135–24144. [Google Scholar] [CrossRef]
Zhang, K.; Yuan, C.; Zhu, Y.; Jiang, Y.; Luo, L. Weakly supervised instance segmentation by exploring entire object regions. IEEE Trans. Multimed. 2021, 25, 352–363. [Google Scholar] [CrossRef]
Peng, J.; Wang, Y.; Pan, Z. Weakly supervised instance segmentation via class double-activation maps and boundary localization. Signal Process. Image Commun. 2024, 127, 117150. [Google Scholar] [CrossRef]
Ahn, J.; Cho, S.; Kwak, S. Weakly supervised learning of instance segmentation with inter-pixel relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2209–2218. [Google Scholar]
Zhang, J.; Su, H.; He, Y.; Zou, W. Weakly supervised instance segmentation via category-aware centerness learning with localization supervision. Pattern Recognit. 2023, 136, 109165. [Google Scholar] [CrossRef]
Kweon, H.; Yoon, S.H.; Yoon, K.J. Weakly supervised semantic segmentation via adversarial learning of classifier and reconstructor. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11329–11339. [Google Scholar]
Ru, L.; Zheng, H.; Zhan, Y.; Du, B. Token contrast for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3093–3102. [Google Scholar]
Yang, Z.; Fu, K.; Duan, M.; Qu, L.; Wang, S.; Song, Z. Separate and conquer: Decoupling co-occurrence via decomposition and representation for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3606–3615. [Google Scholar]
Kim, B.; Yoo, Y.; Rhee, C.E.; Kim, J. Beyond semantic to instance segmentation: Weakly-supervised instance segmentation via semantic knowledge transfer and self-refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4278–4287. [Google Scholar]
Li, Z.; Zeng, Z.; Liang, Y.; Yu, J.G. Complete instances mining for weakly supervised instance segmentation. arXiv 2024, arXiv:2402.07633. [Google Scholar]
Ye-Bin, M.; Choi, D.; Kwon, Y.; Kim, J.; Oh, T.H. ENInst: Enhancing weakly-supervised low-shot instance segmentation. Pattern Recognit. 2024, 145, 109888. [Google Scholar] [CrossRef]
He, Y.; Wang, J.; Zhang, Y.; Liao, C. An efficient urban flood mapping framework towards disaster response driven by weakly supervised semantic segmentation with decoupled training samples. ISPRS J. Photogramm. Remote Sens. 2024, 207, 338–358. [Google Scholar] [CrossRef]
Kweon, H.; Yoon, K.J. From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19499–19509. [Google Scholar]
Yin, X.; Im, W.; Min, D.; Huo, Y.; Pan, F.; Yoon, S.E. Fine-grained Background Representation for Weakly Supervised Semantic Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11739–11750. [Google Scholar] [CrossRef]
Zhu, L.; Wang, X.; Feng, J.; Cheng, T.; Li, Y.; Jiang, B.; Zhang, D.; Han, J. WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation. Int. J. Comput. Vis. 2024, 2024, 1–21. [Google Scholar] [CrossRef]
Chen, T.; Yao, Y.; Huang, X.; Li, Z.; Nie, L.; Tang, J. Spatial Structure Constraints for Weakly Supervised Semantic Segmentation. IEEE Trans. Image Process. 2024, 33, 1136–1148. [Google Scholar] [CrossRef]
Yoon, S.H.; Kweon, H.; Cho, J.; Kim, S.; Yoon, K.J. Adversarial erasing framework via triplet with gated pyramid pooling layer for weakly supervised semantic segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 326–344. [Google Scholar]
Lee, J.; Kim, E.; Mok, J.; Yoon, S. Anti-adversarially manipulated attributions for weakly supervised semantic segmentation and object localization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 46, 1618–1634. [Google Scholar] [CrossRef]
Kweon, H.; Yoon, S.H.; Kim, H.; Park, D.; Yoon, K.J. Unlocking the potential of ordinary classifier: Class-specific adversarial erasing framework for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6994–7003. [Google Scholar]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4974–4983. [Google Scholar]
Huang, Z.; Huang, L.; Gong, Y.; Huang, C.; Wang, X. Mask scoring r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6409–6418. [Google Scholar]
Liu, S.; Ma, Y.; Zhang, X.; Wang, H.; Ji, J.; Sun, X.; Ji, R. Rotated multi-scale interaction network for referring remote sensing image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26658–26668. [Google Scholar]
Zhang, T.; Zhang, X. A full-level context squeeze-and-excitation ROI extractor for SAR ship instance segmentation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4506705. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X. A mask attention interaction and scale enhancement network for SAR ship instance segmentation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4511005. [Google Scholar] [CrossRef]
Ke, X.; Zhang, X.; Zhang, T. GCBANET: A global context boundary-aware network for SAR ship instance segmentation. Remote Sens. 2022, 14, 2165. [Google Scholar] [CrossRef]
Shao, Z.; Zhang, X.; Wei, S.; Shi, J.; Ke, X.; Xu, X.; Zhan, X.; Zhang, T.; Zeng, T. Scale in scale for SAR ship instance segmentation. Remote Sens. 2023, 15, 629. [Google Scholar] [CrossRef]
Zhang, X.; Guo, W.; Xing, Y.; Wang, W.; Yin, H.; Zhang, Y. AugFCOS: Augmented fully convolutional one-stage object detection network. Pattern Recognit. 2023, 134, 109098. [Google Scholar] [CrossRef]
Xie, E.; Sun, P.; Song, X.; Wang, W.; Liu, X.; Liang, D.; Shen, C.; Luo, P. Polarmask: Single shot instance segmentation with polar representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12193–12202. [Google Scholar]
Ying, H.; Huang, Z.; Liu, S.; Shao, T.; Zhou, K. Embedmask: Embedding coupling for one-stage instance segmentation. arXiv 2019, arXiv:1912.01954. [Google Scholar]
Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. Solo: Segmenting objects by locations. In Computer Vision—ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16; Springer: Cham, Switzerland, 2020; pp. 649–665. [Google Scholar]
Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. Solov2: Dynamic and fast instance segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 17721–17732. [Google Scholar]
Wang, C.; Wang, G.; Zhang, Q.; Guo, P.; Liu, W.; Wang, X. Openinst: A simple query-based method for open-world instance segmentation. Pattern Recognit. 2024, 153, 110570. [Google Scholar] [CrossRef]
Gu, Z.; Chen, H.; Xu, Z. Diffusioninst: Diffusion model for instance segmentation. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 2730–2734. [Google Scholar]
Zhou, Y.; Zhu, Y.; Ye, Q.; Qiu, Q.; Jiao, J. Weakly supervised instance segmentation using class peak response. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 8–23 June 2018; pp. 3791–3800. [Google Scholar]
Zhu, Y.; Zhou, Y.; Xu, H.; Ye, Q.; Doermann, D.; Jiao, J. Learning instance activation maps for weakly supervised instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3116–3125. [Google Scholar]
Cholakkal, H.; Sun, G.; Khan, F.S.; Shao, L. Object counting and instance segmentation with image-level supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12397–12405. [Google Scholar]
Shen, Y.; Cao, L.; Chen, Z.; Zhang, B.; Su, C.; Wu, Y.; Huang, F.; Ji, R. Parallel detection-and-segmentation learning for weakly supervised instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 8198–8208. [Google Scholar]
Ge, W.; Guo, S.; Huang, W.; Scott, M.R. Label-penet: Sequential label propagation and enhancement networks for weakly supervised instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3345–3354. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Su, H.; Wei, S.; Liu, S.; Liang, J.; Wang, C.; Shi, J.; Zhang, X. HQ-ISNet: High-quality instance segmentation for remote sensing imagery. Remote Sens. 2020, 12, 989. [Google Scholar] [CrossRef]
Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An empirical study of spatial attention mechanisms in deep networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6688–6697. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Global context networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 6881–6895. [Google Scholar] [CrossRef]
Vu, T.; Kang, H.; Yoo, C.D. Scnet: Training inference sample consistency for instance segmentation. AAAI Conf. Artif. Intell. 2021, 35, 2701–2709. [Google Scholar] [CrossRef]
Kim, B.; Yoo, Y.; Rhee, C.E.; Kim, J. Break Adhesion: Triple Adaptive-parsing for Weakly Supervised Instance Segmentation. arXiv 2024, arXiv:2406.18558. [Google Scholar]

Figure 1. By employing bounding boxes and image-level annotations, the need for fine pixel-level annotations can be greatly reduced. In particular, image-level annotations can be easily obtained from existing classification datasets, simplifying the annotation process.

Figure 2. Class activation maps (CAMs) face two main challenges when it comes to identifying regions specific to each class. (1) Partial recognition, wherein the model concentrates specifically on a small segment of the target; (2) Foreground–background confusion, where the model struggles to discern the boundary between the target object and its background. Red represents a high level of class activation and blue represents a lower level of class activation.

Figure 3. Overview of CSASeg. To overcome the challenges of part discrimination and instance-separation ambiguity, we propose a method of consistent learning and foreground enhancement to generate class activation maps. Furthermore, we propose a multi-level projection field network designed to enhance the learning of instance-aware masks.

Figure 4. Non-local attention [68] and foreground-enhancement attention. Our foreground-enhancement technology involves two key branches: removing uniform background information (

μ

and

σ

) and achieving class-agnostic foreground perception.

Figure 4. Non-local attention [68] and foreground-enhancement attention. Our foreground-enhancement technology involves two key branches: removing uniform background information (

μ

and

σ

) and achieving class-agnostic foreground perception.

Figure 5. A multi-level projection field-prediction model is designed to improve instance-separation capability. (a) The model utilizes features at different levels. Deep features provide an initial projection field estimate, while shallow features refine this estimate by calculating the residual. (b) A concrete example of

Conv (F_{l}, F_{l + 1}^{↑})

in the final stage, showing how information is transferred across layers.

Figure 5. A multi-level projection field-prediction model is designed to improve instance-separation capability. (a) The model utilizes features at different levels. Deep features provide an initial projection field estimate, while shallow features refine this estimate by calculating the residual. (b) A concrete example of

Conv (F_{l}, F_{l + 1}^{↑})

in the final stage, showing how information is transferred across layers.

Figure 6. (a) Input images; (b) non-local attention model; (c) our foreground-enhancement model.

Figure 7. Visual of the projection fields clearly demonstrates the effectiveness of the proposed MPF in accurately capturing complex, slender, and concentrated object instances.

Figure 8. The visualization of instance-segmentation results demonstrates the model’s capability to accurately differentiate the precise contours of objects within an image.

Figure 9. This set of images shows the results of the weak instance-segmentation technique applied to the test set of the iSAID dataset. Different colors represent different instances.

Figure 10. This set of images shows the results of the weak instance-segmentation technique applied to the test set of the NWPU VHR-10.v2 dataset. Different colors represent different instances. The final images showcase some failed recognitions, such as bridges.

Table 1. The parameter settings for the three datasets.

	iSAID		NWPU VHR-10.v2		SSDD
	SSE	MPF	SSE	MPF	SSE	MPF
Model architecture	ResNet-50	ResNet-50	ResNet-50	ResNet-50	ResNet-50	ResNet-50
Input size	$256 \times 256$	$256 \times 256$	$400 \times 400$	$400 \times 400$	$432 \times 432$	$432 \times 432$
Optimizer	SGD	SGD	SGD	SGD	SGD	SGD
Batch size	32	64	8	8	4	8
Epoch	6	50	12	100	20	100
Learning rate	0.005	$1 \times 10^{- 4}$	0.05	$1 \times 10^{- 5}$	0.001	$1 \times 10^{- 5}$
Weight decay	$5 \times 10^{- 4}$	$1 \times 10^{- 4}$	$5 \times 10^{- 4}$	$1 \times 10^{- 4}$	$1 \times 10^{- 3}$	$1 \times 10^{- 4}$
Loss function	$L_{c l s} L_{s r} L_{s c r}$	$L_{m p f}$	$L_{c l s} L_{s r} L_{s c r}$	$L_{m p f}$	$L_{c l s} L_{s r} L_{s c r}$	$L_{m p f}$

Table 2. Evaluation of each component. All models are evaluated on the iSAID val set.

SSE	FE	MPF	mAPs	FPS
			23.8	16.35
✓			$24.2 (↑ 0.4)$	$16.35 (↓ 0.00)$
	✓		$24.4 (↑ 0.6)$	$14.42 (↓ 1.93)$
		✓	$24.7 (↑ 0.9)$	$14.90 (↓ 1.45)$
✓	✓		$25.0 (↑ 1.2)$	$14.42 (↓ 1.93)$
✓		✓	$25.5 (↑ 1.7)$	$14.90 (↓ 1.45)$
	✓	✓	$25.6 (↑ 1.8)$	$12.97 (↓ 3.38)$
✓	✓	✓	$26.1 (↑ 2.3)$	$12.97 (↓ 3.38)$

Table 3. Evaluation of different types of affine transformation

A (\cdot)

in SSE. All models are evaluated on the iSAID val set. In our experimental methodology, we employed an affine transformation denoted by *.

Table 3. Evaluation of different types of affine transformation

A (\cdot)

in SSE. All models are evaluated on the iSAID val set. In our experimental methodology, we employed an affine transformation denoted by *.

Rescale	Flip	Rotation	Brightness	mIoU
✓				43.6 *
	✓			43.4
		✓		43.9
			✓	43.3
Rotation	Rescale	Flip	Brightness	mIoU
✓				42.2
	✓			43.9
		✓		43.0
			✓	42.7

Table 4. Evaluation of the module order. All models are evaluated on the iSAID val set. Our results are shown in bold.

Module Sequence	mIoU
Backbone $\to CAM \to NLBlock$	41.6
Backbone $\to FE \to CAM$	40.5
Backbone $\to CAM \to FE$	43.6

Table 5. Evaluation of the foreground enhancement in the SSE module. Our results are shown in bold.

Method	mIoU
GABlock	41.1
CCBlock	40.7
NLBlock	41.5
GCBlock	41.8
SCPBlock	42.4
FE (ours)	43.6

Table 6. A comprehensive evaluation of diverse methodologies conducted on iSAID test dataset. The fully supervised results are derived from the literature [9].

Method	Backbone	mAPs	${mAP}_{50}$	${mAP}_{75}$
Fully supervised instance segmentation
Mask R-CNN 2017	ResNet-50	36.7	59.7	39.7
SCNet 2021	ResNet-101	38.1	60.4	41.2
CATNet 2024	ResNet-50	39.9	62.8	43.5
Weakly supervised instance segmentation
IR Net 2019	ResNet-50	22.6	35.6	24.6
BESTIE 2022	ResNet-50	23.6	34.0	26.1
TAPNet 2024	ResNet-50	25.4	36.4	27.6
CSASeg (Ours)	ResNet-50	25.9	37.3	28.0
CIM * 2024	ResNet-50	26.4	37.9	28.3
CSASeg * (Ours)	ResNet-50	27.2	39.2	28.9

* represents the model is refined through the Mask R-CNN framework.

Table 7. A class-wise instance evaluation of diverse methodologies conducted on iSAID test dataset.

Method	Backbone	mAPs	SH	ST	BD	TC	BC	GT	BR	LV	SV	HE	SP	RO	SB	PL	HA
Fully supervised instance segmentation
MaskR-CNN	RNet-50	36.7	46.8	35.4	48.0	75.7	48.2	28.1	17.4	30.4	15.6	13.4	38.1	43.2	30.9	42.9	31.0
SCNet	RNet-101	38.1	48.0	35.9	56.6	77.0	51.5	30.2	18.7	31.7	14.2	9.7	39.2	46.7	36.3	45.1	31.1
CATNet	RNet-50	39.9	49.4	36.7	55.9	77.7	55.9	29.7	20.0	31.5	15.1	14.8	40.1	43.5	39.0	47.0	35.1
Weakly supervised instance segmentation
IR Net	RNet-50	22.6	28.9	21.0	33.6	46.2	30.9	18.4	11.7	18.6	9.7	7.3	23.4	22.5	23.1	26.1	17.6
BESTIE	RNet-50	23.6	30.0	20.2	34.4	47.4	31.6	13.2	9.9	20.6	12.6	6.8	26.8	26.1	25.9	25.2	22.7
TAPNet	RNet-50	25.4	33.2	23.3	33.3	51.4	34.3	18.7	9.2	22.7	10.7	8.6	26.4	29.4	28.5	30.0	22.0
CSASeg	RNet-50	25.9	31.8	24.2	36.3	50.8	36.3	19.1	13.2	20.3	10.5	9.6	26.0	29.6	26.4	31.6	22.8
CIM *	RNet-50	26.4	32.6	25.8	37.6	51.5	35.9	17.4	14.4	21.8	11.7	8.1	28.8	31.1	27.3	28.1	23.9
CSASeg *	RNet-50	27.2	33.3	25.6	38.6	51.9	36.7	20.1	14.8	20.5	12.1	11.1	26.8	31.6	29.8	30.9	24.7

* represents the model is refined through the Mask R-CNN framework.

Table 8. A comprehensive evaluation of diverse methodologies conducted on the NWPU VHR 10.v2 test dataset. The fully supervised results are derived from the literature [9].

Method	Backbone	mAPs	AR	SP	ST	BD	TC	BC	GF	HB	BD	VH
Fully supervised instance segmentation
RDSNet	RNet-101	43.2	5.0	41.5	50.5	79.9	30.1	42.1	87.8	27.4	26.2	41.5
ARE-Net	RNet-101	64.8	39.9	58.5	72.7	85.2	69.5	77.3	88.5	65.0	36.2	55.2
CATNet	RNet-50	69.1	46.0	60.7	84.2	87.9	70.3	73.2	91.5	62.9	46.9	67.0
Weakly supervised instance segmentation
IR Net	RNet-50	38.4	18.7	34.8	45.9	53.7	39.3	45.9	55.6	40.1	17.0	33.4
BESTIE	RNet-50	40.5	23.9	32.2	49.7	56.2	38.8	48.4	58.8	43.6	15.5	37.9
TAPNet	RNet-50	41.4	23.5	35.1	$50.6$	$58.6$	43.8	46.7	60.1	42.7	13.8	39.3
CSASeg	RNet-50	42.1	26.6	38.3	46.9	56.1	47.6	48.8	57.7	42.1	16.8	40.3
CIM *	RNet-50	42.5	$28.9$	38.7	46.2	56.6	46.9	48.1	58.8	42.5	17.3	$41.4$
CSASeg *	RNet-50	$43.8$	27.4	$39.3$	48.9	56.9	$48.4$	$50.0$	$60.5$	$44.6$	$20.7$	40.9

* represents the model is refined through the Mask R-CNN framework.

Table 9. A comprehensive evaluation of diverse methodologies conducted on the SSDD test dataset.

Method	Backbone	mAPs	${mAP}_{50}$	${mAP}_{75}$
Fully supervised instance segmentation
Mask R-CNN 2017	ResNet-50	64.3	92.6	80.9
SCNet 2021	ResNet-101	64.9	92.6	80.1
CATNet 2024	ResNet-50	63.9	93.7	80.1
Weakly supervised instance segmentation
IR Net 2019	ResNet-50	36.4	57.0	49.3
BESTIE 2022	ResNet-50	37.3	58.1	50.2
TAPNet 2024	ResNet-50	37.9	59.0	50.7
CSASeg (Ours)	ResNet-50	39.1	60.3	51.8
CIM * 2024	ResNet-50	40.5	62.1	53.4
CSASeg * (Ours)	ResNet-50	40.9	62.6	53.8

* represents the model is refined through the Mask R-CNN framework.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, J.; Luo, P.; Mu, D. Weakly Supervised Instance Segmentation in Aerial Images via Comprehensive Spatial Adaptation. Remote Sens. 2024, 16, 4757. https://doi.org/10.3390/rs16244757

AMA Style

Xu J, Luo P, Mu D. Weakly Supervised Instance Segmentation in Aerial Images via Comprehensive Spatial Adaptation. Remote Sensing. 2024; 16(24):4757. https://doi.org/10.3390/rs16244757

Chicago/Turabian Style

Xu, Jingting, Peng Luo, and Dejun Mu. 2024. "Weakly Supervised Instance Segmentation in Aerial Images via Comprehensive Spatial Adaptation" Remote Sensing 16, no. 24: 4757. https://doi.org/10.3390/rs16244757

APA Style

Xu, J., Luo, P., & Mu, D. (2024). Weakly Supervised Instance Segmentation in Aerial Images via Comprehensive Spatial Adaptation. Remote Sensing, 16(24), 4757. https://doi.org/10.3390/rs16244757

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Weakly Supervised Instance Segmentation in Aerial Images via Comprehensive Spatial Adaptation

Abstract

1. Introduction

2. Related Work

2.1. Fully Supervised Instance Segmentation

2.2. Weakly Supervised Instance Segmentation

3. Our Approach

3.1. An Overview of CSASeg

3.2. Self-Adaptive Spatial-Aware Enhancement Network

3.3. Multilevel Projection Field Module

4. Experiments

4.1. Implementation Details

4.2. Detailed Comparisons and Ablation Study

4.3. Experimental Results and Comparisons

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI