3.1. An Overview of CSASeg
We have a set of aerial training images with image-level supervision, denoted as . represents the total number of images, denotes the n-th image, and indicates the corresponding class label. The class label is depicted as a -dimensional vector, where the length of the vector corresponds to the total number of classes.
Figure 3 illustrates the overall architecture of our CSASeg approach. This architecture integrates the proposed Spatial Segmentation-Enhancement (SSE, see
Section 3.2) module and the Multi-Level Projection Field (MPF, see
Section 3.3) block into a backbone network, such as ResNet50. Specifically, the SSE module first identifies broad class activation maps to locate the spatial semantic regions of instances. Unlike BESTIE, which only highlights the most discriminative activation regions, we implement a spatial-aware adaptive learning mechanism that facilitates complementary matching between image samples and their affine transformations, providing more comprehensive class activation regions. At the same time, we improve these areas using the proposed foreground enhancement (FE), which aims to create clearer boundaries while reducing background noise. Subsequently, MPF enhances the projection field by mapping the learned features to new spaces at various levels. This approach captures more intricate projection relationships, enhancing the ability to differentiate between instances effectively. To effectively optimize the CSASeg network, we propose the powerful loss function detailed below.
where
,
, and
refer to the image classification loss, spatial-aware regularization, and spatial-aware refinement loss in the first SSE stage, respectively, while
indicates the projection field loss in the MPF block.
3.2. Self-Adaptive Spatial-Aware Enhancement Network
SSE module uses a Siamese Network to extract features, which consists of two sub-networks with the same structure and shared weights. We employ ResNet-50 [
67] network architecture for both branches. To enhance its suitability for our task, we remove the last fully connected layer of the network and replace it with a new convolutional layer with
channels. Here,
represents the number of dimensions corresponding to the image category labels. By employing global average pooling on the
-dimensional feature maps, we infer the classification results from
. The classification loss is defined as follows:
We initially train a classification network and subsequently eliminate the pooling function to adapt it for a segmentation task. To this end, we normalize the feature maps of
channels to obtain the segmentation region for each category
, which identifies the parts of the input image that significantly affect the model-classification decisions.
Here, represents the -dimensional feature maps, and x denotes a 2D spatial coordinate within . While these class activation maps effectively highlight significant regions, they may only indicate areas activated by part discrimination, failing to accurately delineate object boundaries. To overcome these limitations, we propose a module that enhances foreground while ensuring spatial consistent learning. We initially design a spatially aware consistent learning method to adaptively mine semantically complementary parts between paired images.
Consequently, a spatial affine transformation
applies to the training images, including re-scaling, flipping, and rotation. As illustrated in
Figure 3, we can also derive class activation map
. Subsequently, we calculate the spatial consistency constraint across diverse spatial transformations of the identical input image.
Here, for each image, denotes any spatial affine transformation of the image, and represents the corresponding class activation maps. In the early stages of training, each branch may learn different specific discriminant regions, which may overlap or differ. By proposing spatial consistency regularization , the model gradually adjusts the parameters of each branch during training so that the specific discriminant regions of different branches gradually complement each other. As training progresses, the model gradually reaches a stable state, and the prediction results of different branches for different affine transformations of the same image show a high degree of consistency, thereby achieving overall performance optimization and training convergence.
Additionally, the segmentation of various objects is influenced not only by their appearance but also by the surrounding background. We further develop a foreground-enhancement module to further optimize the preliminary class activation maps, which is able to enhance foreground and boundary features in class activation maps.
Figure 4 gives an illustration of the structure. Specifically, the pixel
from initial class activation map
can be revised by pixel
:
where
represents the revised class activation map, and
means the set of all pixels on a feature map of size
.
calculates the correlation between the pixel vectors
and
.
Since background noise usually shows global or local consistency, the de-meaning operation can help suppress this consistency. Specifically, and are the mean values of all feature vectors in the feature map, which reflects the global trend in the feature space. By subtracting the mean and from each feature vector and , it helps to remove the global offset of the background and make the foreground features more prominent. The term helps capture the cosine similarity between foreground features, especially within same-category regions.
Based on the de-meaning operation, the term can further adjust the distribution of the feature vectors so that has the same impact on all . This term tends to focus on boundary pixels among all pixels because the boundary pixels provide the most valuable activations when looking at the overall impact of all pixels. It enhances boundaries, especially in complex backgrounds. not only considers the similarity of foreground features but also boundary constraints.
Foreground enhancement allows us to further refine the original class activation map and obtain a refined result. Here, given the revised class areas set
, the two branches of the Siamese network are optimized using the spatial-aware refinement loss functions
:
During the test phase, we maintain one branch as the inference model to generate the final class activation maps.
3.3. Multilevel Projection Field Module
After obtaining the category-activation representation (instance-agnostic), we propose a powerful Multi-level Projection Field (MPF) estimation method for instance segmentation, which aims to exploit the coarse-to-fine multi-scale information in the network feature pyramid to effectively distinguish different instances within the same category and alleviate the challenges brought by instance variations. The projection field provides a vector representing the offset of that pixel relative to the central reference position of the instance. This field is composed of two channels, representing the horizontal and vertical offsets, respectively. By predicting the projection field, each pixel can be associated with its corresponding instance center point, thereby facilitating instance-level segmentation.
MPF improves the performance of instance segmentation by combining deep and shallow features to generate a more accurate projection field. Despite deep features with low resolution, these are more robust to instance variations. These features are better able to capture the overall structure and contours of the instance, although the details may not be clear enough. They thus provide an initial projection field estimate. This initial projected field provides the approximate position and orientation of the instance. Shallow features have higher resolution and can provide more detailed information. Based on the shallow features, the projection field can be further refined by calculating the projection field residual, which is equivalent to adding more details to the initial rough estimate to make the projection field more accurate. When using shallow features to predict projection residuals, deep features are integrated into shallow features, and a cross-layer feature-fusion method is proposed. This fusion method can reduce the distance between features at different levels, reduce the amplitude of the residual, and make it easier to infer.
We employ the ResNet-50 network to serve as the backbone for our MPF. As delineated in
Figure 5, we extract an L-level feature pyramid
. Given
, our network learns multi-level projection field
. We adopt the weakly supervised instance-segmentation method BESTIE [
33] to obtain the ground truth of the projection field
. Specifically, for each layer
l of features, the network outputs a two-channel projection field
, where each pixel position
has a corresponding two-dimensional vector
and
. For each pixel position
, calculate its shifted coordinates
to achieve instance segmentation:
It is difficult to directly infer the projection field due to the diversity of object scales and shapes commonly displayed in remote sensing images, as well as the ambiguity caused by close instances. Therefore, we adopt a deep-to-shallow multi-level feature for estimation.
Given the deepest feature map
, which has the largest receptive field and strong long-distance relationship modelling ability, an initial projection field estimation
can be predicted. Start with an initial low-resolution projection field estimate
, the most straightforward strategy is to progressively refine
by gradually updating at each level of the feature pyramid to finally achieve a high-resolution projection field estimation
:
Here, and from the previous layer need to be upsampled to increase the spatial resolution (indicated by “↑”), and needs to be scaled by a factor of 2 to align with the resolution of the pyramid feature at level l. In this way, the estimation for MPF can be progressively refined by integrating the higher-level prediction with the lower-layer feature . Since the size of the projected field is enlarged in each estimation step, the output has an inconsistent range of values, which makes it difficult for regression models to handle. Therefore, we propose the residual optimization mechanism to achieve more precise MPF estimation, in which the new estimation result is obtained by adding the residual of the current step to the estimation result of the previous step.
Instead of estimating the complete projection field directly
, the network estimates the difference (i.e., residual
) between the current layer and the previous layer estimate
. Equation (
9) is consequently enhanced as:
Residual magnitudes are usually smaller than complete projections, which means the range of values is similar at different pyramid levels and regression training is more stable.
For operation, convolution operation is performed between the l-th layer and the upsampled l + 1-th layer feature map. We propose multi-level context fusion convolution (MCF). In detail, we use 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolution kernels to perform multi-scale feature extraction on . Among them, 1x1 convolution realizes inter-channel feature fusion and optimizes dimensions, 3 × 3 convolution captures local textures and edges to help accurate segmentation of small targets, while 5 × 5 and 7 × 7 convolution captures a wider context and deepens the understanding of object spatial relationships and backgrounds. Understand and adapt to the large-scale structure of remote sensing images. By integrating these convolution operations, the model learns rich multi-level features and significantly enhances expression capabilities. More importantly, this multi-scale feature-extraction strategy also helps to improve the alignment between and , ensuring that feature information from different levels can be more accurately aligned during the upsampling process.
Finally, we use the regression loss function to measure the difference between the predicted value and the true value of the projection field and specifically choose the mean absolute error (MAE) as the loss function
. The formula is as follows:
where
is the ground truth value,
is the predicted value, and
is the number of samples. By minimizing
, our model can learn the trend of the projection field more accurately and improve the accuracy of instance segmentation.