3.1. Overall Architecture
The Local–global framework (LGNet) is consist with the dual-source fusion network (DFNet), local–global transformer modules (LG-Trans) and multiple upsampling network, as shown in
Figure 2.
When segmenting IRRG and DSM images with sizes 512 × 512 are firstly processed with the two EfficientNet models to generate multiple level features with sizes 16 × 256 × 256, 24 × 128 × 128, 48 × 64 × 64, and 120 × 32 × 32. Then, the DFNet fuses multiple level features through selective fusion modules (SFM) and generates deep semantic features with sizes of 16 × 256 × 256, 24 × 128 × 128, 48 × 64 × 64, and 120 × 32 × 32. After that, deep semantic features fused by SFM modules are input into LG-Trans modules and generate multiscale features with sizes of 128 × 256 × 256, 128 × 128 × 128, 128 × 64 × 64, and 128 × 32 × 32. The MUNet firstly generates features u1 by upsampling features extracted by the fourth LG-Trans module into size using bilinear interpolation operation and concatenating them with the second LG-Trans module. Then, the LGNet generates features u2 by upsampling features extracted by the third LG-Trans module into size 256 × 256, using bilinear interpolation operation and concatenating them with the first LG-Trans module. After that, it restores features u1 to size 256 × 256 and fuses with u2 by adding. The MUNet gradually restores the original image size, could effectively alleviate the partial details loss caused by excessive sampling amplitude, and could improve semantic segmentation accuracy.
Furthermore, the LGNet is optimized by minimizing
, and we utilize backpropagation to obtain the optimal parameters in the LGNet. Specifcally, the learning process of the LGNet is summarized in Algorithm 1.
Algorithm 1: Optimization of the LGNet. |
Input: IRRG image sets ; DSM image sets ; Label sets y, Category number n |
Model: the prediction of the LGNet
|
- 1:
Initialize Feature extractor EfficientNet with pre-trained parameters - 2:
fore in Epochs do - 3:
Extract IRRG features via - 4:
Extract DSM features via - 5:
Use the DFNet to fuse each level of and - 6:
for I in Levels do - 7:
- 8:
Use four LG-Trans modules to fuse global and local features extracted from each level - 9:
- 10:
end for - 11:
Use the MUNet to restore size of the prediction - 12:
- 13:
Calculate joint via Equation ( 17) - 14:
Optimize the parameters of the LGNet by minimizing - 15:
end for
|
3.2. Dual-Source Fusion Network
The DFNet is a novel dual-source feature fuse structure, which uses two EfficientNet B2 models to extract multiple level features of IRRG and NDSM/DSM images and fuses features of each level by SFM module. The SFM module could effectively use the spatial information among the neighborhoods reflected by the NDSM/DSM images to reduce the inaccurate prediction caused by the shadows and fully capture texture details and spatial features from multisource images.
As shown in
Figure 3, the SFM module concatenates features extracted from IRRG and DSM images. Then, it utilizes an adaptive global average pooling to extract the global features and acquires weight parameters for each channel of global features by using two fully connected layers to compress into a one-dimensional vector from the global features. Afterward, features with the same size are obtained by multiplying the weight parameters of each channel by the features extracted from IRRG images of the same level. Moreover, the selective fusion features would be obtained by adding features extracted from IRRG images to the previous features.
The following equations explain the SFM module. First, we express a convolution operator as Equation (
1),
x represents image features,
indicates the kernel size is
means the convolution operator and
b means the vector of bias:
Then, we use the two EfficientNet B2 models to extract features
and
from IRRG image
and NDSM/DSM image
, respectively.
We use concatenating operations
to connect IRRG features
and NDSM/DSM features
of the same level.
Finally, the selective fusion module (SFM) fuses different levels of features extracted from IRRG and NDSM/DSM images as the following equations.
where ⊗ indicates the dot-multiply, + means the add operation and
represents the adaptive average pooling operation.
3.3. Local–Global Transformer Module
To fully extract local texture details and improve the global understanding of the entire image, we propose an LG-Trans module consisting of submodule A and submodule B. As shown in
Figure 4, the submodule A focuses more on global long-range visual dependencies of images through a multihead attention module, which helps the architecture improve the ability to extract solid semantic features. The submodule B could pay more attention to the texture details of small objects in the local region and make up for the neglect of the local texture details by the trans-based global fusion module. The LG-Trans module not only captures the rich spatial information but also obtains contextual features of images.
In the submodule A, we give features
with spatial size of
and
C number of channels. The traditional transformer first performs tokenization by reshaping the input features
x into a sequence of flattened 2D patches, where each patch is of size
and
is the number of image patches. Then, the transformer maps the vectorized patches
into a D-dimensional embedding space using a trainable linear projection. The transformer learns specific position embeddings for encoding the patch spatial information, and these embeddings are added to the patch embeddings to retain positional details as Equation (
6):
where
means a patch embedding projection layer, and
represents the position embedding. The transformer consists of
L layers of multihead self-attention mechanism module (MSA) and multilayer perceptron (MLP). Therefore, the output of the
I-th layer can be written as Equations (
7) and (
8), where
denotes a layer normalization operation and
is the encoded features.
However, it is not the optimal directly usage of transformers in segmentation since features fused through SFM modules are usually much smaller than the orginal image size , which inevitably results in a loss of low-level details. The submodule A uses features extracted by dual-source fusion network as the input and employs patch embedding layer apply to patches extracted from features instead of from raw images.
Moreover, the LG-Trans module also uses submodule B for compensating for local information loss, which could help the architecture to fully capture the local texture details. The submodule B consists of a convolution operation, two receptive field blocks (RFB), and an adaptive average pooling operation. To avoid the degradation of the model’s ability to extract features caused by excessive atrous rate, we apply atrous convolution with the rates of 3, 6, and 9 in the RFB blocks. First, we define the features extracted by atrous convolution is
,
represents the input, w represents a convolution operation,
represents the kernel size of convolution,
represents the rate of atrous convolution, and ⊙ represents the dot-product operator. Furthermore,
and
indicates the bias vector a the bias vector of the
convolution, respectively.
The first branch extracts features
by
kernel size of convolution, the advantage of which could increase the network’s non-linearity without changing the images’ spatial structure.
The second branch extracts features
by cascade atrous pyramid, which consists a
convolution and atrous convolutions with atrous rate of 3, 6, and 9.
The third branch extracts features
by cascade atrous pyramid, which consists
convolution, the adaptive pooling operation and atrous convolutions with atrous rate of 3, 6, and 9.
The fourth branch consists of the adaptive pooling operation, which could effectively solve the problem that the architecture is insensitive to object scale changes by fusing image features under different receptive fields.
In the end, the features extracted from multiple branches were fused by concatenating operation, which could capture multiscale image features from the different receptive fields and enhance the semantic information of local space. Furthermore, the submodule B uses batch normalization operation and relu activation function after each branch’s convolution or pooling operation to reduce the displacement of internal covariables of features.
3.4. Loss Function
Contrastive learning is widely used in self-supervised representation learning [
31,
32]. He et al. [
31] constructed a large and consistent dictionary on-the-fly through a dynamic dictionary with a queue and a moving-averaged encoder, which facilitates contrastive unsupervised visual representation learning. For a given anchor point, contrastive learning aims to pull the anchor close to positive points and push the anchor far away from negative points in the representation space [
33]. Previous works [
34,
35] often apply contrastive learning to high-level vision tasks since these tasks are inherently suited for modelling the contrast between positive and negative samples. However, there are still few works to constructing contrastive samples and contrastive loss into semantic segmentation.
Inspired by these works, we propose a pixelwise contrastive loss (CT), which exploits both the information of prediction and ground truth as negative and positive samples and ensures that the prediction is pulled closer to the ground truth. It satisfies the following basic criterion, the smaller the distance between approximate samples the better, and the larger the distance between unlike samples the better. In this paper, a hyperparameter m is added to the second criterion to make the training target bounded. The second criterion has a figurative interpretation, like a spring of length
. If it is compressed, it will recover to length
because of the repulsive force. So the loss of contrastive learning is:
where
is the framework weight,
is the pairwise label,
if the pair of samples
,
belongs to the same class and
if they belong to different class,
is the distance between approximate samples.
Moreover, we add cross-entropy loss (CE) to the total loss so that the model can accurately calculate the differences between the prediction and the ground truth, as the Equation (
16).
where
is the pairwise label,
is the pairwise predition, and
is the categories of objects.
Thus, the loss function can be reformulated as Equation (
17), which can be trained via the Adam optimizer in an end-to-end manner.