1. Introduction
Semantic segmentation of urban scene images plays an important role in many real-life applications, such as autonomous driving, robot sensing, etc. Deep learning technology greatly promotes the performance of semantic segmentation. New methods and models are emerging, e.g., Ref. [
1] proposes a novel soft mining contextual information beyond-image paradigm named MCIBI++ to boost the pixel-level representations; Ref. [
2] proposes a Residual Spatial Fusion Network (RSFNet) for RGB-T semantic segmentation; Ref. [
3] develops the Knowledge Distillation Bridge (KDB) framework for few-sample unsupervised semantic segmentation; and Ref. [
4] proposes a pixel-wise contrastive algorithm, dubbed as PiCo, for semantic segmentation in the fully supervised learning setting. However, in applications such as autonomous driving, where the safety response time is less than 0.1 s, the high computational cost required by the deep learning model limits the deployment of the semantic segmentation model on mobile devices with limited resources.
There have been some compression algorithms, such as pruning [
5,
6], knowledge distillation [
7,
8], and quantification [
9,
10], or compact modules design [
11,
12,
13,
14], to lower the computational cost. The above methods mainly alleviate the problem of high computing cost through network parameter compression. Intuitively, the high resolution of the input image is also a critical reason for the high computational cost. Therefore, super-resolution semantic segmentation (SRSS) is an alternative method to reduce the computational overhead. It reduces the resolution of the original input image as the input of a given semantic segmentation network and up-samples the generated low-resolution semantic prediction results to obtain the same resolution as the original input image. The current challenges facing the super-resolution semantic segmentation task mainly include: (1) down-sampling high-resolution images will lose much detailed information, especially for small objects, which results in failing to keep impressive pixel-wise semantic prediction performance; and (2) the design of adding complex computational modules or up-sampling modules drastically increases the complexity of model inference, which goes against the original intention of keeping lower computational cost. Thus, in this paper, we will propose a novel SRSS method which can simultaneously obtain impressive high-resolution segmentation results and maintain a lower computational cost.
In recent years, there are a few scholars who have begun to study this issue [
15,
16,
17,
18]. Zhang et al. [
15] proposed an HRFR method based on knowledge distillation. The HRFR used a SRR module to up-sample the extracted features learned in the super-resolution network to recover high-resolution features and decoded them accordingly to obtain high-resolution semantic segmentation results. HRFR first tried to solve the SRSS problem, but it needed to add new processing modules and there was still a lot of room for performance improvement. Wang et al. [
16] proposed a dual super-resolution learning method called DSRL, which was able to keep high-resolution representations with low-resolution input by utilizing super-resolution and feature affine. DSRL adopts an image reconstruction task to guide the training of the super-resolution semantic segmentation task, but the two tasks are not intrinsically related, and thus, the guidance for training is of limited effectiveness. While DSRL does not change the parsing network structure, the performance degradation is significant. Jiang et al. [
17] subsequently proposed a relation calibrating network (RCNet) to restore the high-resolution representations by capturing the relation’s propagation from low resolution to high resolution and using the relation map to calibrate the feature information. RCNet designs an extra high-complexity module to achieve a boost in performance, but increases inference complexity, and there is room for further performance improvement. Although the RCNet achieves the state-of-the-art prediction performance, it requires modifications to the given semantic segmentation network structure, which limits the flexibility in practical applications. Liu et al. [
18] proposed a super-resolution semantic segmentation method for remote sensing images in which the super-resolution guided branch is designed to supplement rich structure information and guide the semantic segmentation. Although the method [
18] has been proven to be effective in remote sensing datasets, the effectiveness of the method in the urban scene needs to be further evaluated, as the urban scene images are more complex than remote sensing images. In addition, the existing methods overlook the focus on semantic edges, where false positives always exist due to the loss of some detailed information in reduced-resolution input images.
To solve the above dilemma, we propose a novel method, MS-SRSS, which is a simple but effective framework that does not change the original structure of the given semantic segmentation network and does not increase computational costs. Specifically, a multi-resolution learning mechanism (MRL) is proposed for capturing multi-resolution representations and promoting the representation ability of the model. In the training process, we construct a dual branch structure, with one branch for reduced-resolution image input and the other for high-resolution image input. The multi-resolution representations can be captured by the shared encoder weights during the joint training of two branches. In addition, in order to further improve the segmentation accuracy on semantic boundaries, a semantic edge enhancement (SEE) loss is proposed to impose soft constraints on semantic edges through local structural information. After the training process is accomplished, we only utilize a super-resolution branch with simple up-sampling operations on the low-resolution segmentation results to obtain high-resolution semantic segmentation results in the inference stage. In summary, our main contributions include:
We propose a novel super-resolution semantic segmentation method called MS-SRSS, which can achieve improved semantic segmentation results without modifying the structure of the original given semantic segmentation network.
We propose a multi-resolution learning mechanism (MRL) and a semantic edge enhancement loss (SEE) to boost the representation ability of the model and alleviate the problem of semantic edge confusion due to the low resolution of the input image.
Extensive experiments demonstrate that our proposed method outperforms the baseline method and several state-of-the-art super-resolution semantic segmentation methods with different image input resolutions.
2. Methodology
In this section, we first introduce an overview of our proposed framework for the super-resolution semantic segmentation (SRSS) in
Section 2.1. Then, we give a specific description of the two main strategies of our proposed method, including the multi-resolution learning mechanism (MRL) and the semantic edge enhancement loss (SEE), explained in
Section 2.2 and
Section 2.3, respectively. Finally, we present the optimization objective function briefly.
2.1. Overview
The goal of the SRSS task is to take the down-sampled raw image as the input of the semantic segmentation network and output the semantic segmentation prediction results with the same resolution as the raw image. In this way, the inference process can save a lot of computational cost, which makes an important contribution to the application of semantic segmentation in real scenarios.
In this paper, we propose a multi-resolution learning- and semantic edge enhancement-based super-resolution semantic segmentation method (MS-SRSS) for the super-resolution semantic segmentation task. Specifically, to alleviate the problem of missing information due to low-resolution input images, we present a novel multi-resolution learning mechanism (MRL) which aims to motivate the semantic segmentation network to capture multi-resolution representational information for more precise semantic segmentation results. In addition, low-resolution input images increase the difficulty of semantic segmentation at object edges, especially for small objects or nearby objects with similar appearance. Therefore, we further propose a semantic edge enhancement loss (SEE) function and integrate it into the overall loss objective function to motivate the network to focus on the edge region. Our proposed framework can directly use existing encoder–decoder-based networks (e.g., Deelabv3+, PSPNet, OCNet, etc.) to generate semantic segmentation results without changing the network structure or adding additional modules. The encoder–decoder structure is a classic and effective network structure for semantic segmentation. During the training stage, the two branches are trained in the fashion of multi-task learning, while in the inference stage, it is the super-resolution branch that actually performs the super-resolution semantic segmentation inference.
Figure 1 provides an overview of the proposed framework. Assuming that a raw RGB image is of size
H ×
W × 3, the input image to the super-resolution branch is the low-resolution image of size (
H/
K) × (
W/
K) × 3 obtained by reducing the raw image resolution through a down-sampling operation. Here,
H and
W refer to the raw image height and width, respectively, while
K is the down-sampling coefficient. After the low-resolution semantic segmentation results are obtained by the semantic segmentation network, the high-resolution semantic prediction results are obtained by a simple up-sampling operation. We choose the simple bilinear method as the up-sampling operation. Our framework aims to obtain precise super-resolution segmentation results of size
H ×
W ×
C, where
C represents the number of semantic categories.
2.2. Multi-Resolution Learning Mechanism
The core of the MRL mechanism is the introduction of a high-resolution semantic segmentation prediction branch to support the training process. Our motivations for adopting multi-resolution learning include: (1) the high-resolution images in the dataset contain richer color and texture information and clearer information about the boundaries of objects, especially small objects, and the multi-resolution learning mechanism can make full use of the above information; (2) the multi-resolution learning can improve the feature representation ability of the network by learning different resolution representations at the same time, ultimately improving the semantic segmentation performance.
As shown in
Figure 1, our framework contains two branches during training, namely, the high-resolution branch and the super-resolution branch. Among them, the super-resolution branch is the main task, and the high-resolution branch is the auxiliary task. These two branches use the same semantic segmentation network, where the encoder weights are shared and the decoder weights are branch-specific. To put it another way, the high-resolution branch and the super-resolution branch use the same encoder, but different decoders. For the super-resolution branch, the raw image is firstly reduced by a down-sampling operation and then fed into the given semantic segmentation network to generate low-resolution semantic probability maps, which are further used to obtain the final high-resolution semantic segmentation results through the up-sampling operation. For the up-sampling operation, we choose the simple bilinear method. In the high-resolution branch, the input is the raw image, and the sizes of the generated semantic segmentation results are consistent with the raw images.
During the training stage, the two branches are trained in the fashion of multi-task learning. For each training iteration of multi-resolution learning, we feed the high-resolution image and its corresponding low-resolution image into the semantic segmentation network in succession for learning, and this learning method integrates the advantages of contrastive learning and multi-task learning so that the feature encoder of the semantic segmentation network is able to extract the affinity features applicable to both the high-resolution and the low-resolution images, thus improving the super-resolution semantic segmentation performance. Losses for both branches are added together with a trade-off coefficient, and gradients are taken for the collection of all parameters. The MRL described above is similar to the anti-forgotten configuration, which causes the feature extractor of the semantic segmentation network to acquire different resolution representations simultaneously, boosting the representational capacity and semantic segmentation performance.
2.3. Semantic Edge Enhancement Loss
The Structure Similarity Index Measure (SSIM) [
19] implements constraints through the similarity of the structural information of images, which, in addition to its important role in assessing image quality [
20], also shows potential in optimizing saliency map prediction [
21], rate control [
22], rate-distortion optimization [
23], object detection [
24], image retrieval [
25], and hyperspectral band selection [
26]. We further extend SSIM by applying this structural information constraint to multi-class semantic segmentation and propose a semantic edge enhancement (SEE) loss.
Specifically, patches with a size of
M × M are cropped from each predicted semantic category probability map and its corresponding truth map. And the SEE loss is defined as:
where
C denotes the number of semantic categories;
U denotes the number of patches;
,
are the mean value of the
j-th patch in the
i-th semantic segmentation probability map and the corresponding groundtruth;
,
are the standard deviations of the
j-th patch in the
i-th semantic segmentation probability and the corresponding groundtruth; and
and
are constants to avoid dividing by zero.
We further elaborate on how SEE loss improves edge accuracy. When the j-th patch is assigned to the i-th semantic category, the mean value is observed to be larger, and the standard deviation is found to be smaller. Conversely, when the j-th patch is at the semantic edge, the is smaller and the is larger. Therefore, we implement constraints on semantic edges by narrowing the gap between the predicted mean values and the standard deviation of each image patch and their corresponding groundtruth values. This soft supervision based on the structural statistical information within the image patches provides better generalization than the hard supervision that applies L1 or L2 loss to the pixels at the semantic edges.
2.4. Optimization
As shown in
Figure 1, the whole objective function
L of our training framework is composed of two parts, the loss of the super-resolution branch
and the loss of the high-resolution branch
:
where
is trade-off coefficient for controlling the influence of the high-resolution branch.
Specifically, for the super-resolution branch, we use the multi-class cross-entropy loss (CE) and the semantic edge-enhanced loss (SEE) to guide the training process simultaneously:
where
is the coefficient to balance the different loss terms.
is the SEE loss term (refer to Equation (1)).
is the multi-class cross-entropy loss for the super-resolution branch, which can be presented as:
where
and
refer the predicted semantic probability and the corresponding category for pixel
i, and
N means the total number of pixels in the raw image.
For the high-resolution branch, we only use the multi-class cross-entropy loss
to guide the training:
where
and
refer the predicted semantic probability and the corresponding category for pixel
i, and
N means the total number of pixels in the raw image.
In summary, the overall optimization objective can be expressed as follows:
In our experiments, we set both the trade-off parameters and to 1.
5. Conclusions
In this paper, we propose a novel super-resolution semantic segmentation (SRSS) method called MS-SRSS. By using the multi-resolution learning (MRL) mechanism and the semantic edge enhancement (SSE) loss function, our proposed MS-SRSS method can effectively improve the semantic prediction accuracy for SRSS without increasing the extra computing overhead. Specifically, the MRL mechanism is presented to motivate the network to acquire multi-resolution semantic representation information and improve the semantic segmentation precision without changing the original network structure. The semantic edge enhancement (SEE) loss is introduced to further improve the segmentation precision at semantic edges.
To evaluate the specific contributions of MRL and SEE to the proposed method, we performed targeted ablation experiments. The experimental results show the important role of both components in our method and their contributions to the improvement of performance. To further validate the effectiveness and efficiency of our method, we conducted extensive comparison experiments on the Cityscapes, Pascal Context, and Pascal VOC 2012 datasets, which are representative benchmarks in the field of urban scene perception. The experimental results also show that our method achieved better mIoU than the state-of-the-art super-resolution semantic segmentation methods at different down-sampling coefficients with lower FLOPs.