1. Introduction
Nowadays, autonomous driving has become a research hotspot in the field of intelligent transportation. As one of the significant computer vision designs, the main task of deep learning in autonomous driving systems is to assist the car to sense the surrounding environment in time during the driving process. Through the detection, recognition, and tracking of target objects, obstacles such as pedestrians and vehicles can be avoided, thereby improving the safety of car driving. In this regard, researchers have conducted plenty of research on image classification and detection using deep convolutional neural networks, such as scene parsing [
1], pose estimation [
2], object detection [
3], and collision avoidance [
4]. Semantic segmentation [
5] is a major branch of deep learning. Its main goal is to predict the label for each pixel in the image, so as to mine deep feature information and obtain accurate detection results, which is of great significance for autonomous driving.
Recently, semantic segmentation has been very popular in a range of environment perception tasks. Wong [
6] proposed a feedback-based deep semantic segmentation which can incorporate the spatial context through appending an output feedback mechanism without a post-process step such as conditional random field (CRF) refinement. Junaid [
7] proposed a multi-feature view-based shallow convolutional neural network (MVS-CNN), which utilizes the abstract features extracted from the gradient information of the image, improving the semantic segmentation of the road region. In order to capture and transmit the road-specific contextual information, research has been focused on the spatial information inference structures (SIISs), which can learn both the local visual characteristics of the road and the global spatial structure information [
8]. In order to accurately extract linear features, a novel dilated convolution, which contains vertical and horizontal kernels (DVH), was introduced into the feature extraction task of a semantic segmentation network [
9]. Mobile laser scanning (MLS) technology has also been widely used in road image segmentation and recognition [
10]. Balado, et al. [
11] used the point cloud acquired by the sensor to apply PointNet to scene segmentation and performed semantic segmentation on the main targets of the road environment to understand real-time road conditions. Semantic segmentation can also be enhanced by wavelet transform, where the symmetric fully convolutional neural network is designed to carry out lane-marking segmentation [
12]. However, in actual application scenarios, autonomous driving has extremely strict requirements for real-time road and obstacle detection methods. It is desirable to develop novel semantic segmentation models assuring a fast inference speed for autonomous vehicles.
In order to solve this problem, a multitude of scholars have constructed real-time semantic segmentation models. Sun [
13] proposed a real-time fusion semantic segmentation network called RFNet, which can effectively use complementary cross-mode information to conduct real-time RGB-D fusion semantic segmentation research, enriching the unforeseen hazard identification in real scenes. Zhao [
14] proposed an image cascade network (ICNet) based on the pyramid scene analysis network, which integrates medium- and high-resolution features, while taking into account the segmentation accuracy, and uses the cascade strategy to accelerate the realization of real-time image semantic segmentation. By directly connecting the shallow feature map in the encoding module to the decoding module of the corresponding size, LinkNet not only uses the accurate position information of the shallow layer, but also does not increase the redundant parameters and calculations, so the calculation speed is improved under the premise of ensuring the accuracy [
15]. The ENet network is designed with an asymmetrical codec structure. The convolution operation is decomposed by low-rank approximation, which can ensure the accuracy of segmentation while significantly reducing the amount of calculation. It is a real-time segmentation network that can complete tasks such as pixel labeling and scene analysis [
16]. In addition, the lightweight network LEDNet proposed by Wang, et al. [
17] develops the residual module based on the ReNet network encoder and introduces an attention mechanism in the decoder to predict the semantic label of each pixel. Thus, while enhancing the feature expression ability, it also reduces the amount of network calculation. Therefore, we can conclude that for the purpose of improving the inference speed of the semantic segmentation network, it is general to design novel semantic segmentation models, compression models, or advanced modules to meet real-time requirements. The key point to realize real-time semantic segmentation is to increase the segmentation speed while ensuring the accuracy of segmentation. Hence, balancing the segmentation accuracy and inference speed of the model will be one of the crucial research directions in the future.
Although the semantic segmentation network has excellent performance, especially in the parsing task of outdoor road scenes, the problem of rough segmentation of target edges cannot be ignored. The authors of [
18] used a new kind of upsampling method to optimize the image segmentation of the edge of the object, which more comprehensively retains the edge information of the detection target and obtains a better boundary mask. Mostajabi, et al. [
19] used superpixels as the basic unit to extract image features, and then input the features into the VGG16 network [
20]. This method transforms the pixel-level semantic segmentation problem into a classification problem based on superpixels. By combining the spatial context information of superpixels, the extracted image features take into account both local and global information. Similarly, when CRF is used as the subsequent optimization process, the higher order potential (HOP) based on superpixels is increased and embedded in the CNN for end-to-end training to upgrade the accuracy of image semantic segmentation [
21]. Feng, et al. [
22] employed a boundary-enhanced loss (BEL) for learning exquisite boundaries-based salient object detection methods.
In this paper, considering the performance and speed in the environment perception and road obstacle detection, an ICNet is employed to execute our image semantic segmentation work. The model combines effective strategies to speed up the inference speed without sacrificing performance, which is based on different multi-scale features [
23]. Originally, it performs downsampling operations on input images of various sizes. After a low-resolution image passes through the entire network, a rough prediction map is obtained. Compared with high-resolution predictions, it lacks dozens of tiny but valued details, and the boundaries of objects also become blurry [
24,
25]. Accordingly, the low-resolution boundary prediction can be optimized, and then additional loss is added to the segmentation boundary of the network output, which is simple and effective for the network to learn the edge and regional features of the target object.
Therefore, we realize that complex network depth information, large datasets, and real-time performance are all problems that need to be solved urgently in the application of autonomous driving technology. The development and popularization of semantic segmentation methods are still facing plenty of difficulties and challenges. First, autonomous driving has very strict requirements for computer vision and needs to meet real-time characteristics. Second, because its application involves driving safety issues, a certain accuracy of the semantic segmentation model must be guaranteed. In order to address these problems, we propose an improved lightweight real-time semantic segmentation network, using multi-scale branches and a cascaded feature fusion unit to extract rich multi-level features. In this paper, a spatial information network is designed to transmit more prior knowledge of spatial location and edge information. During the course of the training phase, we append an additional loss function to enhance the learning process of the deep learning network system as well. The model we propose can detect the road in the movable area from images and avoid obstacles such as pedestrians and vehicles, which not only increases the safety and comfort of car driving, but also meets the needs of assisted driving.
In summary, the main contributions of our work are threefold:
For road detection tasks, we propose a real-time semantic segmentation architecture, which enhances the image cascading network (ICNet) architecture based on real-time image semantic segmentation to deliver more spatial position prior knowledge and edge information.
We take the spatial information inference module as a sub-network and effectively integrate it with the semantic segmentation network. Furthermore, an additional loss function is introduced for the SP-ICNet architecture to enhance the learning process of the deep learning network system.
Contrary to the dramatic development of high-quality semantic segmentation, we focus on a more lightweight network to undertake semantic segmentation tasks and apply it to the public dataset to evaluate real performance, including the essential implementation details of road obstacle detection.
First, we introduce the detailed information of the model and method in
Section 2. Then, we discuss the results in
Section 3 and perform the corresponding analysis. Finally, conclusions are drawn in
Section 4.
2. Methodology
In this paper, a lightweight semantic segmentation algorithm for road obstacle detection is proposed, SP-ICNet, with the aim to extract the edge features in road images accurately and preserve the image boundary details by adding the spatial information sub-network of the original ICNet. Apart from this, an external loss is appended to the training stage of the spatial information sub-network output to ameliorate the learning ability of the model.
Figure 1 shows an overview of the workflow of this study.
2.1. Semantic Segmentation Model
In order to satisfy the real-time requirements, we adopt the ICNet model as a backbone semantic segmentation network to detect road obstacles. ICNet is a lightweight semantic segmentation network with fast detection speed and low memory consumption, which is consistent with the characteristics of strict real-time requirements and low hardware conditions in road obstacle detection [
26]. As a state-of-the-art method, it introduces a cascaded feature fusion module on the basis of PSPNet, which dramatically combines the processing efficiency of low-resolution images and the detection accuracy of high-resolution images, maintaining a high balance between detection accuracy and detection speed. Its structure is shown in
Figure 2, the operations are indicated in brackets, and the final × 4 upsampling is only used during testing.
In order to accelerate the speed of network segmentation, ICNet converts the input image to different scales and then inputs three branches: low-resolution, medium-resolution, and high-resolution images. Then, the pyramid pooling module (PPM), which can obtain global information capabilities, is retained, and the fused features after pyramid pooling are upsampled and output features. Further, it innovatively proposes the cascaded feature fusion (CFF) unit and the training method of cascaded label guidance, using these different levels of feature fusion and cascaded label guidance to produce a better prediction output and obtain the final segmentation.
2.1.1. Branches of Different Scales
The input image resolution of the low-resolution branch is only 1/4 of the original input image, aiming to extract the semantic information of the entire image, so it adopts the heavy CNN structure. As shown in
Figure 2, after multiple downsamplings of convolutional layers, the resulting feature map size is 1/32 of the original input image. Then, it uses dilated convolution to expand the receptive field of the feature map without changing the size of the feature map. Similar to the operation of the low-resolution branch, after the medium-resolution (1/2) branch is downsampled and convolved, the resulting feature map size is 1/16 of the original image. Meanwhile, the two branches share the same parameters to increase the calculation speed. The high-resolution branch outputs a feature map of 1/8 the size of the original image after passing through three convolutional layers. Although the resolution of the input image is higher, the speed is faster due to fewer convolutional layers. The light CNN structure is adopted for medium resolution and high resolution.
2.1.2. Pyramid Pooling Module
This model uses the pyramid pooling module proposed by PSPNet in the heavy CNN network, which can aggregate the context information of different regions, thereby improving the ability to obtain global information. Experiments suggest that the model shows excellent results on multiple different datasets. First, it divides the feature map into different sub-regions by using an operation called adaptive average pooling. Then, the low-dimensional feature map is upsampled to obtain the same size features as the original feature map through bilinear interpolation. Finally, the features of different scales are summarized into the final global feature of the pyramid pool.
2.1.3. Cascaded Feature Fusion Unit
In order to combine the feature maps of different resolutions output by the three branches, ICNet proposes a cascaded feature fusion unit. Since the output feature map size ratio between the three branches is fixed at 2, the CFF upsamples the smaller feature maps of the two input feature maps twice, and then uses a convolution kernel size of 3 × 3 dilated convolution layers to expand the receptive field of the feature map and keep the resolution of the feature map unchanged. This operation makes the resolution of the two input feature maps of the CFF unit the same and then adds the two feature maps to the ReLu layer. At the same time, in order to enhance the learning of F1, auxiliary labels are applied on the upsampling features of F1. Its structure is shown in
Figure 3.
2.1.4. Cascading Label Guidance Strategy
For reinforcing the learning of features in the three branches, the network adopts the loss function optimization strategy, which is described in PSPNet, and adds a cascaded label guidance strategy to the CFF. The specific method is to double the smaller input size in the CFF unit sample and use dilated convolution to broaden the receptive field and different scales (such as 1/16, 1/8, and 1/4) of ground truth labels to guide the learning stage of low-, medium-, and high-resolution inputs. Among them, the high-resolution branch does not need to be cascaded with the higher-resolution feature map, so when calculating the loss, the 1/4-size label map of the original image corresponds to the feature map in the decoding stage after the three branches are fused.
On the basis of the ICNet network, we construct a parallel spatial information sub-network to extract and conserve the road edge detail information and then merge it with the features extracted by the original network. Meanwhile, the SLIC superpixels would be generated as an auxiliary training branch in another sub-network, which is also appended to an external loss. The richer feature map space representation improves the performance of the proposed model in learning detailed features and obtains the final semantic segmentation results precisely.
2.2. Spatial Information Sub-Network
The ICNet network we adopted in this work is to let low-resolution images pass through a complete heavy network first. Then, according to a novel strategy, the medium- and high-resolution features are merged, so that the network has both the advantages of accuracy and speed, which is challenging for segmentation networks. However, in the traditional method, all pixels are involved in the calculation and classified into specific categories for scene parsing, which is difficult for segmentation networks. For its constraints by pixel information and spatial unity, while the highly computational complexity also cannot satisfy expectations, it will affect the high-quality segmentation of road images.
Instead, inspired by the image edge information construction network [
27,
28,
29], while using pixels for semantic segmentation, we construct a spatial information sub-network. The superpixels generated by the SLIC method will participate in this model, and they combine the spatial context information of the superpixels and fuse high-level abstract semantic features to acquire semantic segmentation results with the edge optimized. It solves the problem that a large amount of image detail information is lost due to continuous pooling and downsampling in the feature extraction stage of the existing deep learning-based image semantic segmentation algorithm, which lose some important edge information of the object. Meanwhile, superpixels are especially widely used in traditional energy minimization frameworks, as they can extremely promote the performance of the algorithm without increasing the computational complexity.
The detailed structure of the spatial information sub-network is shown in
Figure 4. In the spatial information detection sub-network, the input image first generates features at each pixel through the sub-network. The entire network is divided into five stages: First, there are three convolutional layers in the spatial information sub-network. The input image is downsampled by factors of 2 and 4, which will output a feature map of 1/8 the size of the original image after convolution, and uses dilated convolution to expand the receptive field of the feature map. This can avoid the loss of high-level feature information in operations such as pooling and capture the edge information of the target better. Then, through deconvolution, the feature map is upsampled to the original image size resolution. In the next stage, combining these network features, the SLIC iterative clustering is performed to obtain the superpixels segmentation results. The algorithm restricts the image search space to an area proportional to the size of the superpixel, and then calculates the center of the superpixel cluster, using the weighted combination of color and space metrics as the distance measurement unit between each pixel and the cluster center to generate a superpixel block with a regular and compact size. Finally, the result of the spatial information sub-network segmentation has obvious boundary discrimination, which helps to obtain more edge information of the target and maintain the integrity and clarity of the boundary.
A good SLIC algorithm is similar to a clustering algorithm, through the process of local clustering of image pixels, using color distance and spatial distance to search for pixel points and cluster centers. The generated superpixel blocks are relatively compact, similar in size, and have similar texture, brightness, and other pixel information, so that the neighborhood features are relatively well maintained, and the algorithm can better capture edge information. In the spatial information sub-network, the road image extracts rich local feature information after convolution and deconvolution operations, which helps SLIC generate better superpixel blocks, cascade the fusion of different feature information, merge this channel information, and obtain the segmentation results of the spatial information sub-network. Certainly, the image boundary information contained in these segmentation results helps to make up for the lack of boundary information of the basic network ICNet and improve the accuracy of the overall framework.
After the spatial information sub-network obtains the edge information, and the semantic segmentation sub-network obtains the semantic information, the concat feature fusion method is used to fuse the features from two branches. Owing to the spatial information sub-network and the semantic segmentation sub-network obtaining clearly different feature maps, the former represents more image edge and detail features, and the result obtained by the latter is more inclined to show the global regional features of the image. It means our framework will supply a richer feature expression to transmit details exactly.
2.3. External Auxiliary Loss Function
The improved road obstacle detection model SP-ICNet includes two branches: a semantic segmentation network and a spatial information sub-network. The branch of the semantic segmentation network is based on the ICNet model. During the course of the process, the model carries out the prediction and the branch of the spatial information sub-network conducts detecting road obstacles as an auxiliary training technique simultaneously. Therefore, this paper adds an external loss function in an auxiliary manner to optimize the learning process in the training stage of the model. It is noteworthy that the auxiliary training of the sub-network is only used in the training phase to adjust the network parameters and optimize the model, thus increasing the detection consequences of the road edge in images.
In the semantic segmentation network of road scenes, due to the uneven distribution of pixels, this paper uses the cross-entropy loss function to obtain the gradient of the model to update the network parameter values. Since the ICNet model has three branches with different resolutions, it adds a weighted SoftMax cross-entropy loss in each branch with a relevant loss weight
. Therefore, the loss function of the semantic segmentation network is defined as
where
represents branches and
represents categories. The size of the feature map
in each branch is
. The corresponding ground truth label of the feature map is
. Moreover, the loss weights
are set for each branch, which are 0.16, 0.4, and 1, respectively.
The loss function of the spatial information sub-network can be defined as
where
represents samples,
represents superpixels, and
represents categories. The ground truth label of each pixel
is
.
Specifically, the total loss of the improved model is acquired by the weighted summation of the semantic segmentation network loss and the spatial information sub-network loss. The semantic segmentation network is applied to detect road obstacles in the road scene images. In order to collect more potential semantic classification information and regional features precisely, we assign its weight to 1 in the total loss. The spatial information sub-network branch mainly assists the target classification on the basis of the semantic segmentation network and adjusts the network parameters, which relieves the problem of the semantic segmentation network in the road edge coarse segmentation to a certain extent. Through several sets of experiments, it is found that setting its weight to 1/10 of the semantic segmentation network can speed up the model convergence.
4. Conclusions
In this paper, an SP-ICNet model has been proposed, which uses an additional sub-network to acquire richer feature information of the road images to achieve drivable region and obstacle detection naturally. A semantic segmentation network, ICNet, based on feature fusion is adopted in our work. By capturing contextual information, we avoid the problems of large computation and memory consumption caused by using probability graph models. The methods of the pyramid pooling module, multi-scale convolution, and the cascade model are used to fuse feature information of different scales to gradually refine the segmentation results. As a state-of-the-art lightweight network, it has absolute advantages in predicting speed. However, along with its advantages, it also has disadvantages such as the boundary information of the segmentation target not being detected completely. Therefore, this paper uses the superpixels method as the spatial information sub-network, which utilizes the local and global feature information of the image thoroughly to merge edge features and semantic segmentation features, improving the problem of fuzzy edges and inaccurate segmentation in semantic segmentation. The experimental results demonstrate the effectiveness of this model, and we conduct multiple sets of experiments to assess and verify the visual segmentation effect and detection performance of the proposed model. Although there are some inescapable errors in the displayed road detection results, in general, it can obtain more ideal semantic segmentation results, and it acquires the drivable area and obstacles in road images of popular datasets, which is significantly improved compared to other models. Moreover, it meets the test task of real-time image semantic segmentation requirements in autonomous driving, and the entire network architecture can converge very well. The model is relatively stable and reliable, which indicates that the proposed model in this paper is a beneficial attempt and contributes to the research on image semantic segmentation to a certain extent.
Real-time image semantic segmentation is often used in environment perception tasks such as scene analysis and multi-target detection, where it has great application value. Furthermore, we plan to enhance the ability of our semantic segmentation network in more possible dimensions and study the detection of different targets in the road scene without sacrificing high-quality performance. Similarly, it will be interesting to explore ways to achieve safer driving on road corner subjects or seek compression models to achieve real-time requirements for autonomous driving more efficiently. These are interesting research directions.