1. Introduction
Artificial intelligence-based image processing has become popular in recent years and has improved the development of solutions in multiple fields, most notably in those that rely on computer-based vision. Deep learning systems feature in various real-life applications, making use of large publicly available datasets and of the increased processing power now available. Most notably, deep learning-based computer vision has been used to predict relevant information, with potential uses in various fields, such as medical imaging, piloting autonomous vehicles, robots or platforms, surveillance, and so on. For the prediction task, much of the work has been focused on developing convolutional neural networks (CNNs) that usually outperform the traditional image-processing algorithms.
Of all the possible uses for CNN-based computer vision, the field of automotive applications, which not only includes autonomous vehicles but also other advanced driving assistance systems, has benefited from special attention in recent years. This field is significant from the computer vision point of view because the environment to be perceived is dynamic, unpredictable, and complex, and the image perception must be performed in real time using limited computer architectures.
While there is a vast range of sensor systems available for autonomous vehicles, including laser scanners, radar, sonar, or stereo camera setups, a system that uses a single camera is much easier to set up and calibrate, but it can still extract a vast amount of information.
By analyzing the image provided by a single camera, a computer system can extract generic area information, such as the free-space areas in front of the vehicle that can be traversed or the obstacle areas that must be avoided, or it can extract information about individual objects. This generic area information can be used for basic collision avoidance or even for limited autonomy, such as lane-keeping or platooning, and the identification of individual objects helps in tracking them and predicting their intentions, or in classifying them, to better estimate the risks and vulnerabilities.
Both area perception and individual object perception functions can be achieved by CNNs. Identifying free space and obstacle areas can be achieved by using CNNs for semantic segmentation, which labels each pixel of the image with a class-based label; identifying individual objects can be achieved using instance segmentation CNNs.
In the past few years, much research has been proposed to improve semantic segmentation network architectures and to achieve accurate instance segmentation. The instance segmentation methods are a step ahead of the semantic methods. Semantic segmentation neural networks are faster but less complex. Semantic segmentation produces a label image with pixel-level information regarding the object’s class, whereas instance segmentation takes the semantic segmentation one step further to provide identifiers for individual object instances. To produce individual object instances, instance segmentation methods usually predict the object’s bounding box and then perform segmentation for each predicted bounding box. This introduces dependency in the network regarding the quality of the bounding box prediction. In addition, instance segmentation methods have proven to be time-consuming, and most solutions based on CNNs focus on performance, rather than prediction speed. Real-time capability is usually limited to object detection neural networks, which are mostly limited to predicting two-dimensional (2D) bounding boxes.
The work presented in this paper addresses the need for real-time instance segmentation with a reduced demand for resources. The core of our solution is a multiple output semantic segmentation network, which, while not able to identify individual objects, will produce object part information that is easily exploited by a low-complexity geometry-based algorithm. This combination of semantic segmentation and the geometric post-analysis of the semantic segmentation results is able to achieve the real-time identification of individual objects at the pixel level, with a significant reduction in the required resources when compared with instance segmentation networks.
In this paper, we propose a convolutional neural network that is trained to perform multiple prediction tasks, along with a geometry-based post-processing step to extract additional relevant data. The multi-output CNN prediction is achieved using a single encoder module and multiple decoders. The encoder part will extract the relevant features from the input images that are fed into the CNN and is usually referred to as the “backbone” network. The decoders will make use of these features to generate the predictions (outputs) of the network and are usually called network “heads”.
The presented CNN architecture is able to produce three different output heads related to the road traffic scene: the obstacle parts for instance segmentation (the top-left, top-right, bottom-left, and bottom-right quarters, which will be used later by the clustering algorithm), the semantic segmentation of the scene for detecting the drivable road area, and the vanishing point of the image. The backbone network (the encoder) features convolution layers that are used by the semantic segmentation head in order to construct images of the scene using the extracted features by using various upsampling layers and direct connections with the encoder.
The first head outputs the semantic segmentation of the object parts, which will be clustered using the algorithm presented in
Section 3.4 to produce the instance segmentation results.
The second head produces the obstacle vs. free area semantic segmentation of the scene; this result is used as a mask to refine the object regions after they are clustered. The free space segmentation result can also be used as a standalone result for simpler driving assistance applications, such as collision avoidance.
The third head produces the gradient voting maps for vanishing point computation, a result that can be used for automatic camera orientation calibration and, therefore, for estimating the three-dimensional (3D) position of the objects detected by the previous two heads. This step is beyond the scope of this paper.
Thus, our proposed solution achieves obstacle reconstruction and individual instance extraction. We were able to perform semantic segmentation of the road traffic scene, detect obstacles and extract instances, and determine the vanishing point in an efficient manner, with results that were comparable to the ones produced by instance segmentation networks without the added memory requirements, high computational requirements, and network complexity.
The multi-output neural network is trained using the existing publicly available datasets. The training process is fully automated and does not require any annotation work, as the object quarters are extracted from labeled instances by our training pre-processing tool.
The main contributions of this work are the following:
A multiple-head network architecture that is able to produce the geometric parts of the objects, which can be then grouped into individual instances, free and occupied pixel classification for results refinement, and voting maps for vanishing point computation;
A lightweight object-clustering algorithm, based on the results of the multiple-head semantic segmentation network;
An automated training solution for the multiple-head network, which uses publicly available databases without the need for manual annotation of the object parts.
3. Solution Description
The core of the solution is the multiple-head artificial neural network, based on an encoder–decoder structure that uses a color image as input, whereas the output consists of multiple prediction modules: an obstacle detection module, a semantic segmentation module, and a vanishing-point detection module. The input part, based on an encoder, will extract the proper and significant features from the input image, whereas the decoder, which is based on semantic segmentation solutions, will provide multiple predictions (outputs). We have trained each module independently and we have used the same loss function. The CNN architecture is presented in
Figure 1.
The detection module will identify individual objects or obstacles from the road scene, including vehicles, trucks, buses, cyclists, pedestrians, etc. This is based on semantic segmentation approaches; more specifically, on a modified U-Net CNN decoder module. The method of extracting the individual object instances is unique, meaning that we made use of the semantic segmentation CNN to label the image pixels with the corresponding object quarters (parts), instead of using it to label free space or specific obstacles. The labeled object parts were then grouped further into individual objects, using post-processing that took into account the proximity and position of the quarters as part of a full object. This approach simplifies the overall artificial network architecture by making use of the direct connections with the decoder’s layers. Another advantage of a multiple output network is that the encoder part is shared between the modules and the model can easily be extended, to predict other information. The detection module output features four layers (channels) that encode the binary status of the image pixels as being part of an object quarter (top left, top right, bottom left, bottom right). Using these parts, the algorithm described in
Section 3.4 is able to extract the object instances as bounding boxes and as the labeled regions of pixels.
3.1. Feature Extraction
The extraction of the relevant pixel-based features from the input images is achieved by the first part of the artificial network; its structure is based on the ResNet neural network architecture [
1], which has proved to be very effective and is widely used. It gained recognition after it won the ImageNet competition in the past; it also introduced the concept of skip connections between the layers of the neural network. These connections were the novel part of the system and helped to improve the performance of the CNN. In this work, we make use of a modified version, which is called ResNet-50 (see
Figure 2).
Each “conv block” is characterized by the following three operations: a 2D convolution, a batch normalization operation, and ReLU activation. These are performed three times, with a different number of filters and kernel sizes. This block is then combined with the result of running another 2D convolution. The “identity block” is similar, meaning that it has the same three operations that are also performed three times, but the skip connection is performed with the input tensor, rather than having the extra convolution at the end.
3.2. Decoder Structure
The part responsible for decoding the extracted relevant features is the decoder. This is based on the well-known U-Net architecture [
4], which is able to perform well, even with small training datasets. The structure of the decoder is illustrated in
Figure 3. The concatenation operations are paired with the corresponding layers from the encoder part, whereas the output of the network is given by the final convolution operation.
The decoder makes use of multiple operations that are performed in order to obtain the desired output.
Our module features a central convolutional layer, followed by three upsampling layers. The central layer contains a 2D convolution operation and a batch normalization, followed by ReLU activation. The three upsampling layers have the following structure: 2D upsampling, concatenation (with the corresponding encoder layer), zero padding, 2D convolution, and a batch normalization operation. The final part from the semantic segmentation module has an additional convolution that represents the final segmentation output (the segmentation map), with the same number of layers as the desired number of predicted object classes.
The encoder-decoder structure of the artificial neural network is presented in
Figure 4, where the concatenate connections are more clearly illustrated.
3.3. Global Semantic Segmentation
The neural network output (or heads) represents the decoder part that reconstructs the semantic segmentation and is based on the U-Net CNN. The network proposed by us features a center layer and three upsampling layers that are further concatenated with their correlated layers from the ResNet-based encoder, as described in the previous section.
The relevant features, extracted using the ResNet encoder, are used in the reconstruction layers of the U-Net decoder to produce an image with three channels, each channel representing the desired segmentation classes. In this proposed work, we predict three different classes: the road (free space), dynamic objects, and static objects. Therefore, the first output channel of the final convolutional layer will depict the drivable road area, while the second channel will represent the dynamic (moving) objects from the road, including pedestrians, cyclists, vehicles, buses, and trucks. The final channel of the output will be the static objects; more specifically, we chose sidewalks and the lane delimiters (road barriers or fences). An example is presented in
Figure 5.
3.4. Obstacle Reconstruction Using Part-Based Semantic Segmentation
The detected obstacles from the decoder module will be reconstructed using the methodology presented in this section. The CNN will provide an output wherein the obstacles are split into four individual quarters (parts), each represented by a binary channel of the output image (see
Figure 6).
Based on the semantic segmentation results of the neural network, which labels each obstacle pixel with a “quarter” label, we have designed an algorithm to extract the individual objects. First, each pixel of the whole image space is labeled with a 4-bit code, each bit corresponding to a quarter that overlaps the pixel. Some pixels can be overlapped by more than one quarter, as the four semantic segmentation images produced by the CNN are not mutually exclusive (see
Figure 7). The meaning of the four bits is as follows:
The coded pixels will have a value of 0 if they belong to the free space (they are not obstacle points), or have a value between 1 and 15 if they belong to an object. For each of the 15-obstacle pixel values, a binary image will be generated, and the image will be labeled using the connected component labeling algorithm. The labels for each of the 15 images will be unique (the labels for image 2 will start from the maximum value of the labels of image 1, plus 1, and so on), meaning that at the end, they can be joined together into a common label image, as shown in
Figure 8. The resulting regions are similar to the superpixels described in [
11], but these have the added advantage that they are the result of semantic segmentation and have a high probability that each region belongs to a single object—they are grouped by meaning and not simply by color or texture properties, as in the case of superpixels.
The problem is now re-stated as the problem of assigning a unique object identifier to each region. Basically, the problem becomes a problem of region-grouping but, again, we can use the advantage of semantics because we can restrict the position of the region inside the whole object, based on the quarter codes.
The grouping of the regions will be based on generating multiple object hypotheses, based on the individual quarters. Because we have four types of quarters, we can generate a maximum of four hypotheses for each real-world object. Each quarter will generate a complete object by extending its size to cover the other three missing quarters, assuming that the quarters will be of a similar size. For example, a bottom-left quarter can be extended upwards and to the right to generate the possible complete object to which the quarter belongs. If the object is completely visible (not covered by another object and not at the edge of the image), four complete rectangular hypotheses will be generated, as seen in
Figure 9.
Due to the fact that the hypotheses outnumber the objects by a factor of 4 to 1, the algorithm will compute a pixel fitness score for all hypotheses, so that the best-fitting one can be selected. The pixel score S(R) for each region R is computed by counting the pixels overlapping the region that fit with their corresponding quarter: the quarter defined by the region must correspond to the quarter label assigned to the pixel by the CNN-based semantic segmentation. When considering a matched pixel, we will consider the best-suited segmentation quarter label, so if a pixel has two labels (for example, a top-left label and a bottom-left label) and belongs to a top-left quarter of the hypothesis region, it is considered to be a match. This process is depicted in
Figure 10.
The pixel score is then normalized with the region area, so all pixel scores S(R) will belong within the interval (0…1).
The next step of the algorithm is to establish the dependency relationships between the hypotheses. An overlapping score is computed between any two rectangles Ri and Rj. If the overlapping score exceeds a threshold of 0.5, and the pixel score of Rj is higher than the pixel score of Ri, the rectangle Ri will be labeled as being dependent on Rj. Basically, we assume that the region Ri depicts the same object as Rj but is a less adequate fit.
This process is depicted in
Figure 11. The four rectangles are generated by the pixel-derived quarters, but the best fit is the rectangle generated by the top-right quarter. Therefore, all the other rectangles are labeled as being dependent on the rectangle hypothesis generated by the top-right quarter.
Based on the rectangle hypotheses and their dependency relations, the individual regions presented in
Figure 8 will receive the final label. This new label is the identity of the individual object of which the region is a part.
For each region, the number of pixels overlapping every rectangle hypothesis is counted. All rectangles are considered in this step, even those that are in a dependency relationship. The process has two stages:
The final step of the algorithm is to label the pixels themselves with the identity of the object to which they belong. For this step, we simply transfer to the individual pixel the label of the region to which the pixel belongs, as this region has already been labeled with the identity of the object.
The entire process is illustrated in
Figure 12, where two partially overlapping objects are presented, together with their final label. The complete processing steps are described in Algorithm 1.
Algorithm 1 Instance segmentation |
Input: TL, TR, BL, BR—Top-left, top-right, bottom-left, and bottom-right binary images Output: L—image labeled with the individual object identifiers |
1. Identification of connected regions from each of the four binary images: TL_Regions = ConnectedLabeling (TL) TR_Regions = ConnectedLabeling (TR) BL_Regions = ConnectedLabeling (BL) BR_Regions = ConnectedLabeling (BR) 2. Creation of the composite type image: For each (x,y) Composite(x,y) = BR(x,y) <<3 | BL(x,y) <<2 | TR(x,y) <<1 | TL(x,y) 3. Labeling the composite image: Regions = [] For type = 1 to 15 Regions = Regions U ConnectedLabeling (Composite = type) 4. Generating the hypothetical rectangles: TL_Rectangles = ExpandRectangle (TL_Regions) TR_Rectangles = ExpandRectangle (TR_Regions) BL_Rectangles = ExpandRectangle (BL_Regions) BR_Rectangles = ExpandRectangle (BR_Regions) R = TL_Rectangles U TR_Rectangles U BL_Rectangles U BR_Rectangles S(R) = ComputeRectangleScores (R) 5. Labeling the rectangles For each rectangle Ri LR (i) = i Repeat For i For j If Overlap (Ri, Rj) > 0.5 and S (Ri) > S (Rj) LR(j) = LR (i) Until no change of LR 6. Labeling the regions For each Reg in Regions R = ArgMax R(Overlap (R, Reg)) L(Reg) = LR(R) |
3.5. Object Refinement
We provide an additional post-processing step in order to improve the quality of the detected objects. The dynamic objects from the global segmentation head (described in
Section 3.3) are used to improve the four quarters from the object detection head. Therefore, the semantic output is used to refine the edges of the detected quarters. The process is depicted in
Figure 13, where the global semantic segmentation dynamic objects (column 2) are combined with the quarters output (column 3).
3.6. Vanishing-Point Computation
Vanishing-point computation methods have been presented in the literature. Most methods are based either on line intersections or feature analysis or use artificial networks to predict the 2D point coordinates of the vanishing point [
33]. In this work, we make use of the results of our previously published work [
34] to produce the required training images for the neural network. Our previous paper presented an algorithm for detecting the vanishing point by computing the orientation and magnitude of the gradient that is used to generate three vote-map images. The first image contains the vote map of the features from the left side of the input image, the second image has the vote-map features from the right side of the input image, and the third image is actually the multiplication result of the first two images (left and right vote maps).
Figure 14 presents an example of the three vote-map images. The voting maps can be computed using classic image processing techniques (by computing the orientation of the gradients) or can be extracted directly from a fully convolutional CNN, such as the one proposed in this paper.
The coordinates of the vanishing point can be computed using a sliding window on the third image, or by extracting the maximum from the third vote map image. Extracting the maximum takes an additional 0.09 ms on average to compute; therefore, we used this version. Extracting the VP coordinates can be achieved directly in an end-to-end manner via a CNN with a different architecture; however, in our case, we preferred to leverage the shared layers from the encoder, in order to produce the three vote-map images using semantic segmentation and then extract the coordinates from the third output layer. The other two predicted layers (the left and right vote maps) can be used as input for a lane detection system on marked roads, or as an additional step to validate the extracted vanishing-point coordinates.
Computing the vanishing point is relevant, due to the fact that it can be used to compute the extrinsic camera parameters if we assume some geometric constraints (a flat road assumption and a small lens distortion). The pitch and yaw angles of the camera with respect to the world can be computed if the focal distance is also known (along with the image size), as we presented in [
33].
4. Training the Multi-Output CNN
Multi-task learning requires special datasets and databases for training. For the proposed solution, we have used three well-known datasets: CityScapes [
35], the Berkeley Deep Drive (BDD) [
36], and Mapillary [
37]. All the images were processed in order to have the same size and aspect ratio; we also filtered the images to select only those that feature a large number of road pixels. After this selection, we processed the number of images used for training, as follows: 2759 images were taken from BDD, 2975 images from CityScapes, and 17,109 images from Mapillary. These databases contain relevant information for semantic segmentation and the bounding boxes of the obstacle objects. By using this data, we could extract the four obstacle quarters. The input images were split into four quarters, using the available bounding boxes of the objects. The next step consisted of masking each quarter with the semantic segmentation maps in order to generate the top-left, top-right, bottom-left, and bottom-right binary images of the individual object instances. The process is presented in
Figure 15.
The prepared images from the datasets were also augmented during training. We used the techniques of random intensity and saturation adjustments in the HSV color space, as well as random image scaling and translation.
The semantic segmentation module used binary cross-entropy and the Sorensen–Dice [
38] loss function, which is a modified version of the intersect over union loss. The loss function for the obstacle detection was the same. The vanishing point module also used the same loss function as the semantic segmentation; generating the training data is described in
Section 3.6 During the training process, each loss function can be configured with a different weight. We used the same weight for all three loss functions, having experimented with various settings.
Training the neural network was performed for a total of 500 epochs; it was executed on a system equipped with two Nvidia 1080 Ti GPUs that featured a total of 22 GB of memory. The patience parameter was set to 50 epochs; therefore, if the loss function did not improve for 50 epochs, the training process was stopped early. With this hardware, training one epoch took around 300 s, which means that a fully working multi-output CNN model can be obtained in less than 10 h (with early stopping). For our system, the ResNet-50 encoder was initialized with the weights from ResNet-50 and trained on ImageNet, which helped speed up the training process. We also tried freezing some layers during training, but this did not improve the final results.
5. Results and Evaluation
The same hardware setup was used for evaluating as well as training the model. The experimental setup is based on the Python programming language, the project being implemented using open-source deep learning software. The proposed multiple output network was evaluated, using not only existing popular datasets (as mentioned in
Section 4) but also our own datasets [
17,
33].
For the semantic segmentation output head, the evaluation was performed using the CityScapes validation dataset; the results were published in our previous work [
16], where we used the free road space in order to detect on-road obstacles and the segmented road surface was integrated into our monocular perception system, which is able to track the detected obstacles using a particle filter. For the road class, we obtained an IoU score of 0.90.
The obstacle detection output head that predicts the obstacle quarters represents the main information used for detecting the obstacles from the road traffic scene. Therefore, the obstacle detection module was also evaluated using the dataset presented in [
17], which features road-traffic images captured with different camera systems from the ones used in the training datasets. For this dataset, the ground truth is considered to be the information obtained from a stereo-vision camera setup. We compared our results with our previously proposed system and previously published results in [
39,
40]. On our own dataset, which features over 1000 images of object-bounding boxes extracted from the stereo-vision data, we obtained an IoU score of 0.74. On the CityScapes validation dataset, we obtained an IoU score of 0.83 and, on the KITTI [
41] dataset, a 0.70 IoU score. The results were also compared with a Yolo V3 [
5] detector featuring a DarkNet backbone, which was trained on KITTI using the same loss as the initial Yolo paper. We also compared the results with a Yolo V3 detector with the same ResNet-50 backbone, trained on the same datasets as our proposed model. We have also compared our results with the bounding boxes from the Mask R-CNN and Yolact; the results are similar to ours, with minor differences (1–2% in favor of the Mask R-CNN or Yolact). The bounding box obstacle detection results are presented in
Table 1.
The stereo-vision database from
Table 1 used images acquired with a pair of cameras from which we used only the left image from the setup. The ground truth data was considered to be the result of a stereo-tracking algorithm [
42]. The main advantage of a stereo-vision setup is that in the case of overlapping objects, they can be easily detected based on their depth information. Still, our proposed system, using a monocular camera setup, obtained a high accuracy. The evaluation in terms of complex city scenarios was proven to be highly accurate, especially using the CityScapes dataset. When testing on the KITTI database, our results were poorer due to the different aspect ratios of the dataset images. To address this issue, we had two possibilities: we either resized the KITTI image to fit the 256 × 256 pixels input of the CNN, or we cropped the KITTI image and then resized it. Both variants would affect our evaluation results; the first choice would produce a severe deformation of the objects from the traffic scene, which would highly impact the detection process, whereas the second choice (cropping the input image), would most likely prevent various objects from the scene being considered for detection. For this evaluation, we used the first option.
For the 2D bounding box (object detection) evaluation, we obtained a marginal improvement (1%) from previously published work [
40], due to the image fusion of the output heads (as described in
Section 3.5).
The pixel-wise evaluation of the detection was also performed. We evaluated the instance segmentation on the CityScapes dataset and compared the results with an R-CNN and Yolact, tested on the same test data from the dataset. The results are presented in
Table 2.
We also present some of the qualitative results of our proposed network, versus Yolact and R-CNN, in
Figure 16. The Mask R-CNN would extract well-defined obstacle edges that were more refined, but, in some cases, this method might miss objects (as can be seen in the second row of
Figure 16). Yolact predictions are sometimes missing some objects from the scene, as seen in the second or third row of
Figure 16; the Yolact solution also seems to predict false positives (fourth row of
Figure 16).
The popular Mask R-CNN network has a total of 64 million parameters, while Yolact has 50 million parameters; Yolo V3, with a Darknet backbone, features 41 million parameters, while our U-Net- and ResNet-based model features 32 million parameters in total for the three output heads (semantic segmentation, vanishing point, and obstacle quarters). The total prediction time for all three outputs was 0.043 s on average, whereas for the Mask R-CNN, the prediction time was 0.23 s and, for Yolo V3, it was 0.052 s. The Yolact network predicted the output in 0.041 s. If we removed the vanishing point output head, the network was reduced to 24 million parameters and the prediction time was then 0.034 s. If we further removed the semantic segmentation output head, we ended up with a 16-million-parameter network with a prediction time of 0.026 s.
All the results are presented in
Table 3 and the tests were performed on a desktop system, equipped with an Intel i7 CPU and two Nvidia 1080 Ti GPU graphic boards that were used for both training and prediction (evaluation).
The prediction time for all three output heads was similar to the one from Yolo V3, which only features obstacle detection (0.057 vs. 0.052 s, on average, on the same input test images). The total computational time for our system was higher, due to the extra post-processing required for the obstacle quarter grouping, labeling, and, finally, bounding box extraction (0.014 s); however, these steps were executed on a single CPU core and were not optimized. Nevertheless, the processing times were comparable.
The prediction and post-processing results of the proposed system, in various scenarios, are illustrated in
Figure 17. A video with the results on the stereo-vision dataset is available at:
https://vimeo.com/694007992 (accessed on 20 April 2022).
The third output head represents the vanishing-point prediction. The vanishing-point coordinates are extracted from the CNN-predicted mask, which takes an additional 0.09 ms on average. The results are presented in
Table 4, as already published in our previous work [
39].
The evaluation presented in
Table 4 was performed using the CityScapes test set and the dataset from [
33]. The metric used was NormDist, which represents the RMSE pixel error, divided by the image diagonal.