EnRDeA U-Net Deep Learning of Semantic Segmentation on Intricate Noise Roads

Yu, Xiaodong; Kuan, Ta-Wen; Tseng, Shih-Pang; Chen, Ying; Chen, Shuo; Wang, Jhing-Fa; Gu, Yuhang; Chen, Tuoli

doi:10.3390/e25071085

Open AccessArticle

EnRDeA U-Net Deep Learning of Semantic Segmentation on Intricate Noise Roads

by

Xiaodong Yu

¹,

Ta-Wen Kuan

^1,*,

Shih-Pang Tseng

^1,2,

Ying Chen

¹,

Shuo Chen

³,

Jhing-Fa Wang

¹,

Yuhang Gu

¹ and

Tuoli Chen

¹

School of Information Science and Technology, Sanda University, No. 2727 Jinhai Road, Shanghai Pudong District, Shanghai 201209, China

²

School of Software and Big Data, Changzhou College of Information Technology, Changzhou 213164, China

³

Jiangsu Zero-Carbon Energy-Saving and Environmental Protection Technology, Yangzhou 225000, China

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(7), 1085; https://doi.org/10.3390/e25071085

Submission received: 7 June 2023 / Revised: 12 July 2023 / Accepted: 16 July 2023 / Published: 19 July 2023

(This article belongs to the Special Issue New Trends in Fault Diagnosis and Prognosis for Engineering Applications: From Signal Processing to Machine Learning and Deep Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Road segmentation is beneficial to build a vision-controllable mission-oriented self-driving bot, e.g., the Self-Driving Sweeping Bot, or SDSB, for working in restricted areas. Using road segmentation, the bot itself and physical facilities may be protected and the sweeping efficiency of the SDSB promoted. However, roads in the real world are generally exposed to intricate noise conditions as a result of changing weather and climate effects; these include sunshine spots, shadowing caused by trees or physical facilities, traffic obstacles and signs, and cracks or sealing signs resulting from long-term road usage, as well as different types of road materials, such as cement or asphalt; all of these factors greatly influence the effectiveness of road segmentation. In this work, we investigate the extension of Primordial U-Net by the proposed EnRDeA U-Net, which uses an input channel applying a Residual U-Net block as an encoder and an attention gate in the output channel as a decoder, to validate a dataset of intricate road noises. In addition, we carry out a detailed analysis of the nets’ features and segmentation performance to validate the intricate noises dataset on three U-Net extensions, i.e., the Primordial U-Net, Residual U-Net, and EnRDeA U-Net. Finally, the nets’ structures, parameters, training losses, performance indexes, etc., are presented and discussed in the experimental results.

Keywords:

semantic segmentation; machine vision; U-Net deep learning; road segmentation; self-driving sweeping bot; residual U-Net

1. Introduction

Generally, the behavior of self-driving transportation bots involves smart movements between the warehouse and consumer on a pre-planned GPS route, using sensors for obstacle avoidance. Such an activity is like that of an ant avoiding obstacles using a pre-laid pheromone path when shuttling between its nest and a food source [1]. Analogically, a mission-oriented self-driving bot, such as the Self-Driving Sweeping Bot (SDSB) [2,3], intelligently navigates when sweeping rubbish by means of machine vision which seeks out sweeping targets along a pre-planned GPS route. This may be compared with the scavenger behavior of an herbivorous animal when it seeks out and consumes dead plants for feeding [4]. However, for an SDSB to fully achieve scavenger feeding behavior, many challenges must still be met in terms of machine vision for target detection [5,6,7] and road segmentation outdoors [8,9]. In a modern city, road conditions present a much more complex environment; this includes the presence of many interruptive noises, and the additional factors of changing weather and climate, as well as different kinds of road materials [8].

This work investigated road segmentation for SDSB in terms of a Residual Attention U-Net deep learning approach, in which the input channel applied a Residual U-Net block treated as an encoder, and an attention gate in the output channel was regarded as the decoder; namely, EnRDeA U-Net. The road surface surroundings were considered with respect to intricate noises including differing weather conditions (sunny, rainy, or cloudy), sunshine spots and shadowing caused by large trees or physical facilities, traffic signs, obstacles, and types of road materials (cement or asphalt). The previous work investigated the morphological filtering on the road segmentation [9,10,11,12], in which [9] an optimizing HSV encoding framework is proposed by embedding the morphology operation to explore the road segmentaton. However, during the morphological filtering on segmentation, the different times of corrosion and expansion operatons were iteratively tuned subjectively to reach the optimized results, which implied that the different road optimized segmentation would rely on specific times of corrosion and expansion operatons through user inspection; this fact further indicated that to dynamically tune the parameters on corrosion and expansion operations for road segmentation is infeasible, in terms of the real-time self-driving applications.

To validate the EnRDeA U-Net performance, we first discuss the Primordial U-Net [13] and the Residual U-Net [14]. We then comprehensively examine any advances attained by the EnRDeA U-Net, and present validation as appropriate. The Primordial U-Net mostly adopted the convolutional network architecture used for segmentation of medical and satellite images in a fast and precise manner. In U-Net++ [15,16,17], nested and dense skip connections were introduced to decrease the semantic gap between the encoder and decoder. Three-dimensional U-Net [18,19,20,21] used volumetric segmentation for application to 3D medical scans. Deep Residual U-Net was initially proposed by Zhang et al. [22] to inspect the rich skip connections within networks for information propagation, and this has achieved improved performance. Pavement crack segmentation was investigated by the RUC-net [23] for automatic crack detection. The dResU-Net was applied to brain tumor segmentation from MRI [24]. The dual-attention residual U-Nets integrated two nets for the purpose of transformer oil leakage detection. The Residual Dense U-Net (RDUN) [25] achieved real-world road detection using satellite images. Sea land, rooftops, and iron ore segmentation by the Residual U-Net for remote sensing images was explored for road extraction [26,27,28,29]. Finally, an attention-gated U-Net was successfully validated for lesion segmentation in medical images and remote sensing images [30,31,32].

The Self-Driving Sweeping Bot (SDSB) was investigated in a previous work [2], in which SDSB achieved an intelligent manner of operation, with respect to the security of personnel and materials, as well as sweeping efficiency. For SDSB to achieve self-intelligent movement within a restricted area, various factors for road segmentation must be comprehensively inspected, particularly those relating to varying weather and road conditions. For example, roads are exposed daily to changing weather conditions; these include full sunshine, full cloud cover, and partial cloud cover with sunshine spots and shadowing caused by large trees or physical facilities, as well as different directions of sunshine. This leads to various contrasts in grayscale presentation at the image pixel level for any given region of road. Complex intricate noises must also be taken into consideration. These include cracks and sealing signs as a result of long-term road usage, as well as traffic signs, manhole covers, and roadway speed bumps, all of which greatly affect segmentation efficiency.

In this work, to achieve an SDSB with vision-controllability for operation on restricted roads using a built-in machine vision to segment the sweeping scope in a real-time manner and thus promote sweeping efficiency and security, we validated three U-Net extensions: the Primordial U-Net; the Residual U-Net; and the EnRDeA U-Net. The latter is extended from the Primordial U-Net in terms of its encoder channel, while the Residual U-Net replaces the down-sampling operation of the Primordial U-Net in terms of convolution and pooling layers, so as to increase feature extraction efficiency in a more compact manner. For the decoder channel, the attention block was embedded into the output channel of the Residual U-Net, to assign different weights to each corresponding part of the input feature map, thereby extracting more critical and important information to increase prediction precision. To carry out a detailed analysis of the nets’ features and segmentation performance, and to validate the intricate noises dataset on the three U-Net versions, the nets’ structures, parameters, training losses, performance indexes, etc., are presented and discussed in the experimental results.

The remaining parts of this paper are organized as follows. Section 2 describes the methodology in terms of the EnRDeA U-Net. Section 3 explains the dataset characteristics. Section 4 presents experimental results and analysis. Conclusions are offered in Section 5.

2. EnRDeA U-Net Framework

Figure 1 shows the training flowchart of the Residual U-Net embedded with the attention block, in which four layers of the residual blocks (black box) followed up the convolution, batch normalization, and ReLU operations in the input channel, and were then sequentially transposed into the corresponding attention blocks (rosewood color) within the DubleCnv and up-sampling operations for output segmentation. Two extensions of U-nets, i.e., the Residual U-Net and EnRDeA U-Net, were used. In the following sections, their parameter sizes and feature structures, as well as the encoder/decoder flowchart of EnRDeA U-Net methodologies and the principles of residual connection and threshold settings, are also discussed.

2.1. Embedded Resdual U-Net with Attention Module

As mentioned above, the EnRDeA U-Net embedded the attention block into the Residual U-Net, in which an encoder located in the input channel and a decoder located in the output channel sequentially performed the learning phases. An encoder can be operated by multiple sequential layers known as the VGG family; for example, VGG11 contains seven convolutional layers followed by the ReLU function [33]. Recently, VGG16 has also been evaluated for pre-training from ResNet-related works [34,35,36]. For the decoder, the transposed convolution layers are constructed so that many sizes of the feature map are yielded and the number of channels relied upon can be reduced in cases such as the LeCun uniform initializer [33]. In this work, the encoder was mainly used for feature extraction, and the decoder used to restore the size of the feature map, with respect to the original image size, by transposing from the residual block into the attention block, to complete the fusion of high-level and low-level features for road segmentation. Details of this process are as follows.

2.2. Encoder and Decoder

To validate the encoder channel to increase feature extraction efficiency on road segmentation, residual layers are beneficial to fix the overfitting problem and prevent the degradation of the model. During the training process, as layers increase and deepen, the model is more prone to overfitting, and the loss also tends gradually towards saturation. At the same time, if the network layers continuously increase, so the performance of the model declines. Consequently, the residual connection establishes a residual mapping between input and output during convolution operations, so that the output features contain the encoder features, in terms of the original feature information extracted from residual blocks simultaneously, ensuring that more feature information is retained from the low-level information and transposed into the high-level information, so as to prevent any further gradient loss. In this work, resnet34 was applied and structured as a 34-layer convolutional net for feature extraction on account of its beneficial effect on the U-Net structure. The improvement in encoder and decoder can now be described as follows.

2.2.1. Encoder

The encoder utilized the residual block by replacing the down-sampling operation in terms of convolution and pooling layers from the Primordial U-Net, to increase feature extraction efficiency further, and learn the net in a more compact manner, through the identical mapping of residual connections into the attention block, and thereby alleviate the vanishing gradient problem in backpropagation. Figure 2 shows a typical schematic diagram of the original Primordial U-Net. Although the complexity in the residual block is increased in the network, no more additional parameters are observed, and efficiency is improved. Figure 3 shows two schematic residual blocks. The portion to the left of the dotted line indicates learning of the residual mapping g(x) = f(x) − x, which makes the identity mapping f(x) = x easier to learn; the portion on the right indicates the input transformed into the desired shape for the addition operation by 1 × 1 convolution.

2.2.2. Decoder

The decoder channel was, thus, embedded with the attention block into the output channel in the Residual U-Net, to extract more representative features in the training phase and thereby increase the segmentation precision. The Residual U-Net retained both low-level feature information and high-level semantic information, and simultaneously connected to corresponding layers between the low- and high-level blocks, to achieve better segmentation results. Sequentially, the Residual U-Net was embedded with the attention block by the addition of the low-level input, with regard to the output, to acquire the weight; this was then multiplied by the low-level feature information and combined with the high-level semantic information. Such an operation is beneficial for assigning different weights to each corresponding part of the input feature map, by enabling the extraction of more critical and important information, and thereby increasing prediction precision. Figure 4 presents a schematic diagram of the attention block embedded in the Residual U-Net.

Inspired by human attention theories concerning the connection between perception and memory [37], which state that the degree of attention paid to an event correlated highly with future recollection of that event, nets have traditionally used convolution operations for feature extraction, maintaining equivalent degrees of attention on each pixel. By embedding the attention block on the net, highly related features regarding the target are retained so that segmentation efficiency can be increased, and irrelevant information can be discarded.

In Figure 4, it can be seen that the weight is firstly multiplied by the deeper layer of the feature map; then, the output is sequentially multiplied by the weight of the shallow feature map, so as to improve the detection of the target area on the weight of the pixel. Features are then combined through the skip layer connection. Figure 4 also shows the detailed operation of the attention module, in which the output

\hat{x}

^l_i,_c is the multiplied output of

\hat{x}

^l_i,c with the corresponding attention coefficient θ^l_i; that is, the attention block embedded in the Primordial U-Net output channel

\hat{x}

^l_{i,c =} X^l_i,c • θ^l_i, where θ^l_i ∈ [0, 1]. This is learned from the parameters automatically to adapt the activation, to precisely segment the road from the surrounding noises and the non-road portions; l is the net layer corresponding to each feature, i is the pixel space, and c is the size of the channel. Calculations of the attention block and coefficients are carried out using Equations (1) and (2)

ω_{a t t e n t i o n}^{l} = φ^{T} [σ_{1} (W_{x}^{T} x_{i}^{l} + W_{g}^{T} g_{i} + b_{g})] + b_{τ}

(1)

θ_{i}^{l} = σ_{2} (ω_{a t t e n t i o n}^{l})

(2)

By combining Figure 4 with Equations (1) and (2), it can be determined that the attention block is working by combining the convolution operation of the input feature x^l_i and the gating signal g_i with the corresponding weights W_x^T and W_g^T for the output of

W_{x}^{T} x_{i}^{l}

and

W_{g}^{T} g_{i}

, respectively. These are then added together via the ReLU to generate

σ_{1} (W_{x}^{T} x_{i}^{l} + W_{g}^{T} g_{i} + b_{g})

, which is conducted by the 1 × 1 × 1 convolution to yield

ω_{a t t e n t i o n}^{l}

, to which the Sigmoid σ₂ is then applied to give the final coefficient

θ_{i}^{l}

. The final feature

\hat{x}

^l_i,c is then obtained by multiplying by X^l_i,c and θ^l_i throughout the operation.

3. Dataset Analysis

3.1. Intricate Road Characteristics

Prior to conducting semantic segmentation by the EnRDeA U-Net on the target road, it was necessary to, first, analyze the characteristics of the collected visual dataset, so that the challenges during segmentation could be understood. Our observation indicated that the roadside area contained large trees, grassy areas, facilities, corridors, and sport parks. The road itself contained several target and noisy objects, including pedestrians, vehicles, bridges, speed bumps, marble blocking balls, traffic signs, and manhole covers. Road and non-road areas were typically separated by retaining walls of 15 cm height. In addition, some road areas were partially or fully overshadowed by large trees and facilities, and many areas contained squiggly black lines, road cracks, and manhole covers. Different types of road material, i.e., cement or asphalt, were also observed, as shown in Figure 5. These factors, combined with changing weather conditions (sunshine, rain, or cloud), meant that road performance was a highly complex presentation. Depending upon location and time, the road could be clouded over, or fully or partially sunlit, and/or overshadowed by trees, facilities, park fencing, etc.

3.2. Three Categories of Validated Datasets

In this work, we categorized the collected dataset into three types: sunshine spot shadowed by facilities; sunshine spot shadowed by large trees; and fully cloudy conditions. Examples of these are shown in Figure 6, Figure 7 and Figure 8, respectively. Observations at the pixel level were then analyzed as follows:

In fully cloudy conditions, the pixels at the edge between road and grass presented lower grayscale values than in full-sunshine conditions, resulting in confusion with respect to the status of distant road, and failure in segmentation.
In full-sunshine conditions, road pixels, on average, exhibited a higher grayscale contrast, compared with non-road areas, particularly if the road was beneath the large trees where partially shadowing sunlight spots occurred. In such cases, the grayscale values of pixels in sunlit areas were obviously higher than in shadowed regions. Moreover, the texture feature on roads lit by intense sunlight caused the region to become more blurred. Additionally, the distant road showed bright regions of high intensity, implying a challenge for road segmentation.
Many complex symbols and signs were found that increased the difficulty of road segmentation. For example, there were many cracks in the road caused by long-term intense sunshine. In addition, traffic signs such as white arrows and lines, black manhole covers, zebra crossings, and roadway speed bumps were also influential factors which led to further challenges.

3.3. U-Net Extensions Comparison

To evaluate the training cost and the learning efficiency of the Primordial U-Net, Residual U-Net, and EnRDeA U-Net in terms of parameter sizes, we carried out a summarization and comparison, as shown in Table 1. It can be seen that the EnRDeA U-Net required a significantly greater number of training parameters during the training phase. This was because, unlike the other U-Net versions, the EnRDeA U-Net contains both encoder and decoder phases. It may also be observed that the Attention U-Net was particularly beneficial to noise reduction in terms of the three categorized datasets containing many noises (shadows caused by facilities or trees, traffic signs on road, etc.), thereby promoting segmentation efficiency, so that the Attention U-Net required fewer training cost parameters yet achieved a greater segmentation efficiency.

4. Experimental Setup and Results

4.1. Training Dataset and Data Enhencement

For training and validation purposes, the datasets used in this work, i.e., the Camvid dataset [38] and the self-collection dataset, were collected and evaluated. We found that Camvid contained 11 categories of images regarding pedestrians, cars, roads, etc., in terms of scenes of streets, cement pavements, and urban roads. From a total of 701 images, 600 were used for training and 101 used for validation, each with a resolution of 960 × 720 pixels. From a total of 259 self-collection images, 109 were used, including four types of campus scenes with two kinds of road materials, i.e., cement and asphalt.

Figure 9 presents examples of training images. The Camvid dataset is shown in the first three two-column images in Figure 9a1–c1, and the self-collection dataset in the final three two-column images in Figure 9d1–f1. Within Figure 9, raw images are shown in the first row (Figure 9a1–f1), and corresponding ground-true images in the second row in Figure 9a2–f2. Several backgrounds of scenes surrounding the collected dataset were found. For example, Figure 9a1 shows an urban road; Figure 9b1,d1,e1,f1 show types of campus roads; and Figure 9b1 is a kind of street scene. Many noises can also be observed on roads that significantly influence the semantic segmentation. For example, vehicles running on the road can be seen in Figure 9a1,b1,d1. Speed bumps and white traffic lines can also be observed in Figure 9c1,f1. Several images were inspected showing roads covered by sunshine-shadowed spots, which resulted in an inconsistent pixel-contrast presentation. Different road materials, i.e., the cement road in Figure 9f1 and the asphalt road in Figure 9a1–f1, are also examined in the dataset.

To increase segmentation efficiency in the case of the small-size dataset, one possibility was to increase size depth by convolution kernel to retain more representative features in the dataset; however, such a procedure would have increased the training cost, due to the increasing number of parameters. For this reason, we chose to use data enhancement to improve the diversity of the dataset in a lower cost manner, so that a more robust network could be learned. Specifically, the albumentations toolkit was applied to enhance the dataset in terms of invariant operations including horizontal flipping transformation, translation, and scaling rotation. By such means, the diversity of the dataset was enriched.

In this work, with regard to the loss function, we considered semantic segmentation as a kind of binary classification problem; for this reason, binary cross-entropy was used as the loss function to train the network for road segmentation. Calculation of binary cross-entropy was achieved using Equation (3):

l o s s = - \sum [y l o g \hat{y}] + (1 - y) l o g (1 - \hat{y})

(3)

where

\hat{y}

is the probability of a pixel being classified as a road portion, and y is the labelled value of the pixel. The operation is to call the nn. BCEloss() class to calculate the loss in the dataset activated by the sigmoid, and update it through the optimizer by the PyTorch library.

In this work, the Adam optimizer in PyTorch 1.10.0 was applied to train the network. The Adam optimizer offers benefits in terms of its high computation efficiency, low memory usage, and adaptive learning features; these are useful for handling large-scale data and for optimizing parameters. Initially, the learning rate was set to 1 × 10⁻⁴ and the learning rate was dynamically tuned in terms of exponential decay, the Epoch was set to 120, and batch_size was set to 8. The observation indicated that when the Epoch reached a value of 20, the loss then tended to saturate and terminate to prevent the overfitting condition, so that generalization was increased further. Figure 10 shows the visualized map in terms of the Epoch and training loss. Our observations indicated that when the Epoch reached a value of 100, the loss attained a level of 0.04 and tended towards stabilization.

4.2. Segmentation Results and Analysis

To validate the road segmentation efficiency of the three versions of the U-Net extensions (Primordial U-Net, Residual U-Net, and EnRDeA U-Net) in terms of two types of road materials (cement and asphalt) under various conditions (different weather and sunshine/shadowing effects), we used the experimental results shown in the subfigures of Figure 11 as follows: (a) shows a cement road with squiggly black lines and a manhole cover on its surface under fully cloudy weather; (b) shows a cement road surface under sunny conditions with sunshine-shadow spots caused by trees and vehicles beside; (c) shows an asphalt road with speed bumps surrounded and covered on two sides by sparse large trees in bright conditions; (d) shows an asphalt road with speed bumps surrounded by sparse giant trees in bright conditions.

In Figure 11, it can be seen that interference and obstruction in road surface segmentation is caused mostly by pedestrians, vehicles, and fallen leaves, as well as by sunshine spots on the road surface caused by large trees or facilities. Additionally, due to long-term road usage, some cracks, such as the squiggly black lines, were frequently misclassified as traffic markings. Furthermore, the road was generally covered by large numbers of fallen leaves. Such irregular characteristics made it difficult to label the road region, and segmentation was thus impacted by sparse noises; however, this could be alleviated by the morphological operation at a later stage. In intense sunlight, shadowing by the large tree generally caused the road to be covered with many randomly distributed spots, on which the higher gradient of pixel adjacent between the road and the fallen leaves would cause the inconsistency, and representative texture features were lost, and even higher-resolution images could only retain a few, resulting in greater difficulty for road segmentation.

In terms of our detailed analysis of the effects of road noises on segmentation efficiency, we found that the Primordial U-Net and the Residual U-Net performed more poorly in segmenting the road’s distant portion than the EnRDeA U-Net, as can be seen in the red, blue, and green circles, respectively, of Figure 11b. The EnRDeA U-Net increased the weight in the attention gate to suppress the complex distant non-road noises, allowing more attention to be paid to distant-road feature extraction. In the case of an asphalt road in a darker situation, as shown in Figure 11c, the sparse and randomly distributed non-segmented spots obviously appeared in all three U-Net methods, for numerous tiny fallen leaves were randomly distributed and occupied only a small portion of the road, so that grayscale consistency within the road surface was lost. Furthermore, the higher gradient between the fallen leaves and the adjacent road pixels caused the non-segmented spots on the road. In the case of an asphalt road in brighter conditions, as shown in Figure 11d, the Primordial U-Net achieved the poorest segmentation on the brighter regions in terms of the speed bumps and the brighter road regions aside the retaining wall, in which the surrounding minimum grayscale pixel values caused the non-segmented portions shown in red circles. However, the Residual U-Net achieved better segmented efficiency compared with the Primordial U-Net, for the residual block retained more target features via connections between high-level and low-level convolutions in the brighter regions, thereby improving the segmentation efficiency. In the case of speed bumps, the attention gate in the EnRDeA U-Net adjusted the weight to retain more features during the deeper convolution operation, increasing the representative feature, and, thus, improving the segmented efficiency.

4.3. Index Evaluation

The performance indexes of PA (pixel accuracy), CPA (class pixel accuracy), and MIoU (mean intersection over union) were used for quantitative evaluation of the segmentation efficiency of the three U-Net versions; of these, PA (pixel accuracy) is more objective and referential than visual evaluation [13]. Table 2 shows that, for the EnRDeA U-Net, the PA, CPA, and MIoU indexes were 96.68%, 95.48%, and 91.77%, respectively. The EnRDeA U-Net demonstrated superior segmentation efficiency compared with the Primordial U-Net in terms of PA, CPA, and MioU, with improved performances of 1.23%, 0.92%, and 2.21%, respectively. Using the same indices, compared with the Res U-Net, the EnRDeA U-Net gave improved performances of 0.64%, 0.34%, and 1.17%, respectively. A follow-up evaluation proved the superior performance of the EnRDeA U-Net model described in this paper, compared with that of the Primordial U-Net and Res U-Net. Figure 12 shows the three indexes of CPA, Recall, and IOU with correspondence to the Epoch in terms of the three U-nets. It can be seen that the EnRDeA U-Net achieved superior efficiency, compared with the other two nets.

5. Conclusions and Future Works

In this study, we investigated the proposed EnRDeA U-Net and validated an intricate noise road dataset in which factors affecting segmentation efficiency were considered. These included changing weather conditions, random sunshine-spot shadowing caused by trees or facilities, traffic obstacles and signs, cracks and sealing signs caused by long-term road usage, and different road materials, i.e., cement or asphalt. To comprehensively understand the objective effects of intricate noise on the experimental results, we carried out a detailed analysis and comparison with the Primordial U-Net and the Residual U-Net in terms of the nets’ structure, training loss, parameters, and performance indexes, i.e., PA, CPA, and MioU; these are also presented and discussed in this paper. In the future work, to reach the SDSB with the 24 h/day of road segmentation capability, the road statuses in the night scenery would be considered and collected for training and validation to comprehensively advance and evaluate the model. In addition, the camera sensor captured the images somehow performed the overexposed or underexposed road statuses by the various sunshine directions, which would also be considered for further pre-processing in the dataset, to decrease the sunshine effects and to increase the model training efficiency.

Author Contributions

Conceptualization, T.-W.K. and X.Y.; methodology, T.-W.K.; software, Y.G.; validation, T.C.; formal analysis, X.Y. and T.-W.K.; investigation, T.-W.K. and J.-F.W.; resources, X.Y. and S.C.; data curation, T.-W.K.; writing—original draft preparation, X.Y. and T.-W.K.; writing—review and editing, T.-W.K. and S.-P.T.; visualization, Y.G.; supervision, Y.C.; project administration, T.-W.K.; funding acquisition, T.-W.K., S.-P.T. and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research and the APC were partially funded by Sanda University Grant numbers [2021ZD06], [2022BC088] and [2021ZD05], respectively.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Not applicable.

Acknowledgments

This research is supported by the school research funding, Sanda University (research project no. 2021ZD06, 2022BC088 and 2021ZD05). We are grateful to Chen Yiyi, ZHU Shaozhong, Lou Bin-chao and Pan Hui-bin, for their great supports on the project of self-driving sweeping bot to promote the smart campus to reach the intelligent environmental protection. We also thank the internship teams for collecting the image and video surrounding campus for exploring the machine intelligence research on self-driving sweeping bot.

Conflicts of Interest

The authors declare no conflict of interest.

References

Katiyar, S.; Ibraheem, N.; Ansari, A.Q. Ant colony optimization: A tutorial review. In Proceedings of the 10th IET International Conference on Advances in Power System Control, Operation and Management (APSCOM 2015), Hong Kong, China, 8–12 November 2015; pp. 99–110. [Google Scholar]
Kuan, T.W.; Chen, S.; Luo, S.N.; Chen, Y.; Wang, J.F.; Wang, C. Perspective on SDSB Human Visual Knowledge and Intelligence for Happiness Campus. In Proceedings of the 2021 9th International Conference on Orange Technology (ICOT), Tainan, Taiwan, 16–17 December 2021; pp. 1–4. [Google Scholar]
Kuan, T.W.; Xiao, G.; Wang, Y.; Chen, S.; Chen, Y.; Wang, J.-F. Human Knowledge and Visual Intelligence on SDX^tensionB. In Proceedings of the 2022 10th International Conference on Orange Technology (ICOT), Shanghai, China, 10–11 November 2022; pp. 1–4. [Google Scholar]
Medina, M. The World’s Scavengers: Salvaging for Sustainable Consumption and Production; Rowman Altamira: Walnut Creek, CA, USA, 2007. [Google Scholar]
Yu, X.; Kuan, T.W.; Zhang, Y.; Yan, T. YOLO v5 for SDSB Distant Tiny Object Detection. In Proceedings of the 2022 10th International Conference on Orange Technology (ICOT), Shanghai, China, 10–11 November 2022; pp. 1–4. [Google Scholar]
Zhan, W.; Sun, C.; Wang, M.; She, J.; Zhang, Y.; Zhang, Z.; Sun, Y. An improved Yolov5 real-time detection method for small objects captured by UAV. Soft Comput. 2022, 26, 361–373. [Google Scholar] [CrossRef]
Liu, Z.; Gao, X.; Wan, Y.; Wang, J.; Lyu, H. An Improved YOLOv5 Method for Small Object Detection in UAV Capture Scenes. IEEE Access 2023, 11, 14365–14374. [Google Scholar] [CrossRef]
Kuan, T.-W.; Gu, Y.; Chen, T.; Shen, Y. Attention-based U-Net extensions for Complex Noises of Smart Campus Road Segmentation. In Proceedings of the 2022 10th International Conference on Orange Technology (ICOT), Shanghai, China, 10–11 November 2022; pp. 1–4. [Google Scholar]
Yu, X.; Kuan, T.W.; Qian, Z.Y.; Wang, Q. HSV Semantic Segmentation on Partially Facility and Phanerophyte Sunshine-Shadowing Road. In Proceedings of the 2022 10th International Conference on Orange Technology (ICOT), Shanghai, China, 10–11 November 2022; pp. 1–4. [Google Scholar]
Sun, Z.; Geng, H.; Lu, Z.; Scherer, R.; Woźniak, M. Review of road segmentation for SAR images. Remote Sens. 2021, 13, 1011. [Google Scholar] [CrossRef]
Wang, J.; Qin, Q.; Gao, Z.; Zhao, J.; Ye, X. A new approach to urban road extraction using high-resolution aerial image. ISPRS Int. J. Geo-Inf. 2016, 5, 114. [Google Scholar] [CrossRef] [Green Version]
Hui, Z.; Hu, Y.; Yevenyo, Y.Z.; Yu, X. An improved morphological algorithm for filtering airborne LiDAR point cloud based on multi-level kriging interpolation. Remote Sens. 2016, 8, 35. [Google Scholar] [CrossRef] [Green Version]
Ronneberger, O.; Fischer, P.; BROX, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Li, L.; Wang, C.; Zhang, H.; Zhang, B.; Wu, F. Urban building change detection in SAR images using combined differential image and residual u-net network. Remote Sens. 2019, 11, 1091. [Google Scholar] [CrossRef] [Green Version]
Shuai, L.; Gao, X.; Wang, J. Wnet++: A nested W-shaped network with multiscale input and adaptive deep supervision for osteosarcoma segmentation. In Proceedings of the 2021 IEEE 4th International Conference on Electronic Information and Communication Technology (ICEICT), Xi’an, China, 18–20 August 2021; pp. 93–99. [Google Scholar]
Kamble, R.; Samanta, P.; Singhal, N. Optic disc, cup and fovea detection from retinal images using U-Net++ with EfficientNet encoder. In Proceedings of the Ophthalmic Medical Image Analysis: 7th International Workshop, OMIA 2020, Lima, Peru, 8 October 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 93–103. [Google Scholar]
Cui, H.; Liu, X.; Huang, N. Pulmonary vessel segmentation based on orthogonal fused u-net++ of chest CT images. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 293–300. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, 17–21 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 424–432. [Google Scholar]
Isensee, F.; Maier-Hein, K.H. An attempt at beating the 3D U-Net. arXiv 2019, arXiv:1908.02182. [Google Scholar]
Hwang, H.; Rehman, H.Z.U.; Lee, S. 3D U-Net for skull stripping in brain MRI. Appl. Sci. 2019, 9, 569. [Google Scholar] [CrossRef] [Green Version]
Wang, F.; Jiang, R.; Zheng, L.; Meng, C.; Biswal, B. 3d u-net based brain tumor segmentation and survival days prediction. In International MICCAI Brainlesion Workshop; Springer International Publishing: Cham, Switzerland, 2019; pp. 131–141. [Google Scholar]
Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef] [Green Version]
Yu, G.; Dong, J.; Wang, Y.; Zhou, X. RUC-Net: A Residual-Unet-Based Convolutional Neural Network for Pixel-Level Pavement Crack Segmentation. Sensors 2022, 23, 53. [Google Scholar] [CrossRef] [PubMed]
Rehan, R.; Usama, I.B.; Yasar, M.; Muhammad, W.A.; Hassan, J.M. dResU-Net: 3D deep residual U-Net based brain tumor segmentation from multimodal MRI. Biomed. Signal Process. Control 2023, 79, 103861. [Google Scholar]
Yang, X.; Li, X.; Ye, Y.; Zhang, X.; Zhang, H.; Huang, X.; Zhang, B. Road detection via deep residual dense u-net. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–7. [Google Scholar]
Shamsolmoali, P.; Zareapoor, M.; Wang, R.; Zhou, H.; Yang, J. A novel deep structure U-Net for sea-land segmentation in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3219–3232. [Google Scholar] [CrossRef] [Green Version]
Chen, Z.; Li, D.; Fan, W.; Guan, H.; Wang, C.; Li, J. Self-attention in reconstruction bias U-Net for semantic segmentation of building rooftops in optical remote sensing images. Remote Sens. 2021, 13, 2524. [Google Scholar] [CrossRef]
Mustafa, N.; Zhao, J.; Liu, Z.; Zhang, Z.; Yu, W. Iron ORE region segmentation using high-resolution remote sensing images based on Res-U-Net. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2563–2566. [Google Scholar]
Wang, Y.; Kong, J.; Zhang, H. U-net: A smart application with multidimensional attention network for remote sensing images. Sci. Program. 2022, 2022, 1603273. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Le Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Petit, O.; Thome, N.; Rambour, C.; Themyr, L.; Collins, T.; Soler, L. U-net transformer: Self and cross attention for medical image segmentation. In Proceedings of the Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Strasbourg, France, 27 September 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 267–276. [Google Scholar]
Wu, C.; Zhang, F.; Xia, J.; Xu, Y.; Li, G.; Xie, J.; Du, Z.; Liu, R. Building damage detection using U-Net with attention mechanism from pre-and post-disaster remote sensing datasets. Remote Sens. 2021, 13, 905. [Google Scholar] [CrossRef]
Iglovikov, V.; Shvets, A. Ternausnet U-net with vgg11 encoder pre-trained on image net for image segmentation. arXiv 2018, arXiv:1801.05746. [Google Scholar]
Debgupta, R.; Chaudhuri, B.B.; Tripathy, B.K. A wide ResNet-based approach for age and gender estimation in face images. In Proceedings of the International Conference on Innovative Computing and Communications: Proceedings of ICICC 2019, Bhubaneswar, India, 16–17 December 2019; Volume 1, pp. 517–530. [Google Scholar]
Ali, L.; Alnajjar, F.; Al Jassmi, H.; Gocho, M.; Khan, W.; Serhani, M.A. Performance evaluation of deep CNN-based crack detection and localization techniques for concrete structures. Sensors 2021, 21, 1688. [Google Scholar] [CrossRef] [PubMed]
Peteinatos, G.G.; Reichel, P.; Karouta, J.; Andújar, D.; Gerhards, R. Weed identification in maize, sunflower, and potatoes with the aid of convolutional neural networks. Remote Sens. 2020, 12, 4185. [Google Scholar] [CrossRef]
Wickens, C.D.; Mccarley, J.S.; Gutzwiller, R.S. Applied Attention Theory; CRC Press: Boca Raton, FL, USA, 2022. [Google Scholar]
Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]

Figure 1. Training flowchart of Residual U-net embedded with attention block, wherein four layers of the residual blocks (black box) for the input original image are sequentially transposed into the corresponding attention gate (rosewood color) for the output segmentation.

Figure 2. The schematic diagram of primordial U-Net framework.

Figure 3. Two schematic residual blocks, the left block of the dotted-line block indicates that the learning of the residual mapping g(x) = f(x) − x, making the identity mapping f(x) = x easier to learn, and the right diagram indicates the input transformed into the desired shape for the addition operation by 1 × 1 convolution.

Figure 4. Attention block embedded in primordial U-Net output channel.

Figure 5. Examples of collected visual dataset, wherein the left image indicated the road with the squiggly black lines (red box) and 15 cm height of the retained wall (green boxes), whereas the right one presented the road surface with the random sunshine spots shadowed by facilities or giant trees (yellow box).

Figure 6. Examples of collected images dataset used for validation, in which the dataset is categorized as road with sunshine spot shadowed by facilities.

Figure 7. Examples of visual dataset which is categorized as road with sunshine spots shadowed by giant tree.

Figure 8. Examples of visual dataset which is categorized as roads under fully cloudy status, wherein two types of road materials, including the cement road and the asphalt road, respectively, are observed.

Figure 9. Examples of training images including, Camvid dataset in the first three columns’ images (a1–c1), and self-collection dataset in the last three columns’ ones (d1–e1), wherein the first row contains the raw images (a1–f1), whereas the second row contains the corresponding Ground-True images (a2–f2).

Figure 10. The visualized map of training evaluation in terms of Epoch and training loss.

Figure 11. The segmentation experiments on three versions of U-net, including original U-net, Residual U-net, and EnRDeA U-net, validated on four types of road conditions, wherein (a) is the cement road under fully cloudy weather, (b) is shown as the cement road with sunshine-shadowed spots under sunny weather, (c) is indicated as the asphalt road in a darker situation, whereas (d) exhibits the asphalt road in a brighter situation.

Figure 12. Three indexes of CPA (a), Recall (b), and IOU (c) with correspondence to Epoch number in terms of three U-nets, in which the EnRDeA U-net (blue line) significantly performed the superior efficiency compared to the other two (black line and gray line).

Table 1. Summarized Training Performance Comparison of Four Versions of U-net Extensions.

Net Versions	U-Net	Residual U-Net	Attention U-Net	EnRDeA U-Net
Net Structure	Encoder-coder	Encoder with Residual Block	Decoder with Attention Gate	Pairwise Encoder/Decoder with Residual Block and Attention Gate
Number of Parameters	34.53 M	48.53 M	34.88 M	52.02 M
Characteristic Performance	Original Method Lower Training Cost	Better Segmented Efficiency, Higher Training Cost	Noises Reduction, Lower Training Cost	Superior Segmented Efficiency, Fairly Training Cost

Table 2. Model Metrics Evaluation.

U-Net Model Name	PA	CPA	MIoU
U-Net	95.45%	94.56%	89.56%
Residual U-Net	96.04%	95.14%	90.60%
EnRDeA U-Net	96.68%	95.48%	91.77%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, X.; Kuan, T.-W.; Tseng, S.-P.; Chen, Y.; Chen, S.; Wang, J.-F.; Gu, Y.; Chen, T. EnRDeA U-Net Deep Learning of Semantic Segmentation on Intricate Noise Roads. Entropy 2023, 25, 1085. https://doi.org/10.3390/e25071085

AMA Style

Yu X, Kuan T-W, Tseng S-P, Chen Y, Chen S, Wang J-F, Gu Y, Chen T. EnRDeA U-Net Deep Learning of Semantic Segmentation on Intricate Noise Roads. Entropy. 2023; 25(7):1085. https://doi.org/10.3390/e25071085

Chicago/Turabian Style

Yu, Xiaodong, Ta-Wen Kuan, Shih-Pang Tseng, Ying Chen, Shuo Chen, Jhing-Fa Wang, Yuhang Gu, and Tuoli Chen. 2023. "EnRDeA U-Net Deep Learning of Semantic Segmentation on Intricate Noise Roads" Entropy 25, no. 7: 1085. https://doi.org/10.3390/e25071085

APA Style

Yu, X., Kuan, T. -W., Tseng, S. -P., Chen, Y., Chen, S., Wang, J. -F., Gu, Y., & Chen, T. (2023). EnRDeA U-Net Deep Learning of Semantic Segmentation on Intricate Noise Roads. Entropy, 25(7), 1085. https://doi.org/10.3390/e25071085

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EnRDeA U-Net Deep Learning of Semantic Segmentation on Intricate Noise Roads

Abstract

1. Introduction

2. EnRDeA U-Net Framework

2.1. Embedded Resdual U-Net with Attention Module

2.2. Encoder and Decoder

2.2.1. Encoder

2.2.2. Decoder

3. Dataset Analysis

3.1. Intricate Road Characteristics

3.2. Three Categories of Validated Datasets

3.3. U-Net Extensions Comparison

4. Experimental Setup and Results

4.1. Training Dataset and Data Enhencement

4.2. Segmentation Results and Analysis

4.3. Index Evaluation

5. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI