1. Introduction
Road safety is a crucial task for transportation [
1]. With the increase in the number of vehicles and the continuous growth in road usage, urban roads and highways are experiencing varying degrees of deterioration. The hot summer season exacerbates the situation as the high volume of vehicles generates significant heat, leading to road cracking and collapse. Once the road becomes damaged, the deterioration process accelerates, giving rise to more severe road defects [
2]. These road defects not only pose risks to traffic safety but also impact the overall urban landscape, reducing the service life of the roads. Studies by Teschke and Ahlin [
3,
4] have shown that driving on roads with poor road conditions results in increased vibration, adversely affecting vehicle occupants’ health. Excluding regular traffic facilities such as manhole covers and construction sites, potholes and irregular road cracks should be treated as abnormal features requiring prompt identification [
5]. Timely road detection holds immense potential in reducing maintenance costs, prolonging the service life of roadways, and enhancing ride comfort for road users.
The road detection process comprises two main steps. The first step involves inspectors collecting road image information using various carriers. The second step entails employing different detection methods to process and analyze the acquired road images.
In the first step, various methods are employed to capture road defect images. They have evolved from manual and semi-automatic to fully automatic detection. Traditional manual road detection has gradually become obsolete due to its inefficiency and time-consuming nature. Semi-automatic inspection technology refers to the inspection process of storing the collected detection defect images on a hard disk and then manually tracking the cracks by an operator to mark and analyze the cracks. Indeed, both of them have faced the problems of subjectivity and lower accuracy. To address these limitations, an increasing number of studies have turned to using mobile filming equipment for road image acquisition [
6,
7]. This approach promises to improve efficiency and accuracy in the road detection process. Mei et al. [
7] used less expensive devices, such as smartphones, to acquire images of road surfaces and created a dataset containing 600 images of road cracks. In current studies, road inspection vehicles with laser sensors are more accurate [
8,
9,
10,
11]; however, they are expensive and difficult to popularize. Nowadays, UAVs, with their small size, low cost, flexibility, mobility, and ability to perform multiple-road parallel inspections, are gaining importance in structural health performance inspections of civil engineering infrastructure [
12,
13,
14,
15]. Zhu et al. [
16] proposed a dataset with 3151 road defects collected based on UAVs. With its low price, UAVs are gradually becoming a road and bridge detection tool for small and micro-maintenance enterprises and funding-constrained local governments. However, the dataset was collected in a limited location and context, covering only Dongji Avenue in Nanjing, Jiangsu Province, China.
In the second step, road detection algorithms are classified into two categories: traditional computer image processing, including machine learning and computer vision techniques in the framework of deep learning. Traditional computer image processing techniques are currently more mature, such as filter detection methods [
17], road defect spectral index [
18], local binary pattern methods [
19], and machine learning methods like AdaBoost [
20]. These methods commonly suffer from limited generalization capabilities and low detection accuracy when applied to diverse road conditions. With the rapid development of deep learning, object detection and segmentation are extensively used in the field of defect detection [
21,
22]. Singh et al. [
23] employed a Mask-RCNN on Road Damage Dataset2020 [
24]. This demonstrates that utilizing Mask R-CNN to address this issue is as effective as its applications on common object categories. Road defect detection algorithms based on YOLO (You Only Look Once) or VGG neural network models have better results in several road detection tasks [
6,
16,
25,
26]. Wu et al. [
27] proposed YOLOV5-LWNet devices for road damage detection on mobile terminals. However, there remains a limited number of studies on combining UAV-based road defect object detection with deep learning. There are several problems: (1) the current UAV-based detection datasets are limited for their number of samples, lacking representativeness and generalization capabilities; (2) existing studies have primarily focused on utilizing existing models [
16] or employing computer image processing methods to enhance detection accuracy; (3) General-purpose object detection algorithms may excel on datasets like MS COCO and Pascal VOC, but their performance might be somewhat diminished on road defect images due to the substantial variation in shapes present in these datasets. Nonetheless, there exists a scarcity of research focused on modifying algorithms according to dataset characteristics and image features.
This research aims to address the challenges of high cost and low efficiency in road detection tasks by utilizing unmanned aerial vehicles (UAVs) for road inspections. To overcome the issue of limited dataset universality, we meticulously collected diverse data to facilitate detection in various scenarios. Through an in-depth analysis of the dataset characteristics and image features, an object detection network was devised and is specifically designed for road defect images, resulting in a more efficient and effective road defect detector.
This research collected 2517 images containing road defects as the Road Defect Dataset, covering a diverse range of road scenarios and possessing good universality. In addition, a data augmentation scheme is performed according to the road materials and image features to enhance robustness, including photometric transformation, geometric transformation, and other methods. To obtain a more suitable object detection algorithm for road detection under the UAV perspective, a network based on YOLOv5 for road defect images is designed. Yang et al. [
6] argued that crack-like defects in road defects have similar characteristics to image edges in terms of shape and structure. Explicit Vision Center Block (EVC Block) [
28] is considered in the improvement strategies, which not only extracts the distinguishing features between crack-type defects and loose defects in edges but also captures long-distance dependencies from large crack-like defects to perform the classification task better. Guo et al. [
29] proposed that images where the detection objects are defects often have distracting factors, which require a more flexible and adaptable detector, such as (1) more background interferences; (2) difficulty in identifying objects from the background; (3) large variation in the size of defects in the same class; (4) cluttered defect locations; and (5) objects obscuring each other. The Swin Transformer Block [
30] is applied to facilitate the model of high correlations in visual signals, having a more dynamic perceptual field and more flexibly acquiring road defect contextual information. Finally, considering the complex representation of road defect images, Gaussian Error Linear Units (GELUs) [
31] are used as the activation function of the model in this study. The experiments demonstrated that the nonlinear capability of the GELU activation function exceeds that of the ReLU [
32] or ELU [
33]. Furthermore, experimental tricks and UAV flight settings were proposed to boost detection performance and stability in real-world flight scenarios.
The main contributions of this paper can be considered to be as follows:
A Road defect dataset is built. It includes a common category of road dataset and covers multiple road backgrounds and traffic conditions with precise manual annotations. To ensure its validity and universality, a few tricks of data augmentations are implemented on the dataset, which is suitable for road defect images based on the UAV perspective.
An improved YOLOv5 algorithm named RDD-YOLOv5 (Road Defect Detection YOLOv5) is proposed. Considering the complexity and diversity of road defects, bottleneckCSP(C3) is replaced with a self-attentive mechanism model called SW Block, and a spatial explicit vision center called EVC Block is added to the neck to capture long-range dependencies and aggregate local critical regions. Finally, the activation function is replaced with GELUs to boost its fitting ability.
The experiment establishes a UAV flight platform, including an accurate mathematical model for flight altitude and flight speed, to improve UAVs’ efficiency and image quality in collecting road defect images. A few experimental tricks are provided to enhance the performance further. In this case, the anchors are recalculated to improve the positioning accuracy for the abnormally shaped bounding boxes included in the dataset developed in this study. Additionally, label smoothing is applied to mitigate the impact of manual annotation errors in training, enhancing precision.
The remainder of this paper is organized as follows:
Section 2 reviews related studies on pavement crack detection.
Section 3 provides a detailed description of the structure and features of the dataset, along with the specific data augmentation procedures employed. In
Section 4, the network architecture of RDD-YOLOv5 is introduced.
Section 5 provides a comprehensive overview of the experimental process, comparison, and ablation experiments. Experiment results explore the performance of the proposed method.
Section 6 presents a detailed discussion that delves into the experimental results, analyzing their validity and practical applications.
Section 7 summarizes the research findings and potential future work.
3. Dataset
Existing road defect datasets suffer from limitations such as scenario homogeneity and sample insufficiency. For instance, the public dataset CRACK500 [
54] covers only 500 images with pixel-level annotated binary maps. Zhu [
16] collected a dataset containing only 300 full-size images from a single road, which were subsequently processed and segmented to obtain 3191 images without other backgrounds and transportation facilities. To address these limitations, a comprehensive collection of representative road defect images is conducted from multiple regions across China. The dataset comprises 2517 high-resolution images of different sizes with diverse scenes.
This section introduces detailed information about the dataset’s content and statistical characteristics. Additionally, a data augmentation scheme is proposed to overcome the statistical limitations and improve the dataset’s quality. The Road Defect Dataset is open access and can be found at
https://github.com/keaidesansan/Roadcrack_Dataset_2517.git. (accessed on 26 September 2023).
3.1. Dataset Collection
The Road Defect Dataset covered multiple provinces across China, including Hebei Province, Jilin Province, Henan Province, and Shandong Province. In order to increase the universality of training results in the context of road defects, it is beneficial to collect corresponding images from various road scenarios, helping the model to understand the variations and patterns of road defects that may occur in natural environments. The dataset contains six scenarios, as shown in
Figure 1: (1) country roads; (2) campus roads; (3) urban feeder roads; (4) urban arterial roads; (5) sub-urban roads; and (6) urban expressways. Images were captured at different moments throughout the day to emulate diverse lighting conditions. It can be observed in
Figure 1 that road surfaces undergo various degrees of damage through long-term use. The images encompass a comprehensive range of road types, which contributes to the model’s increased transferability.
In order to align with the actual distribution of road defect occurrences and morphological characteristics, this paper categorizes road defects into four main classes: transverse cracks, longitudinal cracks, alligator cracks, and potholes.
3.2. Dataset Characteristic
After conducting statistical analysis on the Road Defect Dataset, it contained 2517 images with a total of 14,398 instances. This indicates a substantial volume worth summarizing regularly in terms of its magnitude. The dataset has the following characteristics:
(1) Realistic distribution of defects:
Figure 2a shows that the number of different defect classes varies greatly, with longitudinal cracks being the most prevalent. This distribution of defect numbers closely aligns with real-world road damage scenarios, lending authenticity to the findings.
(2) Diverse defect shapes: As shown in
Figure 2b, comparing the aspect ratios of bounding boxes from multiple datasets, including the Road Defect Dataset, CrackTree200, CrackForest Dataset, and China Drone from RDD2022, it is evident that the Road Defect Dataset constructed in this study exhibits a relatively uniform distribution of aspect ratios ranging from 0 to 0.1 to 10 and beyond. This indicates the diverse range of shapes of objects in this dataset, presenting a significant challenge for models in terms of calculating target bounding boxes.
(3) Complexity of image features: As shown in
Figure 2c, over 80% of the images within the Road Defect Dataset exhibit the presence of two or more instances, thereby offering more intricate image features compared to the Pascal VOC dataset, showing more similarities to MS COCO. The complex nature of these images necessitates the development of robust algorithms capable of effectively detecting and distinguishing multiple instances within a single image.
(4) Diverse instance sizes: As shown in
Figure 2d, in CrackTree200, CrackForest Dataset, and China Drone from RDD2020, the majority of images contain only a single instance, which places relatively lower demands on the model’s performance. Instead, the majority of road defects within the dataset exhibit varying sizes, typically ranging from 0.2% to 16% of the image size when observed at high resolution. Statistical analysis determined that the largest road defects account for approximately 69% of the total area, further emphasizing the diversity in instance sizes. Consequently, the model faces a substantial challenge in effectively extracting robust features at different scales to accurately identify and analyze road defects.
Summarizing the aforementioned features, the Road Defect Dataset is a comprehensive and dependable resource comparable to datasets of similar nature. However, these characteristics also present three notable challenges: (1) The uneven distribution of defect categories; (2) the precision of road defect identification; and (3) the accuracy of defect localization. The upcoming work aims to devise strategies to address these challenges.
3.3. Dataset Augmentation
To address the first challenge presented in 3.1, the uneven distribution of defect categories, this paper proposes a data augmentation strategy. Unlike simplistic approaches such as mirroring or adding noise, data augmentation of this research was categorized into three schemes: photometric distortion, geometric distortion [
60], and intelligent transformation. These schemes effectively expand the dataset by simulating various shooting scenes. Photometric transformations involve adjustments to brightness, hue, and other parameters, enhancing the dataset’s diversity. It is critical to exercise caution during photometric transformations to avoid wearing out the crucial details of road defects. Geometric transformations encompass changes in image orientation and shape. However, it is essential to avoid unsuitable geometric transformations. For instance, crack-like defects heavily depend on features themselves, such as the direction of extension. Excessive geometric transformations may invert the annotations, leading to misclassified images. Intelligent transformations, primarily implemented through Mix class algorithms, play a crucial role in sample enhancement by facilitating feature migration, including MixUp [
61], CutMix [
62], and FMix [
63]. MixUp has the ability to generate new training samples by linearly interpolating between multiple samples and their labels. CutMix is a variation of MixUp that randomly cuts out a rectangular region of the input image and replaces it with a patch from a different image in the training set. FMix is another variation of CutMix that cuts out an arbitrarily shaped region and replaces it with a patch from another image.
In summary, this paper proposes a set of 14 data augmentation methods that serve as a data augmentation strategy of “bags of tricks”, filtering out ineffective techniques in order to enhance the model’s robustness for road defect images. Photometric transformation contains Gaussian blur, Gaussian noise, Poisson noise, brightness adjustment, and hue adjustment. The photometric transformation includes random crop, random shift, random rotation, random flipping, mirroring, and random cut-out. Mix transformation has MixUp, CutMix, and FMix. Notably, an image’s brightness, contrast, and blur will affect its accuracy. For the same image, the road defects will be severely distorted when the photometric transformation scale factor exceeds a certain range. Increasing the false detection rate, i.e., background FP, is possible. Thus, the photometric transformation employed in our methodology incorporates the crucial determination of the scale factor. The process of factors are as follows:
Gaussian blur smooths the image and simulates the motion blur effect caused by flying drone shots. This effect simulates the inherent motion and vibrations encountered during image capture, mitigating unwanted jaggedness or pixelation. Sometimes, Gaussian blur reduces the disturbances caused by Gaussian noise. An effective range of parameters is available to simulate the blurring of images captured by the UAV while in flight.
where
is an image after Gaussian blur is applied.
is the Gaussian kernel.
is the original image.
is the 2D convolution operation.
- 2.
Gaussian Noise
Gaussian noise stands as a prevalent technique employed to bolster the model’s robustness in deep learning. It is noteworthy that excessive noise may result in the loss of image detail and quality. So, it is important to find appropriate parameters, namely the mean and variance of the Gaussian noise.
where
represents images after Gaussian noise.
.
- 3.
Poisson noise
Poisson distribution serves as a powerful tool to model the number of photons detected by each pixel in an image, simulating statistical characteristics of photon noise under low-light conditions. Incorporating Poisson distribution into the road defect detection task will simulate outdoor lighting and weather changes.
where
represents images after Poisson noise.
represents the Poisson distribution. Notably,
should be added to reduce the effect of photonic noise, representing the parameter scaling factor for the Poisson distribution.
- 4.
Brightness adjustment
Adjusting the brightness of an image simulates a range of lighting conditions that are encountered during testing, thereby enhancing the model’s adaptivity to variations in lighting. Effective brightness adjustment techniques require limiting the appropriate lighting parameters.
where
represents images after adjusting brightness.
is a scaling factor that adjusts the overall brightness of the image.
is an offset value that controls the brightness level of the image.
- 5.
hue adjustment
Different road materials exhibit distinct colors. Among the commonly used road materials, asphalt and concrete stand out prominently. In the HSV color space, the hue value of an asphalt road is approximately 30–40 (yellow-orange), while that of a concrete road is around 0–10 (gray). Slight variations in hue values occur under different lighting conditions, adding to the natural complexity of the environment. In the research, hue adjustment is confined to 180–230 to ensure consistency.
where
represents images after adjusting hue.
is the original hue value of a pixel.
is the hue shift value.
4. Methods
This section presents a concise overview of the architecture of RDD-YOLOv5, while emphasizing the location, mechanism, and advantages of three key modules: SW Block (Swin Transformer Block [
30]), EVC Block (Explicit Vision Center Block [
28]) and CBG (Convolution Batch Normalization GeLU [
31]), which are integrated into RDD-YOLOv5 to enhance its performance. In summary, the SW Block elevates the model’s capacity to capture long-range dependencies, the EVC Block improves the extraction of road defect features, and the replacement of GELU activation contributes to the overall performance enhancement.
4.1. Introduction of YOLOv5
YOLO series algorithms divide the image into equally sized grid cells and infer class probabilities and bounding boxes for each of them based on a convolution neural network (CNN), including four parts: input, backbone, neck, and prediction head.
Input: YOLOv5 adaptively computes the initial anchor sizes. During the training stage, the model predicts the bounding boxes based on the anchors and then compares them to the ground truth to calculate the discrepancies. Subsequently, the network parameters are updated iteratively to minimize the differences in the backward propagation. Additionally, multiple techniques are incorporated in the YOLOv5 to enhance its inference speed, such as Mosaic data augmentation and adaptive image scaling.
Backbone: The focus module performs a slice operation on the input. Compared to the conventional downsampling method, it reduced information loss. In YOLOv5-6.0, the Focus module has been replaced with convolutional layers (Conv) with kernel size 6, stride 2, and padding 2. Based on the CSP (Cross Stage Partial) architecture, the C3 module is the primary module for learning residual features. It consists of two branches. One branch utilizes multiple stacked Bottleneck blocks and three standard convolutional layers, while the other branch passes through a basic convolutional module. Finally, the outputs of both branches are concatenated. SPPF takes SPP’s place in YOLOv5 for its faster performance. SPPF takes the input and passes it through multiple MaxPool layers with different sizes in parallel and then further fuses these outputs. This approach helps address the multi-scale nature of the detection task.
Neck: A combination of the Feature Pyramid Network (FPN) and the Path Aggregation Network (PAN) structures is employed in YOLOv5. FPN constructs the classical structure of a feature pyramid using a top-down side-by-side connection that builds a high-level semantic feature map at all scales. After multiple network layers, the target information at the bottom is already very ambiguous. PAN structure incorporates bottom-up routes to compensate for and strengthen the localization information.
Prediction head: YOLOv5 produces predictions at three scales: small, medium, and large size. The model calculates total loss consisting of three parts: classification loss (BCE Loss), confidence loss (BCE Loss), and bounding box loss (CIOU Loss). Notably, the confidence loss is assigned weights on the three prediction scales. These processes are shown in Equations (6).
4.2. Overview of RDD-YOLOv5
This work centers around optimizing the backbone and neck to adapt to the requirements of road defect detection. (1) Backbone: Utilizing the property that SwinTransformer Block implements a shift window-based self-attention mechanism on image patches, SW Block is introduced to alter the generation of the smallest-scale feature map owing to the CNN-based YOLOv5. This strategy focuses on the model’s ability to capture long-range dependencies within images. (2) Neck: the EVC Block is embedded in this part. This strategy takes advantage of the multi-scale feature fusion in the FPN+PAN structure to extract more complex features and fuse them with shallow-layer features to capture and represent complex information across different scales. (3) Whole structure: YOLOv5-6.0 applied the SiLU (Sigmoid-Weighted Linear Unit [
64]) as the activation. The research employed GELUs (Gaussian Error Linear Units) to promote the model’s nonlinear fitting capabilities, albeit at the cost of slightly sacrificing convergence speed, leading to a boost in the overall performance. The structure of RDD-YOLOv5 is shown in
Figure 3. The arguments of the network is shown in
Table 1.
4.3. SW Block (Swin Transformer)
Figure 2c shows that the largest instance in the dataset has a ratio of about 0.70 to the image size, which is typically represented by alligator cracks. The interference caused by the road surface background can significantly alter the semantic information of road defects, leading to misclassification and mis-segmentation. To enhance the discriminative capabilities for surface defects with larger scopes or complex backgrounds, it is necessary to collect contextual information from a broader neighborhood. However, traditional convolutional networks face limitations in capturing global contextual information due to their focus on local features, which hamper their ability to effectively learn global relationships and model the relationship object-to-object. In computer vision, the interested relationships include pixel-to-pixel, pixel-to-object, and object-to-object. While the first two types of relationships can be modeled using convolutional layers and region of interest alignment (RoIAlign) [
65], the modeling of the last type lacks a well-established method.
Convolution Neural Networks (CNNs), primarily composed of convolution and pooling layers, extract features through localized perception of image matrices. However, it solely captures local information, leading to the loss of inter-data relationships. Dosovitskiy et al. [
66] proposed Vision Transformer (ViT), a model that utilizes a self-attention mechanism to model object-to-object relationships and exhibits global solid modeling capabilities. Concurrently, ViT poses a challenge for practical applications because of its prohibitively high computational complexity. Liu et al. [
30] proposed the Swin Transformer, which reduces the computational complexity of self-attentive computation through window partitioning and shift window operations. The shift window operation allows for information exchange between non-overlapping windows, facilitating communication between feature maps. The self-attentiveness of the window enables a genuinely global focus on dependencies between image feature blocks through window interaction and overcomes computational challenges while capturing long-range relationships. Therefore, Swin Transformer Block (SW) is introduced as an improvement strategy, leading to improved performance in defect classification tasks.
SW Block is an abbreviation for Swin Transformer Block. Its architecture includes two blocks with nearly the same structure as the Swin Transformer proposed by Liu [
30], as shown in
Figure 4. The first part uses the standard window-based multi-head self-attention mechanism. The second part replaces the standard multi-head self-attention mechanism with shifted window multi-head self-attention. The SW Block is added after the feature map output of
size, where the feature map at the end of the backbone contains more semantic information with a smaller image size, decreasing computational complexity. The self-attention mechanism enables effective modeling of the context. The provided Equation (7) offers a means to estimate the computational complexity of the RDD-YOLOv5 model.
where
is the computational complexity of multi-head self-attention modules based on windows.
is the computational complexity of multi-head self-attention modules based on shifted windows. The
and
represents the size of the feature map.
is the channel of the feature map.
is fixed. In this paper,
is set to 7 by default. By performing the shift windows operation, the computational complexity of shifted window-based self-attention increases linearly instead of nonlinearly with the size of the feature map, reducing the model’s computational complexity.
The incorporation of the SW Block enhances its ability to center around object-to-object relationships, enabling the capture of long-range dependencies more effectively. The ablation experiment in
Section 5.5 validates the rationale behind the incorporation of the SW Block.
4.4. EVC Block (Explicit Vision Center Block)
There are some issues to be solved, such as erroneous image segmentation that leads to the loss of important features, especially for those with extensive areas. For instance, the elongation of a crack may result in part of the feature being unrecognized or wrongly assigned to a different bounding box, leading to multiple defects being labeled on the same image, which will affect the representation of damage degree to the road surface and ultimately impact highway maintenance decisions. Furthermore, identifying various types of road defects relies on recognizing local key features such as curvature. Consequently, better capturing global and local features is important to resolve this problem effectively.
The YOLOv5 neck adopts an FPN + PAN structure, which enables the lower-level feature maps to contain stronger semantic information. This improves the model’s ability to detect features at different scales through upsampling and feature fusion. Quan et al. [
28] proposed a Centralized Feature Pyramid network with a spatial explicit visual center (EVC) scheme. EVC Block contains a lightweight MLP and a learnable visual center (LVC). The lightweight MLP captures global long-range dependencies, and the LVC aggregates local key regions. Longitudinal cracks and potholes often have similar backgrounds, which can be resolved by allowing the lower-level feature maps containing positional and detail information to obtain stronger semantic features. Therefore, we introduce EVC Block into the neck network. The addition of EVC Block not only solves the problem of incorrectly labeled large-scale road defects but also further identifies the subtle features of road defects, such as curvature, in deep features.
EVC Block is an abbreviation for Explicit Visual Center Block. It consists of two processes, namely lightweight MLP and learnable visual center (LVC), as shown in
Figure 5.
In lightweight MLP, input features
are normalized by group normalization two times and then passed through a depthwise convolution-based module in the first block and an MLP-based module in the second block. Output features
pass through channel scaling and the droppath. The residual connections are implemented. This process can be summarized by the following Equation (8):
where
represents the channel MLP.
represents the input feature.
represents the output of the first block.
is the output of the second and the last block.
LVC consists of multiple convolutional layers and encoding operations. After passing through four convolutional layers, the features are inputted into a codebook. Here, the authors used a set of scaling factors to ensure that the features of the input and the factors in the codebook correspond to their respective position information. After the entire image is mapped to the codeword, the outputs are fed into a layer with ReLU and batch normalization. Information output from the previous layer is then fed into a fully connected layer and a convolutional layer. This process is described in Equation (9):
where
represents the output from the last convolution layer.
represents the output from the LVC block.
Combining the outputs of lightweight MLP and LVC, deeper features will be obtained in Equation (10):
In RDD-YOLOv5, the EVC Block is added within the neck. After upsampling to generate a larger depth feature map, the EVC Block operation is conducted. Subsequently, the deep features are fused with shallow features, enabling the latter to benefit from semantic information. This process is repeated twice to ensure a comprehensive fusion of shallow features and deep features.
4.5. Convolution Batch Normalization GeLU (CBG)
The essence of an activation function is to add non-linearity to the neural network by applying a nonlinear transformation to the linear transformation
. In addition to non-linearity, an important property of a model is its ability to generalize, which requires the addition of stochastic regularization. Therefore, the input to a model is determined by both the nonlinear activation and stochastic regularization. Hendrycks et al. [
31] proposed the GELU function, which has better nonlinear fitting than ReLUs or ELUs. GELUs stands for Gaussian Error Linear Units, a stochastic regularization equation combining properties from dropout, zone-out, and ReLUs. The authors introduced the idea of stochastic regularization into the activation function, which intuitively fits our understanding of natural phenomena. The approximate calculation for GELUs is Equation (11). As seen, the GELU activation is more complex than others.
YOLOv5 uses the SiLU as the activation function. As shown in
Figure 6a, the SiLU has a smooth and continuously differentiable shape, which helps reduce the possibility of gradient vanishing. The GELU has a shape similar to the SiLU in terms of smoothness and continuity. In contrast to SiLU activation, the GELU is advantageous for capturing more intricate relationships within the data of its slight S-curve.
Figure 6b shows that both the GELU and SiLU have non-zero gradients for all input values, but the derivative of the GELU is more complex, which can help the model capture more complex and subtle patterns in the data. In this study, the SiLU is replaced with GELU, and mAP and other metrics are compared in the experiment of
Section 5. The experimental results show that replacing the activation function of YOLOv5 with the GELU is a wise choice for the Road Defect Dataset.
6. Discussion
Aiming at the challenges conventional object detection algorithms face in ensuring the detection of road defects in complex traffic scenarios and the need for lightweight deployment via UAV inspection, this research submits a novel approach. A specialized dataset named the Road Defect Dataset was collected, and an object detection algorithm was devised for road defect detection.
This research builds upon the YOLOv5 model, employing three strategies. Firstly, SW Block with shift-window based multi-head self-attention mechanism is embedded as an efficient extraction module. Secondly, it integrates the explicit vision center into a multi-scale feature fusion network to facilitate feature fusion and precise object localization. Thirdly, SiLU activation is replaced by the GELU at the cost of the sacrifice of training speed. To address the irregularity in image label aspects compared to standard datasets, an enhanced K-means algorithm is employed to compute the optimal initial anchor boxes. In the experimental part, this research introduces various techniques, such as label smoothing and models for the flight altitude and speed of UAVs, which contribute to the comprehensive process of the proposed approach.
Comparison and ablation experiments are conducted to validate the effectiveness of the project detection algorithm and the dataset. The ablation studies demonstrated that the algorithmic improvements proposed in this research have a cumulative contribution to enhancing the algorithm’s performance. Specifically, K-means++, SW Block, EVC Block, and CBG provide improvement effects of 1.71%, 0.93%, 1.99%, and 0.74%, respectively. The comparison experiments consist of two parts. One is to demonstrate the superiority of RDD-YOLOv5 by comparing it with several mainstream models (Mask-RCNN, Faster R-CNN, SSD, YOLOv3, YOLOv4, YOLOv5, etc.), and the results show that the proposed model achieves an mAP of 91.5%. Compared to the original model, the mean average precision (mAP) was improved by 2.5% with a model volume of 24 MB, which shows a relatively lower model size for lightweight road detection based on UAV inspection. While the mAP may not be as high as that of YOLOv7, it offers the advantage of lightweight detection suitable for real-time deployment on drones. The second part compared the Road Defect Dataset (RDD) with several open-source road datasets, including CrackTree200, CrackForest Dataset, and China Drone in GRDD2022. Multiple experiments demonstrated that these open-source datasets generally suffer from accuracy issues and are challenging to use for practical road detection. RDD-YOLOv5 consistently exhibited excellent performance across multiple datasets, showing improvements ranging from 1% to 4% in detection accuracy.
7. Conclusions
This paper presents a comprehensive approach to road defect detection, addressing various challenges through the construction of a diverse dataset, data augmentation methods, and improvements to the YOLOv5 algorithm. The dataset’s richness and specificity make it a valuable resource for drone-based road detection tasks, while the proposed data augmentation techniques tailor the dataset for unmanned aerial vehicle (UAV) road defect images. In the task of road defect detection during unmanned aerial vehicle (UAV) inspections, RDD-YOLOv5 achieves an impressive accuracy of 91.5% while successfully meeting the lightweight model target with a reduced size of only 24 MB. By optimizing the network, it demonstrated the ability to accurately predict the specific types and locations of road defects, ensuring practical and efficient detection. The establishment of UAV flight height and speed equations further enhances the quality and effectiveness of the detection process without increasing computational power, making the algorithm more suitable for real-world applications. This research provides an integrated and complete road detection method, serving as a technological reference for drone-based road inspections.
While the proposed method achieved favorable results in practical road defect detection, there are drawbacks and opportunities for further improvement. (1) The proposed model faces the challenge of a relatively large number of parameters, which poses certain requirements on the devices it runs on. So, it is necessary to reduce the model volume of RDD-YOLOv5 through pruning and optimization efforts to make it more lightweight. (2) The inherent drawback of drone inspection is its limited battery life, making it challenging to undertake long-distance continuous road inspection tasks. Drones are more suitable for short-distance inspections in complex urban scenarios. (3) The dataset contains fewer types of road defects, which may not be sufficient to address the diverse range of road surface inspection tasks in urban environments. For example, the Road Defect Dataset does not include certain types, such as asphalt oil spills and loose road surfaces, which will pose certain challenges for the detection task. Additionally, expanding the dataset to cover a broader range of defect types will increase its universality and applicability.
However, future road defect detection tasks will demand even higher precision in localizing defects, considering complex traffic environments and potential occlusions by vehicles. Addressing challenges in detecting multiple and small objects in complex scenarios will be critical in advancing the field. Exploring the application of object segmentation algorithms to calculate the sizes of road defects and evaluate road damage extent will contribute to the algorithm’s practicality and accuracy.