1. Introduction
Since the heart of aircraft aviation engines usually works within high-temperature, high-pressure, high-load, high-speed, strongly corrosive, and harsh environments, the core components inside the engine’s applicability and reliability will decline, affecting the aircraft’s long-term capability [
1]. High-speed rotors, including compressor and turbine blades, often undergo deformation, bruising, tearing, cracking, discoloration under high temperatures, and other structural damage. These are shown in
Figure 1. The sources of damage mainly include centrifugal inertia force generated by high-speed rotation, aerodynamic force generated by flowing gas, fatigue damage due to long operation time, and impact of external objects [
2]. Therefore, timely and accurate detection and resolution of engine blade structural losses during the maintenance period not only extends the service life of the aircraft but also ensures personnel safety [
3].
At present, non-destructive testing has been widely applied to detect structural damage to engine blades [
4]. Wang et al. [
5] used ultrasonic technology to detect small cracks in blades. They first fabricated blades with different degrees of damage on the vibration experimental bench and then utilized non-linear ultrasonic testing to detect non-linear coefficients of different faulty blades. Through a large number of experiments and statistical analyses, they derived the empirical non-linear coefficients and the equivalent crack size formula, which can be used for blade crack evaluation. Based on the principle of electromagnetic induction, eddy current detection can be used to detect faulty blades. Xie et al. [
6] designed a new type of flexible eddy current array sensor that generates eddy currents through the excitation and sensor coils. When the eddy currents pass through the damaged part of the blade, the eddy currents change significantly to derive the type, location, and size of the damage to the blade. Yang et al. [
7] also designed an eddy current automatic detection system with six degrees of freedom, and the experiments show that the detection sensitivity of the automatic detection system is very high. Liu et al. [
8] also utilized digital radiography for non-destructive testing of gas turbine blades, with remarkable results. Karatay et al. [
9] investigated Fluorescein isothiocyanate-conjugated Escherichia coli as a penetrant that can be used to detect leaf cracks.
Due to the many types of engine components and complex structure, it is difficult and costly to dismantle the engine. Hole probe inspection enters the engine interior with a camera through a small hole in the engine to inspect the blades in real-time, which can avoid disassembling the engine. By changing the probe insertion depth, rotating the probe direction and engine rotor angle, maintenance personnel can collect video and image information of different levels of blades and different angles of the rotor and then analyze and judge the state of the blades [
10]. However, these methods still rely on visual inspection. There is a subjective factor. Different mechanics detecting the same damaged blade may have different conclusions, with a certain risk. The engine’s internal light is dark, the contrast is low, it is easy to hide the hidden faults, the number of blade stages inside the engine is large, and the overhauling workload is large, which is a very high requirement for the maintenance personnel [
11].
In recent years, artificial intelligence technology has continued to develop and is widely used in various industries. Target detection technology has a strong feature learning ability [
12], which can be combined with hole detection technology to help maintenance personnel identify and process image information, make more accurate and rapid judgment decisions, and improve maintenance ability. The current popular target detection algorithms are ResNet [
13], R-CNN series [
14], YOLO series [
15], FCN series [
16] and Mask-RCNN [
17], etc.
More and more scholars have now applied target detection techniques to engine borehole inspection. Anurag et al. [
18] designed the U-Net architecture to detect defects on high-pressure compressor blades, and the detection precision and recall rate exceeded 90%. Still, the detection sensitivity was low for small-size defects. Li et al. [
19] combined high-resolution engine blade images into a deep convolutional neural network (DCNN) and then proposed a coarse classifier over most of the background parts. Finally, the defects were detected by a fine detector module, which showed a better detection performance. Zhang et al. [
20] applied YOLOv3 to the task of detecting damage on aero-engine blades and achieved a balance between detection accuracy and detection speed. Li et al. [
21] proposed an improved intelligent detection model for YOLOv4, which fuses shallow and deep features, improves the PANet structure, and employs focusing loss, and the results show that the improved model achieves an average accuracy of 90.1% and an FPS of 24.82, which falls short of real-time detection. Li et al. [
22] introduced deformable convolution and depth-separable convolution on the basis of YOLOv5s and used k-means clustering to optimize the anchor frame, and the results showed that the detection precision could reach 93.3% and the recall rate could reach 76.2%, but the parameter count of the model was 7,928,117, the weight file size was 15.3 MB, and the average detection time was 28.8 ms. The model has high precision, but the model parameters are too large, the detection speed is slow, and further improvement is needed. Cai et al. [
23] reconstructed the structure of YOLOv5, changed the backbone network to FasterNet, and introduced the depth separable convolution and GSConv in the neck. The results show that the improved model reduces the number of parameters by 52.5% compared to the YOLOv5 model, the FPS reaches 61, and the average accuracy value reaches 89.6%, and there is still much room for improvement in the accuracy and speed of detection of this model. Shang et al. [
3] constructed an enhanced shape Mask R-CNN network with three functions: damage pattern separation, damage localization, and damage region segmentation. The model pays more attention to more attention to the texture information model; although the detection accuracy is higher, it belongs to the two-stage, which may bring about the problem of high computational cost and slow detection speed, and it still can’t be better luck to use in the hole probing detection. Li et al. [
24] embedded the CBAM attention mechanism module into YOLOv7 and utilized Alpha-GIOU as the coordinate damage function. The average accuracy of the improved model was 96.1%, but the number of parameters was 36.52 MB, and the FPS reached 85.92; although the accuracy of the model was higher, the number of parameters of the model was larger, and it brought high requirements to the mobile device. In summary, although many scholars have used the target detection technology for engine borehole exploration equipment, they are unable to balance the relationship between test accuracy, test speed, and model size; there is still a lot of room for development. Therefore, this paper combines the current mainstream improvement module reorganization YOLOv5s model to try to solve the above problems.
YOLO network version has been updated from 1.0 to the current 10.0; considering its model size, detection accuracy, and detection speed, YOLOv5 still has strong application value. Compared to other versions of YOLO, YOLOv5 weighs the detection performance and model size and still contains three main parts: backbone network, feature fusion network, and detection head [
15], which is more adapted to small target detection, with greater speed and detection accuracy [
25]. According to the number of model channels and convolution times, the YOLOv5 family can be divided into YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, which are similar in structure but different in size. YOLOv5s, with the smallest number of model channels and convolution times, belong to the lightest class, which has a small amount of computation and thus is faster and more adaptable to mobile terminal devices [
2]. Firstly, YOLOv5s performs Mosaic data enhancement and size scaling and other operations on the input images, i.e., four input photos are used to be spliced by random scaling, random cropping, and random arranging, which not only enriches the dataset to reduce the memory needs of the device, but also accelerates the training speed; secondly, preprocessing the data by using Focus to reduce the amount of computation and increase the speed, alternately using three Convolutional Block Layers (CBL) and Cross Stage Partial networks (CSP) for feature extraction, deepening the network and preventing model overfitting, this part is the core of the backbone network, fusing local and global features by Spatial Pyramid Pooling (SPP) and inputting the extracted features in different stages into the neck network; then, in the neck network, the Feature Pyramid Network(FPN) and the Path Aggregation Network (PAN) are utilized for enhanced feature fusion and extraction, fusing the features extracted at different stages of the backbone and those obtained from the detection network, to improve the robustness and generalization of the detection; finally, the feature
mAPs outputted from the Neck network are convolved and predicted to derive the bounding boxes, categories and confidence levels of the detection targets at different scales [
26].
YOLOv5s algorithm weighs the detection accuracy and detection speed compared to other algorithms, but the model parameters are more and cannot be embedded in mobile device terminals. In order to make up for the shortcomings of the above research, we take the YOLOv5s algorithm for improvement.
2. Algorithms Overview
In order to better adapt the model to the field of aero-engine blade crack detection, this paper proposes a detection algorithm based on MobileNet3, YOLOv5, and GSConv (hereinafter referred to as MobGSim-YOLO algorithm), the structure of which is shown in
Figure 2. The K-means++ algorithm is introduced instead of Kmeans, and the anchor frame size is re-selected to improve the detection accuracy; the backbone network is replaced with the lightweight module MobileNet3 to realize the initial extraction of the dataset features. The neck part utilizes the lightweight convolution of GSConv instead of the ordinary convolution, and the activation function is replaced with the Hard Sigmoid to improve the multi-scale fusion and to enhance the small target feature extraction ability in the engine interior dark light, complex structure, noise, and the blade cracks are usually small, deep feature fusion is often not conducive to the extraction of small targets, so the SimAM attention mechanism is introduced in the part of the feature fusion to make the model pay more attention to small target features. Compared with the YOLOv5s algorithm, the model proposed in this thesis is more innovative, almost updating the whole framework and retaining only a few original modules, which substantially improves the detection speed and compresses the model size under the premise of guaranteeing the detection accuracy. Finally, the model MobGSim-YOLO used in this paper is proposed.
Target detection can be divided into two stages: one is to determine the target location, and the other is to identify the target category. For YOLOv5s, generating rectangular candidate frames or Kmeans clustering, mainly randomly selected K samples as the initial clustering center, cannot avoid the similarity problem of clustering, aiming frames to form the positional error affects the model accuracy, and the K value is difficult to determine. Therefore, in this paper, the Kmeans++ algorithm is used to optimize the clustering center selection so as to get the most suitable aiming frame. The steps are as follows:
- (1)
Randomly select a sample from the data distribution set as the initial clustering center, as close to the edge as possible;
- (2)
Calculate the distance between each sample and the current clustering center and select the minimum distance of ;
- (3)
Calculate the probability that the sample in (2) will be the next cluster center according to the probability distribution formula
P. The formula is given as follows:
Repeat the above (2) and (3) until
K clustering centers are selected and iterate using the Kmeans++ algorithm to get the final result. As
Table 1 shows the results of anchor frame generation, the comparison shows that the anchor frames obtained by K-means clustering have a deviation in the center and small size, while the center deviation obtained by Kmeans++ clustering is small and the size is appropriate, and most of the cracks can be included.
w denotes the length of the anchor frame, and
h denotes the width of the anchor frame. Therefore, the K-means clustering algorithm is more reliable in target detection.
The backbone network is an important part of the deep learning model, which mainly realizes the function of extracting features and learning them. The backbone network of YOLOv5s utilizes the complex C3 network structure, which is not only slow to detect but also has many parameters and large computations, which is demanding on the equipment, and it cannot be used in the borehole probing equipment. Intelligent borehole probing devices need to embed deep learning models into mobile device terminals with limited device memory and performance, which undoubtedly puts high demands on the computation and memory of the models. Several factors need to be considered to choose a suitable backbone network: image dataset characteristics and size, task-specific requirements, etc. With comprehensive consideration, we choose the more advanced MobileNetv3 model as the backbone network and improve it.
MobileNet network architecture is a lightweight neural network model proposed by Google, which has now been updated to the fourth generation. It relies on lightweight model design and is widely used in projects with limited hardware resources and arithmetic power. Compared to other versions, MobileNetv3 is applied to this dataset with better results, utilizing the inverted residual structure, depth separable convolution, SE attention mechanism, and activation function h-swish [
1].
The principle of traditional multi-channel convolution is shown in
Figure 3, the size of the input feature map is:
, the size of the convolution kernel is
, and the number is
. The original feature map is computed with each original convolution kernel for channel convolution respectively, and all the results can be summed up to get the output feature map, and the number of parameters is:
.
Depth separable convolution can be understood as a tandem operation of channel-by-channel convolution and point-by-point convolution and the working principle is shown in
Figure 4. The size of the input feature map is
, and each channel is convolved channel by channel with a convolution to obtain the intermediate feature map. The intermediate feature map is then convolved point by point with a convolution kernel, and the result is summed up as the output feature 2, and the number of convolution kernels point by point is equal to the number of channels of the output feature map. The number of parameters is:
.
The ratio of the number of parameters for conventional multichannel convolution and depth separable convolution is:
It can be seen that the depth separable convolution can make the model simpler and less parameterized, i.e., the depth separable convolution extracts deeper features with the same parameters. Specific references can be found in the literature [
27].
The depth separable convolutional parameters are less, which will cause the model to lose a lot of key features; in order to ensure the accuracy of the model, MobileNet3 also introduces the SE attention mechanism. The SE attention mechanism mainly starts from the perspective of the channel to get the weights of different channels at different positions of the feature map and to learn the channel features more accurately. First of all, the feature map is globally average pooled and compressed into vectors, i.e., each channel can be represented by a single number; then the weights of each channel are generated by two fully connected layers; finally, the generated channel weights are used to assign values to the original feature map to get the final required feature map, and the details can be referred to the literature [
28], and the schematic diagram is shown in
Figure 5.
A reverse residual network replaces the traditional ResNet [
13] residual structure. The reversed residual network first uses dot convolution to boost the number of channels of the feature map, uses depth-separable convolution to extract features in high-dimensional space, which reduces the parameters, and then uses dot convolution to reduce the number of channels, and introduces a new activation function, ReLU6, and the formula can be expressed as:
In addition to this, in order to ensure the nonlinearity, lightness, and accuracy of the activation function, MobileNet3 also introduces the h-swish function, which can be expressed as an equation:
Traditional activation function has a strong exponential nature, and the derivation is complex and computationally intensive in the gradient calculation; whereas the h-swish function is a combined form of linear ReLU activation functions with at most quadratic terms, which is simple and smoother to compute, and reduces the computational cost significantly.
The smart hole detector device should be integrated into the mobile device terminal; the algorithm model size has high requirements, the ordinary convolution to extract features is limited, and for more parameters, we use the GSConv lightweight convolution module instead of the ordinary convolution and use the VOVGSCSPC module [
29], i.e., slim-neck. The slim-neck module was first proposed to be applied in the automatic driving system, which not only requires high detection accuracy but also needs good real-time performance, which is basically the same as the requirements of this paper’s smart hole detector. GSConv is more effective in lightweight networks than other convolutional modules, which combine the ideas of GhostNet and ShuffleNetv2 and apply deep separable convolution more skillfully. The principle of GhostConv is to use a small number of ordinary convolutions to compress the feature channels, which can reduce the amount of computation, and then carry out the constant to get the Intrinsic feature and, at the same time, extract the Ghost feature through the low-cost depth-separable convolution, and then finally splice the two types of feature to get the final feature. GSConv principle uses a smaller ordinary convolution to generate a series of basic features, and then a series of features are generated using depth separable convolution. Then, the two groups of features share weights and merge them, randomly mixing and washing the feature to enhance the generalization ability, to ensure that the information circulates between different groups, and then finally form the final convolutional results, it is obvious that GSConv has more outputs compared to the ordinary convolution, but the cost of computation is still kept at a low level, and its schematic diagram is shown in
Figure 6.
Two GSConv and one ordinary convolution are combined in parallel to generate the GS bottleneck, the structure of which is shown in
Figure 7. This module is then combined with ordinary convolution to be the VOVGSCSPC module, which is obtained by ablation experiments and one-time aggregation methods, and its structure is shown in
Figure 8. Research manuscripts reporting large datasets that are deposited in a publicly available database should specify where the data have been deposited and provide the relevant accession numbers. If the accession numbers have not yet been obtained at the time of submission, please state that they will be provided during review. They must be provided prior to publication. For details, please refer to the literature [
29].
The activation function for the shuffle blending operation in GSConv is a Sigmoid function. The Sigmoid function and derivatives are schematically shown in
Figure 9. It can be seen that the two ends of the function are saturated regions, and the derivative of the saturated region is close to 0, which will bring serious gradient vanishing problems. When the network is deep enough, the gradient will gradually disappear, reducing the convergence speed of the network. In addition, the Sigmoid function is an exponential-type function, and the computational cost is too high. Although the slim-neck structure improves a lot in performance, the limited number of aero-engine blade datasets is prone to the risk of overfitting, for which this paper redesigns the slim-neck and optimizes the structure again.
In this paper, Hard Sigmoid function is used to replace the original activation function, and its core idea is to approximate the Sigmoid function by a segmented linear function with the expression:
A comparison of the Sigmoid function and Hard Sigmoid function is shown in
Figure 10. Compared with the original activation function, the Hard Sigmoid function has a more stable gradient in the saturation region and is less prone to gradient vanishing. In order to verify whether replacing this activation function helps to improve the performance of the original model, ablation experiments are also done in this paper.
In the dark and noisy interior of an aero-engine, theoretically, a deeper model can learn a more complex feature representation, which is more conducive to distinguishing between noise and small targets, but at the same time, multiple down-sampling operations, with the increase of the network depth, lead to an increase in the receptive field, and the representation of the small targets on the feature will be better sparse or even lost. To overcome this difficulty, we introduce the attention mechanism to emphasize important regions. Learning the attention mechanism module is widely used in many computer vision projects because it allows the model to focus on more important information and ignore unimportant information. Aero-engine blade crack detection is different from traditional target detection in that blade cracks tend to be long, diversified in shape, and have inconspicuous features; secondly, the engine interior has low brightness, complex structure, less visible area, and more background noise. In order to overcome the influence of the above situations and considering the lightweight design, we adopt the SimAM attention module, as shown in
Figure 11, which will reasonably assign 3D weights to the feature and increase the degree of target attention so that the model can compute the local similarity between the target region and the neighboring regions, capture the texture features of the image, and improve the recognition accuracy of the crack. Compared to the current hot attention mechanisms, including ECA, CBAM, SE, etc., the biggest advantage of SimAM is that it does not need to add parameters to the original network, and it is a lightweight module while still maintaining considerable accuracy [
11]. This attention module mainly takes inspiration from human neurons and introduces an energy function to assign weights; the larger the energy difference between a neuron and its surrounding neurons, the more important and worthy of attention the neuron is. For details, please refer to the literature [
11].
3. Experiments and Analysis of Results
The aero-engine blade crack intelligent detection process is shown in
Figure 12. Firstly, the aero-engine blade with cracks is manually screened, and the dataset is expanded by cropping and rotating the images and divided into training, validation, and testing sets; secondly, all the image sizes are modified to be 640 × 640 × 3, and the usable aero-engine blade cracks dataset is obtained; the training parameters are set up, and the COCO dataset [
30] is utilized to perform deep learning model preprocessing, and get the initial weights and biases; then adjust the training parameters, and use the obtained aero-engine blade crack dataset for freeze training, and the model weights and biases are updated; finally, validate and test the model.
3.1. Data Set Production
The datasets explored in this study were derived exclusively from photographs captured by an aero-engine equipped with a borescope. Initially, aero-engine blade images exhibiting cracks were meticulously filtered out, totaling approximately 200. These images were subsequently labeled with the identifier “crack” utilizing specialist labeling software, ensuring accurate categorization, and the images are expanded, which are shown in
Figure 13. and randomly divided into training set, validation set, and test set according to the ratio of 7:2:1, and the resolution of all the images is unified to be 640 × 640. A total of 3000 aero-engine blade images were obtained in the end. Crack images, the use of image annotation software, labeling sequentially on all cracks in accordance with the VOC data format, and finally, automatically recording the crack location and rectangular box size in a notepad file.
3.2. Model Evaluation Indicators
In order to accurately measure the performance of the model, this paper selects standard evaluation metrics for quantitative evaluation, such as Precision, Recall,
[email protected],
[email protected], parameter, FLOPs and Frames Per Second (FPS). The binary confusion matrix is widely used in evaluating target detection models, which simply means that the number of model predictions and the number of true labels for each category are integrated into a single matrix.
Precision is the ratio of the number of model-predicted positive samples that are also positive to the number of all model-predicted positive samples, and Recall is the ratio of the number of model-predicted positive samples that are also positive to the number of all actual positive samples. The formula is as follows:
where
P denotes precision;
R denotes Recall;
TP denotes the number of samples predicted to be positive and actually also positive;
FP denotes the number of samples predicted to be positive and actually negative; and
FN denotes the number of samples predicted to be negative and actually positive.
Recall for the horizontal axis, Precision for the vertical coordinates of the curve called P-R curve, the curve and the axis of the polygon around the city of the area known as the AP, the average of the AP can be obtained by averaging the mAP.
K is the number of detection categories, which is 1 in this paper.
To better illustrate
[email protected] and
[email protected], the concept of
IOU must be introduced here, which measures the degree of overlap between the predicted bounding box and the true bounding box, the intersection is the area where the predicted box and the true box overlap, the concatenation is the sum of the areas of all the regions of the predicted box and the true box, and the
IOU is the ratio of the intersection and the concatenation. The schematic diagram is shown in
Figure 14. S1 denotes the area of the region where the prediction box and the real box overlap; S2 denotes the sum of all areas of the prediction box and the real box. The formula for
IOU is:
[email protected] then denotes the average precision mean
mAP of
. And mAP @0.95 is stricter, denoting the average precision mean of
.
All of the above are measures of model detection accuracy; the smart hole detection device also requires the model to be small enough and the detection speed to be fast enough. Therefore, it is also necessary to introduce the model size (parameter), FLOPS, and FPS. Parameters include weights, neuron paranoia, convolution kernel, all-connected layer weights, anchor frame parameters, etc. FLOPS floating point operations per second characterize the computational power; FPS can reflect the speed of the model testing and inference, i.e., how many frames of images can be processed per second, which is extremely important for real-time monitoring of the borehole detection equipment. This is extremely important for real-time monitoring of equipment.
3.3. Ablation Experiment
In this paper, we use the Windows 10 operating system, the CPU is Intel(R)Core(TM)i5-10200H, and the GPU is NVIDIA GeForce GTX 1650 Ti. YOLOv5s model is built based on Pytorch deep learning framework using Python 3.6 programming language. The model training parameters: epochs are 300, batch size is 2, image size is 640, initial learning rate is 0.01, momentum parameter is 0.937, weight decay coefficient is 0.0005, and SGD optimizer is used to iterate the parameters.
3.4. Ablation Experiment
In order to prove the effectiveness of the proposed algorithm in this paper, we perform ablation experiments on the same engine blade crack dataset and ensure that the training strategy and hyperparameters are the same. The design of the ablation experiment is as follows:
Module A1: Introducing the K-means++ clustering algorithm;
Module A2: Replace the backbone network with MobileNet3 for a lightweight design for downsizing the model and reducing parameters;
Module A3: Replace the neck part of the ordinary convolution as GSConv and add the VOVGSCSPC module after it to form a slim-neck module, which enhances the depth of the network and distinguishes between noise and valid features;
Module A4: Slim-neck module after replacing activation function by hard Sigmoid;
Module A5: Incorporating SimAM attention mechanisms to improve the learning of small features.
By analyzing the results of the experiment, the following conclusions can be drawn:
The introduction of K-means++ clustering frames made the generated anchor frames more adapted to the present data, which significantly improved the test accuracy, and the results were in line with the expected assumptions;
Replacing the backbone network with the lightweight module MobileNet3, the model parameters are reduced by 49.54% compared to YOLOv5s, and the model accuracy is also slightly improved, mainly due to the small number of parameters in depth separable convolution compared to the ordinary convolution, the small number of parameters in h-swish activation function compared to the exponential activation function, the SE attention mechanism to improve the attention to the features, and the inverted residual structure enhances the model expression ability, and the results are in line with the expected assumptions;
Replacing the ordinary convolution in the neck with GSConv and adding the VOVGSCSPC module, i.e., slim-neck module, afterward, the results show that compared with YOLOv5s, the parameters are reduced by 52.71%, and the accuracy is improved by 20.83%, which is significant due to the fact that the model learns deeper features after the introduction of slim-neck, which enhances the feature fusion and improves the target learning ability, which is in line with the expected conception;
Adding the SimAM attention mechanism to the head’s front-end, which does not introduce any parametric quantities but improves the accuracy, can improve the problem of deeper learning depth and ignoring small target features brought about by GSConv, which is in line with the expected conception.
After replacing the activation function with GSConv, the results compared with YOLOv5s, the parameters were reduced by 52.96%, and the accuracy was improved by 29.31%, which was obvious and in line with the expected assumptions.
After the lightweight design, the FPS of all the models is greater than 95, which meets the detection requirements of the hole detector.
There are many popular lightweight backbone networks, such as PP-LCNet and MobileNet series. In order to enhance the persuasiveness and visualize the results, this paper also conducts a series of ablation experiments for backbone networks. The second set of ablation experiments is designed as follows:
Module A6: Replace the backbone network with PP-LC [
31];
Module A7: Replace the backbone network with MobileNetv4 [
32].
By analyzing the results of the second group of ablation experiments, the model that replaces the backbone network for PP-LC and MobileNet4, although excellent in model size and detection speed, the decline in test accuracy is more obvious and can’t meet the accuracy requirements, so by analyzing the experimental data, we can get the best performance of the backbone network for MobileNet3, which is in line with the application scenario of this paper.
Through the above ablation experiments, it can be concluded that the model, after improving on the YOLOv5s model by adding the K-means++ algorithm, replacing the backbone network with the Mobilenetv3 lightweight module, replacing the neck portion with the slim-neck structure, and incorporating the SimAM attention mechanism performs the best when weighed against the test accuracy, speed, and model size. The training results are shown in
Figure 15. Analyzing the loss function curve shows that the total loss value of the model decreases with the increase of training rounds, indicating that the model is gradually learning and reducing the prediction error, which is in line with the expected results. The model only detects one type of crack, so the classification damage is 0. Analyzing the precision, recall, and map curves, it can be seen that with the increase of training rounds, the performance index rises significantly and gradually reaches convergence after 300 rounds of training, and the training results are real and credible.
The test set is detected using the model MobGSim-YOLO algorithm proposed in this paper, and some of the plots of the results are shown in
Figure 16. The YOLOv5s algorithm detection results are shown in
Figure 17. The test set selected in this paper is composed of photos of compressor blades and turbine blades, totaling 300, and only 7 turbine blade cracks and 5 compressor crack detection results are screened and displayed for the convenience of comparison. It can be observed by comparison that the confidence level of the MobGSim-YOLO model test results is significantly higher than that of the YOLOv5s model, visually observing that the bounding box formed by the former is more accurate, while the latter omits the boundary cracks with darker brightness and specific angles and misjudges the boundaries of the occluded and darker parts as cracks. It is clear that the MobGSim-YOLO model is superior in detection.