1. Introduction
Wheels play a pivotal role as integral components of trains, directly influencing the safety and reliability of train operations [
1]. With the rapid development and increasing speed of high-speed trains, corresponding safety risks are also escalating. The quality and structural condition of high-speed rail wheels have a direct impact on the safety and overall performance of an entire train. Operating at high speeds exposes the wheels to various external factors, including wear, impacts, and fatigue cracks, which will lead to deformation, damage, or even detachment, particularly under high-load and -frequency conditions. Wheel defects can lead to severe traffic accidents.
Non-destructive testing is a method that preserves the integrity of the object being examined [
2], primarily being used for detecting defects in materials, components, and equipment. Traditional non-destructive testing (NDT) methods encompass ultrasonic testing (UT) [
3], magnetic particle testing (MT) [
4], and penetrant testing (PT) [
5]. In the railway industry, owing to the unique environment, nearly all inspections rely on non-destructive testing. X-ray computed tomography has been employed to address the reconstruction of rolling contact fatigue cracks in rails [
6], and various ultrasonic testing methods have been applied in axle defect detection [
7,
8]. Given the complex structures of wheels and the multitude of components, there are a variety of detection methods [
9,
10,
11].
Traditional wheel defect detection methods predominantly hinge on visual inspection and ultrasonic testing. Visual inspection, however, is inefficient and prone to inaccuracies. Ultrasonic testing, relying on the principles of ultrasonic wave propagation in materials, involves emitting ultrasonic waves and capturing reflected signals. This approach enables the detection of various defects within wheels, such as cracks, inclusions, and porosities. Characterized by high sensitivity and resolution, this technology proves effective in identifying minute defects, offering early warnings of potential issues and contributing to the prevention of safety hazards arising from wheel failures.
The application of ultrasonic testing to train wheels goes beyond routine periodic inspections; it is also applicable in emergencies, such as post-collision or other external impact scenarios. Continuous monitoring of wheel health empowers operators to implement timely maintenance and repair measures, ensuring the safety and stability of trains during high-speed operations. However, due to the huge quantity of data, relying solely on manual detection is too inefficient. Traditional detection methods rely more on fixed methods. Matching and efficiency are difficult to guarantee. At the same time, subject to the limitations of expert knowledge, the detection process is relatively complicated. Therefore, a set of novel methods that can meet the above needs continues to be applied and promoted.
Deep learning technology represents a significant advancement in enhancing the efficiency and accuracy of computer vision as well as NDT. Its application extends to the automated analysis of various NDT data, including ultrasonic signals, magnetic particle images, and penetrant images, thereby substantially improving defect detection accuracy. Specifically, in ultrasonic wheel defect detection, deep learning technology offers several advantages. Firstly, it facilitates the automatic analysis of ultrasonic signals, thereby significantly enhancing detection efficiency. Secondly, it can extract intricate features from ultrasonic signals that may be challenging for human recognition, leading to a substantial improvement in detection accuracy. Additionally, deep learning technology contributes to cost reduction and minimizes manual intervention in the testing process. One notable strength of deep learning models is their ability to classify defects into different types, accurately locate defect positions, and measure defect sizes. These affordances not only streamline the analysis process but also provide a comprehensive understanding of the detected defects. By annotating ultrasound B-scan images, we can annotate a defect, which will give us a UT dataset. By learning from a large quantity of effective data, a model can efficiently detect data.
Despite these advantages, the application of deep learning technology in ultrasonic wheel defect detection still encounters certain challenges. One of the primary challenges is the creation of high-quality datasets. The performance of deep learning models heavily relies on the quality and diversity of the training data. Ensuring that the datasets used for training are representative of real-world scenarios is essential for achieving robust and reliable defect detection. Another challenge lies in the high complexity of deep learning models. While these models exhibit remarkable capabilities, their complexity can lead to challenges in interpretability and the need for significant computational resources. Striking a balance between model complexity and interpretability is crucial for practical implementation and widespread adoption. While deep learning technology has immense potential for revolutionizing ultrasonic wheel defect detection by significantly improving efficiency and accuracy, addressing challenges related to dataset creation and model complexity is essential for realizing its full benefits in real-world applications. To this end, we need to explain the categories of defects. In this work, the defect data are divided into four different forms according to the depth of scanning, namely, surface, near-surface, internal, and wheel plate cracks.
Given that natural images seldom exhibit exaggerated aspect ratios, standard deep-learning approaches may fall short when applied directly to ultrasonic images. Hence, this article explores specialized preprocessing techniques aimed at enhancing the compatibility between deep learning models and the intricate features inherent in ultrasonic wheel defect B-scan images. By addressing these specific challenges through dedicated preprocessing methods, the goal is to optimize the learning capacity of deep neural networks, fostering improved accuracy in detecting defects. This approach is crucial in bridging the gap between conventional image-processing methodologies and the distinctive characteristics of ultrasonic datasets, ultimately enhancing the efficacy of defect detection in high-speed railway train wheels. The design and training of deep learning models need to be tailored to accommodate these distinctive characteristics, ensuring optimal performance in detecting ultrasound defects. The main contributions of this paper are as follows:
An ultrasonic defect detection data set based on real wheels was created, and the defects were subdivided according to different categories;
An advanced UT-YOLO network was proposed to increase a network’s ability to perceive small defects and detection;
This research has been applied in actual scenarios and has helped to greatly improve manual detection efficiency.
3. Methodology
The basic architecture of YOLO v5 is shown in
Figure 1. The whole operation process is as follows: Firstly, the images are readied for feature extracted by the backbone network, named CSPDarkNet53 [
28]; then, the features in the neck section are separated into different sizes for feature fusion to obtain more feature information. After that, the different detection heads are designed, which allows the algorithm to fulfill the specific requirements of object detection.
3.1. Backbone
The architecture of the backbone is designed for feature extraction, but its efficiency should be given more attention. Due to this consideration, some novel backbone structures were applied in this task. The defects from ultrasonic testing B-scan images appear in different regions, so a useful feature extraction backbone should be designed to solve this problem. Regarding the wheels of railway trains, the defects are always small and noise-like, making them hard to identify easily.
Figure 2 shows a typical residual connection, which is widely used in the design of backbone improvements. By introducing residual connections, the neural network is allowed to skip certain levels of learning, thereby alleviating the vanishing and exploding gradient problems and making training easier. This network also supports deeper networks. Traditional deep neural networks suffer from the problem wherein the training difficulty increases as the depth of the network increases. The residual structure allows building deeper networks by adding more layers without degradation issues.
Figure 3 shows the structure of the Swin-Transformer block. It allows a model to better capture global relationships among pixels in an image and avoids full connection to reduce calculation costs, affordances that are of vital importance in object detection, especially in ultrasonic defect testing. At the same time, by stacking multiple transformer blocks, the network is able to learn more complex and abstract feature representations.
Here, represents the output of layer or the input of layer . The information passes from one layer to the next layer sequentially.
Firstly, the input is linearly transformed and normalized; then, it is passed through Weighted Multi-Head Self Attention (W-MSA), which focuses on a different area of the input rather than the global average. Secondly, the output will be added by the residual connection. The new output will repeat another Layer Normalization (LN) step and then connect with a Multi-Layer Perceptron (MLP) to fit the complicated input data. Finally, the residual connection will be used again to improve model stability. It is usually used as a repeated block, which helps alleviate vanishing or exploding gradient problems during training, making the network easier to train and optimize. In wheel defect feature extraction, after convolution and residual connection, the feature extraction module of Swin-Transformer can be superimposed multiple times to obtain more semantic information, which plays an important role in distinguishing background noise and defects.
3.2. Neck
BiFPN, a bidirectional feature propagation network, facilitates the bidirectional flow of information, allowing features to traverse from high-level layers to low-level layers, as well as enabling the propagation of low-level features to high-level layers. This dual-directional information exchange enhances the model’s ability to capture semantic information across various feature levels, thereby augmenting the model’s perceptual capabilities towards the target. In contrast, PANet adopts unidirectional propagation, limiting the flow of information to a single direction. The corresponding structure is shown in
Figure 4.
Another noteworthy aspect is that BiFPN was meticulously designed to mitigate computational complexity without compromising model efficacy. It achieves this by judiciously fusing features at distinct levels, thereby sidestepping the computational overhead associated with full connections. In contrast, PANet exhibits a more intricate design that could entail a higher computational burden. Nevertheless, a novel enhancement of BiFPN made by incorporating the Simple, Parameter-Free Attention Module (SimAM) [
36] was proposed. This addition enables the network to selectively concentrate on varying levels of features, thus refining its capacity for nuanced feature extraction. A diagram of SimAM is shown in
Figure 5.
3.3. Head
Given that ultrasonic testing typically involves small targets, the existing detection heads may struggle to meet the required standards. Consequently, this study enhances the number of detection heads and simultaneously augments the depth of the detection heads within the target detection model. The objective is to enhance the model’s discernment of intricate scenes and nuanced features in pursuit of more effective performance. The new structure is shown in
Figure 6.
The added detection head is shown in
Figure 6. The structure in the dotted box in the lower half is consistent with that in
Figure 1 but is rendered in gray. On this basis, in the upper block diagram of the network, a small target detection layer was added. Using color marking, the structure is used for the detection of small targets.
3.4. Loss Function
The original YOLOv5 loss function generally consists of three components, as represented by Equation (1). The first term quantifies localization loss, evaluating the accuracy of the model’s predictions for the position of the target bounding box. The second term corresponds to confidence loss, measuring how accurately the model predicts the presence of a target. The final term assesses the accuracy of the classification
where
,
, and
are the number of output features, the number of grids, and the number of anchors, respectively.
is the weight for different sections;
,
, and
are set as 0.05, 0.3, and 0.7, respectively.
represents whether the
cell, the
anchor box in the
feature map, is a positive sample. If it is positive, it is defined as 1; otherwise, it is set as 0.
and
are the prediction vector and ground truth (GT), and
is the weight for different sizes of the output.
Equation (2) is the new loss function called
, which is designed for UT defect detection. The localization loss is replaced by
.
Intersection over Union (IoU) is used to evaluate the quality of object detection models. A schematic diagram of IoU is shown in
Figure 7.
Here, the value of IoU should be described as in Equation (3).
Simply put, the overlapping region of A and B is C, which is denoted as
. The area of
is denoted as
. The ratio of
and
is IoU, which is an indicator of prediction accuracy. In the original loss function, a complete IoU (CIoU) was used to solve the special cases with the same central point but different ratios of height to width. Equations (4)–(6) define the loss of
where
is weight and the
is the parameter for measuring the ratio of height to width. A schematic diagram is shown in
Figure 8: the green line (AB) is the parameter
, which signifies the diagonal distance of the minimum bounding box (the gray, dotted one) capable of encompassing both the predicted box and the GT box. Meanwhile, the parameter
is the Euclidean distance between two center points
and
, which are indicated by the red line.
At the same time, the definition of
is shown in Equation (7):
In Equation (2), the penalty term of EIoU consists of separating the influence factors of the aspect ratio based on the penalty term of CIoU to calculate the length and width of the target frame and anchor frame, respectively. So,
is changed according to Equation (8).
Therefore, can provide a more accurate assessment when an object’s position is imprecise or slightly offset. However, to use as the loss function, the dataset should be more closely scrutinized, especially regarding the balance between various types of data.
To strike a balance between real-time requirements and computational complexity, the original YOLOv5 algorithm was chosen as the baseline model. This work enhanced the baseline model by (1) utilizing dataset-preprocessing methods that incorporate the characteristics of ultrasonic B-scan data, enhancing the model’s compatibility with neural networks; (2) employing a combination of ResNet, Swin-Transformer, and BiFPN structures for feature extraction to capture richer feature information, thereby increasing the capability of feature extraction; (3) introducing SimAM and a dedicated small-object detection head to further optimize the detection accuracy for small objects; and (4) enhancing the accuracy of bounding box regression through the refinement of the loss function’s design.
4. Experiments and Results
Diverging from natural images, ultrasonic detection images primarily convey intensity information, rendering them particularly susceptible to noise interference. Moreover, as the focus of this study is on detecting defects in wheels, the acquired data hold exceptional value. The experiments were conducted alongside an exploration of datasets and various experiments.
4.1. Dataset
This work was evaluated with reference to a real dataset, which was collected with phased array UT (PAUT) and transmit–receive (TR) probes to attain B-scan images about four different defects according to different depths, surfaces near other surfaces, and internal and rim cracks in different service depots using the LU system [
37]. The entire dataset of UT defects consisted of around 15,000 B-scan images, in which peeling, scratches, and cracks in different positions and depths were recorded. It is hard to find different real images with which to show the different types of defects, especially some inner defects, so a diagram of typical wheel defects is shown in
Figure 9.
UT is a very useful solution for detecting inner defects, but analyzing surface and sub-surface areas is not its strength because of the large amount of initial wave interference at the contact surface, as well as the noise; last but not the least, due to the particularity of the wheel structure, the collection is generally completed in one cycle, so the aspect ratio of the collected images will be very large. Examples of the original collection data are shown in
Figure 10. Image (a) below depicts a surface defect, and the periodic signal represents the wheel plate hole; there are six holes in a wheel.
The environment of the experimental data acquisition is shown in
Figure 11. In the service depots, the wheel runs in the designated area, and the robotic arm will carry a phased array probe to collect the testing information. The collected signals will be stored in the industrial computer for preliminary analysis like different gains to give a short judgement.
To illustrate the practicability of the proposed method, this work briefly displays the typical real defects, including wheel rim cracks and some surface defects like scratches, with their B-scan images given below in
Figure 12.
Another thing that should be taken into consideration is the variation in the depths of the scan with the different ratios of width to height. It was necessary to perform a data-preprocessing procedure. First, the images were cropped and reorganized into rectangles with an aspect ratio of 1 to the greatest extent possible, as in the algorithm for data preprocessing shown below (Algorithm 1).
Algorithm 1: Data Preprocessing |
Input: Original Images. Output: Processed Image after Data Preprocessing. |
- 1:
// Obtain the width (w) and height (h) of the image - 2:
w, h = getImageDimensions(originalImage) - 3:
// Check if the width is greater than 3 times the height - 4:
if w > 3 * h: - 5:
// Crop the image into two halves and stack them vertically - 6:
processedImage = cropAndStackImage(originalImage) - 7:
else: - 8:
// If width is not greater than 3 times height, no processing is done - 9:
processedImage = originalImage - 10:
end if - 11:
// Return the processed image - 12:
return processedImage
|
Then, the processed images needed to be labeled with different classes. Finally, the images were divided into three sets: a training set, a validation set, and a testing set with a ratio of 7:2:1, containing 10,395, 2970, and 1485 images, respectively.
4.2. Evaluated Index
In this work, two sets of metrics from academia and industry, respectively, are used to evaluate the model. Generally, the academic indexes that should be used are average precision (AP), recall, mAP, and frames per second (FPS); the industrial indexes always focus on the true alarm rate (TAR) and the false alarm rate (FAR) of detection.
4.2.1. Academic Indexes
A confusion matrix is often used to summarize the prediction effect of classification models, as shown in
Table 1.
The calculations are shown in Equations (9)–(13):
4.2.2. Industrial Indicators
An example describing the true alarm rate (TAR) and false alarm rate (FAR) is provided here. Assuming there are T samples in total, and D is the number of defect samples, the rest of the samples, called N, are normal. Using an algorithm of detection, in the problem section, D − 1 out of D samples were found to be defects, and in the normal section, 2 out of N were identified as defects; an explanatory diagram is shown in
Figure 13, and the calculation formulae should be like those in Equations (14) and (15).
In this example, there is one more definition to provide, that is, the miss alarm rate (MAR), which in this case is 1 out of D, as represented in Equation (16).
In practice, there is a big gap between academia and industry. The requirements for statistical results are often stringent in the industrial setting, making the utilization of industrial indexes more challenging. In this work, academic metrics were adopted primarily to validate the feasibility of algorithmic improvements, serving as an assessment tool for evaluating the effectiveness of the algorithm.
4.3. Implementation Details
All the experiments were carried out on a computer running Windows 10. Two Nvidia GeForce RTX3090 GPUs were used, and the CPU version was Intel Core i9-1098XE. The deep learning framework was Pytorch 1.7.1, and the python version was 3.7.11. The image input was resized to 640 × 640, the batch size was 128, the optimizer was SGD, the initial learning rate was set to 0.001, the total number of iterations was set as 500 epochs, and the patience was set to 50, signifying that if there are 100 consecutive rounds without improvement, the system defaults to convergence.
4.4. Experiment Results
To evaluate the performance of the algorithm in question in terms of accuracy and speed in terms of both academic and industrial applications, UT-YOLO was compared with different algorithms.
Table 2 presents the best recall,
[email protected], and
[email protected]:0.95 results for each algorithm, derived from the experimental observations of the proposed model with respect to the test set.
Figure 14 and
Figure 15 depict the
[email protected] and
[email protected]:0.95.
The results depicted in
Figure 14 and
Figure 15 show the effectiveness of UT-YOLO. Compared with the original baseline, the IoU0.5 indicator and the 0.5:0.95 indicator have been greatly improved, which is a very exciting result. In
Figure 16, a comparative analysis of the detection results across different models is presented, highlighting the performance variations for a common object. Notably, the examination distinctly underscores the superior efficacy of the UT-YOLO model.
It can be observed in
Figure 17 that the UT-YOLO model demonstrated precise localization of the defects, especially the small-sized defects. Furthermore, for defects near the wheel plate hole, the model exhibited commendable feature extraction capabilities, as evident in the outcomes presented in
Figure 17. Despite the presence of some minor defects associated with background blur, the model displays robustness and can effectively detect defects, even in challenging conditions like strong background noise and electrical interference. The left side of each image is the original detection result, and the blue box on the right corresponds to the enlarged local features.
For real-world industrial inspection, additional sets of 50 defective samples and 50 non-defective samples were carefully chosen to assess the detection rate and false alarm rate. Among the 50 defective samples, consisting of a total of 6000 images, 210 were identified as defects by the UT inspectors. Using UT-YOLO, a total of 206 defect instances were successfully detected, resulting in a TAR of 98.10% (206 out of 210). Notably, the four undetected defects existed in three distinct wheelsets. In practical applications, the achieved TAR was 94% (47 out of 50), underscoring the heightened challenges faced in real-world scenarios.
Furthermore, a thorough examination of 6000 defect-free images from 50 wheelsets revealed 83 instances of false alarms, originating from 6 different wheelsets. According to statistical analysis based on the images, the calculated FAR was 1.38% (83 out of 6000). However, in actual application scenarios, the observed FAR was 12% (6 out of 50). These results emphasize the complexity and intricacies encountered in practical industrial applications, where achieving optimal detection indicators proves to be a more demanding task.
Last but not least, the efficiency of this method aligns with the demands of real-time collection and detection. In typical scenarios, a wheel completes one full rotation in about one minute. With a phased array probe utilizing 120 channels for data collection, the algorithm achieves a detection speed of approximately 2 s per wheel. Considering factors such as software response and data loading time, on-site application can accomplish the detection of a wheel and provide defect location information within 10 s, significantly diminishing the expenses associated with manual inspection.
5. Discussion
In this study, the application of the UT-YOLO method to ultrasonic inspection data of railway wheels was inspected, with a focus on detecting various types of defects. The evaluation of results underscores the superiority of this method, emphasizing its reliability both in field applications and on site. Among the models employed for object detection, the proposed UT-YOLO model exhibits significant advancements compared to several benchmark models. The outcomes indicate that UT-YOLO achieved a best recall of 0.94, an
[email protected] of 0.89, and an
[email protected]:0.9 of 0.64, surpassing baseline YOLOv5s by 37%, 36%, and 42%, respectively. Particularly noteworthy is UT-YOLO’s substantial superiority over other comparative algorithms in terms of
[email protected]:0.9. Moreover, UT-YOLO demonstrated the fastest speed among the evaluated models. This is attributed to its unique features, such as an added small object detection layer, an attention mechanism module, data enhancement, and data preprocessing, specially designed for ultrasonic wheel b-scan image defect detection.
While the current study has effectively addressed challenges in ultrasonic wheel detection, especially in regard to speed and accuracy, achieving promising outcomes in terms of efficiency and practical applicability, it is crucial to acknowledge the diverse nature of real-world defects. Despite this success, certain limitations may arise in addressing specific prevalent issues. In particular, some false alarms caused by external factors such as electrical interference are still relatively serious, so the data collection process is very dependent on the stability of the equipment because the surface conditions of the wheels are not the same, which will cause the data to be easily invalidated, and a large amount of noise, especially electrical noise, is generated at the same time. However, to avoid excessive investment, more targeted distinctions will be made on the algorithm side to eliminate some special interferences. This will also be reflected in future work. Therefore, future endeavors will prioritize an in-depth exploration of actual wheel inspection equipment and the quality of inspection data. These tasks will involve a meticulous classification of various defects, aiming to enhance this methodology’s performance across a broader spectrum of real-life scenarios.