1. Introduction
As an important guiding, moving, and loading device of a train, the train wheelset tread will inevitably suffer from different types of defects, such as wear, cracks, and scratches, due to the long-term service of rolling contact [
1,
2,
3]. If such defects are allowed to continue to worsen, not only will additional shock vibrations affecting ride comfort be caused, but serious defects, such as wheelset breakage, in extreme cases, can also eventuate. If not handled in time, these defects may lead to catastrophic accidents such as train derailment. Effective detection of such wheelset tread defects is of great significance in improving the running safety of a train [
4,
5,
6].
At present, most detection methods of wheelset tread defects are traditional. These methods involve scanning the wheelset tread with light or laser beams when the train enters the depot at low speed or stops for maintenance. Defects are then identified by comparing the scanned shape with the standard wheelset tread shape [
7,
8,
9,
10,
11]. Although the identification accuracy of such methods is high, high-intensity labor and long recognition cycles are required.
With the rapid development of machine learning, deep learning presents advantages and potential applications in defect detection. Various types of deep learning methods have been adopted in research related to wheelset tread defect detection [
12,
13,
14,
15,
16,
17,
18], including the Faster RCNN algorithm, YOLO algorithm, SSD algorithm, deep convolutional neural networks, and so on [
19,
20,
21,
22,
23,
24]. In reference [
19], a convolutional neural network recognition model was established to detect wheelset tread defects by obtaining two-dimensional wheel–rail force information data with a time series for an instrumented wheelset. Nevertheless, most existing wheelset tread defect detection methods require a relatively large amount of data to obtain high accuracy. This cannot be easily fulfilled, as trains running in complex and diversified environments, and their tread defect samples, are not easily collected. When deep learning is carried out directly with small samples, the model accuracy is affected, resulting in over-fitting due to an insufficient defect feature space [
25,
26]. So far, few works of research have been devoted to solving the issue of detecting wheelset tread defects with small samples.
This paper focuses on addressing the issue of deep learning with small samples in the context of wheelset tread defect detection. Through three aspects, namely sample expansion, feature enhancement, and network decision making, a local inference constraint network is constructed to defect wheelset tread defects. In terms of sample expansion, the data are expanded by using a generative adversarial network, and a dataset with semantic consistency and sample distribution diversity is obtained. For feature enhancement, the importance of defect features is increased by introducing an attention mechanism module into the network. For network decision making, a spine network with residual structure is constructed to achieve higher defect detection by using more accurate local information input and less deep network calculations. Finally, this proposed method is verified and analyzed with experimental data.
This paper focuses on addressing the issue of deep learning with small samples in the context of wheelset tread defect detection. Through three aspects, namely sample expansion, feature enhancement, and network decision making, a local inference constraint network is constructed to defect the wheelset tread defect. In terms of sample expansion, the data are expanded by using a generative adversarial network, and a dataset with semantic consistency and sample distribution diversity is obtained. For feature enhancement, the importance of defect features is increased by introducing an attention mechanism module into the network. For network decision making, a spine network with residual structure is constructed to obtain more accurate results with fewer operations through the precise input of local information. Finally, this proposed method is verified and analyzed with experimental data.
2. Proposed Method
Wear, cracks, peeling, scratches, and severe scratches on the wheelset treads are common defects caused by wheelset–rail contact [
27,
28,
29]. Considering that an insufficient feature space can be caused by small samples of wheelset tread defects, the importance of defect features in model identification was improved by addressing sample expansion, feature enhancement, and network decision making. A local inference constraint network-based detection method was proposed to predict the type of wheelset tread defect with small samples, as shown in
Figure 1.
In
Figure 1, the diverse sample expansion module (Module B), utilizing the blue block indicated adversarial generative network module, was employed to obtain a dataset with semantic consistency and sample distribution diversity. The feature extraction process module (Module C) increased the importance of defect features by incorporating the attention mechanism module, represented by the orange block, into the network. The residual spinal fully connected layer (Module D) improves the distinguishability of feature vectors. This module re-sorts local features and assigns values based on their importance, enabling the network to achieve more accurate results with fewer calculations, and the defect types of the wheelset tread are determined, as in Module E.
2.1. Sample Expansion Based on Generative Adversarial Network
The purpose of data sample expansion is to generate enough data for deep network learning. In a basic generative adversarial network model, the generator receives a random sample space and noise data conforming to a Gaussian distribution. The discriminator receives samples generated by the generator and real data samples, updating parameters through gradient back-propagation via a loss function. By using random samples and noise data with Gaussian distribution, the basic generative adversarial network ignores the information distribution characteristics of the actual object, so the quality of the generated samples is relatively low. To address this, noise data were added through an AdaIN mechanism during sample expansion, ensuring that the generator creates wheelset tread images according to image size. The independent Gaussian noise input affects only subtle changes in visual features. AdaIN is defined as follows [
30],
where
and
represent the mean and variance of the real data sample, respectively;
and
are the mean and variance of the latent data sample space, respectively.
The network generated by Equation (1) can better fit the distribution of wheelset tread defect data, and the constructed image information is more in line with real samples. For the specific network structure, please refer to Reference [
30].
2.2. Feature Enhancement with Attention Mechanism
Although a conventional network can dynamically describe and weight the input data in a learnable manner, it struggles to distinguish defect feature weights of different data subsets. This is especially true for weakened features, random positions, and high background noise in wheelset tread defects. To eliminate these weaknesses, a multi-dimensional attention mechanism was embedded in the feature extraction module. That is, an attention mechanism was added to the channel level to highlight the feature channel information, and an attention mechanism was introduced in the spatial dimension to strengthen the spatial dimensional information. The module is shown in
Figure 2.
As can be seen from
Figure 2, the input data were the wheelset tread defect data expanded by the generative adversarial network. A convolutional neural network is used to obtain a feature vector of high-dimensional semantic information. Weighted attention mechanism parameters are then obtained through the channel and spatial attention mechanisms indicated by the red outline in the figure. Then, the output result with attention weighting was obtained by multiplying feature mapping and attention weights, in which the feature mapping was composed of the weighted attention mechanism parameters and the feature vectors, and the attention weights were obtained by the attention mechanism. The channel attention mechanism and spatial attention formula are,
where
Conv5 is the feature space after five convolution layers;
Mc and
Ms are the channel attention weight and spatial attention weight, respectively;
At is the weighted feature space. By compressing the spatial dimension with the channel attention mechanism, a greater weight was applied to the wheelset tread defect features. By compressing the channel dimension with the spatial attention mechanism, spatial positional information was provided to improve the utilization of feature vectors.
2.3. Residual Spinal Connected Layer
The residual is a measure of the difference between a statistical sample and a true sample, and it is an observable estimate of unobservable statistical errors [
30]. In neural networks, the residual (also ation (GT) known as a skip connection) directly connects an input to an output through another channel. It is used to amplify errors and highlight changes in the model’s parameters to be optimized. In the decision-making layer of the neural network, the fully connected layer is generally combined with other modules to reduce the number of parameters because of its inherent characteristics of high parameter count and difficult training. Inspired by the unique method of processing information in the vertebrate nervous system, the idea of the residual was introduced into the spinal fully connected layer, resulting in a residual spinal fully connected layer. The module innovatively improved the input and output connection, and it alleviated the over-fitting problem of the spinal fully connected layer by adding a residual channel. Furthermore, the activation function was reset based on the dataset’s quantity characteristics. This makes the module more suitable for detecting the wheelset tread defect with small samples. The model structure is shown in
Figure 3.
As demonstrated in
Figure 3, the weighted feature space
At(
x) obtained from Equation (4) is subjected to a max pooling layer, Maxpool, to prune redundant information from shallow features, resulting in a one-dimensional feature
Mx of size 1 × 1 × 2048. Subsequently,
Mx is decomposed into two equal parts,
A1 and
A2, defined as,
where “Maxpool” refers to the operation of maximum pooling, and the elements x
Mi within the sets
A1 and
A2 denote the
i-th feature value of the feature vector
Mx.
The spinal neural network consists of
Sp1,
Sp2,
Sp3, and
Sp4. As indicated by the green outline in
Figure 3, each spinal neural network module is composed of a dropout layer, a linear layer, and an activation function. The discard layer and activation function were set with the same parameters. The output of the linear layer was set to 512. The input size of the linear layer in
Sp1 was 1024, while that of the linear layers in
Sp2,
Sp3, and
Sp4 was 1536. The purpose of such settings is to unify the feature dimension of the final input with the fully connected layer. The corresponding equations are,
where
Droop is a random zeroing operation,
Linear is a linear layer, and
is the activation function of the spinal neural network. The output of each spinal block, when combined with the feature vector that has undergone max pooling, yields a refined feature vector
FC of dimensions 1 × 1 × 2048. This feature vector
FC is then mapped to a specific class label through the final fully connected layer. Subsequently, the Softmax function is applied to transform the class labels into a probability distribution, thereby accomplishing the pattern recognition task for images of wheelset tread.
2.4. Model Pseudocode
Based on the above analysis, the pseudocode of the detection method of train wheelset tread defects is as follows.
Model pseudocode (Pytorch) |
Input: x, half_width Output: y |
# features: feature extraction network # attention: attention mechanism # avergpool: average pooling # rsf: residual spinal fully connected layer |
# Perform training iterations for epoch in range (50): x = features(x) x = attention(x) output = avergpool(x) x = output x1 = rsf − 1(x[:,0:half_width]) x2 = rsf − 2(cat([x[:,half_width:2 × half_width],x1],dim = 1)) x3 = rsf − 3(cat([x[:,0: half_width],x2],dim = 1)) x4 = rsf − 4(cat([x[:,half_width:2 × half_width],x3],dim = 1)) x = cat([x1,x2]) x = cat([x,x3]) x = cat([x,x4]) y = fc(x) # Apply the optimization network Adam # Calculate the loss function Loss = CrossEntropyLoss() # Load model test optimizer = optim.Adam() # Load model test Model = load() Model.test() |
3. Experiment and Analysis
3.1. Experiment Environment
Dataset: The dataset used in this paper was collected in a maintenance depot workshop. A total of 210 wheelset tread defect images were selected and labeled with defect type. Among them, 45 were normal, 56 had cracks, 52 had scratches, 42 had peelings, 40 had abrasions, and 20 had severe scratches.
Experimental setup: The operating platform of the experimental program was Pytorch. The running environment was configured as follows, processor: Intel Core i9-9900K; running memory: 11G; graphics card: Nvidia NVIDIA GeForce RTX 2080 Ti; Code operating environment: Torch = 1.90, Python = 3.9; CUDA version: CUDA11.1. SGD was used as the optimizer, and the batch size was 32, with 50 training iterations.
Evaluation indicators: The accuracy, recall, precision, and F1 value were chosen as the evaluation indicators. Among them, accuracy indicates the proportion of the number of correctly predicted samples to the total number of samples. Recall (R) is the proportion of the number of samples correctly predicted to be positive to the total number of positive samples. Precision (P) stands for the proportion of the number of samples correctly predicted to be positive to the total number of samples predicted to be positive. The F1 value indicates the weighted harmonic average of recall and precision.
3.2. Ablation Experiments
3.2.1. Ablation Experiment of Generative Adversarial Network
In order to objectively evaluate the performance of the generative adversarial network, wheelset tread defect datasets were established by various data generative methods, including a dataset without data generative processing (Non), a dataset generated by geometric transformation (GT), a dataset generated by pixel transformation (PT), and a dataset generated by a generative adversarial network (Gan). The specific distributions are shown in
Table 1.
In the geometric transformation, a horizontal flip, vertical flip, rotation by 90 degrees, and rotation by 180 degrees were carried out. In the pixel transformation, Gaussian noise, salt-and-pepper noise, brightening, and blurring processes were carried out on the real data. In the generative adversarial network, the generation model variant StyleGan3 was used to generate the data, and several generative data by Non, GT, PT, and Gan are shown in
Figure 4.
The training dataset and the test dataset were divided in a 4:1 ratio. Resnet was used as the feature extraction module in the above four datasets to verify the effectiveness of the data generative method, and the results are shown in
Table 2.
As can be seen from
Table 2, in all models, the applied generative method can effectively increase the number of training samples, improve the diversity of samples, and extend the samples’ spatial distribution. Therefore, regardless of whether the tread defect dataset is generated by geometric transformation, pixel transformation, or a generative adversarial network, the performance of the model can be effectively improved, and the error rate of the model can be reduced. Comparing geometric and pixel transformations, defects have a crumby distribution and are slightly darker than the background, resembling noise added in pixel transformation, which may reduce prediction performance. Thus, the dataset generated by geometric transformation is slightly better than that by pixel transformation. Generally, the performance of the generative adversarial network is better than that of the geometric transformation or pixel transformation because the generative adversarial network can effectively avoid the problem associated with small samples and improve the ability of models to resist over-fitting.
3.2.2. Ablation Experiment of Residual Spinal Fully Connected Layer
In order to verify the influence of the improved fully connected layer on the model performance, the fully connected layer in Resnet50 was replaced with the improved structure proposed in this paper. Meanwhile, in order to examine the performance difference between the spinal fully connected layer and the residual spinal fully connected layer, the feature network was Resnet [
20], and the comparative experiments were performed on the two models. In these experiments, the settings were the same except for differences in the fully connected layer. The results are shown in
Table 3.
Although the spinal neural network can improve the performance of some network models, it can lead to a decrease in model performance and an increase in parameters when applied to Resnet 50. Spinal-res represents the model after replacing the fully connected Resnet50 layer with a residual spinal fully connected layer. As seen in
Table 3, the detection accuracy of the improved spinal fully connected layer is improved by 0.42%, and it has a stronger ability to identify defect features. This is due to the dual-branch decision network providing both local and global views, effectively reducing local information disturbance and improving model robustness.
In addition, the model parameter quantity in deep convolutional neural networks is also an evaluation indicator, which is used to measure the performance of model complexity and prediction accuracy. A smaller number of parameters usually leads to faster training speeds and smaller storage space occupation, but it may also reduce the model’s prediction accuracy. On the contrary, a larger number of parameter counts can improve the prediction accuracy, but it may require a longer training time and larger storage space. As shown in
Table 3, although the prediction accuracy of the three models is similar, it is observed that Spinal-res can improve the accuracy and reduce the error rate without significant increases in parameter count, so the effectiveness of the Spinal-res module was verified.
3.2.3. Ablation Experiment of Attention Mechanism
The dimension and weight of the feature layer affect the training results of the network. Considering the different impacts of various feature information dimensions, an attention mechanism module was added after feature extraction in the Resnet training stage. In this experiment, the feature network was set as Resnet-20 [
31], and the attention mechanism module was set as a single-channel attention mechanism (SE) and a multi-channel attention mechanism (CBAM). The results are shown in
Table 4.
As can be seen from
Table 4, compared with the model without an attention mechanism, the accuracy, precision, recall, and specificity of the model are improved by the insertion of the SE and CBAM. Specifically, the accuracies are improved by 1.25% and 1.25%, respectively; the precisions are improved by 1.26% and 1.26%, respectively; the recalls are improved by 1.29% and 1.27%, respectively; and the specificities are improved by 1.27% and 1.26%, respectively. This indicates that the attention mechanism can assign different weight information for different feature information data, so it can lead to better results in feature expression and model prediction.
3.2.4. Results Analysis
In order to verify the effectiveness of the proposed model, different traditional deep neural networks and lightweight network models were applied for comparison. The traditional deep neural networks included Resnet-50, Resnet-101, and VGG16, and the lightweight network models were Desnet-161, Desnet-169, and ConvNext. The data for model training and testing are the generated data in
Section 3.2.1. The results are shown in
Table 5.
As can be seen from
Table 5, the prediction performances of deep neural networks (Resnet-50, Resnet-1010, VGG16) are superior to lightweight networks (Desnet-161, Desnet-169, ConvNext) for detecting wheelset tread defects. Compared with VGG16, the Resnet-50 and Resnet-101 networks can generally achieve better performance, which is the result of the innovatively introduced double-branched residual channel. Furthermore, the improved algorithm model based on the Resnet feature extraction network achieves a better performance than others in terms of accuracy, precision, recall, and specificity. Specifically, the accuracy of the model proposed in this paper is improved by 7.23% compared with VGG16 and by 1.667% compared with Resnet-50. The improvement is obtained through the restraining action on the model by the residual fully connected layer and attention mechanism, which can provide greater possibilities in feature expression and decision-making.
4. Conclusions
In order to solve the issue of insufficient defect feature spaces caused by small samples, a wheelset tread defect detection method based on a local inference constraint network is proposed. The main innovations are as follows: (1) The wheelset tread defect data generated by a generative adversarial network is used to expand sample distribution diversity while preserving semantic consistency. High-dimensional semantic feature vectors are then obtained. (2) The limited feature vector weight is increased through the attention mechanism to improve the importance of defect features. (3) In the decision layer, the residual spine network based on local input is innovatively proposed to obtain better prediction results with few parameters. Meanwhile, the experimental results demonstrate that the proposed method can achieve higher prediction accuracy in detecting wheelset tread defects compared to advanced methods such as Resnet-50, VGG16, ConvNext, etc.
Although this method has achieved high accuracy in detecting wheelset tread defects, it still faces practical challenges. Issues such as image blurring caused by vibration and unstable image frames due to high speed remain key areas for future research [
32].