1. Introduction
As an important part of the global public transportation infrastructure, the railroad network is essential in driving economic progress, enhancing regional interconnectivity, and supporting the movement of goods and people. When railroad operation time increases, irreversible damage is caused to the rail surface, which is inevitably affected by factors such as manufacturing process, wheel-rail contact stresses, and natural weathering, leading to the development of surface defects including seams, scars, abrasions, and spalling as shown in
Figure 1 [
1]. Wear in this picture above is caused by the gradual wearing away of metallic material from the surface of the rail as a result of prolonged contact with the wheels, and may cause train bumps. Depression is a depressed area caused by localized loss of material or indentations on the rail surface. Common defects in joint include looseness, misalignment, and poor joint can cause shock and noise when the train is moving, affecting passenger comfort and exacerbating track damage. Corrugation and abrasion may cause the rail surface to become rough, increasing friction with the wheels, which accelerates the wear process. Consequently, to uphold the smooth and efficient operation of the rail services, timely and accurate detection of defects on the surface of the railroad track has become an urgent problem in the railroad industry, which not only has significant practical application value, but also has far-reaching research significance.
The early detection of rail surface defects primarily depends on manual inspection. Although this approach is cost-effective, it is often slow and inefficient, making it challenging to detect defects in their initial stages, potentially leading to more severe issues. With the advancement of technologies such as ultrasonic testing, magnetic particle inspection, and eddy current testing, detection capabilities have greatly improved, gradually replacing traditional manual inspection methods. M. Sun et al. [
2] introduced a detection technique utilizing photoacoustic signals, which utilizes the photoacoustic effect for the detection of rail surface defects in response to the inadequacy of ultrasonic signals in detecting surface microcracks. Xiong, L et al. [
3] studied and explored the use of ultrasonic guided wave technology for non-destructive testing of rail bottoms, which is a physical testing method that specializes in areas that are difficult to detect with conventional techniques. Han S-W et al. [
4] proposed electromagnetic ultrasonic testing (UT) using Electromagnetic Acoustic Transducers (EMATs) is a non-contact ultrasonic inspection technology, which excites and detects ultrasonic waves on the metal surface through electromagnetic effect, and is particularly suitable for rail testing in high temperature environments. The above methods, although effective, still face limitations in terms of accessibility, real-time monitoring and effective coverage of long distances of railroad track. Rough or irregularly shaped rail surfaces may affect the coupling and detection results of ultrasonic inspection and ultrasonic inspection may not be sufficiently sensitive to defects located in the deeper layers of the material [
5]; Magnetic Particle Inspection (MPI) is usually only able to qualitatively show the presence of defects, but it is difficult to determine the exact dimensions and depth of the defects [
6]; Eddy current inspection is also difficult to achieve uniformity for complex or irregularly shaped workpieces [
7].
With the increasing requirements for safety and reliability of railroad systems, traditional manual inspection methods can no longer meet the demand for efficient and accurate inspection. The existing body of work in vision-based systems for spotting rail imperfections exhibits significant strides and potential for utilization. Machine vision systems utilize cutting-edge image processing techniques for the automated recognition and pinpointing of defects on rail tracks [
8].
Despite the significant progress of machine vision in rail defect detection, there are still some challenges. For example, in practical applications, complex ambient light changes, dirt and occlusions on the rail surface can affect the detection results. Moreover, conventional methods usually rely on manually set parameters such as thresholds, filter sizes, shape of structural elements, etc. These parameters need to be tuned for specific sizes of defects. These parameters need to be tuned for a specific size of defect, so when the defect size varies, the originally set parameters may no longer be applicable, leading to performance degradation. In addition, the operation of high-speed trains requires the system to have extremely high real-time and stability, which puts higher requirements on the efficiency of the algorithms and hardware performance. To solve these problems, future research directions include developing more robust image processing algorithms, optimizing the computational efficiency of deep learning models, and designing smarter inspection system architectures. Overall, the research of machine vision technology in the detection of rail surface defects continues to deepen, providing strong technical support for the safety and security of railroad transportation.
Based on the above, the technical approaches for rail surface defect detection have all obtained good results, but at the same time there are problems and challenges in certain aspects. CenterNet [
9] is a deep learning based target detection algorithm, which achieves target detection through a Fully Convolutional Network (FCN) [
10], which makes it highly computationally efficient while maintaining a high level of accuracy. Key point estimation is applied to locate the center point of the target and regress to other attributes, which makes it flexible in dealing with defects of different sizes and shapes, and has a greater advantage for surface defects on rails with different sizes and types. The main contributions of this paper based on this paper are:
A novel fusion feature extraction module based on CenterNet is proposed for rail surface defect detection network.
Replacing the original backbone ResNet with an efficient feature extraction backbone ResNeXt, and changing the low-level features layer into a multi-branching layer increases the model’s ability to capture low-level features, providing a richer feature representation for the accurate identification of rail surface defects.
The SKNet attention mechanism, which incorporates the C2f structure of the YOLOv8 network, is introduced so that the model can focus more on the key areas in the image, improving the detection accuracy of small defects and weak features.
Given the specific shape characteristics of rail defects, we change the traditional circular Gaussian kernel into an elliptical Gaussian kernel that further takes into account the aspect ratio of the GTbbox, which strengthens the dimensional regression loss of the network’s loss values and improves the model training speed of the network.
3. Results
The core idea of CenterNet is to consider a target as its center point and to determine the target’s boundaries by predicting the location, size and offset of the target’s center point through a single-stage network. The entire process does not require complex candidate frame generation or post-processing steps, which enables CenterNet to achieve fast, concise, and accurate real-time target detection. However, the original CenterNet is still insufficient in dealing with complex and subtle defects on the rail surface. To this end, we have made several improvements to CenterNet with the aim of enhancing its performance in the detection of rail surface defects. First, we modified the backbone network part, and we proposed an improved multi-branch ResNeXt network structure to replace the original ResNet network; second, after the backbone network part of the feature fusion, we introduced a fused multi-scale attention mechanism, C2f_SKNet module, to further enhance the expression of important features; then, in order to better deal with defects at different scales, we introduce FPN into the network to enhance the detection of multi-scale defects; finally, we replace the previous circular Gaussian kernel with an elliptical Gaussian kernel in the heat map generation part of the network head to better match the aspect ratio of the defects, which improves the accuracy and robustness of the defects detection, and also accelerates the training speed of the network at the same time. Our network’s refined structure is presented in
Figure 2.
3.1. ResNeXt_b
ResNeXt [
26] is an innovative convolutional neural network architecture based on ResNet, which improves the efficiency and performance of the network by introducing the concept of group convolution to extend the previous residual module. However, grouped convolution may lose some information in low-level feature extraction because it limits the perceptual field of each convolutional kernel to process only some of the channels of data features are usually missing to capture are the basic visual elements in the image, such as edges, corners, textures, etc. For the rail scar detection task, low-level features are very important, and the original ResNeXt network may result in a lack of features making the detection ineffective [
27]. Since scars usually exhibit edge abnormalities or texture changes. Therefore, we propose a modified ResNeXt_B network architecture as in
Figure 3. The architecture is based on the classical ResNeXt and enriches it for low-level feature extraction in images by changing to four parallel grouped convolutional structures at the input feature layer, followed by the outputs of the four branches being merged into a 128-channel feature map in order to integrate the features of the different branches for enhanced feature representation.
3.2. C2f_SKNet Module
Based on the characteristics of different scale sizes of rail surface defect species, we introduced the SKNet [
28] attention module, which is able to adaptively select different convolution kernels to better capture features at different scales in the image. The structure of the network module is shown in
Figure 4.
SKNet is implemented by three operations Split, Fuse and Select, the Split operation is a multi-scale feature extraction technique. In this operation, the input feature map is convolved through multiple convolution kernels of different sizes and each convolution operation forms a feature branch. These differences in convolutional kernel sizes allow them to capture features at different scales, and multi-branch SKNet [
29] can adapt to features at multiple scales, so we selected three convolutional kernel sizes, 1 × 1, 3 × 3, and 5 × 5, for the Split operation. And since it is a small target detection, we introduce dilation convolution [
30] into the 5 × 5 convolutional kernel, which is also called cavity convolution, a method to extend the sensory field of convolutional kernel. It works by inserting blank 0-value pixels between each pixel point of the convolution kernel, thus indirectly increasing the coverage of the convolution kernel without significantly increasing the amount of computation or the number of parameters. It is done by inserting gaps between every two neighboring elements of the convolution kernel. These voids do not contain actual weight values, they just make the coverage area of the convolution kernel larger. For the convolution kernel we designed with a 3 × 3 convolution kernel, using a null convolution with an expansion rate of 2 is equivalent to inserting 1 gap between every two neighboring elements. Due to these inserted gaps, the convolutional kernel’s receptive field is equivalent to a 5 × 5 convolutional kernel, which can capture a wider range of image contextual information rather than being limited to the region covered by the original convolutional kernel size. Despite the increase in the receptive field, the actual computed convolution operation still only considers the original element positions of the convolution kernel. As a result, the number of parameters and the amount of computation do not increase with the expansion of the receptive field, and the results of the comparison between the standard convolution kernel and the expanded convolution kernel are shown in
Figure 5.
The Split operation is followed by the Fuse operation to integrate the global information of the spliced features, and the weights for different scales are generated through the fully connected layer and feature stacking operations. Then finally these weights are applied to different feature branches through Select operation to get a weighted combined feature map, such weighted combination makes the network able to flexibly adjust its sensory field, which can be adapted to different sizes of targets. Based on SKNet’s ability to enhance feature expression through multi-scale convolution and dynamic selection mechanism, we also fused the cross-stage part of the connectivity structure from the C2f module of the YOLOv8 network at the back of the network structure to turn it into a more adaptable attention model—C2f_SKNet, which enhances the feature expression ability of the model through shared feature map and multiscale feature fusion to enhance the feature expression capability of the model to provide efficient feature fusion strategies while maintaining the integrity of feature information [
31]. The final network structure is shown in
Figure 6.
The overall structure includes, SKNet module, two-branch convolutional layer, Bottleneck module, and feature splicing layer. Among them, the Conv structure is the 2D convolutional module of YOLOv8, which uses a cross-stage partially connected structure to keep some features unchanged, and the other part is processed by the Bottleneck module and then feature fusion, which retains the rich contextual information, and significantly improves the accuracy of small target detection and detection in complex backgrounds.
3.3. Elliptic Gaussian Kernel
In CenterNet, the anchor-free based design generates a heat map by applying a Gaussian kernel at the center of the target, where the center point of the target has the highest value and decreases with distance, each target generates a Gaussian distribution at its center, and then all the Gaussian distributions are superimposed on a shared heat map. The value of the 2D Gaussian kernel at the image coordinates (
x, y) is calculated as in Equation (1):
where (
,
) are the coordinates of the center point of the target in the target in the heat map,
is the standard deviation of the Gaussian kernel, which determines the degree of expansion of the Gaussian distribution. In the inference phase, the model outputs a predictive heat map, and by looking for peaks in the heat map, the location and confidence level of the presence of the target can be determined, but relying on the centroid alone for size regression also poses a training challenge. The fact that only the information from the center point is used may cause the model to be less efficient in dealing with the size and scale of the target, which in turn increases the difficulty of training.
Improving training efficiency usually takes two approaches: increasing the learning rate and reducing the data augmentation, where too much of the former may result in training dispersion and instability, and the latter may cause the model to focus more on basic feature learning and lead to overfitting. Therefore, we draw on the theoretical ideas of [
32] and change the original circular Gaussian kernel into an elliptical Gaussian kernel. This improvement further considers the aspect ratio of the target on the basis of the center point, which makes the model reflect the shape and size of the target more accurately in the localization regression, and it also conforms to the characteristics of the defects of the surface of the rail in the shape of a variety of sizes. We obtain the length-short axis ratio of the elliptical Gaussian kernel by labeling the width as well as the height of the target, and list a new elliptical Gaussian kernel ground calculation formula as in Equations (2)–(4):
where
and
are the long and short axes of the elliptical Gaussian kernel, dynamically adjusted according to the width and height of the target, the heat map can be more in line with the characteristics of the rail surface defects of unequal proportions of the length and width of the circular Gaussian kernel heat map and the elliptical Gaussian kernel heat map comparison effect is shown in
Figure 7.
From the
Figure 7, we can see that for the same target labeling box, the heat map generated by the elliptical Gaussian kernel occupies a larger distribution density, which makes the model predict a larger region of interest and improves the model’s detection accuracy and training efficiency.
4. Experiments and Results
4.1. Experimental Environment
The test environment for this experiment is Windows 11 operating system, the hardware configuration system is Inte ® Core i7-14650HX CPU, NVIDIA GeForce RTX 4060 graphics card and 16 GB RAM. The deep learning configuration uses CUDA11.7, CUDNN8.9 and the framework platform is pytorch version 2.2.1.
4.2. Image Acquisition
The rail surface image acquisition system described in this paper primarily consists of a linear image acquisition unit and a rail detection beam, as depicted in
Figure 8. This system includes an industrial-grade high-speed linear CCD camera and a non-visible light source. To achieve synchronized image capture and spatial isometric sampling with the two CCD linear cameras, a high-precision speed sensor is mounted on the wheelset. The key parameters of the image capture unit are provided in
Table 1.
The maximum inspection speed of the railroad inspection vehicle is about 80 km/h. The size of each captured image is 512 mm × 2112 mm, and a single image acquisition unit can capture 25 images per second.
4.3. Rail Surface Defect Dataset
This study uses a customized rail surface defect dataset designed to detect and classify different types of defects on the rail surface. The dataset consists of 321 images of rail surface defects, including four common types of rail surface defects: abrasions, scars, dents, and seams. Each image clearly shows a specific type of defect for model training and testing.
4.4. Evaluation Indicators
To validate the improvement of the model, we applied precision, recall, and mean accuracy (
mAP) for comprehensive evaluation. They are calculated as in Equations (5)–(8):
where
TP denotes the number of true positive samples that were correctly predicted as positive, and
FP denotes the number of true negative samples that were incorrectly predicted as positive. Accuracy reflects how accurately the model recognizes positive samples, i.e., how many of the results that the model predicts as positive samples are correct.
where
FN denotes the number of samples that are true positive samples but incorrectly predicted as negative samples. Recall reflects the model’s ability to cover positive samples, i.e., how many of all positive samples are correctly identified.
For each category, the area under the precision-recall curve is calculated as the average precision (
AP) for that category. A common calculation is to divide the recall into points and then average the maximum precision for each point.
The APs of all categories are averaged to obtain mAP, where N is the total number of categories. mAP in the rail surface defect detection task is obtained experimentally to measure the overall detection performance of the model on various defect types.
We assess the computational complexity of the model using FLOPs, which indicate the number of floating-point operations required during a single forward pass. The size of the model is measured by the number of parameters; fewer parameters result in a smaller model size and lower storage and computation requirements. In addition, we also use FPS (frames per second) to measure the actual running speed of the model as an important indicator, and the average training time (Training time) to measure the model training efficiency.
4.5. Ablation Experiment
In order to verify the influence of each component on the performance of our proposed model for detecting surface defects on rails, we conducted ablation experiments. The setup and results of the experiment are shown in
Table 2.
Experiment 1 is the unimproved CenterNet network model without any components, and the mAP and Recall reach 0.893 and 0.725, respectively, while the training time is longer at 0.2925 s/L. We replace the ResNet backbone network of the original model with a multi-branch ResNeXt network in Experiment 2, which improves the performance of the network by 3.8% and enriches the extraction of low-level features by 2.5%, due to the introduction of group convolution. network performance as well as enriching the extraction of low-level features, the mAP is improved by 3.8% and Recall by 2.5%. In Experiment 3, we improved the prediction ability of the model as well as the training efficiency by replacing the original circular Gaussian heat map with an elliptical Gaussian heat map with a larger region of interest, with a 2.4% improvement in mAP and a 36.6% faster Training time than before. In order to verify the enhancement of the fused multi-scale attention on the feature extraction of the network model, we added the C2f_SKNet module to Experiment 2, which resulted in better mAP and Recall, but the training time of Experiment 4 became slower at the same time, due to the substantial increase in the number of parameters as well as a more complex number of network layers. Therefore, after we used all the improvements together in Experiment 5, the model performance is further improved with 4.9% improvement in mAP, 10.7% improvement in Recall, and 35.2% speedup in Training time.
4.6. Performance Comparison Experiment
In order to verify the effectiveness of our proposed method in the detection of rail surface defects, we compared it with several mainstream target detection networks, and a detailed comparison of the experimental results is shown in
Table 3, and a graph comparing the effect of network performance is shown in
Figure 9, the four color boxes in
Figure 9 are the corresponding detection boxes for each of the four types of defects. Among them, Faster-RCNN [
33], Cascade-RCNN [
34] and TridentNet [
35] belong to the two-stage networks with a large number of parameters to improve precision at the expense of processing speed. Faster R-CNN has high detection accuracy. It is optimized for region extraction and feature processing, and is a benchmark for many high-precision detection tasks, and is well suited for complex backgrounds and multi-scale defects such as rail defects. Cascade R-CNN is an improved version of its predecessor, which further improves detection accuracy through a multi-stage cascade structure, especially for difficult objects. TridentNet introduces a multi-branch structure to better handle objects of different scales. a multi-branch structure to better handle objects of different scales, which is similar to our improvement and can be compared with the advantages and disadvantages by experimental results. But they are still not as good as the networks in this paper in terms of precision and recall. Our network is 9.17%, 9.17%, and 5.43% better than the previous three in terms of precision, and 41.62%, 27,30%, and 21.12% better than the previous three in terms of recall. SSD is one of the most classical one-stage networks, it is suitable for real-time inspection tasks for evaluating the balance between speed and accuracy of new models. Although the network parameters as well as the processing speed are small, the number of parameters is 33.78% smaller compared to our network, and the FPS is improved by 55.63% compared to our network, the detection accuracy and recall are far inferior to our network. FCOS is a single-stage detector without anchor frames, closest to CenterNet, the initial network of this paper, which utilizes a fully convolutional network to directly predict the location and class of an object, simplifying the detection process and improving speed. YOLOv8 is the more effective YOLO architecture algorithm, which combines a number of advanced detection techniques. However, due to the lack of strategies designed for small objects, he is not as effective as our network, outperforming it by 1.93% and 5.24% in terms of precision and recall.
We also performed a scatter plot depiction as in
Figure 10 based on the relationship between each network Params and
mAP, respectively, to demonstrate the comprehensive performance of each model. From the figure, it can be visualized that our network significantly improves the detection accuracy while keeping the complexity low, highlighting the advantages of the comprehensive performance.
5. Conclusions
In this paper, we propose a new model based on CenterNet that integrates a multi-branch ResNeXt backbone network with a fused multi-scale attention mechanism, specifically targeting the challenges of detecting small targets and handling a diverse range of defects on the rail surface. Our model achieves the highest mean Average Precision (mAP) of 0.763, outperforming several mainstream target detection models. In addition, the Recall rate of our model reaches 0.803, significantly higher than other models, which indicates its superior performance in detecting small and less prominent defects.
In terms of efficiency, our model maintains a relatively low computational complexity with 90.86 GFLOPs and 32.498 million parameters, while achieving a processing speed of 33.4 images per second. This makes it not only faster than two-stage networks like Faster R-CNN but also competitive with other single-stage detectors, as demonstrated. This balance between accuracy and efficiency highlights the effectiveness of integrating the SKNet attention mechanism with the C2f module from YOLOv8, as detailed in the ablation experiment in
Table 2.
By comparing the various indices from our experiments, we demonstrate that our model achieves high-accuracy rail surface defect detection without significantly sacrificing speed and computational resources. Specific improvements include a 6.6% increase in
mAP and a 35.5% reduction in training time compared to the unimproved model. The model’s detection performance on real datasets as shown in
Figure 10 reflects its lower leakage rate and reliable target detection, making it more suitable for deployment in real-time applications where both accuracy and efficiency are critical. These results confirm that our approach provides a robust and efficient solution for real-time rail defect detection.