1. Introduction
Railway transportation plays a vital role in national economic growth, serving as a key component of the transportation network. Precise railway track extraction is fundamental for creating detailed electronic maps, maintaining smooth operations, and safeguarding lives and property [
1]. Traditionally, railway tracks have been inspected manually or with specialized vehicles, but these methods are often inefficient and lack comprehensive, high-frequency monitoring capabilities [
2,
3]. Manual inspections across extensive railway networks are highly inefficient and labor-intensive. Moreover, optical sensors on inspection vehicles often collect unsatisfactory data due to the loss of information details, adversely affecting the accuracy of subsequent defect analysis and detection. With advancements in remote sensing and UAV technology, UAV inspections are increasingly utilized for the intelligent detection of railway tracks, significantly enhancing the safety of high-speed railways. Equipped with high-definition cameras, UAVs are able to efficiently collect data on railway infrastructure without disrupting normal operations, thus improving both detection accuracy and efficiency [
4]. However, challenges remain due to the complex background information in UAV images, the diversity of railway track shapes, and the variability of shooting angles, which make high-precision extraction of track lines particularly difficult.
Traditional methods for railway track extraction often necessitate substantial prior knowledge from researchers. Several studies have utilized models for track detection using conventional computer vision algorithms [
5,
6]. However, these methods frequently misclassify railway tracks due to their spectral similarities to other features such as buildings, fields, water bodies, and parking lots. As a result, they often suffer from lower-than-expected classification accuracy and issues related to extraction precision and robustness. Therefore, it is essential that advanced detection algorithms be developed to improve the effectiveness of UAV-based railway track inspections.
Traditional machine learning techniques play a key role in image feature detection; techniques such as Scale-Invariant Feature Transform (SIFT) and Speeded-Up Robust Features (SURF) provide robust solutions for detecting and describing local features in images under varying conditions [
7,
8]. Histogram of Oriented Gradients (HOG) and Haar cascades are also widely used in object detection as they capture shape information and rapidly recognize features [
9,
10]. These methods have established the groundwork for detecting features of railway tracks in UAV imagery. Deep learning has undergone rapid advancements in recent years, significantly improving image recognition and object detection from UAV imagery [
11]. It is worth noting that the UAV benchmark study on object detection and tracking has been instrumental in driving these advancements, establishing foundational benchmarks and performance metrics for various applications of UAV-based visual systems [
12,
13]. This progress extends to various applications, including road extraction [
14], lane detection [
15], railway foreign body detection [
16], rail surface defect inspection [
17], and crop row detection and guidance system [
18,
19]. Deep learning-based methods for detecting railway tracks from high-resolution UAV imagery offer notable advantages over traditional approaches. They require less prior knowledge, reduce researchers’ workload, and more effectively handle the complex environments encountered in railway track extraction. These methods have already been employed in extracting railway tracks from UAV imagery and in creating electronic railway maps [
20,
21].
Given the similarities between road extraction and railway track detection, methods developed for road extraction are also effective for railway track detection using UAV imagery. These methods identify the network structure of roads or railways in high-resolution remote sensing images (with a resolution of 0.5 to 1 m), sourced from unmanned aerial vehicle (UAV) remote sensing and satellite remote sensing platforms, by recognizing multi-scale features. Unlike unsupervised learning, which often relies on color-based segmentation, deep learning techniques utilize a range of features, including texture, geometric shapes, and line patterns, to extract roads [
22]. Currently, many road detection algorithms for high-resolution remote sensing data utilize deep neural network models based on encoder-decoder structures such as FCN, UNet, and DeepLabV3+. Researchers are continually optimizing network architectures, objective functions, and training strategies to achieve more precise road segmentation.
Research on road extraction focused on enhancing backbone networks, context information extraction, and attention mechanisms. Convolutional neural networks (CNNs), as a foundational deep learning architecture, significantly contribute to road segmentation. For example, Tao et al. proposed the Seg-Road model, which combines Transformer and CNN techniques for road extraction from remote sensing images [
23]. Similarly, Qiu L developed the Semantic Geometry Network (SGNet), which uses dual-branch backbones to extract roads from high-resolution images [
24]. Unlike CNN models that use dense layers to generate fixed-length feature vectors and require fixed-size images, Fully Convolutional Networks (FCNs) employ interpolation layers to upsample feature maps, allowing them to restore the input size and process images of any dimension. Varia et al. utilized the FCN-32 network to segment road sections from ultra-high-resolution UAV imagery [
25], while Kestur et al. developed a novel U-shaped FCN (UFCN) specifically designed for road extraction from UAV images [
26]. Zhu et al. introduced the Global Context-Aware and Batch-Independent Network (GCB-Net) for this purpose [
27]. In GCB-Net, Global Context-Aware (GCA) blocks within the encoder enhance the capture of global spatial relationships, while multi-parallel unfolding convolutions enable the extraction of multi-scale road features, thereby improving the model’s overall efficacy and connectivity of road topology. Additionally, Dai L et al. introduced the Road-Enhanced Deformable Attention Network (RADANet), which leverages road shape priors and deformable attention mechanisms to extract road information from high-resolution images. This method effectively captures semantic shape information and long-range dependencies between road features [
28].
Despite the substantial advancements in road segmentation through deep learning, the application of these methods to railway track extraction poses several challenges. Existing models often necessitate enhanced edge detection accuracy for track lines and face difficulties with the linear structural characteristics of railway tracks. Deep learning-based pixel-level feature extraction may introduce noise, gaps, or discontinuities, and these models fail to adapt to variations in the tilt angles of railway tracks in UAV imagery, which reduces segmentation accuracy. Additionally, deep neural networks are often computationally inefficient due to their multiple layers and extensive parameter sets. Another challenge is the absence of specialized training datasets specifically designed for railway track extraction.
To address these challenges, this paper introduces an improved NL-LinkNet deep learning network, termed NL-LinkNet-SSR, specifically designed for extracting railway tracks from UAV aerial images [
29]. This network provides a robust solution for the automated extraction of railway track lines. Building on the conventional NL-LinkNet architecture, the model integrates a Sobel edge detection module and a parameter-free SimAM attention mechanism. These enhancements markedly enhance the network’s ability to detect railway track edges, thus increasing the precision and reliability of the track extraction process. The key contributions of this study include:
(1) The encoder integrates Sobel edge detection modules and non-local blocks to effectively extract edge information of railway tracks from the input images and incorporate it with the original feature maps. This integration improves the network’s edge perception capabilities, enabling the model to capture fine details and contextual information about the railway tracks, thus improving extraction accuracy and robustness.
(2) The decoder incorporates the SimAM attention mechanism, which is applied to the output feature maps of each decoder block. This results in weighted feature maps that emphasize the railway track regions, selectively amplifying the feature responses in these areas. The parameter-free nature of SimAM ensures high computational efficiency without the need for additional learning parameters.
(3) A new dataset consisting of 12,130 high-resolution railway track images and their corresponding label images has been developed, providing a valuable data resource for railway track extraction from UAV images.
The remainder of this paper is organized as follows:
Section 2 reviews related work on deep learning-based railway track extraction methods.
Section 3 introduces the experimental data from UAV images, the algorithm framework for track extraction, and the network structure.
Section 4 describes the experimental setup and evaluation metrics.
Section 5 presents a detailed analysis of the experimental results, including comparisons of different models and ablation studies. Finally,
Section 6 and
Section 7 provide the discussion and conclusions, respectively.
3. Dataset and Methodology
The algorithmic framework outlined in this paper is depicted in
Figure 1, which significantly enhances the detection of railway tracks from aerial images captured by drones. The core innovation lies in the advanced adaptation of the NL-LinkNet network, which now incorporates a Sobel edge detection module and a SimAM attention mechanism residual module [
31]. These enhancements are specifically designed to improve the network’s sensitivity to railway track features, which is a critical aspect for achieving higher precision in identifying and delineating these structures from high-resolution drone imagery. Initially, high-resolution images of ground railway tracks were captured using drone aerial photography, and corresponding labeled images were generated. Data augmentation techniques were applied to enhance model training. The NL-LinkNet network was subsequently enhanced through the integration of an edge detection module and a SimAM attention mechanism residual module [
33], aimed at improving the model’s ability to identify railway track features. The upgraded NL-LinkNet network was then utilized for training and testing to further assess the accuracy of the model’s predictions. Finally, comparative experiments and ablation studies were conducted to validate the effectiveness and advantages of the model.
3.1. UAV Data Preprocessing and Datasets Introduction
This dataset was acquired through aerial photography with DJI (Da-Jiang Innovation) drones, specifically using the DJI Matrice 300 RTK model equipped with a Zenmuse P1 camera, covering specific railway lines in Qingdao and Nanjing, China. The drone operates at a typical altitude of 150 m with a pixel size of 4.4 μm and a 35 mm lens, resulting in a Ground Sample Distance (GSD) of 15.08 cm, which ensures detailed and precise imagery for analysis. This task includes 214 high-resolution images, each with a resolution of 8192 × 5460 pixels at 96 dpi, totaling approximately 49.33 GB. The images were captured along a predetermined linear flight path at regular intervals, which introduced some redundancy in the railway information.
Images containing railway tracks were specifically selected to ensure complexity and diversity, meeting the requirements for railway track line extraction tasks. These images were initially preprocessed with Gaussian filtering to reduce noise. Using the coordinate information from the drone images, track line points were calculated through linear interpolation, and these points were connected to form lines, resulting in binarized label images with the track lines. These label images were automatically generated using a Python program.
For model training, the filtered images and corresponding binarized track line label images were cropped using a sliding window algorithm, resulting in 12,130 image slices. These slices were then split into a training set and a test set in a 9:1 ratio, where the test set also served as a validation set. The original images and label images of part of the dataset can be seen in
Figure 2.
3.2. Data Augmentation
Several data augmentation techniques were applied to the railway track image datasets to enhance the model’s generalization and robustness. These methods included random adjustments of image hue, saturation, and brightness to simulate various lighting conditions, with hue altered within a −60 to 60-degree range and saturation and brightness varied between −50 and 50 units. Perspective transformations introduced variability through random translations, scaling, and rotations, with shift limits from −0.7 to 0.7, scale limits from −0.8 to 0.8, and rotation limits from −90 to 90 degrees [
34]. Gaussian noise was added with a variance limit ranging from 15 to 60 to simulate sensor noise and realistic image acquisition conditions. Elastic transformations with parameters such as alpha of 120, a sigma of 6, and an alpha affine of 3.6 simulated natural scene distortions to aid the model in adapting to local deformations. Coarse Dropout randomly created black patches or holes, ranging from 2 to 8 holes of 8 × 8 to 16 × 16 pixels, in order to simulate occlusions. Additionally, random horizontal and vertical flips and 90-degree rotations were employed to further increase the diversity and robustness of the dataset against directional biases. All these augmentations were implemented using the Albumentations library, ensuring a robust and diverse set of images that trains the model to perform effectively under varied conditions.
3.3. Algorithm Framework
The improved NL-LinkNet network architecture is illustrated in
Figure 1. This framework is based on a ResNet34 [
35] encoder-decoder structure, designed to enhance feature representation in railway track line segmentation tasks. The network incorporates non-local attention modules in the third and fourth encoder layers to capture long-range dependencies and global contextual information.
A Sobel edge detection module is also integrated into the network to extract edge features of railway track lines from the images. The module is a key image processing technique that calculates the gradient magnitude at each pixel to highlight areas where intensity sharply changes, indicating edges or boundaries within the image. The Sobel edge detection module uses horizontal and vertical Sobel filters for edge detection in both orientations, applying them via convolutional operations to emphasize relevant intensity changes. The combined edge maps from both directions create a comprehensive scene depiction, further refined by a 1 × 1 convolution that adjusts channel dimensions to match subsequent network layers. This integration within the deep learning framework significantly enhances the model’s ability to detect railway tracks with high accuracy. These edge features are then fused with the output from the first encoder layer through a channel adjustment layer, thereby enhancing the model’s capability to extract detailed track line features.
Additionally, a SimAM attention module is appended to each decoder block to refine feature representation through an adaptive mechanism. The module is an attention mechanism that dynamically adjusts the focus of the neural network on important features within an image. By computing attention scores based on the significance of each feature, SimAM effectively directs the model’s computational resources towards areas of interest, enhancing detection accuracy and efficiency. This method is particularly beneficial in environments with variable and intricate backgrounds where distinguishing key features from noise is critical.
After several convolutional layers and nonlinear activation functions, the network produces high-precision segmentation results for railway track lines. By leveraging the strong feature extraction capabilities of the ResNet34 network and incorporating these enhancement modules, the architecture achieves superior performance in extracting railway track lines from complex scenes.
3.3.1. Non-Local Attention Module
Nonlocal attention blocks (NLBs) are a key enhancement for convolutional neural networks (CNNs), designed to capture long-range dependencies in feature maps [
29,
36]. Traditional CNNs often face limitations due to their restricted receptive fields, which hinder their ability to reference distant spatial information. Nonlocal blocks overcome this limitation by calculating the response at a given position as a weighted sum of features from all positions in the input feature map. This approach allows each spatial point to aggregate contextual information from the entire image, enhancing the network’s ability to process and understand complex patterns that span large areas.
In railway track extraction, nonlocal operations provide significant benefits. High-resolution satellite images of railway tracks may be obscured by shadows, trees, or buildings, making accurate detection challenging for conventional methods. By incorporating the NLBs, models can leverage global information across the entire image, improving the precision of track extraction even in the presence of such obstacles. The ability of nonlocal blocks to compute a weighted sum across the entire feature map enables each spatial point to gather contextual information from the whole image. This capability enhances the model’s performance, as demonstrated by NL-LinkNet’s superior results in the DeepGlobe Challenge, where it outperformed state-of-the-art models with fewer parameters and faster training times.
The function of the nonlocal block can be described by the following equation:
where
is the output at position i, and
and
represent the input features at positions i and j, respectively. The function
computes the pairwise relationship (or affinity) between features at positions i and j.
is a normalization factor, typically set as
. A common choice for
is the embedded Gaussian function:
where
and
are linear embeddings, and the function
is a linear embedding applied to the input features. The non-local block uses learnable weights
,
,
to transform the linear embeddings of the input features. Then, it calculates the pairwise function of similarity between features using the embedded Gaussian function. It aggregates features from all positions weighted by the calculated similarity and adds the aggregated features back to the original input to form the output, maintaining residual connections that preserve both local and global information.
3.3.2. Edge Detection Module
The Sobel operator is a commonly used method for edge detection in image processing. It utilizes two 3 × 3 kernels: one for detecting horizontal edges and another for detecting vertical edges. These kernels convolve with the input image to approximate the gradients in the horizontal and vertical directions. The gradient magnitude at each pixel is then computed as the square root of the sum of the squares of these horizontal and vertical gradients. In the model, the feature maps generated by the Sobel edge detection module are concatenated with those from the first layer of the LB module in the encoder along the channel dimension. This process enriches the model’s representation of the input image.
During implementation, the Sobel kernels are first transferred to the same GPU device as the input tensor to ensure consistent computation. The input tensor is then padded using reflection padding to handle boundary pixels and preserve edge information. Assuming the input tensor has dimensions , where C is the number of channels, H is the height, and W is the width, the padded tensor dimensions become . The Sobel operator, with its kernel, is then applied in both the horizontal and vertical directions to the input tensor, producing two edge maps that represent the horizontal and vertical edge intensities. The convolved output tensor size remains for both the horizontal and vertical convolutions, thereby maintaining the same number of channels as the input.
To compute the overall edge strength (gradient magnitude) for each pixel, the square root of the sum of the squares of the horizontal and vertical edge maps is calculated. A small epsilon (1e-8) is added to this calculation to avoid zero gradients. The resulting edge strength is then maximized along the channel dimension while maintaining the original spatial dimensions, resulting in a feature map of size . Finally, a 1 × 1 convolution layer is applied to adjust the channel dimensions of the edge map, ensuring it matches the expected input size of the subsequent layers. This module effectively converts the input image into an edge map, highlighting edges in both horizontal and vertical directions, thereby providing crucial information for further railway track line processing tasks.
3.3.3. SimAM Attention Mechanism
SimAM is a lightweight, parameter-free attention mechanism designed for convolutional neural networks and is commonly used in visual tasks such as image classification, object detection, and image segmentation [
33]. Unlike traditional channel or spatial attention modules, SimAM calculates three-dimensional attention weights for feature maps directly within the inference layer without increasing the network’s parameter count. Inspired by neuroscience principles, this module employs an energy function to evaluate the importance of each neuron, thereby reducing model complexity and computational cost. This approach enhances the representational capacity of convolutional neural networks, leading to improved performance in visual tasks, such as railway track line extraction.
First, the module calculates the mean
and variance
of the input feature map X. The energy function for each neuron is defined as
where
and
are the mean and variance of the feature map, respectively, and λ is a predefined coefficient. This equation indicates that a lower energy value
signifies a more important neuron for visual processing. Consequently, the importance of each neuron can be determined by
. The squared difference (d) between each feature and the mean is then calculated and normalized by the adjusted variance term (v), given by
, where
is the number of elements minus one. The inverse of the energy function
is computed as
. Finally, the attention map is generated by applying a sigmoid function to the inverse energy values and refining the feature map by scaling,
.
5. Results and Analysis
5.1. Visualization of Railway Track Lines Extraction
The visualization results of the railway track line extraction are presented in
Figure 3. These results demonstrate the model’s strong generalization capability on the railway tracks dataset. It is clear that the proposed model effectively identifies railway track line regions across various angles and complex backgrounds. The model robustly identifies and segments railway track lines, effectively differentiating them from diverse environments, including snowy conditions, urban clutter, and amidst foliage. This is evidenced by the distinct contrast between the original images and the processed results, where track lines are prominently highlighted. These examples highlight the model’s strong generalization capability across a broad range of scenarios, affirming its adaptability to different railway track orientations and environmental conditions. The implementation of the Sobel edge detection module, combined with the SimAM attention mechanism, enhances the model’s sensitivity to subtle edge details, crucial for accurate track delineation in complex scenes. Furthermore, the inclusion of NLBs in the model architecture allows it to capture and utilize global contextual information, thus enabling the model to maintain performance even when local visual information is compromised by occlusions or blending with background textures. This advanced integration of NLBs with edge detection and attention mechanisms proves particularly effective in scenarios where track lines are obscured or blend with backgrounds of similar textures. It demonstrates the model’s superior feature extraction capabilities and robustness, ensuring reliable track detection even under challenging real-world conditions.
To demonstrate the effectiveness and robustness of the method in extracting railway track lines, the model’s performance was compared against four other network models: NL-LinkNet, DeepLabv3+, U-Net, and FCN. All experiments were conducted under identical conditions using the same dataset to ensure fairness and objectivity. The results of different models in railway track line extraction are illustrated in
Figure 4.
The NL-LinkNet model shows relatively accurate track line extraction but exhibits deviations and broken lines, especially in the first and third images, where the continuity of the extracted lines is compromised. The DeepLabv3+ model performs well in capturing most railway lines accurately, but there are minor inconsistencies, particularly in the first and third images, where the lines are not as precise as the ground truth. The U-Net model struggles with accurately extracting railway lines, displaying noticeable gaps and noise, especially in the first, second, and fourth images. The FCN model has difficulty maintaining the continuity of railway lines, resulting in significant gaps and misdetections, especially in the first, third, and fourth images. In contrast, the proposed method outperforms the other models, delivering the most accurate and continuous railway track line extractions. The extracted lines closely match the ground truth with minimal deviations and noise, demonstrating the robustness and effectiveness of the proposed approach. Moreover, compared to the proposed method, the U-Net, DeepLabv3+, and FCN models often miss segments with similar colors and textures and struggle to extract smaller segments. These models perform poorly in extracting railway track lines, particularly in shadowed areas, due to insufficient feature learning when dealing with similar colors, textures, and shadows, leading to a loss of detail.
Overall, the proposed method demonstrates superior performance in accurately extracting railway track lines compared to NL-LinkNet, DeepLabv3+, U-Net, and FCN. This highlights the effectiveness of the proposed improvements in enhancing the precision and reliability of railway track detection from UAV imagery.
Table 2 presents the quantitative extraction metrics for railway track lines on the test set. The proposed model achieved the best performance with an accuracy (Acc) of 98.2%, an F1 score of 74.9%, a mean Intersection over Union (mIoU) of 65.3%, and a Kappa coefficient of 84.1%. The F1 score, which is the weighted average of precision and recall, serves as an objective measure of the model’s performance. The model’s F1 score of 74.9% outperforms the other models, with DeepLabv3+ following closely at 72.8%, which is 2.1% lower than the proposed model. In contrast, the FCN model performed the least effectively, with an F1 score of only 66.3%, which is 8.6% lower than the proposed model.
To validate the performance of the proposed method in railway track line extraction, the training loss convergence process of different models on the railway dataset was visualized. During training, the training loss typically decreases gradually, while the validation loss, computed on data outside the training set, assesses the model’s generalization ability. A decrease in both losses indicates effective learning by the model.
The training and validation loss curves for various models, shown in
Figure 5, provide insights into their performance and generalization capabilities over the training epochs. All models exhibit a steady decrease in training loss, reflecting effective learning. However, validation loss shows variability, highlighting the models’ ability to generalize to unseen data. The NL-LinkNet (
Figure 5a) and DeepLabv3+ (
Figure 5b) models demonstrate a stable decrease in training loss, with validation loss following a similar trend, suggesting good generalization. However, the validation loss stabilizes at a higher value than the training loss, indicating some degree of overfitting. The UNet (
Figure 5c) and FCN (
Figure 5d) models also show a reduction in training loss but with more fluctuations in validation loss, particularly in UNet, which may imply potential overfitting or sensitivity to the validation set. Among all models, NL-LinkNet-SSR achieves the lowest training and validation losses, indicating superior learning and generalization capabilities. The validation loss closely tracks the training loss with a minimal gap, suggesting robust performance with less overfitting.
Overall, NL-LinkNet-SSR exhibits the most promising results in minimizing both training and validation losses, followed by NL-LinkNet and DeepLabv3+. UNet and FCN show greater variance in validation loss, indicating potential challenges in generalization.
5.2. Ablation Experiment
To validate the impact of the proposed modules and improvements on the performance of the railway track detection network, ablation experiments were designed to quantitatively evaluate track line extraction performance. The experiments were based on the NL-LinkNet network, with two enhancements individually incorporated: the SimAM attention mechanism and the Sobel edge detection module. This resulted in the NL-LinkNet-SimAM and NL-LinkNet-Sobel networks, respectively. When both modules were combined, the network was designated as NL-LinkNet-SSR. The experimental results for these networks on the test dataset are presented in
Table 3.
The visual results of the ablation experiments are illustrated in
Figure 6, highlighting the differences in track line extraction performance among the various models. The base NL-LinkNet model, as shown in
Figure 6c, tends to miss or inaccurately detect track lines in complex scenarios. Incorporating the Sobel edge detection module, resulting in the NL-LinkNet-Sobel model, leads to noticeable improvements in the completeness and accuracy of track line extraction, especially in identifying track line edges, as depicted in
Figure 6d. The NL-LinkNet-SimAM model, which includes the SimAM attention mechanism, further enhances track line detection by improving focus on track lines and reducing false detections and omissions, particularly in complex backgrounds, as seen in
Figure 6e. Finally, the NL-LinkNet-SSR model, which combines both the Sobel edge detection and SimAM attention mechanisms, demonstrates the best performance in track line extraction, almost perfectly reconstructing the actual track lines with minimal false detections and omissions, as shown in
Figure 6f. This indicates that the combination of edge detection and attention mechanisms effectively enhances the model’s detection capabilities in complex scenarios, significantly improving accuracy and completeness.
Table 3 supports these visual observations with quantitative results from the ablation experiments. The base NL-LinkNet model achieves an accuracy of 0.960 and an F1-Score of 0.706. Adding the Sobel edge detection module (NL-LinkNet-Sobel) improves all metrics, with accuracy increasing to 0.971 and the F1-Score rising to 0.718. The incorporation of the SimAM attention mechanism (NL-LinkNet-SimAM) further enhances performance, with accuracy reaching 0.975 and an F1-Score of 0.732. The NL-LinkNet-SSR model, integrating both modules, achieves the highest performance, with an accuracy of 0.978 and an F1-Score of 0.749. These results highlight that combining both the Sobel edge detection and SimAM attention mechanisms significantly enhances the model’s detection capabilities.
From the experimental results in
Table 3 and
Figure 6, it is evident that the NLLinkNet-Sobel, NLLinkNet-SimAM, and NL-LinkNet-SSR models show significant improvements across various metrics compared to the original NL-LinkNet model. Specifically, both the SimAM attention mechanism and the Sobel edge detection module enhance prediction accuracy to varying degrees.
The SimAM attention mechanism adaptively highlights critical features, broadens the perceptual field, and enhances responses in the target areas. This improvement leads to a reduction in missed detections and enhances the model’s accuracy and robustness. In contrast, the Sobel edge detection module enriches the image’s detailed features, minimizes background interference, and emphasizes edge cues, thereby improving feature extraction. This allows the model to better understand image structures and ultimately enhances prediction accuracy.
Figure 7 presents the training and validation loss curves for different models. In
Figure 7a, the original NL-LinkNet model shows a rapid decrease in training loss during the early stages, but the validation loss stabilizes and fluctuates significantly in the later stages, indicating some degree of overfitting. In contrast,
Figure 7b reveals that the NLLinkNet-SimAM model achieves a significantly lower and more stable validation loss compared to NL-LinkNet, suggesting that the SimAM attention mechanism effectively enhances the model’s generalization ability. Similarly,
Figure 7c shows that the NLLinkNet-Sobel model demonstrates improvements in both training and validation loss, further verifying the performance enhancement provided by the Sobel edge detection module. Finally,
Figure 7d illustrates that the NLLinkNet-SimAM-Sobel model performs the best among all models, with the lowest training and validation losses. This indicates that the combination of the SimAM attention mechanism and the Sobel edge detection module significantly enhances the model’s overall performance.
Overall, the experimental results show that the improved NLLinkNet model, integrating both the SimAM attention mechanism and the Sobel edge detection module, outperforms all other models on key metrics, offering a more reliable solution for railway track detection.
6. Discussion
6.1. Enhancements and Limitations of the Improved NL-LinkNet Model
Through the Sobel edge detection module, the improved NL-LinkNet model more accurately captures the edge features of the tracks, significantly enhancing its ability to extract track edges. This allows the model to better identify and segment track lines, even in complex backgrounds. Meanwhile, the SimAM attention mechanism adaptively emphasizes key features and expands the perceptual range, enhancing the responsiveness of target areas. This reduces missed detections and improves both the accuracy and robustness of the model. In addition, the NL-LinkNet architecture was used as the baseline for the railway track detection task due to its unique capabilities in handling complex, linear structures across expansive spatial contexts. Unlike methods that focus narrowly on localized regions, NL-LinkNet integrates spatial relationships on a broader scale, essential for accurately detecting the continuous and interconnected nature of railway tracks. This model employs non-local blocks (NLBs) to capture long-range dependencies within the input data, ensuring that each spatial feature is considered in relation to the entire scene. This global contextual awareness is crucial for distinguishing railway tracks from complex backgrounds, where similar-looking features may lead to detection errors. Moreover, NL-LinkNet is specially optimized for linear and elongated structures, providing significant advantages over more generalized attentive models that may not adequately highlight features pertinent to railway detection. Compared to other commonly used models such as FCN, U-Net, and DeepLabv3, the improved NL-LinkNet model demonstrates significant advantages across several key metrics, including accuracy, F1-Score, and MIoU. The superior performance is largely due to the SimAM attention mechanism’s ability to adaptively emphasize important features after each encoder and the Sobel edge detection module’s role in enriching edge detail features and reducing background interference.
However, due to the complexity of the background in the railway dataset, extracting clear, continuous, and complete railway networks in complex and varying scenes remains a challenge. Additionally, despite the significant advantages of the proposed model in various complex scenarios, such as shadows and size changes, compared to the comparative models, the overall performance improvement is still moderate. There remains a gap between the model’s results and the precision of manual visual interpretation. In addition, there are certain threats to validity in our approach. One key limitation is the reliance on dataset quality, which can affect model generalization. Variations in lighting and background conditions across different regions could lead to discrepancies in performance. Another concern is the potential overfitting of the model to specific railway track patterns, although various data augmentation techniques have been applied to increase diversity, such as adding Gaussian noise, elastic transformations, and random flipping. The Sobel edge detection module is specifically designed for linear structures. The model’s dependence on detecting linear features may hinder its performance when encountering more complex and non-linear railway track geometries. Future work could focus on expanding the dataset and introducing further diversity in track shapes to improve generalization.
6.2. Comparative Analysis of Improved NL-LinkNet Model with Existing Methods
The improved NL-LinkNet model proposed in this paper introduces the Sobel edge detection module and the SimAM attention mechanism, effectively enhancing railway track detection performance. Compared to existing remote sensing railway track extraction methods, such as the improved DeepLabV3+ model, which focuses on utilizing MobileNetV3 and CARAFE for efficient up sampling and accurate segmentation [
31], the proposed approach emphasizes edge feature extraction to address the complexity of railway track detection. While DeepLabV3+ optimizes overall track area segmentation, our model specifically targets precise track line detection, ensuring higher accuracy in capturing track edges. In contrast to Weng et al.‘s work, which uses an improved D-LinkNet model for detecting railway track areas [
21], this paper focuses on extracting finer railway track line details. Weng’s approach is effective at segmenting broader track regions but may miss critical edge details, particularly in complex backgrounds. The integration of the Sobel edge detection module ensures that fine edge features are captured, improving performance in scenarios where accurate line extraction is crucial. ARTNet, with its dual-branch architecture, provides full-angle detection [
20], but the NL-LinkNet-SSR model focuses on robustness in edge feature extraction, yielding higher accuracy and stability when faced with background noise and complex track shapes. Compared to the anchor-adaptive ARTNet, which is designed to handle varying railway track angles through its dual-branch architecture, the improved NL-LinkNet model demonstrates stronger global feature representation by integrating NLBs. This allows each spatial feature point to reference all other contextual information, enhancing the model’s ability to detect tracks in complex backgrounds where track edges and textures may be similar to surrounding elements. Additionally, compared to “Rethinking attentive object detection via neural attention learning,” the improved NL-LinkNet model offers a unique advantage by focusing on holistic scene comprehension and robust feature integration. While the neural attention learning approach focuses on dynamically prioritizing salient features within an image, our NL-LinkNet model integrates these features with non-local spatial relationships, providing a comprehensive understanding of the scene. This capability is particularly effective in environments where traditional attention mechanisms might overlook crucial interconnections of railway track features due to their localized nature. The SimAM attention mechanism, applied after each encoder in our model, further emphasizes important features while suppressing irrelevant background information, thus improving detection accuracy and reducing false positives. Moreover, the Sobel edge detection module enhances edge detail extraction, contributing to more precise track detection.
This combination of NLBs, attention mechanisms, and edge detection gives our model a clear advantage in handling diverse track shapes and challenging backgrounds, offering a significant improvement over existing methods that leverage attentive object detection primarily focused on localized areas.
7. Conclusions
To address the issues of missing and false detections in railway track detection using deep learning algorithms, this study proposes several improvements. A Sobel edge detection module is introduced into the NL-LinkNet semantic segmentation network to enhance edge segmentation performance. Additionally, the SimAM attention mechanism is integrated to focus on important features, further improving the prediction accuracy of the network model.
To overcome the challenge of limited railway track datasets, original UAV data was collected using low-altitude drones and manually annotated to create a comprehensive dataset. This dataset includes images of railway tracks captured under various environmental conditions, such as different seasons and weather scenarios, which enhances the model’s robustness and generalization capability.
The proposed model outperforms all others in railway track line extraction, achieving top marks in F1-Score (0.982), MIoU (0.749), recall (0.653), and Kappa coefficient (0.841). It excels in accurately detecting and localizing tracks in complex and variable environments, demonstrating its effectiveness and reliability for practical applications. Compared to DeepLabv3+ and the original NL-LinkNet, the proposed model shows a 0.82% improvement in F1-Score and a 2.9% increase in MIoU over DeepLabv3+, as well as significant enhancements of 2.20% in F1-Score and 4.30% in MIoU compared to NL-LinkNet. These results highlight the model’s improved precision and robustness in challenging conditions. The ablation studies reveal that the NL-LinkNet-SSR model, enhanced with the Sobel edge detection module and the SimAM attention mechanism, provides notable performance improvements in detecting railway tracks, thereby enhancing both the detection capabilities and accuracy of the NL-LinkNet model.
However, while the improved algorithm enhances track detection accuracy, the addition of edge detection modules and attention mechanisms increases computational complexity. Despite these improvements, there remains considerable room for exploration in future research. For instance, combining image enhancement techniques with multitask learning methods could further improve model performance. Additionally, exploring one-shot learning and data augmentation strategies could enhance the model’s effectiveness on small sample datasets. Future research could also focus on designing lightweight network models suitable for deployment on embedded devices or mobile platforms, thereby expanding the range of practical applications. Furthermore, to address the challenges encountered with our current methodologies, future efforts will focus on refining the model and employing oblique photogrammetry to capture multi-angle UAV railway track images.