1. Introduction
Railways are an important lifeline for the country’s economy and trade and extend over long distances between various regions. This makes monitoring and inspection of railway lines both expensive and laborious. Traditionally, railway monitoring and inspection have been accomplished by deploying specialised personnel and machinery on the track [
1,
2]. However, this method is degraded by high labour costs and limited inspection frequency and duration. There is also a personal safety risk to railway lines in no-man’s land.
As a result, unmanned aerial vehicles (UAVs) are becoming increasingly popular to improve the efficiency of the inspection process [
3,
4,
5,
6]. UAV inspections can be carried out more frequently and do not conflict spatially with traditional inspection methods (track inspection vehicles, manual patrols, etc.). Since UAVs do not obstruct the railway, normal train operations can continue even during inspections. However, current UAV-based inspection methods require a dedicated flyer to control the UAV in real time or set the route manually, and the operation is cumbersome. Autonomous UAV inspection technology will be an important development direction in the future. Among the technologies related to autonomous inspection by UAVs, the identification and localization of railroad areas in images through machine vision is the main task.
The main challenges of the railway segmentation problem in UAV images include the following:
First, the railway target contains a variety of structures such as rail track, sleeper, ballast and turnout. Among them, the rail track is the core structure of the railway, and its surface is smooth, which shows large differences in various intensities and angles of light environments.
Second, the scenes around the railway may be vegetation or terrain with different textures, colours, appearance and other structures, such as roads, bridges and power lines. Moreover, some of the scenes have textures similar to those of railway targets, which are prone to misidentification.
Third, the railway tracks are slender linear structures, and the track images change considerably with different camera angles.
Fourth, the UAVs used to capture the images may be at different heights at various times and locations along the railway. Therefore, the tracks may appear in the images at different scales.
To overcome the above challenges, deep-learning-based image segmentation methods are gradually becoming mainstream. In order to obtain finer segmentation results, researchers have carried out research on various aspects of network structure design, objective functions, and training methods. Encoder–decoder models such as SegNet [
7], U-net [
8] and DeeplabV3+ [
9] and multi-scale feature combination means such as Pyramid Network (FPN) and ASPP, are widely used to achieve semantic segmentation of fine targets such as utility poles, lane lines and blood vessels. CondLaneNet [
10], LaneATT [
11] and other models directly use the line model (anchor point + angle + bias) as the optimization target to achieve fast detection of lane lines. However, such models are poorly adapted to images with complex aerial views. Models, such as HED [
12], Deepflux [
13], and ALSN [
14], obtain high accuracy segmentation results of line shape targets such as edges and skeletons by calculating loss functions of intermediate features with different resolutions and ground truth with side output supervision.
However, in practice, it is found that existing models and training methods suffer from obvious problems of similar region mis-segmentation and weak inference ability of occluded regions. For linear targets such as railway tracks, humans can easily remove the misidentified scenes by global information and fill the obscured track regions. This is an automatic generation capability for unknown regions, and based on this consideration, a GAN-based accurate segmentation framework for railway tracks, RT-GAN, is proposed, which improves the segmentation accuracy for slender linear targets, such as rail tracks, by introducing gradient information and adversarial training. RT-GAN consisted of two parts: a track segmentation network, termed RT-seg, which generates the track segmentation image, and a track discriminator, which determines whether the segmentation image is true or not. Compared with using the track segmentation network alone, the track discriminator and the track segmentation network supervise each other by adversarial training, which can obtain higher accuracy. In practice, only the track segmentation network is being used, which does not affect the efficiency of the segmentation method.
The main contributions of this study include the following:
A framework for accurate segmentation of railroad tracks, RT-GAN, is proposed. The framework treats the segmentation of track targets as an image-to-image generation task and enables the generator RT-seg of RT-GAN to more accurately exclude misidentified similar targets and reason about the occluded track regions through adversarial training.
A training process optimized for the continuous gradient features of track images, a gradient loss function and a discriminator input containing gradient images are designed to guide the generator RT-seg to focus on linear features of railroad tracks faster and more accurately.
A UAV-based railway image dataset named “iRailway”, including illumination changes, occlusions and various complex backgrounds, which can reflect the challenges related to real-world railway images is included.
The rest of this article is organizised as follows. In
Section 2, the work related to the semantic segmentation of railway targets and GAN-based image segmentation methods is outlined. In
Section 3, the proposed RT-GAN framework for track segmentation in UAV images is presented, including the track segmentation network, track discriminator and training. In
Section 4, the experiment and analysis are discussed. In
Section 5, the conclusion is presented. The dataset and code for DFR are available at
github.com/ksws0499733/RTGAN.
3. Method
The framework of RT-GAN is shown in
Figure 1, which can be divided into two parts: the rail track segmentation network RT-seg and the track discriminator, which are represented by
and
, respectively. The input of RT-seg is an RGB-3 channel image of arbitrary size, and the output is a pixel-level semantic segmentation image with the same size. Suppose the input image is
I and the output result is
,
The input of the track discriminator contained 3 parts: the input image, the ground truth image and the output of RT-seg, and its output is the veracity scores of the generated image. In each training, the inference of the discriminator is executed twice, and two different scores are output based on the output result of RT-seg and the ground truth as input, respectively. Suppose the ground truth image is
and the output score is
F, then, the output of the track discriminator is as follows:
The RT-seg and track discriminator are explained in detail next.
3.1. Rail Track Segmentation Network
As shown in
Figure 1, an encoder–decoder network termed RT-seg is designed as the generator. RT-seg is inspired by [
14] and uses a linear span unit (LSU)-based feature pyramid network, termed LSP, to concatenate feature maps with different resolutions.
First, the CSPDarknet53 [
28] is selected as the backbone network, and its structure is shown as
Figure 2. It mainly consists of 5 cross stages. In each stage, the input features are downsampled (except Stage 1) and then split into two parts according to the channels. One part passes through a set of “Dark block”, and the other part does nothing. Finally, the two parts are concatenated as the output of the stage. The outputs of 5 stages (
,
,
,
and
) are selected as the output feature maps of the backbone with strides of 1, 2, 4, 8 and 16, respectively.
Then, these feature maps are passed through the LSP to obtain an optimised combination of features with different resolutions. In contrast to the basic feature pyramid network, the LSP employs LSUs instead of simple convolution operations. Each LSU accepts and accumulates feature maps from the backbone network and LSUs at deeper levels. The number of channels and resolution of its output are aligned to the channels and resolution of the current stage by
convolution and upsampling, respectively. An additional LSU is used to accept the output of the LSP and the output fused features to further extend the feature sum-space. Among them, to mitigate the noise caused by upsampling and downsampling, the stride of the fused features is chosen to be 4, and the features of other levels are resolution aligned by downsampling or upsampling,
Finally, the fused features
are upsampled, convolved, and SoftMaxed to output a pixel-level semantic segmentation image of the same size as the input.
In the network, the LSU is a configurable subnetwork proposed by LIU [
14] et al. Its structure is shown in
Figure 3. If the feature map is considered a node and the layer operations are considered edges, the subnetwork of LSU can be considered a directed acyclic graph. This directed acyclic graph consists of
K input nodes,
P intermediate nodes and 1 output node. For multiple input nodes, LSU first merges all the input feature maps by the “concatenate”, “
convolution” and “ReLU” operations. Then, a series of multi-layer combinations of feature transformations are performed, including “skip”, “
convolution”, “
convolution”, “dilated convolution”, etc. Finally, all the intermediate nodes are connected to the output node by the “summing” operation, and an expanded feature map is output that combines all the transformed features.
Suppose the input node is , the intermediate node is , and the output node is y.
The set of available transformation operations between nodes is denoted
. For an ordered pair of intermediate nodes
, the transformation operation between them is
, i.e.,
Then, the LSU can be represented as follows:
where
denotes the ReLU activation function,
denotes the
convolution operation,
denotes the cancatenate operation,
denotes the number of operations experienced from the 1st intermediate node
to the i-th intermediate node
, and
denotes all transform operations experienced from
to
.
LSU encourages multiple paths for feature transformation and does not have a fixed network structure. Some efficient network structures are obtained by a heuristic search in the literature [
14], and the ALSN-Large structure is chosen after experimental validation. The structures of LSUs in the ALSN-Large structure are shown in
Figure 4. The concatenate and channel alignment operations are omitted in the figure. The
symbol indicates the convolution operation with kernel size
X and dilation factor
Y. Among them,
and
are actually only simple direct connections;
,
, and
use large size convolution kernels and shallow-wide structures;
, which fuses all resolution feature maps, uses a deep-narrow structure with multiple convolutional layers with increasing dilation factors.
Image Segmentation Loss
The loss of RT-seg is divided into 3 parts: category loss, gradient loss, and side output loss.
Category loss. For a binary classification task, the general approach is to measure the distance between the predicted result
and the true classification
by calculating the sum of the binary cross-entropy of each pixel. Considering the diversity of targets in the track images, it is a better choice to use the cross-entropy of multiple classifications to calculate the loss. Due to the unbalanced number of track regions and backgrounds, the focal loss is used to balance the weight of each pixel as Equation (
10).
where
X is the predicted result
,
denotes the focal Loss function, and
and
are the hyperparameters.
Gradient loss. Since railway tracks are linearly connected regions with distinct edges on the image, explicitly incorporating gradient variations into the loss helps to guide the generation of segmentation results with clear boundaries. Therefore, the root-mean-square loss function with mask is used to calculate the gradient loss as Equation (
11).
where
X is the predicted result
,
Y is the ground truth
,
, and
are gradient operators.
M is the mask of the region of interest, which is used to remove the effects of image augmentation such as masking and cropping.
means to calculate the arithmetic mean.
Side output loss. The side output loss refers to the difference between the middle feature layer of a DNN and the ground truth. The side output loss can supervise the changes of intermediate feature layers and guide the DNN to converge faster and is widely used in linear target segmentation tasks such as edge detection [
12] and skeleton detection [
13,
14]. Different feature layers of LSP are used as input, and the focal function is used to calculate the loss as Equation (
12).
where
is the
k-th side output, which achieves channel alignment with the number of target categories through 1 × 1 convolution.
is the output of LSP.
In summary, the loss function of RT-seg is s Equation (
12)
where
,
and
are the proportions of each loss function selected in this paper.
3.2. Track Discriminator
The structure of the track discriminator, shown in
Figure 5, is a fully convolutional Patch-GAN discriminator [
29], consisting of 5 convolutional layers. The input of the track discriminator is a 6-channel combined feature, consisting of the input image (3 channels), the segmentation result (1 channel) and the gradient feature (2 channels), and the output is a single-channel scoring map
F instead of a single score after 4 downsamplings.
In contrast to image-specific discriminators, if only the semantic segmentation results of the tracks are fed to the discriminator, the adversarial loss is difficult to optimise, leading to training failure. This may be due to the sparsity of the track regions in the image. Natural images have gradient features on each local region that can be learned, but the sparse pixel distribution of the track-only region contains less gradient information, which makes the training of the discriminator difficult. To solve this problem, the gradients of the input image and the classification result are treated as additional conditions. By concatenating the input images
I, the segmented image
and the gradient of
as inputs, the track discriminator gains enough knowledge to distinguish the difference between the generated distribution and the true distribution, and the training becomes stable. Therefore, for two discriminator inferences with generated image
and ground truth image
, the corresponding score maps are as follows:
where
X is the predicted result
,
Y is the ground truth
,
,
,
,
,
,
, and ⊙ is the same as Equation (
11).
Adversarial Loss
The task of the track discriminator is to try to distinguish the RT-seg output result
from the ground truth
by discriminating
as false and
as true. Therefore, the loss function of the track discriminator is mainly divided into two parts:
for updating the weights of the track discriminator and
for updating the weights of the RT-seg. The hinge loss function [
30] is used to determine whether the input is true or false. The adversarial losses for training RT-GAN are as follows
where
denotes the ReLU activation function, and
and
are two score maps of the track discriminator output.
3.3. Training
Training RT-GAN is a challenging task, and training often fails if both image segmentation loss and adversarial loss are applied, even if the weights between these two losses are carefully adjusted. To avoid this problem, a two-stage training approach is used. In the first stage, only the RT-seg is trained and achieves as high a segmentation accuracy as possible. In the second stage, the trained RT-seg is used as a generator and trained adversarially with the track discriminator. Experimental validation shows that the higher the initial accuracy of RT-seg, the higher the efficiency of adversarial training and the final segmentation accuracy.
4. Experiment
In this section, the implementation details are first introduced. Then, RT-GAN is compared with the state-of-the-art segmentation networks. Finally, an ablation study is carried out on the improvements adopted in RT-GAN.
4.1. Implementation Details
4.1.1. Dataset
There are few publicly available datasets specified for railway segmentation. This article builds an image segmentation dataset (“iRailway”) for railway lines by using videos captured with on-board cameras on trains and UAVs for more than 30 h. The positive set in iRailway includes 21,379 images (3296 labeled) randomly sampled from four railway lines and under the resolution of
. The ballast, sleeper and track areas in the image are manually marked as ground truth.
Figure 6a shows some examples, which are composed of railway lines in the field, in the forest, across buildings, under dazzle, weak and backlighting conditions. The negative set includes 552 images collected from the Web, which involve linear targets and false-positives, such as roads, sidewalks and grid floor tiles,
Figure 6b.
4.1.2. Implementation Details
RT-GAN is deployed based on PyTorch and runs on 2 blocks of NVIDIA Tesla V100 (32 GB RAM).
For the first stage of RT-seg training, an Adam optimiser with parameters , a learning rate of 0.0002 and a batch size of 16 was used, and 90 epochs were trained. The three weight ratios in image segmentation loss are , and .
For the second stage of adversarial training, the Adam optimizer with parameters was used, and the initial learning rates of RT-seg and discriminator were 0.0001 and 0.00001, respectively. A total of 100 epochs were trained according to the strategy of 1:1 interaction training between RT-seg and the track discriminator.
These hyperparameters are experimentally validated values, and all network training used the same hyperparameters.
Finally, the intersection over union (IoU) between the semantic segmentation result and the ground truth was used to judge the segmentation effectiveness of all of the algorithms.
4.2. Comparison with the State-of-the-Art Segmentation Networks
RT-GAN was compared with popularly used image segmentation networks such as SegNet [
7], PSPnet [
31], DeeplabV3+ [
9], ALSN [
14], SETR [
32] and SegFormer [
33]. All networks were trained on iRailway. Each network was trained in two groups, one with the original image as input (cropping operation only) and the other with a random augmentation (affine transformations, block masking, mosaic masking, etc.) image as input. Each group of training was repeated 10 times and averaged as the final result to reduce the randomness caused by dataset cropping and image augmentation. For each training, 80% of the samples were used for training and 20% for validation. The results of all experiments are shown in
Table 2.
For all networks, the IoU of the unaugmented group was greater than the IoU with augmented data because augmentation strategies, such as masking, shading and mosaic work, considerably increase the recognition difficulty.
RT-GAN obtained IOUs that exceed those of the other comparison networks in both training groups, reaching 81.34% (augmentation) and 88.7% (no augmentation), and the computation speed is faster (35.0 fps). Even without adversarial training, RT-GAN achieves 71.77% (augmentation) and 84.63% (no augmentation) IOUs, which are also higher than those of the other comparison networks.
Among the comparison networks, the highest segmentation accuracy is SegNet with IoUs of 68.71% (augmentation) and 80.98% (no augmentation), which are substantially lower than RT-GAN. It is followed by DeepLabv3+, which has lower computation and lower IoUs than SegNet, 61.69% (augmentation) and 78.84% (no augmentation). PSPnet has a faster computation speed (43.0 fps), but the IoU are only 55.98% (augmentation) and 69.37% (no augmentation).
SETR and SegFormer are transformer-based networks that have become popular in the field of image segmentation in recent years. Two SETR models (SETR_MLA and SETR_PUP) and two SegFormer models (mit_b2 and mit_b5) were tested. The experimental results show that the transformer-based model performs poorly in the rail track segmentation task. Among them, SETR_PUP had the slowest computation speed (11.0fps) with IoU 56.32% (augmentation) and 76.52% (no augmentation) over the other transformer models; Mit_b2 has the smallest model size and the fastest computation speed (46.0 fps) with IoUs 50.86% (augmentation) and 70.17% (no augmentation). The poor performance of the transformer-based models may be due to the small dataset and insufficient training.
Segmentation Examples.Figure 7 is a visualisation of some segmentation results of railway lines. DeepLabV3+, Segformer(mit_b5) and ALSN are chosen to compare with RT-GAN. It can be seen that RT-GAN can considerably reduce the false detection of nontrack targets in the face of interference targets such as utility poles, streetlights, building edges and ground seams. For occluded and mosaic images, RT-GAN can effectively reason about the occluded tracks, while large mis-segmented regions exist in other models.
4.3. Ablation Study
The essence of RT-GAN is to improve the accuracy of image segmentation networks through adversarial training. Compared to the usual GAN model, RT-GAN uses a loss function containing gradient information and a discriminator input. To verify the effectiveness of these improvements, this subsection compares the effects of different generators, loss functions, and discriminator inputs on RT-GAN.
Different generators. DeepLabV3+ and Segformer are selected as generators for comparison tests with the proposed RT-seg. The experimental results are shown in
Table 3, and the numbers in parentheses are the IoUs without adversarial training. It can be seen that (1) using RT-seg as generator is superior to using other generators; and (2) adversarial training can substantially improve the segmentation accuracy of the image segmentation network. For the Mit-b5 network, the IoU for unaugmented images (86.42%) is improved to a level similar to the baseline (88.07%), indicating that adversarial training can effectively reduce the demand of the transformer model for training sample size and improve the training efficiency.
Different image segmentation losses. The effects of different loss functions on the segmentation accuracy of RT-seg and adversarial training of RT-GAN are shown in
Table 4. Compared with using only
, adding
improves the IoU of the RT-seg for augmentation images by 4.64% (from 64.98% to 69.62%), while the improvement for adversarial training is smaller, with only 0.08% (augmentation, from 78.97% to 79.05%) and 0.1% (no augmentation, from 87.13% to 87.23%). The accuracy is further improved if
,
and
are used simultaneously. Before adversarial training, the IoUs increase by 6.79% (augmentation, from 64.98% to 71.77%) and 4.68% (no augmentation, from 79.95% to 84. 63%), and after adversarial training, the IoUs increase by 2.37% (augmentation, from 78.97% to 81.34%) and 0.94% (no augmentation, from 87.13% to 88.07%).
Different discriminator inputs.Table 5 shows the effect of different discriminator inputs on the segmentation accuracy of RT-GAN. If only semantic images are used as discriminator inputs, the segmentation IoUs are 79.9% (augmentation) and 83.04% (no augmentation). After adding the input image as auxiliary information to the track discriminator, the IoUs improve by 1.87% (augmentation, from 79.9% to 81.17%) and 4.7% (no augmentation, from 83.04% to 87.74%). The segmentation IoU improves to 81.34% (augmentation) and 88.07% (no augmentation) after further addition of gradient information to the discriminator input.
4.4. Analysis
Experimental results show that RT-GAN can obtain stronger track segmentation accuracy than other existing methods, both for unaugmented and augmented images. Upon careful analysis, these enhancements come from three main sources: (1) generator model, (2) adversarial training, and (3) explicit introduction of gradient information.
First, the generator model is the foundation. The LSU and LSP used in RT-seg are able to extract image features at different scales effectively and use the optimal network structure obtained by heuristic search. In terms of results, RT-seg has more potential as a generator compared to other networks.
Second, the adversarial training trained the generator network in a global direction. As the segmentation results show, before adversarial training, RT-seg misidentifies linear regions such as streetlights and eaves as rail tracks, which may be caused by the smaller weights of RT-seg’s higher-level features (which have larger receptive fields) and larger weights of lower-level features (which have smaller receptive fields). In adversarial training, the track discriminator scores the segmentation result as a whole, forcing RT-seg to pay more attention to the global features of the track image, which results in a reduced mis-segmentation rate and an improved inference ability for the occluded targets of the the generator network.
Finally, explicitly adding the gradient information of track targets in the form of loss function and discriminator inputs to the training and adversarial training of the semantic segmentation network enables RT-seg to notice track regions with distinct gradient features earlier and improve the semantic segmentation of track targets. This can be proved by the results of ablation experiments.