3.1. System Architecture
In this study, we propose an advanced obstacle detection solution that integrates the incremental clustering algorithm with lightweight shared convolutional segmentation technology. This approach aims to address the rapid and accurate identification of obstacles in railway track transportation, as illustrated in
Figure 1. Firstly, we introduce an enhanced detection region segmentation scheme based on an improved RANSAC algorithm. The enhanced RANSAC algorithm employs line pass filters to extract track areas and conducts an initial screening on the raw point cloud to gather a comprehensive set of track points. A phased RANSAC algorithm is utilized for precise ground fitting, integrating dynamic thresholding techniques to effectively segment and process ground point cloud data across different distance layers. This method not only enhances the efficiency of point cloud data processing, but also ensures precise segmentation of track gradients, which is particularly crucial for long-distance detection. This process significantly improves the accuracy and reliability of obstacle detection.
To further enhance the detection system’s performance, we process the laser radar point cloud data using the incremental clustering algorithm. This algorithm effectively removes noise and identifies spatially adjacent point cloud clusters, providing a more precise determination of unknown or emergent obstacles. Additionally, the improved YOLOv8 semantic segmentation algorithm incorporates a proprietary lightweight shared convolutional detection head (LSCD), optimizing the model’s computational efficiency and real-time processing capability. This innovative technological integration not only enhances the accuracy of obstacle detection, but also meets real-time requirements, thereby significantly improving railway transportation safety.
3.2. Ground Point Cloud Filtering
To further enhance the performance of the detection system, particularly in handling complex railway environments, we propose a detection area segmentation scheme based on an improved RANSAC algorithm. Describing railway scenes from large datasets composed of 3D points requires meticulous analysis, as the complexity of the algorithm increases with the number of points, leading to longer computation times and potential degradation in detection accuracy due to noise interference. Effective boundary area segmentation can significantly improve detection speed and accuracy. Innovatively building upon the RANSAC algorithm, we optimized the segmentation scheme for detecting areas. Specifically, by employing a line filter to extract the track area and conducting preliminary filtering of the raw point cloud, we aim to capture as many points on the track as possible. Central to this process is the phased application of the RANSAC algorithm to precisely fit the ground surface, accommodating differences in the quality of ground point cloud data across various distances. We introduced dynamic thresholding techniques to effectively segment and process ground point cloud data at different distance levels. Through this approach, we not only enhance the efficiency of point cloud data processing, but also achieve precise segmentation, especially addressing the impacts of track slopes in long-distance detection, thereby ensuring the accuracy and reliability of obstacle detection.
3.2.1. Preprocessing of Point Cloud Data
The preprocessing formulas for point cloud data are shown in Equations (1) and (2).
where
represents the original point cloud set and
represents the filtered point cloud set. The function
returns a Boolean value indicating whether point
is invalid. Here,
checks if the coordinates are
values, and
is a small threshold used to determine if a point is too close to the origin. If any coordinate of the point is
or if all coordinates are close to the origin, then the point is considered invalid.
3.2.2. Spatial Segmentation of Point Cloud
After processing, point cloud data are typically spatially segmented into multiple sectors, which can be represented using polar coordinates:
Here, represents the point cloud set in the sector, and and are the angular limits of the sector division.
3.2.3. Selection and Sorting of Ground Points
In each sector, points in the point cloud are sorted based on their vertical height (z-value), and points below a specific threshold are selected as ground points, as shown in Equation (4).
is the set of ground points in the sector, and is the height threshold for selecting ground points.
3.2.4. Fitting Ground Plane
The ground plane fitting is performed using the RANSAC algorithm, and the ground plane can be represented by the equation
, as shown in the final Formula (5).
Here, is the collection of ground points from all sectors, and represents the coordinates of point .
3.2.5. Separation of Ground and Non-Ground Points
Based on the fitted ground model, ground points can be separated from non-ground points, as shown in Formula (6).
In the formula,
represents the set of non-ground points, and
is the threshold used to differentiate between ground and non-ground points. Considering the influence of slopes, segmented fitting is used. For long-distance detection, a dynamic threshold RANSAC algorithm is employed to fit the ground model segment by segment to eliminate the influence of slopes, as shown in Formula (7).
represents the set of segmented distances, where
calculates the distance from point p to a reference point (such as a LiDAR device), to accomplish the segmentation task of ground and non-ground point clouds. As shown in
Figure 2, it illustrates the actual segmentation result of ground from LiDAR.
3.3. Research on Train Obstacle Detection Algorithm
In the research of LiDAR-based object detection, particularly in the domain of obstacle detection, point cloud clustering algorithms have rapidly advanced due to their capability to provide precise three-dimensional spatial information. This algorithmic approach holds significant advantages in applications such as autonomous driving and railway transportation because of its accuracy in locating and identifying various obstacles. Unlike deep learning methods, a key advantage of point cloud clustering is its independence from extensive training data, enabling effective handling of unknown or unexpected types of obstacles, which is crucial for railway safety.
However, point cloud clustering faces a challenge: it may indiscriminately cluster all scanned objects, leading to over-detection of obstacles and occasional misidentification of noise or objects outside the boundary. In railway transportation, such false positives are unacceptable as ensuring safe train operations is paramount. To address this issue, this study integrates shared convolutional semantic methods to determine whether clustered obstacles genuinely intrude into safe operational boundaries. Semantic segmentation methods adeptly learn and understand complex railway environment features, significantly enhancing detection accuracy while reducing false alarm and omission rates. The combined application of point cloud clustering and semantic segmentation methods notably enhances the safety of train operations. Point cloud clustering provides precise obstacle localization, while semantic segmentation further discriminates these obstacles with finesse. This synergistic strategy optimally leverages the strengths of both technologies, offering a robust and reliable obstacle detection solution for railway transportation systems. Specifically, in the context of point cloud obstacle detection, we propose a growing clustering approach inspired by morphological operations commonly used in image processing, such as dilation and erosion algorithms. In image processing, dilation expands object boundaries, while erosion contracts them. These algorithms are typically employed for noise removal, element separation, and connection of disjointed elements. In point cloud data processing, this concept is applied to three-dimensional point cloud clustering algorithms, aiming to identify continuous clusters of points within a two-dimensional polar coordinate space. Simulating erosion operations helps eliminate surrounding point cloud noise, reducing the misidentification of obstacles and enhancing detection accuracy. Leveraging dilation extends the boundaries of point cloud clusters, merging neighboring point cloud elements into the same cluster, thereby more accurately identifying potential physical objects or obstacles. Specifically, this clustering process initially labels initial point cloud clusters in a polar coordinate grid and progressively expands these clusters, incorporating surrounding neighboring point cloud elements into their respective clusters. This process not only enhances clustering accuracy but also ensures a better representation of actual objects in physical space, particularly in critical applications like obstacle detection. Moreover, this algorithm avoids the need for extensive training data required by traditional deep learning methods, offering greater flexibility and adaptability for handling unknown or unforeseen scenarios, which is crucial for real-time analysis and decision-making. Through this approach, we present an innovative detection solution for safe railway operations.
3.3.1. Creation of Polar Coordinate Grid
Firstly, initialize parameters to define the polar coordinate system for constructing a grid model: polar angle resolution
, distance resolution
, maximum detection range
, and height threshold
. Map non-ground point clouds into the polar coordinate grid: for each point
in the point cloud, calculate its polar coordinate
.
Map point
to grid
.
3.3.2. The Erosion and Dilation Processes in Agglomerative Clustering Algorithms
For each cell
in the polar coordinate grid, initialize a clustering label
. During the clustering process, using the erosion and dilation concept from 2D image processing, first define a kernel matrix K. Here, a 3 × 3 matrix is chosen as the kernel matrix.
In this matrix, the value 1 represents neighboring cells considered in the erosion and dilation operations. For each grid cell
, apply the dilation operation.
Here,
represents the set of neighboring grid coordinates around
using K as the kernel. Specifically, for the center point
, its neighboring units are determined by the kernel K, and the point clouds from all neighboring units are merged into the
cluster. Based on the clustering result after erosion and dilation, merge clusters of point clouds that are spatially close to each other.
Here,
represents spatially adjacent erosion and dilation cluster groups. The clustering results in polar coordinates are then transformed back to Cartesian coordinates: for each point in the cluster, perform the coordinate transformation using the formula as shown in (13).
The proposed incremental clustering algorithm is capable of handling large volumes of point cloud data in real-time, which is crucial for ensuring safe train operations. In terms of accuracy, the algorithm can effectively identify potential obstacles ahead of the train, thereby significantly enhancing the safety performance of train operations. Furthermore, unlike deep learning methods that require extensive training data, this algorithm can detect various obstacles without the need for pre-training. This feature allows the algorithm to demonstrate high adaptability and flexibility in dealing with unknown or unexpected obstacles that may occur in railway traffic scenarios (
Figure 3).
3.4. Strategies for Detecting Obstacles in Rail Transportation
In railway transportation systems, timely detection of obstacles is crucial to ensure the safe operation of trains. This section proposes a novel strategy based on enhanced YOLOv8 semantic segmentation technology aimed at efficiently and accurately identifying potential obstacles on the track. This strategy integrates data from both LiDAR (Light Detection and Ranging) and cameras, achieving precise mapping between three-dimensional space and two-dimensional images through joint calibration techniques. This enhancement enhances the accuracy and reliability of obstacle detection across multiple dimensions. Initially, leveraging the precise physical positions of obstacles obtained earlier, the joint calibration matrix between LiDAR and camera data is used to map the corner points of these 3D bounding boxes onto the corresponding 2D image plane. This step is crucial as it integrates the results from 3D detection with predictions from 2D image data, providing a multidimensional perspective for obstacle detection, thereby enhancing the safety of train operations. This section focuses on improving YOLOv8 semantic segmentation technology at the 2D image level. We introduce an adaptive feature selection module in the final layer of the backbone to capture global spatial information of the track images, facilitating region segmentation. Additionally, a custom lightweight shared convolution detection head is integrated into the segment layer to reduce model computational load while ensuring accuracy in identifying railway areas in the images. Through this enhanced algorithm for safe train operation, the safety of train operations is further enhanced.
3.4.1. YOLOv8s-seg Network Model
With the continuous advancements in deep learning technology, YOLOv8 has gained significant popularity across various application scenarios due to its outstanding precision and efficiency. The core of this algorithm lies in its meticulously designed network structure, comprising three key components. Firstly, the backbone utilizes the enhanced residual structure C2F module to effectively extract deep image features, enrich gradient flow information, optimize the training process, and adjust channel numbers for different scales, thereby enhancing the model’s flexibility and adaptability. Secondly, the neck section combines the C2F module with up-sampling modules to achieve multi-scale feature fusion and enhancement, thereby improving detection accuracy and robustness. Finally, the head section adopts a decoupled head structure that separates classification from detection heads, freeing itself from traditional anchor box constraints, simplifying the model structure, and enhancing the detection capability of objects at multiple scales, allowing YOLOv8 to perform exceptionally well in complex scenes.
Therefore, to meet the high demands for accuracy and real-time performance in railway track obstacle detection, this paper proposes an improved semantic segmentation model based on YOLOv8. YOLOv8-seg is the semantic segmentation model derived from YOLOv8, with its network structure illustrated in
Figure 4.
YOLOv8-seg utilizes Darknet as its core backbone network, built upon a series of meticulously designed convolutional layers and residual blocks. These network layers are responsible for accurately extracting crucial features from images and effectively passing them to the subsequent segmentation head for processing. Following the backbone network, YOLOv8-seg integrates a segmentation head dedicated to generating semantic segmentation predictions. This head consists of a series of convolutional and up-sampling layers whose primary task is to restore the feature map size to match that of the original input image, thereby producing pixel-level semantic segmentation results. To further enhance the accuracy and detail representation capability of semantic segmentation, YOLOv8-seg introduces a feature fusion module. Innovatively, this module deeply integrates feature maps from both object detection and semantic segmentation tasks, leveraging the combined information to achieve higher accuracy in semantic segmentation tasks. Ultimately, YOLOv8-seg outputs a semantic segmentation result map that exactly matches the size of the input image. In this result map, each pixel is precisely assigned a class label, clearly identifying the object category or background to which the pixel belongs. In contrast to traditional object detection models, the semantic segmentation model of YOLOv8-seg features unique prototype mask branches and mask coefficients in its head structure. These components work together to generate refined semantic masks, thereby achieving precise segmentation of different semantics. This approach, initially proposed by the YOLACT model and further optimized and applied in YOLOv8-seg, enhances the model’s capability in semantic segmentation tasks.
YOLOv8-seg provides five different model variants, namely, YOLOv8n-seg, YOLOv8s-seg, YOLOv8m-seg, YOLOv8l-seg, and YOLOv8x-seg. According to performance metrics released by Ultralytics in
Table 1, these models demonstrate excellent performance on the COCO2017 dataset. Taking into account both model size and performance, this study selects YOLOv8s-seg as the base model, aiming to achieve higher accuracy and efficiency in subsequent research endeavors.
3.4.2. Improving the YOLOv8s-seg Network Model
In addressing the segmentation challenges of unknown or emergent obstacles in railway track traffic, models used for railway track segmentation must meet real-time requirements while ensuring detection accuracy. Therefore, this study proposes two main improvements to the YOLOv8s-seg algorithm. Firstly, a self-adaptive feature selection module is integrated into the last layer of the backbone. This module aims to capture global spatial information from railway track images to facilitate region segmentation. Secondly, a self-developed lightweight shared convolution detection head is incorporated into the segment layer. This addition reduces model computational complexity while maintaining accuracy.
Figure 5 illustrates the modified network architecture of YOLOv8-seg proposed in this study.
3.4.3. Shared Convolutional Layer Module
The core concept of convolutional neural networks (CNNs) lies in local connectivity and parameter sharing. In convolution operations, parameter sharing is achieved through the use of filters, commonly referred to as weights, which are applied consistently across different spatial locations of the input data. When an image is input into a CNN, these filters scan the image, performing convolutional operations that involve computing the dot product between the filter weights and the local regions of the input. This process allows the network to detect spatial hierarchies of features, such as edges, textures, and more complex patterns, at different levels of abstraction. The advantage of weight sharing is that it significantly reduces the number of parameters in the network, making the model more efficient and less prone to overfitting, particularly when dealing with large-scale image data. Furthermore, local connectivity ensures that the network can effectively capture spatial dependencies within the data, which is critical for tasks such as image recognition, object detection, and semantic segmentation. By leveraging these properties, CNNs have become a fundamental tool in the field of deep learning, enabling the development of highly accurate and computationally efficient models for various computer vision applications.
To further optimize convolutional neural networks (CNNs) by reducing their storage requirements and computational burden, this paper proposes an innovative approach: cyclic weight sharing of convolutional layer parameters in the detection layer network. This method effectively reduces redundant parameters within the network, enhancing model efficiency while maintaining strong feature learning and representation capabilities. By implementing cyclic weight sharing of convolutional layer parameters, significant compression of the CNN can be achieved without sacrificing model performance, making it more suitable for resource-constrained environments and real-time applications. Specifically, as illustrated in
Figure 6, consider a 3 × 3 convolutional kernel example sliding over an image to extract features. If the input image has k channels, the total number of parameters for this convolutional kernel would be 3 × 3 × k. Without parameter sharing, as depicted in
Figure 5, each convolution operation at every position would use independent parameters, resulting in a total of W(width) × H(height) × K(kernel) parameters. For instance, using an input image size of 192 × 192 and removing parameter sharing from the first layer’s 3 × 3 × 32 convolutional kernel, the parameter count would increase to 192 × 192 × 32, which is approximately 4096 times more than the original 288 parameters. Such a large parameter scale not only excessively increases the model’s size, but also escalates the consumption of computational resources and storage space. Based on the above discussion, this paper proposes an optimization strategy within the custom lightweight detection head layer, specifically targeting weight sharing in convolutional layers for classification and regression operations.
3.4.4. Introducing a Self-Developed Lightweight Detection Head
YOLOv8 shows the most significant changes in its head section compared to YOLOv5, as illustrated in
Figure 7.
From the diagrams, it is evident that YOLOv8 has made significant innovations in its head structure compared to YOLOv5. YOLOv5 originally employed an anchor-based coupled head, which has been transformed into an anchor-free decoupled head design in YOLOv8. This transformation signifies the complete separation of classification and detection heads, discarding the traditional objectness branch in favor of decoupled classification and regression branches that incorporate Distribution Focal Loss (DFL). Each branch undergoes a meticulously designed network structure, including Conv modules with two 3 × 3 convolutional kernels and a Conv2d module with a 1 × 1 convolutional kernel. While this design enhances the model’s capacity for feature extraction and representation, it concurrently increases the computational complexity and parameter count of the model.
To address the randomness and unpredictability of obstacles in railway track traffic, it is crucial to balance model accuracy and efficiency to meet real-time detection requirements. Therefore, this paper proposes an innovative solution: integrating a custom lightweight shared convolutional detection head (LSCD) into the segment structure of YOLOv8s-seg, replacing the original detection head. The design of this LSCD structure aims to reduce the computational burden of the model while maintaining its robust feature learning and representation capabilities. Through this improvement, the new segment structure not only inherits the high precision of YOLOv8, but also significantly reduces computational complexity and parameter count, thereby enhancing the real-time performance of the model.
Figure 8 illustrates the segment structure after integrating the LSCD, demonstrating visually how this innovative design optimizes and improves the model structure.
In the YOLOv8-seg model, the segment layer incorporates a crucial component known as the Proto layer, which plays a pivotal role as the prototype mask branch. It is responsible for generating mask coefficients and semantic masks to achieve precise segmentation of different instances in the image. To extract effective features, the network initially passes the feature map through a 1 × 1 convolution module with group normalization (Conv_GN), enhancing feature representation capability while reducing internal covariate shift. Subsequently, the feature map undergoes two 3 × 3 shared convolutional layers with group normalization. These shared convolutional layers not only reduce model parameters and computational load, but also ensure stability and accuracy through their group normalization structure. Following the shared convolutional processing, the feature map is separately processed by Conv_Reg and Conv_Cls convolutional modules for bounding box regression and class prediction. It’s noteworthy that Conv_Reg and Conv_Cls also adopt a shared convolutional design, further enhancing model efficiency. Simultaneously using shared convolutions, the model introduces a scale layer to scale features, addressing scale differences in targets detected by different detection heads, and thus ensuring accurate detection across various object scales.
In the YOLOv8-seg model, another critical component is the Conv_Mask module, which learns and predicts masks for target objects from input features, enabling precise segmentation of objects in the image. Due to the high demand for fine-grained segmentation in semantic segmentation models, the Conv_Mask module does not employ shared convolutions to maintain segmentation accuracy. Furthermore, to reduce parameters and computational load, the improved segment layer replaces one 3 × 3 convolution kernel with a 1 × 1 convolution kernel, optimizing model efficiency while preserving performance.
The original design of the YOLOv8 head layer is depicted in
Figure 7, where features from three different scales, P3, P4, and P5, are processed independently through respective branches. Each branch consists of two 3 × 3 convolutional layers and one 1 × 1 convolutional layer. While this design enhances detection accuracy, it simultaneously results in a significant increase in the number of convolutional layer parameters. To address this issue, this paper introduces the concept of shared convolution within the detection head, as illustrated in
Figure 9. Specifically, we redesign the two 3 × 3 convolutional layers originally distributed across multiple branches, as well as the convolutional layers for classification and regression, to be shared. This sharing mechanism means that the three feature branches at different scales will utilize the same set of weight parameters during convolutional processing. Such sharing significantly reduces the number of model parameters, lowers model complexity, and potentially enhances computational efficiency. Furthermore, by sharing convolutional layers, both classification and regression tasks extract information from the same input features, ensuring consistency between the two tasks. This consistency contributes to improving the model’s performance in classification and regression tasks. Moreover, since classification and regression tasks share computational resources, the model can more efficiently utilize computing power during training, speeding up the training process and reducing training time. Importantly, through shared convolutional layers, the model learns feature representations across tasks, thereby enhancing its generalization ability. This capability allows the model to classify and regress better when facing new, unseen data. Additionally, reducing the number of parameters and computational complexity makes the model more lightweight, suitable for deployment on resource-constrained devices, thereby providing greater flexibility for practical deployment.
In summary, this paper proposes a strategy of sharing convolutions for classification and regression operations within the segment layer, aiming to optimize the model’s parameter efficiency and computational performance while maintaining or improving its performance in classification and regression tasks. This design not only enhances the model’s generalization ability, but also makes it more applicable to real-world application scenarios.
3.4.5. Group Normalization
During the training process of neural networks, the iterative update of model parameters is a critical step where changes in current layer parameters are often influenced by preceding layer parameter changes. This hierarchical propagation characteristic can lead to the problem of gradient vanishing in shallow neural networks during backpropagation, thereby hindering the network’s convergence performance. To effectively address the challenge of gradient vanishing, batch normalization techniques are commonly employed in network architecture design [
27]. Batch normalization transforms input data to approximate a standard normal distribution, enhancing the responsiveness of nonlinear functions to input data. This method accelerates the network’s convergence speed and improves prediction accuracy.
However, the effectiveness of batch normalization is significantly affected by batch size. Particularly in small batch scenarios, applying batch normalization may increase errors in output results, thereby affecting model performance. Therefore, careful selection of batch size is crucial to ensure that batch normalization achieves optimal effectiveness. To overcome the impact of batch size on normalization effectiveness, this paper proposes an innovative approach: replacing the original batch normalization in the LSCD convolutional layers with group normalization (GN) [
28]. GN is a more robust normalization method that maintains stable performance across batches of different sizes. By adopting GN, we can mitigate the undue influence of batch size selection on normalization effectiveness, ensuring that the model performs well in various scenarios. This improvement not only enhances the convergence speed and prediction accuracy of the model, but also strengthens its robustness and generalization capability.
Group normalization (GN) is a normalization technique used in convolutional neural networks. Unlike BN, GN operates by grouping channels and normalizing within each group. This approach reduces the impact of batch size on model performance. GN’s computation is independent of batch size, making it stable even for small batch sizes used in high-precision image scenarios. Additionally, GN computes mean and variance within each channel group, which helps reduce noise and improves model stability. The formula for group normalization (GN) is as follows:
where
is the input feature,
is the mean computed within each group,
is the variance computed within each group, and
is a small value. Specifically, if we have a feature
with shape
, where
is the batch size,
is the number of channels, and
and
are spatial dimensions, first, the
dimension is divided into
groups, each with
channels, and then each group’s features are normalized. According to Reference [
29],
has been shown to improve the performance of detection heads for localization and classification in the FCOS paper. Reference [
5] demonstrates that removing group normalization in classification and regression reduces model accuracy by 1%. Therefore, to maintain the detection performance of the detection heads, GN modules are added to the convolutional layers in LSCD, as shown in
Figure 10.
3.4.6. Attention
This study utilizes the attention mechanism SegNext_Attention from the SegNeXt semantic segmentation model. SegNeXt, proposed by Guo et al. [
30] in 2022, is an innovative network architecture meticulously designed for semantic segmentation tasks, significantly enhancing performance. This architecture disrupts traditional convolutional attention design principles by introducing the multi-scale convolutional attention module (SegNext_Attention), which effectively encodes contextual information. Compared to traditional self-attention mechanisms in transformers, SegNext_Attention demonstrates higher performance.
At the core of SegNeXt lies the SegNext_Attention module, where the attention mechanism is cleverly integrated between the encoder and decoder connections. This mechanism learns pixel-level attention weights, allowing the model to precisely focus on regions of interest while disregarding irrelevant background information. The overall framework of SegNext_Attention is carefully constructed, comprising encoder, attention mechanism, decoder, and loss function components.
The encoder consists of multiple convolutional and pooling layers, progressively reducing feature map size and increasing channel numbers for efficient feature extraction. On the output feature map of the encoder, the SegNext_Attention mechanism generates attention weights. The decoder utilizes up-sampling and convolution operations to map encoder feature maps to pixel-level segmentation results, gradually restoring feature map size and reducing channel numbers. Ultimately, the training process guides the model through the computation of loss functions, such as cross-entropy and Dice loss, which effectively measure prediction accuracy.
In the encoder section, traditional convolutional block designs are overturned by introducing the multi-scale convolutional attention mechanism. This mechanism leverages multi-scale convolutional features and performs efficient spatial attention computation through element-wise multiplication. The encoder is structured with four down-sampling stages, each employing the same MSCAN module to form a pyramid-like structure for extracting multi-scale contextual feature information, as shown in
Figure 11a.
In the decoder section, the Hamburger structure is employed to further extract global contextual information, enabling multi-scale contextual feature extraction from local to global levels. The core design within the MSCAN module is the multi-scale convolutional attention SegNext_Attention, illustrated in
Figure 11b. It comprises deep convolution, multi-branch deep convolution, and convolution parts. Deep convolution extracts local feature information, multi-branch deep convolution captures multi-scale contextual feature information, and convolution models correlations between different channels. The output results serve as attention weight parameters, weighting the input to SegNext_Attention and producing the final output. The computation process of SegNext_Attention is depicted in Equations (15) and (16).
where
F represents the input feature,
Att represents the attention weight parameter,
Out represents the output feature, and
represents the element-wise multiplication operation.
denotes the depth-wise convolution operation,
where
, represents the
i-th branch, as shown in
Figure 11b.
represents direct connection, while in the other three branches, depth-wise strip convolutions (DWSCs) in two continuous directions are employed to approximate a standard depth-wise convolution with a large core. The kernel sizes for these three strip convolutions are set to 7, 11, and 21, respectively.
The use of vertical and horizontal bar-shaped convolution kernels allows an effective receptive field equivalent to d × d to be achieved while maintaining fewer parameters. Additionally, in the context of separating railway tracks from the ground in transportation systems, these bar-shaped kernels are particularly beneficial for extracting features related to track boundaries. The utilization of depth-wise separable convolution further reduces the parameter count, thereby mitigating the risk of overfitting. The primary advantages of SegNext_Attention include its ability to better utilize global contextual information and focus more accurately on regions of interest, thereby enhancing the performance of image segmentation.
In this study, we propose a train obstacle detection and safety response strategy based on incremental clustering and lightweight shared convolution track segmentation methods. The specific approach is as follows: Firstly, employing incremental clustering algorithms to achieve more accurate clustering results. For each merged cluster, we compute its 3D bounding box. Through joint calibration of cameras and LiDAR, we obtain the transformation matrix T between them. By mapping the obstacle’s bounding boxes to the 2D image plane, combined with track segmentation models, we enhance the safety of train travel. This involves precisely identifying and responding to potential obstacles on the track, ensuring the safety and efficiency of train operations. The formula for mapping 3D obstacle bounding boxes to the 2D image plane is as follows:
Using the improved YOLOv8 semantic segmentation model for railway area segmentation in images, we obtain segmentation masks
\(M
\). The strategy for detecting traffic obstacles is as follows: For each projection point
\(x
\), check if this point lies within the railway segmentation area
\(M
\), as shown in Equation (9). If any projection point
\(P_{2D}
\) falls within the railway area, trigger a safety alert.