1. Introduction
Rail transit plays a critical role in fostering urban economic development and fulfilling citizens’ transportation needs. However, the intrusion of obstacles poses a significant threat to the operational safety of rail systems [
1,
2]. Traditional methods for detecting obstacles often rely on manual inspection, which is both time-consuming and labor-intensive. The current mainstream solution employs video surveillance [
3,
4]. Nevertheless, the significant degradation in video image quality caused by fog presents a formidable challenge to traditional techniques used for detecting obstacles along rail transit perimeters.
Obstacle detection in rail transit using video images predominantly encompasses three methodologies: feature-based methods, machine learning-based methods, and deep learning-based methods. Feature-based object detection techniques involve extracting features from images and employing classifiers or regressors for object identification. Commonly utilized features include Haar features [
5], HOG features [
6], and SIFT features [
7]. These methods typically require manual feature and classifier design, with their performance being significantly constrained by the selection and design of these features. Machine learning-based object detection techniques acquire feature representations and target models through the training of classifiers or regressors. Conventional machine learning methodologies encompass support vector machines, random forests, and AdaBoost [
8,
9]. These approaches necessitate manual feature extraction as well as training and inference employing machine learning algorithms. In recent years, remarkable advancements have been achieved in object detection through the application of deep learning methodologies. These approaches harness deep convolutional neural networks (CNNs) to acquire intricate feature representations from images and precisely localize objects within them. Prominent among these deep learning techniques are Faster R-CNN, where you only look once, and a single-shot multibox detector [
10]. These methodologies facilitate end-to-end object detection, obviating the necessity for manual feature engineering, thereby yielding substantial enhancements in both precision and computational efficiency. They demonstrate heightened discriminative capabilities and improved generalization performance.
In the realm of object detection, YOLO epitomizes single-stage algorithms that harness convolutional neural networks to extract target features, thereby transforming the classification quandary into a regression one. This paradigm shift results in the direct provision of insights regarding target categories and their respective positional domains, thus markedly amplifying detection efficacy. Deqiang He proposed the FE-YOLO algorithm, an improved CNN-based one-stage object detection method, to enhance obstacle detection accuracy in rail transit environments, focusing on small and irregular obstacles [
11]. Tao Ye proposed the SEF-Net algorithm, which enhances railway obstacle detection by improving accuracy and speed, particularly for small objects in complex environments. It integrates stable bottom feature extraction, lightweight feature extraction, and adaptive feature fusion modules [
12]. The algorithm proposed above effectively addresses obstacle detection in rail transit under clear or light foggy weather conditions. However, detecting obstacles during severe foggy weather remains a significant unresolved challenge in rail transit safety. Compared to obstacle detection in clear weather conditions, research on detecting obstacles in dense, foggy weather is relatively limited. Utilizing defog models for image enhancement has emerged as a powerful technique and has seen significant advancements in recent years. Kaiming He proposed a method to remove haze from a single image using the dark channel, which relies on the observation that haze-free outdoor images have pixels with very low intensities in at least one color channel [
13]. Vishwanath A. Sindagi introduced an unsupervised domain adaptive object detection framework designed for adverse weather conditions like haze and rain, using weather-specific priors to improve detection performance by minimizing weather-related distortions in image features [
14]. However, image degradation caused by dense fog significantly lowers detection accuracy, resulting in issues such as missed detections and false alarms.
A target detection model trained on high-quality, clear images often fails to achieve satisfactory performance under adverse weather conditions, such as dense fog. One effective approach is to decompose images captured in these adverse conditions into clean images and their corresponding weather information. Based on weather information, image quality can be appropriately enhanced, potentially recovering more latent information about originally blurred and misidentified objects. This image enhancement technique not only improves the clarity and contrast of images but also enhances the performance of target detection models under adverse weather conditions. As a result, it allows for accurate identification and detection of target objects in environments such as dense fog, rain, and snow, thereby increasing the reliability and stability of the system. However, enhancing image quality under varying levels of fog density remains a challenging problem that requires further investigation.
To address the aforementioned issues, this paper proposes a MSA-YOLO algorithm for rail transit under foggy weather conditions. The algorithm incorporates a multi-scale adaptive technique into the foggy image to suppress interferences such as dense fog and recover latent useful information from the images. Subsequently, the YOLOv3 network is employed to detect obstacles in rail transit. Utilizing its powerful convolutional neural network architecture, YOLOv3 performs multi-scale object detection, ensuring high accuracy while maintaining real-time performance. The system effectively identifies and locates obstacles on the tracks even under complex foggy conditions, significantly enhancing detection accuracy and robustness.
3. Methodology
Figure 1 illustrates the MSA-YOLO algorithm. This algorithm comprises three main components: a multi-scale adaptive module, an image processing module, and a YOLO module. The image processing module comprises several advanced filters: a defogging filter, a white balance filter, a gamma filter, a contrast filter, a tone filter, and a sharpen filter. The multi-scale adaptive module is employed to optimize the parameters within the filtering module. These filters collaborate to enhance image quality by mitigating the effects of fog, adjusting color balance, optimizing brightness and contrast, refining tonal range, and sharpening details. This comprehensive approach significantly enhances the algorithm’s overall performance, especially in adverse weather conditions. The YOLO algorithm integrates both local and global information, substantially enhancing the network’s capability to detect objects of varying scales, even in challenging foggy conditions. This integration significantly improves detection accuracy and robustness, making the algorithm highly effective for real-world applications.
3.1. A Multi-Scale Adaptive Module
The multi-scale adaptive network module demonstrates exceptional performance in the field of image processing, particularly in tasks such as defogging and various filtering operations. By employing convolution kernels of different scales, this module effectively processes multi-level information in images, thereby enhancing image quality and detail representation. Specifically, the module includes three different scales of convolution kernels: 7 × 7, 5 × 5, and 3 × 3.
The input to the network is a foggy image, and the architecture consists of three parallel branches for multi-scale feature extraction and fusion. The first branch employs five layers of 7 × 7 convolutions, designed to cover a larger receptive field and capture global features and long-range dependencies of the image. This branch effectively handles images with complex backgrounds and extracts large-scale global information through multiple layers of 7 × 7 convolutions. The second branch uses five layers of 5 × 5 convolutions, balancing receptive field size and computational complexity. Compared to the 7 × 7 convolutions, the 5 × 5 convolutions better capture medium-scale features while being more computationally efficient. This branch enriches the overall feature representation by extracting medium-scale features through multiple layers of 5 × 5 convolutions. The third branch directly inputs the original foggy image, preserving the initial information and details. This ensures that the original image information is retained during feature extraction, providing a reliable foundation for subsequent processing.
At the output stage, the outputs of the three branches are summed to achieve multi-scale feature fusion. By combining features from different scales, the network can comprehensively represent the image information, enhancing defogging effectiveness and feature representation capability. The fused features are then passed through five layers of 3 × 3 convolutions for further detail extraction. The 3 × 3 convolutions, with their smaller receptive field, capture fine structures and textures in the image. This multi-layered approach allows for fine-grained processing of the fused features, enhancing the detailed representation of the image.
Finally, the output is processed through two dense layers for higher-level feature extraction and classification. These dense layers linearly combine the features extracted by the convolutional layers and enhance the model’s expressive power through nonlinear activation functions. The processing through the two dense layers results in a high-quality, defogged image.
3.2. An Image Processing Module
3.2.1. A Defogging Filter
The dark channel prior defogging filter is an effective image-defogging algorithm that removes fog effects by estimating the dark channel prior to an image, thereby restoring the image’s clarity and contrast [
13]. Based on the atmospheric scattering model, a fog image can be constructed as follows:
where
represents the fog image, and
denotes the scene radiance (clean image).
A is the global atmospheric light, and
is the medium transmission map, defined as follows:
where
represents the scattering coefficient of the atmosphere, and
denotes the scene depth.
Select the top 0.1% brightest pixels in the dark channel image and use the corresponding pixels in the original image to estimate the atmospheric light
A. For each pixel in the input image, select the minimum value among all color channels within a local window. This forms the dark channel image.
where
is the dark channel image,
represents the local window around pixel
, and
is the pixel value of color channel
c.
The transmission map
describes the portion of light that is not scattered and reaches the camera. It is estimated as follows:
In this context, ω is a hyperparameter that is optimized through backpropagation by the multi-scale adaptive network to improve the performance of the defog filter for foggy image detection.
3.2.2. White Balance Filter
The white balance algorithm adjusts the proportions of the red, green, and blue (RGB) channels in an image to ensure that white objects appear truly white. The white balance filter is defined as follows:
where
,
, and
represent the values of the three color channels of the input image in the white balance filter, respectively. The coefficients of the white balance filter are denoted by
,
, and
, while
,
, and
represent the corresponding output values. In the white balance filter,
,
, and
are the parameters that require optimization.
3.2.3. Gamma Filter
The Gamma filter is a nonlinear operation commonly used in image processing. It enhances the visual quality of an image by adjusting its brightness and contrast. The filter is defined as follows:
where
is a hyperparameter that requires optimization.
3.2.4. Contrast Filter
A contrast filter is an image processing tool used to enhance the contrast of an image. By increasing the difference between the light and dark areas, a contrast filter can make the features of the image more distinguishable and visually appealing. The primary goal of a contrast filter is to improve the image’s clarity and detail, making it more suitable for visual analysis or presentation. The filter is defined as follows:
where
is the parameter that requires optimization.
3.2.5. Tone Filter
A tone filter is an image processing tool designed to adjust the tonal range and distribution of an image. It enhances the visual quality by manipulating highlights, midtones, and shadows, thus bringing out details, improving contrast, and creating a more balanced and aesthetically pleasing image. The filter is defined as follows [
29]:
where
parameters are represented as
,
. In a tone filter,
are the parameters that require optimization.
3.2.6. Sharpen Filter
A sharpen filter is an image processing tool used to enhance details and edges in an image, making it appear clearer. By emphasizing the edges and fine structures in the image, a sharpen filter can make the visual effect more vivid and dynamic. The filter is defined as follows [
30]:
where
represents the input image,
represents the Gaussian filter, and
is a positive scaling factor. In a sharpen filter,
is the parameters that require optimization.
3.3. Detection Network Module
YOLOv3 utilizes a novel network architecture known as Darknet-53, which comprises 53 convolutional layers. Unlike previous models that use fully connected layers, YOLOv3 employs anchor boxes for predicting bounding boxes. It performs these predictions at three different scales, enabling more accurate detection of objects of varying sizes. Each prediction layer corresponds to a specific scale, enhancing the model’s flexibility and precision. Darknet-53 leverages residual blocks, skip connections, and upsampling techniques to extract more meaningful features from images, thus improving learning and detection accuracy. Due to its outstanding performance in object detection, YOLOv3 is highly suitable for applications such as image editing, security surveillance, crowd detection, and autonomous driving [
31,
32]. In the detection network presented in this paper, we employ an identical network architecture to detect obstacles in rail transit systems.
4. Experiments
4.1. Data Training
The dataset comprises five annotated object classes: person, bicycle, car, bus, and motorcycle. These classes are sourced from two public datasets, VOC2007 and VOC2012. The foggy images are generated using Equations (1) and (2), where
is defined as follows:
where
represents the Euclidean distance from the current pixel to the central pixel, and
and
denote the number of rows and columns in the image, respectively. By setting
and
, where
is an integer from 0 to 9, ten different levels of fog can be applied to each image.
To accommodate obstacle detection in rail transit under both normal and foggy weather conditions, a hybrid approach has been adopted for the training dataset. This method enhances the model’s robustness across various weather conditions and improves its applicability in real-world scenarios. Before each image is fed into the network for training, there is a 2/3 probability that it will be randomly augmented with simulated fog. This augmentation enables the model to learn and adapt to obstacle detection tasks in foggy conditions. By combining normal images with foggy ones, the entire training pipeline is executed end-to-end using YOLOv3 detection loss, ensuring the model learns to detect obstacles in both normal and foggy environments during training.
Furthermore, to enhance the model’s detection performance, a multi-scale adaptive module has been integrated into the training process. This module is weakly supervised through detection loss, eliminating the need for manually annotated ground truth images and thus reducing the dependency on large-scale labeled datasets. The multi-scale adaptive module can extract and fuse features at different scales, allowing the model to more accurately detect obstacles of various sizes and shapes. This multi-scale feature extraction and fusion method significantly enhances the model’s performance in complex scenarios.
4.2. Implementation Details
The MSA-YOLO model is trained using the Adam optimizer over a total of 80 epochs. The initial learning rate is set to and decays to a final learning rate of . The learning rate is adjusted using a warm-up phase followed by a cosine annealing schedule. During training, the MSA-YOLO model predicts bounding boxes at three different scales, with three anchors at each scale to ensure accurate detection of objects of various sizes. This multi-scale prediction approach enhances the model’s robustness and accuracy in detecting objects of different sizes and shapes.
Our experiments are conducted using the TensorFlow framework and executed on an NVIDIA RTX3090 Ti GPU (Santa Clara, CA, USA). The computational power of this high-performance GPU significantly accelerates the training process, enabling us to complete numerous training iterations within a reasonable timeframe. This setup ensures that the model can be efficiently trained to achieve high performance in object detection tasks.
4.3. Experimental Results
4.3.1. Defogging Experimental Results
To validate the effectiveness of the defogging network proposed in this paper, a comparative analysis was conducted between the IA network [
33] and the multi-scale adaptive (MSA) network introduced in this study. With β values ranging from 0.05 to 0.14, the defogging effects of the IA and the multi-scale adaptive networks are illustrated in
Figure 2. Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) are employed as technical metrics to evaluate the defogging performance of different algorithms. As shown in the figure, the PSNR for the fog image, the defogged image using IA, and the defogged image using MSA are 13.4221, 12.713, and 13.472, respectively; the SSIM values are 0.6594, 0.7398, and 0.7672, respectively. From these PSNR and SSIM values, it is evident that MSA achieves the most satisfactory defogging results for railway images, exhibiting the highest PSNR and SSIM values. The superior performance of MSA can be attributed to the incorporation of a multi-scale network, which enables the algorithm to adaptively remove dense fog while simultaneously enhancing critical target information within the images through multiple filters.
To further validate the reliability of the proposed algorithm, we conducted tests on PSNR and SSIM using 200 images. The experimental results are presented in
Table 1. As shown in the table, the proposed MSA network consistently outperforms the IA algorithm. Specifically, MSA achieves higher PSNR and SSIM values, indicating its significant advantage in preserving image quality and detail. The MSA algorithm is better equipped to handle images with varying fog densities, thereby enhancing the accuracy and robustness of target detection. These results further confirm the potential and reliability of MSA in practical applications, providing an effective solution for defogging railway images.
4.3.2. Object Detection Experimental Results
To validate the effectiveness of the object detection algorithm proposed in this paper, a comparative analysis was conducted between IA-YOLO [
33] and MSA-YOLO.
Figure 3 illustrates the object detection performance of IA-YOLO and MSA-YOLO under different β values and various foggy conditions. As shown in the figure, when β < 0.1, both IA-YOLO and MSA-YOLO effectively detect cars. However, when β ≥ 0.1, IA-YOLO fails to detect cars in foggy environments, whereas MSA-YOLO continues to perform effectively. This indicates that MSA-YOLO excels at removing dense fog while preserving and enhancing critical obstacle information in the images.
The MSA module improves the network’s ability to detect objects of varying scales by integrating high-resolution detail information with low-resolution global information. High-resolution detail information allows the system to capture fine features of target objects, while low-resolution global information provides a comprehensive view, enabling the system to recognize a wider range of target objects in complex environments. This multi-scale information fusion allows MSA-YOLO to perform exceptionally well in various conditions, particularly in complex, foggy weather. The model can accurately identify and locate obstacles in the images, thereby enhancing the system’s robustness and detection accuracy.
Figure 4 shows the obstacle detection results of MSA-YOLO and IA-YOLO under varying fog densities. In the first column, IA-YOLO misses a bus. In the second column, IA-YOLO detects a section of the road as a bus. In the third column, IA-YOLO identifies a tree trunk as a pedestrian. However, MSA-YOLO correctly identifies the objects in the images. These examples demonstrate MSA-YOLO’s excellent performance in identifying various objects of different sizes, shapes, and orientations in complex scenes. This achievement proves the high sensitivity and accuracy of our algorithm, further validating the advancement and practicality of MSA-YOLO in object detection. These visual examples showcase the broad potential and value of MSA-YOLO in real-world applications.
5. Conclusions
We propose the MSA-YOLO algorithm to enhance obstacle detection performance in rail transit under foggy conditions. The input image is processed through a series of filters: defogging, white balance, Gamma, contrast, tone, and sharpening. The MSA effectively removes dense fog, while preserving and enhancing critical obstacle information, facilitating YOLO in obstacle detection. Experimental results demonstrate that the proposed MSA-YOLO algorithm achieves higher detection accuracy in foggy scenarios.
While the MSA-YOLO algorithm has shown promising results, several areas for future work remain. First, the algorithm could be further optimized for real-time performance in practical rail transit applications. Enhancing computational efficiency without compromising detection accuracy will be critical for deployment in real-world scenarios. Second, testing and refining the algorithm on more diverse and extensive datasets, including real-world foggy weather images, will help to improve its generalizability and robustness. Additionally, exploring the integration of the MSA-YOLO algorithm with other advanced defogging techniques and adaptive learning mechanisms could further enhance its performance.