1. Introduction
With the continuous expansion of power networks and the continuous growth of power demand, the safe and stable operation of transmission lines as the “arteries” of power transmission is crucial to ensuring the reliability and stability of the power supply [
1,
2]. During operation, transmission lines are often faced with a variety of threats from the natural environment and human factors. Among them, foreign objects and mountain fires on transmission lines are two major risk factors that cannot be ignored, which bring great challenges to power operation and maintenance and service life [
3,
4]. Therefore, the timely and accurate detection of foreign objects and mountain fires on transmission lines has become an important part of power operation and maintenance. Traditional detection methods often rely on manual inspections and simple equipment monitoring, which are not only inefficient but also make it difficult to achieve comprehensive coverage and real-time monitoring of transmission lines [
5]. In recent years, with the rapid development of science and technology, deep learning has become widely used in the field of electrical engineering, such as power equipment fault diagnosis, smart grids, digital circuit vulnerability analysis, etc. [
6], and it is also widely used in the field of power line safety detection. These technologies can not only realize real-time monitoring and early warning of faults in transmission lines but also improve the accuracy and efficiency of detection, providing strong support for power operation and maintenance [
7,
8,
9].
Therefore, we developed an autonomous inspection Unmanned Aerial Vehicle (UAV) and a cable robot, whose structures are shown in
Figure 1. They achieve real-time and accurate detection through deployed visual edge devices, which greatly improve the accuracy and efficiency of transmission line safety detection [
10,
11]. In the future, it is expected that an intelligent regional transmission line safety monitoring system will be deployed, as shown in
Figure 1. The system uses cable robots to achieve static area safety monitoring and UAVs to achieve dynamic area autonomous safety inspection. At the same time, robots and UAVs use wireless interactive communication to achieve the high-precision positioning of transmission line safety detection [
12,
13]. The system can achieve comprehensive monitoring and provide an early warning of faults in transmission lines, can discover and eliminate potential safety hazards in a timely manner, and can ensure the stable operation of the power network. It has important practical significance and broad application prospects [
14,
15].
With the great achievements of deep learning algorithms in image recognition and detection, the detection of foreign objects and mountain fires on power lines can be performed by deploying deep learning-based object detection algorithms on UAVs and visual robot edge devices, thereby improving the accuracy of power line safety object detection and the timely handling of power line safety faults, effectively realizing the automation and intelligence of the monitoring system [
16,
17,
18]. At present, the object detection methods based on deep learning are mainly divided into two categories. One is a two-stage method, such as the R-CNN series algorithm based on candidate regions [
19,
20]. These methods first generate regions and then classify samples through convolutional neural networks (CNNs). Feng Jun used the Faster R-CNN algorithm to build a Siamese network model, using the improved Region Proposal Network (RPN) module to generate high-quality prediction boxes, and performed correlation matching on the ROI features of the support and query images in the detection head. This model solves the problems of difficulty and low efficiency in power system inspection [
21]. Xue proposed a detection method based on the improved Faster R-CNN model, which improves the feature extraction capability of the model by increasing the network depth, solving the problem of small object detection on power lines [
22]. The other is a one-stage method, such as the YOLO series algorithm [
23,
24,
25]. These methods directly extract features in the network to predict object classification and location. Yan et al. proposed an improved Single Shot MultiBox Detector (SSD) algorithm to detect small object defects in transmission line inspection images [
26]. Xue et al. proposed a transmission line foreign body detection algorithm that combines a window-based self-attention network with YOLOv5. The algorithm uses a large convolution kernel to expand the receptive field of the model, enhances the ability to extract effective information, and improves the adaptive space [
27]. Liu et al. added a compression excitation module to the YOLOv5s backbone network to enhance the feature extraction ability of the network, thereby effectively improving the performance of the algorithm [
28]. Liu et al. proposed an information aggregation algorithm based on YOLOX-S. The algorithm aggregates spatial information and channel information in the feature map, which enhances relevant features, suppresses irrelevant features, improves the overall learning ability of the network, and improves detection accuracy [
29]. Wang et al. proposed a new method for foreign body detection on transmission lines based on YOLOv8n. The method integrates MSDA attention into the YOLOv8n network, which optimizes the feature fusion process and enables the model to effectively capture feature information of different scales [
30]. Zhang et al. proposed a mountain fire detection method in the transmission line channel based on improved DETR, which adds multi-scale feature information in the feature extraction stage and uses hole convolution to improve the algorithm’s perception ability of underlying features. Meanwhile, the self-attention mechanism in the Transformer module is improved; finally, the optimal mountain fire detection model is established [
31]. Yan et al. used EfficientNet networks to replace the main feature extraction network in the original YOLOv4 model. In addition, the inclusion of a grouping convolution module in the feature pyramid structure replaces the conventional convolution operation. The resulting model not only reduces the model parameters but also effectively ensures detection accuracy [
32].
The above methods have achieved improvement of the performance of safety-based object detection on transmission lines to a certain extent, but the types of detected objects are detected on a single-object basis, and it is impossible to take into account foreign objects and mountain fires at the same time. In addition, when the target is in a complex background, it is difficult to effectively extract features of multiple targets and easy to cause missed detection and false detection. Moreover, the model is complex and not conducive to deployment on edge devices. Therefore, in order to improve the effectiveness and accuracy of foreign object and mountain fire detection on transmission lines, enhance the environmental adaptability of the algorithm, reduce the complexity of the model, and facilitate the deployment of edge devices, this paper follows the efficiency–accuracy-driven design strategy and proposes an edge–real-time transmission line safety hazard detection method (ETLSH-YOLO). The model comprehensively considers various components in YOLO and designs a lightweight layer aggregation network and a spatial channel decoupling subsampling module, which significantly reduces computational redundancy and improves computational efficiency and enhances feature expression. In order to further improve the accuracy, a coordinate attention module is added in the multi-scale feature fusion process to enhance the model’s capabilities and explore the potential for performance improvement at a low cost. In order to improve the detection performance and convergence speed of the model, the Mish function is used instead of the Silu function to ensure that the model can capture complex nonlinear relationships in the data. The main contributions of this study are listed as follows:
- (1)
Designing a re-parameterized Ghost efficient layer aggregation network (RepGhostCSPELAN), which enhances the model’s feature extraction and gradient flow capabilities, reduces the model’s complexity, and reduces the model’s parameters and floating-point operations.
- (2)
Designing a spatial channel decoupled downsampling block (CSDovn), which re-duces computational redundancy, improves the model’s computational efficiency and information retention during downsampling, and obtains stronger feature expression capabilities, thus proving the model’s detection capabilities.
- (3)
Adding coordinate attention to help the model extract the relationship between position information and channel information in the feature map at a lower cost, enhancing the model’s global perception capabilities and improving the model’s detection accuracy.
- (4)
Using the Mish function instead of the Silu function to capture complex nonlinear relationships in the data, thereby improving the model’s stability, convergence speed, and generalization.
2. Material and Methods
The YOLOv9s model is mainly composed of five parts: input, backbone, neck, head, and auxiliary reversible branch. The input is mainly used for data enhancement. The backbone is composed of multiple Conv, Adown, and RepNCSPELAN4 modules, which are mainly used to extract features from images. The neck network adopts the PAN structure to fuse the high-level and low-level features extracted by the backbone and realizes the integration of feature map information and semantic information. The head is mainly composed of multiple detection heads, which are responsible for predicting the object position and category based on the feature information refined by the neck. The auxiliary reversible branch mainly generates reliable gradients so that the main branch can receive more complete and rich information, thereby improving the accuracy of the model.
Due to the limited computing resources of edge devices, the efficiency–accuracy driven model was designed for each part. The network structure of the ETLSH-YOLO method is shown in
Figure 2. In order to better enable the reversible branch to generate reliable gradient information during the training process and provide the main branch with gradients for reverse transmission, ETLSH-YOLO continues the YOLOv9 reversible branch structure, which extracts reversible branch features from the input feature map instead of extracting reversible branch features from the intermediate features of the backbone. This will ensure complete information flow from data to the target, and the model can learn more comprehensive feature representation, which helps to improve the model’s detection accuracy and generalization ability. Because of the large complexity of the model, which is not conducive to the deployment of edge devices, the re-parameterized Ghost efficient layer aggregation network (RepGhostCSPELAN) and the spatial channel decoupling downsampling block (CSDovn) are designed. The RepGhostCSPELAN applies the GhostNet lightweight structure and uses cheap operations to generate a part of redundant feature maps to reduce the number of calculations and parameters. At the same time, in order to make up for the performance loss caused by discarding the residual block, RepConv is used on the gradient flow branch to enhance the ability of feature extraction and gradient flow. The CSDovn separates the two processes of spatial downsampling and channel adjustment, first through deep convolution and pooling operations and then through point convolution, avoiding the non-interaction of information between feature map channels obtained by spatial downsampling, while helping to reduce the computational cost of the model and improve the inference speed. In order to improve the detection accuracy of the model, coordinate attention (CA) is added to the feature fusion layer to assist the network in learning key feature information and to enhance the model’s global perception ability, thereby improving the model’s detection accuracy. The Mish function is used instead of the Silu function on the activation function to capture the complex nonlinear relationship in the data, improve the convergence and stability of the model, and enhance the generalization ability of the model, thereby improving the performance of the object detection model.
2.1. RepGhostCSPELAN
In order to effectively reduce the number of model parameters, improve model efficiency, enhance feature expression capabilities, and optimize reasoning efficiency so that the model can be better deployed on resource-constrained devices while maintaining high performance, the re-parameterized Ghost efficient layer aggregation network (RepGhostCSPELAN) is designed, and the structure is shown in
Figure 3.
GhostNBottleneck adopts the Ghost lightweight structure, which generates many feature maps through group convolution and simple linear transformation. These maps can fully reveal the information of intrinsic features and enrich the feature expression ability of the model, thereby improving the detection performance of the model. This module significantly reduces the computational cost of the model, significantly improves the inference speed, and makes the model more efficient in the inference process.
GhostNCSP divides the input feature map into two branches for feature extraction, which helps the model better capture the different scale features in the image, and reduces the number of parameters in each branch, thereby reducing the computational complexity of the entire model. Since the GhostNBottleneck branch does not use bottleneck connection, in order to make up for the performance loss of abandoning staggered connection, re-parameterized convolution is used on another branch. Re-parameterized convolution uses a multi-branch structure to increase the gradient feedback path during training; it helps the model learn richer feature representations and merges multiple convolutional layers (such as Conv+BN) into one convolution operation during inference, thereby reducing the number of calculations and improving the inference efficiency. Finally, the different features of the two branches are fused so that the model obtains richer semantic information and improves the accuracy of foreign object and mountain fire detection.
RepGhostCSPELAN fuses features from different levels of convolution and GhostNCSP, which can capture more contextual information and detailed features, improve the feature expression ability of the model, enable it to better capture target information of different scales, and make the model more robust, thereby improving the detection accuracy of safety-based object detection on transmission lines.
2.2. CSDovn
In order to further reduce the number of parameters and computational complexity of the model while improving the diversity and expressiveness of model feature extraction, the spatial channel decoupled downsampling block (CSDovn) is designed, and the structure is shown in
Figure 4.
Spatial channel decoupling allows the model to extract features independently in the spatial dimension and channel dimension, which means that the model can perform more refined processing on the features of each channel or each spatial position, which helps the model to more accurately capture the location information and semantic information of the target in the transmission line image. The module first uses two branches to reduce the spatial size. One branch independently extracts features from different groups through group convolution, which reduces the spatial size of the feature map while reducing the number of calculation and parameters, and helps the model learn a variety of feature representations. The other branch reduces the spatial size of the feature map through max pooling, which does not require additional computational overhead. Moreover, maxi pooling has a certain robustness to small translations of input features, which helps the model extract more stable features. Then, the feature maps processed by group convolution and max pooling are added together to achieve feature fusion, which helps the model to integrate the features extracted so that this feature information can be complementary, forming more comprehensive feature representation and enhancing the expressiveness of the features. Finally, the number of channels of the feature map is adjusted through point convolution, which can further fuse features from different channels and realize feature interaction between channels, thereby improving the accuracy of transmission line safety detection. Compared with YOLOv10 [
33], this module proposes a spatial channel decoupling module, which significantly improves the interaction between feature information of different channels and enhances the expressiveness of features.
When the feature map
is downsampled to
, the number of parameters and calculations required for the CSDovn is as follows:
where
represents the size of the convolution kernel,
represents the number of channels of the feature map, and
and
represent the height and width of the feature map.
The number of parameters and calculations required for ordinary convolution is as follows:
In terms of the comparison of parameter quantity and computational complexity, the CSDovn has fewer parameters and lower computational complexity, which can enable the model to have a faster inference speed to meet the real-time detection of transmission lines, better generalization to meet the safety object detection of transmission lines in complex environments, and lower computing resource consumption, making it easy to deploy and integrate in visual edge devices.
2.3. Coordinate Attention
Attention mechanisms are widely used in computer vision tasks. Common attention modules include squeeze-and-excitation (SE), convolutional block attention module (CBAM), and efficient channel attention (ECA). The SE module only considers the information of different channels but ignores the location information. On the basis of the SE module, the CBAM uses convolution to obtain position information, but convolution can only focus on local information and lacks the ability to extract remote information. ECA introduces deformable convolution to capture the correlation between channels and reduce the amount of computation, but it does not take into account the spatial information directly. The coordinate attention (CA) module overcomes the above problems; it extracts the relationship between the position information and channels information in the feature map in an effective way, and it obtains the feature map with direction-aware and position-aware information and applies it to the input feature map in a complementary manner. The CA enhances the feature representation of the target, allowing the model to more effectively focus on the information of a specific location or region, rather than relying solely on global or local feature representations, which helps to accurately locate and identify targets and improve the detection accuracy of the model. Its structure is shown in
Figure 5.
The calculation process of CA is divided into two steps: coordinate information embedding and coordinate attention generation. The overall process formula is shown below:
Among them, and represent the height and width of the input feature map. and represent the size of the pooling kernel. and represent the horizontal feature vector at height and the vertical feature vector at width of the feature graph of channel . and represent the horizontal and vertical coordinate information vectors on the channel. represent convolution operations. represent different activation functions. represents the intermediate feature encoding spatial information. and represent splitting into two independent vectors along the spatial dimension. and represent attention vectors.
Specifically, when embedding coordinate information, each channel uses and pooling kernels to encode single-dimensional features in the height and width directions to obtain coordinate information vectors and in the horizontal and vertical directions. Then, the coordinate information vector is concat connection, and 1 × 1 convolution, BatchNorm (BN) layer, and nonlinear activation layer are applied. Then, the intermediate features are split into two independent feature tensors, and , and then the dimension is adjusted from channels to channels by 1 × 1 convolution. Then, the attention weights in the horizontal and vertical directions are obtained by the Sigmoid activation function. Finally, the weights of the two directions are multiplied with the feature map at the corresponding coordinate position to obtain the final output feature map that is endowed with the channel and the internal position and direction attention information.
2.4. Activation Function
In the safety-based object detection task on transmission lines, the selection of a suitable activation function has an important impact on the performance and training effect of the model. Therefore, the Mish function is used to replace the Silu function. Compared with the Silu function, the Mish function has outstanding nonlinear characteristics, which can improve the convergence and stability of the model and enhance the generalization ability of the model, thereby improving the performance and effect of the target detection model so as to better cope with the task of safety-based object detection on transmission lines in complex environments.
The calculation method of the Silu function and the calculation method of the Mish function is shown in the formulas below.
The Mish function uses the hyperbolic tangent tanh and logarithmic ln functions, whereas the Silu function tends to be linear in some cases. By contrast, the Mish function exhibits obvious nonlinearity in the entire definition domain and can better capture the complex nonlinear relationships in the data, which is particularly important for identifying and locating the complex features of targets in transmission line object detection and can improve the model’s expressiveness and prediction performance.
The inverse of Silu and Mish functions is shown in the formulas below.
It can be seen from the formula that when
is large or small,
may change rapidly due to the rapid growth or decay of the exponential function. Because of the nature of functions
and
,
is relatively smooth throughout the domain of
. In particular, when
is large or small,
does not change as rapidly as
. Therefore, Mish functions are analytically smoother than Silu functions. It can also be seen from
Figure 6 that the
Mish() function is smoother than the
Silu() function, allowing better information to penetrate the neural network for better accuracy and generalization. At the same time, the
Mish() function has a smoother gradient, which can more effectively reduce the problem of gradient disappearance or explosion compared to the Silu function.
Mish() can make it easier for the model to reach a good local optimal solution, thereby improving the training stability and convergence speed of the model.
4. Conclusions
In order to realize the intelligent, safety-based detection of hidden danger targets in transmission lines and construct an intelligent, regional transmission line safety monitoring system, this paper proposes an edge–real-time transmission line safety hazard detection method (ETLSH-YOLO). Among them, the RepGhostCSPELAN module significantly reduces the number of model parameters and floating-point calculations and effectively integrates the information of different feature layers to improve the feature expression ability of the model. The CSDown module decouples the spatial channel of the feature convolution operation, which not only effectively reduces the complexity of the model but also performs more refined processing on the channel and spatial dimensions of the feature, thereby improving the model’s feature capture ability. The CA module effectively improves the detection accuracy of hidden danger targets amid complex backgrounds in transmission lines and enhances the model’s attention to key areas. The Mish activation function better captures the complex nonlinear relationships in the data, improving model convergence, stability, and detection performance. On the power transmission line safety hazard data set, ETLSH-YOLO significantly improved the detection accuracy of the model and significantly reduced the complexity of the model compared with the baseline model and had higher accuracy and better adaptability to complex environments than other object detection models. In the future, we will continue to optimize the model structure and find a more efficient and accurate model structure to meet the needs of edge deployment through continuous research and experiments. We aim to strengthen cross-domain adaptability and train and optimize the model according to different application scenarios and weather conditions so as to improve its generalization ability in various environments, promote the application of the model in actual scenarios, and build an intelligent and high-precision protective barrier for power grid security.