1. Introduction
The transmission tower is one of the essential components of the power transmission system [
1]. It is vital for carrying transmission lines, regulating line tension, and safeguarding power safety. The structure of various transmission towers varies based on their designs and intended purposes [
2]. However, it typically comprises many vital components, including the tower body, cross arms, insulator strings, lightning lines, and additional auxiliary equipment [
3,
4,
5]. The transmission tower is a complex system; the structures complement each other to undertake the function of power transportation jointly, and some parts are easily affected by internal and external environmental impact and aging damage, thus causing power accidents resulting in economic losses [
6]. Therefore, it is necessary to regularly inspect transmission towers to protect the normal operation of the power transmission system [
7].
Manual inspection requires workers to use telescopes for visual inspection, or they have to climb up for inspections, and there are problems such as high labor intensity, low work efficiency, and high impact of human factors, and the inspection status cannot be recorded in real time [
8]. Therefore, inspection robots have gradually emerged to replace manual inspection efficiently. Ground inspection robots are generally suitable for performing inspection tasks in simple power scenarios such as small substations; flying robots (e.g., UAVs) and hybrid robots are more suitable for performing inspection tasks in complex power scenarios such as large-area substations and transmission corridors [
9,
10].
With the increasing maturity of UAV technology, there is a growing trend among electric power companies and maintenance firms to adopt novel technologies, including UAVs, to enhance the efficiency and precision of inspections [
11,
12,
13]. UAV electricity inspection is generally divided into transmission corridor inspection and transmission tower component inspection. This paper focuses on the methods applied to the latter. Current commonly used transmission tower UAV inspection technology heavily depends on the process of prepared in-door UAV route planning, in which the precise 3D coordinates of the key tower components of the tower need to be identified and collected from transmission tower point cloud data point by point in advance and then manually input to produce the UAV waypoints and camera shooting events [
13,
14]. After the flight path has been planned, the electricity inspection multi-rotor UAV uses the planned routes to complete autonomous flights and take the component photos at the planned waypoint [
15], i.e., normally 3–5 m away from the target on the tower. The route planning work is usually a key, cumbersome, and labor-intensive process. It also increases the UAV safety risk due to manual ignorance and fault. Therefore, the automatic identification and locating of the inspection target point coordinates have been a means of improving the efficiency of route planning.
To automatically extract the UAV inspection target points, the corresponding locations need to be found first, and then the segmentation is accomplished by using a deep learning point cloud segmentation algorithm with specific dataset training. In the field of deep learning, there are various public datasets adapted to different needs, among which the ShapeNet dataset [
16] is widely used for testing part segmentation [
17,
18,
19]. For research on using point cloud data for segmentation tasks, there are three main types of segmentation: projected image-based segmentation, voxel-based segmentation, and direct point-based segmentation [
20,
21]. Both projected image-based segmentation and voxel-based segmentation essentially involve converting the point cloud data into other forms first. The core idea of the former is to use 2D-CNN [
22] to extract features from the projected image in 3D and then fuse these features for label prediction. In the latter, the point cloud is first converted into voxels, similar to pixels in a 2D image, and then processed using a 3D convolutional network. SqueezeSeg is a 3D point cloud real-time segmentation network proposed by Wu et al. [
23] after getting inspiration from SqueezeNet [
24], where the 3D point cloud is subjected to spherical projection for feature extraction and segmentation, and then conditional random field (CRF) [
25] is used as a recurrent layer to refine the results. Similar to the 2D semantic segmentation network, Liu et al. introduced a network called 3DCNN-DQN-RNN [
26], which fuses 3D-CNN and DQN to control the eye window to achieve fast localization and segmentation and then further refine the parsing accuracy of the point cloud by 3D-CNN and Residual RNN. Since the point cloud data presents irregular distribution in three-dimensional space, the applicability of traditional two-dimensional and three-dimensional convolutional neural networks is limited, and the segmentation methods directly based on the points, at present, mainly contain three categories based on multilayer perceptron [
27], point convolution, and graph convolution. PointNet [
28] pioneered the direct processing of point clouds and subsequently improved and upgraded the process; the original authors released PointNet++ [
29], which uses a shared MLP to extract features, completes the segmentation of the point cloud after capturing and learning global and local features, and also provides the basic framework for many subsequent methods. The Point Transformer model [
30,
31,
32,
33] is proposed based on the transformer [
34] architecture, which has been used in a large number of natural language processing and image processing tasks and has made essential breakthroughs; the attention mechanism is applied to the processing of a three-dimensional point cloud, and each point in the point cloud is regarded as an element in the sequence, which is coded to capture inter-point relationships and neighborhood information, and finally decoded to obtain the results, with this mechanism also achieving excellent results. The DGCNN model [
35] combines the ideas of the graph convolutional neural network (DCN) and dynamic graph construction, and the EdgeConv module is designed to incorporate the relationship between points into the point cloud processing and recalculate the domain of the sampled points in each layer of the feature space and update the graph model, which can better allow the information to propagate among the similar structures and accelerate the learning of local semantic information.
Despite the commendable performance of current deep learning models on public datasets, they encounter certain challenges when applied in the practical application of power inspection. For example, the data quality of the training samples is uneven, and the number is small [
36]; the number of points in different point cloud data varies considerably, and the coordinates of the point cloud data with factual geographic information are very different from each other, and the direction is variable, and so on. Hu et al. [
37] combined PointNet++ [
29] with a self-attention mechanism [
38] and then utilized positional coding to somewhat alleviate the challenges posed by sparse point clouds and the issue of the disproportionately small proportion of insulator string point clouds. Huang et al. [
39] improved the PointNet++ [
29] model by adjusting the model feeling field and extracting point cloud features using core point convolution [
40], which improved segmentation accuracy slightly compared to the classical PointNet++ model [
29].
Considering the advantages and shortcomings of the existing methods, in order to better solve the above problems, this paper proposes a new point cloud transmission tower segmentation (PCTTS) model for accomplishing the automatic segmentation task of transmission towers. The method employs specific preprocessing for the original point cloud data to make the point cloud distribution more reasonable and uniform, while the unit ball normalization ensures the translation invariance. Utilizing rotationally invariant features to fuse multi-scale feature extraction with the offset-attention mechanism can solve the problem of variable orientation of transmission towers while improving the feature extraction capability of the model, which ensures rotational invariance. Finally, the segmented results are used to extract the location coordinates of the UAV inspection target points using a specific target point localization algorithm to help improve the efficiency of UAV route planning. The method has the advantages of low cost, high efficiency, and strong generalizability. It can greatly reduce the manual, repetitive, and cumbersome operation steps, effectively shorten the operation time of route planning in UAV electric power inspection, enhance the operational efficiency of the staff, and have high practical application value.
3. Methodology
In the UAV automatic refinement inspection process of transmission tower, one important task is to find the locations of the camera shooting target point and then plan the UAV flight route based on this information. At present, route planning relies heavily on human visual interaction to find the tower target point in the transmission tower point cloud data. The target points are mainly distributed on the tower head, insulator string, and each structural connection. Generally, there is one target point at each end of the tower head, and each insulator string corresponds to 2–3 target points depending on the length.
In this paper, an automatic target point localization method is proposed, which is carried out in three steps: the first step is data preprocessing; in the second step, segmentation of each structure of the transmission tower is completed based on the PCTTS; and in the third step, the target point localization of the UAV’s automatic inspection is carried out. The technical flow of the method is shown in
Figure 4.
The first step is to carry out some basic processing of the transmission channel point cloud data, such as removing redundant information and standardizing the format to facilitate subsequent operations according to the specific needs of the actual completion of the denoising, cropping, calculation of normal vectors, and other operations. Subsequently, the preprocessed data are input into the PCTTS network to acquire segmented results. We have carried out experiments with two different segmentation methods, which are divided into part segmentation and instance segmentation, and there will be different effects in different situations. At this point, we have obtained individual point cloud data for each structure, and in the case of obvious errors, post-processing corrections or even manual intervention are required. Finally, the center point fitting is carried out on the point cloud data of each component individually, the center point coordinates are obtained, and then the corresponding processing is carried out to obtain the final UAV inspection target point coordinates, and the result inspection is completed by superimposing the obtained target point with the corresponding transmission tower point cloud data.
3.1. General Architecture of the PCTTS Network
As shown in
Figure 5, the overall architecture of our proposed PCTTS network is redesigned based on the structure of PointNet [
28] and PointNet++ [
29], which mainly consists of four parts, namely, the data preprocessing module, the feature extraction module, the attention mechanism module, and the feature propagation module, which are complementary to each other and ultimately realize the effective segmentation of the point cloud data of transmission towers.
The data preprocessing module mainly undertakes some functional transformation operations before inputting the point cloud data into the neural network, such as down-sampling, rotational transformation, normalization, etc. Among them, the operation of down-sampling can not only simplify the data according to the specific practical needs, which can be hundreds of thousands or even millions of points thinned down to a few tens of thousands or even a few thousands of points, but also plays an auxiliary role in the process of extracting the local features in the follow-up. The operation of rotational transformation not only serves as a data enhancement during training but also serves as a function to check the generalizability of the model during testing. The operation of normalization can solve the problem of the point cloud data of different transmission towers having too much difference in individual coordinates due to the difference in the actual location or the problem of large differences in the coordinates of each structure due to the use of different scenarios and different actual sizes of the same type of transmission towers. The feature extraction module is one of the most important parts of the whole network, which can abstract the collection of transmission tower point cloud data input into the network and extract the global features and local features to be used for learning, which can help the network understand and analyze the point cloud data better. The attention mechanism module can improve the performance of the model by selecting and utilizing information more efficiently, focusing on relatively more important information with higher weights, ignoring less important information with lower weights, and constantly adjusting the weight allocation so that more helpful information can be selected in different situations. The feature propagation module mainly plays the role of feature fusion and feature transfer, which is essentially an up-sampling operation, combining the global information of the higher level with the local information of the lower level and then transferring the feature information upward across the hierarchy to the higher level, updating the obtained features upward step by step to the feature vectors of each point and completing the categorization of the category of each point.
In summary, the workflow of our proposed PCTTS network can be summarized as data preprocessing → feature extraction → feature propagation → point cloud segmentation, in which how to better adapt to the transmission tower point cloud data and learn more helpful feature information from it to make the segmentation results more accurate becomes the key highlight of this research method.
3.2. Data Preprocessing Module
In addition to some of the most basic operations such as cropping, denoising, and other preprocessing steps, we also need to carry out the operation of down-sampling to reduce the data size and minimize the storage and computation needs.
Due to the specification difference between the different LiDAR equipment used to collect the point cloud and the different methods of collection, the point cloud density and the data size of a tower may vary drastically. For example, some towers may have millions of points, while other towers may have only a few tens of thousands of points, which will significantly influence the neural network training effect and performance. Therefore, we chose to use the octree [
41] sampling method to first thin the data of various transmission towers to a certain extent. After this processing step, almost all the transmission tower point cloud data are restricted to 10,000 pts or so. The effect of octree sampling is schematically shown in
Figure 6, which divides the 3D space into a hierarchical structure consisting of octree nodes and divides the space by recursively dividing a region into eight sub-regions each time until the division stops after reaching the set conditions. The octree sampling method is used to down-sample the point cloud data of transmission towers, which can retain the shape characteristics of the transmission towers well compared to random sampling; it will also make the sampled point cloud data uniformly distributed to a certain extent like isometric sampling and also avoids the influence of different point densities at different locations due to different materials of each part in the data acquisition stage; it maintains a high sampling while maintaining high sampling quality, and the computational complexity is much lower than that of the farthest point sampling.
After down-sampling, two random rotational transformations are applied to the transmission tower point cloud data, i.e., a random rotation around the
Z-axis and a random SO(3) rotation in three-dimensional space, computed with the rotation matrix as follows:
where
denotes the rotation matrix,
denotes the angle of rotation around the
X-axis,
denotes the angle of rotation around the
Y-axis, and
denotes the angle of rotation around the
Z-axis.
The former is used to simulate the situation in which different transmission towers in a realistic scenario have different orientations to be more realistic for data enhancement, as well as to improve the generalizability of the model; the latter is only used for model generalizability testing and is not used in practical application.
Finally, a unit sphere normalization operation is applied to the transmission tower point cloud data. In this way, the point cloud data of a transmission tower located in different locations, with the same types of towers with different sizes or different tower types, is uniformly translated and scaled into a unit sphere with the origin (0, 0, 0) as the center and a radius of size 1, which unifies and standardizes the point cloud data of each transmission tower. The schematic diagram of the data preprocessing module is shown in
Figure 7.
Translation and Rotation Invariance
For the transmission tower point cloud data, the coordinates of each point in it correspond to a real position in the real world. If segmentation is operated directly on the original point set, the locational and geometric differences between two towers will be significant since they are in different positions and may be oriented in different directions even if they have identical transmission tower types. In order to solve this problem, we need to ensure the translational and rotational invariance of the point cloud data in order to minimize the influence of the coordinate difference on the trained model to realize the effective segmentation of the various components of the transmission towers.
We adopt the currently commonly used unit sphere normalization method to normalize the data, which can be formulated as the following equations:
where
denotes the original
x,
y, and
z coordinates of the center of mass point,
n denotes the total number of points in the point cloud, and
m denotes the scaling factor required for the ball normalization operation.
The point cloud data of each transmission tower are processed in turn until all the transmission towers are normalized into the unit sphere. After the final segmentation is completed, it is necessary to carry out the inverse normalization according to the unique coordinate offset parameter and scaling parameter of each transmission tower point cloud data to restore the coordinates to the real position in the real world.
In order to ensure the rotational invariance of the transmission tower point cloud data, one solution is to unify all the point cloud data into a standard orientation [
42], and another is to try to weaken the effect caused by rotational transformations through data enhancement. However, both methods can be problematic in practice. In the former approach, it is difficult to define a suitable rotation transformation matrix to unify the orientation of each different transmission tower completely, while in the latter approach, trying to employ a finite number of random rotations for data augmentation can essentially only mitigate the effect but does not really solve the problem, since the rotational operation is infinite in three dimensions. Instead of relying on some preprocessing of the point cloud data, the approach we take here is to extract features with rotational invariance in the point cloud to help our model learn the intrinsic rotational invariance feature of the transmission towers [
43]. The schematic diagram of the operation to ensure translation and rotation invariance is shown in
Figure 8.
3.3. Multi-Scale Feature Extraction and Fusion Module
This module is the core part of the whole network, aiming to extract features from point cloud data for learning, and the overall structure learns from the multi-scale grouping (MSG) design of PointNet++ [
29], which applies grouping layers with different scales, using multiple radii, to extract features with different resolutions and splicing them to form multi-scale features.
The number of points in the transmission tower point cloud data has yet to be standardized with the preprocessing module; consequently, a random sampling method is used to reduce each point cloud file to 4096 points before feeding it into the network. The network then uses the farthest-point sampling method to choose a subset of the points, which is performed twice. For the first time, 1024 points are retained, which are used to establish the center point of the grouping to divide the entire transmission tower point cloud data into 1024 local point sets. In addition to this central point, each point set will contain a certain number of neighboring points within a certain radius of the central point. The same operation is used again and then preserves the 256 local centroids, forming a set of 256 local points. Note in this step, the advantage of the octree sampling preprocessing process may also be reflected for the same shape of the point cloud with different point densities, considering the possibility of missing the more critical structure features if the point density is too high, as shown in
Figure 9.
After determining the grouping of the local point sets, the center of mass point of each local point set will be calculated, a local coordinate system will be constructed with the center of mass point as the origin (0, 0, 0), and the multilayer perceptron (MLP) will be used to extract features from each local point set. In addition to encoding the position information and normal vector information of each point, we also calculate the interrelationships between the center of mass point, center point, and each neighboring point information in the local point set to extract the feature information of Euclidean distance and cosine similarity, which are fused with the basic information of the points and encoded together as the feature information of the point set:
where
F denotes a feature encoding function,
A1 denotes the features obtained by encoding the base information of the points,
A2 denotes the features obtained by encoding the interrelationships between the points,
denotes the Euclidean distance, and
denotes the cosine similarity, which can correspond to the schematic diagram of the local feature extraction in the point cloud. The schematic diagram of the local feature extraction of the point cloud is shown in
Figure 10, where the red point
C indicates the center point of the local point set obtained after sampling the farthest point, the blue point
O indicates the center of mass point of the local point set obtained after computation, and the green point
P indicates any neighboring point in the local point set.
The structure of the multi-scale feature extraction module is shown in
Figure 11. Through the point coordinate information and normal vector information, the required Euclidean distance information and cosine similarity information can be calculated. The MSG structure can be hierarchical and scaled to extract a variety of resolutions of the features, stacked twice, which allows the model to more smoothly increase the sensory field with more robust feature extraction capabilities. At the same time, we learn from the design advantages of the ResNet [
44,
45] network and introduce the residual structure to optimize the network structure further, avoid the degradation problem, and improve the model segmentation accuracy. Finally, the features extracted at the three radius scales are stitched together to obtain more comprehensive information passed to the next module.
3.4. Offset-Attention Mechanism
The attention mechanism [
34] is an important technique used to improve the performance of neural networks. The core idea of the attention mechanism is to find the correlation between the original data and highlight some of its important features. The model will assign different weights to each part of the feature map, highlighting key and vital information and suppressing unimportant information so that the model can make more accurate judgments.
The offset-attention mechanism we use is improved from the self-attention mechanism, which was first applied in the point cloud transformer (PCT) network [
32]. The self-attention mechanism is a method for calculating semantic correlations between different elements in a data sequence. Specifically, we first convert each element into a query, a key, and a value for an input sequence. We then obtain an attention score by computing the dot product of the query with all keys, which indicates how well the query matches each key. Next, we perform a softmax operation on these scores so that they sum to 1, which gives us an attention weight. Finally, we weigh and sum the values with this weight, and the result obtained is the output of the self-attention mechanism. The formula for the self-attention mechanism is shown below:
where
denotes the dimensionality of the bond, and the
Q,
K, and
V contained in the formula all have their origin in the product of the input sequence
X and the weight matrix and are essentially linear transformations of
X.
The offset-attention mechanism calculates the offset between the self-attentive features and the input features by element-to-element subtraction and then passes this offset to the subsequent steps instead of the output of the original self-attention mechanism. The structure of the offset-attention mechanism is shown in
Figure 12. In the traditional self-attention mechanism, the first dimension is scaled to
, and the second dimension is normalized using softmax. In contrast, in the offset-attention mechanism, the softmax operator is used in the first dimension, and the L1 paradigm is used to normalize the attention graph in the second dimension. The offset-attention mechanism will increase the attentional weight and reduce the effect of noise, which will be more favorable for downstream tasks. The formula for the offset-attention mechanism is shown below:
3.5. Feature Propagation Module
The difference between the point cloud segmentation task and the classification task is that classification requires only one overall feature to discriminate the overall point cloud class. In contrast, segmentation can be understood as classifying each point in the point cloud, and thus, each point needs to correspond to a separate feature.
PCTTS reduces the number of points sampled layer by layer with the network to ensure that the network receives enough global information, which is similar to PointNet++ [
29]. In order to obtain point features for all the points in the point cloud while avoiding higher computational costs, we use the known feature points to interpolate to complete the up-sampling so that the network outputs the same features as the input points.
The nearest neighbor interpolation is used to complete the up-sampling layer by layer, and the features of the previous layer are spliced with the features of this layer using the cross-level skip link concatenation strategy of hierarchical propagation to form a new feature representation. In PCTTS, the specific up-sampling process involves taking the globally obtained features, copying them first, and then splicing them with the 256 sampled centroid features from the preceding layer. Subsequently, nearest neighbor interpolation is applied to the results based on the distance matrix, modifying the feature values—an operation that can be interpreted as a form of weighting. After completing one round of up-sampling, the interpolated point features correspond to the 1024 sampled centroids of the next higher layer, and the features are spliced once again. This entire operation is repeated twice, gradually completing the up-sampling process and obtaining features for each of the 4096 points to fulfill the classification task for each point. The formula followed for nearest neighbor interpolation is shown below:
Here, we use three-point interpolation, where denotes the feature of the point, and denotes the weight of each point, which is weighted based on distance.