Semantic Segmentation Algorithm Fusing Infrared and Natural Light Images for Automatic Navigation in Transmission Line Inspection

Yuan, Jie; Wang, Ting; Huo, Guanying; Jin, Ran; Wang, Lidong

doi:10.3390/electronics12234810

Open AccessArticle

Semantic Segmentation Algorithm Fusing Infrared and Natural Light Images for Automatic Navigation in Transmission Line Inspection

by

Jie Yuan

^1,*,

Ting Wang

¹,

Guanying Huo

²,

Ran Jin

¹ and

Lidong Wang

³

¹

College of Big Data and Software Engineering, Zhejiang Wanli University, Ningbo 315104, China

²

School of Internet of Things Engineering, Hohai University, Changzhou 213022, China

³

School of Engineering, Hangzhou Normal University, Hangzhou 310030, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(23), 4810; https://doi.org/10.3390/electronics12234810

Submission received: 20 October 2023 / Revised: 16 November 2023 / Accepted: 19 November 2023 / Published: 28 November 2023

(This article belongs to the Special Issue Recent Advances in Unmanned System Navigation and Control)

Download

Browse Figures

Versions Notes

Abstract

:

Unmanned aerial vehicles (UAVs) are widely used in power transmission line inspection nowadays and they need to navigate automatically by recognizing the category and accurate position of transmission pylon equipment in line inspection. Semantic segmentation is an effective method for recognizing transmission pylon equipment. In this paper, a semantic segmentation algorithm that fuses infrared and natural light images is proposed. A cross-modal attention interaction activation mechanism is adopted to fully exploit the complementation between natural light and infrared images. Firstly, a global information block with a feature pyramid structure is used to deeply mine and fuse multi-scale global contextual information of fused features, and then the block is used to conduct feature aggregation in the decoding processing, and enough aggregation with multi-scale features of infrared and natural light images is used to enhance the expression ability of the model and improve the accuracy of semantic segmentation of transmission pylon equipment in complex scenes. Our method guides the process of low-level up-sampling and restoration by denser global and high-level features. Experimental results on a dataset of transmission pylon equipment collected by us show that the proposed method achieved better semantic segmentation results than the state-of-the-art methods.

Keywords:

transmission pylon equipment; inspection and navigation; infrared light; semantic segmentation; multi-modal fusion

1. Introduction

Nowadays, unmanned aerial vehicles (UAVs) are widely used in the inspection of power transmission lines. Some detection devices, such as natural light cameras and infrared cameras, are installed on UAVs to capture imagery data from the field of transmission lines. Natural light cameras can capture RGB images and videos while infrared cameras can capture infrared energy images of target objects presented as temperature distribution fields. All those imagery data aid surveillance personnel in gaining insights into the on-site conditions of the transmission lines.

However, using UAVs to inspect lines requires prior planning of the route, and the UAVs can only fly according to the specified route during inspection, which makes them unable to effectively handle unexpected anomalies. To achieve real real-time automatic navigation and line inspection, it is necessary to perform visual-based detection and semantic segmentation of various power facilities on the transmission line. As important power equipment, the semantic segmentation of transmission pylons and their equipment during the automatic navigation and inspection process has the top priority of the transmission line inspection work. Common image object detection can only obtain a rectangular box around the target object. As the transmission pylon and equipment only occupy a portion of the box, the accuracy of navigation and obstacle avoidance for UAVs cannot meet practical needs. Semantic segmentation can do pixel-level classification, which can obtain the accurate position and irregular range of the target object in the image, therefore, it is more suitable for the actual needs of UAV navigation and line inspection. So, it is of great significance and value to study semantic segmentation algorithms for power pylons.

Traditional semantic segmentation algorithms generally use surface information of images such as edges, grayscale, texture, etc., to segment images into multiple object regions. However, they are generally aimed at single targets and have low segmentation accuracy and efficiency, so they are unable to meet practical application needs. Semantic segmentation research using deep learning networks has also rapidly increased. Compared with traditional methods, semantic segmentation methods based on deep learning have been proven to significantly improve segmentation efficiency and accuracy [1,2]. In 2015, Long et al. [3] proposed a full convolution network (FCN) architecture for end-to-end semantic segmentation of images. FCN replaces the fully connected layers in CNN models with convolution layers to achieve pixel-level prediction for images of any size. This is the first algorithm to use deep learning networks for semantic segmentation. Since then, many researchers have proposed semantic segmentation methods based on deep learning architectures [4,5,6,7,8,9,10,11,12,13,14,15,16] including UNet [4], UNet++ [9], Mask R-CNN [5], Deeplab series [10,11,12,13], Segment Anything [16], etc. Liu Siyuan et al. [17] proposed a method based on an improved R-FCN algorithm for detecting transmission line equipment. They improved the R-FCN [18] algorithm framework at multiple stages, which significantly improved the detection accuracy of transmission line equipment. Ronneberger et al. [4] proposed a U-Net network with symmetric encoding and decoding structures for medical image segmentation. The encoder performs down-sampling for feature extraction, while the decoder performs up-sampling to recover resolution. Meanwhile, jump connection structures are introduced to fuse shallow detail information and deep semantic information. Liu He et al. [19] constructed a Res-UNet network to segment electrical equipment in infrared images by using a deep residual network for feature extraction and encoding to complete the segmentation of electrical equipment. He Kaiming et al. [5] proposed the Mask R-CNN model by introducing a parallel mask branch based on FCN. The mask branch is used to output a segmentation mask map. Mask R-CNN can maintain high detection accuracy while maintaining high detection speed, so it has been widely used. Xiong et al. [20] proposed a semantic segmentation method for power equipment based on a dual-layer network architecture containing the Mask R-CNN network and Bayesian networks to provide refined segmentation prediction maps. The Deeplab series methods [10,11,12,13] are also widely used in semantic segmentation. Deeplab v1 [10] used fully connected conditional random field and hole convolution to enhanced the effectiveness of deep CNN (DCNN). Deeplab v2 [11] used atrous spatial pyramid pooling and merged the DCNN and CRF to get better results. Deeplab v3 [12] added a multiple scales module and designed serial and parallel hole convolution modules. Deeplab v3+ [13] used Xception net instead of Resnet 101 as the backbone net. Guanke et al. [21] proposed an improved Deeplabv3+ based power line segmentation method. They used a one-shot aggregation feature pyramid and feature fusion module to achieve a larger receptive field and higher-level features.

The above research on semantic segmentation is based on single-modal images. In recent years, scholars have begun to focus on semantic segmentation tasks based on multi-modal images, attempting to fully exploit feature information through the fusion of different modalities to improve the performance of semantic segmentation [22,23,24,25,26,27]. Infrared thermal imaging technology generates infrared images by sensing the thermal radiation of the target, so infrared images are not affected by changes in lighting. They are beneficial for detecting and segmenting targets under insufficient light conditions. However, the edges of infrared images are generally blurred without color texture information and clear details. Natural light images have rich details and texture information, but the imaging quality is easily affected by light illumination in outdoor scenes. The two modalities of images have a good complementary relationship, so some studies use the fusion of them to achieve more accurate semantic segmentation results. Ha et al. [22] proposed a multi-modal fusion network (MFNet) based on CNN architecture for semantic segmentation tasks across multi-modal image fusion. The overall structure of MFNet is an encoder–decoder structure consisting of two identical encoders that extract feature maps of natural light images and infrared thermal images, respectively, and one decoder. The proposal of MFNet indicates that the collaborative processing of infrared and natural light images can significantly improve the performance of semantic segmentation algorithms. Sun et al. [23] proposed a network called RTFNet which is designed by introducing a residual structure on the basis of MFNet. The RTFNet network used two of the same encoders to extract natural light and infrared image features and then fused features by summing up the elements one by one. However, most of the above methods used simple fusion strategies, which lead to insufficient fusion of multi-modal image information and could not fully utilize the specific features of different modalities. Sun et al. [24] proposed a FuseSeg network using DenseNet as the backbone encoder network. This network proposed a two-stage fusion strategy to achieve the fusion of different modalities of image features in order to fully utilize the complementary information between infrared and natural light images, and it reduced the loss of spatial information due to down-sampling. Zhou et al. [25] used the MobileNetV2 network to extract features from infrared and natural light images and a fusion strategy with an embedded control gate to segment urban scenes. Wang et al. [27] proposed a novel semantic-guided fusion network (SGFNet) for RGB–thermal semantic segmentation. The SGFNet consists of an asymmetric encoder with a TIR branch, an RGB branch and a decoder.

However, there are few studies on power equipment semantic segmentation using cross-modal image fusion. Yan et al. [28] proposed a multi-modal image fusion framework based on Mask R-CNN for instance segmentation of power equipment, and compared the effects of different fusion methods on the segmentation results. The results indicate that multi-modal fusion methods can improve the segmentation accuracy of power equipment in different scenarios. Shu et al. [29] proposed an end-to-end multi-modal semantic segmentation model on power equipment instance segmentation. First, they designed a novel multi-modal feature extraction block to extract rich features from natural light images and infrared images. Second, they constructed a feature fusion block to fully fuse the extracted multi-modal features. Finally, a multi-scale instance segmentation block was used to segment power equipment. The proposed model demonstrates good performance in power equipment instance segmentation.

To obtain more accurate semantic segmentation results to guide autonomous navigation of UAVs for transmission line inspection, we propose a semantic segmentation model based on multi-scale feature differentiated fusion for power equipment on transmission pylons. Compared with other similar methods, the main innovations of our method in this paper are to

(1): Use a global information block and multi-level information aggregation block with a feature pyramid structure to deeply mine and fuse the multi-scale contextual information, and utilize the multi-scale feature of infrared and natural light images to better characterize the essential features of images;
(2): Design differentiated fusion strategies for different levels of natural light and infrared dual-modal feature maps, and use cross-modal attention interaction activation mechanisms to fully mine the modality complementation between natural light and infrared images, thereby improving semantic segmentation results;
(3): Use seven types of semantic segmentation algorithms to conduct experiments on our self-made TTS200 multi-modal dataset for transmission pylon equipment and analyze the experimental results in depth.

The rest of this paper is organized as follows: Section 2 introduces in detail the proposed semantic segmentation algorithm for transmission line automatic navigation that fuses infrared and natural light images for transmission pylon equipment; in Section 3 we conduct experiments using multiple semantic segmentation algorithms on the multi-modal dataset for transmission pylon equipment and deeply analyze the experimental results; and Section 4 summarizes this article.

2. Semantic Segmentation Algorithm for Transmission Pylon Equipment Based on Fusion of Infrared and Natural Light Images

We propose a semantic segmentation algorithm for power transmission pylon equipment that integrates infrared and natural light images. The main body of the algorithm is a neural network based on multi-scale feature differentiation fusion which uses a differentiated fusion strategy for different levels of feature maps to better capture spatial location information and classification discrimination semantic information in bimodal images. The structure of our network is shown in Figure 1. The network uses two symmetrical encoder structures to extract feature maps of different levels from infrared and natural light images with ResNet50 as the backbone network. We designed a differentiated fusion strategy of a lower-level fusion block (LLFB) and a higher-level fusion block (HLFB) to fuse features on different levels of the network. The LLFB is used to fuse the first three layers of feature maps in the network, and the spatial attention mechanism is used to activate each of two different modal along the spatial dimension. The spatial attention mechanism can suppress the interference of background noise in the lower-level feature maps and improve the representation ability of the fused feature maps to make the network model more focused on the spatial location region of the target to be segmented. The higher-level fusion block is designed to fuse the high-level semantic information of the last two layers of the network feature map and it uses a channel attention mechanism to activate cross-modal channels along the channel dimension. For the fused higher-level feature maps, the global information block (GIB) extracts multi-scale high-level semantic information and guides the aggregation of feature maps at each layer during decoding. Finally, the information aggregation block (IAB) is used to progressively aggregate the fused feature maps at various levels to generate the final segmented image.

Our method extracts distinct information for different layers, and densely integrates the high-level and global information into the restoration process of the original resolution of low-level information for each layer, so it can improve the effectiveness of semantic segmentation by better integrating information from both natural light and infrared images.

2.1. Lower-Level Fusion Block (LLFB)

Firstly, the feature maps extracted by ResNet50 are divided into lower-level feature maps and higher-level feature maps based on the depth of the network, and then a differentiated fusion strategy is adopted for different levels of the feature maps. For the first three feature maps containing plenty spatial information, a lower-level fusion block is used for fusion. The lower-level fusion block firstly utilizes a spatial attention mechanism to activate the lower-level infrared feature map

{T_{i} | i = 0, 1, 2}

and the lower-level natural light feature map

{R_{i} | i = 0, 1, 2}

along the spatial dimension, and then fuses the activated cross-modal feature maps to capture cross-modal spatial complementary information.

The structure of the lower-level fusion block is shown in Figure 2. The lower-level natural light feature map

R_{i}

is firstly fed into a 1 × 1 convolution layer to integrate information from different channels, and then a max pooling layer is used along the integrated channel dimension to get the most significant feature information at each spatial location. Finally, a tanh activation function is applied to obtain the corresponding spatial attention activation map

M_{i}^{R}

for the lower-level natural light feature map

R_{i}

.

M_{i}^{R} = φ (M a x p o o l (C o n v (R_{i})); α^{P})

(1)

In Formula (1),

C o n v (*; α^{P})

denotes a 1 × 1 convolution operation with

α^{P}

as its parameter. Maxpool (*) denotes the max pooling operation and

φ (*)

represents the tanh activation function.

The higher the value in the spatial attention activation map

M_{i}^{R}

, the more important the spatial information of the corresponding region in the natural light feature map, and vice versa. As natural light images and infrared images are different modalities of the same scene, the attention responses of the same spatial regions in both the lower-level infrared feature map

T_{i}

and the corresponding lower-level natural light feature map

R_{i}

should be similar to each other. Based on this, cross-modal spatial activation is used to enhance different modalities of images. The lower-level infrared feature map

T_{i}

is enhanced using the spatial attention activation map

M_{i}^{R}

of natural light as follows in Formula (2).

{T_{i}}^{'} = R_{i} \cdot M_{i}^{R} + T_{i}

(2)

Similarly, enhanced lower-level natural light feature maps can also be obtained by cross-modal spatial activation, as shown in Equation (3).

{R_{i}}^{'} = T_{i} \cdot M_{i}^{T} + R_{i}

(3)

With the cross-modal spatial activation path of the spatial dimension, the lower-level natural light feature maps and infrared feature maps can mutually absorb complementary information from each other to benefit the spatial detail information for semantic segmentation. Meanwhile, the spatial attention mechanism helps the network to focus more on the spatial detail features of the target to be segmented, eliminating the interference of redundant information and background noise on semantic segmentation predictions. The enhanced natural light feature map

{R_{i}}^{'}

and infrared feature map

{T_{i}}^{'}

are fused by channel concatenation, then a 1 × 1 convolution layer is used to convert the fused feature maps to the same channel number as the input feature maps, and finally the maps are batch-normalized to attain the fused feature map

F_{i}

.

F_{i} = B a t c h N o r m (C o n v B l o c k ([{T_{i}}^{'}, {R_{i}}^{'}])) i = 0, 1, 2

(4)

2.2. Higher-Level Fusion Block (HLFB)

As shown in Figure 3, unlike the LLFB, the proposed higher-level fusion block aims to fuse the high-level semantic information between the higher-level natural light feature map

{R_{i} | i = 3, 4}

and the higher-level infrared feature map

{T_{i} | i = 3, 4}

and utilizes the interactive attention mechanism to mutually activate along the channels, achieving interactive fusion of cross-modal category judgment information.

Firstly, the higher-level natural light feature map

R_{i}

is sent into a 1 × 1 convolution layer to integrate information from different channels, and then an average pooling layer is used to obtain global information along every integrated channel. After passing through a tanh activation function, the corresponding channel attention activation vector

V_{i}^{R}

for the higher-level natural light feature map

R_{i}

is obtained as formulated in Equation (5):

V_{i}^{R} = φ (A P (C o n v (R_{i}); β^{P}))

(5)

In Formula (5),

C o n v (*; β^{P})

denotes a 1 × 1 convolution operation with

β^{P}

as its parameter. AP(*) denotes the average pooling operation and

φ (*)

denotes the tanh activation function.

Similarly, the corresponding channel attention activation vector

V_{i}^{T}

for the higher-level infrared feature map

T_{i}

is obtained.

Similar to the LLFB, cross-modal channel mutual activation is used to explore complementary information between higher-level feature maps of different modalities, and semantic information enhancement is performed using complementary information. The formula is as follows:

\begin{matrix} {T_{i}}^{'} = R_{i} \cdot V_{i}^{R} + T_{i} \\ {R_{i}}^{'} = T_{i} \cdot V_{i}^{T} + R_{i} \end{matrix}

(6)

With this cross-modal activation approach along the channel dimension, higher-level natural light feature maps and infrared feature maps can fuse each other’s advanced semantic information. After that, a 1 × 1 convolution layer is used to convert the fused feature maps to the same channel number as the input feature maps, and finally the maps are batch-normalized to attain the fused feature map:

F_{i} = B a t c h N o r m a l (C o n v ([{T_{i}}^{'}, {R_{i}}^{'}])) i = 3, 4

(7)

2.3. Multi-Scale Feature Fusion

In higher-level convolution neural networks, the feature maps extracted from the top layers of the encoder contain advanced semantic information about the image. Abstract semantic information is beneficial for improving the performance of semantic segmentation networks, and multi-scale advanced semantic information is helpful for the network to perceive targets of different sizes in semantic segmentation tasks. Considering that the deeply fused feature map

F_{4}

obtained after cross-modal fusion contains rich advanced semantic information, a global information block is designed to mine multi-scale global context information from the fused feature map.

The structure of the global information block is shown in Figure 4. The higher-level fusion feature map

F_{4}

is firstly fed into a feature pyramid structure with five parallel branches to capture multi-scale global context information. The five parallel branches contain a

1 \times 1

standard convolution layer for channel adjustment and

3 \times 3

and

5 \times 5

hole convolution layers with different dilation rates. Different expansion rates of hole convolutions can obtain feature maps with the same spatial resolution but different receptive field scales which contain high-level semantic information at different scales. The multi-scale feature maps obtained via convolution are fused through channel cascade operations, and then a

1 \times 1

standard convolution layer is used for feature refinement and channel adjustment. Finally, the fused multi-scale feature map is fused with the original cross-modal fusion feature map

F_{4}

through a skip connection structure, a batch-normalization operation and an up-sampling layer to obtain the final multi-scale fusion feature map

G_{4}

. This parallel multi-scale feature pyramid structure is beneficial for exploring and enhancing the multi-scale context information of higher-level fusion feature maps, and the extracted multi-scale global context feature map

G_{4}

serves as global semantic information to guide the fusion of feature maps at each stage.

Higher-level feature maps provide advanced semantic information that is beneficial for category discrimination, while lower-level feature maps provide rich detailed texture information that is helpful for refining the edges of segmentation regions. Lower-level and higher-level feature maps contain different spatial and semantic information. For the feature maps

F_{0} ~ F_{4}

generated at different levels after cross-modal fusion, an information aggregation block is designed to gradually aggregate features from different levels. During the bottom-up aggregation and decoding process, multi-scale advanced semantic information,

G_{4}

and G_k (k > i + 1), is used as guide to gradually supplement the spatial detail information in lower-level fusion features to obtain more fine and accurate semantic segmentation results.

Figure 5 shows the structure of the proposed information aggregation block, which contains a

2 \times 2

transposed convolution layer, a standard

1 \times 1

convolution layer, a batch-normalization layer and an up-sampling layer. Multi-scale global contextual information

G_{4}

, higher-level contextual information G_k (k > i + 1) and the feature map

G_{i + 1}

aggregated in the previous stage are up-sampled using bi-linear interpolation and recovered to the same resolution as the cross-modal fusion feature map

F_{i}

using transposed convolution. Then, feature aggregation is performed by pixel-wise addition. The gradual addition of cross-modal fusion features and multi-scale global contextual information during each level of feature aggregation can effectively alleviate the problem of missing details and semantic information in the decoding process, which is beneficial for final semantic segmentation prediction. Finally, the aggregated feature maps are batch-normalized and up-sampled and sent to the classification prediction block for final semantic segmentation prediction, which consists of a

1 \times 1

convolution layer and a softmax layer.

2.4. Loss Function

In selecting the loss function, we use a focal loss [30] function to address the class imbalance issue as in Formula (8):

Loss (P, G) = - \frac{1}{H W} \sum_{i = 1}^{H} \sum_{j = 1}^{H} W (x_{i j}) (1 - P {(x_{i j})}^{γ} \log (P (x_{i j}))

(8)

In Formula (8),

W

and

H

denote the width and height of input images, respectively;

(i, j)

denotes the position coordinates of pixel points;

P (x_{i j})

denotes the predicted category weight that pixel

x_{i j}

belongs to;

W (x_{i j})

denotes the truth category weight that pixel

x_{i j}

belongs to.The value of

γ

is set as 2 here.

3. Results

3.1. Dataset

Currently, there is no publicly available dataset for infrared and natural light images of power transmission equipment. We collected 200 pairs of infrared and natural light images, including vertical insulators, pylons, shockproof hammers and connecting hardware. After alignment and annotation, they were gathered into a multi-modal dataset named TTS200. The annotated data are divided into a training part and a testing part with a ratio of 7:3. The sizes of the natural light and infrared images are adjusted to 640 × 480. Some image pairs of TTS200 dataset are shown in Figure 6.

3.2. Evaluation Indicators

This article uses two evaluation indicators, accuracy (Acc) and intersection over union (IoU), to quantitatively assess the semantic segmentation performance of a method. Acc represents the ratio between the true value and the predicted value for each category, while IoU represents the ratio between the intersection and union of the predicted region and the true region for each category. Mean accuracy (MAcc) and mean IOU (MIOU) are global evaluation metrics that are calculated by averaging the Acc and IoU values of all categories. The formulas for these two evaluation indicators are defined as:

\begin{matrix} M A c c = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F N_{i}} \\ M I O U = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F N_{i} + F P_{i}} \end{matrix}

(9)

In Formula (9), N represents the number of all segmentation categories including the background. The value of N is set as 5 in this paper.

T P_{i}

,

F N_{i} and F P_{i}

, respectively, denote the number of true positive, false negative and false positive samples of the i-th category.

3.3. Experiment Results

Our method is developed using the open-source PyTorch framework and is trained and tested using a NVIDIA GeForce RTX3080 graphics card. Data augmentation is applied to the dataset used during training, including random flipping and cropping operations. To verify the semantic segmentation performance of the proposed method, this section compares it with six other semantic segmentation methods: BiSeNet [31], RTFNet [23], FuseSeg [24], GMNet [32], ABMDRNet [33], and LASNet [34]. Among them, BiSeNet is a semantic segmentation algorithm based on a single natural light image, while the others are based on the fusion of infrared and natural light images. To ensure consistent input data, for the BiSeNet algorithm based on single-modality semantic segmentation, the input layer of the network was modified to fuse infrared and natural light images in advance to generate multi-channel images, which were then used as input to the network.

Table 1 presents the quantitative comparison results of various algorithms.

It can be seen from the table that compared with other semantic segmentation algorithms, our method achieved the best results in both MAcc and MIoU indicators. The MAcc value of our method is 0.6% higher than the next-best LASNet algorithm, while the MIOU value is 3.8% higher than the next-best LASNet algorithm. Although LASNet can obtain an MAcc value similar to that of our method, its segmentation results in edge areas are not as precise as ours, resulting in a lower MIoU indicator. The table demonstrates the semantic segmentation performance and superiority of our method compared with other algorithms. Compared with other algorithms, our method performs better for semantic segmentation of transmission pylons and attached equipment, and its semantic segmentation results can effectively guide UAVs to automatically navigate for line inspection.

As shown in Figure 7, a comparison of different algorithms is conducted on a set of segmentation results. Figure 7 shows the original images of natural light and infrared light, the manually labeled ground truth, the segmentation results of various other algorithms and the segmentation result of our algorithm. It can be seen from the figure that BiSeNet has poor semantic segmentation results, and the segmented power equipment area is very rough, making it difficult to correctly distinguish the location areas of different equipment. The edge contour obtained by our method is the most delicate. Compared with the other algorithms, our algorithm has a finer edge in the segmentation of the example image pair. This indicates that the differential feature fusion strategy proposed in this paper can effectively fuse the features of complex infrared and natural light images.

BiSeNet [31] only uses natural light images to segment, so it achieved a bad result. RTFNet [23] and FuseSeg [24] only use element-wise summation and down-sampling/up-sampling to fuse different modal images, they treat low-level and high-level information in the same way, thus achieving relatively moderate results. GMNet [32] only uses a deep feature fusion module to segment semantic regions, so its IOU metric is not good enough. ABMDRNet [33] and LASNet [34] employ different strategies to handle the fusion information of low-level and high-level data, so they obtained good results. In our method, distinct information is extracted for different layers, and the extracted high-level and global information is densely integrated into the restoration process of the original resolution of low-level information for each layer, so our method obtains the best result.

4. Conclusions

This article proposes a semantic segmentation algorithm of power transmission pylon equipment for UAV navigation and line inspection that fuses infrared and natural light images. The main body of the algorithm is a semantic segmentation network based on multi-scale feature differentiation fusion. The segmentation results can be provided to power transmission line inspection UAVs for real-time navigation during line inspection. A differentiated fusion strategy is proposed, and a lower-level fusion block and a higher-level fusion block are designed according to the characteristics of different levels of feature maps for differentiated processing. The spatial attention mechanism is used in the lower-level fusion block, while the channel attention mechanism is used in the higher-level fusion block to enhance feature representation. The cross-modal attention interaction activation mechanism is used to fully exploit the complementary information between infrared and natural light images. A global information block with a feature pyramid structure is designed to deeply mine and fuse multi-scale global context information of features which can guide feature aggregation during decoding. Our method extracts distinct information for different layers, and densely integrates the high-level and global information into the restoration process of the original resolution of low-level information for each layer, so it can improve the effectiveness of semantic segmentation by better integrating information from both natural light and infrared images. The experimental results on the power pylon equipment dataset show that the proposed algorithm can obtain better semantic segmentation results than other methods. Due to the current incapacity of UAVs to support real-time edge analysis with hardware such as the NVIDIA 3080 GPU, we are currently in the process of porting existing models to deploy on the NVIDIA Jetson AGX Xavier edge computing chip. In the future, we will conduct efficiency tests directly on the edge computing chip and explore methods to enhance computing efficiency.

Author Contributions

Methodology, J.Y.; Formal analysis, J.Y.; Resources, G.H.; Data curation, L.W.; Writing—original draft, T.W.; Writing—review & editing, T.W.; Visualization, R.J.; Project administration, J.Y.; Funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation numbered 2023J297 of Ningbo City and the school Foundation numbered SC1032211780350 of Zhejiang Wanli University.

Data Availability Statement

The data that support the findings of this study are not publicly available due to secrecy. However, they may be available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no known conflict of financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Luo, H.; Zhang, Y. A Survey of Image Semantic Segmentation Based on Deep Network. Acta Electron. Sin. 2019, 47, 2211–2220. [Google Scholar]
Tian, X.; Wang, L.; Ding, Q. Review of Image Semantic Segmentation Based on Deep Learning. J. Softw. 2019, 30, 440–468. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Peng, D.; Lei, Y.; Hayat, M.; Guo, Y.; Li, W. Semantic-Aware Domain Generalized Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2584–2595. [Google Scholar] [CrossRef]
Lee, S.; Seong, H.; Lee, S.; Kim, E. WildNet: Learning Domain Generalized Semantic Segmentation from the Wild. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9926–9936. [Google Scholar] [CrossRef]
Hoyer, L.; Dai, D.; Van Gool, L. DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9914–9925. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2020, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
Fan, Q.; Pei, W.; Tai, Y.-W.; Tang, C.-K. Self-Support Few-Shot Semantic Segmentation. In European Conference on Computer Vision, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Liu, Q.; Wen, Y.; Han, J.; Xu, C.; Xu, H.; Liang, X. Open-World Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding. In European Conference on Computer Vision, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Liu, S.; Wang, B.; Gao, K.; Wang, Y.; Gao, C.; Chen, J. Object Detection Method for Aerial Inspection Image Based on Region-based Fully Convolutional Network. Autom. Electr. Power Syst. 2019, 43, 162–168. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016; pp. 379–387. [Google Scholar]
Liu, H.; Zhao, T.; Liu, J.; Jiao, L.; Xu, Z.; Yuan, X. Deep Residual UNet Network-based Infrared Image Segmentation Method for Electrical Equipment. Infrared Technol. 2022, 44, 1351–1357. [Google Scholar]
Xiong, S.; Liu, Y.; Rui, X.; He, K.; Dollár, P. Power equipment recognition method based on mask R-CNN and bayesian context network. In Proceedings of the IEEE Power & Energy Society General Meeting (PESGM), Montreal, QC, Canada, 2–6 August 2020; pp. 1–5. [Google Scholar]
Chen, G.; Hao, K.; Wang, B.; Li, Z.; Zhao, X. A power line segmentation model in aerial images based on an efficient multibranch concatenation network. Expert Syst. Appl. 2023, 228, 120359. [Google Scholar] [CrossRef]
Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 5108–5115. [Google Scholar]
Sun, Y.; Zuo, W.; Liu, M. RTFNet: RGB-Thermal fusion network for semantic segmentation of urban scenes. IEEE Robot. Autom. Lett. 2019, 4, 2576–2583. [Google Scholar] [CrossRef]
Sun, Y.; Zuo, W.; Yun, P.; Wang, H.; Liu, M. FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Trans. Autom. Sci. Eng. 2020, 18, 1000–1011. [Google Scholar] [CrossRef]
Zhou, W.; Lv, Y.; Lei, J.; Yu, L. Embedded Control Gate Fusion and Attention Residual Learning for RGB–Thermal Urban Scene Parsing. IEEE Trans. Intell. Transp. Syst. 2023, 24, 4794–4803. [Google Scholar] [CrossRef]
Wu, W.; Chu, T.; Liu, Q. Complementarity-aware cross-modal feature fusion network for RGB-T semantic segmentation. Pattern Recognit. 2022, 131, 108881. [Google Scholar] [CrossRef]
Wang, Y.; Li, G.; Liu, Z. SGFNet: Semantic-Guided Fusion Network for RGB-Thermal Semantic Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023. [Google Scholar] [CrossRef]
Yan, N.; Zhou, T.; Gu, C.; Jiang, A.; Lu, W. Bimodal-based object detection and instance segmentation models for substation equipments. In Proceedings of the Annual Conference of the IEEE Industrial Electronics Society (IES), Singapore, 18–21 October 2020; pp. 428–434. [Google Scholar]
Shu, J.; He, J.; Li, L. MSIS: Multispectral instance segmentation method for power equipment. Comput. Intell. Neurosci. 2022, 2022, 2864717. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Jiang, A.; Lu, W. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Zhou, W.; Liu, J.; Lei, J.; Yu, L.; Hwang, J.-N. GMNet: Graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation. IEEE Trans. Image Process. 2021, 30, 7790–7802. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Zhao, S.; Luo, Y.; Zhang, D.; Huang, N.; Han, J. ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2633–2642. [Google Scholar]
Li, G.; Wang, Y.; Liu, Z.; Zhang, X.; Zeng, D. RGB-T semantic segmentation with location, activation, and sharpening. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1223–1235. [Google Scholar] [CrossRef]

Figure 1. Semantic segmentation network structure.

Figure 2. The structure of the LLFB.

Figure 3. The structure of the HLFB.

Figure 4. The structure of the global information block (GIB).

Figure 5. The structure of the information aggregation block (IAB).

Figure 6. Image samples in the TTS200 dataset. (a) Infrared images of transmission pylons; (b) natural light images of transmission pylons.

Figure 7. Experimental results of different methods. (a) Origin image; (b) ground truth; (c) BiSeNet; (d) RTFNet; (e) FuseSeg; (f) ABMDRNet; (g) GMNet; (h) LASNet; (i) our method.

Table 1. Experimental results of different semantic segmentation algorithms.

	Pylon		Vertical Insulator		Shockproof Hammer		Connecting Hardware		MAcc	MIOU
	Acc	IOU	Acc	IOU	Acc	IOU	Acc	IOU	MAcc	MIOU
BiSeNet	89.7	85.0	89.2	35.6	62.8	12.1	79.2	26.2	80.2	39.7
RTFNet	89.5	86.1	87.3	64.8	85.3	37.2	85.8	48.4	87.0	59.1
FuseSeg	89.8	86.8	88.2	76.4	80.6	64.3	87.1	67.1	86.4	73.7
ABMDRNet	89.2	87.7	89.4	75.8	86.2	63.1	86.6	67.4	87.9	73.5
GMNet	89.0	88.2	89.5	72.0	88.8	44.4	88.4	54.7	88.9	64.8
LASNet	89.5	88.0	89.5	79.2	88.6	61.5	88.6	67.8	89.0	74.1
Ours	89.9	88.3	89.4	81.8	88.9	68.1	89.5	73.1	89.6	77.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, J.; Wang, T.; Huo, G.; Jin, R.; Wang, L. Semantic Segmentation Algorithm Fusing Infrared and Natural Light Images for Automatic Navigation in Transmission Line Inspection. Electronics 2023, 12, 4810. https://doi.org/10.3390/electronics12234810

AMA Style

Yuan J, Wang T, Huo G, Jin R, Wang L. Semantic Segmentation Algorithm Fusing Infrared and Natural Light Images for Automatic Navigation in Transmission Line Inspection. Electronics. 2023; 12(23):4810. https://doi.org/10.3390/electronics12234810

Chicago/Turabian Style

Yuan, Jie, Ting Wang, Guanying Huo, Ran Jin, and Lidong Wang. 2023. "Semantic Segmentation Algorithm Fusing Infrared and Natural Light Images for Automatic Navigation in Transmission Line Inspection" Electronics 12, no. 23: 4810. https://doi.org/10.3390/electronics12234810

APA Style

Yuan, J., Wang, T., Huo, G., Jin, R., & Wang, L. (2023). Semantic Segmentation Algorithm Fusing Infrared and Natural Light Images for Automatic Navigation in Transmission Line Inspection. Electronics, 12(23), 4810. https://doi.org/10.3390/electronics12234810

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Segmentation Algorithm Fusing Infrared and Natural Light Images for Automatic Navigation in Transmission Line Inspection

Abstract

1. Introduction

2. Semantic Segmentation Algorithm for Transmission Pylon Equipment Based on Fusion of Infrared and Natural Light Images

2.1. Lower-Level Fusion Block (LLFB)

2.2. Higher-Level Fusion Block (HLFB)

2.3. Multi-Scale Feature Fusion

2.4. Loss Function

3. Results

3.1. Dataset

3.2. Evaluation Indicators

3.3. Experiment Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI