1. Introduction
Mineral oil is commonly used as an insulator and coolant for power transformers, which are critical pieces of equipment in power substations. Mineral oil leaks need to be detected effectively and/or quickly. A reduced amount of mineral oil reduces the insulation strength of the power transformer and can even lead to breakdown, short circuits, fires, explosions, and other faults [
1,
2]. Thus, detecting the leaked oil of a power transformer has become a routine task to ensure the safe operation of a substation [
3,
4].
The traditional method of detecting oil leakages from a power transformer commonly relies on human operators visually inspecting transformers in a prearranged inspection cycle, which might result in untimely detection and inefficient troubleshooting. With the wide application of inspection robots and video surveillance in substations, many surveillance images can be captured quickly, which helps to detect oil leakages early and maintain the normal operation of power transformers [
5,
6,
7]. However, recognizing an oil leak from many captured images is difficult. Manual visual inspection yields poor performance because it requires time and effort by skilled technicians. Thus, it is important to automatically detect transformer oil leakage from surveillance images.
With the development and application of machine vision technology, image recognition methods have been used to detect transformer oil leaks. Commonly, oil leaks should be detected as quickly as possible because early detection helps power operators take action to repair the transformer as soon as possible. While the leaked oil appears yellow and transparent under natural light, it is difficult to automatically recognize it using image processing. Thus, to effectively detect oil leakages, most researchers use the fluorescence features of mineral oil when the oil is irradiated with ultraviolet (UV) light [
8,
9]. Using fluorescence imagery, image processing is typically used to recognize oil leaks [
3]. Based on change detection and greyscale histogram double Gaussian fit analysis, current transformer oil leakages have been detected [
10]. However, traditional image processing methods have limitations, such as a low accuracy of images with complex backgrounds and detection thresholds that are difficult to adaptively set according to varied illumination.
In recent years, with the rapid development of deep learning (DL) in the computer vision field, DL has provided a new idea for the detection of oil leakage. Because DL can automatically learn and extract higher-level latent features from complex image backgrounds by constructing certain deep neural networks, DL methods have been widely used in object detection, such as pedestrian detection, vehicle detection, and medical imaging and tumour detection [
11,
12,
13]. The DL-based methods of object detection can mainly be categorized as two-stage or single-stage. The two-stage detectors first generate region proposals and then identify whether objects exist in each potential region or not. The representative methods of two-stage detectors include the region-based convolutional neural network (R-CNN) [
14], Fast R-CNN [
15], and Faster R-CNN [
16]. A long detection time is the disadvantage of the two-stage detectors, since they operate in two stages. Contrarily, single-stage detectors implement detection once, i.e., achieving the task of categories and locations directly within one stage. The representative methods of single-stage detectors include the Single Shot Detector (SSD) [
17] and the series of YOLO (You Only Look Once) [
18], e.g., YOLOv1∼YOLOv4 [
19,
20,
21].
Although there are many DL networks for object detection, recently, a few networks, e.g., the U-net network [
4,
22], YOLO [
23], and LSTM combined with a Genetic Algorithm [
24], have been applied to detect the leaked mineral oil of transformers. Among these networks, the U-net network has achieved the highest accuracy of leaked oil detection in the experimental research. Additionally, U-net networks can be used to segment the target object (i.e., oil), which is conducive to evaluating the level of oil leakage, providing a basis for additional research. However, when U-Net is used in a practical scenario, we face the following issues. First, generally, the detection performance of DL methods relies on a large amount of labeled training datasets and constructed network models. However, in practice, the probability of transformer oil leakage is low (i.e., the amount of surveillance images containing insulating oil leakage is limited). Second, due to the various environments of installing transformers and the fluidity of insulating oil, the areas stained by leaked oil exhibit a variety of shapes in the images. These issues make it difficult for conventional U-net networks to completely focus on the characteristics of oil pixels, making them misjudge non-oil portions of an image as leaked oil or miss oil-stained areas, which leads to a low detection accuracy.
To accurately detect leaked oil and further evaluate the seriousness of stained oil in the near future, we propose a dual attention residual U-net (DAttRes-Unet) network that includes an embedded residual model and dual attention modules from both spatial and channel-wise perspectives in the existing U-Net architecture. The primary contributions of this study are as follows: (1) integrating spatial attention and channel attention in the decoder path is conducive to recalibrating the feature responses towards the most informative and important oil component; (2) a residual learning mechanism is introduced in the standard U-net framework, and the residual module is used in the encoder path of the proposed DAttRes-Unet network to overcome the problems of gradient vanishing, partly caused by the small amount of training data; (3) a transfer learning strategy is used, and ResNet18 weights obtained by pretraining on the ImageNet dataset are used as the initial weights for training the DAttRes-Unet network to improve the performance of the proposed DAttRes-Unet in terms of the amount limitation of the training dataset. Experimental results with fluorescence images of transformers captured in field substations show that the proposed DAttRes-Unet can detect and segment leaked oil with better performance than other improved U-Nets (VGG16-Unet [
25] and Res18-Unet) and the variant of DAttRes-Unet with only one attention module.
The remainder of this paper is organized as follows.
Section 1 briefly introduces the framework of the classical U-net.
Section 3 describes the proposed method, including the design of the device that acquires relevant fluorescent images and the proposed DAttResU-Net architecture with a residual block and two attention modules.
Section 4 describes the experiments and results of this study, and conclusions are given in
Section 5.
2. Basic Principles of the U-Net Network
In this section, we briefly introduce the classical U-net network [
22]. Its structure is shown in
Figure 1. The topology of the U-net is symmetrical, mainly consists of the encoder on the left and the decoder on the right. The encoder is the backbone network to extract abstracting features and information, composed of four basic convolutional units. Each basic convolutional unit consists of a
traditional convolutional layer, a batch normalization layer, and a Rectified Linear Unit (ReLU) activation layer, which is then followed by a pooling layer for downsampling. Thus, the encoder uses downsampling to gradually increase the image depth to implement pixel classification, while the decoder uses upsampling to restore the image size and implement pixel positioning.
The decoder is the feature enhancement network that precisely positions and gradually restores the original size of the image, consisting of four upsampling layers followed by a ReLU and four skip connection layers. The upsampling layer expands the deep feature map as a larger feature. With the downsampling performed in the encoder four times, symmetrically, the upsampling is performed four times in the decoder, in order to restore the feature map extracted in the encoder into images with the same size of the original images. The skip connection layer is used to fuse the different levels of features from the encoder and the decoder. This operation is conducive to supplementing the missing pixel position information during the downsampling process and improving the accuracy of the segmentation. Finally, a softmax classifier is utilized for pixel-by-pixel classification to achieve semantic segmentation.
3. The Proposed Method
Although the classical U-net has achieved high accuracy in segmentation, stacking multiple traditional convolutional layers causes the network to suffer from the problem of gradient vanishing, which impedes the optimization of the network weights. To overcome this problem, four residual blocks replace the original basic convolutional unit in the encoder to facilitate the training of the deep networks. Additionally, considering that the downsampling operation in the encoder would cause information loss, dual attention modules, i.e., the attention gate (AE) and squeeze-and-excitation (SE) blocks, are integrated into the decoder, in order to enhance the richer low-and high-level information and provide a higher weight coefficient for more relevant channels. For these purposes, the DAttRes-Unet network is proposed in this work.
Figure 2 shows the procedure of detecting leaked oil from a transformer, including the acquisition device for fluorescence images, data processing, the proposed DAttRes-Unet network, and the training and testing of the proposed networks, described below.
3.1. Image Acquisition Device
To facilitate the acquisition of fluorescent images, we designed a portable acquisition device that integrated a light source that emits UV light and a digital camera. The appearance of the designed device is shown in
Figure 3a. The rated power of the UV light source is 15 W, which can produce approximately 5000
W/cm
2 UV light at most. Different light levels with a lithium battery as the supply can be adjusted according to practical situations. In our experiments, the focus of UV light source was fixed, and the rated wavelength was designed for 365 nm UV light. The cooperative operation of the light source and the digital camera was implemented through a control unit, and its corresponding simplified circuit is shown in
Figure 3b, where the symbols “S” and “KT” denote the primary switch and time relay, respectively; the symbols “L” and “Camera” denote the UV light source and digital camera, respectively.
To collect oil-stained images, the primary switch “S” was first turned on to concurrently provide power to the UV lamp and the coil of the time relay. To guarantee that the UV light source provided stable UV light, the camera was automatically turned on after the timing in the time relay reached a predetermined delay (10 s in this study). The researchers then operated the camera to acquire fluorescent images under UV light. After multiple fluorescent images were collected, the acquired images were stored on a computer via a USB interface and processed by the constructed DAttRes-Unet network to automatically recognize the transformer oil leakage area.
3.2. DAttRes-Unet Architecture
The proposed DAttRes-Unet model is an end-to-end network for leaked oil detection, and its architecture is similar to that of a classical U-net, symmetrically consisting of the encoder and the decoder, as shown in
Figure 4. In the encoder, the basic convolutional unit, composed of a
convolutional layer, batch normalization, and a ReLu activation layer, are the same as in the classical U-net. A max-pooling layer with a
kernel and a stride of 2 was used to select the most powerful features. Different from the classical U-net shown in
Figure 1, four residual blocks are embedded in the encoder to overcome the vanishing gradient problem. In the decoder, different from the classical U-net, an AG block is embedded to enhance the richer low- and high-level information by weighting oil pixels. Followed by a feature concatenation layer, an SE block is embedded as the channel attention to yield a higher weight coefficient for more relevant channels. Finally, a Softmax layer is used to obtain the final segmentation results. The details of the residual block and the SE block are described in the following subsection.
3.2.1. Residual Block
To solve the network degradation and realize the feature reuse, the residual block introduces a shortcut connection channel and an element-wise addition operation, making the network more accurate without any extra parameters. By stacking the residual blocks [
26], different types of residual networks [
27], e.g., ResNet18, ResNet34, ResNet50, and ResNet101, can be constructed as the encoder within the U-net architecture.
Considering that the amount of acquired fluorescence images of transformer oil leakage is limited, if we adopt a deep residual network model, which implies that the number of training parameters increases dramatically, then the network becomes difficult to train, and the trained model is prone to over-fitting. Secondly, considering that ResNet18 has been well trained with good performance with the ImageNet dataset, we can adopt the transfer learning strategy to utilize the trained weights of ResNet18 as initial weights for leakage oil images. This will increase the training speed and improve the performance of the network. Thus, as a trade-off between computation costs and accuracy, we mainly adopted ResNet18 as the encoder path to extract features in our DAttRes-Unet network. Further, in order to be applicable to the U-Net architecture, we removed the full connection layer in the original ResNet18 network and only used a basic convolutional unit to construct the residual block.
The structure of the residual block is shown in
Figure 5 and primarily consists of two convolution units and identity mapping. Each convolution unit is composed of a
convolutional layer, batch normalization, and a ReLU activation layer. The identity mapping connects the input and output of the residual block, and the residual operation is denoted as
where
and
are the input and output vectors of the
-th residual unit, respectively;
is the mapping function for the residual path;
is the trainable weight. From (
1),
of the residual unit can be mapped as
(i.e.,
), even when
. Taking the derivative of Equation (
1), we have
.
is always satisfied, which implies that the residual operation avoids gradient vanishing and accelerates network convergence. By stacking the residual blocks, we can obtain hierarchical latent feature maps of the input fluorescence image, including the low-level detailed information from shallow blocks and the high-level semantic information from deep blocks.
3.2.2. The combination of AG and SE Attention Blocks
The proposed DAttRes-UNet network has four spatial attention modules and four channel attention modules, as shown in
Figure 4. We use the attention gate [
28] to calculate the spatial attention to highlight the oil-stained region on the feature maps while suppressing the background or irrelevant parts. The SE block [
29] is also used as the channel attention to yield a higher weight coefficient for more relevant channels [
30]. These attention modules are described in the following subsections.
(1) The AG attention block
Four AG blocks in the decoder learn attention at four different resolution levels, and the structure of each AG module is shown in
Figure 6. Given that
is the feature map of the
l-th layer, the attention coefficient
is used to determine focus regions and to curb useless feature information. The output of the AG
is the multiplication of
with
as follows:
where the operator “·” denotes element-wise multiplication. With multiplication, the AG attention module can overlook non-oil background information by giving more weight to feature maps with higher semantic information.
is obtained using additive attention, and its diagram is shown in
Figure 6. As the result after upsampling, the attention gating sign vector
is used for each pixel to determine focus regions. The additive is formulated as follows:
is often chosen as the ReLu activation function [
31] (i.e.,
), and
is often chosen as the sigmoid function (i.e.,
).
and
are bias terms, and
,
, and
are linear transformations implemented using the
convolution in the channel direction of the input tensor. The parameter set
consists of
,
,
,
, and
, which must be updated by the backpropagation algorithm in the model training.
(2) The SE attention block
Typically, convolutional blocks treat channel-wise features equally and thus have difficulty managing various shapes and sizes of the leaked oil region in the images. This issue encourages us to use the SE attention block to selectively emphasize informative feature maps by calculating channel-wise summary statistics.
Figure 7 shows the structure of the SE attention module.
We let
be the feature map input of the SE module, where
c is the channel number, and
H and
W are the height and width of the feature maps, respectively. The SE module is composed of global average pooling, two fully connected (FC) layers, and their corresponding activation functions. To obtain the global information of each channel
, global average pooling is performed on individual feature channels along spatial dimensions so that the global spatial information is squeezed into a channel descriptor. The
c-th element of
z is calculated as the global information of each channel:
where
is the position of the
c-th channel
.
To obtain the channel attention coefficient
, a multiple layer perception (MLP) is constructed by two fully connected (FC) layers around the nonlinearity. The first FC layer is a dimensionality-reducing layer with reduction ratio
r (i.e., the output channel number is
), which is then followed by a ReLU. In this study,
r is set to 16. The second FC layer is a dimensionality-increasing layer with the same channel number of
c as the input
X, and its result is fed into a sigmoid to obtain
, given by
where
, and
. Finally, the final output
of the SE module is obtained by scaling the feature map input
X with
:
where
.
3.2.3. Loss Function
To update the parameters of the proposed model, we used a binary cross entropy as a loss function, defined as:
where
N is the number of data,
is the class of classification that has the value of 0 or 1, and
is the probability of
.
5. Conclusions
In this paper, a new DAttRes-Unet network that is designed to segment stained oil regions by integrating ResNet and dual attention modules is proposed. The proposed model uses the pretrained ResNet18 as an encoder to mitigate issues due to few training data and gradient vanishing during backpropagation. Combining spatial and channel attention modules in the decoder improves feature representation for oil-stained regions while suppressing the background. Experimental results showed that the proposed model outperforms the commonly used VGG16-Unet and Res18-Unet, and the accuracy reached 98.49%. Extensive ablation studies also confirm the effectiveness of the dual attention modules.
In spite of the good performance of the DAttRes-Unet network, some aspects are still open for improvement in the future. For example, the SE attention module or other attention modules are applied in the encoder to emphasize more features in the image, the influence of the embedding attention modules in the encoder or in the decoder needs to be further studied. The hyperparameters (e.g., and ) of the Adam optimizer have considerable influence on the performance of networks. Sensitivity needs to be analyzed. The proposed network trained with fluorescence images has not been tested by the dataset of other categories. The generality of the proposed network needs to be verified in other segmentation tasks. Additionally, the generality of the proposed network needs to be improved by using some strategies, e.g., dropout by stopping feature detectors with a random probability at each training epoch, and a selection loss function. Further, to avoid the problem of over-fitting, we will continuously collect more types of fluorescent images with leaked oil to increase the amount of training data and to improve the architecture of the network by the selection of learning mode and optimization method.