1. Introduction
Wind turbines are the main equipment for converting natural wind energy into electricity [
1]. In recent years, as a large number of wind turbines have been put into operation and the wind power industry has gradually matured, the inspection and maintenance of wind turbines have become a challenge. Wind farms are generally located in remote areas, with scattered equipment and inconvenient transportation [
2]. The use of manual inspection of wind farms is not only time and labour-consuming but also cannot detect some abnormal conditions of wind turbines in time. At the same time, there is a risk of working at height when staff enter the wind turbine nacelle for inspection. However, the inspection robot can work continuously around the clock and can inspect several important parts of the wind turbine nacelle in real time (e.g., check the level of grease in the generator oil box, whether the oil in the waste oil tray under the hydraulic station has overflowed, and whether the water pump has leaked). Some abnormalities of the equipment in the nacelle can be found in time and notified to the maintenance personnel for timely treatment, which prolongs the unit’s service life and saves manpower. Therefore, studying the detection of multiple objects in the wind turbine nacelle is not only a prerequisite for the condition monitoring of multiple vital parts of the wind turbine nacelle by the inspection robot but also has great significance for improving the efficiency of the inspection robot.
The development of machine learning has provided new ideas for object detection algorithms. Deng et al. employed a classifier based on the Histogram of Oriented Gradient (HOG) and Support Vector Machine (SVM) to identify and classify the defect types of wind turbine blades, which effectively improves the identification accuracy of scratch-type, crack-type, sand-hole-type, and speckle-type defects [
3]. Abedini et al. operated the Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), Features from Accelerated Segment Test (FAST), Brute-Force, Fast Library for Approximate Nearest Neighbors (FLANN) method to detect wind turbine towers, which achieves a detection accuracy of 0.894 [
4]. Zhu et al. used a semi-supervised method based on anomaly detection for the detection of internal defects in aluminum conductor composite core (ACCC) wires, which achieves a detection accuracy of 0.761 [
5]. Although these methods improve the detection accuracy to different degrees, the final detection accuracy is roughly distributed between 0.75–0.92, which is low and cannot meet the requirements of multi-object detection accuracy in the wind turbine nacelle. Moreover, these methods have poor robustness since they can only be applied to specific environments and are unsuitable in the complicated environment of wind turbine nacelles.
With the development of computer technology, deep learning-based object detection algorithms have been widely used in various aspects in recent years [
6,
7]. The current deep learning-based object detection algorithms are mainly composed of two categories: two-stage object detection algorithms such as Region Convolutional Neural Network (R-CNN) [
8], Fast Region-based Convolutional Neural Network (Fast R-CNN) [
9], Faster Region-based Convolutional Neural Network Faster R-CNN [
10]; and one-stage object detection algorithms such as You Only Look Once (YOLO) series [
11,
12,
13,
14], Single Shot MultiBox Detector (SSD) [
15]. The two-stage object detection algorithm has high detection accuracy but slow inference speed. In contrast, a one-stage object detection algorithm has faster inference speed and higher detection accuracy. Ran et al. could achieve 58% detection accuracy and 4.9 frame s
−1 detection speed by using Faster R-CNN, a two-stage object detection algorithm, to detect defects in wind turbine blades; and 75.6% detection accuracy and 35.8 frame s
−1 detection speed by using YOLOv3, a one-stage object detection algorithm, to detect blade defects [
16]. Hu et al. could obtain 83.32% detection accuracy and 3.2 frame s
−1 detection speed by using Faster R-CNN, a two-stage object detection algorithm, to detect fastener defects in high-speed railroads; and 82.34% detection accuracy and 47.27 frame s
−1 detection speed by using YOLOv5, a one-stage object detection algorithm, to detect fastener defects [
17]. Therefore, the one-stage object detection algorithm is more suitable for detecting multiple objects (oil box, pump, oil pan) in the wind turbine nacelle.
Liu et al. [
18] proposed YOLOX, a one-stage object detection algorithm suitable for deployment in industry, which incorporates the advantages of the YOLO family of networks with high detection accuracy and has been widely used in various applications. Yi et al. effectively improved the accuracy and speed of this algorithm for strip steel surface defect detection by enhancing the feature extraction layer and feature pyramid network of the YOLOX model [
19]. Wu et al. proposed an improved YOLOX-TR model based on the Transformer encoder and structurally reparametrized VGG (RepVGG) blocks to achieve end-to-end tank detection and classification of dense regions of large-scale synthetic aperture radar (SAR) images [
20]. Ru et al. presented a lightweight ECA-YOLOX-Tiny unmanned aerial vehicle inspection model by embedding the effective channel attention (ECA) module into the lightweight network model YOLOX-Tiny, effectively improving the localization accuracy of defective insulator self-detonation areas [
21]. All these methods are mainly responsible for detecting objects of different sizes, especially for small objects with high detection accuracy. However, as shown in
Figure 1, the space inside the wind turbine nacelle is small, and the detected objects occupy a relatively large portion of the monitoring screen of the inspection robot. The probability of the existence of small objects in the multi-object detection in the wind turbine nacelle is extremely small. In addition, the monitoring time of the inspection robot for the detected objects in the nacelle is short, leaving little time for the inspection robot to check the status of multiple objects in the nacelle. The original YOLOX algorithm is used to detect multiple objects in the wind turbine nacelle; although it has high detection accuracy, the inference speed is slow. Moreover, the correlation layer structure of the original YOLOX algorithm for detecting large and medium objects is sufficient for detecting multiple objects in the nacelle. The correlation layer structure for detecting small objects not only contributes to low accuracy in detecting multi-objects in the wind turbine nacelle but also causes redundant computation in the network inference process and lengthens the inference time, which is not suitable for the detection of multi-objects in the wind turbine nacelle.
An algorithm with high accuracy, low latency, and the capability of being deployed on embedded devices is needed for multi-object detection in wind turbine nacelle. Therefore, this study proposes a multi-object detection algorithm in wind turbine nacelles based on improved YOLOX-Nano, and the main contributions of this paper are as follows:
The feature extraction layer of YOLOX-Nano is replaced by CSPDarkNet-Tiny, the backbone network of YOLOv4-Tiny, to improve the speed of feature extraction for images.
The detection layer structure associated with the 8× downsampling rate feature layer in YOLOX-Nano is removed to reduce the model computation and speed up the inference.
The CSPlayer module of the YOLOX-Nano feature pyramid layer is replaced by the CSPBlock module to speed up the processing of feature fusion for different downsampled rate feature layers.
2. Models and Datasets
2.1. Dataset
The dataset of multiple objects in the nacelle was collected on a wind farm in Northwest China. It was collected from multiple wind turbine nacelles by an orbital inspection robot inside the nacelle using Hikvision surveillance cameras at different times, locations, and lighting conditions. The dataset contains 5000 RGB images with image resolutions of 1920 × 1080 and 480 × 270. As shown in
Figure 2a is the hydraulic station in the wind turbine nacelle; the inspection robot needs to determine whether the oil in the waste oil tray under the hydraulic station is full or not;
Figure 2b represents the pump in the wind turbine nacelle, and the inspection robot needs to judge whether the pump has water leakage;
Figure 2c is the oil box that supplies grease to the generator in the wind turbine nacelle, and the inspection robot needs to check whether there is grease in the oil box. Detecting these components is a prerequisite for the condition monitoring of the oil pan, oil box, and water pump in the nacelle by the inspection robot.
The Labellimg annotation tool was used to annotate the collected 5000 images, and 3000 images were randomly divided for training and 2000 images for testing. Models of different sizes have different learning abilities. For large models with stronger learning ability, using stronger data augmentation can effectively improve the generalization ability of the model. For small models, the learning ability is weak, and using stronger data augmentation will lead to the model not fitting well. All the studies in this paper are based on the improvement of the YOLOX-Nano lightweight network model. Therefore, only weaker data enhancement methods, such as random horizontal flipping, random scaling, random panning, and random Hue, Saturation, Value (HSV) image enhancement, are used in the training process to enhance the generalization ability and ensure the fitting ability of the model.
2.2. YOLOX-Nano Network Model
YOLOX, as the current mainstream one-stage object detection algorithm, incorporates the advantages of the previous YOLO series networks on the basis of YOLOv3. It also innovatively proposes the decoupled head and Simplified Optimal Transport Distribution (SimOTA) dynamic sample matching scheme, introduces the anchor-free frame detection method, and achieves the improvement in detection accuracy and speed. Meanwhile, YOLOX is available in various models (e.g., YOLOX-s, YOLOX-m, YOLOX-l, YOLOX-x, YOLOX-DarkNet53, YOLOX-Tiny, and YOLOX-Nano).
YOLOX-Nano, as one of the lightest models in YOLOX, is mainly divided into a backbone network, neck module and head module, and the network structure is shown in
Figure 3. The backbone network is mainly employed for the extraction of image feature information. The design based on the backbone network CSPDarkNet is cited in YOLOv5 [
22], and part of the regular convolution is changed to depthwise separable convolution (DWConv) to reduce the number of network parameters. At the same time, the Spatial Pyramid Pooling-Fast (SPPF) structure proposed by YOLOv5 authors Glenn Jocher et al. is introduced to achieve the fusion of local feature information and global feature information of images, which enriches the feature representation capability of the backbone network.
The neck module consists of Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) [
23,
24]. The top-down semantic information of the FPN is laterally connected with the bottom-up semantic information of the PAN to achieve the fusion of deep semantic information with shallow semantic information, ensuring the output of the neck module for high-resolution, strongly semantic multi-layer feature information.
While the head module of the previous YOLO series models used one convolutional module to predict the bounding box regression and the category, the head module of YOLOX-Nano uses its own proposed decoupled head module, which uses multiple convolutional modules to predict the category and bounding box regression separately, effectively improving the convergence speed and accuracy of the network model.
2.3. Backbone Improvements
YOLOv4-Tiny is the lightest model in the YOLOv4 series of models. Its ultra-fast detection speed has been widely recognized, which can be used for devices with relatively tight computing resources. YOLOv4-Tiny’s backbone network, CSPDarkNet-Tiny, consists mainly of the CSPBlock module and the Max Pooling layer. The CSPBlock module divides the feature map into two parts and then combines the two parts by cross stage residual edge. This allows the gradient stream to be propagated in two different network paths to increase the correlation difference of gradient information and the learning ability of the convolutional network. In addition, it reduces the computational cost and speeds up the image feature extraction to some extent [
25,
26]. In this study, in order to improve the inference speed of the network model without introducing too many parameters and reducing the accuracy of the model, we replace the backbone of the original YOLOX-Nano network by halving the number of output channels of all convolutions of the CSPDarkNet-Tiny backbone network of YOLOv4. The improved backbone network is shown in
Figure 4.
2.4. Neck Module and Head Module Improvements
The original YOLOX-Nano detects objects of different sizes on the network structure’s 8×, 16×, and 32× downsampling rate feature layers. The perceptual field of individual feature points on the 8× downsampling rate feature layer is smaller, so the 8× downsampling rate feature layer focuses more on the local feature information and small objects in the image. The perceptual field of individual feature points on the 32× downsampling rate feature layer is larger, so the 32× downsampling rate feature layer focuses more on the global information and large objects in the image. The feature layer at the 16× downsampling rate is somewhere in between. The original YOLOX-Nano’s 8× downsampling rate feature layer is mainly used to detect small objects, which is more focused on the local feature information of the image, which helps the regression of the bounding box of small objects. However, the 8× downsampling rate feature layer is not effective for the detection of large and medium objects, and the bounding box regression is difficult. Detecting large and medium-sized objects is mainly performed on feature layers with 32× and 16× downsampling rates. The space inside the wind turbine nacelle is small and the object being detected occupies a relatively large proportion of the pictures taken by the inspection robot. The probability of the presence of small objects in the multi-object detection inside the nacelle of a wind turbine is extremely small. Therefore, we reduce the computation of the network model and speed up the inference of the model by removing the correlation layer structure of the 8× downsampling rate feature layer from the network structure. The detection of multiple objects in the nacelle is carried out only by the correlation layer structure of the 16× and 32× downsampling rate feature layers. The specific operation involves removing the upsampling layer within the FPN for the 16× downsampling rate feature layer and the associated layer structure, and further removing the detection head that is mainly used to detect small objects. The model is changed to predict information about multiple objects in the nacelle on the 16× and 32× downsampling rate feature layers. In order to speed up the fusion of features in feature layers with different downsampling rates of the model, we replace the CSPLayer module for fusing feature information at different scales within the FPN layer with the CSPBlock module. In addition, a convolution layer is added after the CSPBlock module for integrating the number of channels output from the CSPBlock module. The improved network model is shown in
Figure 5. The images are extracted to 16× and 32× downsampling rates feature information by the backbone network, respectively. The feature information continues to be fed into the FPN module for feature fusion. Finally, the fused feature information is processed by the detection head to achieve the classification and object box regression prediction of multiple objects in the wind turbine nacelle.
3. Experiment
The experimental environment of this study was built based on Windows 10 operating system, and the model was built, trained, and tested on the same PC. The computer configuration is as follows: CPU model is 11th Gen Intel Core i5-11400H with RAM of 16 GB, GPU model is NVIDIA GeForce GTX 3050 Laptop (8G), Python version is 3.7, and the deep learning framework is Pytorch 1.10.0. The GPU’s general purpose parallel computing architecture is CUDA 11.3, and the deep learning acceleration library is cuDNN 8.2.
In the field of object detection, precision, recall, and average detection accuracy (AP; mAP) are usually used as model performance evaluation metrics. The precision is the percentage of positive samples in the predicted samples. The recall is the percentage of positive samples that are correctly predicted as positive samples. The AP is the area under the P–R curve of a single category with recall and precision as the horizontal and vertical coordinates, respectively. The mAP denotes the average of AP of multiple categories.
[email protected] is the mAP obtained when the Intersection over Union (IoU) of the prediction frame and the real frame are greater than 0.75 for positive samples.
[email protected]:0.95 is the positive sample when the IoU of the prediction frame and the real frame are greater than 0.5~0.95, respectively, mAP is calculated every 0.05 in steps of 0.05, and the average of all steps of mAP is finally obtained. In this study,
[email protected],
[email protected]:0.95 and Frames Per Second (FPS) are used as the evaluation metrics of the model. Accuracy, recall, and mAP are calculated as follows:
where
TP stands for the number of correctly predicted positive samples, and
FP denotes the number of incorrectly predicted positive samples.
FN is the number of true samples that were not predicted.
N,
P, and
R represent the number of categories, precision value, and recall value, respectively.
In this study, we design two sets of experiments. The first set of experiments is to perform ablation experiments to determine the effectiveness of each improvement point of the model. The second set of experiments comprehensively analyzes the performance of the improved YOLOX-Nano model by comparing the experimental results of the improved YOLOX-Nano model with some current lightweight models. All the models trained in this study use the same training parameters: the batch size is 8, the input size of the image is determined to be 416 × 416, the number of iterations is set to 100, and the initial learning rate is fixed to 0.00125.
3.1. Model Ablation Experiments
In this study, we verify the effectiveness of the improvements for the original YOLOX-Nano by ablation experiments, as shown in
Table 1. YOLOX-Nano-A represents the model after replacing the backbone network of the original YOLOX-Nano with CSPDarkNet-Tiny. YOLOX-Nano-B is the model after removing the correlation layer structure of the 8× downsampling rate feature layer from the original YOLOX-Nano. YOLOX-Nano-C is the model after removing the correlation layer structure of the 8× downsampling rate feature layer from the YOLOX-Nano-A. Improved YOLOX-Nano is the model after changing the CSPLayer in the FPN structure of YOLOX-Nano-C to CSPBlock and adding a convolutional layer after it to integrate the channels. These models are trained and tested with the same dataset, equipment, and training parameters.
By comparing the experimental results of YOLOX-Nano and YOLOX-Nano-A, it can be seen that by replacing the backbone network of the original YOLOX-Nano with CSPDarkNet-Tiny and halving the output channels of all convolutional layers within CSPDarkNet-Tiny, the model’s
[email protected]:0.95 and
[email protected] are improved by 0.12% and 0.18%, and the FPS of the model is improved from 81 frame s
−1 to 84 frame s
−1. The results show that the improvement of the backbone network has a certain improvement in the accuracy and inference speed of the model. After continuing to remove the correlation layer structure of the 8× downsampling rate feature layer of model YOLOX-Nano-A, the
[email protected]:0.95 of model YOLOX-Nano-C improved by 0.04%,
[email protected] decreased by 0.01%, and FPS improved from 84 frame s
−1 to 117 frame s
−1. It can be seen that the accuracy of the model is almost unchanged, but the inference speed improved by 33 frame s
−1. After removing the correlation layer structure of the 8× downsampling rate feature layer of the original YOLOX-Nano, YOLOX-Nano-B has a 0.12% and 0.19% decrease in
[email protected]:0.95 and
[email protected], respectively, and 17 frame s
−1 improvement in FPS compared to the original YOLOX-Nano model. Within the FPN layer of the original YOLOX-Nano, the global information and local features of the 32×, 16×, and 8× downsampling rate feature layers are fused and output; 32× and 16× downsampling rate feature layers fuse the local feature information of the image captured by the 8× downsampling rate feature layer, which can effectively improve the detection accuracy of the model. Therefore, after the original YOLOX-Nano removes the correlation layer structure of the 8× downsampling rate feature layer, the detection accuracy of the model decreases, the parameters are reduced, and the inference speed is accelerated. By comparing the experimental results of the original YOLOX-Nano, YOLOX-Nano-B, and YOLOX-Nano-C, it can be seen that the detection accuracy and speed of YOLOX-Nano-C are higher than the former two.
Therefore, compared with YOLOX-Nano-B that directly removes the layer structure associated with the 8× downsampling rate feature layer of YOLOX-Nano, and YOLOX-Nano-C that replaces the backbone network and removes the layer structure associated with the 8× downsampling rate feature layer, is more beneficial to the detection accuracy and speed of the model. The
[email protected]:0.95,
[email protected], and FPS of improved YOLOX-Nano improve 0.14%, 0.14%, and 23 frame s
−1, respectively, over the YOLOX-Nano-C model. Therefore, replacing the CSPLayer in the FPN with a CSPBlock and adding a convolutional layer after the CSPBlock to integrate the channels can effectively improve the model accuracy and speed, especially the inference speed of the model is improved by 19.66%.
From
Table 1, we can ascertain that the
[email protected] and FPS of improved YOLOX-Nano are improved by 0.44% and 72.8%, respectively, based on the original YOLOX-Nano. It is verified that the improvement of YOLOX-Nano effectively improves the model’s detection accuracy and inference speed, and especially, the model’s inference speed is greatly improved. The low latency feature of the improved YOLOX-Nano algorithm effectively improves the efficiency of the inspection robot to check the status of multiple objects in the nacelle within a short period. Although the number of parameters of the improved YOLOX-Nano increases by 2.03 M compared to the original YOLOX-Nano, the improved YOLOX-Nano can still be deployed on embedded devices. In addition, the drawback from the increase in parameters is much less than the advantage from the speed of the improved YOLOX-Nano.
In order to compare the detection accuracy of the improved YOLOX-Nano and the original YOLOX-Nano, three images randomly collected by the inspection robot were detected using the original YOLOX-Nano network model and the improved YOLOX-Nano network model, respectively. The detection results are shown in
Figure 6: (a) is a graph of the detection results of the original YOLOX-Nano, and (b) is a graph of the detection results of the improved YOLOX-Nano. In the first set of comparisons, the confidence level of the original YOLOX-Nano for the detection of oil boxes was only 0.21, while the confidence level of the improved YOLOX-Nano for the detection of oil boxes was 0.84. In the second set of comparisons, the confidence level of the original YOLOX-Nano for oil pan detection was only 0.12; however, the confidence level of the improved YOLOX-Nano for oil pan detection was 0.63. In the third set of comparisons, the confidence level of the original YOLOX-Nano and the improved YOLOX-Nano models for pump detection were 0.82 and 0.88, respectively. These three comparisons show that the detection accuracy of the improved YOLOX-Nano model is higher than that of the original YOLOX-Nano model in the actual detection, which proves the effectiveness of the improved YOLOX-Nano model in improving the detection accuracy.
3.2. Model Comparison
To verify the effectiveness of the improved YOLOX-Nano model, we compared the experimental results of the improved YOLOX-Nano with the original YOLOX-Nano, YOLOv4-Tiny, and YOLOX-Tiny on the same datasets and devices, as shown in
Table 2. The improved YOLOX-Nano has higher detection accuracy compared to other models, has a smaller number of parameters than YOLOX-Tiny and YOLOv4-Tiny, and has a faster detection speed than YOLOX-Nano and YOLOX-Tiny. Although the improved YOLOX-Nano has more parameters than the original YOLOX-Nano, it does not affect its deployment on embedded devices. The original YOLOX-Nano and YOLOX-Tiny have higher detection accuracy but a slower inference speed. YOLOv4-Tiny has a faster detection speed but lower detection accuracy. The improved YOLOX-Nano has higher detection accuracy as well as faster detection speed. And the relatively small number of parameters of the improved YOLOX-Nano model allows it to be deployed on embedded devices. Collectively, the overall performance of the improved YOLOX-Nano is better than other lightweight network models.
To verify the performance of the improved YOLOX-Nano deployed on an embedded device, we compared the experimental results of the original YOLOX-Nano, YOLOv4-Tiny, YOLOX-Tiny, and the improved YOLOX-Nano when deployed on the Jetson Nano, an embedded device from NVIDIA. The hardware model of the Jetson Nano is Jetson Nano B01 (4 GB) and the system version is JetPack 4.6. As shown in
Figure 7, the FPS of the original YOLOX-Nano, YOLOX-Tiny, YOLOv4-Tiny, and the improved YOLOX-Nano are 8.96 frame s
−1, 10.53 frame s
−1, 14.45 frame s
−1, and 17.18 frame s
−1, respectively. The improved YOLOX-Nano has a 91.74% improvement in detection speed compared to the original YOLOX-Nano, and compared to other lightweight models, with better performance on the embedded device, Jetson Nano, having a faster detection speed. Ma et al. proposed that the inference speed of a network model is not only related to the number of network parameters but also the frequency of device memory access and hardware platform characteristics [
27]. The original YOLOX-Nano has only 0.88 M parameters, but its introduction of depthwise separable convolution increases the memory access frequency to some extent, thus leading to a lower inference speed than other lightweight network models. The greater parameters of the YOLOX-Tiny and YOLOv4-Tiny models result in lower inference speed than the improved YOLOX-Nano when deployed to embedded devices with limited computational resources. The improved YOLOX-Nano maintains a lower number of parameters than YOLOX-Tiny and YOLOv4-Tiny without introducing too many depth-separable convolutions. Therefore, the improved YOLOX-Nano model deployed on the Jetson Nano holds a faster detection speed than other lightweight network models. On the whole, the improved YOLOX-Nano performs better on embedded devices and is more suitable for deployment on embedded devices.