Cleaning of Abnormal Wind Speed Power Data Based on Quartile RANSAC Regression
Abstract
:1. Introduction
- (1)
- The first category mainly relies on statistical and clustering methods for data cleaning. Lou et al. employed the Optimal In-group Variance (OIV) method for cleaning [12], which, despite its rapid identification capability, is susceptible to the influence of data grouping methods. Zheng et al. utilized the Local Outlier Factor (LOF) to distinguish between normal and abnormal data [13], showcasing strong adaptability and the ability to handle data with different density distributions. Reference [14] initially used the quartile method to eliminate dispersed data, followed by k-means clustering to clean the stacked data, although the selection of parameters significantly impacts the clustering effectiveness. Zhao et al. proposed a strategy combining Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and the quartile method [15]. In reference [16], a combined algorithm of Isolation Forest and mean shift was constructed. Although it can achieve efficient cleaning, its cleaning results in datasets with high noise levels may be affected. Luo et al. designed a method based on density clustering and boundary extraction, but this approach incurs higher time costs [17].
- (2)
- The second category determines the upper and lower boundaries of the wind power curve for cleaning. In reference [18], a wind power curve model was established using the copula conditional function, which is highly effective for identifying sparse anomalous data. Villanueva and Feijóo proposed a real power curve model that fits the wind power within various wind speed ranges to a normal probability distribution, considering data exceeding three standard deviations as anomalies [19]. Wang et al. improved the binning algorithm for regional calculation [20], but data at regional boundaries may be difficult to accurately fit, affecting accuracy. Cleaning through neural networks allows for the adjustment of network structures and parameters to address different types of anomalies, such as using the Artificial Neural Network (ANN) algorithm [21], as well as Graph Convolutional Neural Networks combined with Long Short-Term Memory networks (GCN-LSTM) for cleaning [22]. Neural network algorithms can automatically extract features; however, when the model becomes too complex, there is a risk of overfitting.
- (3)
- The third category adopts the method of image processing for wind power curves. Liang et al. use the pixel counting method to generate feature grayscale images and combine the image threshold segmentation method to eliminate abnormal data [23]. Wang et al. proposed a fast data cleaning algorithm to maintain the longest continuous pixels in each column and each row of the binary curve image [24]. Long et al. proposed a wind power abnormal data cleaning algorithm based on color space transformation and image feature detection of wind power curve images [25]. However, image processing methods require powerful computing resources and often entail relatively high time costs. In summary, the relevant research that has been carried out provides references and theoretical support for subsequent work.
2. Wind Power Curve
- (1)
- Type Ⅰ abnormal data are the lower horizontal band abnormal data:The first type of abnormal data refers to the abnormal data where the wind speed is greater than the cut-in wind speed but the power is less than or equal to zero. This type of abnormal data exhibits distinct characteristics in the wind power curve graph, forming a horizontal band region composed of dense data points at the lower end of the curve. The main factors contributing to this type of abnormality include internal failures of wind turbines and shutdowns for maintenance.
- (2)
- Type Ⅱ abnormal data are sparse abnormal data:The second type of abnormal data exhibits a unique distribution pattern, appearing as scattered and irregular data points. These data points display significant randomness but maintain a certain correlation with the standard power curve. The main factors contributing to this type of abnormality include meteorological fluctuations, signal transmission noise interference, sensor failures, and various other unpredictable random factors.
- (3)
- Type Ⅲ abnormal data are stacked abnormal data:The third type of abnormal data typically appears over a continuous period, clustering into one or more distinct horizontal data bands in the middle region of the power curve. The emergence of this type of abnormality is closely related to issues such as wind curtailment and communication failures. The technical factors directly leading to wind curtailment include power system failures, insufficient system frequency regulation capabilities, and inadequate transmission and storage technologies [26]. In particular, wind curtailment and power rationing have become prominent issues restricting the sustainable and healthy development of the wind power industry.
3. Building an Outlier Data Cleaning Model Based on Quartile RANSAC
- 1.
- Data collection: utilize the SCADA system to collect wind speed, power, and rotational speed data from the wind farm to form a dataset.
- 2.
- Data preprocessing: compare the wind speed-power scatter plot with the standard wind power curve and filter the first type of abnormal data based on the basic operating principles of the standard wind power curve. Eliminate data points where wind speed (v), power (P), or rotational speed (n) are less than zero, and mark data as abnormal when the power is less than or equal to zero while the wind speed is greater than the cut-in wind speed ().
- 3.
- Elimination of sparse abnormal data using the quartile method: sort the preprocessed data pairs by power in ascending order and divide the data into equal interval power bins. Apply the quartile method to filter the data within each power bin, marking wind speed data that fall outside the inner limits as abnormal and removing them from the dataset.
- 4.
- Elimination of stacked data based on RANSAC regression fitting: extend the original two-dimensional data to a three-dimensional space through polynomial features to fit more complex nonlinear relationships. On this basis, the RANSAC regression algorithm will be employed to predict wind speed values. Use random sample points to fit a model, calculate the distance of all data points to the fitted model, and classify points into inliers and outliers based on a threshold. Continuously update and iterate until the model performance is optimized. Finally, determine whether a data point exceeds the threshold; if so, it is classified as an abnormal point; otherwise, it is considered a normal point.
3.1. Data Preprocessing
3.2. Quartile Method
- 1.
- Calculate the second quartile , which is the median:
- 2.
- Calculate the first and third quartiles and :When n = 2k (k = 1, 2, …), divide X into two parts from , with excluded from both parts of the data, and calculating the medians of the two parts, and , then = , = .When n = 4k + 3 (k = 0, 1, 2, …), there areWhen n = 4k + 1 (k = 0, 1, 2, …), there are
- 3.
- The interquartile range IQR can be obtained by calculating and :
- 4.
- Based on the IQR, the inner limits for identifying outliers in the data sample X are determined as follows:
3.3. RANSAC Regression Algorithm
4. Case Study
4.1. Data Description
4.2. Case Study on Data Cleaning of Wind Turbine with High Proportion of Stacked Abnormalities
- (1)
- Data PreprocessingUnder normal operating conditions, it is unreasonable for the generator to produce negative wind power. Therefore, values with wind speed, power, and rotational speed less than zero are eliminated from the scatter plot. Based on the basic operating principles of wind turbines, data with wind speeds exceeding the cut-in wind speed but power less than or equal to zero are also marked as abnormal, accurately identifying the first type of abnormal data.
- (2)
- Elimination of Sparse OutliersAfter preprocessing, the data are sorted in ascending order of power and divided into intervals with a spacing of 25 kW. The quartile method is then applied to each interval of data to eliminate abnormal data points lying outside the inner limits. After data preprocessing and the application of the quartile method, most sparse outliers have been eliminated, revealing a clearer data distribution profile, especially with very distinct boundaries for stacked outliers. Examining the box plot generated by the quartile method for data within the power interval [1000, 1025] kW, it can be observed that most data points within the interval fall within the inner limits [], identified as normal data and marked blue, while those outside the inner limits are recognized as sparse outliers and marked gold.
- (3)
- Elimination of Stacked OutliersWhen performing RANSAC regression fitting, the two-dimensional data are first extended to three-dimensional data through polynomial feature expansion to better perform nonlinear fitting. Then, hierarchical sampling of the minimum effective sample subset is performed on the data, and the corresponding model parameters are calculated using the least variance estimation method. Subsequently, the deviation between each sample data point and the estimated model is calculated. Based on this, the deviation is compared with the threshold. If the deviation is less than the threshold, it is considered normal data; if the deviation is greater than the threshold, it is identified as abnormal data. Finally, iterations continue until the model achieves the optimal effect. As shown in Figure 5, during curve fitting, the clear boundaries of stacked outliers reduce their impact on the fitting effect. After the RANSAC regression algorithm is used to identify stacked outliers, the stacked abnormal data are clearly marked, and the wind speed-power curve is clearly outlined. Gold represents the elimination of sparse outliers, green represents the elimination of stacked outliers, and blue represents normal data. It can be seen that the wind speed-power curve is clearly identified. The power curve fitted by the “bin” method on the cleaned data is highly consistent with the standard power curve, demonstrating the effectiveness of the new method proposed in this paper for cleaning abnormal data with a high proportion of stacked outliers.
4.3. Algorithm Comparison and Analysis
4.3.1. Comparative Experiment
4.3.2. Evaluation Metrics
5. Conclusions
- (1)
- A novel method for abnormal data cleaning based on a classification processing framework is proposed, which employs operational guidelines, the quartile method, and the RANSAC regression algorithm for three types of abnormal data. This staged approach significantly enhances the robustness and accuracy of cleaning data with a high proportion of stacked anomalies.
- (2)
- Through a case study on the cleaning of abnormal data from wind turbines with a high proportion of stacked abnormalities, the high degree of consistency between the cleaned data and the standard power curve indicates that the cleaning effect is good. Furthermore, the proposed method is accurate for cleaning other types of abnormal wind turbines, and the effectiveness of the quartile RANSAC algorithm has also been proved.
- (3)
- To validate the significant advantages of the proposed method, it was compared with quartile, isolation forest, and k-means algorithms. The cleaning results intuitively demonstrate that the proposed method significantly outperforms the other three existing algorithms in terms of cleaning effectiveness. The introduction of evaluation metrics further accurately demonstrates the superiority of the cleaning results, with the proposed method achieving a more reasonable data deletion rate and excellent cleaning efficiency. Compared to the quartile method, which performed the best among the compared algorithms, the proposed method reduces MAE by 54%, 78%, 56%, and 15% and RMSE by 67%, 78%, 66%, and 18%, respectively, across the four wind turbines, proving the better performance of the quartile RANSAC algorithm in abnormal data cleaning. It must be pointed out that the performance of the algorithm proposed in this paper depends to some extent on its parameter configuration. If the parameters are not set reasonably, it may lead to deviations in the results. The authors will strive to address this issue in future work.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Global Wind Energy Council, Global Wind Report 2024 (Global Wind Energy Council, 2024). Available online: https://gwec.net/global-wind-report-2024/ (accessed on 1 July 2024).
- Li, Z.; Jiang, Y.; Guo, Q.; Hu, C.; Peng, Z. Multi-dimensional variational mode decomposition for bearing-crack detection in wind turbines with large driving-speed variations. Renew. Energy 2018, 116, 55–73. [Google Scholar] [CrossRef]
- Xu, Y.; Jia, L.; Yang, W. Correlation based neuro-fuzzy Wiener type wind power forecasting model by using special separate signals. Energy Convers. Manag. 2022, 253, 115173. [Google Scholar] [CrossRef]
- Tian, B.; Zhang, Y. Energy storage operation control strategy for smoothing wind power based on multi-objective cooperative game. High Voltage Eng. 2023, 49, 2546–2557. [Google Scholar]
- Nielson, J.; Bhaganagar, K.; Meka, R.; Alaeddini, A. Using atmospheric inputs for Artificial Neural Networks to improve wind turbine power prediction. Energy 2020, 190, 116273. [Google Scholar] [CrossRef]
- McKinnon, C.; Carroll, J.; McDonald, A.; Koukoura, S.; Plumley, C. Investigation of isolation forest for wind turbine pitch system condition monitoring using SCADA data. Energies 2021, 14, 6601. [Google Scholar] [CrossRef]
- Maldonado-Correa, J.; Martín-Martínez, S.; Artigao, E.; Gómez-Lázaro, E. Using SCADA data for wind turbine condition monitoring: A systematic literature review. Energies 2020, 13, 3132. [Google Scholar] [CrossRef]
- Yang, W.; Tavner, P.J.; Crabtree, C.J.; Feng, Y.; Qiu, Y. Wind turbine condition monitoring: Technical and commercial challenges. Wind Energy 2014, 17, 673–693. [Google Scholar] [CrossRef]
- Liu, J.; An, B.; Zhang, W.; Gan, Q. Review of health condition evaluation of large wind turbines. Power Syst. Prot. Control 2023, 51, 176–187. [Google Scholar]
- Wen, X.; Xu, Z. Wind turbine fault diagnosis based on ReliefF-PCA and DNN. Expert Syst. Appl. 2021, 178, 115016. [Google Scholar] [CrossRef]
- Elusakin, T.; Shafiee, M. Fault diagnosis of offshore wind turbine gearboxes using a dynamic Bayesian network. Int. J. Sustain. Energy 2022, 41, 1849–1867. [Google Scholar] [CrossRef]
- Lou, J.; Xu, J.; Lu, H.; Qu, C.; Li, S.; Liu, R. Wind turbine data cleaning algorithm based on power curve. Autom. Electr. Power Syst. 2016, 40, 116–121. [Google Scholar]
- Zheng, L.; Hu, W.; Min, Y. Raw wind data preprocessing: A data-mining approach. IEEE Trans. Sustain. Energy 2014, 6, 11–19. [Google Scholar] [CrossRef]
- Zhao, Y.; Ye, L.; Zhu, Q. Characteristics and processing method of abnormal data clusters caused by wind curtailments in wind farms. Autom. Electr. Power Syst. 2014, 38, 39–46. [Google Scholar]
- Hou, G.; Wang, J.; Fan, Y. Wind power forecasting method of large-scale wind turbine clusters based on DBSCAN clustering and an enhanced hunter-prey optimization algorithm. Energy Convers. Manag. 2024, 307, 118341. [Google Scholar] [CrossRef]
- Wang, W.; Yang, S.; Yang, Y. An improved data-efficiency algorithm based on combining isolation forest and mean shift for anomaly data filtering in wind power curve. Energies 2022, 15, 4918. [Google Scholar] [CrossRef]
- Luo, Z.; Fang, C.; Liu, C.; Liu, S. Method for cleaning abnormal data of wind turbine power curve based on density clustering and boundary extraction. IEEE Trans. Sustain. Energy 2021, 13, 1147–1159. [Google Scholar] [CrossRef]
- Ye, X.; Lu, Z.; Qiao, Y.; Min, Y.; O’Malley, M. Identification and Correction of Outliers in Wind Farm Time Series Power Data. IEEE Trans. Power Syst. 2016, 31, 4197–4205. [Google Scholar] [CrossRef]
- Wang, S.; Zhang, Z.; Wang, P.; Tian, Y. Failure warning of gearbox for wind turbine based on 3σ-median criterion and NSET. Energy Rep. 2021, 7, 1182–1197. [Google Scholar] [CrossRef]
- Wang, X.; Wang, Z. Wind speed-power data cleaning of wind turbine based on improved bin algorithm. Chin. J. Intell. Sci. Technol. 2020, 2, 62–71. [Google Scholar]
- Li, T.; Liu, X.; Lin, Z.; Morrison, R. Ensemble offshore wind turbine power curve modelling–an integration of isolation forest, fast radial basis function neural network, and metaheuristic algorithm. Energy 2022, 239, 122340. [Google Scholar] [CrossRef]
- Li, L.; Liang, Y.; Lin, N.; Yan, J.; Meng, H.; Liu, Y. Wind speed cleaning method for wind turbine considering spatial-temporal correlation. Acta Energy Sol. Sin. 2024, 45, 461–469. [Google Scholar]
- Liang, G.; Su, Y.; Chen, F.; Long, H.; Song, Z.; Gan, Y. Wind power curve data cleaning by image thresholding based on class uncertainty and shape dissimilarity. IEEE Trans. Sustain. Energy 2020, 12, 1383–1393. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, L.; Huang, C. A fast abnormal data cleaning algorithm for performance evaluation of wind turbine. IEEE Trans. Instrum. Meas. 2020, 70, 5006512. [Google Scholar] [CrossRef]
- Long, H.; Xu, S.; Gu, W. An abnormal wind turbine data cleaning algorithm based on color space conversion and image feature detection. Appl. Energy 2022, 311, 118594. [Google Scholar] [CrossRef]
- Chen, H.; Chen, J.; Han, G.; Cui, Q. Winding down the wind power curtailment in China: What made the difference? Renew. Sustain. Energy Rev. 2022, 167, 112725. [Google Scholar] [CrossRef]
- Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Wind Turbine | Total Number of Data | Quartile RANSAC | Quartile | Isolation Forest | K-Means | ||||
---|---|---|---|---|---|---|---|---|---|
R (%) | T (s) | R (%) | T (s) | R (%) | T (s) | R (%) | T (s) | ||
No. 2 | 38,855 | 11.13 | 0.42 | 3.33 | 0.18 | 5.14 | 0.90 | 9.93 | 0.56 |
No. 3 | 38,995 | 33.54 | 0.42 | 1.44 | 0.18 | 4.53 | 0.96 | 9.96 | 0.60 |
No. 7 | 43,324 | 12.93 | 0.42 | 4.27 | 0.17 | 5.38 | 1.06 | 9.94 | 0.67 |
No. 8 | 38,470 | 7.62 | 0.39 | 4.62 | 0.16 | 4.99 | 0.87 | 9.92 | 0.63 |
Wind Turbine Number | Original Data | Quartile RANSAC | Quartile | Isolation Forest | K-Means | |||||
---|---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | |
No. 2 | 0.0982 | 0.2158 | 0.0316 | 0.0476 | 0.0697 | 0.1429 | 0.0853 | 0.1916 | 0.0758 | 0.1783 |
No. 3 | 0.1437 | 0.2582 | 0.0303 | 0.0495 | 0.1365 | 0.2215 | 0.1330 | 0.2389 | 0.1277 | 0.2309 |
No. 7 | 0.0912 | 0.1998 | 0.0322 | 0.0511 | 0.0739 | 0.1488 | 0.0786 | 0.1731 | 0.0648 | 0.1484 |
No. 8 | 0.0671 | 0.1665 | 0.0283 | 0.0414 | 0.0334 | 0.0505 | 0.0542 | 0.1338 | 0.0459 | 0.1193 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, F.; Zhang, X.; Xu, Z.; Dong, K.; Li, Z.; Liu, Y. Cleaning of Abnormal Wind Speed Power Data Based on Quartile RANSAC Regression. Energies 2024, 17, 5697. https://doi.org/10.3390/en17225697
Zhang F, Zhang X, Xu Z, Dong K, Li Z, Liu Y. Cleaning of Abnormal Wind Speed Power Data Based on Quartile RANSAC Regression. Energies. 2024; 17(22):5697. https://doi.org/10.3390/en17225697
Chicago/Turabian StyleZhang, Fengjuan, Xiaohui Zhang, Zhilei Xu, Keliang Dong, Zhiwei Li, and Yubo Liu. 2024. "Cleaning of Abnormal Wind Speed Power Data Based on Quartile RANSAC Regression" Energies 17, no. 22: 5697. https://doi.org/10.3390/en17225697
APA StyleZhang, F., Zhang, X., Xu, Z., Dong, K., Li, Z., & Liu, Y. (2024). Cleaning of Abnormal Wind Speed Power Data Based on Quartile RANSAC Regression. Energies, 17(22), 5697. https://doi.org/10.3390/en17225697