1. Introduction
In addressing climate change, air pollution, and the energy security concerns stemming from conventional energy sources, numerous countries are vigorously advancing renewable energy development. With the ongoing expansion of the total installed capacity of photovoltaic systems, there is a significant mitigation of environmental issues. Nevertheless, the intrinsic high variability of photovoltaic power output presents considerable challenges to the safe and stable functioning of power systems. Forecasting of photovoltaic power generation is pivotal in proactively managing potential power supply gaps, optimizing the resource distribution, judiciously allocating reserve capacity, and reducing the risks associated with power undersupply and oversupply. Photovoltaic power forecasting is categorized into long-term, medium-term, and short-term forecasts, based on the range of time they cover in their predictions [
1]. Long-term forecasts, which extend beyond a month, are utilized for strategic energy planning and system expansion [
2]. Medium-term forecasts, projecting the power output for the forthcoming days, assist in load balancing and power dispatch [
3]. Short-term forecasts, focusing on power predictions ranging from a few minutes to several hours, are essential for real-time system scheduling, a swift demand response, and maintaining grid stability [
4,
5,
6]. Given the high dependence of photovoltaic power generation on weather conditions and its inherent variability, the accuracy of these forecasts directly impacts the efficiency and stability of grid operations. Therefore, there is a need for greater precision and robustness in the forecasting models.
In addressing the traditional photovoltaic power forecasting issue, numerous scholars both domestically and internationally have conducted extensive research on theoretical methods, including physical models [
7], statistical models [
8], machine learning models [
9,
10], and hybrid models that integrate these approaches [
11]. Among these methods, machine-learning-based photovoltaic power forecasting has emerged as the dominant approach. Reference [
12] introduced a photovoltaic power prediction method combining the genetic algorithm (GA) and particle swarm optimization (PSO) to refine the adaptive neuro-fuzzy inference system (ANFIS), with its effectiveness validated using data from the Beijing Goldwind microgrid system. Reference [
13] proposed an advanced Improved Grey Wolf Algorithm (DIGWO) in conjunction with Bidirectional Long Short-Term Memory (BILSTM) for developing a fault diagnosis model for photovoltaic arrays. Reference [
14] employed two hybrid models, namely Convolutional Neural Network-LSTM (CNN-LSTM) and Convolutional LSTM (ConvLSTM), for short-term photovoltaic power forecasting, corroborating their effectiveness with data from a Moroccan solar power plant. Reference [
15] proposed a photovoltaic power prediction method utilizing an improved ant colony optimization algorithm (ACO) and support vector machine (SVM), which demonstrates a higher forecasting accuracy compared to a traditional SVM. Traditional machine learning methods exhibit drawbacks in terms of their generalization ability and prediction accuracy. As a forefront research area within machine learning, deep learning can delve deeper into understanding the inherent laws and representation levels of the sample data. Deep learning has been extensively researched in the context of photovoltaic power forecasting. Reference [
16] utilizes the random forest algorithm for photovoltaic power prediction. Reference [
17] employed an eight-layer fully convolutional network (FCN-8) and an enhanced bidirectional gated recurrent unit (EBiGRU) to develop a deep hybrid network. However, the complexity of the deep learning network architecture, particularly the choice of the number of hidden layers and nodes, significantly impacts the training outcomes. Improper network design often leads to issues such as falling into local minima and overfitting. These characteristics pose limitations to the further advancement of deep learning in photovoltaic power forecasting.
In recent years, tree ensemble algorithms have demonstrated significant advancements. In the Kaggle data science competition, tree ensemble algorithms, notably XGBoost, outperformed many deep learning algorithms. Reference [
18] employed XGBoost to develop a fault diagnosis model targeting various fault types in photovoltaic arrays. Reference [
19] suggests developing a photovoltaic power prediction model for various weather types by integrating the Fuzzy C-Means (FCM) clustering algorithm with XGBoost. Reference [
20] introduces an adaptive transfer learning framework for XGBoost-based photovoltaic power prediction, validating the model with distributed photovoltaic power generation data from the Alice Springs region. Reference [
21] integrates LSTM with XGBoost to diminish the error rate of individual models and enhance their accuracy in photovoltaic power prediction. Reference [
22] integrates the physical aspects of distributed photovoltaics with XGBoost for more accurate distributed photovoltaic forecasting.
Signal decomposition technology provides a powerful tool within the machine learning framework for photovoltaic systems, enhancing aspects such as data processing [
23], feature extraction, performance optimization, and predictive maintenance [
24]. In the application of photovoltaic power forecasting, these technologies are primarily used to accurately extract key features from the time-series data on photovoltaic power, such as periodicity and trends, and are also employed for data denoising, which involves separating and removing environmental and equipment noise to enhance the accuracy of predictions. Furthermore, signal decomposition technology enables models to adaptively adjust according to changes in the environmental and operational conditions, thereby maintaining the flexibility and accuracy of the forecasting strategy. Reference [
25] employed Fast Iterative Filtering Decomposition (FIFD) to extract the complex features of photovoltaic power time series. Reference [
26] introduces Variational Mode Decomposition (VMD) to address the volatility of raw photovoltaic data. However, VMD requires presetting parameters such as the number of decomposition modal components and the quadratic penalty term, where inappropriate values may affect the prediction outcomes. Reference [
27] utilized Empirical Mode Decomposition (EMD) to decompose photovoltaic power data. They employed the Sine Cosine Algorithm (SCA) and Extreme Learning Machine (ELM) to develop a model, validating its effectiveness with distributed photovoltaic data from the electrical department of SOA University in Bhubaneswar, Odisha, India. EMD can effectively remove noise interference and improve the prediction accuracy, but it faces the issues of frequency aliasing and poor noise robustness. Reference [
28] employs Ensemble Empirical Mode Decomposition (EEMD) to decompose photovoltaic power data into multiple stable components, and by calculating the Sample Entropy (SE) of each component and reconstructing those with similar SE values, it reduces the superposition errors, lowers the computational costs, and enhances the prediction accuracy. Reference [
29] proposes a secondary decomposition technique combining VMD and Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) to process historical PV output data, enhancing the model’s predictive performance. Thus, by employing decomposition techniques to analyze photovoltaic data, it is possible to effectively unearth the inherent nonlinear characteristics of the data, thereby significantly enhancing the precision and reliability of predictive models.
In the context of rapid advancements in artificial intelligence technology in photovoltaic power forecasting, this paper proposes a short-term photovoltaic power forecasting model based on ICEEMDAN-Bagging-XGBoost. The paper initially outlines the algorithm’s mechanism and then segregates it into high-frequency, medium-frequency, and low-frequency components based on the zero-crossing rate of each IMF component decomposed using ICEEMDAN. Bagging-XGBoost is utilized for predicting the high-frequency and medium-frequency components, while the SSA is incorporated to reduce the time required for XGBoost hyperparameter optimization. XGBoost-Linear is capable of accurately predicting the low-frequency component, demonstrating superior smoothing and curve-fitting capabilities, along with a rapid calculation speed. The output pf the first-layer prediction model is fed into the second-layer model, comprising SSA-XGBoost, for nonlinear fusion and reconstruction, yielding the final prediction result. Compared with the traditional superposition reconstruction method, this study employs XGBoost to perform nonlinear fusion-based reconstruction of the high-frequency, medium-frequency, and low-frequency components of the predicted output. In the case study, this method’s efficacy was validated using real power generation data from a photovoltaic power station in Hebei Province, China. The results demonstrate that the proposed model exhibits a superior stability and minimal prediction errors compared to other models, thereby underscoring its significant practical value.
2. Basic Principles
2.1. ICEEMDAN Algorithm Mechanism
ICCEMDAN represents an advancement over EEMD and CEEMDAN, aiming to further enhance the precision in processing non-stationary and nonlinear signals. EEMD reduces mode mixing by adding white noise to the signal and performing multiple decompositions, yet it may leave residual noise. CEEMDAN builds upon this by adaptively adjusting the noise level to further decrease the reconstruction error. In contrast, ICCEMDAN introduces specific adaptive white noise during the extraction of each intrinsic mode function, effectively mitigating the impact of residual noise and false modal components, thus improving the accuracy of the signal decomposition and reconstruction. The detailed procedure of analysis encompasses the following steps:
Step 1: Add special noise
to the original signal
x, as shown in Formula (1):
In the formula, is the i-th Gaussian white noise added, , is the noise standard deviation of the first decomposed signal, and is the standard. The difference operator, , is the operator for computing the EMD decomposition.
Step 2: Calculate the local average value of each
signal, and obtain the residual
of the first decomposition, as shown in Formula (2):
In the formula, is the operator for calculating the average value of M signals, and is the operator for calculating the local average value of the signal.
Step 3: Subtract the first residual
from the original signal
, thereby obtaining the first
component of the original signal. As shown in Formula (3):
Step 4: Determine the second
IMF component,
IMF2, as follows:
Step 5: For
k = 3, 4, …,
N, calculate the
k-th
IMF component,
IMFk, as follows:
Step 6: Increment k by 1 and repeat Step 5 until the extraction of the final IMF component is achieved.
2.2. XGBoost Algorithm Mechanism
XGBoost is a prominent example of the Boosting ensemble algorithm, falling within the category of a Gradient Boosting Decision Tree (GBDT). The underlying principle of the traditional GBDT algorithm involves utilizing CART as the base learner. In each iteration, the new base learner persistently adapts to the residual generated by the previous base learner, thereby reducing the loss function of this iteration expeditiously. Ultimately, the weak learners produced in each iteration are combined to form a strong learner. Building upon a GBDT, XGBoost incorporates a regularization term into the loss function to manage the model’s complexity, and subsequently executes a second-order Taylor expansion on the loss function. In contrast to a GBDT, XGBoost enhances the model’s generalization ability.
The algorithm mechanism of XGBoost is as follows. For the dataset
, (
), the ensemble model of the tree is shown using Equation (6):
In the formula,
is the prediction result after the
t-th iteration,
is the sum of the prediction results of the previous t − 1 trees, and
is the newly added
t-th tree. The loss function (objective function) of XGBoost consists of two parts:
In the formula, the first part of the objective function of the t-th iteration is the error between the predicted value and the true value , and the second part is the regularization term, that is, the sum of the complexity of each tree. and are the penalty coefficients, is the number of leaves in the tree, and is the leaf weight.
Given that the complexity of the first
t − 1 trees is a known constant, Equation (7) can be reformulated as:
XGBoost performs second-order Taylor expansion on the loss function, defining
and
as the first and second derivatives of the objective function
, respectively. Bring
and
into the objective function to get the approximate objective function:
Due to the error
between the predicted value
and the actual value
in the
t-th iteration and the complexity constant being a definite constant value, which has no effect on the function optimization, the objective function is approximated as:
The XGBoost algorithm adopts an incremental training method where, in each iteration, new trees are continually added to fit the preceding error values, thereby minimizing the objective function, as depicted in Equation (11).
Within the XGBoost framework, users have the option to select from two types of base learners: the Tree Booster and the Linear Booster, with the latter also referred to as XGBoost-Linear. These learners offer flexible solutions tailored to accommodate diverse data characteristics and predictive task requirements. The Tree Booster elucidates the complex structure of the data through the construction of a series of decision trees. It enhances traditional Gradient Boosted Decision Trees with several optimizations, including efficient tree-splitting algorithms, regularization to prevent overfitting, and parallel processing techniques to expedite the training process. These improvements endow the XGBoost Tree Model with remarkable capabilities in handling nonlinear relationships and feature interactions, making it applicable to a wide range of datasets and machine learning tasks.
Conversely, XGBoost-Linear provides a succinct and efficient solution for scenarios where there exists a linear relationship between the features and the target or when dealing with high-dimensional data. By incorporating L1 and L2 regularization, XGBoost-Linear effectively mitigates overfitting, showcasing its advantages in terms of computational efficiency and model simplicity. Although the tree-based ensemble of XGBoost is more commonly utilized in practical applications, XGBoost-Linear demonstrates unique strengths under specific conditions, especially in datasets characterized by smoothness and strong regularity, offering superior fitting capabilities and an exceptional generalization performance.
2.3. Parallel Ensemble Learning Method of the Bagging Mechanism
Bagging, also known as the Bagging method, is a parallel ensemble learning method that can effectively reduce variance. Its algorithmic architecture is depicted in
Figure 1. The core concept involves sampling the original dataset using the self-sampling method (Bootstrap) to generate multiple random datasets. For example, for a dataset
with m samples, random sampling with replacement is performed on it, and a new dataset
with the same number of samples as the original dataset is obtained by looping m times. Due to random sampling with replacement, there will be duplicate samples in the resulting dataset. After
rounds of self-sampling,
new datasets
containing
samples are obtained, and
new datasets are used to train
independent base learners
. In regression problems, the outputs from the trained
n base learners are aggregated and averaged to yield the final result.
2.4. Sparrow Search Optimization Algorithm Mechanism
The SSA represents a novel heuristic swarm intelligence optimization algorithm, mimicking the predation and anti-predation behaviors of sparrows. In the SSA, the key roles include the discoverer, joiner, and predator. Assuming an N-dimensional search space, the position (or potential solution) of the
i-th sparrow is represented as follows:
In the formula, is the current iteration number, and is the population size.
The formula for updating the position of the finder within the sparrow population is as follows:
In the formula, is the position of the i-th sparrow in the j-th dimension in the N-dimensional search space; is the maximum number of iterations; is a random number between 0 and 1; is a random number that follows a normal distribution; is a matrix; indicates the warning value; indicates the safety value; when , that means that the population is in a safe area, and the discoverer can randomly search for potential solutions in the current area; and when , that means that there are predators around the population, and we need to move to a safe area for searching.
To ascertain a superior potential solution, the joiner might opt to follow the discoverer in pursuit of the optimal solution or independently explore other regions, with the position update occurring as follows:
In the formula, is the worst solution in the j-th dimension of the search space in the current population; is the optimal solution in the j-th dimension of the search space in the current population; when , that means that the i-th joiner does not get a better solution and searches elsewhere; when , the joiner will monitor the finder and compete with it for the optimal solution, thereby replacing the finder to seek out a larger space to search.
Additionally, the sparrow algorithm incorporates predators. The sparrows tasked with monitoring predators constitute 10% to 20% of the entire population. The positions of these sparrows are randomly determined and are expressed as follows:
In the formula, is a random number obeying the standard normal distribution; ; is a small constant, in order to prevent the denominator from being 0; and are the worst and optimal fitness of the current population; is the fitness value of the i-th sparrow; when , that means that the sparrow is at the edge of the population and is more threatened; when , that means that the sparrow is aware of the danger and needs to approach other sparrows to avoid being preyed on.
In this paper, the SSA is utilized to optimize the hyperparameters of XGBoost. The SSA exhibits a fast convergence speed and a strong optimization capability.
Figure 2 illustrates the performance of six heuristic optimization algorithms on various test functions—PSO, the SSA, SOA, GWO, the Whale Optimization Algorithm (WOA), and the Memetic Algorithm (MA). These convergence curves facilitate a comparative analysis of the optimization capabilities of these algorithms during the iterative process. Each subplot displays the results for different test functions, with changes in the objective function values reflecting the quality of the solutions provided by the algorithms. In the graph, lower values on the
y-axis indicate better optimization results, while the
x-axis represents the number of iterations, showing the progression over time.
The marked decline in the SSA during the initial iterations on each test function exemplifies its rapid initial convergence rate, suggesting its efficacy in reducing errors or costs in practical applications. The performance characteristics of the SSA are its convergence rate and the ability to avoid or escape local optima. The convergence rate is observed through the slope of the curve, where a steeper slope indicates a quicker improvement in the quality of the solution; the ability to escape local optima is inferred from the trend in the curve after its initial decline. If the curve flattens at a higher value, this implies that the algorithm is trapped in a local optimum. The absence of such flattening in the SSA’s curve indicates its effective mechanism for avoiding local optima and more comprehensively exploring the search space. Especially after 1000 iterations, the SSA attains lower objective function values relative to other algorithms, indicating its efficiency in finding globally optimal or near-optimal solutions. During the later stages of algorithm optimization, the subtle enhancements it exhibits are vital for practical applications, as these minor improvements can sometimes yield substantial real-world impacts. Through comparison of it with other algorithms,
Figure 2 underscores the superiority of the SSA in its optimization efficiency and convergence speed, bolstering its dependability in optimizing the hyperparameters of intricate models like XGBoost.
3. A Photovoltaic Prediction Method Based on ICEEMDAN and Multi-Model Fusion
3.1. The Overall Framework of the Photovoltaic Prediction Model
XGBoost continuously adds new trees during training and employs a greedy algorithm to progressively reduce the loss function value. The regularization term in the loss function moderates the model complexity, enhancing the fitting capability and consequently reducing prediction bias. However, the serial ensemble learning model, exemplified by XGBoost, often encounters excessive variance due to its complexity. Minor changes in the sample data can alter the learned model’s performance and stability, leading to a decreased prediction accuracy on the test set, indicative of overfitting. The Bagging parallel ensemble learning method conducts multiple random samplings from the original data, inputting each resulting dataset into respective basic learners for training. Subsequently, the prediction results are aggregated to mitigate the impact of data disturbances and thereby decrease the risk of overfitting. Consequently, this paper adopts XGBoost as the basic learner in the Bagging parallel ensemble learning method, harnessing the synergistic strengths of both algorithms to minimize the prediction model’s generalization error. The variation in the error during model training is illustrated in
Figure 3.
Figure 3 depicts the overall framework of the photovoltaic prediction model based on ICEEMDAN-Bagging-XGBoost, as proposed in this paper. The first-layer prediction model comprises SSA-Bagging-XGBoost, while the second-layer prediction model consists of SSA-XGBoost. Detailed descriptions of the model analysis and algorithm flow are provided below.
This study utilizes the ICEEMDAN algorithm to decompose the original photovoltaic power sequence. Based on the zero-crossing rate of each IMF component, these are categorized into high-frequency, intermediate-frequency, and low-frequency components. The SSA-Bagging-XGBoost model is employed for predicting the high-frequency and intermediate-frequency components, which capture the intricacies and randomness of photovoltaic power fluctuations, thereby reducing the variance and bias during training and enabling more detailed tracking of the photovoltaic power change curve. For the low-frequency components, which roughly indicate the photovoltaic power variation trends, the XGBoost-Linear model is utilized for prediction. This approach is adopted because the XGBoost-Linear model’s simplicity enhances the fitting of smooth curves and minimizes the risk of overfitting.
Figure 4 presents the comparison results between the XGBoost-Linear and Multiple Linear Regression (MLR) models in terms of smooth curve prediction. Through analysis, it can be observed that, compared to the MLR algorithm, which also achieves a high degree of fit for smooth curves, XGBoost-Linear demonstrates a superior prediction performance for smooth curves. It exhibits significant advantages in computation speed and parameter adjustment flexibility. Therefore, choosing XGBoost-Linear for predicting the low-frequency components is a highly suitable choice.
After conducting the preceding analysis, the first-layer prediction model comprises SSA-Bagging-XGBoost. The XGBoost algorithms are utilized to calculate the feature importance scores of the input features based on tree gain. This process ascertains the contributions of characteristics including irradiance, temperature, pressure, humidity, wind direction, and wind speed to the predicted target. These feature importance scores aid in selecting the final input feature set for the prediction model. The divided high-frequency component, intermediate-frequency component, and low-frequency component are considered the target output, and the dataset required by the model is created by integrating the determined input features. Then, the dataset is segmented into an initial training set and a test set, and the initial training set is further divided into a training set and a verification.
The validation set performs two key functions. One is facilitating the optimal hyperparameter selection for XGBoost in the first-layer prediction model. Given that most machine learning algorithms encompass both hyperparameters and model parameters, these parameters significantly impact the model’s performance. Model parameters can be automatically adjusted during training, while hyperparameters require manual iterative debugging to ascertain their optimal combination. XGBoost includes numerous hyperparameters, like the number of trees, tree depth, learning rate, and minimum loss value for node splitting. Therefore, this study introduces SSA optimization to assist in identifying the optimal hyperparameters for XGBoost.
Utilizing the SSA for hyperparameter optimization, both the tree-based and linear ensemble XGBoost models are fine-tuned. Subsequently, initial training samples are put into the optimally configured Bagging-XGBoost model to generate predictions for the high-frequency and mid-frequency components. Concurrently, the XGBoost-Linear model is employed to predict the low-frequency portion of the data, leveraging its proficiency in handling smoothly varying data.
The XGBoost models, under the optimal hyperparameter configuration, are employed to make predictions on the verification and test sets, yielding prediction results. The prediction results from the validation set are merged with the corresponding meteorological factors and actual values to form a new training set. Similarly, the prediction results from the validation set are amalgamated with the corresponding meteorological factors to develop a new input feature set. Building on this, the new training set is utilized to further train XGBoost, enabling it to comprehensively learn the correlations between each frequency component, the meteorological factors, and the actual values, thus achieving the fusion and reconstruction of the preliminary prediction results.
The second function of the second-layer validation set is the optimization of the hyperparameters of the second-layer prediction model. The second-layer prediction model, composed of SSA-XGBoost, is employed to nonlinearly fuse and reconstruct the high-frequency, intermediate-frequency, and low-frequency components from the first-layer prediction to obtain the final output. The high-frequency, medium-frequency, and low-frequency components of the sub-training set from the first-layer prediction model serve as the inputs, and the corresponding actual photovoltaic power values are the target output, forming the training set for the second-layer prediction model. The high-frequency, medium-frequency, and low-frequency components predicted by the first-layer model on the verification set act as inputs, and the corresponding actual photovoltaic power values serve as the target output, forming the verification set for the second-layer prediction model. Then, by utilizing the SSA, the optimal hyperparameters for the second-layer prediction model are identified. The high-frequency, intermediate-frequency, and low-frequency components predicted by the first-layer model on the test set are put into the SSA-XGBoost model with the optimal hyperparameters for nonlinear fusion reconstruction, resulting in the output of the final prediction results.
Figure 5 presents the ICEEMDAN-Bagging-XGBoost photovoltaic prediction model framework.
3.2. Model Evaluation Metrics
To rigorously evaluate the performance of the predictive model outlined in this manuscript, three principal metrics are utilized: the relative root mean square error
, the mean absolute error
, and the mean absolute percentage error
.
In the formula, represents the actual value at time t, denotes the predicted value, and m is the number of samples.
4. Case Analysis
The experimental dataset utilized in this paper originates from a centralized photovoltaic power plant in Hebei Province, China. This original dataset comprises daily photovoltaic power data from June to August 2018, with each day featuring 96 sampling points. Additionally, it encompasses meteorological information including irradiance, temperature, humidity, air pressure, wind direction, and wind speed. The model programming related to this paper was conducted in the MATLAB 2023a environment. The prediction evaluation indices employed are the relative mean square error , mean absolute error , and mean absolute percentage error .
4.1. Photovoltaic Power Time-Series Decomposition
Photovoltaic power prediction is influenced by numerous factors, exhibiting complex volatility. To enhance the prediction accuracy, signal decomposition technology has increasingly gained prominence in photovoltaic power prediction. This technology facilitates more accurate prediction of photovoltaic power by analyzing the fluctuation patterns in time series. Currently, EMD is extensively employed, yet it encounters a modal aliasing problem during decomposition. To address this issue, the ICEEMDAN decomposition algorithm is utilized in this study. The algorithm employs a variational model to ascertain the relevant frequency band and to extract the corresponding modal components. The ICEEMDAN algorithm exhibits significant anti-noise advantages in signal processing. This advantage stems from its unique noise-assisted mechanism, which effectively reduces the modal aliasing in the signal, thus ensuring more accurate frequency separation and enhanced signal analysis reliability.
At present, most studies introducing decomposition algorithms for photovoltaic forecasting typically put the decomposed IMF components sequentially into the forecasting model for training and then aggregate the forecast results of each component to obtain the final output. Given that training each component separately can result in prolonged training durations and increased resource consumption, this approach is not optimal. Additionally, the cumulative prediction errors of each component during the final reconstruction can diminish the overall prediction accuracy.
As shown in
Figure 6, the ICEEMDAN decomposition reveals no significant mode mixing in each intrinsic mode function, with each component exhibiting relatively stable frequencies. Based on this, the division into high-frequency, medium-frequency, and low-frequency parts is achieved by calculating the zero-crossing rate of each intrinsic mode function. In this paper, the zero-crossing rate is categorized as high-frequency within the range of 0.1 to 1, medium-frequency from 0.01 to 0.1, and low frequency below 0.01. The calculation of each component’s zero-crossing rate is facilitated by Formula (1), as depicted in
Figure 7.
Examination of the previous chart reveals that the zero-crossing rate of IMF1 to IMF3 exceeds 0.1, leading to their classification as high-frequency components. The zero-crossing rate of IMF4 to IMF8 surpasses 0.01, categorizing them as intermediate-frequency components, whereas the zero-crossing rate of IMF7 to Res falls below 0.01, thus classifying them as low-frequency components. The intrinsic modal functions corresponding to these three frequency components are amalgamated and reconstituted into new modal functions, as depicted in
Figure 8.
Following the ICEEMDAN decomposition and reconstruction of the photovoltaic processed data, the low-frequency components demonstrate a smoother data trend and enhanced periodicity. Consequently, employing multivariate linear regression for their prediction not only circumvents complex parameter tuning but also rapidly and accurately yields satisfactory results. The SSA-Bagging-XGBoost model is utilized to predict both the high-frequency and intermediate-frequency components. In this model, XGBoost serves as the basic learner within the Bagging parallel ensemble learning method, effectively compensating for the latter’s limitation in reducing deviation. Concurrently, this method significantly mitigates the impact of minor data perturbations and reduces the variance effectively. Furthermore, optimization with the SSA substantially shortens the hyperparameter optimization time. Consequently, employing SSA-Bagging-XGBoost for predicting both the high-frequency and intermediate-frequency components facilitates a more effective prediction of local details in the photovoltaic power fluctuations.
In this study, the decomposed low-frequency, intermediate-frequency, and high-frequency components are considered the target outputs, and the dataset required for the model is formed by integrating the established input features. Subsequently, the dataset is segmented into an initial training set and a test set, with the initial training set further divided into a training set and a validation set.
4.2. Feature Engineering Analysis
Feature importance analysis plays a vital role in training and the prediction performance. The XGBoost model can calculate the importance of each feature during the training process. The principle is based on the gain in the number of structural branches to select a feature as the split point. The importance of a feature is determined by the total number of times it appears in the tree. The more frequently it occurs, the more important it is. This paper employs XGBoost to construct the feature engineering, as shown in
Figure 9.
Figure 9 illustrates the cumulative contribution of multiple features over time (measured in hours). Each feature has 24 bars representing its cumulative importance for each hour in the past 24 h. For features such as wind speed, wind direction, temperature, humidity, air pressure, irradiance, and photovoltaic power, each bar signifies the contribution of that feature to the model’s predictive results within a specific hour. The figure indicates that, in historical data, there is a high correlation between photovoltaic power and electric energy production in the target forecast period. Irradiance also shows a higher feature contribution, aligning with the expectations for a solar power generation prediction model, as irradiance directly affects the solar output.
Meteorological conditions such as wind speed, wind direction, temperature, humidity, and air pressure have a relatively lower contribution. These factors do impact photovoltaic power generation, but their effects are not as significant as irradiance and photovoltaic power. In specific periods, they may significantly influence the predictive model’s output, but overall, their impact is smaller. Although temperature and humidity contribute less than irradiance and photovoltaic power, they show an increased correlation in certain specific periods, indicating their potential importance to the prediction model. Rising temperatures may reduce the efficiency of photovoltaic panels during the day, thereby affecting power generation, while changes in humidity might impact the panel performance in the morning due to dew or fog or interact with temperature during hot periods to affect the cooling efficiency of the panels. The fluctuations in these environmental variables provide important contextual information for the model, especially under conditions of drastic weather changes. Their contribution to the prediction accuracy should not be overlooked. Including temperature and humidity as input features in the construction of a photovoltaic power prediction model helps capture the variations in the power generation caused by meteorological condition fluctuations, enhancing the model’s predictive capability.
Therefore, when constructing a prediction model, selecting photovoltaic power, irradiance, temperature, and humidity as inputs not only improves the model’s fit to the historical data but also enhances its adaptability to future condition changes and predictive accuracy.
4.3. Prediction Method Analysis
In this paper, the rolling prediction method is utilized for the three frequency components, while the sliding window method is employed to process the decomposed and reconstructed component sequences. Subsequently, the original sequence is segmented into multiple sub-sequences.
Figure 10 illustrates the assumption that
delineates a sub-time series with a timestep of
K, under the premise
and
. This formulation leads to the subsequent sub-time series
, and consequently, a time series of length
N is segmented into
N −
K + 1 subsequences.
4.4. Model Effect Comparison Analysis
To validate the superiority of the proposed Nonlinear Fusion Prediction method, it was compared with both the non-thresholding reconstruction and thresholding reconstruction prediction methods after decomposition, where both employed direct summation for fusion. The results of this comparison are depicted in
Figure 11 and
Table 1. Clearly, the Nonlinear Fusion Prediction method utilizing XGBoost achieved the highest prediction accuracy. Moreover, the evaluation metrics indicated that the prediction performance of the non-thresholding reconstruction method was slightly inferior to that of the thresholding reconstruction method. This can be attributed to the reconstruction phase, wherein the decomposed components undergo reconstruction or integration, potentially enhancing the useful information within the signal and reducing noise. Such enhanced information can be more effectively utilized in subsequent predictions, thereby improving the prediction accuracy.
For high-frequency and medium-frequency components, Bagging-XGBoost effectively minimized the bias and variance, thereby enhancing its ability to track local variation trends in the photovoltaic data and improving the prediction accuracy. Regarding low-frequency components, XGBoost-Linear utilized its exceptional fitting capability to smooth the curve, thereby achieving a remarkably high prediction accuracy. XGBoost conducts Nonlinear Fusion Prediction on the high-, medium-, and low-frequency components predicted by the first-layer model, further enhancing the prediction accuracy compared to the thresholding reconstruction method.
To further investigate the predictive performance of this model, comparisons were made between XGBoost, MLR, Transformer, LSTM, and BPNN models. The prediction errors of each model are, respectively, presented in
Table 1. As indicated by the results in
Table 1 and
Figure 12, the proposed model significantly outperforms the single models in its prediction accuracy, more precisely tracking changes in the photovoltaic power. In the case of a single model, there is frequently a risk of converging to a local minimum during the training and learning process, which may result in the diminished generalization ability of the model and instability in the prediction outcomes. While the XGBoost algorithm can incrementally reduce the deviation by adding new trees during training, it struggles to mitigate the data disturbance effects at later stages, leading to increased deviation and a reduced test set prediction accuracy. The proposed ICCEMDAN-Bagging-XGBoost model in this paper demonstrates a high predictive accuracy and overcomes the effects of the data variability.
5. Conclusions
This paper draws upon the latest cutting-edge research theories in the field of artificial intelligence to propose a short-term photovoltaic power prediction model. The model employs ICEEMDAN-Bagging-XGBoost and integrates SSA optimization to efficiently identify the optimal hyperparameters for XGBoost, thereby minimizing the time required for hyperparameter optimization. The analysis illustrates that the proposed prediction model in this study exhibits an exceptional performance in short-term forecasting for photovoltaic power generation. This model accurately tracks the curve of variation in the photovoltaic power during periods of significant fluctuations. Compared to single models, the model presented in this study achieves a superior prediction accuracy and demonstrates a more stable performance. Additionally, it boasts advantages such as a reduced overfitting risk and an enhanced generalization ability. Consequently, it possesses significant practical value for short-term forecasting in photovoltaic power plants.
In future work, we will continue to carry out research on the application of advanced artificial intelligence technology in power systems:
At the forecasting scenario level, the data used in this example encompass complete meteorological factors. However, at this level, the variability in photovoltaic power is a critical factor, largely due to unknown meteorological factors. Consequently, at this level, future work will involve integrating real-time weather forecasting with rolling correction of the short-term meteorological factors to enhance the accuracy of the photovoltaic power forecasting for each cycle.
At the algorithmic level, given the volatility and randomness of photovoltaic power generation systems, along with the anticipated large volume and high dimensionality of future photovoltaic system data, there are heightened requirements for photovoltaic power forecasting algorithms. It is essential to effectively utilize algorithms like XGBoost to extract critical information from the high-dimensional features and to apply batch learning techniques to photovoltaic power forecasting, thereby continuously optimizing the model’s accuracy. Additionally, the implementation of functionalities for offline training and online prediction significantly enhances the practical applicability of the model.
At the application level, it is imperative to continue exploring the application of artificial intelligence technology in the prediction of renewable energy output. Furthermore, it is crucial to fully leverage the role of artificial intelligence technology, embedded into power knowledge and experience, in power grid optimization and scheduling, fault diagnosis and analysis, and other fields to ensure the power system embodies the characteristics of intelligent interaction, safety, and controllability.