1. Introduction
Amid the intensifying challenges of global environmental pollution, ecological changes, and the energy crisis, developing renewable energy has emerged as the optimal strategy for transitioning from fossil fuels to a cleaner energy paradigm [
1]. Optimizing the energy structure is the best way to address these issues. Incorporating renewable energy into the energy supply system can effectively alleviate the problems caused by the overuse of fossil fuels. Solar energy, distinguished by its extensive global distribution and abundant availability, has witnessed a substantial surge in penetration within electrical power and energy systems [
2]. The power system used is a typical multi-energy system, and photovoltaic energy can provide more energy supply for the system. Nevertheless, solar power generation exhibits intermittent, fluctuating, and stochastic characteristics. The mitigation of these uncertainties and the effective prediction of solar power generation have become urgently imperative [
3]. The foremost factor contributing to the instability in solar power output lies in the variability of solar irradiance. Notably, solar irradiance forecasting plays a pivotal role in addressing this challenge [
4,
5]. Therefore, predicting solar energy resources for a certain period in the future has become an important measure for optimizing energy management systems. For example, accurate forecasts of solar irradiance contribute to the meticulous planning of energy storage system charging schedules and the optimization of energy transmission, thereby mitigating energy losses [
6]. Furthermore, solar irradiance prediction enables the proactive anticipation of photovoltaic power generation, resulting in the reduction of reserve capacity and overall generation costs [
7]. In a broader context, solar irradiance forecasting not only accommodates the intrinsic instability of solar power generation but also ensures that energy systems can uphold secure, economical, and efficient operations [
8]. Hence, the accurate forecasting of solar irradiance assumes pivotal importance in the realm of electrical power and energy systems [
9]. These imperatives underscore the critical role played by solar irradiance forecasting in mitigating the challenges posed by the dynamic nature of solar energy and fortifying the reliability of energy systems [
10].
So far, research on solar irradiance prediction can be categorized into four types: physical methods, statistical methods, machine learning, and deep learning [
11]. Physical methods employ the principles of meteorology and numerical weather models to simulate atmospheric conditions for predicting solar irradiance variations. This approach considers complex meteorological interactions and exhibits a profound understanding of atmospheric processes [
12], making it suitable for long-term forecasts. This research emphasizes the utilization of numerical weather in conjunction with support vector regression for large-scale photovoltaic power forecasts and satellite-based cloud motion vector forecasts [
13]. This study introduces an innovative approach to generating real-time current-voltage characteristics and forecasting the peak power of photovoltaic modules under actual meteorological conditions using the power-law model and single-diode model parameters [
14]. However, the physical methods suffer from high model complexity, high computational cost, and sensitivity to initial conditions, resulting in relatively low prediction accuracy [
15]. Statistical prediction methods utilize historical meteorological data and solar energy system output information to forecast future solar irradiance using statistical models. The research reported in Ref. [
16] introduces the use of the ARMAX model with exogenous inputs to forecast photovoltaic power output, significantly improving its prediction accuracy compared to the traditional ARIMA model. A new SARIMA model to forecast hourly wind speeds in the coastal areas of Scotland has been proposed [
17], demonstrating superior accuracy in predicting future offshore wind-speed time series compared to deep learning-based algorithms. While the statistics-based forecasting methods offer high computational efficiency and can handle some nonlinear relationships well, their ability to deal with complex meteorological dynamics is limited, leading to relatively poor accuracy in long-term forecasting [
18,
19].
Machine learning methods predict solar irradiance by training neural network models using historical data and are capable of handling nonlinear relationships and large-scale inputs [
20]. Compared to physical methods and statistical models, machine learning-based prediction models better capture the complex relationships among input data, thereby improving their accuracy and suitability for medium- and short-term forecast tasks [
21]. In a previous study [
22], an irradiance prediction method with an integrated framework of robust local mean decomposition and bidirectional long short-term memory is proposed. The study reported in Ref. [
23] presents a novel multibranch attentive gated recurrent residual network that is capable of modeling data at various resolutions, extracting the hierarchical features, and capturing short- and long-term dependencies. A novel approach is proposed in Ref. [
24] to predict one-week-ahead half-hourly photovoltaic power output in the United Kingdom, leveraging sloped extra-terrestrial irradiance and weather data, enabling the better balancing of electricity supply and demand. Another study [
25] introduces a novel hourly stepwise forecasting framework for solar irradiance, employing an integrated hybrid model combined with error correction and variational mode decomposition, significantly enhancing the model’s anti-interference capability and prediction accuracy. However, as the number of model layers increases, machine learning-based prediction models encounter challenges such as the curse of dimensionality and slow network convergence [
26].
Deep learning models automatically extract features and are suitable for complex nonlinear problems, possessing strong expressive power to handle large-scale, high-dimensional data [
27]. They are particularly well-suited for short-term and real-time predictions [
28]. The authors of Ref. [
29] propose a prediction model based on dual decomposition with error correction, a strategy-based, improved hybrid deep learning method. This method adopts complete ensemble empirical mode decomposition with adaptive noise and a variational mode. The historical event sequence and error sequence of solar radiation are decomposed, and the short- and long-term features of the data are extracted by a BiLSTM deep learning network, which effectively improves the prediction accuracy. In another study [
30], a hybrid solar irradiance forecasting model based on partial mutual information, an enhanced whale optimization algorithm, and deep reinforcement learning is proposed, which, compared to traditional methods and single deep learning models, can more effectively address dynamic variations and exhibits superior performance across multiple forecasting horizons. A deep learning framework utilizing convolutional neural networks and attention mechanisms to extract the spectral information from geostationary satellites for accurate ground-level solar irradiance estimations is proposed in Ref. [
31], outperforming traditional databases. The authors of Ref. [
32] propose selecting input variables using the information gain factor to enhance the accuracy of solar irradiance prediction models, validating their superiority over the Pearson correlation coefficient. A comprehensive review of deep learning for renewable energy forecasting can be found in Ref. [
33]. Traditional deep learning algorithms that are adept at handling time series data, such as a recurrent neural network (RNN), transmit information between hidden layers through linear or nonlinear activation functions to capture effective features from historical sequences. However, this information propagation mode tends to dilute crucial features with noise, leading to the occurrence of gradient vanishing or exploding issues, thereby constraining the capability of deep learning models to handle long time series, including solar irradiance sequences. Despite the incorporation of memory and forget gates in some deep learning algorithms like long short-term memory (LSTM) networks to selectively filter the input information [
34], the fundamental operational logic of neural networks remains unaltered, thereby failing to completely mitigate the accumulation of errors and data noises.
The attention mechanism is a pivotal technique for processing sequential data and has been widely applied in domains such as natural language processing and computer vision [
35]. By modeling the correlation information among the different positions within input sequences, it automatically learns the importance weights of different positions and prioritizes those with higher relevance during information transmission [
36]. This feature renders attention mechanisms particularly effective in handling long sequences and capturing internal dependencies [
37]. ProbSparse attention (PSA), as a representative attention mechanism, derives Q, K, and V matrices from the encoding layer, and subsequently obtains attention values using correlation scores [
38]. Despite PSA’s widespread use in natural language processing and image generation, there have been no reported studies regarding its application in solar irradiance prediction. The ProbSparse attention mechanism computes the attention weights for each position within input sequences, facilitating interaction and information transmission between positions, thereby effectively mitigating cumulative errors and data noise [
39], rendering it highly suitable for solar irradiance prediction tasks. In this paper, a solar irradiance prediction model has been designed, which makes the following core contributions:
- ⟡
We have innovatively designed an artificial intelligence model for short-term solar irradiance prediction, leveraging the ProbSparse attention mechanism to efficiently capture the inherent short-term and long-term dependencies within input sequences.
- ⟡
The dingo algorithm has been redesigned to optimize the hyperparameters of the proposed AI prediction model, enhancing the convergence and performance of the prediction model.
- ⟡
A comprehensive data preprocessing method incorporating feature selection, multiple imputation, and median filtering is introduced to ensure the quality and accuracy of input data.
The performance of the proposed prediction model has been evaluated using the mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R2). The simulation results demonstrate that the proposed AI prediction model exhibits higher applicability and practicality in solar irradiance prediction, offering new solutions for addressing challenges in the energy management of hybrid electrical systems.
2. PSA-Based Solar Irradiance Prediction Model
The subject of this paper is the time series of global horizontal irradiance. The sequence features are mainly reflected in the long-term and short-term correlations at various points. The value of irradiance exhibits clear periodic similarity with the variation in solar zenith angle, and the sequence values do not diverge but, rather, only change within a certain range [
40]. Therefore, in the case of a sufficient number of historical sequence samples, it can be assumed that the sequence to be predicted is, to some extent, contained within the historical sequence, or shares a similar distribution with a certain part of the historical sequence [
41].
The prediction model used in this paper is based on the ProbSparse attention mechanism and, thus, adopts an encoder-decoder architecture that is suitable for this mechanism in the model structure. The overall structure of the model can be divided into the embedding layer, feature extraction layer, and regression layer, as shown in
Figure 1.
2.1. Embedding Layer
The embedding layer is the starting layer of the model structure, and its main purpose is to encode the input sequence and transform it into a form that can be processed by the feature extraction layer. Since input time sequences are typically long, they are often batched into a set of shorter sequence inputs. In this article, the embedding layer transforms the input sequence group into a high-dimensional tensor using three embedding methods.
The first approach is numerical embedding, wherein the scalar inputs from the original input sequence are fed into a Conv1d layer. Through convolutional operations, high-dimensional vectors in the same format as the other embedding sequences are generated. This allows the retention of numerical features from the input sequence, and the vectors are then summed with the other embedding sequences for subsequent processing. This layer can be represented by the following formula:
where
represents the probability of
occurring,
represents the probability of
occurring, and
represents the probability of
and
occurring simultaneously, which is the joint probability of
and
.
The second method is positional embedding. Since the core mechanism used by the feature extraction layer is the attention mechanism, and the attention mechanism provides a relevance retrieval table, changing the position of any two elements in the table does not affect the values, only the positions of the values. Therefore, the attention mechanism cannot extract positional relationship features from the input sequence, and cannot effectively capture the long-term features of the original input sequence. The embedding layer needs to provide positional information for the subsequent calculations. Therefore, it is possible to use the sine and cosine functions to encode the positions of the input vectors. Even index positions can be encoded as:
Odd index positions can be encoded as:
In this context, PE represents the positional encoding sequence, pos indicates the position of an element in the sequence, and represents the size of the hidden layer, i.e., the dimensionality of the sequence after positional embedding.
The third step is time embedding. Since the irradiance sequence is time series data, the time series usually records the exact moment of each set of data through time units such as the year, month, day, hour, and grade. The exact time information can provide new features for model training, so the embedding layer is required to integrate this part of feature information into the input sequence.
The encoded input vector set is obtained by adding positional encoding to the sequence encoding. The encoded input vectors contain both the inherent sequence features and positional information. These vectors are then trained through a linear layer to map the sequence features into , , and matrices, which serve as the input to the feature extraction layer.
2.2. Feature Extraction Layer
In deep learning models, the feature extraction layer is often a key component for achieving predictive capabilities. In this article, the feature extraction layer consists of an encoder and a decoder. The encoder-decoder structure enables training, validation, and prediction for time series data.
2.2.1. Encoder
The encoder consists of three self-attention layers. In the traditional self-attention mechanism, the multi-head attention mechanism processes the feature matrices
,
, and
of the input vectors to obtain the attention values and then uses these attention values to make sequence predictions. The
matrix is the query matrix, with one of its vectors referred to as the q vector. The
matrix is the key matrix, with one of its vectors known as the k vector. The
matrix is the value matrix, and one of its vectors is called the v vector. The
,
, and
matrices are all obtained through the training of the embedding layer model. For each input vector, its query vector needs to be dot-produced with the key vectors of all input vectors to acquire the respective relevance scores. These relevance scores are then transformed into a probability form through
, and the outcomes are pondered and combined with the value vectors to derive the anticipated output sequence. This part of the structure can be represented by
Figure 2, and this part of the process can be expressed by Equation (4):
where
refers to the attention value obtained after calculation. The product of
and
matrices is known as the correlation score. This score is normalized by dividing
, and the probability form of the score is obtained through
.
When calculating the self-attention values, the query vector of each encoded input vector needs to be dot-produced with the key vectors of all input vectors. If the length of the original input sequence is long, this will lead to excessive computational load in the encoding layer, affecting the efficiency of the model. Therefore, ProbSparse attention improves the attention mechanism based on traditional attention. In reality, not every key vector has a strong correlation with each query vector. Experimental evidence shows that most query vectors have little correlation with key vectors, and their dot product values tend to approach zero, making this part of the computation meaningless. The ProbSparse attention mechanism optimizes self-attention by effectively reducing the computational load of the attention mechanism if it can find effective pairings in the entire sequence.
The ProbSparse attention mechanism searches for valid query vectors by measuring the sparsity of the Q matrix, rephrasing the original self-attention into probabilistic terms as follows:
where
represents the i th query vector, and
represents the distribution function of the k vector.
The effective information content of the
and
matrices can be measured by the distribution of key vectors under the i-th query vector. With the probability form of self-attention in Equation (5), the distribution function can be expressed in the form of Equation (6):
If this distribution is close to a uniform distribution, this proves that the query vector is lazy and cannot contribute effectively to the attention values. Conversely, if the distribution shows a large fluctuation, this proves that the query vector contributes significantly to the calculation of attention values. Therefore, the problem can be transformed into how to measure the similarity between distributions, and this similarity can be measured by KL divergence. By calculating the KL divergence between the distribution of key values conditioned on the i-th query vector and the uniform distribution, the sparsity of each query vector can be obtained. The KL divergence can be obtained with Equation (7):
where, supposing that the distribution function of uniform distribution is in the form of:
is an arbitrary constant.
From this calculation, the required query vectors can be filtered. The model extracts the required query and key vectors through self-attention distilling, which is achieved through the maximum pooling layer of the following formula:
where
contains the key operations in the ProbSparse attention mechanism, and
is the activation function, the formula for which can be shown as Equation (9).
indicates the maximum pooling operation.
The new attention mechanism obtained in this way is called ProbSparse Attention (PSA). In the actual experimental process, PSA also has a positive impact on prediction accuracy. After the PSA mechanism is calculated in the encoding layer, a new encoding sequence will be output, which contains the transformed positional encoding information and “semantic” encoding information.
2.2.2. Decoder
The decoder consists of a masked ProbSparse attention layer and a ProbSparse attention layer. During training, the input to the decoder is the encoded target sequence, while during the prediction process, the target sequence is unknown information yet to be predicted. However, the encoder still requires an input sequence, which is generally a randomly generated data sequence. Since the self-attention mechanism can handle global data, in order to avoid the influence of future random data on the prediction process, ProbSparse Attention is needed to process the randomly generated input sequence. This involves retaining only the confirmed data that have been predicted and masking out future random interference data. This step can be represented as follows:
where
is the sequence with a form that is consistent with the input sequence,
is the sequence to be trained, and
is the placeholder of the target sequence.
The traditional transformer network applies “dynamic decoding” in the decoding layer. Due to the characteristics of NLP problems, it can only output serial results, resulting in the low output efficiency of the program. However, the context of this paper is focused on time series problems, not within the context of NLP problems. Therefore, the decoder structure adopted here uses a forward process, which can synchronize and parallel output all sequences. Thus, the output efficiency of the model is improved.
2.2.3. Regression Layer
The regression layer’s task is to reconstruct the initial sequence from the encoded sequence that is derived following the decoding process. In the embedding layer, the input sequence is transformed into a 512-dimensional encoded vector, and, after model computation, it still remains a high-dimensional vector. Therefore, the regression layer is required to perform inverse embedding on the vectors.
In this paper, a fully connected layer is used to implement the functionality of the regression layer. The fully connected layer can be represented as follows:
where
is the output sequence,
is the activation function, and
and
are the weight matrix and bias vector. The activation function can be chosen as the
function, and, by training and iteratively updating the values of the weight matrix and bias vector, sequence restoration can be achieved.
3. Feedforward Network and Parameter Iteration Method
To achieve the best prediction performance, it is necessary to fine-tune certain hyperparameters of the network model in practical applications. Various destination input sequences differ with respect to the intricacy of features, the extent of the sequence, and the strength of noise disturbance, among additional considerations. Therefore, hyperparameters that determine the interpretability of the model will impact the accuracy of model predictions. Take, for instance, the count of attention heads, which influences the model’s capacity to grasp global interdependencies in time series information. An insufficient number of attention heads could cause the model to miss intricate relationships within the data, whereas an excessive number might cause the model to overfit. Likewise, additional parameters that can impact the precision of model predictions encompass the quantity of encoding and decoding layers, the dimensionality of the hidden layers in the feedforward network, and the sequence length for both encoding and decoding inputs.
Therefore, optimizing the hyperparameters in the model’s feedforward network can effectively improve the model’s performance and enhance its generalizability, enabling it to achieve good results when dealing with different datasets.
3.1. Dingo Optimizer (DOX) Algorithm Optimization
The DOX is an optimization algorithm that takes n-heads, d-model, and c-out as its three-dimensional inputs to obtain the optimal hyperparameters for the current dataset. The overall model serves as the fitness function, with MSE used as the optimization parameter, resulting in the fitness function outputting MSE. The algorithm selects the best solution, based on minimizing MSE.
The operating principle of DOX is shown in
Figure 3. Its framework shares similarities with the grey wolf algorithm, as their position update methods and optimal solution search approaches are fundamentally similar. However, the grey wolf algorithm lacks a greedy mechanism, leading to slower convergence. Alternatively, the DOX fusion integrates the grey wolf optimizer with genetic algorithm principles, adding a screening procedure to reduce the impact of search points that are too far from the target point. In order to preserve the native algorithm’s exploration abilities and prevent getting stuck in local optima, the DOX introduces supplementary strategies for updating positions.
The dingo algorithm begins by partitioning the initial population into three distinct groups using two randomly produced values, rand_1 and rand_2. Each group undergoes a different position update behavior, with the encirclement behavior involving a search near the current global optimal position. The formula for updating positions in this approach closely mirrors that of the grey wolf algorithm, as depicted in Equation (14).
The pursuit of a solution by the algorithm’s hunting behavior involves a localized search around the best solution found so far. Unlike the encircling behavior, where the movement of each individual is guided by the collective actions of the group, resulting in a consistent pattern of positional shifts among the swarm, the hunting behavior’s approach to updating positions is dictated by the previous iteration’s optimal solution and the location of a randomly chosen member from the entire population. This introduces a stochastic element to the positional updates, which reduces the likelihood of converging on a suboptimal local solution. This method of updating positions is captured in Equation (15).
The mechanism for updating the position in the search behavior is not contingent upon the optimal solution; it entails exploring the space between the current position and a position chosen at random, which, in turn, amplifies the stochastic nature of the position updates. This refinement boosts the model’s ability to explore and reduces the probability of settling for an inferior solution. This idea is quantitatively represented in Equation (16).
Following the updating of positions, a fitness-based selection process is employed where points with a fitness below a certain criterion are subject to further position updates. This selective strategy enhances the model’s rate of convergence. The hyperparameters to be optimized in this paper are all integers; therefore, the search capabilities required for the optimization algorithm are relatively low. Due to the complexity of the deep learning model, when the model itself is used as the fitness function, a higher convergence speed is required for the optimization algorithm. The dingo optimizer algorithm is more suitable for this optimization task.
3.2. Parameter Iteration Method
In the process of model construction, there are certain parameters that need to be learned. During training, these parameters are optimized and updated at each iteration to ensure that the model achieves better predictive performance. To enable iterative updates to the parameters, the model requires a closed-loop feedback system, which necessitates the use of the backpropagation algorithm and the Adam optimization algorithm.
3.2.1. Backpropagation Algorithm
The backpropagation algorithm is applied to those layers requiring parameter optimization in the network model used in this paper. MSE is selected as the error calculation index in the network model used in this paper, the error between each layer is calculated in the process of data forward propagation, the gradient of each loss function to the parameter is calculated through the chain rule, and the gradient descent method is used to optimize the loss function, which can achieve the optimization training of the parameter. This process can be represented by Equation (17):
where
is the updated parameter,
is the overall error of the optimized part of the model, and
is the weight parameter to be optimized.
3.2.2. Adam Optimization Algorithm
The Adam algorithm is an optimization algorithm that can automatically update the learning rate. Updating the learning rate at each epoch can improve the training effectiveness of the model and alleviate potential overfitting or underfitting situations. The Adam algorithm utilizes the exponentially decaying moving average of the first moment (as calculated in Equation (18)) and the exponentially decaying moving average of the second moment (as calculated in Equation (19)). By combining these two, the final update formula is obtained, as shown below:
where
is the first-order momentum of the current step,
is the first-order momentum of the previous step,
is the second-order momentum of the current step, and
is the second-order momentum of the previous step.
and
are the two hyperparameters, theta is the coefficient that prevents the denominator from approaching 0, and alpha is the model learning rate.
4. Data Preprocessing
Generally, due to the influence of equipment or human factors, the photovoltaic power generation data recorded by photovoltaic power stations may exhibit phenomena such as repetition, missing data, or anomalies [
42]. Therefore, data preprocessing is required before using the dataset.
4.1. Feature Selection
The use of multidimensional time series data in this study involves the incorporation of environmental data that exhibits a high correlation with the historical irradiance data, aside from the irradiance data itself. This inclusion serves to enhance the predictive accuracy of the model. Therefore, the selection of environmental data sequences with high correlation to the irradiance series in the dataset is one of the crucial steps in data preprocessing.
The dataset used in this paper has 21 features; except for the global horizontal irradiance, the remaining 20 features can be regarded as environmental features. Certain features, like the solar elevation angle, diffuse irradiance, and azimuthal irradiance, show a strong linear relationship with horizontal irradiance. Alternatively, characteristics like wind speed, wind direction, and humidity exhibit a less pronounced linear correlation with solar irradiation. Despite the lower linear correlation of some features with the horizontal irradiance, they still exhibit a strong correlation with the target feature. When training the model to predict future irradiance, features with a strong nonlinear correlation are more meaningful. Therefore, this research employs mutual information to ascertain the relationship between the ambient variables and the sequence of global horizontal irradiance:
where
is the solution for the
divergence,
is the joint distribution between the two substitutions, and
is the independent distribution of the two quantities.
Evaluating the mutual information between the sequence of global horizontal irradiance and other sequences of environmental characteristics allows for the identification of features that strongly correlate with the target sequence and are suitable for use as model inputs.
4.2. Data Cleaning
Datasets often contain errors, duplicates, anomalies, or missing data. This part of the data will have a great impact on model prediction, so it is necessary to clean the data during data preprocessing. The main problem of the dataset adopted in this paper is that there is a large amount of missing data, considering that the dataset has multiple features with high linear correlation, and the missing value is randomly distributed among the features.
Therefore, the multiple interpolation method is used to complete the dataset. Multiple interpolation methods estimate the missing values by building models, such as linear regression based on the sequence of the missing values themselves, and decision trees based on the relevant features. In general, the multiple interpolation method requires that the proportion of missing values is not too large, to ensure the accuracy of model construction. The proportion of missing values in the dataset used in this paper is about 4, and the multiple interpolation method can be used.
4.3. Filtering
The primary objective of filtering is to diminish noise and remove anomalies. For this research, the dataset is subjected to median filtering following multiple imputations, successfully eradicating the biases that arise from the imputation of missing values while preserving the integrity of the original data’s curve and boundaries.
The process of median filtering can essentially be broken down into two main stages. Initially, it is crucial to establish the dimensions of the filter window, a decision that markedly influences the outcome of the filtering. An inadequately sized window might not adequately remove substantial noise and disturbances, whereas an excessively large window could compromise the integrity of the original data and impair predictive accuracy. Consequently, the optimal window size should be chosen in consideration of the noise and signal properties.
After establishing the window size, the median filtering process is carried out on every feature in the dataset. The window moves systematically through the dataset, from beginning to end, and the contained data points are sorted. Subsequently, the median of these points is adopted as the filtered result. This procedure can be formulated mathematically as:
where
denotes the outcome of the filtering process,
signifies the information contained in the filter’s window, and the
function computes the median value of the data within the window.
5. Performance Metrics
In the process of training and evaluating deep learning models, various statistical metrics are commonly used to measure their predictive performance. These include the mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination (R2). The MSE is obtained by calculating the average of the squared differences between the predicted and actual values. In the model constructed in this paper, the MSE is used as the evaluation metric during the training process. The MSE is sensitive to outliers and can effectively correct large errors. Therefore, the MSE is used as an evaluation index for model training and hyperparameter optimization in this paper.
During the phase of forecasting, the assessment of a model’s efficacy is multifaceted. Consequently, this study employs the mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination (R2) as metrics to gauge the model’s predictive capabilities. The MAE quantifies the mean absolute discrepancy between the predictions and actual outcomes, which proves valuable for models that demand greater resilience. The RMSE computes the square root of the average squared deviations between the predicted and actual data. The bias coefficient primarily helps in assessing how closely the model’s predictive trajectory matches the actual data trajectory, which is instrumental in comparing the predictive precision of different models when they are applied to the same dataset.
The MAE can be calculated using the following formula:
MSE and RMSE can be calculated as follows:
R
2 can be calculated as follows:
where
is the total sample,
is the predicted value, and
is the true value.
6. Case Study
In order to verify the effect of the predicted model on the actual forecast, we used the data from a Vaulx-en-Velin photovoltaic power station in Lyon, France to verify the model. This dataset provides the irradiance of photovoltaic power plants for the whole year of 2018 and the remaining 21 environmental characteristics datasets, which can provide sufficient data volume for model training.
6.1. Data Description and Simulation Setting
The dataset used in this study is sourced from the Vaulx-en-Velin photovoltaic power station in Lyon, France. The Vaulx-en-Velin region is located at 45.7786° N latitude and 4.9225° E longitude, and it has a temperate maritime climate with distinct seasonal variations. The region exhibits characteristic temperate maritime climate features in terms of temperature, annual precipitation, and relative humidity throughout the year. Solar radiation is higher in the summer and autumn, and lower in the spring and winter seasons. The dataset provides solar irradiance and environmental characteristics data for the entire year of 2018. The total number of data points in this dataset is 525,450, consisting of 16 features.
According to the DOX algorithm and the ProbSparse attention-based PV prediction algorithm above, the input parameters of the prediction model were selected using the mutual information method, of which 16 input parameters were finally used. The model’s learning rate was updated in realtime using the Adam optimizer in each training iteration, with the initial learning rate set to 0.0001. For this research, traditional time series forecasting methods such as LSTM, CNN, and the Elman neural network were chosen as benchmark algorithms.
6.2. Irradiance Prediction Based on ProbSparse Attention
Based on the theory presented in
Section 3, an experimental model for solar irradiance simulation prediction was constructed. Prior to prediction, it was essential to optimize the model’s hyperparameters to ensure the effectiveness of the predictions. The choice of hyperparameters varies for different datasets and significantly impacts the model’s interpretability and training efficiency, thereby influencing the prediction results. The critical hyperparameters that were fine-tuned were the count of attention heads, the size of the model, the batch size for the training data, and the learning rate of the optimizer. To bolster the model’s capacity for generalization and to confirm its robustness on various datasets, we utilized the DOX algorithm for the tuning of hyperparameters. Following DOX-based optimization, the experiment yielded the following optimal hyperparameter setup: 15 attention heads, a model dimension of 512, an input sequence length of 96, a training batch size of 32, and an initial learning rate for the optimizer of 0.001. The split ratio between the training set and the validation set was 8:2.
The input sequence length was set to 96, and the label length for the decoding layer input was set to 48. Predictions for the next timestep’s solar irradiance value were made using a sliding time window approach. The first 80% of the dataset was used as the training set, while the remaining 20% was used as the validation set to evaluate the model. Typical daily sequences were randomly selected from the obtained prediction results, and the predicted images were compared with the original solar irradiance sequence images, as shown in
Figure 4 and
Figure 5. It is evident that the predicted sequences closely align with the original sequences, effectively enabling short-term solar irradiance prediction. Additionally, a scatter plot comparing the predicted images with the original sequences was generated using statistical methods, as illustrated in
Figure 6. The plot demonstrates that the majority of predicted data points linearly coincide with the original data and exhibit a clear convergence trend, confirming the model’s capability to effectively predict solar irradiance sequences.
6.3. Comparison with the Traditional Time Series Prediction Model
At present, linear prediction models such as LSTM, GRU, BP, CNN, and the Elman neural network are usually used in time series prediction tasks. For this paper, the LSTM, CNN, and Elman neural network models were selected as the control models. The prediction performance of the ProbSparse attention-based irradiance prediction model was analyzed.
Data on global horizontal irradiance, as recorded by photovoltaic facilities, is subject to a range of environmental conditions, including the dry bulb temperature, zenith brightness, and atmospheric humidity. For this research, the dataset was sourced from the Lyon area in France, which is characterized by a temperate oceanic climate and marked by well-defined seasonal changes. The impact of these environmental variables is pronounced across the different seasons, with fluctuations in certain factors potentially altering the patterns of global irradiance measurements at photovoltaic plants. Consequently, the experimental data are believed to present varying models, depending on the season. During the comparative testing of the LSTM and CNN models, the data were segmented into four distinct subsets corresponding to spring, summer, autumn, and winter. The models were then trained and validated using subsets from each of the four seasons. The performance metrics derived from these experiments are presented in
Table 1.
Since datasets from different seasons can be considered to have different data patterns, training and validating the model using different seasonal datasets can test the model’s adaptability to data exhibiting different patterns. The experimental data were randomly selected for comparison with images of predictions using the proposed method, CNN, and LSTM in spring, summer, autumn, and winter. The resulting images are shown in
Figure 7,
Figure 8,
Figure 9 and
Figure 10.
It can be observed that the prediction model based on ProbSparse Attention outperforms CNN and LSTM in most cases, with slightly lower performance metrics compared to LSTM in winter, a slightly lower R2 and a slightly higher MAE in spring, and higher performance metrics in summer and autumn compared to LSTM. The performance metrics of the proposed method in this study are consistently superior to CNN networks across the different seasonal patterns. The images also visually demonstrate that the proposed method yields better prediction results compared to traditional baseline prediction methods. This indicates that the prediction model based on ProbSparse Attention is more suitable for short-term irradiance prediction.
Box plots provide a more intuitive comparison for the prediction models. By plotting the data trained by the four models, along with the original data on the same box plot, the visual results thus obtained are shown in
Figure 11. It can be seen that compared with the LSTM network, the boxplot style of presentation of the PSA method is more similar to the graph of the original data. This indicates that in its overall prediction, the proposed method in this study outperforms the comparative networks in terms of prediction effectiveness.
At the same time, changing the length of the model’s input sequence can affect the model’s final predictive performance. The input sequence lengths for the ProbSparse attention predictive model, LSTM, and CNN neural networks were modified to 32, 96, and 168. In the predictive model designed in this paper, in order to match the input sequence length, the label length of the decoding layer was set to 18, 68, and 116, respectively. This resulted in four sets of evaluation metrics, as shown in
Table 2.
It can be observed that all predictive models have improved the predictive performance to some extent as the input sequence length increases. However, during the experimental process, in order to achieve continuous point prediction, the input sequence length serves as the window size of the sliding time window. Increasing the window size of the sliding time window will increase the overall data input volume, leading to a decrease in the model’s operational efficiency. The proposed method exhibits stronger interpretability for sequences, with relatively small performance variations across different input sequence lengths. When the input sequence length is relatively small, its performance is significantly better than that of the LSTM and CNN neural network models.
7. Conclusions
Ensuring the stability of electrical grids is paramount, and this necessitates the accurate forecasting of photovoltaic power generation, due to the inherent variability and instability of solar energy. The high-precision prediction of PV output has, therefore, become a vital component in the deployment of solar energy within the power system. Therefore, this paper proposes a deep learning network based on the ProbSparse attention mechanism for short-term irradiance prediction. The main advantages of this method are as follows: (1) compared to the commonly used linear neural networks in short-term irradiance prediction, this method demonstrates better predictive performance and is less prone to issues such as vanishing or exploding gradients. (2) This method introduces a novel artificial intelligence prediction model that utilizes the ProbSparse attention mechanism, which enhances the overall operational efficiency of the model, ensuring real-time and high efficiency in irradiance prediction. (3) Employing dingo optimization for autonomous optimization of the model’s hyperparameters reduces the manual cost of model deployment and enhances the model’s versatility.
Through extensive experimental research evaluating various performance indicators, the proposed method has shown superior precision over other reference algorithms across diverse seasons and forecasting time frames. Consequently, the forecasting model introduced in this study is considered to have considerable promise for utilization in the domain of short-term solar irradiance forecasting.
In the future, we will delve deeper into the application of attention mechanisms in the field of solar irradiance prediction. Currently, the core of our prediction approach involves utilizing the ProbSparse attention mechanism to forecast long-term time series data. However, recent studies have indicated that the practical predictive performance of attention mechanisms in time series forecasting may not surpass that of linear networks. This discrepancy may stem from the fact that the attention mechanisms we employ rely on segmenting the input sequences, akin to their application in natural language processing. However, the representation of time series features occurs over an extended period, and segmenting the input sequences might disrupt the inherent characteristics of the original sequence. This is why some studies have reverted to using linear networks for time series prediction. Nevertheless, attention mechanisms retain immense potential in the realm of time series forecasting. We aspire to address the aforementioned issues through in-depth, interpretable research into attention mechanisms, aiming to enhance their effectiveness in solar irradiance prediction.