1. Introduction
The rapid growth in energy demand and the increasing climate change have promoted the widespread application of renewable energy [
1]. However, the discontinuity and fluctuation of photovoltaic power generation presents severe challenges to the stability, reliability, and operational efficiency of the smart grid [
2]. To navigate these difficulties, the precision of photovoltaic power-generation forecasting is of paramount importance, as it enhances the operational efficiency of power grids and mitigates the costs associated with balancing supply and demand [
3]. Moreover, accurate forecasting is instrumental in advancing the integration of photovoltaic power generation with state-of-the-art technologies, including energy storage solutions and smart grid systems [
4].
Generally speaking, the field of photovoltaic power-generation forecasting encompasses a diverse array of methodologies, including physical modeling, conventional machine learning, and deep learning approaches, with a predominant focus on model design and optimization [
5,
6,
7,
8]. These techniques are fundamentally data-dependent, and their efficacy hinges on data quality. However, photovoltaic systems periodically experience downtime for maintenance and other reasons, resulting in gaps in data records [
9]. The resultant data incompleteness severely disrupts the continuity of time-series signals, compromising the accuracy of prediction models, particularly in short-term forecasts [
1].
To tackle intermittent data gaps, Brooks M.J. [
10] and Layanun V. [
11] propose linear and cubic spline interpolation methods, yet these prove inadequate in the face of abrupt weather shifts [
12], and their precision wanes significantly when confronted with extensive, continuous missing data of photovoltaic output [
13]. References [
14,
15] employ regression trees for data imputation, assuming linear or predefined relationships between variables. However, these methods may fall short in accurately capturing the data’s intrinsic patterns when faced with sparsity, dramatic fluctuations, or nonlinear dependencies, leading to suboptimal or failed imputation outcomes.
Additionally, supervised learning models such as Multi-Layer Perceptions (MLPs) [
16], Super-Resolution Perception Convolutional Neural Networks (SRPCNNs) [
17], and Long Short-Term Memory networks (LSTMs) [
12,
18] have emerged as the norm in contemporary data imputation technologies. In an effort to mitigate the assumption that missing data share the same distribution or features as observed data, semi-supervised learning-based imputation methods like SolarGAN [
19], CC-GAIN [
13], and CM-GAN [
20] have also gained widespread application.
CC-GAIN and SolarGAN are both strong approaches. The generators in both CC-GAIN and SolarGAN have equal input and output sequence lengths, primarily designed for imputing missing values within a portion of the input sequence. When the length of the missing data exceeds the predefined input sequence length, these models fail to perform effectively. CM-GAN focuses on the relationships between cross-domain data. However, it overlooks the representation of sequential dependencies in time-series data.
To mitigate these challenges, this work introduces a photovoltaic power data imputation approach that leverages the Wasserstein Generative Adversarial Network (WGAN) in conjunction with LSTM networks. The integration of a gradient penalty mechanism, along with a single-batch multi-iteration training strategy, ensures the stability of the model during training. The key contributions of this work are as follows:
- (1)
Designing an LSTM-WGAN framework, which is developed by the WGAN framework. This network is designed with data-driven architecture with quasi-convex properties, specifically tailored to address the complexities of photovoltaic power data imputation.
- (2)
Utilizing real-world photovoltaic-system data for testing. The efficacy of the generated data is demonstrated through a trilogy of validation approaches: frequency domain assessment, t-SNE visualization, and a comparative analysis of forecasting efficacy.
The organization of this paper is as follows:
Section 2 provides an in-depth description of the photovoltaic power-generation data imputation method based on LSTM-WGAN. In
Section 3, the experimental results are presented, and the performance of the proposed method is tested on the real-world dataset to demonstrate its effectiveness. Finally, conclusions are drawn in
Section 4.
2. LSTM-WGAN Method
In the imputation task, capturing the essence of the sequence’s characteristics takes precedence over the precise replication of values. In such scenarios, a GAN network emerges as the ideal candidate. Although there are several GAN variants, such as Conditional GAN (CGAN) and CycleGAN, most of them are not suitable for this task. For example, CGAN requires labeled data, which are not at our disposal. Meanwhile, CycleGAN, with its dual generator and discriminator architecture, is crafted for domain-to-domain translations, thus rendering it incompatible with our data imputation needs. In contrast, WGAN excels in generating data that closely adheres to the real data distribution. Consequently, WGAN is chosen for data imputation, due to its superior training stability, reliable convergence, and ability to generate high-quality data through the gradient penalty mechanism.
A conventional GAN network consists of two main components: the generator and the discriminator. The generator generates simulated data through input noise. Nevertheless, when applying standard GANs to the task of imputing missing data in photovoltaic power time-series, ensuring continuity at the edges of the imputed data can be quite challenging. To address this issue, our objective is to integrate historical data into the generation process, thereby infusing temporal patterns and imposing smoothness constraints to enhance the quality of the generated data.
Considering the temporal dynamics inherent in photovoltaic power data, the LSTM network was selected for our approach. Although other Recurrent Neural Networks (RNNs) such as Gated Recurrent Units (GRUs) are capable of capturing sequential dependencies, LSTM was favored, due to its extensive adoption and proven track record. It offers the optimal balance between performance and model complexity for the particular requirements of our application.
Thus, by integrating the strengths of both WGAN and LSTM, we achieve a stable and dependable data imputation process that adheres to the sequential patterns intrinsic to photovoltaic time-series data. This hybrid approach is particularly tailored to r the task we aim to accomplish. The complete system architecture is depicted in
Figure 1, and a thorough explanation is provided in the subsequent sections.
2.1. Generator
Let
gout represent a vector comprising
k photovoltaic power generation data points starting at time
t0, and
q be a vector consisting of
m consecutive photovoltaic power generation values that are known before
t0. The feature extraction model is constructed as follows:
where LSTM denotes the Long Short-Term Memory network [
21], and its fundamental building block is depicted as follows:
where
hi−1 represents the output from the previous level, and
qi represents the input at the current level; σ and tanh denote the activation functions; and
Wo,
Wf,
Wb,
Wc,
bo,
bf,
bb,
bc are the parameters of network model.
Given that photovoltaic power generation is subject to a multitude of stochastic elements, including weather conditions, temperature fluctuations, cloud cover, and more, these influences typically emerge as noise or random variations. Consequently, the global feature representation of photovoltaic power-generation sequences encapsulates overarching patterns associated with randomness or noise, and is built as
where
FC1 and
FC2 both represent fully connected layers. Subsequently, by concatenating the feature vector
from Equation (1) with the global feature vector
, the comprehensive feature can be expressed as follows:
The final generated vector for photovoltaic power generation, denoted as
gout, is formulated as follows:
where
FC3 and
FC4 represent fully connected layers.
To facilitate training, we selected activation functions with quasi-concave or quasi-convex properties, such as ReLU and Rectified Exponential Unit (REU) [
1]. These functions were tested to evaluate their effects on training stability and convergence speed. Leaky ReLU was ultimately chosen for our model, due to its outstanding performance.
Notably, to prevent the generator from simply shortening the input sequence to produce the output sequence, it is crucial to impose the constraint m < k. This ensures that the generator is forced to learn meaningful correlations and patterns within the existing data, which allows it to accurately predict missing values. By repeating this data generation process, the model can produce sequences of arbitrary length. This approach, therefore, enables the model to handle missing data at both short and long intervals in a consistent manner.
To prevent overflow due to large data values, Min–Max Scaling was employed to normalize the input data before feeding it into the model. Once the model generates the output, the reverse normalization was performed to guarantee that the normalization process does not affect the results.
2.2. Discriminator
The primary task of the discriminator is to distinguish between real and generated data, which is a task that falls under binary classification. Its role is to evaluate the authenticity of the input data and provide critical feedback to refine the generator’s performance throughout the training process. The discriminator’s architecture is typically composed of convolutional layers, followed by fully connected layers, and culminates in a sigmoid activation function for binary classification purposes. The underlying formula for this process is as follows:
where
Xin and
Xout denote the input and output of the discriminator, respectively; LReLU(.) denotes the Leaky ReLU non-linear activation function;
FC stands for the fully connected layer; and
conv(.) refers to the convolutional layer, which is employed to extract features from the input data. Lastly, the output is transformed to the range [0, 1] via the
sigmoid function, which offers a probabilistic indication of whether the input data are actual or generated.
2.3. Model Training
2.3.1. Model Training
In this proposed model, the training procedure entails the concurrent optimization of two key components: the generator G and the discriminator D. The objective of the generator is to fabricate data of such high fidelity that they elude detection by the discriminator, whereas the discriminator is responsible for precisely differentiating actual and generated data. Customarily, this adversarial framework is formulated as a minimax optimization problem, where the design of the loss function is critical for ensuring convergence.
Unfortunately, ordinary GANs are subject to training instability. Accordingly, the Wasserstein GAN (WGAN) with Gradient Penalty is employed to stabilize training and mitigate overfitting. Gradient penalty is a regularization technique, which encourages the discriminator to remain Lipschitz continuous without relying on weight clipping.
The loss functions for both the generator and discriminator are delineated as
where
E denotes the expected value;
Pr and
Pg represent the distributions of real and generated data, respectively.
λ and
ε denote the regularization coefficient and a random value from the interval [0, 1], respectively;
and
represent the real data, generated data, and interpolated data, respectively, where the interpolated data can be expressed as
The discriminator, acting as a binary classifier, produces an output of 0 and 1, signifying its assessment of whether the input is generated data or real data, respectively. The objective of the generator is to deceive the discriminator by maximizing LG, thereby generating data that closely mimic the distribution of real data. Concurrently, the discriminator sharpens its discriminative capabilities by minimizing LD, thereby improving its ability to differentiate between actual and generated data.
2.3.2. Hyperparameter Determination
The regularization penalty λ is a critical parameter in the gradient penalty mechanism of WGAN. It controls the trade-off between enforcing the Lipschitz continuity condition and maintaining model stability. To determine λ, we referred to commonly used values (e.g., 10) in the WGAN literature as a starting point. We then conducted a grid search over a range of values (5, 10, 20). Through experimentation, we found that λ = 10 offered the best balance between stable training and high-quality data generation for our specific task.
The optimal number of iterations for the generator and discriminator is a key question. GANs can suffer from instability if the generator and discriminator are not balanced in terms of updates. If one component is updated too frequently, it can become too powerful, making it difficult for the other to learn.
To address this issue, we initially performed a series of training runs with varying numbers of iterations (1, 2, 5, and 10 iterations for each component). During these runs, we monitored the loss values for both, observing whether there were sharp fluctuations or divergence. By experimenting with different iteration counts, we found that updating the generator 5 times for each batch and the discriminator 2 times provided stable training and reliable convergence.
The batch size determines the number of samples used in each training iteration. It directly affects the training dynamics, including convergence speed and stability. A smaller batch size provides more frequent updates but can lead to noisy gradients, while a larger batch size stabilizes training but may require more computational resources. We started with a series of batch sizes (32, 64, 128, and 256) and tested various configurations. Through these experiments, we found that a batch size of 128 provided an optimal trade-off, ensuring stable training and efficient use of computational resources.
2.3.3. Overfitting Problem
Unlike supervised models, which have explicit feedback in the form of labels, GANs only receive feedback based on whether the generated samples are realistic (via the discriminator). This less explicit feedback increases the risk of the model focusing too much on irrelevant aspects of the data. As a result, GANs are generally more prone to overfitting compared to supervised learning models.
Overfitting in GANs typically manifests in the following ways: generator overfitting, and discriminator overfitting. In the first case, the generator might learn to produce outputs that are too similar to the training data, resulting in a lack of diversity in the generated samples. In the last case, the discriminator is too powerful or trains for too many iterations without sufficient regularization, and it may become too sensitive to small fluctuations in the data, thereby providing poor feedback to the generator.
To address potential overfitting issues, we implemented several strategies during the model training process: (1) we applied gradient penalty regularization in the WGAN framework to stabilize training and mitigate overfitting; and (2) we randomly deactivated a fraction of neurons during training, to encourage the model to generalize better and avoid overfitting.
3. Experiments and Results
All experiments in this work were conducted on a computer equipped with an NVIDIA GeForce RTX 4070 graphics card (sourced from Santa Clara, CA, USA), powered by an Intel i9-14900HX processor (sourced from Santa Clara, CA, USA), and supplemented with 64 GB of Samsung RAM (sourced from Seoul, South Korea). Note that to train the model with limited resources, fewer LSTM layers or fewer units per layer can be used to reduce computational load and memory usage. However, this may come at the cost of reduced model performance.
3.1. Data Analysis
Figure 2 shows photovoltaic power generation data from the Desert Knowledge Australia Solar Centre in 2023. Due to regular shutdowns for equipment maintenance or component aging, as well as forced stoppages when electricity demand is insufficient, the photovoltaic power-generation system often experiences intermittent operation, resulting in discontinuous generation records in the photovoltaic data.
To visually analyze the characteristics of the photovoltaic power-generation data sequence, Discrete Cosine Transform (DCT) analysis was employed. DCT is a technique used to transform a time-domain signal into the frequency domain. The result consists of frequency components that represent various characteristics of the signal. In the context of time-series data such as photovoltaic power generation, DCT helps break down the signal into a series of frequency components, each of which carries specific information about the data.
Low-frequency components in the DCT represent the overall trends or slow changes in the data. In time-series data, these correspond to the smooth, large-scale patterns or longer-term behaviors. For example, in the context of photovoltaic data, low frequencies might correspond to overall seasonal variations or long-term power generation trends.
High-frequency components correspond to rapid fluctuations or short-term variations in the data. These are typically associated with the fine details, such as daily changes in power generation, noisy variations, or transient events.
If the imputed data results in significantly higher or lower amplitudes in the low-frequency range, it suggests that the imputation method has altered the global structure of the data. A significant increase in low-frequency amplitude could mean that the imputation method has introduced artificial or over-smooth trends, while a decrease might suggest that the imputation has removed important long-term patterns.
Relative to the low-frequency components, high-frequency components are difficult to visualize. Therefore, only a subset of the low-frequency components is shown in
Figure 3.
Figure 3b,c illustrates the first five low-frequency components of the DCT following the zero-padding and 200-value filling methods for missing data, respectively. It is evident that there are distinctions in amplitude.
When the imputed data deviate from the true data trends, this leads to erroneous input that misguides data-driven models into learning incorrect relationships. This, ultimately, results in inaccurate predictions and flawed conclusions, which can significantly impact operational decisions, causing potentially increasing costs. Consequently, accurate data imputation is vital to avoid changes in amplitude of frequency components, thereby ensuring the overall integrity and accuracy of the dataset.
3.2. Validation Analysis
Figure 4 illustrates the evolution of loss values throughout the training process of the proposed network. In the experiment, the batch-optimization parameters for the generator and discriminator were configured to perform five and two iterations per batch, respectively. Observing the convergence of loss values across training iterations, it is evident that the adversarial strategy introduced in this work has effectively demonstrated its utility. Despite minor fluctuations observed during the training process, the system promptly regains stability, maintaining an overall robust stability.
Figure 5 displays the data generated via the DCT transformation, along with their associated frequency components. In contrast to the result with zero-padding, it can be observed that the frequency components of the generated data have been improved. These generated data effectively assist the prediction model, enhancing its ability to capture fundamental frequency characteristics.
To more intuitively illustrate the quality and reliability of the data generation process, a t-SNE (t-Distributed Stochastic Neighbor Embedding) visualization method is adopted. t-SNE conducts a nonlinear dimensionality reduction by first computing the pairwise similarities between data points in high-dimensional space and then finding a lower-dimensional representation that preserves these similarities. It preserves the relative distances between data points, so that similar points in high-dimensional space remain close to each other in the lower-dimensional representation. This technique is very useful for visualizing high-dimensional data in lower-dimensional spaces (typically 2D or 3D) to better understand the structure and distribution of data.
In our case, t-SNE is used to compare the real and generated data and interpret how well the generator has learned the underlying distribution of the real data. If the real and generated data points are well separated, it may indicate that the model is generating data that is not representative of the real data distribution. However, if the generated data overlaps significantly with the real data, it suggests that the generator has learned to mimic the real data distribution effectively, generating high-quality synthetic data.
t-SNE provides a qualitative insight into the relationship between real and generated data, as depicted in
Figure 6. It can be seen that there is a substantial overlap between the original and generated data within the two-dimensional plane. This indicates that the generated data have effectively encapsulated the structural and distributive characteristics of the original dataset. This alignment not only confirms the effectiveness of the generation model, but also highlights the promising potential of the proposed method for data synthesis tasks.
3.3. Performance Test
To quantitatively assess the predictive accuracy, the experiment employed three principal evaluation metrics: Mean Absolute Error (
MAE), Root Mean Squared Error (
RMSE), and the Coefficient of Determination (
R2). Given that the actual data and predicted data are denoted by
y and
, respectively, the mathematical formulations for these metrics are outlined as follows:
The MAE is an intuitive metric because it provides a simple average of the error magnitudes across all predictions. A lower MAE value indicates a better result. RMSE penalizes larger errors more, but its scale is easier to understand, due to the use of the same units as the target variable. R2 measures the strength of a linear relationship between two variables.
In this experiment, six models widely used in the field of time-series forecasting were selected to evaluate the impact of the proposed method on photovoltaic output prediction. The chosen forecasting models include Long Short-Term Memory (LSTM) [
22], Bidirectional LSTM (BiLSTM) [
23], Stacked LSTM (SLSTM) [
24], a hybrid model of CNN and LSTM (CNN_LSTM) [
25], Gated Recurrent Units (GRUs) [
26], and Bidirectional GRU (BiGRU) [
23].
To ensure fairness in the comparative analysis, the configuration of all algorithms in the experiment was standard, employing a batch size of 64, and each model was subjected to 50 training epochs. Consistently across all models, the Adam optimizer was chosen for the training process, and the learning rate was fixed at 0.001; 80% of the data is allocated for training purposes, while the remaining 20% is reserved for testing.
The photovoltaic power-generation dataset we used has a sampling interval of 5 min and spans 1 year. The number of missing records varies across datasets. However, the total number of records exceeds 100,000 in the datasets.
Table 1 presents the performance comparison for the missing and imputed data at the DKA Solar Center. In these tables, the suffix letter “F” stands for “imputed data”. From the experimental results, it can be observed that the performance of photovoltaic output prediction after data imputation has improved in terms of MAE, RMSE, and R
2.
Although the improvement in each evaluation metric is relatively small, this does not indicate poor performance of the data imputation method. The primary reason is the fact that missing data account for less than 3% of the dataset, with the majority of the data being complete and intact. Additionally, testing datasets include varying lengths of missing data, ranging from 1 to 1797 points. The results indicate that the performance remains robust, even for short and long periods of missing data.
We chose the data from the DKA Solar Centre due to their high quality, detailed records, and public availability. These characteristics make the data a valuable resource for research reproducibility and benchmarking. However, including data from various geographical and climatic conditions would enhance the credibility of the results. For this reason, we conducted additional experiments on our own dataset to evaluate the effect of geographical and climatic conditions. The experimental results, as tabulated in
Table 2, indicate that our proposed method improves prediction accuracy after data imputation, even when applied to different datasets. However, due to company policies, we are unable to make the dataset from West China publicly available.
3.4. Practicality Analysis
Our model can be trained on standard hardware, with a cost not exceeding USD 1500. However, using larger datasets or deploying the model on smaller systems with limited computational power may result in longer training times. To mitigate this, one can reduce the computational load by using fewer LSTM layers or fewer units per layer, though this may come at the cost of slight reductions in model performance.
In terms of real-time processing, while the training process for the LSTM-WGAN model is relatively time-consuming, the generator’s inference speed is quite fast. Our tests show that the average time for the generator to complete a single data-generation process is approximately 0.0026 s. Considering that the current minimum sampling interval for PV data is 1 min, our method is well suited to meet the timeliness requirements for real-time processing.
Our model can handle time-series data, including abrupt changes or anomalies, by leveraging historical data and sequential patterns. Based on these results, it is capable of adapting to seasonal variations, since our experimental data span a full year. If the historical data include indications of sudden changes in weather conditions or unexpected shifts, our method is likely to accurately capture the change trends and correctly impute the data. However, if the historical data do not show any signs of such changes, the model may not be able to respond effectively.
In addition, applying the model to data from different geographical locations presents a challenge. The main reason is that the relative position of the Sun to the Earth varies, resulting in differences in sunlight hours, and it is difficult to infer geographic information solely from short-term historical data. Therefore, our method may require additional adaptations to effectively handle data from diverse geographical areas.
Note that our method can produce sequences of arbitrary length by repeating the data generation process; its performance may degrade when the missing data span prolonged periods. This is because our method relies heavily on historical data and sequential patterns to estimate missing values. Extensive gaps in the data can result in the loss of critical temporal features or trends, which the model depends on to make accurate predictions.
4. Conclusions
This work tackles the challenge of missing data in photovoltaic power-generation records, and proposes a photovoltaic power data imputation method based on WGAN and LSTM. The method robustly maintains model training stability and the integrity of the generated data by utilizing a designed data-driven GAN network, along with an implemented gradient penalty mechanism. Through frequency domain analysis of the generated data and t-SNE metric evaluation, the effectiveness of the proposed method in generating high-quality photovoltaic data is verified. The testing datasets include varying lengths of missing data, ranging from 1 to 1797 points. The results indicate that the performance remains robust, even for short and long periods of missing data.
Our model supports real-time processing, with an average generator inference time of 0.0026 s. It can handle time-series data with seasonal variations. While the model performs well when historical data reflect abrupt changes, it struggles if such patterns are absent. Furthermore, the trained model fails to generate the data at different geographical locations due to variations in sunlight hours and the difficulty of inferring geographic information from short-term data, requiring further adjustments for diverse regions. As a result, expanding our approach to distributed systems with varying geographic and climatic conditions is a key focus of our future research.