Identification and Correction of Abnormal, Incomplete Power Load Data in Electricity Spot Market Databases

Li, Jingjiao; Lv, Yifan; Zhou, Zhou; Du, Zhiwen; Wei, Qiang; Xu, Ke

doi:10.3390/en18010176

Open AccessArticle

Identification and Correction of Abnormal, Incomplete Power Load Data in Electricity Spot Market Databases

by

Jingjiao Li

,

Yifan Lv

^*,

Zhou Zhou

,

Zhiwen Du

,

Qiang Wei

and

Ke Xu

School of Electric Power Engineering, Nanjing Institute of Technology, Nanjing 211167, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(1), 176; https://doi.org/10.3390/en18010176

Submission received: 26 November 2024 / Revised: 17 December 2024 / Accepted: 23 December 2024 / Published: 3 January 2025

(This article belongs to the Special Issue Trends and Challenges in Power System Stability and Control)

Download

Browse Figures

Versions Notes

Abstract

:

The development of electricity spot markets necessitates more refined and accurate load forecasting capabilities to enable precise dispatch control and the creation of new trading products. Accurate load forecasting relies on high-quality historical load data, with complete load data serving as the cornerstone for both forecasting and transactions in electricity spot markets. However, historical load data at the distribution network or user level often suffers from anomalies and missing values. Data-driven methods have been widely adopted for anomaly detection due to their independence from prior expert knowledge and precise physical models. Nevertheless, single network architectures struggle to adapt to the diverse load characteristics of distribution networks or users, hindering the effective capture of anomaly patterns. This paper proposes a PLS-VAE-BiLSTM-based method for anomaly identification and correction in load data by combining the strengths of Variational Autoencoders (VAE) and Bidirectional Long Short-Term Memory Networks (BiLSTM). This method begins with data preprocessing, including normalization and preliminary missing value imputation based on Partial Least Squares (PLS). Subsequently, a hybrid VAE-BiLSTM model is constructed and trained on a loaded dataset incorporating influencing factors to learn the relationships between different data features. Anomalies are identified and corrected by calculating the deviation between the model’s reconstructed values and the actual values. Finally, validation on both public and private datasets demonstrates that the PLS-VAE-BiLSTM model achieves average performance metrics of 98.44% precision, 94% recall rate, and 96.05% F1 score. Compared with VAE-LSTM, PSO-PFCM, and WTRR models, the proposed method exhibits superior overall anomaly detection performance.

Keywords:

anomaly identification and correction; bidirectional long short-term memory network; power load data; partial least square; variational auto-encoders

1. Introduction

The widespread adoption of smart meters and cloud computing technologies has enabled modern power systems to collect and store vast amounts of load data. These data hold significant value for grid companies, electricity retailers, and virtual power plant operators alike [1]. High-quality historical load data are foundational for accurate future load curve prediction. Precise load forecasting is critical for informed decision-making across various aspects of power system operations, including scheduling and control, demand response, and electricity spot market transactions. With the advancement of electricity spot markets, more granular and accurate load forecasting is becoming an essential capability for fine-tuning scheduling control and designing innovative trading products. However, historical load data at the distribution network or user level are frequently plagued by outliers and missing values. These data quality issues arise from factors such as erroneous signal inputs, inaccurate measurements, data extraction problems, communication failures, and integration inconsistencies [2,3]. Poor data quality severely hinders the accurate characterization and prediction of more fine-grained electricity consumption patterns (e.g., at the distribution network or individual user level), thus impacting subsequent decision-making and actions. To avoid the “garbage in, garbage out” problem, it is imperative to identify and rectify missing and outlier values in historical load data to improve data quality. This ensures the reliability of the analysis, prediction, and scheduling processes involved in power system operations and trading [4].

Currently, scholars have researched the identification of anomalous data and have proposed corresponding solutions and measures to improve data quality. These methods can be broadly categorized into four types: (1) statistical methods; (2) distance-based methods; (3) density-based methods; and (4) data-driven methods [5,6]. In the analysis of load data, direct identification and correction of raw daily electricity data often lack a thorough analysis of the underlying components and fail to capture the patterns and influencing factors directly affecting electricity consumption. Statistical methods, including Z scores, box plots, and hypothesis testing [7,8,9,10,11,12,13], typically assume that the data follow a specific distribution and contain a certain proportion of normally distributed data. Furthermore, their parameter settings tend to be subjective. Distance-based methods operate on the assumption that normal data points are densely distributed within their local regions while outliers are sparsely distributed [14]. This can be effective for anomaly detection; however, the computational speed slows down when the distance formulas become more complex. Among density-based methods, DBSCAN is a representative technique. It can identify noise points without prior knowledge of the number of clusters to form and can discover clusters of any shape, giving it some applicability. However, it struggles to incorporate influencing factors [15]. Data-driven learning methods, which do not require expert prior knowledge or precise physical models, have also been applied to anomaly detection. However, single-detection network structures often struggle to adapt to the diverse characteristics of distribution grids or user loads and cannot effectively capture data anomaly patterns. In comparison, combined models demonstrate greater potential. For example, the model proposed in [16] combines a novel residual Convolutional Neural Network (CNN) with a layered Echo State Network (ESN) to capture both spatial and temporal dependencies in the data. Furthermore, the Variational Autoencoder (VAE)–Long Short-Term Memory (LSTM)-combined models adopted in [17,18,19] have achieved effective anomaly detection across multiple data types. These methods primarily learn the patterns of load changes from the training dataset to predict load changes in the test dataset, thus realizing anomaly detection through prediction. However, they often overlook the fact that data after the anomaly detection point already exist and contain a significant amount of pattern information. In reference [20], a bidirectional LSTM (BiLSTM) network was successfully utilized to capture the latent patterns and dependencies in power load data. By encapsulating a contextual understanding of the data, it significantly enhances predictive capabilities and delivers outstanding performance, providing valuable insight. In addition, current approaches to missing value imputation during data preprocessing typically rely on traditional methods such as interpolation, curve fitting, or clustering. These approaches often struggle to consider the underlying influencing factors.

Addressing the aforementioned issues, this paper proposes a method for identifying and correcting missing and anomalous values in power spot market load data. This method is based on a Partial Least Squares (PLS)–Variational Autoencoder (VAE)–Bidirectional Long Short-Term Memory (BiLSTM) network. First, missing values are identified and then initially imputed using a PLS-based approach. This approach considers both the underlying load variation patterns and the influencing factors, thereby preventing the omission of extreme impacts. The completed load curves are then normalized to finalize the data preprocessing step. Second, a VAE-BiLSTM deep hybrid network model is designed, leveraging the representation learning and modeling capabilities of the Variational Autoencoder along with the advantages of the bidirectional long short-term memory network in capturing temporal features and learning contextual information. This model effectively maps relationships between different data features. Finally, the effectiveness of the proposed method is validated through outlier detection and correction experiments conducted on both public and non-public datasets.

2. The Types of Anomalous Power Load Data

In power load data, common types of anomalous data can be broadly categorized into two groups: missing values and outliers. Missing values exhibit a single form, characterized by “NaN” when called. In comparison, outliers appear in more complex forms, such as continuous duplicate data, abnormal peaks (or troughs), or trends that deviate entirely from established electricity usage patterns. In addition to their different manifestations, anomalous data often differ in their causes and handling methods.

2.1. Missing Value

Missing data refer to power load records that were not captured or lost over a specific time point or period. Common causes include equipment failures, sensor disconnections, communication interruptions, or human errors. Handling methods for missing data typically include interpolation, imputation, or predicting missing values using statistical models. The imputation of missing values should aim to preserve the temporal continuity and physical consistency of the data as much as possible.

2.2. Outliers

(1): Continuous Duplicate Data

Continuous duplicate data refer to identical or similar load values recorded over the same period, often caused by sensor malfunctions, data acquisition system failures, or communication delays. These anomalies compromise data validity and may bias load forecasting models. They should be addressed through timestamp inspections or data deduplication algorithms;

(2): Abnormal Peaks (or Troughs)

Abnormal peaks or troughs refer to extreme high or low values in load data that far exceed the normal operating range. These anomalies may arise from equipment failures, sudden load increases or decreases, or environmental factors (e.g., climate changes). Although often transient, such peaks or troughs can significantly impact data analysis and system stability. Detection methods include threshold-based approaches, statistical analysis (e.g., standard deviation), or anomaly detection models;

(3): Abnormal Consumption Trends

Abnormal consumption trends refer to load data patterns or periodic fluctuations that deviate significantly from normal loads. These trends often indicate long-term system issues, equipment aging, or changes in load composition. For instance, load curves may exhibit fluctuations inconsistent with regular operating cycles or sustained deviations. Detecting such anomalies typically relies on time series analysis techniques, such as seasonal adjustments, trend analysis, or machine learning-based pattern recognition methods.

3. The Proposed Model

3.1. Partial Least Squares Regression

Partial Least Squares (PLS) is a regression technique suitable for high-dimensional, multi-collinearity data sets. It can not only extract the principal components of the data but also establish a linear regression relationship between the input variables and the response variables.

The partial least squares regression for the single dependent variable is as follows: With a single dependent variable

Y \in R^{n}

, independent variable

X = | x_{1}, x_{2}, \dots, x_{p} |

,

x_{j} \in R^{n}

. Partial least squares regression extracts the components

t_{1}

and

u_{1}

in

X

and

Y

, respectively, and the extraction of the components is required to satisfy for regression analysis:

(1): $t_{1}$ and $u_{1}$ should capture as much variation as possible from their respective datasets;
(2): The correlation between $t_{1}$ and $u_{1}$ must be maximized.

After extracting the first components,

t_{1}

and

u_{1}

partial least squares regression is performed to model

X

and

Y

, based on

t_{1}

. If the regression equations achieve satisfactory accuracy, the algorithm terminates. Otherwise, a second iteration is conducted using the residuals of

X

and

Y

explained by

t_{1}

. This process repeats until satisfactory accuracy is reached. If a total of

m

components

t_{1}, t_{2}, \dots, t_{m}

are extracted from

X

, partial least squares regression is performed by regressing

Y

on these components. This regression is subsequently expressed as a relationship between

Y

and the original variables

x_{1}, x_{2}, \dots, x_{p}

, completing the modeling process.

3.2. Variational Autoencoder Model

Variational Autoencoder (VAE) is a generative model for learning the latent representation of data and generating new data similar to the input data [20]. The objective of VAE is to maximize the similarity between the reconstructed data

L^{'}

and the input data

L

while ensuring that the latent variable

z

follows a predefined prior distribution, typically a standard normal distribution. It consists of an encoder and a decoder. The encoder maps the input data x to the parameters of a distribution in the latent space, e.g., the mean and standard deviation. The latent variable

z

is then sampled from this distribution. The decoder reconstructs the data

\hat{Y}

from the sampled latent variable

z

.

z \sim E n c (x) = q_{φ} (z | L)

(1)

where φ is the distribution function of the encoder;

L

is the latent feature, and x is the input function.

The decoding process is the recovery of the hidden variable space data and the decoding, as shown in Equation (2).

L^{'} \sim D e c (z) = p_{θ} (L | z)

(2)

where θ is the distribution function of the decoder;

L^{'}

is the reconstructed data.

Because the potential features z cannot be directly observed in the above process’s distribution, through the encoding process

q_{φ} (z | L)

, the posterior distribution of

p_{θ} (L | z)

is replaced.

To approximate the two, the KL scatter degree (the difference between the logarithms of the two distributions) is measured, and the parameter

φ, θ

are passed to minimize the KL scatter degree, as shown in Equation (3).

\begin{array}{l} φ, θ = \arg \min D_{K L} (q_{φ} (z | L) | | p_{θ} (L | z)) \\ = E_{q_{φ} (z | L)} [\log q_{φ} (z | L) - \log p_{θ} (L | z)] + \log p_{θ} (L) \end{array}

(3)

3.3. Bidirectional Long Short-Term Memory Networks

Unlike load forecasting, the load data before and after the data points to be examined in the anomaly data identification and correction process have already occurred. In this paper, we introduce a bi-directional structure that has advantages in learning the information of front and backward text to improve the LSTM model in order to improve the ability to learn the features of the complete load data. The Bi-LSTM uses two LSTM layers; one reads the window data from the forward direction, and the other one reads the window data from the backward direction and also utilizes the load information before and after the occurrence of each load data, which makes the ability of LSTM model to learn the data features improved. The following Figure 1 and Figure 2 respectively show are the LSTM model structure as well as the Bi-LSTM model structure:

f_{t} = σ (W_{f} * [h_{t - 1}, x_{t}] + b_{f})

(4)

i_{t} = σ (W_{i} * [h_{t - 1}, x_{t}] + b_{i})

(5)

\hat{C_{t}} = t a n h (W_{c} * [h_{t - 1}, x_{t}] + b_{c})

(6)

C_{t} = f_{t} * C_{t - 1} + i_{t} * \hat{C_{t}}

(7)

o_{t} = σ (W_{o} * [h_{t - 1}, x_{t}] + b_{o})

(8)

h_{t} = o_{t} * t a n h (C_{t})

(9)

where

f_{t}

is the output of the forgetting gate;

i_{t}

is the control signal of the input gate for controlling the output of the current cell state

\hat{C_{t}}

;

C_{t}

is the current cell state;

o_{t}

is the control signal of the output gate, and

h_{t}

is the output value of the output gate.

y_{t} = w_{t 1} * h_{t 1} + w_{t 2} * h_{t 2}

(10)

where

y_{t}

is the output of the Bi-LSTM in the current window;

h_{t 1}

is the output of the forward LSTM in the current window, and

h_{t 2}

is the output of the backward LSTM’s output in the current window.

3.4. VAE-BiLSTM

In this paper, Bidirectional LSTM (Bidirectional LSTM) is embedded into the encoder and decoder parts of VAE (Variational Autoencoder). Bidirectional LSTM allows the model to learn information in both directions of the time step, which is particularly useful for the processing of time series data.

In the encoder part, each bidirectional LSTM layer can be seen as consisting of two unidirectional LSTM layers, one forward (from t = 1 to t = T) and one backward (from t = T to t = 1).

\{\begin{matrix} \vec{h_{t}^{(1)}} = {LSTM}_{1}^{\to} (x_{t}, \vec{h_{t - 1}^{(1)}}) \\ \bar{h_{t}^{(1)}} = {LSTM}_{1}^{\leftarrow} (x_{t}, h_{t + 1}^{(1)}) \end{matrix}

(11)

Output:

h_{t}^{(1)} = [\vec{h_{t}^{(1)}}; \vec{h_{t}^{(1)}}]

(12)

For the second bidirectional LSTM layer, the input is the output of the first layer:

\{\begin{cases} \vec{h_{t}^{(2)}} = {LSTM}_{2}^{\to} (h_{t}^{(1)}, \vec{h_{t - 1}^{(2)}}) \\ \bar{h_{t}^{(2)}} = {LSTM}_{2}^{\leftarrow} (h_{t}^{(1)}, \overset{\leftarrow}{h_{t + 1}^{(2)}}) \end{cases}

(13)

Output:

h_{t}^{(2)} = [\vec{h_{t}^{(2)}}; \overset{\leftarrow}{h_{t}^{(2)}}]

(14)

The decoder is the same. Therefore, the overall structure of BiLSTM embedded in VAE is that the encoder extracts features from the input sequence L through the bidirectional LSTM layer and compresses it into the potential space z, while the decoder decodes the potential space z into the original sequence through the bidirectional LSTM layer.

3.5. Flowchart

Figure 3 shows the overall process of identifying and correcting incomplete and abnormal power load data based on the VAE-BiLSTM model. The details are as follows: First, load and influencing factor data are loaded and preprocessed. The preprocessing includes preliminary filling of missing values based on PLS and data normalization. Secondly, based on the representation learning modeling ability of the VAE network and the ability of the bidirectional LSTM to capture long-term and short-term dependencies between contextual data, the bidirectional LSTM is used to replace the BP neural network layer of the traditional VAE, and a VAE-BiLSTM hybrid model for abnormal power load data detection is designed. The model is trained to simulate potential power consumption patterns. The test data

L_{test} (t)

with anomalies are input into the trained VAE-BiLSTM model, and the model estimates the embedding sequence

Z (t)

of the test data

L_{test} (t)

through the VAE encoder; then, the embedding sequence is input into the Bi-LSTM model for self-supervised learning and outputs the reconstructed embedding sequence; finally, the embedding sequence output by the Bi-LSTM is reconstructed through the VAE decoder of the model to obtain the reconstructed result

L_{test}^{'} (t)

value. The deviation value calculated by input

L_{test} (t)

and reconstructed output

L_{test}^{'} (t)

is compared with the threshold to determine whether the load data that have occurred are abnormal. The abnormality is marked and replaced with the reconstructed value.

4. Performance Evaluation Index

In order to evaluate the detection and prediction accuracy of the model for the user’s daily load, this study uses precision, recall, and F1 scores as indicators. The calculation expressions are as follows:

Precision = \frac{TP}{TP + FP}

(15)

Recall = \frac{TP}{TP + FN}

(16)

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(17)

Among them, TP (True Positive) represents the number of samples that correctly detect anomalies, that is, the number of samples with an anomaly detection label of 1 and a real label of 1; TN (True Negative) represents the number of samples that are correctly detected as normal, that is, the number of samples with an anomaly detection label of 0 and a real label of 0. FP (False Positive) represents the number of misdetected samples, that is, the number of normal samples misjudged as abnormal samples; FN (False Negative) represents the number of missed samples, that is, the number of abnormal samples that cannot be detected; precision is an indicator for evaluating the accuracy of the algorithm model in detecting positive samples; the larger, the better; recall is an indicator for evaluating the ability of the algorithm model to detect the full degree of positive samples, the bigger the better; precision is mutually restricted with the recall. The comprehensive indicator F1 score is introduced for evaluation; the bigger, the better.

5. Case Analysis

5.1. Test Platform and Data Sources

The PLS-VAE-Bi LSTM model proposed in this paper is built using deep learning Pytorch running on PyCharm 2024.3.1.1. The CPU is Intel Core i7-9750H CPU 2.60 GHz, 16 GB memory, and the graphics card is NVIDIA GTX1650, 4 GB video memory.

This paper uses public and private power load datasets to test the algorithm, including main grid load, distribution grid load [21], manufacturing user’s load, and a residential user’s load [22]. The data set covers not only the load data and meteorological factors of different grid levels but also the load data of power users in different industries. The model is trained using historical two-month load data, and the trained model is used to detect abnormal load data in the next 10 days (once every 30 min, 48 data points per day).

5.2. Network Training and Result Analysis

The trend of the loss function of the PLS-VAE-Bi LSTM model is shown in Figure 4.

In the initial stage of training, the loss of PLS-VAE-BiLSTM is high, and the model has not learned the distribution of data well. In the previous epochs, the loss decreases rapidly, indicating that the model is learning quickly and adjusting its parameters to better fit the data. In the next training process, the loss tends to be stable, indicating that the model has basically converged, and further training has little effect on the loss. The rapid decrease in VAE loss indicates that the PLS-VAE-BiLSTM model quickly learns the distribution of data at the initial stage and then tends to be stable, indicating that the model has basically converged.

As shown in Figure 5, the analysis of the abnormal load data identification results across the manufacturing, residential, distribution grid, and main grid reveals that the F1 index initially increases and then sharply decreases within the threshold range of 80% to 100%. Specifically, the F1 index for the manufacturing user increases gradually with the threshold, peaking at 0.9474 at 98% before rapidly declining. In contrast, the residential user reaches its highest value of 0.9677 at 96.5%, indicating optimal identification performance. The distribution grid shows consistent improvement, with the F1 index steadily increasing and reaching a peak of 0.9796 at 97.5%, the highest among the four sectors. Meanwhile, the main grid F1 index peaks at 0.9474 at 96%, after which a further threshold increase leads to a significant decline. These results demonstrate that the PLS-VAE-BiLSTM-based abnormal load data identification method proposed in this study performs well across various thresholds, with a significant improvement in the F1 index near the optimal threshold. The model achieves efficient and stable identification across all industries, validating its applicability and strong generalization in multi-industry contexts.

5.3. Comparison Experiments

Figure 6 compares the performance of the PLS-VAE-BiLSTM model proposed in this study with the VAE-LSTM, PSO-PFCM [23], and WTRR [24] models in detecting abnormal load data in the manufacturing user. The results show that the accuracy of the PLS-VAE-BiLSTM model is 100.00%, significantly outperforming VAE-LSTM (95%), PSO-PFCM (92%), and WTRR (90%). The recall rate is 90.00%, balancing the avoidance of excessive false detections with a high recall rate. The F1 index reaches 94.74%, significantly outperforming other models (VAE-LSTM: 89%, PSO-PFCM: 87.20%, WTRR: 84.21%), demonstrating a strong balance between accurate detection and comprehensive coverage of abnormal data. Manufacturing user load data exhibit strong periodicity and complex fluctuations, with abnormal loads frequently occurring during peak production periods. Traditional methods, such as PSO-PFCM and WTRR, struggle to handle nonlinear load fluctuations, leading to a lower F1 index.

Figure 7 compares the performance of the PLS-VAE-BiLSTM model proposed in this study with the VAE-LSTM, PSO-PFCM, and WTRR models in detecting abnormal loads in the residential user. The results show that the PLS-VAE-BiLSTM model exhibits clear advantages in accuracy, recall, and F1 scores. Specifically, the accuracy of the PLS-VAE-BiLSTM model is 93.75%, outperforming VAE-LSTM (90%), PSO-PFCM (85%), and WTRR (80%). The recall rate is 100.00%, significantly higher than the other models, indicating that PLS-VAE-BiLSTM captures abnormal data more comprehensively. In terms of F1 score, PLS-VAE-BiLSTM achieved 96.77%, surpassing VAE-LSTM (92.11%), PSO-PFCM (87.62%), and WTRR (82.61%), demonstrating an effective balance between high-precision detection and a low false-positive rate.

Figure 8 illustrates the performance of the PLS-VAE-BiLSTM model proposed in this study, alongside the VAE-LSTM, PSO-PFCM, and WTRR models, in detecting abnormal loads in the distribution grid. The results show that the PLS-VAE-BiLSTM model significantly outperforms the other models in terms of accuracy, recall, and F1 score. Specifically, the accuracy of PLS-VAE-BiLSTM is 100%, surpassing VAE-LSTM (97%), PSO-PFCM (93%), and WTRR (85%). Its recall rate is 96%, significantly higher than the other models, indicating VAE-BiLSTM’s superior ability to capture and identify abnormal loads. The F1 score of PLS-VAE-BiLSTM is 97.96%, leading VAE-LSTM (94%), PSO-PFCM (90%), and WTRR (82.35%), demonstrating an effective balance between accurate detection and a low false-positive rate. In contrast, the PSO-PFCM and WTRR models struggled to handle the random fluctuations and sudden changes in load within the complex distribution grid, leading to lower detection accuracy and recall rates compared to PLS-VAE-BiLSTM. PLS-VAE-BiLSTM demonstrates significant improvement in abnormal load detection in the distribution grid, owing to its superior nonlinear feature extraction and time-series dependency capture abilities, showcasing its strong adaptability and advantages in power system load monitoring.

Figure 9 compares the performance of the PLS-VAE-BiLSTM model proposed in this study with the VAE-LSTM, PSO-PFCM, and WTRR models in detecting main grid load anomalies. The results demonstrate that the PLS-VAE-BiLSTM model significantly outperforms the other models in accuracy, recall, and F1 score. Specifically, PLS-VAE-BiLSTM achieves 100% accuracy, surpassing VAE-LSTM (98%), PSO-PFCM (92%), and WTRR (88%). Its recall rate is 90.00%, significantly higher than the other models, indicating its superior ability to capture abnormal data. The F1 score of PLS-VAE-BiLSTM is 94.74%, outperforming VAE-LSTM (91.84%), PSO-PFCM (85.71%), and WTRR (81.63%), demonstrating an excellent balance between high accuracy and low false-positive rate. Unlike anomalies in the distribution grid and residential industries, main gird load anomalies typically manifest as widespread trend deviations, significantly impacting power system stability and exhibiting strong global and long-term characteristics. In contrast, although PSO-PFCM and WTRR can handle some load fluctuations, they lack the nonlinear modeling and time-series processing capabilities required to address the complex global anomalies of the main gird load, resulting in inferior performance compared to PLS-VAE-BiLSTM in accuracy and recall.

As illustrated in Figure 10, the PLS-VAE-BiLSTM model demonstrates superior performance in the anomaly detection task, achieving a precision of 98.44%, a recall of 94%, and an F1 score of 96.05%. This highlights its excellent ability to accurately capture anomalies while maintaining high precision. The VAE-LSTM model follows closely, with a precision of 94.88%, a recall of 89%, and an F1 score of 91.91%, reflecting strong overall performance. In comparison, the PSO-PFCM model shows moderate performance, achieving a precision of 90.5%, a recall of 85%, and an F1 score of 87.63%, which are slightly lower than those of the VAE-based models. The WTRR model performs the worst, with a precision of 85.75%, a recall of 80%, and an F1 score of 82.7%, indicating limited anomaly capture capability and a higher false alarm rate. Overall, the PLS-VAE-BiLSTM model outperforms all other compared models, showcasing its clear advantage in anomaly detection tasks.

6. Discussion

Currently, the volume of fine-grained power load data is rapidly increasing, and the value of this data is contingent upon its quality meeting application requirements. To enhance anomaly detection in fine-grained load data from distribution networks or users and address the limitations of single-network models in capturing data patterns or accounting for sequential data relationships, this paper has proposed a method for identifying and correcting missing and anomalous load data based on a PLS-VAE-BiLSTM hybrid model. This method introduces the PLS approach for missing value preprocessing and combines it with the representation learning capabilities of the VAE and the contextual modeling advantages of the BiLSTM network. First, data preprocessing is performed, including PLS-based missing value imputation and data normalization. Second, a VAE-BiLSTM hybrid model is constructed and trained using a loaded dataset incorporating influencing factors. This training process allows the model to learn the relationships between different data features. Anomalous data are then identified and corrected by calculating the deviations between the model’s reconstructed values and the actual values. Finally, the proposed method is validated on four real-world datasets. The experimental results demonstrate that the VAE-BiLSTM model achieves average performance metrics of 98.44% for precision, 94% for recall, and 96.05% for F1 score. In comparison to VAE-LSTM, PSO-PFCM, and WTRR models, the proposed model exhibits superior overall anomaly detection performance. Specifically, the VAE-BiLSTM model achieves 3.75%, 8.77%, and 14.79% higher precision than the VAE-LSTM, PSO-PFCM, and WTRR models, respectively. It also demonstrates 5.62%, 10.59%, and 17.5% higher recall than those models. Furthermore, the F1 score is 4.51%, 9.6%, and 16.15% higher compared to the VAE-LSTM, PSO-PFCM, and WTRR models, respectively. For each method, the precision and recall results across different datasets show a characteristic trade-off relationship. Overall, the experiments demonstrate that the PLS-VAE-BiLSTM hybrid model can effectively capture the interdependencies within load data and between load data and their influencing factors, exhibiting strong overall detection performance.

In the current work, the single-dependent-variable PLS model can establish a maximized linear regression model between key influencing factors and the load. This is computationally efficient but is limited by its linear form and inability to decouple the correlations between influencing factors, which requires further improvements in future research. In the VAE-BiLSTM hybrid model, the threshold settings for anomaly detection with VAE, which are dataset-dependent and learned through training, need to be considered regarding the model training time and resource consumption when deploying the model on a platform. Future work will focus on optimizing the model performance, making it lightweight, and reducing the detection time.

7. Conclusions

Based on an analysis of the issues and limitations of existing methods, this paper proposes a power load anomaly detection and correction method based on the PLS-VAE-BiLSTM model. The conclusions and recommendations are summarized as follows:

Data Preprocessing:

The Partial Least Squares (PLS) method was used to establish a maximized linear regression model between historical load and influencing factors. This approach incorporates influencing factors to impute missing values, forming a complete time-series load curve that undergoes normalization for further analysis;

2.: Anomaly Identification and Correction:

The VAE-BiLSTM model is trained to simulate historical data trends for anomaly detection. In this model, BiLSTM replaces the BP neural network in the VAE framework for encoding and decoding, effectively integrating influencing factors such as calendar and weather data. The model learns the features of load data through training, reconstructs the data to capture load variation trends, and identifies anomalies by comparing reconstructed data with actual data for deviation correction. Compared to LSTM models used in previous predictive studies, the BiLSTM model demonstrates superior performance in anomaly cleaning due to its ability to better leverage contextual information from both preceding and succeeding data;

3.: Consideration of Influencing Factors and Future Improvements:

The proposed model considers major factors affecting power loads, such as calendar and weather, during both data preprocessing and cleaning stages. However, there is room for further improvement in algorithm design and computational performance. Additionally, the exploration of more influencing factors holds potential for enhancing the model’s effectiveness.

Author Contributions

Conceptualization, J.L.; software, Y.L.; validation, Z.Z.; formal analysis, Z.D. and Q.W.; writing—original draft, Y.L.; writing—review and editing, J.L.; supervision, K.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Talent Introduction Scientific Research Foundation of Nanjing Institute of Technology: Research on the Mechanism of Demand Side Response Resources Participating in the Electricity Market (No. YKJ202010).

Data Availability Statement

The original contributions presented in this study are included in this article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gope, P.; Sikdar, B. Lightweight and Privacy-Friendly Spatial Data Aggregation for Secure Power Supply and Demand Management in Smart Grids. IEEE Trans. Inf. Forensics Secur. 2019, 14, 1554–1566. [Google Scholar] [CrossRef]
Li, Y.; Yang, R.; Guo, P. Spark-based Parallel OS-ELM Algorithm Application for Short-term Load Forecasting for Massive User Data. Electr. Power Compon. Syst. 2020, 48, 603–614. [Google Scholar] [CrossRef]
Sohei, I.; Takayuki, N.; Yusuke, K.; Masahiro, M.; Koji, Y. Distillation-Based Semi-Supervised Federated Learning for Communication-Efficient Collaborative Training with Non-IID Private Data. IEEE Trans. Mob. Comput. 2021, 22, 191–205. [Google Scholar]
Thomson, V.E. Garbage in, Garbage Out: Solving the Problems with Long-Distance Trash Transport; University of Virginia Press: Charlottesville, VA, USA, 2009; pp. 1–173. [Google Scholar]
Barnett, V.; Lewis, T.; Abeles, F. Outliers in statistical data. Phys. Today 1979, 32, 73–74. [Google Scholar] [CrossRef]
Zhuo, L.; Zhao, H.; Zhan, S. Overview of Anomaly Detection Methods and Their Applications. Comput. Appl. Res. 2020, 37, 9–15. [Google Scholar]
Mendenhall, W.M.; Sincich, T.L. Statistics for Engineering and the Sciences; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar]
Bury, K. Statistical Distributions in Engineering; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar]
Tong, S.; Wen, F.; Chen, L.A. Two-dimension Wavelet Threshold De-noising Method for Electric Load Data Pre-processing. Autom. Electr. Power Syst. 2012, 36, 101–105. [Google Scholar]
Wang, H.; Bah, M.J.; Hammad, M. Progress in Outlier Detection Techniques: A Survey. IEEE Access 2019, 7, 1. [Google Scholar] [CrossRef]
Li, C. Preprocessing Methods and Pipelines of Data Mining: An Overview. arXiv 2019, arXiv:1906.08510. [Google Scholar]
Patel, V.; Kapoor, A.; Sharma, A.; Chakrabarti, S. Taxonomy of outlier detection methods for power system measurements. Energy Convers. Econ. 2023, 4, 73–88. [Google Scholar] [CrossRef]
Simmons, J.P.; Nelson, L.D.; Simonsohn, U. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 2011, 22, 1359–1366. [Google Scholar] [CrossRef] [PubMed]
Knox, E.M.; Ng, R.T. Algorithms for mining distance based outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases, New York, NY, USA, 24–27 August 1998; pp. 392–403. [Google Scholar]
Schubert, E.; Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. (TODS) 2017, 42, 1–21. [Google Scholar] [CrossRef]
Alanazi, M.D.; Saeed, A.; Islam, M.; Habib, S.; Sherazi, H.I.; Khan, S.; Shees, M.M. Enhancing Short-Term Electrical Load Forecasting for Sustainable Energy Management in Low-Carbon Buildings. Sustainability 2023, 15, 16885. [Google Scholar] [CrossRef]
Lin, S.; Clark, R.; Birke, R.; Schönborn, S.; Trigoni, N.; Roberts, S. Anomaly Detection for Time Series Using VAE-LSTM Hybrid Model. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4322–4326. [Google Scholar] [CrossRef]
Wang, C.; Zhang, A.; Yang, L.; Zhang, B.; Li, S. Anomaly detection of UAV flight data based on VAE-LSTM modeling. Electron. Meas. Technol. 2024, 47, 187–196. [Google Scholar] [CrossRef]
Jing, Z.; Chai, L.; Hu, S. Research on abnormal load detection method for distribution network based on improved LSTM-VAE. Electr. Meas. Instrum. 2024, 61, 71–76. [Google Scholar] [CrossRef]
Pavlatos, C.; Makris, E.; Fotis, G.; Vita, V.; Mladenov, V. Enhancing Electrical Load Prediction Using a Bidirectional LSTM Neural Network. Electronics 2023, 12, 4652. [Google Scholar] [CrossRef]
Available online: https://archive.ics.uci.edu/dataset/849/power+consumption+of+tetouan+city (accessed on 1 June 2024).
Available online: https://archive.ics.uci.edu/dataset/235/individual+household+electric+power+consumption (accessed on 23 March 2022).
Li, Q. Power big data anomaly detection method based on an improved PSO-PFCM clustering algorithm. Power Syst. Prot. Control. 2021, 49, 161–166. [Google Scholar] [CrossRef]
Karkhaneh, M.; Ozgoli, S. Anomalous load profile detection in power systems using wavelet transform and robust regression. Adv. Eng. Inform. 2022, 53, 101639. [Google Scholar] [CrossRef]

Figure 1. The structure of LSTM network.

Figure 2. The structure of Bi-LSTM network.

Figure 3. Flowchart of the proposed model.

Figure 4. Trend chart of load data loss function.

Figure 5. F1 score of abnormal load identification under different thresholds.

Figure 6. Comparison of abnormal load identification results in manufacturing user load dataset.

Figure 7. Comparison of abnormal load identification results in residential user load dataset.

Figure 8. Comparison of abnormal load identification results in distribution grid load dataset.

Figure 9. Comparison of abnormal load identification results in main grid load dataset.

Figure 10. The average performance of each model.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Lv, Y.; Zhou, Z.; Du, Z.; Wei, Q.; Xu, K. Identification and Correction of Abnormal, Incomplete Power Load Data in Electricity Spot Market Databases. Energies 2025, 18, 176. https://doi.org/10.3390/en18010176

AMA Style

Li J, Lv Y, Zhou Z, Du Z, Wei Q, Xu K. Identification and Correction of Abnormal, Incomplete Power Load Data in Electricity Spot Market Databases. Energies. 2025; 18(1):176. https://doi.org/10.3390/en18010176

Chicago/Turabian Style

Li, Jingjiao, Yifan Lv, Zhou Zhou, Zhiwen Du, Qiang Wei, and Ke Xu. 2025. "Identification and Correction of Abnormal, Incomplete Power Load Data in Electricity Spot Market Databases" Energies 18, no. 1: 176. https://doi.org/10.3390/en18010176

APA Style

Li, J., Lv, Y., Zhou, Z., Du, Z., Wei, Q., & Xu, K. (2025). Identification and Correction of Abnormal, Incomplete Power Load Data in Electricity Spot Market Databases. Energies, 18(1), 176. https://doi.org/10.3390/en18010176

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identification and Correction of Abnormal, Incomplete Power Load Data in Electricity Spot Market Databases

Abstract

1. Introduction

2. The Types of Anomalous Power Load Data

2.1. Missing Value

2.2. Outliers

3. The Proposed Model

3.1. Partial Least Squares Regression

3.2. Variational Autoencoder Model

3.3. Bidirectional Long Short-Term Memory Networks

3.4. VAE-BiLSTM

3.5. Flowchart

4. Performance Evaluation Index

5. Case Analysis

5.1. Test Platform and Data Sources

5.2. Network Training and Result Analysis

5.3. Comparison Experiments

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI