1. Introduction
Reliable electric power transmission and distribution has ever been at the forefront of power system studies. Power system reliability, as defined by [
1], encompasses both system adequacy and system security. The former aims to satisfy the balance between energy supply and energy demand within the system, while the latter aims to establish the system’s ability to bounce back from disturbances. These days, more and newer devices such as renewable energy sources and converter interfaced generation are introduced to the network. These induce uncertainties as shown by [
2] and harmonic disturbances in the voltage signal as shown by [
3]. Further, integration of DC sources and loads into hybrid AC/DC distribution systems are also increasing [
4]. To help maintain reliability, reference [
5] mentions that operators need full system monitoring to maximize renewable energy production and to make a more versatile power system. However, system monitoring implements such as SCADA, AMI, and PMUs are expensive; thus, optimal use is normally required. As proposed in [
6], while fully covering the system with measurement devices ensures reliable services, they experience diminishing returns on investment. Due to that, optimal placement papers such as [
6,
7,
8] exist to attempt to find a configuration which makes the system observable and cost effective. This then leads to the problem of completing the missing measurements of the current system state given limited information; thus, power system state estimation is developed.
State estimation works by pooling the limited measurement data available and finding a suitable system state that will approximate the given measurements with the least amount of error. This is largely analogous to regression analyses wherein a best fit function’s parameters are changed in order to reconcile the data and the function mapping. However, instead of the usual forecasting paradigm, state estimation aims to deduce current values given current and historical input data. The key difficulty here is establishing an effective model to relate the measurement values to the system state, that is, completely modeling the power system including the non-linear components, the noise acquired during data acquisition, and the statistical uncertainties of the renewable energy sources. There are few references that consider hybrid distribution system implementations, some of which include [
9,
10] wherein the conventional state estimation method has been decentralized to evaluate the AC and DC subgrids separately. The work in [
9] capitalized on the duality principle while the work in [
10] introduced an intermediate non-linear variable in their three-stage approach. Since the conventional state estimation method is a snapshot in time, the distributed generation and renewable energy sources can be summarily modeled as current injections in the DC subgrid. For references [
9,
10] too, however, the conventional process is computationally intensive and is open to failures in assumption as [
11] mentions.
Due to the nature of this problem and its analog, regression analysis, there exists a method that excels in summarily modeling unseen and complex relationships between data: machine learning, specifically, neural networks. There have already been other allied works that have used neural networks in some form or another for power system state estimation. Works such as [
12,
13] have used neural networks simply to prepare and clean measurement data to aid in state estimation. One work [
14], far back in 1996, has already explored the use of neural networks for topology processing and static state estimation. A more recent work [
15] adopts Bayesian state estimation using machine learning to power system state estimation so that estimation of unobservable parts of the system is made possible. Some other works try to include the original components of state estimation into the training process such as [
16]. In the same category, another recent work utilizes the concept of optimal measurement device placement to dictate the structure of the neural network [
17]. Another work used neural networks to make state
forecasts instead of estimation such as [
18]. These works show that neural networks can indeed be used for state estimation. However, for large, practical systems, the neural networks will tend to be large, heavy, and computationally taxing as well.
This paper proposes taking the neural network application three steps further by (i) adopting the historical measurements and states to improve the accuracy of state estimation, especially for those states near transformers, ii) utilizing the standardized model optimization techniques to lighten the neural network for state estimation, and (iii) expanding its target application to the state estimation for hybrid AC/DC distribution systems. To visualize the thought process, this paper is structured as follows:
Section 2 discusses the basic intuition on power system state estimation and the conventional method: weighted least squares state estimation (WLS-SE). It also shows the main objective of state estimation, i.e., to minimize the error between the calculated and the measured system state.
Section 3 discusses neural networks and its learning process which involves minimizing errors between initial and target predictions and using those errors to update its own parameters. The section also ties machine learning to state estimation by showing where it can be used within state estimation and the advantages it has over the conventional WLS-SE.
Section 4 carefully lays out the process of developing and verifying the proposal from data preparation all the way through performance evaluation.
Section 5 provides insight into the performance of the proposed neural networks and
Section 6 provides a summary, conclusion, and the topic’s possible future works.
2. Power System State Estimation
The intuition behind state estimation (SE) for power systems is to solve for a system’s state given a set of measurement data (making the system observable) while also considering possible data acquisition error. The basic formulation shown in Equation (
1), and also shown in most state estimation works such as in [
19,
20,
21,
22,
23], where
z is the measurement data,
h is a function that translates the true system states,
x, to measurement values, and
is the modeled acquisition error. For every state estimation problem, there exists required conditions so that the system state can be accurately and validly estimated. First is confirmation of network topology, achieved through confirmation and/or estimation of breaker statuses. Second is observability, which intuitively means, the state of all the nodes of the system can be solved given a set of state measurements in the form shown in Equation (
1) [
24].
Due to the strong dependency of this process on measured data, any estimate of the system state will only be as good as the measurement data it was fed [
23,
25]. Thus, the third requirement is bad data processing, which is why papers that deal with bad data itself [
26], detection [
27], and suppression [
28] exist. Once data have been cleaned and prepared, all of this information is inserted into the weighted least squares (WLS) formulation as seen in Equation (
2) with the aim of minimizing
, thus called WLS-SE.
represents the standard deviation of the measurements—or, loosely put, error—depending on the device and the quantity measured. Examples of this are
to
for voltage-based measurements,
for current-based measurements, and
for power-based measurements. The general flow of WLS-SE is shown in
Figure 1 which is closely similar to [
19].
One of the more important relationships to be established is the relationship between the measurements and the different system states, i.e., the relationship of
with
, which is shown by
. At this point, it is important to be clear about what parameters will be the system state and which parameters will be measured. Based on traditional power system analysis methods, knowledge of system-wide bus voltages and angles are sufficient in order to fully establish current flows and phase differences. Thus, the system state,
, becomes a
vector consisting of voltage magnitude and angle quantities for all
n buses; the measurements largely become a combination of either a subset of the bus voltage vector or a subset of the known real and reactive power injections. With that in mind, and depending on the measurement data used,
can take its forms from traditional power flow analysis, with power measurements derived from the system-wide power balance equation shown in Equation (
3) and is related to the system state via nodal analysis as shown in Equation (
4), as presented in textbooks such as [
29].
The solved state estimates will only be as precise as the model and data used, which is why papers that deal with parameter estimation also exist [
30]. Other ways to improve the state estimate also include transforming the measurements into more indicative quantities such as branch current measurements [
20]. The details about these steps have been neatly laid out by [
21] from network topology modeling down to bad data processing. While this paper focuses on static state estimation, other forms also exist such as dynamic state estimation where [
31] uses the Extended Kalman Filter to effectively consider unknown inputs of synchronous machines (generators). This requires complete modeling of all components of the power system including uncertain components such as RE sources. The authors of [
11], one of whom also published the state estimation textbook [
22], acknowledges that machine learning/deep learning-based state estimation is another direction that state estimation research can go to due to its ability to provide estimates without the need for power system models and not being vulnerable to assumptions that are not true for realistic cases. As [
11] also mentions, deep learning-based state estimators can also be developed for dynamic state and parameter estimation. As will be mentioned in
Section 3, the estimate is only true for the situations the network is trained in. This paper assumes a stance of static state estimation and is only trained for steady state measurements; therefore, should a fault or a slow acting stability issue occur, the
current trained model’s estimates will not be valid. However, should the proposed method be trained for dynamic state estimation—where generator rotor angles are also estimated—then these issues should be addressable as well.
3. Neural Network-Based State Estimation
The buzzwords big data and machine learning have been making the rounds in the academic community for years already. These fields of study go hand-in-hand in a multitude of proposals ranging from basic image classifiers, handwriting recognition, object detection, simple media recommendations, and even to self-driving cars, as popularly posited by [
32]. Indeed, machine learning can be used to address various problems that can be solved by minimizing certain parameters of a given model iteratively. State estimation is by far one of those methods as seen in its formulation presented in
Section 2. One of the more successful approaches available are neural networks, which have a plethora of parameters to tune and a good number of connected
neurons to hold and remember certain information. A simple neural network is shown in
Figure 2.
3.1. Neural Network Theory
On the surface, the output of a neural network can be treated as a very long linear equation as shown by Equation (
5) for an
-layer,
-input,
-output fully connected neural network as shown in [
33,
34,
35], where the
jth neuron in the
rth layer is the sum of the
i weighted and connected neurons in the
layer and the
rth layer’s bias vector—all of which are subject to the activation function,
h. An important thing to note with the activation function,
h, is that it squeezes the numbers to digestible ranges with different behaviors depending on the choice. Usual choices of activation functions include sigmoid, hyperbolic tangent, and the rectified linear unit. In the case of dimensions, the 2nd layer’s neurons take from the
weighted input vector,
, and the
th layer gives the
output vector,
. Equation (
6) shows this relationship in matrix form with
and
representing the number of neurons of the
rth and
th layer and
H representing the activation function applied to each row. This is the equation generally followed during testing and actual implementation, called the feed-forward path.
During the process of training, however, there is an additional step to reliably tune the weights to its proper values, called the back-propagation. Intuitively, this step takes the gradient between the error and the individual weights as shown in [
33,
34,
35] and uses it to update the weight values in order to minimize the error. The loss function, which is the “modeled difference” between the predicted output,
, and the ground truth,
, is at the forefront of this step. One of the most common loss functions is the mean square error (MSE), which is shown in Equation (
7).
The challenge is finding the gradient of the cost function with respect to the weights. Shown in the succeeding equations below is the derivation to achieve the desired form. To obtain
, remember from Equation (
7) that
C is dependent on
and
, and by chain,
is dependent on the sum of the weighted neuron values, seen in Equation (
5) (when
r is the last/output layer,
). Performing the simplified differentiation and applying chain rule, we obtain Equation (
8).
Now, Equation (
9) recognizes that the weighted neuron values,
, are dependent also on their own sum of weighted neuron values. Therefore, it will propagate back up to the first layer, as implied by Equation (
10).
Thus, we arrive at the final point which is Equation (
11), where the change to the weight value,
, is a decrement by a fraction of the gradient dictated by the learning rate,
, as shown in [
33,
34,
35]. The differentiation while including specific activation functions is left to the reader for personal exploration. Here, the activation function is simply a pass through, that is,
.
As mentioned before, this paper proposes the use of machine learning, specifically model optimized neural networks, to solve the state estimation problem. Using neural networks eliminates the need to formulate and compute the measurement function,
, the gain matrix,
G, which also stems from the measurement Jacobian,
H, and the covariance matrix,
R. All of the information that is contained within those quantities can be assumed to be summarily modeled within the neural network’s weight values, thus making neural networks a worthy candidate for a state estimator. In particular, the model will
grow into , which is mathematically shown in Equation (
12).
3.2. Implementation to State Estimation
This paper takes off from a previous work shown in [
36], where the standard MSE loss function has been replaced with the WLS loss function, which is used traditionally by power system state estimation to help the neural network model learn which of the measurements to trust more. By coming off of Equation (
2), the weight,
W, is used with predetermined weights that correspond to the measurement devices known to the operator. The resulting equation is shown in Equation (
14)
Note that, in this paper’s notation, the system states
are the neural network’s target prediction,
, and the system measurements
are the neural network’s inputs,
. The paradigm this paper takes is inferring the system state given the known system states some time steps before as shown in
Figure 3, i.e., approximating
and learning to consider or reject
. Furthermore, reference [
36] used a special kind of neural network structure to improve the accuracy of estimation. This is the long short-term memory (LSTM) neural network, which is an improved variation of the recurrent neural networks (RNNs). It works better for time series data by taking into account the measurements observed some few time steps prior and introduces new weights that correspond to remembering, forgetting, and passing certain information from the input data. These weights are what sets it apart from standard fully connected feed-forward or multilayer perceptron (MLP) networks.
For this paper, the LSTM networks are sequentially stacked with three hidden LSTM layers and a fully connected output layer as shown in
Figure 4. The network structure may vary depending on the scale of application. The activation function for the layers is the default linear (or pass-through) activation. This is possible because the measurements are already acquired using the per-unit system; thus, the numbers are already within manageable ranges and typical machine learning problems such as the vanishing and exploding gradient are prevented. Furthermore, the learning rate,
, is also set to be reasonably small, at 0.001. The loss functions, both MSE and WLS, are minimized using the adaptive moments optimization algorithm.
3.3. Neural Network Model Optimization
The key proposal of this paper is the use of model optimization techniques to reduce the model’s size. Proposed as early as 1990 by [
33], the concept of neural network pruning is removing weights that are not responsive to back-propagation as [
33] mentions. As discussed in
Section 3, back-propagation is in charge of updating the weight values given the gradient of the error with the weights. If the responsiveness of the weights to back-propagation—or in this paper’s terms, the gradient—is negligible, then it is plausible to conclude that the weight (neural connection) is not needed. These days, not only the gradient is evaluated but the value itself is also checked, that is, if the weight is zero (or plays around zero), that connection is also removed. The work by [
37] has only recently proposed a standard way to benchmark gains received from pruning. Popular machine learning frameworks such as TensorFlow and pyTorch have also only recently released their pruning libraries at nearly the same time as [
37] was published.
There are a few general trends between pruning methods as [
37] points out: Structure, Scoring, Scheduling, and Fine-tuning. Each of these differences can exploit advantages but have their own intrinsic disadvantages too. An example would be networks whose structure has been pruned, which is largely removing weight connections as shown in
Figure 5. Since the parameters are exactly zero, deploying these models in small-scale devices such as micro-controllers and remote terminal units is largely plausible due to the advantage they obtain when their model files are being compressed. Compression algorithms work better when there are true zeros in the data, more so if the file is sparse in general. However, speed up features such as those in CUDA-enabled hardware may or may not be available to these models as the APIs themselves may or may not be tailored to handle random sparsity. Scheduling, on the other hand, prunes the network bit by bit at every step of the training process or immediately to target sparsity. An example of a pruning schedule is shown in
Figure 6. It is also possible to be creative with the training process and start the network with all the weights active to help the network reach near optimum point before pruning. Once there, pruning can proceed as scheduled to save space and to focus on updating the responsive weights, which leads to this paper’s main proposal: the use of scheduled weight pruning to optimize the neural network-based power system state estimator.
The pruning schedule used is Polynomial Decay that starts from 0% sparsity at training step 0 and ends with the target ( 25%, 50%, 75%) sparsity. The rate at which the weights are pruned follows the default power setting, which is a cubic ramp-up function as stated in the TensorFlow-Model Optimization Toolkit documentation. It is important to remember here that the TensorFlow way of weight pruning does not immediately cull the weights, rather, a masking layer is created alongside the neural network. The mask layer is Boolean, 1for an active weight and 0 for an inactive (or pruned) weight. Thus, it is important at the end of the process to strip the mask layer using the strip pruning function such that when the model is saved, the active and inactive weights are committed.
5. Results and Discussion
Shown in
Table 1 and
Table 2 are the errors in state estimate per state estimator and subgrid. As can be observed, the LSTM networks performed better than the conventional WLS-SE, largely due to the learning capability of neural networks and to the relative simplicity of the network and data. Another important note is that the error metrics increase steadily as the target sparsity increases, all while the estimation time decreases. This is largely due to the fact that there are less weights to hold information and less weights to actually multiply. It is also important to note that the speed up for each method may or may not be accurately presented because this is only tested for one PC unit equipped with an Intel Core i5, 8GB RAM, and a hard disk drive. However, it can be inferred that with better equipment, such as a GPU with support for Nvidia’s CUDA cores which are optimized for neural network operations, the estimation time will decrease. This event will be more pronounced when the methodology is subject to large, practical system configurations, that is, networks with hundreds or thousands of buses, tens or hundreds of measuring devices, and of course, millions of
entries.
Shown in
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14,
Figure 15,
Figure 16 and
Figure 17 are the graphical representations of the state estimation errors against the true system states. This is where we can see individually how the methods perform and at which points they could possibly fail. One possible trend that can be inferred is that non-pruned networks tend to under predict, which is not the case. The under and over prediction of the model is highly dependent on the sequence of data it has seen during training. Recall that the data are always shuffled every training step in a session and that the weights that will be pruned are nearly random and are at the mercy of the scheduler and its gradients. However, this does not mean that there is absolutely no definite and concrete answer. One of the possible reasons for this is the modeled (or anticipated) errors in the measurement units, which is why it tends to under (or over) predict. What is certain, however, is that a neural network’s aim is always to generalize the dataset; thus, the testing methodology exhausts the dataset. Therefore, by consequence, the error metrics are good enough approximations of their performances. The file sizes of the neural networks after compression are 837 kB for the normal and modified LSTM, 380 kB for the 25% pruned LSTM, 375 kB for the 50% pruned LSTM, and 370 kB for the 75% pruned LSTM. The reasons behind the compression are outside the scope of this paper, but loosely put, it is due to the compression algorithms and its fundamental limits and tradeoffs, i.e., processing time, memory, chunk size, etc. [
41].
Finally, one of the telling signs of the graphs is the profile of the system state. The x-axis represents the bus numbers of the system. It is here we can see that there are two big dipping points in the voltage and angle profiles, both of which are interfaces between an under load tap changer (ULTC) and a three-phase transformer, respectively. Recall that the true system state is acquired from the PSCAD simulations; thus, we can be confident in the models, that is, the behavior is consistent with the documentation for the IEEE 34-bus test system. An example of this sign is that, in between the big voltage drops, we see long line segments which introduce relatively large impedances. Bulk of the spot loads are located at the far ends of the network; thus, the required current flow forces the voltage difference. It is also important to note that the sharp dip of angle measurement is not what it seems to be—mind the scale of the y-axis (0 to −4).