3.1. Construction Method
The PCMD is divided into three steps, which are data collection, failure diagnosis, and preventive control policy fitting. In the first step of PCMD, it needs to collect data including the voltage of nodes, frequency, the current of nodes, and the robustness for CCF, and it needs also to consider the control cost reduction as a feedback variable. In the second step of PCMD, it needs to detect failure symptoms and predict a CCF propagation sequence from the occurrence of an initial failure. In the third step of PCMD, it uses the deep neural network to maintain a parameterized PCP l(s,wl) pursuing the maximum of the infinite horizon discounted expected cumulative reward. Based on the model defined in definition 1, the infinite horizon discounted expected cumulative reward from an initial state g0 is , E is the expectation about different trajectories, Fi is a steady state transition function in a trajectory, and γ is the discount factor. In addition, the PCMD needs also deep neural networks to maintain a parameterized target PCP l’(s,wl’) and parameterized action-value functions (Q(s,l(s,wl),wQ) and Q’(s,l’(s,wl’),wQ’)) for constructing l(s,wl) steadily. Where s∈S is a CCF propagation sequence, and si is a CCF propagation sequence in a steady state gi. wl, wl’, wQ, and wQ’ are the parameters of parameterized functions.
According to Definition 1, many possible expected steady states could be reached from the current steady state when a PCP is attempted to block a CCF propagation sequence. The goal of the constructed PCP is to be selected to reach an optimal steady state from the current steady state with minimum cost. Those requirements about the expected steady state are described by an objective function, as shown in Formula (1).
where
VP1 denotes the set of power generation nodes (distributed generators), and
VP5 denotes the set of external nodes (the energy from power transmission lines).
WGi denotes the weight of the control cost objective, and
WRP denotes the weight of the robustness (against CCF) objective. Δ
PGi is defined in Definition 1.
RPF (a discrete quantity) denotes the proportion of failure nodes during the propagation of CCF in an ADN.
Formula (1) describes the control cost objective and the robustness (against CCF) objective. Instead of using the Pareto idea, the commonly used weighting method [
19] is used to integrate multiple objectives into a single objective function.
During the process of blocking a CCF propagation sequence, the safety requirements and the basic physical constraints should be obeyed. Formula (2) describes inequality constraints (including continuous quantities and discrete quantities) corresponding to safety requirements. Formula (3) describes the equality constraint of power flow equations in the PN.
where
PGi and
QGi denote the active power and the reactive power of a power generator node
nGi, respectively,
PLi and
QLi denote the active power and the reactive power of a load node
nLi, respectively,
vi denotes the voltage of a node
ni in the PN,
IPi denotes the current of a node
ni in the PN,
fref denotes the frequency of an ADN, and
PGi,
QGi,
PLi,
QLi,
vi,
IPi, and
fref are all defined in Definition 1. To be safe, it is necessary to limit the value ranges of variables. Correspondingly,
PGi_min and
PGi_max denote the lower and the upper active power bound of a power generator node
nGi, respectively,
QGi_min and
QGi_ Max denote the lower and the upper reactive power bound of a power generator node
nGi, respectively. Similarly, Δ
PGi_min and Δ
PGi_max denote the lower and the upper control action bound,
PLi_min and
PLi_max denote the lower and the upper active power bound of a load node
nLi, respectively,
QLi_min and
QLi_max denote the lower and the upper reactive power bound of a load node
nLi, respectively,
vi_min and
vi_max denote the lower and the upper voltage bound of a node
ni in the PN,
IPi_max denotes the upper current bound of a node
ni in the PN.
frefmin and
frefmax denote the frequency limits of an ADN.
NONP is the number of nodes in the PN. ζ
i denotes the voltage phase angle of a node
ni in the PN.
Yij and
αij denote the admittance and admittance phase angle between a node
ni and a node
nj, respectively, in the PN.
In order to get the fitted PCP, the objective function and inequality constraints shown in Formulas (1) and (2) are integrated into the infinite horizon discounted expected cumulative reward, and then the immediate reward
r of a transition should be given. The immediate reward
r is composed of a reward
r1 reflecting voltage stability, a reward
r2 for preventing the CCF caused by the current overload, a reward
r3 reflecting the frequency stability, a reward
r4 reflecting the control cost, and a reward
r5 reflecting robustness against CCF. The reward
r1 is shown in Formula (4).
where
r1 is the weighted sum of
rni_1, and the definition of
rni_1 is shown in Formula (5).
wni is the weight (proportional to the importance of a node) of a node
ni,
VP is the set of nodes in the PN.
r1_ni_1 (greater than zero) and
r1_ni_2 (less than zero), respectively, denote the specific immediate reward value when the voltage value of the node
ni is in different ranges, ‖●‖ denotes the norm,
vi_normal is the reference voltage value of a node
ni,
vni = min{‖
vi_normal −
vi_min‖,‖
vi_max −
vi_normal‖} is the voltage threshold of a node
ni.
The reward
r2 is shown in Formula (6).
where
r2 is the weighted sum of
rni_2, and the definition of
rni_1 is shown in Formula (7).
EP is the set of edges in the PN,
r2_ni_1 (greater than zero) and
r2_ni_2 (less than zero), respectively, represent the specific reward value when the current value of the node
ni is in different ranges,
Ii_normal is the reference current value of a node
ni in an ADN,
Ini =
IPi_max is the current threshold value of the node
ni.
The reward
r3 is shown in Formula (8).
where
r3_fref_1 (greater than zero) and
r3_fref_2 (less than zero), respectively, represent the specific reward value when the frequency value is in different ranges,
frefnormal is the reference frequency value,
cfref = min{‖
frefmin − frefnormal‖, ‖
frefmax − frefnormal‖} is the frequency threshold value of an ADN.
The reward
r4 is shown in Formula (9).
where
is the control action,
r4_c_1 (greater than zero) and
r4_c_2 (less than zero), respectively, denote the specific reward value when the control action is in different ranges, and
cPc is the control action threshold value.
The reward
r5 is the sum of
r5_1 and
r5_2, as shown in Formula (10).
where
r5_1 denotes the number of failure nodes at the current time
t,
r5_2 denotes the cumulative number of failure nodes from the start time 0 to the current time
t.
rnf1 and
rnf2 denote the weights of
r5_1 and
r5_2, respectively, and they are both greater than zero, and
nfailure(
i * Δ
t) (
i∈{0,1,2,…,
t/Δ
t}) denotes the number of failure nodes in an ADN at discrete time
i*Δ
t.
The values of the rewards
r1,
r2,
r3, and
r4 are the larger the better, while the value of the reward
r5 is the smaller the better. Thus, the total immediate reward
r is shown in Formula (11).
The convergence and optimization conditions of the construction process in the PCMD need to be guaranteed, and they are considered in the following sections.
3.2. Convergence
In the PCMD, it is necessary to ensure that the construction process converges and the PCP against CCF generated from the construction process exists. These requirements are guaranteed by the two step parameter’s adjustments. In the first step, the convergence of the construction process could be guaranteed by adjusting the parameters in fitted functions (l(s,wl), l’(s,wl’), Q(s,l(s,wl),wQ), and Q’(s,l’(s,wl’),wQ’)). In the second step, the existence of the PCP against CCF could be guaranteed by adjusting additionally the parameters in the reward r.
The parameters fitted functions need to be adjusted, including the learning rate δ, the initial weights of fitted functions (l(s,wl), l’(s,wl’), Q(s,l(s,wl),wQ), and Q’(s,l’(s,wl’),wQ’)), the parameter α, and other parameters of fitted functions based on deep neural networks.
Theorem 1. Given the equation w = φ(w), if its derivative φ’(w) is continuous on a closed interval [w1, w2], when w∈, φ(w)∈, there exists a constant 0 < z < 1, such that when w∈, ‖φ’(w)‖ ≤ z < 1, the following results hold.
- (1)
φ(w) has a unique fixed point on the closed interval .
- (2)
For any initial value w0∈, the iterative scheme (k = 0, 1, 2, …) converges, and .
- (3)
For the sequence
, there exists an asymptotic error estimation shown in Formula (12).
The process of fitting
l⊥(
s) is actually a fixed-point iterative scheme, and the weight
wl of the fitted PCP
l(
s,wl) is obtained by solving the equation
wl =
φ(
wl). The iterative function
φ is defined as shown in Equation (13).
where
N5 is the size of a trajectory data sample from a data pool,
si is a CCF sequence in a steady state
gi.
If the adjustments of the learning rate δ and the initial weight of the fitted function l(s,wl) meet the preconditions of Theorem 1, then the weight sequence of the l(s,wl) converges to the solution of the equation wl = φ(wl).
The following analysis then explains how the initial weights of the fitted functions (
l’(
s,wl’),
Q(
s,
l(
s,wl),
wQ) and
Q’(
s,
l’(
s,wl’),
wQ’)) are adjusted. For example, due to the weights update
at step
k, there is an error equation shown in Equation (14).
where
denotes the weight of the
Q’(
s,
l’(
s,wl’),
wQ’) at step
k.
denotes the weight of the
Q(
s,
l(
s,wl),
wQ) at step
u, and
u ≥
k.
denotes the true value of the weights
wQ and
wQ’.
An inequation is obtained by taking the norm of the elements in Equation (14), as shown in Formula (15).
where
.
The true action-value function
of a steady state
gi has the relation shown in Formula (16).
Thus, the objective function value
yi of the
Q(
si,
l(
si,
wl),
wQ) in a steady state
gi is derived from Formula (16), as shown in Formula (17).
where
si+1 is a CCF sequence of a steady state
gi+1.
From Formula (17), a loss function
LQ for updating the weight
wQ of the
Q(
s,l(
s,wl)
,wQ) is shown in Equation (18).
According to Formulas (17) and (18), the weights of the Q(s,l(s,wl),wQ) converge to the weights of the Q’(s,l’(s,wl’),wQ’), and when u >>k, . From Formula (12) and Formula (15), then , . So, the weight sequence of the Q’(s,l’(s,wl’),wQ’) converges to the true value . Likewise, the weight sequence of the Q(s,l(s,wl),wQ) converges to the true value and , .
An inequation is obtained by taking the norm of the elements in Equation (14) and repeating the substitutions, as shown in Formula (19).
where
denotes the initial weight of the
Q(
s,l(
s,wl)
,wQ), and
denotes the initial weight of the
Q’(
s,
l’(
s,wl’),
wQ’).
According to Formula (19), the weight of the fitted function Q’(s,l’(s,wl’),wQ’) tends to the true value as u and k all tend to infinity. Because of the parameters α < 1 and ηQ ≤ 1, the initial weight w0Q’ of the Q’(s,l’(s,wl’),wQ’) can be selected randomly. The initial weight wQ(0) of the fitted function Q(s,l(s,wl),wQ) can be selected randomly due to Formulas (16) and (17). By exploring the similar analysis, the weight sequence of the l’(s,wl’) converges to the true value and the initial weight wl’(0) of the l’(s,wl’) can also be selected randomly. As for the adjustment of the parameter α, the value of the parameter α < 1 can be set as high as possible in order to speed up the convergence of the fitted functions Q’(s,l’(s,wl’),wQ’) and l’(si,wl’). However, because of ηQ ≤ 1 and ηQ’ ≤ 1, the value of the parameter α < 1 should not be set too high in order to improve the computational stability of the fitted function Q(s,l(s,wl),wQ).
In addition, in order to prevent the output of the fitted functions (l(s,wl), l’(s,wl’), Q(s,l(si,wl),wQ), and Q’(s,l’(s,wl’),wQ’)) from being saturated, the data should be standardized (per unit is used in this paper), the number of neural network layers should be appropriately reduced, the batch normalization layer should be raised, and the activation function layers should be placed behind them. After applying the above parameters adjustments, the convergence of the construction process could be guaranteed, and a PCP could be generated.
In order to guarantee the existence of the PCP against CCF in an ADN, the parameters in the reward r should be adjusted in addition to the parameters adjustments in fitted functions. The idea of the parameters adjustment in the reward r is to reduce the propagation width and depth of CCF. The basic idea of the parameter adjustments based on reducing the propagation width of CCF in an ADN is to ensure that the number of failure nodes at the same time is as small as possible, and the requirements of the voltage stability, the frequency stability, the current overload prevention, and the control cost reduction should be met. Specifically, the parameters Vni, Ini, r1_ni_2, r2_ni_2, cfref, r3_fref_2, and r4_c_2 should be adjusted as low as possible, and the values of parameters r1_ni_1, r2_ni_1, r3_fref_1, r4_c_1, and rnf1 should be increased. Similarly, the basic idea of parameter adjustments based on reducing the propagation depth of CCF in an ADN is to ensure the serialization steps of nodes with as few as possible successive failures and meet the requirements of the voltage stability, the frequency stability, the current overload prevention, and the control cost reduction. Specifically, the values of parameters Vni, Ini, r1_ni_2, r2_ni_2, cfref, r3_fref_2, and r4_c_2 should be adjusted as low as possible, r1_ni_1, r2_ni_1, r3_fref_1, r4_c_1, and rnf2 should be as high as possible.
So, after the two step parameters adjustments, the PCMD converged, and the PCP against CCF in an ADN generated from the construction process exists.
3.3. Satisfiability of Suboptimal Solution
If the PCMD has been guaranteed to be convergent, it is necessary to ensure this process should converge to the optimal solution. The following analysis shows that if the compatibility condition is satisfied, the generated PCP against CCF in an ADN can converge to a local optimal solution.
The performance function for evaluating the PCP against CCF generated in the PCMD is defined as the expected fitted function
Q(
s,l(
s,wl)
,wQ), which is shown in Equation (20).
The gradient theorem of the deterministic control policy and the condition of compatibility between the fitted function Q(s,l(s,wl),wQ) and the fitted PCP l(s,wl) against CCF are given, respectively. The gradient theorem of the deterministic control policy shown in Theorem 2 ensures the existence of the gradient of the deterministic control policy. In addition, once the condition of the compatibility between the fitted function Q(s,l(s,wl),wQ) and the fitted PCP l(s,wl) against CCF is satisfied, shown in Theorem 3, a local optimum PCP l(s,wl) against CCF can be constructed through the construction process proposed in this paper.
Theorem 2. (Gradient Theorem of Deterministic Control Policy): It is assumed that the environment (the controlled object) satisfies the condition of a Markov decision process (MDP), and the fitted action-value function Q(s,a) and the fitted PCP l(s,wl) are both continuously differentiable. It is also assumed that the gradientof the function Q(s,a) with respect to the control action a and the gradientof the function l(s,wl) with respect to the parameter wl exist. According to Formula (20), then the gradient of the performance function J(l(s,wl)) exists, and it is shown in Equation (21). (The proof of Theorem 2 is given in Appendix A).
In the PCMD, there are two fitted PCPs
l(
s,
wl) and
l’(
s,
wl’), and there are two fitted action-value functions
Q(
s,
l(
s,
wl)) and
Q’(
s,
l’(
s,
wl’)). It is assumed that the gradients
,
,
, and
exist. Then, according to Theorem 2, the gradient of the performance function for the fitted PCP
l(
s,
wl) is shown in Equation (22).
Correspondingly, the gradient of the performance function for the fitted target PCP
l’(
s,
wl’) is shown in Equation (23).
Theorem 3. (Compatibility Theorem of Fitted Function Q(s,a,wQ) and Fitted PCP l(s,wl)): In a learning process, the Q(s,a,wQ) is a fitted function about functionof control action a. If the action-value function Q⊥(s,a) and the fitted function PCP l(s,wl) are both continuously differentiable, their gradients,, andexist, and the second gradientalso exits, and the compatibility condition shown in Equation (24) is satisfied, and the parameter wQ can minimize the function shown in Formula (25), then there is a local optimal PCP, which is noted as l(s,wl) shown in Equation (26). (The proof of Theorem 3 is given in Appendix B).
where
is the transpose of
.
According to Theorem 1, if the parameters adjustments conditions are satisfied, then the construction process for a PCP against CCF converges. According to Theorem 2, the existence of the gradient of the performance function J(l(s,wl)) is guaranteed. According to Theorem 3, if the condition of the compatibility between the fitted function Q(s,l(s,wl),wQ) and the fitted PCP l(s,wl) is satisfied, then the alternative gradient formed by the fitted function Q(s,l(s,wl),wQ) and the fitted PCP l(s,wl) is equal to the gradient of the performance function J(l(s,wl)), and a local optimum PCP l(s,wl) can be constructed through the PCMD.
3.4. Algorithm Implementation
The reinforcement learning algorithms based on the stochastic control policy gradient can deal with the problems with the discrete control action space, the continuous control action space, the continuous state space, and the discrete state space. Because the stochastic control policy gives the corresponding probability for each value in the control action space, it is computation intensive and it could be practically infeasible in large-scale networks. Hence, the deterministic control policy is produced. Deep deterministic policy gradient (DDPG) is a deep deterministic control policy reinforcement learning algorithm [
23]. It is different from the random control policy gradient method mentioned above. It integrates the deep neural networks and the deterministic control policy [
24], and it updates the parameters of the deep neural networks by using the gradient descent of the deterministic control policy. DDPG uses the deterministic control policy gradient to deal with the continuous action space, and it is also applicable to both the continuous state space and the discrete state space.
DDPG idea is used to implement the construction process for the PCP against CCF in an ADN through the offline interactive learning, as shown in Algorithm 1. Algorithm 1 adopts the deep neural network to fit a PCP (
l(
s,
wl)). The weight
wl in
l(
s,
wl) is learned by the PMCD algorithm.
Algorithm 1 PMCD Algorithm |
Input: the state gt*Δt, the immediate reward rt*Δt at discrete time t * Δt. |
//Initialization |
1 wQ←rand, wl←rand, RB←N1, wQ’←wQ, wl’←wl; |
//Training Process |
2 for episode = 1 to N2 do |
3 N3←randN; |
4 Receive the state g1; |
5 Detect failure symptoms; |
6 Predict a CCF propagation sequence s1*Δt from the occurrence of an initial failure; |
7 for t = 1, N4 do |
8 at←l(st*Δt,wl) + N3; |
9 Perform the action at*Δt and observe the immediate reward rt*Δt and the next state g(t+1)*Δt; |
10 Detect failure symptoms; |
11 Predict a CCF propagation sequence st*Δt from the occurrence of an initial failure; |
12 Put the data (st*Δt, at*Δt, rt*Δt, s(t+1)*Δt) into the replay buffer RB; |
13 Take randomly the mini-batch N5 samples (si*Δt, ai*Δt, ri*Δt, s(i+1)*Δt) from the replay buffer; |
14 if gi*Δt is the final state do |
15 yi*Δt = si*Δt; |
16 else |
17 yi*Δt = ri*Δt + γQ’(s(i+1)*Δt,l’(s(i+1)*Δt,wl’),wQ’); |
18 wQ←; |
19 wl←; |
20 wQ’←αwQ + (1-α)wQ’; |
21 wl’←αwl + (1-α)wl’; |
Output: A preventive control policy l(s). |
Where
Rand denotes the random number,
RB denotes the size of the replay buffer, and
randN denotes the selected random process.
N1,
N2,
N4 and
N5 denote the parameters in Algorithm 1, and
N3 denotes a random process.
α and δ are described in
Section 3.2.
The execution order of PCMD algorithm for an ADN is as follows. For each time step Δt under each episode, firstly, the algorithm detects failure symptoms and predict a CCF propagation sequence st*Δt from the occurrence of an initial failure. Then, the fitted PCP l(s,wl) selects a control action at*Δt at discrete time t*Δt according to the current CCF propagation sequence st*Δt, and the ADN transits from a current state gt*Δt to another state g(t+1)*Δt under the influence of the CCF propagation sequence st*Δt, and then, the data (st*Δt, at*Δt, rt*Δt, s(t+1)*Δt) are put into a replay buffer. At last, the weights of the Q(s,l(s,wl),wQ), Q’(s,l’(st,wl’),wQ’), l(s,wl), and l’(s,wl’) are updated.
The playback buffer mechanism used in Algorithm 1 is first proposed in the paper [
25]. The playback buffer mechanism makes the correlated data independent and the noise cancel each other, which makes the learning process of Algorithm 1 converge faster. If there is no playback buffer mechanism, Algorithm 1 could make a gradient descent in the same direction for a period of time. Under the same step size, calculating the gradient directly may make the learning process not converge. The playback buffer mechanism is to select some samples randomly from a memory pool, update the weight of the fitted
Q(
s,
l(
s,wl),
wQ) by the temporal-difference learning, then, use this weight information to estimate the gradients of the
Q(
s,
l(
s,wl),
wQ) and the
l(
s,wl), and then update the weight of the fitted PCP. The introduction of the random noise in Algorithm 1 ensures the execution of the exploration process.
According to
Section 3.2, the convergence of Algorithm 1 can be guaranteed via parameters adjustments, and a PCP against CCF could be obtained. In addition, the introduction of the parameter
α in Algorithm 1 ensures the stability of weights updating in the
l’(
s,wl’) and the
Q’(
s,
l’(
s,wl’),
wQ’) and thus ensures the stability of the whole learning process. According to
Section 3.3, once the compatibility condition is satisfied, the generated PCP against CCF in an ADN can converge to the local optimal solution. Thus, after interacting with the offline trajectory dataset of an ADN iteratively, a fitted PCP can be learned by Algorithm 1.