Next Article in Journal
Middle Holocene Coastal Environmental and Climate Change on the Southern Coast of Korea
Previous Article in Journal
6D Pose Estimation of Objects: Recent Technologies and Challenges
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Preventive Control Policy Construction in Active Distribution Network of Cyber-Physical System with Reinforcement Learning

School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2021, 11(1), 229; https://doi.org/10.3390/app11010229
Submission received: 28 November 2020 / Revised: 15 December 2020 / Accepted: 16 December 2020 / Published: 29 December 2020

Abstract

:
Once an active distribution network of a cyber-physical system is in alert state, it is vulnerable to cross-domain cascading failures. It is necessary to transit the state of an active distribution network of cyber-physical system from an alert state to a normal state using a preventive control policy against cross-domain cascading failures. In fact, it is difficult to construct and analyze a preventive control policy via theoretical analysis methods or physical experimental methods. The theoretical analysis methods may not be accurate due to approximated models, and the physical experimental methods are expensive and time consuming for building prototypes. This paper presents a preventive control policy construction method based on a deep deterministic policy gradient idea (shorted as PCMD) to generate and optimize a preventive control policy with Artificial Intelligence (AI) technologies. It adopts the reinforcement learning technique to make full use of the available historical data to overcome the problems of high cost and low accuracy. Firstly, a preventive control model is designed based on the finite automaton theory, which can guide the data collection and learning policy selection. The control model considers the voltage stability, frequency stability, current overload prevention, and the control cost reduction as a feedback variable, without the specific power flow equations and differential equations. Then, after enough training, a local optimal preventive control policy can be constructed under the comparability condition among a fitted action-value function and a fitted policy function. The constructed preventive control policy contains some control actions to achieve a low cost and in accord with the principle of shortening a cross-domain cascading failures propagation sequence as far as possible. The PCMD is more flexible and closer to reality than the theoretical analysis methods and has a lower cost than the physical experimental methods. To evaluate the performance of the proposed method, an experimental case study, China Electric Power Research-Cyber-Physical System (shorted as CEPR-CPS), which comes from China Electric Power Research Institute, is carried out. The result shows that the effectiveness of preventive control policy construction with the PCMD is better than most current methods, such as the multi-agent method in terms of reducing the number of failure nodes and avoiding the state space explosion.

1. Introduction

Active distribution network (ADN) is a typical cyber-physical system (CPS), which consists of a wide variety of distributed generators connected to a power network (PN) and a communication network (CN) [1]. An ADN at every moment of time can be abstracted as a directed graph, where equipment is abstracted as nodes, and power lines or communication connections among equipment are abstracted as edges [1], and the directed graph at different time points may be different due to changes in edges or nodes of an ADN. There are five types of nodes in the PN according to the role of nodes: a power generation node type VP1, a substation node type VP2, a distribution node type VP3, a load node type VP4, and an external node type VP5 (the energy from power transmission lines). Similarly, nodes in the CN are categorized into an information relay node class VC1, a sensor node class VC2, an actuator node class VC3, and a control node class VC4. Nodes in the PN are mutually interdependent with nodes in the CN. Specifically, a substation node nP2VP2 in the PN supplies power to nodes in the CN, a sensor node nC2VC2 in the CN collects the information from nodes in the PN, and an actuator node nC3VC3 drives the behaviors of nodes in the PN. There is a risk of the potential cross-domain cascading failures (CCF) spreading alternatively across the PN/CN due to the interdependence between the PN and CN in an ADN. The propagation process of CCF can be divided into two stages [2]. The first stage is a slowly changing process lasting for several minutes. The second stage is an avalanche-collapse process, which could lead to a large area of power outages and cause large economic losses. Hence, it is essential to prevent the propagation of CCF so as to alleviate the side effect of this event and improve the level of the safe and stable operation of an ADN.
A directed graph at every time point could be described as a state of an ADN. These states can be divided into steady states and transient states among ADN operations. A steady state means that an ADN is in stable operation, and a transient state is an intermediate state between steady states. During the propagation process of CCF, an ADN goes through a series of transient states from one steady state to another steady state. If the CCF is not prevented when the CCF is propagating, an ADN must be transferred into the next transient state, and it must stop at a final steady state, and the ADN may collapse. Hence, once the propagation process of CCF is initiated, it is better to use the preventive control to block the propagation process of CCF.
The preventive control process against CCF in an ADN is shown in Figure 1, there are five steps in each routine loop operation. They are state perception, failure diagnosis, decision, preventive control policies, and control action. The state perception gathers the real-time data (such as the voltage and current of nodes) of an ADN. The failure diagnosis detects failure symptoms and predicts failures that could be fired through some propagation paths of the CCF. The goal of decision is to select an optimal preventive control policy (PCP) to prevent potential accidents and the CCF propagation in an ADN. The preventive control policies (PCPs) are some strategies that could be adopted for decision step, and they are constructed in advance according to safety and stability requirement specifications of an ADN. The control action is the last routine step, which just carries out a sequence of operations on an ADN according to decisions and affects the state transition process of an ADN.
In a process of the preventive control routine, a PCP is very important, it provides roadmaps of operations against CCF and determines the performance of the preventive control routine. A strategy of PCPs needs to minimize the length of CCF sequences so as to block the propagation of CCF in an ADN by exerting the control actions with low cost. The selected PCP needs meet some prevention goals optimally, such as preventing voltage collapses, transient instabilities, etc. Theoretical analysis methods [3,4,5,6,7,8] and physical experimental methods [9,10,11] are two typical solutions for a PCP construction. The theoretical analysis method needs to obtain the approximated models of controlled plants, and the experimental method needs to build physical prototypes and owns a heavy cost. So, traditional methods for constructing PCPs always meet some challenges, such as high costs, long durations, and inaccuracies.
A data-driven approach can make full use of a lot of available historical data to overcome the shortcomings of the existing methods, generating and optimizing a PCP through AI technologies with lower cost. According to the goals of preventive controls against failures, there are many AI technologies that could be adopted for constructing PCPs. Chengxi Liu et al. propose a dynamic security assessment method and obtain a PCP based on a decision tree idea to ensure the dynamic security [12]. In order to improve the voltage stability margin of power systems, Mauricio C. Passaro et al. propose a preventive control method for rescheduling power generation based on a neural sensitivity model. The time series data samples from time domain simulations and the dynamic behavior information of the system are used, and the sensitivity is used to select the most effective set of generators to improve the security of the power system [13]. C. Fatih Kucuktezcan et al. use population optimization techniques to construct a PCP. They reduce the search space of each algorithm according to the size of an objective function and make multiple optimization algorithms run continuously, so as to improve the transient security of a power CPS [14]. Soni B P et al. use a wide area measurement system (WAMS) and phasor measurement devices to conduct real-time transient stability assessment. Specifically, the support vector machine (SVM) based on least squares is used to identify the steady state of the power systems in real time, and then, the appropriate dispatching generators are selected for preventive control to ensure transient stability [15]. Kou P et al. use the algorithm of deep deterministic policy gradient with a safety exploration layer for preventive control to ensure that the node voltages of active distribution network are within the limit [16]. However, these studies focus on the construction of PCPs for only a single failure in CPS, so these constructed PCPs cannot prevent the occurrence of cascading failures. In order to prevent the occurrence of cascading failures, researchers construct many PCPs. Rabie Belkacemi et al. use a distributed adaptive multi-agent algorithm to get a PCP. The PCP could block the propagation paths of cascading failures by dispatching the power of generators based on N-1 criterion [17]. Sina Zarrabian et al. use neural network techniques to construct a PCP to prevent the propagation of cascading failures [18], and they also use the reinforcement learning method based on Q learning to obtain a PCP to prevent cascading failures [19]. Mojtaba Khederzadeh and Arash Beiranvand use a strategy of specific thresholds to evaluate the lines with high vulnerabilities and then obtain a PCP based on a genetic algorithm. The PCP eliminates the overload of high vulnerability lines through load shedding, so as to prevent the spread of cascading failures in a CPS [20]. Dutta O et al. develop a distributed management system using adaptive critical design based on adaptive dynamic programming. The system can flexibly take preventive actions and corrective actions to deal with the thermal overload of lines in active distribution network [21]. The existing research work about data-driven methods only focuses on a single PN and is not studied from the perspective of the interdependence between PN and CN. Since the CCF propagates between the PN and CN alternatively and the existing data-driven methods for PCPs construction only collect the observation data from the PN, which are not applicable for preventing the CCF due to insufficient data. In order to prevent the propagation of CCF in an ADN, not only the measured data from the PN but also the measured data from the CN should be collected.
It is difficult to establish a collaborative simulation of the physical process and the computational process in an ADN, such as digital twin. However, reinforcement learning techniques can analyze and summarize the interaction between the physical process and the computational process of an ADN by using the existing empirical data and play the role of simulation and experiment. Using reinforcement learning techniques, there are three advantages for constructing a PCP: less time, lower cost, and higher accuracy. Therefore, a PCP construction method based on deep deterministic policy gradient idea (shorted as PCMD) is proposed to prevent the propagation of CCF in an ADN, and this method needs to collect the data about voltage stability, frequency stability, current overload prevention, robustness for CCF, and control cost. However, there are some challenges during the construction process. That is, how to choose the concrete objectives and weight them into the reward function; how to adjust parameters so as to ensure that the construction process of PCMD converges and a PCP against CCF exists; how to ensure that the construction process converges to a local optimal solution; how to verify the effectiveness of the proposed PCMD.
Some contributions are concluded as follows. (1) A modeling method is proposed for the preventive control using the finite automaton theory. The preventive control model describes the effect of CCF in an ADN after the intervention of control actions and can guide the gather of required trajectory dataset and the construction of a PCP. (2) PCMD for PCP construction is presented, and it guarantees the voltage stability, the frequency stability, the current overload prevention, and the improved robustness against CCF of an ADN. The constructed PCP can generate control actions with low cost based on the principle of shortening a CCF propagation sequence as far as possible. (3) An experimental China Electric Power Research-Cyber-Physical System (CEPR-CPS) case from China Electric Power Research Institute is carried out to validate the performance of the proposed method. The result shows that the effectiveness of the PCP constructed from the PCMD is better than that with others, such as a multi-agent method in terms of reducing the number of failure nodes and avoiding the state space explosion.
The paper is organized as follows. A preventive control model is given in Section 2. In Section 3, the specific construction method is presented. Section 4 gives a case provided by China Electric Power Research Institute to validate the effectiveness of the construction method. Section 5 outlines the discussion.

2. Preventive Control Model

In order to obtain the learning dataset for PCP construction, a preventive control model is constructed to describe the steady state transition process with external interventions against CCF in an ADN. The model can describe transition relationships among the steady states after applying preventive actions to block the propagation of CCF in an ADN, and CCF sequences should be as short as possible. For example, there are three steady states (noted as g0, g1, and g2), as shown in Figure 2. In a steady state g0, substation nodes (Bus1 and Bus3) supply power to all connected nodes of the CN together, and Bus1 is a backup power to enable all alternative power edges when primary power Bus3 fails with disconnected edges to the nodes in the CN; in this case, the steady state g0 is transited to another steady state g1 after nodes (PG1, Bus3, and EN1) fail. Similarly, when the power generation node PG1, the substation node Bus3 and the external node EN1 are repaired via the external interventions, and all disconnected edges have connected again, but they are inactive, the ADN transfers a steady state g2 from state g1. This transfer process can be described by a finite state machine.
Definition 1.
(Preventive Control Model): A preventive control model against CCF is formally described by an extended finite automaton as a six tuple U = (G, S, A, F, Prb, r, g0).
(1) G = {<V, E>} represents a finite set of directed graphs, which is a steady state under the stable operation of an ADN. The steady state gi = <Vi, Ei > ∈G(i ≥ 0) represents a directed graph in an ADN at discrete time i * Δt, and Δt denotes the sample interval, Vi represents the node set in the steady state gi of an ADN. g0 = <V0, E0> is the initial steady state of an and, which is in normal work condition as shown in Figure 2a. A node ni∈Vi is represented by a feature vector (ai1, ai2, …, aiNOF)T, NOF is the number of features of the node ni. The set Ei represents connected relations between a pair of nodes among the PN and CN under the steady state gi of an ADN at discrete time i * Δt. For example, the features of a power generation node ni = PG1 has 6 feature attributes, which include voltage ai1 = vi, current ai2 = Ii, active power ai3 = PGi, reactive power ai4 = QGi, power adjustment ai5 = ΔPGi, frequency ai6 = fref.
(2) S represents a CCF set, and its element si = Xi0→Xi1→…→Xin includes a cross-domain cascading failures propagation sequence from an initial failure Xi0 in a source node to a failure Xin in a sink node of a state gi = <Vi, Ei>. If two sequences with different orders have the same length and elements, then they are considered to be the same. The failure Xij is represented by a vector Xij = (xij1, xij 2,..., xijNOP)T, which presents the operation states of an ADN, where NOP is the number of nodes of an ADN, and xijk is the operation state of the kth (k∈{1,2,…, NOP}) node of an ADN, and when xijk = 0, it is noted that the operation state of the kth node is normal, otherwise it is faulty when xijk =1. For example, a failure X00 in a source node Bus3 of an ADN occurs in a steady state g0 = <V0, E0 > ∈G, as shown in Figure 2a, its operation states are described as X00 = (xPG1, xBus3, xDN1, xEN1, xBus1, xBus2, xPG2, xSN1, xIR1, xCN1, xIR2, xAN1, xCN2, xIR3, xSN2, xAN2)T =(0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0)T. During the CCF propagation from the failure in a source node Bus3 to a failure Xin in a sink node, there are three failures occurring, and s0 = X00→X01→X03 = (0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0)T → (1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)T → (0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0)T. When a PCP is enabled to prevent the CCF propagation sequence s0, the ADN is transferred from g0 to g1, as shown in Figure 2. In the next state g1, there could be CCF propagation sequences from other source nodes of an ADN.
(3) A⊆Rm represents a set of control actions, which is composed of some control actions following a PCP to prevent the CCF propagation. A PCP is a mapping (l: S→A) from the CCF set to the control action set. For example, in order to prevent the instability of an ADN caused by the failure of the source node Bus3, a backup power Bus1 is activated to enable alternative power edges to the nodes in the CN primarily powered by Bus3, and all connections from nodes (PG1, Bus3 and EN1) to other devices are cut off, the topology of the ADN is changed, as shown in Figure 2b.
(4) F: G×S→G represents a steady state transition function between two steady states of an ADN. For example, the CCF propagation sequence s0 of a state g0 = <V0, E0 > ∈G in an ADN is blocked by a control action l(s0)∈A, so that the ADN can quickly reach a new steady state g1 = <V1,E1 >∈G, it is expressed as < V 0 , E 0 > s 0 : l ( s 0 ) < V 1 , E 1 > .
(5) Prb: F→R represents a state transition probability function under a certain action during the propagation process of CCF, R represents the set of real numbers. The propagation process of CCF has the Markov property [22], and this process is described as a Markov stochastic process. So, the state transition probability function is only related with the current state and action.
(6) r: F→R is an immediate reward function after taking a certain action on a transition.
The preventive control model is a finite automaton, which can be described as a transition system. The state transition of an ADN shown in Figure 3 can be described via a finite automaton, which corresponds to Figure 2.

3. Preventive Control Policy Construction

In the preventive control model, a PCP (l: S→A) against CCF is described. Theoretical analysis methods and physical experimental methods are used to construct this mapping (l: S→A). However, the theoretical analysis methods may not be accurate due to approximated models, and the physical experimental methods are expensive and time consuming for building prototypes. A PCP construction method based on deep deterministic policy gradient idea (shorted as PCMD) is proposed to obtain the mapping (l: S→A). In the PCMD, two problems—which are the convergence and optimization conditions of the construction process—need to be solved. The specific construction method, solutions to two problems, and an algorithm implementation are explained later.

3.1. Construction Method

The PCMD is divided into three steps, which are data collection, failure diagnosis, and preventive control policy fitting. In the first step of PCMD, it needs to collect data including the voltage of nodes, frequency, the current of nodes, and the robustness for CCF, and it needs also to consider the control cost reduction as a feedback variable. In the second step of PCMD, it needs to detect failure symptoms and predict a CCF propagation sequence from the occurrence of an initial failure. In the third step of PCMD, it uses the deep neural network to maintain a parameterized PCP l(s,wl) pursuing the maximum of the infinite horizon discounted expected cumulative reward. Based on the model defined in definition 1, the infinite horizon discounted expected cumulative reward from an initial state g0 is E i = 0 γ i r ( F i ) , E is the expectation about different trajectories, Fi is a steady state transition function in a trajectory, and γ is the discount factor. In addition, the PCMD needs also deep neural networks to maintain a parameterized target PCP l’(s,wl’) and parameterized action-value functions (Q(s,l(s,wl),wQ) and Q’(s,l’(s,wl’),wQ’)) for constructing l(s,wl) steadily. Where sS is a CCF propagation sequence, and si is a CCF propagation sequence in a steady state gi. wl, wl’, wQ, and wQ’ are the parameters of parameterized functions.
According to Definition 1, many possible expected steady states could be reached from the current steady state when a PCP is attempted to block a CCF propagation sequence. The goal of the constructed PCP is to be selected to reach an optimal steady state from the current steady state with minimum cost. Those requirements about the expected steady state are described by an objective function, as shown in Formula (1).
min i = 1 V P 1 V P 5 W G i Δ P G i + W R P ( 1 R P F )
where VP1 denotes the set of power generation nodes (distributed generators), and VP5 denotes the set of external nodes (the energy from power transmission lines). WGi denotes the weight of the control cost objective, and WRP denotes the weight of the robustness (against CCF) objective. ΔPGi is defined in Definition 1. RPF (a discrete quantity) denotes the proportion of failure nodes during the propagation of CCF in an ADN.
Formula (1) describes the control cost objective and the robustness (against CCF) objective. Instead of using the Pareto idea, the commonly used weighting method [19] is used to integrate multiple objectives into a single objective function.
During the process of blocking a CCF propagation sequence, the safety requirements and the basic physical constraints should be obeyed. Formula (2) describes inequality constraints (including continuous quantities and discrete quantities) corresponding to safety requirements. Formula (3) describes the equality constraint of power flow equations in the PN.
P G i _ min P G i P G i _ max , G i V P 1 Q G i _ min Q G i Q G i _ max , G i V P 1 I P i I P i _ max , P i V P v i _ min v i v i _ max , i V P Δ P G i _ min Δ P G i Δ P G i _ max , G i V P 1 0 R P F 1 f r e f min f r e f f r e f max
P G i P L i j = 1 N u m B v i v j Y i j c o s α i j ζ i + ζ j = 0 i = 1 , , N O N P Q G i Q L i + j = 1 N u m B v i v j Y i j s i n α i j ζ i + ζ j = 0 i = 1 , , N O N P
where PGi and QGi denote the active power and the reactive power of a power generator node nGi, respectively, PLi and QLi denote the active power and the reactive power of a load node nLi, respectively, vi denotes the voltage of a node ni in the PN, IPi denotes the current of a node ni in the PN, fref denotes the frequency of an ADN, and PGi, QGi, PLi, QLi, vi, IPi, and fref are all defined in Definition 1. To be safe, it is necessary to limit the value ranges of variables. Correspondingly, PGi_min and PGi_max denote the lower and the upper active power bound of a power generator node nGi, respectively, QGi_min and QGi_ Max denote the lower and the upper reactive power bound of a power generator node nGi, respectively. Similarly, ΔPGi_min and ΔPGi_max denote the lower and the upper control action bound, PLi_min and PLi_max denote the lower and the upper active power bound of a load node nLi, respectively, QLi_min and QLi_max denote the lower and the upper reactive power bound of a load node nLi, respectively, vi_min and vi_max denote the lower and the upper voltage bound of a node ni in the PN, IPi_max denotes the upper current bound of a node ni in the PN. frefmin and frefmax denote the frequency limits of an ADN. NONP is the number of nodes in the PN. ζi denotes the voltage phase angle of a node ni in the PN. Yij and αij denote the admittance and admittance phase angle between a node ni and a node nj, respectively, in the PN.
In order to get the fitted PCP, the objective function and inequality constraints shown in Formulas (1) and (2) are integrated into the infinite horizon discounted expected cumulative reward, and then the immediate reward r of a transition should be given. The immediate reward r is composed of a reward r1 reflecting voltage stability, a reward r2 for preventing the CCF caused by the current overload, a reward r3 reflecting the frequency stability, a reward r4 reflecting the control cost, and a reward r5 reflecting robustness against CCF. The reward r1 is shown in Formula (4).
r 1 = i V P w n i r n i _ 1
where r1 is the weighted sum of rni_1, and the definition of rni_1 is shown in Formula (5). wni is the weight (proportional to the importance of a node) of a node ni, VP is the set of nodes in the PN. r1_ni_1 (greater than zero) and r1_ni_2 (less than zero), respectively, denote the specific immediate reward value when the voltage value of the node ni is in different ranges, ‖●‖ denotes the norm, vi_normal is the reference voltage value of a node ni, vni = min{‖vi_normalvi_min‖,‖vi_maxvi_normal‖} is the voltage threshold of a node ni.
r n i _ 1 = r 1 _ n i _ 1 v i v i _ n o r m a l v n i r 1 _ n i _ 2 v i v i _ n o r m a l > v n i
The reward r2 is shown in Formula (6).
r 2 = i E P w n i r n i _ 2
where r2 is the weighted sum of rni_2, and the definition of rni_1 is shown in Formula (7). EP is the set of edges in the PN, r2_ni_1 (greater than zero) and r2_ni_2 (less than zero), respectively, represent the specific reward value when the current value of the node ni is in different ranges, Ii_normal is the reference current value of a node ni in an ADN, Ini = IPi_max is the current threshold value of the node ni.
r n i _ 2 = r 2 _ n i _ 1 I i I i _ n o r m a l I n i r 2 _ n i _ 2 I i I i _ n o r m a l > I n i
The reward r3 is shown in Formula (8).
r 3 = r 3 _ f r e f _ 1 f r e f f r e f n o r m a l c f r e f r 3 _ f r e f _ 2 f r e f f r e f n o r m a l > c f r e f
where r3_fref_1 (greater than zero) and r3_fref_2 (less than zero), respectively, represent the specific reward value when the frequency value is in different ranges, frefnormal is the reference frequency value, cfref = min{‖frefmin − frefnormal‖, ‖frefmax − frefnormal‖} is the frequency threshold value of an ADN.
The reward r4 is shown in Formula (9).
r 4 = r 4 _ c _ 1 a a c P c r 4 _ c _ 2 a a > c P c
where a R m = A is the control action, r4_c_1 (greater than zero) and r4_c_2 (less than zero), respectively, denote the specific reward value when the control action is in different ranges, and cPc is the control action threshold value.
The reward r5 is the sum of r5_1 and r5_2, as shown in Formula (10).
r 5 = r 5 _ 1 + r 5 _ 2 = r n f 1 n f a i l u r e ( t ) + i = 0 t / Δ t r n f 2 n f a i l u r e ( i * Δ t )
where r5_1 denotes the number of failure nodes at the current time t, r5_2 denotes the cumulative number of failure nodes from the start time 0 to the current time t. rnf1 and rnf2 denote the weights of r5_1 and r5_2, respectively, and they are both greater than zero, and nfailure(i * Δt) (i∈{0,1,2,…, t/Δt}) denotes the number of failure nodes in an ADN at discrete time i*Δt.
The values of the rewards r1, r2, r3, and r4 are the larger the better, while the value of the reward r5 is the smaller the better. Thus, the total immediate reward r is shown in Formula (11).
r = j = 1 4 r j r 5
The convergence and optimization conditions of the construction process in the PCMD need to be guaranteed, and they are considered in the following sections.

3.2. Convergence

In the PCMD, it is necessary to ensure that the construction process converges and the PCP against CCF generated from the construction process exists. These requirements are guaranteed by the two step parameter’s adjustments. In the first step, the convergence of the construction process could be guaranteed by adjusting the parameters in fitted functions (l(s,wl), l’(s,wl’), Q(s,l(s,wl),wQ), and Q’(s,l’(s,wl’),wQ’)). In the second step, the existence of the PCP against CCF could be guaranteed by adjusting additionally the parameters in the reward r.
The parameters fitted functions need to be adjusted, including the learning rate δ, the initial weights of fitted functions (l(s,wl), l’(s,wl’), Q(s,l(s,wl),wQ), and Q’(s,l’(s,wl’),wQ’)), the parameter α, and other parameters of fitted functions based on deep neural networks.
Theorem 1.
Given the equation w = φ(w), if its derivative φ’(w) is continuous on a closed interval [w1, w2], when w∈ [ w 1 ,   w 2 ] , φ(w) [ w 1 ,   w 2 ] , there exists a constant 0 < z < 1, such that when w [ w 1 ,   w 2 ] , ‖φ’(w)‖ ≤ z < 1, the following results hold.
(1)
φ(w) has a unique fixed point w on the closed interval [ w 1 , w 2 ] .
(2)
For any initial value w0 [ w 1 , w 2 ] , the iterative scheme w k + 1 = ϕ ( w k ) (k = 0, 1, 2, …) converges, and lim k w k = w .
(3)
For the sequence { w k } k = 0 , there exists an asymptotic error estimation shown in Formula (12).
lim k w k w w k 1 w = ϕ ( w )
The process of fitting l(s) is actually a fixed-point iterative scheme, and the weight wl of the fitted PCP l(s,wl) is obtained by solving the equation wl = φ(wl). The iterative function φ is defined as shown in Equation (13).
ϕ ( w l ) = w l + δ 1 N 5 i = 1 N 5 a Q ( s i , l ( s i , w l ) , w Q ) w l l ( s i , w l )
where N5 is the size of a trajectory data sample from a data pool, si is a CCF sequence in a steady state gi.
If the adjustments of the learning rate δ and the initial weight of the fitted function l(s,wl) meet the preconditions of Theorem 1, then the weight sequence of the l(s,wl) converges to the solution w l ( ) of the equation wl = φ(wl).
The following analysis then explains how the initial weights of the fitted functions (l’(s,wl’), Q(s,l(s,wl),wQ) and Q’(s,l’(s,wl’),wQ’)) are adjusted. For example, due to the weights update w Q ( k + 1 ) = α w Q ( u ) + ( 1 α ) w Q ( k ) at step k, there is an error equation shown in Equation (14).
w Q ( k + 1 ) w Q ( ) = α w Q ( u ) w Q ( ) + ( 1 α ) w Q ( k ) w Q ( )
where w Q ( k ) denotes the weight of the Q’(s,l’(s,wl’),wQ’) at step k. w Q ( u ) denotes the weight of the Q(s,l(s,wl),wQ) at step u, and uk. w Q ( ) denotes the true value of the weights wQ and wQ’.
An inequation is obtained by taking the norm of the elements in Equation (14), as shown in Formula (15).
w Q ( k + 1 ) w Q ( ) α w Q ( u ) w Q ( k ) + w Q ( k ) w Q ( )
where 0 < α < 1 .
The true action-value function Q ( s i , a i ) of a steady state gi has the relation shown in Formula (16).
Q ( s i , a i ) = E r ( F i ) + γ Q ( s i + 1 , a i + 1 )
Thus, the objective function value yi of the Q(si,l(si,wl),wQ) in a steady state gi is derived from Formula (16), as shown in Formula (17).
y i = r ( F i ) + γ Q ( s i + 1 , l ( s i , w l ) , w Q )
where si+1 is a CCF sequence of a steady state gi+1.
From Formula (17), a loss function LQ for updating the weight wQ of the Q(s,l(s,wl),wQ) is shown in Equation (18).
L Q = E Q ( s i , l ( s i , w l ) , w Q ) y i 2 1 N 5 i = 1 N 5 Q ( s i , l ( s i , w l ) , w Q ) y i 2
According to Formulas (17) and (18), the weights of the Q(s,l(s,wl),wQ) converge to the weights of the Q’(s,l’(s,wl’),wQ’), and when u >>k, w Q ( u ) w Q ( k ) 0 . From Formula (12) and Formula (15), then w Q ( k + 1 ) w Q ( ) η Q w Q ( k ) w Q ( ) , 0 < η Q 1 . So, the weight sequence of the Q’(s,l’(s,wl’),wQ’) converges to the true value w Q ( ) . Likewise, the weight sequence of the Q(s,l(s,wl),wQ) converges to the true value w Q ( ) and w Q ( k + 1 ) w Q ( ) η Q w Q ( k ) w Q ( ) , 0 < η Q 1 .
An inequation is obtained by taking the norm of the elements in Equation (14) and repeating the substitutions, as shown in Formula (19).
w Q ( k + 1 ) w Q ( ) α η Q u w Q ( 0 ) w Q ( ) + ( 1 α ) α η Q u 1 w Q ( 0 ) w Q ( ) + + ( 1 α ) k w Q ( 0 ) w Q ( )
where w Q ( 0 ) denotes the initial weight of the Q(s,l(s,wl),wQ), and w Q ( 0 ) denotes the initial weight of the Q’(s,l’(s,wl’),wQ’).
According to Formula (19), the weight of the fitted function Q’(s,l’(s,wl’),wQ’) tends to the true value w Q ( ) as u and k all tend to infinity. Because of the parameters α < 1 and ηQ ≤ 1, the initial weight w0Q’ of the Q’(s,l’(s,wl’),wQ’) can be selected randomly. The initial weight wQ(0) of the fitted function Q(s,l(s,wl),wQ) can be selected randomly due to Formulas (16) and (17). By exploring the similar analysis, the weight sequence of the l’(s,wl’) converges to the true value w l ( ) and the initial weight wl’(0) of the l’(s,wl’) can also be selected randomly. As for the adjustment of the parameter α, the value of the parameter α < 1 can be set as high as possible in order to speed up the convergence of the fitted functions Q’(s,l’(s,wl’),wQ’) and l’(si,wl’). However, because of ηQ ≤ 1 and ηQ’ ≤ 1, the value of the parameter α < 1 should not be set too high in order to improve the computational stability of the fitted function Q(s,l(s,wl),wQ).
In addition, in order to prevent the output of the fitted functions (l(s,wl), l’(s,wl’), Q(s,l(si,wl),wQ), and Q’(s,l’(s,wl’),wQ’)) from being saturated, the data should be standardized (per unit is used in this paper), the number of neural network layers should be appropriately reduced, the batch normalization layer should be raised, and the activation function layers should be placed behind them. After applying the above parameters adjustments, the convergence of the construction process could be guaranteed, and a PCP could be generated.
In order to guarantee the existence of the PCP against CCF in an ADN, the parameters in the reward r should be adjusted in addition to the parameters adjustments in fitted functions. The idea of the parameters adjustment in the reward r is to reduce the propagation width and depth of CCF. The basic idea of the parameter adjustments based on reducing the propagation width of CCF in an ADN is to ensure that the number of failure nodes at the same time is as small as possible, and the requirements of the voltage stability, the frequency stability, the current overload prevention, and the control cost reduction should be met. Specifically, the parameters Vni, Ini, r1_ni_2, r2_ni_2, cfref, r3_fref_2, and r4_c_2 should be adjusted as low as possible, and the values of parameters r1_ni_1, r2_ni_1, r3_fref_1, r4_c_1, and rnf1 should be increased. Similarly, the basic idea of parameter adjustments based on reducing the propagation depth of CCF in an ADN is to ensure the serialization steps of nodes with as few as possible successive failures and meet the requirements of the voltage stability, the frequency stability, the current overload prevention, and the control cost reduction. Specifically, the values of parameters Vni, Ini, r1_ni_2, r2_ni_2, cfref, r3_fref_2, and r4_c_2 should be adjusted as low as possible, r1_ni_1, r2_ni_1, r3_fref_1, r4_c_1, and rnf2 should be as high as possible.
So, after the two step parameters adjustments, the PCMD converged, and the PCP against CCF in an ADN generated from the construction process exists.

3.3. Satisfiability of Suboptimal Solution

If the PCMD has been guaranteed to be convergent, it is necessary to ensure this process should converge to the optimal solution. The following analysis shows that if the compatibility condition is satisfied, the generated PCP against CCF in an ADN can converge to a local optimal solution.
The performance function for evaluating the PCP against CCF generated in the PCMD is defined as the expected fitted function Q(s,l(s,wl),wQ), which is shown in Equation (20).
J ( l ( s , w l ) ) = E Q ( s , l ( s , w l ) , w Q )
The gradient theorem of the deterministic control policy and the condition of compatibility between the fitted function Q(s,l(s,wl),wQ) and the fitted PCP l(s,wl) against CCF are given, respectively. The gradient theorem of the deterministic control policy shown in Theorem 2 ensures the existence of the gradient of the deterministic control policy. In addition, once the condition of the compatibility between the fitted function Q(s,l(s,wl),wQ) and the fitted PCP l(s,wl) against CCF is satisfied, shown in Theorem 3, a local optimum PCP l(s,wl) against CCF can be constructed through the construction process proposed in this paper.
Theorem 2.
(Gradient Theorem of Deterministic Control Policy): It is assumed that the environment (the controlled object) satisfies the condition of a Markov decision process (MDP), and the fitted action-value function Q(s,a) and the fitted PCP l(s,wl) are both continuously differentiable. It is also assumed that the gradient a Q ( s , a ) of the function Q(s,a) with respect to the control action a and the gradient w l l ( s , w l ) of the function l(s,wl) with respect to the parameter wl exist. According to Formula (20), then the gradient of the performance function J(l(s,wl)) exists, and it is shown in Equation (21). (The proof of Theorem 2 is given in Appendix A).
w l E Q ( s , l ( s , w l ) ) = = E a Q ( s , a = l ( s , w l ) ) w l l ( s , w l )
In the PCMD, there are two fitted PCPs l(s,wl) and l’(s,wl’), and there are two fitted action-value functions Q(s,l(s,wl)) and Q’(s, l’(s,wl’)). It is assumed that the gradients w l Q ( s , l ( s , w l ) ) , w l l ( s , w l ) , w l l ( s , w l ) , and w l Q ( s , l ( s , w l ) ) exist. Then, according to Theorem 2, the gradient of the performance function for the fitted PCP l(s,wl) is shown in Equation (22).
w l J ( l ( s , w l ) ) = E l Q ( s , l ( s , w l ) ) w l l ( s , w l )
Correspondingly, the gradient of the performance function for the fitted target PCP l’(s,wl’) is shown in Equation (23).
w l J ( l ( s , w l ) ) = E l Q ( s , l ( s , w l ) ) w l l ( s , w l )
Theorem 3.
(Compatibility Theorem of Fitted Function Q(s,a,wQ) and Fitted PCP l(s,wl)): In a learning process, the Q(s,a,wQ) is a fitted function about function Q ( s , a ) of control action a. If the action-value function Q(s,a) and the fitted function PCP l(s,wl) are both continuously differentiable, their gradients a Q ( s , a , w Q ) , a Q ( s , a ) , and w l l ( s , w l ) exist, and the second gradient w Q a Q ( s , a , w Q ) also exits, and the compatibility condition shown in Equation (24) is satisfied, and the parameter wQ can minimize the function shown in Formula (25), then there is a local optimal PCP, which is noted as l(s,wl) shown in Equation (26). (The proof of Theorem 3 is given in Appendix B).
a Q ( s , a = l ( s , w l ) , w Q ) = w l l ( s , w l ) T w Q
E a Q ( s , a = l ( s , w l ) , w Q ) a Q ( s , a = l ( s , w l ) ) 2
E a Q ( s , a = l ( s , w l ) , w Q ) w l l ( s , w l ) = E a Q ( s , a = l ( s , w l ) ) w l l ( s , w l )
where w l l ( s , w l ) T is the transpose of w l l ( s , w l ) .
According to Theorem 1, if the parameters adjustments conditions are satisfied, then the construction process for a PCP against CCF converges. According to Theorem 2, the existence of the gradient of the performance function J(l(s,wl)) is guaranteed. According to Theorem 3, if the condition of the compatibility between the fitted function Q(s,l(s,wl),wQ) and the fitted PCP l(s,wl) is satisfied, then the alternative gradient formed by the fitted function Q(s,l(s,wl),wQ) and the fitted PCP l(s,wl) is equal to the gradient of the performance function J(l(s,wl)), and a local optimum PCP l(s,wl) can be constructed through the PCMD.

3.4. Algorithm Implementation

The reinforcement learning algorithms based on the stochastic control policy gradient can deal with the problems with the discrete control action space, the continuous control action space, the continuous state space, and the discrete state space. Because the stochastic control policy gives the corresponding probability for each value in the control action space, it is computation intensive and it could be practically infeasible in large-scale networks. Hence, the deterministic control policy is produced. Deep deterministic policy gradient (DDPG) is a deep deterministic control policy reinforcement learning algorithm [23]. It is different from the random control policy gradient method mentioned above. It integrates the deep neural networks and the deterministic control policy [24], and it updates the parameters of the deep neural networks by using the gradient descent of the deterministic control policy. DDPG uses the deterministic control policy gradient to deal with the continuous action space, and it is also applicable to both the continuous state space and the discrete state space.
DDPG idea is used to implement the construction process for the PCP against CCF in an ADN through the offline interactive learning, as shown in Algorithm 1. Algorithm 1 adopts the deep neural network to fit a PCP (l(s,wl)). The weight wl in l(s,wl) is learned by the PMCD algorithm.
Algorithm 1 PMCD Algorithm
Input: the state gt*Δt, the immediate reward rt*Δt at discrete time t * Δt.
//Initialization
1 wQ←rand, wl←rand, RB←N1, wQ’←wQ, wl’←wl;
 //Training Process
2 for episode = 1 to N2 do
3  N3←randN;
4 Receive the state g1;
5 Detect failure symptoms;
6 Predict a CCF propagation sequence s1*Δt from the occurrence of an initial failure;
7 for t = 1, N4 do
8   at←l(st*Δt,wl) + N3;
9 Perform the action at*Δt and observe the immediate reward rt*Δt and the next state g(t+1)*Δt;
10 Detect failure symptoms;
11 Predict a CCF propagation sequence st*Δt from the occurrence of an initial failure;
12 Put the data (st*Δt, at*Δt, rt*Δt, s(t+1)*Δt) into the replay buffer RB;
13   Take randomly the mini-batch N5 samples (si*Δt, ai*Δt, ri*Δt, s(i+1)*Δt) from the replay buffer;
14 if gi*Δt is the final state do
15 yi*Δt = si*Δt;
16 else
17   yi*Δt = ri*Δt + γQ’(s(i+1)*Δt,l’(s(i+1)*Δt,wl’),wQ’);
18   wQ arg min w Q 1 N 5 i = 1 N 5 y i * Δ t Q ( s i * Δ t , a i * Δ t , w Q ) 2 ;
19 wl w l + δ 1 N 5 i = 1 N 5 a Q ( s i * Δ t , l ( s i * Δ t , w l ) , w Q ) w l l ( s i * Δ t , w l ) ;
20 wQ’αwQ + (1-α)wQ’;
21 wl’←αwl + (1-α)wl’;
Output: A preventive control policy l(s).
Where Rand denotes the random number, RB denotes the size of the replay buffer, and randN denotes the selected random process. N1, N2, N4 and N5 denote the parameters in Algorithm 1, and N3 denotes a random process. α and δ are described in Section 3.2.
The execution order of PCMD algorithm for an ADN is as follows. For each time step Δt under each episode, firstly, the algorithm detects failure symptoms and predict a CCF propagation sequence st*Δt from the occurrence of an initial failure. Then, the fitted PCP l(s,wl) selects a control action at*Δt at discrete time t*Δt according to the current CCF propagation sequence st*Δt, and the ADN transits from a current state gt*Δt to another state g(t+1)*Δt under the influence of the CCF propagation sequence st*Δt, and then, the data (st*Δt, at*Δt, rt*Δt, s(t+1)*Δt) are put into a replay buffer. At last, the weights of the Q(s,l(s,wl),wQ), Q’(s,l’(st,wl’),wQ’), l(s,wl), and l’(s,wl’) are updated.
The playback buffer mechanism used in Algorithm 1 is first proposed in the paper [25]. The playback buffer mechanism makes the correlated data independent and the noise cancel each other, which makes the learning process of Algorithm 1 converge faster. If there is no playback buffer mechanism, Algorithm 1 could make a gradient descent in the same direction for a period of time. Under the same step size, calculating the gradient directly may make the learning process not converge. The playback buffer mechanism is to select some samples randomly from a memory pool, update the weight of the fitted Q(s,l(s,wl),wQ) by the temporal-difference learning, then, use this weight information to estimate the gradients of the Q(s,l(s,wl),wQ) and the l(s,wl), and then update the weight of the fitted PCP. The introduction of the random noise in Algorithm 1 ensures the execution of the exploration process.
According to Section 3.2, the convergence of Algorithm 1 can be guaranteed via parameters adjustments, and a PCP against CCF could be obtained. In addition, the introduction of the parameter α in Algorithm 1 ensures the stability of weights updating in the l’(s,wl’) and the Q’(s,l’(s,wl’),wQ’) and thus ensures the stability of the whole learning process. According to Section 3.3, once the compatibility condition is satisfied, the generated PCP against CCF in an ADN can converge to the local optimal solution. Thus, after interacting with the offline trajectory dataset of an ADN iteratively, a fitted PCP can be learned by Algorithm 1.

4. Case Study

An experimental CEPR-CPS is an ADN case, and it is designed from the actual system provided by China Electric Power Research Institute. The topology and node numbering of PN and CN in CEPR-CPS are shown in Figure 4. Node numbers in the PN are from 1 to 80, and node numbers in the CN are from 81 to 109. Nodes 1–4, 23, 60, 75–80 (abstracted from the contact switch) are distribution nodes. Nodes 17, 18, 24, 25, 43, 44, 56 are substation nodes. Nodes 66–74 are power generation nodes (66, 68, and 70: battery energy storage nodes, 67 and 73: photovoltaic power generation nodes, 69 and 74: wind power generation nodes, 71: a hydraulic turbine node, 72: a micro-gas turbine node). Nodes 63–65 are external nodes. The rest in the PN are load nodes. A node 81 is the control node, nodes 82–87 are information relay nodes, and nodes 88–109 are sensor/actuator nodes. There are 109 nodes in the CEPR-CPS. There are 80 nodes and 107 edges in the PN, and there are 29 nodes and 28 edges in the CN. The directed independence edges between the PN and CN in the CEPR-CPS are shown in Table 1. A directed independence edge is from a node to its successor node. Nodes of PN supply power to nodes of CN. Accordingly, nodes of CN collect the information of nodes in the PN and thus control the actions of them. For example, the node 43 of PN supplies power to nodes 104, 106, and 107 of CN (as shown by the dotted orange line in Figure 4) and the node 99 of CN collects the information of the node 16 in the PN (as shown by the dotted blue line in Figure 4).
The CEPR-CPS is used to verify the effectiveness of the PCP against CCF generated from the PCMD compared with the one from the multi-agent method.

4.1. Trajectory Data Collection

In order to construct a PCP against CCF in the CEPR-CPS using the PCMD, it needs to collect the offline trajectory data {iTra|(gi*Δt, ai*Δt, ri*Δt, g(i+1)*Δt)} from the CEPR-CPS. Where Tra = {0,1,2,…,Tr/Δt}, Tr denotes the duration of a trajectory. According to Definition 1, the data are collected in the CEPR-CPS at discrete time I * Δt (i ≥ 0), and the data collected include the state data gi*Δt, the control action data ai*Δt, and the reward data ri*Δt of CEPR-CPS.
The state data gi*Δt of CEPR-CPS include the three-phase voltage and the three-phase current values (complex number, per unit) of the 80 nodes, the rotor mechanical angle (deg) and the rotor speed (per unit) of a hydraulic turbine node, the active and the reactive power values (complex number, per unit) of two wind power generation nodes, and the three-phase voltage values and the three-phase current values (complex number, per unit) from inverters of one photovoltaic node, one micro-gas turbine node, and three energy storage nodes.
According to different operating conditions, leading conditions and subsequent constraints, the control action data ai*Δt of CEPR-CPS include the phase voltage of the controllable voltage source of energy storage nodes, the reactive power of wind power generation nodes, the three-phase voltage of the controllable voltage source of micro-gas turbine nodes and photovoltaic nodes, and the active power of a hydraulic turbine node.
The reward data ri*Δt include a reward r1 reflecting the voltage stability, a reward r2 for preventing the CCF caused by the current overload, a reward r3 reflecting the frequency stability, a reward r4 reflecting the control cost, and a reward r5 reflecting robustness against CCF.
After the state data gi*Δt, the action data ai*Δt and the reward data ri*Δt are obtained at the discrete time i*Δt (i ≥ 0), then the transition data (gi*Δt, ai*Δt, ri*Δt, g(i+1)*Δt) can be constructed. After obtaining the transition data in a successively dependent manner, trajectory data can be obtained. Then, multiple different trajectory data form an offline trajectory dataset {{iTra|(gi*Δt, ai*Δt, ri*Δt, g(i+1)*Δt)}}.

4.2. Learning Process

After the trajectory data collection step, the PCMD can interact with the offline trajectory dataset of the CEPR-CPS iteratively. Once parameters in the PCMD are adjusted and the condition of the compatibility between the fitted action-value function Q(s,l(s,wl),wQ) and the fitted PCP l(s,wl) against CCF is satisfied, then the weights of the fitted action-value functions and a local optimal preventive control policy are learned. The parameters used in the PCMD for the CEPR-CPS is shown in Table 2.
Where TS is the sampling period (seconds), TF = Tr is the simulation time (seconds), N3.var is the variance of the random process, N3.decR is the decay rate of the variance of a random process, gr_Th denotes the gradient threshold, stopTrainV denotes the stopping training threshold, cl_RB denotes whether to clear the replay buffer, and isD denotes the stopping threshold of every simulation subprocess. The state is 504 dimensional in total, and the control action is 29 dimensional corresponding to the outputs of distributed generators. The CCF propagation sequence input layer of the fitted Q function is regular and fully connected, including 600 neuron nodes. Its CCF propagation sequence output layer is relu, it is fully connected and contains 400 neuron nodes. The action input layer of the fitted Q function is fully connected and contains 400 neuron nodes, the merging layer of the CCF propagation sequence output and the action output of the fitted Q function is relu, and it is fully connected and contains only one neuron node. The output layer of the fitted PCP l is regularized and fully connected, including 300 neuron nodes. The output layer of the fitted PCP l is tanh, and it is fully connected, containing 29 neuron nodes. However, there is no hidden layer.

4.3. Results and Analysis

It focuses on a scenario of the open circuit failure of a node as the initial failure, that is, the initial failure is the physical equipment failure in the CEPR-CPS. The multi-agent method [11] is selected as a control method. Comparison diagram of CCF initiated from a node failure in the PN and in the CN are shown in Figure 5 and Figure 6, respectively.
According to Figure 5 and Figure 6, the PCMD proposed in this paper can block the occurrence of CCF in the CEPR-CPS compared with the existing multi-agent method. Meanwhile, the conclusion that the convergence of PCMD could be guaranteed by the parameters adjustments is verified. The conclusion that a local optimal PCP can be constructed under the comparability condition among a fitted Q function and a fitted policy function is also verified. The PCMD is better than the multi-agent method in terms of reducing the number of failure nodes and avoiding the state space explosion problem. So, an existing problem of the multi-agent method is the curse of dimensionality (feature dimension), and the multi-agent method is not suitable for large-scale networks due to its poor scalability. The PCMD overcome the curse of dimensionality and can be extended to large-scale networks.
The PCMD is compared with the other two kinds of methods: theoretical analysis methods and physical experimental methods in terms of time, cost, reliability, and accuracy. The specific comparison results are shown in Table 3. It can be shown that the PCMD is better than theoretical analysis methods and physical experimental methods in terms of time and cost.

5. Discussion

In this paper, a preventive control model based on the finite automaton theory is designed, which is a six-tuple to describe preventive actions for blocking the propagation of cross-domain cascading failures in an active distribution network of cyber-physical system, and cross-domain cascading failures sequences should be as short as possible. This model is helpful for guiding the trajectory data collection and learning policy selection.
Then, this paper proposes a methodology (named as PCMD) for constructing a preventive control policy to block the propagation of cross-domain cascading failures in an active distribution network of cyber-physical system. This methodology is based on the deep deterministic policy gradient idea to train the deep neural networks with the trajectory data samples originated from simulations and does not need to consider the specific power flow equations. In addition, the parameter adjustments are proposed to guarantee the convergence of the construction process for generating a preventive control policy against cross-domain cascading failures. The gradient theorem of deterministic control policy and the compatibility condition of the fitted state-action function and the fitted preventive control policy have been given to ensure the suboptimality of the generated preventive control policy.
Finally, the proposed PCMD has been verified in the CEPR-CPS provided by the China Electric Power Research Institute. It is shown that the PCMD is better than the multi-agent method in terms of reducing the number of failure nodes and avoiding the state space explosion problem. So, the scalability of the multi-agent method is poor and not suitable for large-scale networks due to the curse of dimensionality. The space complexity of PCMD is very high, and the PCMD has a defect in storing a lot of data. In addition, the proposed PCMD is compared with the theoretical analysis methods and the physical experimental methods in terms of time, cost, reliability, and accuracy. It is shown that the PCMD is better than theoretical analysis methods and physical experimental methods except in terms of reliability.
In future works, there is also a point we should consider. The propagation process of cross-domain cascading failures is time-dependent, and it also takes time for a preventive control policy to work. Therefore, how to reduce the deviation between the time when the preventive control policy works and the time of failure occurrence, that is, the control actions applied by the preventive control policy, really work before a next failure occurs in a cross-domain cascading failures propagation sequence so as to achieve an accurate preventive control effect.

Author Contributions

Conceptualization, P.S. and Y.D.; methodology, P.S.; software, P.S. and S.Y.; validation, P.S., Y.D.; formal analysis, P.S.; investigation, P.S. and Y.D.; resources, Y.D.; data curation, P.S., S.Y. and C.W.; writing—original draft preparation, P.S.; writing—review and editing, Y.D.; visualization, P.S.; supervision, Y.D.; project administration, Y.D.; funding acquisition, Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (Basic Research Class)—Basic Theories and Methods of Analysis and Control of the Cyber–Physical Systems for Power Grid under Grant No. 2017YFB0903000.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Appendix A. A Proof of Theorem 2

Proof.
The gradient of Equation (20) with respect to the parameter wl is calculated as shown in Equation (A1).
w l J ( l ( s , w l ) ) = w l E Q ( s , l ( s , w l ) ) = E w l Q ( s , l ( s , w l ) )
By the gradient properties of a composite function and a = l ( s , w l ) , Formula (A2) can be derived.
w l Q ( s , l ( s , w l ) ) = a Q ( s , a = l ( s , w l ) ) w l l ( s , w l )
After substituting Formula (A2) into Formula (A1), then Formula (A3) is obtained.
w l J ( l ( s , w l ) ) = E a Q ( s , a = l ( s , w l ) ) w l l ( s , w l )
Where w l l ( s , w l ) is a Jacobian matrix, if l ( s , w l ) is an m-dimensional vector function, and the weight wl is M-dimensional, then the Jacobian matrix w l l ( s , w l ) is represented shown in Formula (A4).
w l l ( s , w l ) = l 1 ( s , w l ) w 1 l 1 ( s , w l ) w 2 l 1 ( s , w l ) w M . . . . . . l m ( s , w l ) w 1 l m ( s , w l ) w 2 l m ( s , w l ) w M

Appendix B. A Proof of Theorem 3

Proof.
The minimization function shown in (25) means that the condition is satisfied shown in Equation (A5).
w Q E a Q ( s , a = l ( s , w l ) , w Q ) a Q ( s , a = l ( s , w l ) ) 2 = 0
Then, Formula (A6) can be derived from Formula (A5).
E 2 a Q ( s , a = l ( s , w l ) , w Q ) a Q ( s , a = l ( s , w l ) ) w Q a Q ( s , a = l ( s , w l ) , w Q ) = 0
Thus, Formula (A7) is derived from Formula (24) and Formula (A6).
E a Q ( s , a = l ( s , w l ) , w Q ) a Q ( s , a = l ( s , w l ) ) w l l ( s , w l ) = 0
At last, Formula (26) is derived from Formula (A7). □

References

  1. Sun, P.; Dong, Y.-W.; Wang, C.; Lv, C.; War, K.Y.; Sun, D.; Wang, L. War Cyber-Physical Active Distribution Networks Robustness Evaluation against Cross-Domain Cascading Failures. Appl. Sci. 2019, 9, 5021. [Google Scholar] [CrossRef] [Green Version]
  2. Voropai, N.; Kurbatsky, V.G.; Tomin, N.; Panasetsky, A.D. Preventive and emergency control of intelligent power systems. In Proceedings of the 2012 3rd IEEE PES Innovative Smart Grid Technologies Europe (ISGT Europe), Berlin, Germany, 14–17 October 2012. [Google Scholar]
  3. Rabiee, A.; Soroudi, A.; Keane, A. Risk-Averse Preventive Voltage Control of AC/DC Power Systems Including Wind Power Generation. IEEE Trans. Sustain. Energy 2015, 6, 1494–1505. [Google Scholar] [CrossRef]
  4. Tan, Y.; Li, Y.; Cao, Y.; Tan, Y.; Jiang, L.; Keune, B. Comprehensive decision-making method considering voltage risk for preventive and corrective control of power system. IET Gener. Transm. Distrib. 2016, 10, 1544–1552. [Google Scholar]
  5. Xu, Z.; Julius, A.A.; Chow, J.H. Robust testing of cascading failure mitigations based on power dispatch and quick-start storage. IEEE Syst. J. 2017, 12, 3063–3074. [Google Scholar] [CrossRef]
  6. Li, S.; Tan, Y.; Li, C.; Cao, Y.; Jiang, L. A Fast Sensitivity-Based Preventive Control Selection Method for Online Voltage Stability Assessment. IEEE Trans. Power Syst. 2018, 33, 4189–4196. [Google Scholar] [CrossRef]
  7. Dong, Y.; Xie, X.; Shi, W.; Zhou, B.; Jiang, Q. Demand-Response-Based Distributed Preventive Control to Improve Short-Term Voltage Stability. IEEE Trans. Smart Grid 2018, 9, 4785–4795. [Google Scholar] [CrossRef]
  8. Khazali, A.; Rezaei, N.; Ahmadi, A.; Hredzak, B. Information Gap Decision Theory Based Preventive/Corrective Voltage Control for Smart Power Systems with High Wind Penetration. IEEE Trans. Ind. Inform. 2018, 14, 4385–4394. [Google Scholar] [CrossRef]
  9. Alburguetti, L.M.; Grilo, A.P.; Ramos, R.A. Preventive Control for Voltage Stability Enhancement Using Reactive Power from Wind Power Plants. In Proceedings of the Power and Energy Society General Meeting, Atlanta, GA, USA, 4–8 August 2019. [Google Scholar]
  10. Xypolytou, E.; Zseby, T.; Fabini, J.; Gawlik, W. Detection and mitigation of cascading failures in interconnected power systems. In Proceedings of the IEEE PES Innovative Smart Grid Technologies Conference, Torino, Italy, 26–29 September 2017. [Google Scholar]
  11. Babalola, A.A.; Belkacemi, R.; Zarrabian, S. Real-Time Cascading Failures Prevention for Multiple Contingencies in Smart Grids Through a Multi-Agent System. IEEE Trans. Smart Grid 2018, 9, 373–385. [Google Scholar] [CrossRef]
  12. Liu, C.; Sun, K.; Rather, Z.H.; Chen, A.; Bak, C.L.; Thøgersen, P.; Lund, P. A Systematic Approach for Dynamic Security Assessment and the Corresponding Preventive Control Scheme Based on Decision Trees. IEEE Trans. Power Syst. 2014, 29, 717–730. [Google Scholar] [CrossRef]
  13. Passaro, M.C.; da Silva, A.P.A.; Lima, A.C.S. Preventive Control Stability via Neural Network Sensitivity. IEEE Trans. Power Syst. 2014, 29, 2846–2853. [Google Scholar] [CrossRef]
  14. Kucuktezcan, C.F.; Genc, V.M.; Erol, O.K. An optimization method for preventive control using differential evolution with consecutive search space reduction. In Proceedings of the IEEE PES Innovative Smart Grid Technologies Conference, Ljubljana, Slovenia, 9–12 October 2016. [Google Scholar]
  15. Soni, B.P.; Saxena, A.; Gupta, V.; Surana, S.L. Transient stability-oriented assessment and application of preventive control action for power system. J. Eng. 2019, 2019, 5345–5350. [Google Scholar] [CrossRef]
  16. Kou, P.; Liang, D.; Wang, C.; Wu, Z.; Gao, L. Safe deep reinforcement learning-based constrained optimal control scheme for active distribution networks. Appl. Energy 2020, 264, 114772. [Google Scholar] [CrossRef]
  17. Belkacemi, R.; Babalola, A.; Zarrabian, S. Experimental implementation of Multi-Agent System Algorithm to prevent Cascading Failure after N-1-1 contingency in smart grid systems. In Proceedings of the 2015 IEEE Power & Energy Society General Meeting, Denver, CO, USA, 26–30 July 2015. [Google Scholar]
  18. Zarrabian, S.; Belkacemi, R.; Babalola, A.A. Intelligent mitigation of blackout in real-time microgrids: Neural Network Approach. In Proceedings of the Power and Energy Conference at Illinois, Urbana, IL, USA, 19–20 February 2016. [Google Scholar]
  19. Zarrabian, S.; Belkacemi, R.; Babalola, A.A. Reinforcement learning approach for congestion management and cascading failure prevention with experimental application. Electr. Power Syst. Res. 2016, 141, 179–190. [Google Scholar] [CrossRef]
  20. Khederzadeh, M.; Beiranvand, A. Identification and Prevention of Cascading Failures in Autonomous Microgrid. IEEE Syst. J. 2018, 12, 308–315. [Google Scholar] [CrossRef]
  21. Dutta, O.; Mohamed, A. Reducing the Risk of Cascading Failure in Active Distribution Networks Using Adaptive Critic Design. IET Gener. Transm. Distrib. 2020, 14, 2592–2601. [Google Scholar] [CrossRef]
  22. Rahnamaynaeini, M.; Hayat, M.M. Cascading Failures in Interdependent Infrastructures: An Interdependent Markov-Chain Approach. IEEE Trans. Smart Grid 2016, 7, 1997–2006. [Google Scholar] [CrossRef] [Green Version]
  23. Lillicrap, T.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2016, arXiv:1509.02971. [Google Scholar]
  24. Silver, D.; Lever, G.; Heess, N. Deterministic Policy Gradient Algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
  25. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Figure 1. The preventive control process against cross-domain cascading failures (CCF) in an active distribution network (ADN).
Figure 1. The preventive control process against cross-domain cascading failures (CCF) in an active distribution network (ADN).
Applsci 11 00229 g001
Figure 2. State transitions of an ADN during the propagation of CCF.
Figure 2. State transitions of an ADN during the propagation of CCF.
Applsci 11 00229 g002
Figure 3. A finite automaton preventive control model of an ADN.
Figure 3. A finite automaton preventive control model of an ADN.
Applsci 11 00229 g003
Figure 4. Topology and node numbering of PN and CN in the CEPR-CPS.
Figure 4. Topology and node numbering of PN and CN in the CEPR-CPS.
Applsci 11 00229 g004
Figure 5. Comparison diagram of CCF initiated from a node failure in the PN.
Figure 5. Comparison diagram of CCF initiated from a node failure in the PN.
Applsci 11 00229 g005
Figure 6. Comparison diagram of CCF initiated from a node failure in the CN.
Figure 6. Comparison diagram of CCF initiated from a node failure in the CN.
Applsci 11 00229 g006
Table 1. The directed independence edges between the power network (PN) and communication network (CN) in the China Electric Power Research-Cyber-Physical System (CEPR-CPS).
Table 1. The directed independence edges between the power network (PN) and communication network (CN) in the China Electric Power Research-Cyber-Physical System (CEPR-CPS).
NodeSuccessor NodeNodeSuccessor Node
1784, 95, 96, 97, 98, 99, 1099512, 63, 64
18100, 101, 1029613, 65
2481, 83, 859714, 66, 67, 68
251039815, 69, 70, 71, 72
43104, 106, 1079916
4410510023
5682, 86, 10810121, 22, 24
881, 2, 3, 4, 510225, 30, 31, 32
896, 17, 18, 19, 2010333, 34, 35, 36
907, 26, 27, 28, 2910440, 41, 42, 44
919, 46, 47, 48, 5010543, 51
929, 46, 47, 48, 5010645, 56, 58, 59
9310, 52, 53, 54, 5510749
9411, 60, 61, 6210857, 73
10972, 74
Table 2. Parameters used in the PCP construction method based on deep deterministic policy gradient idea (PCMD) for the CEPR-CPS.
Table 2. Parameters used in the PCP construction method based on deep deterministic policy gradient idea (PCMD) for the CEPR-CPS.
CaseParameters
CEPR-CPSstopTrainVcl_RBVnir1_ni_1N3.varTsγα
20.0 × 1017False0.210.04.51.0 × 10−60.965.0 × 10−5
r1_ni_2wniInir2_ni_1N3.decRTfδN2
−501.00.20.05.0 × 10−53.03.0 × 10−6100.0
r2_ni_2cfrefr3_fref_1r3_fref_2cPcr4_c_1r4_c_2rnf1
−500.20.0−5020.010.0−50.010.0
N1N5N4gr_Thrnf2isD
30.0 × 1061024.03.0 × 1061.010.0
Table 3. Comparison between different methods.
Table 3. Comparison between different methods.
PropertyPCMDTheoretical Analysis MethodsPhysical Experimental Methods
timeshortlonglong
costlowhighhigh
reliabilitymediumhighhigh
accuracyhighhighmedium
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Sun, P.; Dong, Y.; Yuan, S.; Wang, C. Preventive Control Policy Construction in Active Distribution Network of Cyber-Physical System with Reinforcement Learning. Appl. Sci. 2021, 11, 229. https://doi.org/10.3390/app11010229

AMA Style

Sun P, Dong Y, Yuan S, Wang C. Preventive Control Policy Construction in Active Distribution Network of Cyber-Physical System with Reinforcement Learning. Applied Sciences. 2021; 11(1):229. https://doi.org/10.3390/app11010229

Chicago/Turabian Style

Sun, Pengpeng, Yunwei Dong, Sen Yuan, and Chong Wang. 2021. "Preventive Control Policy Construction in Active Distribution Network of Cyber-Physical System with Reinforcement Learning" Applied Sciences 11, no. 1: 229. https://doi.org/10.3390/app11010229

APA Style

Sun, P., Dong, Y., Yuan, S., & Wang, C. (2021). Preventive Control Policy Construction in Active Distribution Network of Cyber-Physical System with Reinforcement Learning. Applied Sciences, 11(1), 229. https://doi.org/10.3390/app11010229

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop