1. Introduction
With increasing PV penetration, power system operation suffers from the uncertainty between a forecast and actual generation. A major research focus is the fact that the voltage of the distribution network becomes vulnerable under the PV impact [
1]. BESS is a potential solution to mitigate the effect of the PV uncertainty.
Engineering has a long history of handling uncertainties in application [
2,
3,
4]. From the mathematical point of view, solutions mitigating the effect of uncertainties can be classified into two major groups. The first group is probability-based optimization. Stochastic programming [
5] is a standard algorithm in this group. These techniques are well developed and have been used in real power system applications [
6,
7]. Chance-constrained optimization, which is a subset of stochastic programming, is a natural framework because it can quantify the probability of the voltage violation [
8,
9]. While probability-based optimization methods are useful, it is difficult to maintain the balance between its reliability and conservativeness [
10], thereby often leading to over-engineered solutions. Conservativeness is necessary for some applications, but sometimes it can increase the control cost of the system [
5]. Systems that use probability-based optimization to operate BESS-supplemented grids often need larger than necessary batteries. Another limitation of the probability-based optimization is the model of the uncertainty. The probability-based optimization requires the knowledge of the uncertainty to estimate its distribution or the boundary. In practice, the model of the uncertainty could be inaccurate and hard to derive.
A recent popular group of solutions managing uncertainty is RL. RL was designed to solve the Markov decision process (MDP) which could regard uncertainties without the uncertainty model. The RL agent can grasp the uncertainty features of the historical training data. In voltage control, researchers have already proposed RL methods for different applications. The tap-changer transformer [
11], the capacitor-based Volt/Var control [
12,
13], the load restoration [
14], and the reinforcement learning for the autonomous voltage control [
15] have been proved to be effective in power systems. RL for BESS is also studied for voltage control [
16], but the algorithm proposed in that paper is using the RL to select which BESS should be used instead of directly controlling the charge/discharge power of BESSs. Even though the achievements are made by RL, the long training time or the low training efficiency is always a challenge in real applications [
17].
One of the reasons for the low training efficiency of RL is the large exploration process. The exploration strategies of the RL can be categorized as directed and undirected. Exploration is undirected when the action selection is fully random. Typical exploration methods, such as
-greedy, Boltzmann distribution, and SoftMax [
18], are undirected exploration strategies. The lack of the utilization of specific information characterizes the undirected exploration strategy, in which RL agent are randomly selects action choices with uniform probability for exploration. The directed exploration methods take advantage of specific information or knowledge to guide the exploration. The active exploration (AE) method was first proposed in [
19], whereby human knowledge is used to correct the improper actions during the training process of an inverted pendulum system. Such a knowledge-based exploration method is also developed for guiding the RL agent in a maze environment by modification of the activation function [
20]. A dynamic parameter tuning for exploration of a meta-learning method with a human interaction robot system is verified in [
21]. The parameter changes the probability distribution of the action selection, which avoids the ineffective training. In a recent study, the AE concept is applied to controlling an HVAC system by a deep Q-learning method to accelerate the training speed [
22]. The authors use engineering knowledge to judge the action picked by RL agents to achieve faster convergence purpose. Traditional reinforcement learning agents adopt a simple strategy, by randomly selecting an action from a set with all the available actions, which wastes training time significantly. The above studies point out that if the human knowledge or engineering sense can be utilized to guide the action selection during the training process, then the overall training time reduces.
The existing literature shows the potential of probability-based optimization and RL methods in obtaining better performance of voltage regulation with renewable uncertainties. However, answers to the following three issues are still unsettled, which motivates this paper’s research:
- (1)
The forecasting model of the renewable generation is hard to obtain accurately. The mismatch between the predicted and actual value diminishes the performance of the probability-based optimization. Therefore, how to overcome the effect of the forecasting mismatch is the challenge of the probability-based optimization methods.
- (2)
The conserveness characteristic of the probability-based optimization methods requires a larger size of the BESS to compensate for the forecasting mismatch. Knowing how to reduce the conserveness of the traditional methods so that the BESS size can be managed properly is valuable for industries.
- (3)
RL has advantages in handling uncertainties. However, the low training efficiency of RL limits its application. Utilizing the probability-based optimization to improve the training time is meaningful.
In this paper, we integrate the conventional engineering knowledge with a relatively novel RL to conquer the aforementioned issues. Specifically, the chance-constrained optimization is employed as the engineering knowledge of proposed AE methods to improve the training efficiency in voltage regulation problems, and the RL framework relieves the dependency on the forecasting accuracy and the conserveness. The contributions of the work are concluded as:
- (1)
We propose an AE method for the voltage regulation problem by applying BESSs in an RL framework. The performance of the proposed method is verified in an IEEE standard test feeder. Simulation results reveal that the proposed method has a better performance.
- (2)
The proposed method speeds up the training process by modifying the action selection distribution according to the engineering knowledge, which discourages unnecessary exploration. To validate the improvement in the training efficiency, the proposed method is compared with the conventional Q-learning method as a benchmark method.
- (3)
The effectiveness of the BESSs’ usage is improved by the proposed method. By comparing the proposed method with a chance-constrained optimization, the proposed method can achieve the voltage regulation with a smaller BESS size while the chance-constrained optimization method returns infeasible solution.
The rest of the paper is organized as follows.
Section 2 demonstrates the chance-constrained optimization, which is regarded as the engineering knowledge. A conventional Q-learning and our proposed method is given in
Section 3.
Section 4 illustrates the case studies. A conclusion is drawn in
Section 5.
3. Proposed Method: Battery Energy Storage Systems (BESS) Operation by an Active Exploration (AE) Reinforcement Learning
The Markov decision process is the base of the RL. In this section, we discuss the MDP formulation of the BESS operation. Then, a modified AE Q-learning framework is proposed to solve the problem.
3.1. Markov Decision Process (MDP) of BESS Operation
MDP is with a 4-tuple of and . denotes the state space where . represents the action space where The transition probability is . is the immediate reward obtained after transitioning from state to state with the action a.
State: In our problem, the first state is the voltage magnitude of the node per unit.
The second state is the SOC of the BESS. The range of the SOC is from 0 to 1. The interval between each SOC state is determined by the value of the charging or discharging power of the action space. The total state will be the product of the voltage state and the SOC state, .
Action: Given a state
S, a BESS executes the action
at the time slot
t. The
represents the charging or discharging power. To avoid violating the power constraints, the action is within the minimum and maximum power as follows:
where
is used to discretize the action set with
which is the number of the actions for a single BESS. In this work,
is set to 21 and
is −200 kW and
is 200 kW.
is 20 kW. This range is the same as the chance-constrained optimization.
Transition probability: the state space contains two terms, the voltage and SOC. For the voltage of the feeder, it is updated with the change of net power injected. For the state of SOC, it is determined by the previous state and actions.
where
is the charge/discharge efficiency.
, if
is positive, which means the BESS is charging.
, if
is negative, which means the BESS is discharging.
and
refer to the charging and discharging efficiency.
is the time interval between each action. Our problem sets 15 min as the time interval. Therefore, the total number of steps for one day is 96.
Reward: The principle of the reward design is to encourage the proper actions and discourage improper actions. The total reward consists of different categories: voltage, SOC and degradation. The following is a detailed explanation of the rewards.
For voltage reward, if the selected action makes the voltage magnitudes hold within the safety range of [0.95, 1.05] in p.u., then the reward for the voltage is designed to be 1. If the action causes any voltage violation, the punishment is set to −100. The punishment should have a greater absolute value than the reward because the total steps in one episode are 96. If the punishment is smaller than 96, the training process may converge to a local optimal.
The reward of the SOC follows the similar logic: if the BESS’s SOC is in [
], the reward is 1. Otherwise, the punishment is −100.
The reward of the degradation cost is the same as the chance-constrained optimization. The Q-learning looks for the maximum value of the sum of the reward. The degradation reward is negative value.
where
is a constant related to the degradation coefficient of BESS as Equation (1).
The overall immediate reward is a weighted combination of each individual reward.
where
, and
are coefficients of the reward.
3.2. Active Exploration (AE) during the Training
The final product of Q-learning is an optimal policy
which maximizes the
such that:
where
denotes the grade of selecting the action
at state
.
is the policy determining the action of BESSs under a certain state. According to the observed the state, the traditional RL agent selects actions based on the Ɛ-greedy algorithms [
26].
By contrast with the traditional action selection approach, modifications are introduced to improve the training efficiency. The conventional RL processes the exploration by assuming the action selection follows a distributed uniform distribution. The RL agent has
different action that can be taken, and each possible action has equal probability. Then, the exploration is with probability as follows,
where
is the probability distribution function.
After the optimization problem (10) is solved, the optimal BESS operation result is known. The action can be selected based on that value. In this study, we regard the as the engineering knowledge. The engineering knowledge or experience can accelerate the training process of active exploration. The chance-constraint optimization in our study can only derive a relatively conservative solution. The RL method is supposed to be less conservative because RL converges to an average optimal instead of worst-case satisfactory.
In this study, we define the active exploration that the action selection is actively assumed to follow a Gaussian distribution during the exploration.
where
is the variance of the distribution and
is the mean of distribution. In our active exploration,
, so that the RL agent has a higher probability to select a rational action during the training phase, which results in the improvement of the convergence rate and the accuracy. The variance should be designed according to the confidence level of the engineering knowledge. If the forecasting algorithm performs well in terms of accuracy, then the chance-constrained optimization could generate reliable results. The
could be given as a relatively large value. Otherwise,
can be a small value to extend the exploration area and corresponding probability. As a trade-off, the smaller
results in a longer training period.
The comparison between the active exploration and the conventional exploration is illustrated in
Figure 2. Compared to the uniform distribution exploration used by the conventional reinforcement, the probability that an RL agent chooses the action that is close to the
is higher under the Gaussian distribution exploration. The corresponding reward of the active exploration is supposed to be larger than the conventional uniform distribution because of the engineering knowledge. The larger reward can be regarded as positive feedback to accelerate the speed of the learning.
After the actions of BESSs are selected, the environment interacts with the actions and generates the immediate rewards. The environment is the IEEE 13 Node Test Feeder in the OpenDSS [
27]. The implementation of the RL is demonstrated in
Figure 3.
By contrast with the optimization method, the RL method does not require the linear approximation of the voltage variation. The function of voltage to the injected power is a non-linear and even non-convex mapping, which ensures the accuracy of the power flow.
The unbalanced load flow calculation algorithm is backward forward swept (BFS) algorithm in the OpenDSS software. The SOC update refers to (3).
When the action
is executed, the reward
in (17) is calculated and the next state
is evaluated. The rule for the update of the action-value function
is [
26].
where
and
stand for the learning rate and discount rate.
denotes all potential actions. The Q-learning update (23) takes place in the Matlab.
4. Case Study and Results
We evaluate the proposed algorithm in four aspects: (1) voltage regulation performance; (2) training efficiency improvement; (3) conserveness; (4) success rate of the voltage regulation under different uncertainty levels and different BESS sizes. Specifically, the feasibility of proposed method in voltage regulation is given in
Section 4.1. The proposed AE method is compared with the conventional RL to demonstrate the training efficiency improvement in
Section 4.2. The conserveness study between the proposed method and the chance-constrained optimization occurs in
Section 4.3. In the last section, the success rates of the proposed method, the conventional Q-learning, and the chance constrained optimization are studied to reveal their performance under different weather conditions.
Two BESSs are placed in the IEEE 13 Node Test Feeder. BESSs are installed in node 611 and node 675. Node 611 is the location with the most vulnerable node to the undervoltage problem. Node 675 is the most possible location of the overvoltage problem. The rest of the feeder maintains the default values. In our experiment, the tap position of the transformer remains unchanged. The power capacity of each BESS is 200 kW. The individual robust parameter is set to 0.99. For RL, the learning rate is 0.8 and the discount factor is set to be 0.5. The aforementioned parameters are tuned based on the training performance to the historical data. Different parameters will result in different performance in training speed, but the reinforcement learning agent will converge to the same optimal policy with different learning and discount rates.
4.1. Voltage Regulation Results of the Proposed Method
In this section, the feasibility of the proposed method is tested. We use the 300 days of the load and PV data as input, and examine the proposed method in the case of exterior day. The energy capacity is 400 kWh for each BESS.
Figure 4 shows the results of the voltage regulation, and it illustrates the voltage profile for algorithms. In the voltage regulation results, the blue lines denote the voltage profiles of node 611, and the red lines denote the result in node 675. The x-axis is the hour of the day, and the y-axis denotes the voltage magnitude in per unit. The minimal allowed voltage in the out formulation is 0.95 p.u. and the maximum voltage limit is 1.05 p.u. The thin curves are the original voltage profiles. The bold lines are the results of the proposed method. We can see that, after the RL agent executes, the voltage profiles are managed in the range of [0.95, 1.05] p.u.
Undervoltage violations happen in node 611 after sunset. Overvoltage violations take place in node 675 at noon. The bold curves represent the voltage results from the proposed method. The red curve shows the voltage results from our method. Overvoltage and undervoltage issues caused by a PV are resolved by the proposed algorithm.
In
Figure 5, the operation power and SOC of two BESSs are given. The x-axis is the hour of the day. The curves with cycles represent the operation power in kW, and the rest of the lines denote the state of charge of the BESSs. From the simulation, we can conclude that the BESSs run within the designed SOC constraints. The feasibility of the proposed method is verified.
4.2. Compare with Conventional Q-Learning by the Training Efficiency
The feasibility and the training efficiency of the proposed method and the conventional Q-learning (QL) are tested in this section. We use 100 days of the load and PV data as inputs, and examine the proposed method in the case of exterior day for the feasibility test. We would like to show our proposed method has advantages in training efficiency compared to QL.
In
Figure 6, the operation power and SOC of two BESSs are given. The maximum of the training episode is set to 3000 in this case. The x-axis is the hour of the day. The left y-axis is the operating power of BESS in kW. The right y-axis denotes the SOC. From the simulation with the 3000 episodes, we can conclude that the BESSs run within the designed constraints of the proposed AE method. However, the SOC of QL is out of the limits of [0, 1]. That means the proposed method converges to an optimal policy at 3000 episodes while the conventional QL is infeasible at the same training efforts, which indicates the improvement of the proposed AE method.
Figure 7 gives the accumulated rewards and the average values of QL and AE. The maximum of the training episode is still set to 3000. The x-axis is the number of the episode. Y-axis is the reward value. According to the reward formulation, if the final reward value is larger than zero, then no voltage violation and all operation constraints are satisfied. If the final reward value is smaller than zero, that means the constraints are violated. Both the reward and the average reward of the proposed method is higher than the corresponding QL. At the end of the training, the reward value of the proposed method is positive, but the QL is negative. From the numerical simulation, we can conclude that the proposed AE method accelerates the training process comparing to the conventional QL.
Figure 8 gives the success rates of QL and AE. The maximum of the training episode is set from 1000 to 6000 in this case. We use 300 days of the load and PV data as inputs, and exam the proposed method in the case of 100 different days.
The x-axis is the number of the episode. Y-axis is the success rate. In a specific test profile, the OpenDSS calculates the voltage magnitudes. MATLAB updates the SOC of all BESS according to (13). We set the voltage allowed range as [0.95, 1.05] and the SOC range as [0, 1]. If the test results tell us that all the voltages and SOCs are within the allowed range, we define the corresponding algorithm successes rate. We use 100 different profiles to test the trained policy of AE in the Q-learning algorithms. The total number of successful cases is recorded. The success rate is calculated as:
From the simulation, the success rate of QL is zero before 3000 episodes, and it is always lower than AE. When the maximum episode increases to 6000, the success rate of the QL is the same with AE, which indicates both algorithms converge to the same optimal policy if the training efforts are large enough. With the limited training episode, the proposed AE has a better training efficiency than the conventional QL.
4.3. Compare with Chance-Constrained Optimization for Conserveness
In this part, the conserveness of the proposed method and the conventional chance-constrained optimization is studied. The conserveness of the method can be regarded as the BESS capacity needed for a certain voltage regulation case. In other words, the difference between SOC usage can reflect the conserveness of the algorithms.
In
Figure 9, the voltage regulation performances of chance-constrained optimization and AE are given. We use the 300 days of the load and PV data as input, and examine the proposed method in the case of a different day. The x-axis means the hour of the day, and the Y-axis is the voltage magnitude. From the simulation, both the proposed AE and chance-constrained optimization can achieve voltage regulation for the selected case.
The BESS usage is directly related to the SOC. High variation of the SOC profile represents the large utilization of BESS. In
Figure 10, the SOC of chance-constrained optimization and AE are given. The x-axis is the hour of the day, and Y-axis is the SOC. Because the SOC is constrained in [0, 1] in this study, both the AE method and the optimization fulfill the SOC constraints. From the results, the proposed AE uses less SOC than the chance-constrained optimization in the voltage regulation, which indicates the proposed method has lower conserveness than the chance-constrained optimization. In a real application, this feature can reduce the investment of the BESSs because less capacity is required. The reason behind this conservativeness is that the chance-constrained optimization needs the BESS to operate like that for the whole training profiles. To handle all the uncertainties in the training data, the optimization solver generates a conservative operation strategy and give more stable margin on the voltage profile. This conservative characteristic improves the reliability but will cause an unnecessary BESS size.
4.4. Success Rate Comparison under Different Sizes and Weather Conditions
In this part, we examine the success rate under different BESS capacities and weather conditions. We use 300 days of the load and PV data as inputs, and exam the proposed method in the case of 150 exterior days. In the selected 150 days, the amounts of clear sky, cloudy, and overcast days are equal to 50.
Figure 11a shows the results of the success rate of the 3000 maximum training episodes, and
Figure 11b illustrates the success rate of the 6000 maximum training episodes. In both figures, the proposed AE method, conventional QL and the chance-constrained optimization are compared. This comparison is to show the training efficiency.
The success rate is the robustness metric in this work. We investigated the effect of both weather conditions and BESS size. The BESS in
Section 4.1 is a 4-h BESS (200 kW with 800 kWh). In this part, a 3 h BESS and a 2 h BESS are studied. We maintained the power rate and change the energy capacity. The 3 h BESS had 200 kW and 600 kWh and the 2 h BESS had 200 kW and 400 kWh.
In
Figure 11, the x-axis is the BESS size in kWh, and we have the values of 400 kWh, 600 kWh and 800 kWh. The y-axis denotes the weather condition. The z-axis represents the success rate. With a fixed weather condition, the success rate of all algorithms decreases with the reducing BESS size. In general, the success rate of the RL is larger than the chance-constrained optimization if the training episode is large enough.
The success rate of the chance-constrained optimization is zero in some cases. The reason is that the chance-constrained optimization solver claimed these cases were infeasible because of the insufficient BESS size. The operation profiles in these cases are only based on the forecasted value without introducing uncertainties. Unfortunately, the mismatch between the forecasting data and actual net power creates voltage violations.
From
Figure 11, the success rate of the chance-constrained optimization drops with weather conditions. The cloudy day has the highest forecasting mismatch, and the clear sky has the lowest mismatch. The success of the AE and Q-learning drops as well, but the speed is much lower than that of the chance-constrained optimization. That means the chance-constrained optimization is more vulnerable to uncertainties. One potential explanation could be that with the increasing uncertainty level, the difference between the training profile and the test profile grows. The chance-constrained optimization may handle cases in training profiles. However, the probability that it encounters a new worse profile during the test process increases due to the rising difference. The optimization did not meet this new worse profile before the test process, and naturally, the generated BESS operation cannot perform well. A possible solution for the chance-constrained optimization is to enlarge the number of training profiles so the optimizer can meet more uncertainties. For fairness in our comparison, the same training profiles were provided to the three algorithms.
During the optimization calculation, the chance-constrained method would like to address cases of the training profiles as much as possible. That results in over-engineered solutions (for example, a large BESS size in our application). If the size of the BESS fails to satisfy the requirement of the chance-constrained optimization, the solver would not generate any profile. However, the goal of the RL agent of Q-learning and AE is to find an average optimal policy for BESS. Although the BESS size is small, the agent can still find a policy for the operation.
When the maximum episode is set to 3000, the Q-learning cannot converge to the feasible policy, so that the corresponding success rate in
Figure 11a is low. When the maximum episode becomes 6000, the success rate of Q-learning is similar to AE, which means both algorithms are converging to the optimal policy, which proves the convergence of the AE and Q-learning.