1. Introduction
Increasing interest in renewable energy sources has led to massive deployment of microgrids as they offer a scalable way of integrating renewable sources into the main grid while allowing maximum usage of battery energy storage system. In the long run, installation of microgrids are expected to reduce cost of power, dependency on utility grid, and increase rural electrification [
1]. Nonetheless, increased integration of distributed renewable energy raises significant challenges in the stable and economic functioning of the microgrid as they are extremely volatile and random. These multiple stochastic resources combined with the load demand make preparation of accurate generation schedules very challenging. Deploying a battery energy storage system (BESS) [
2] can significantly buffer the impacts of these uncertainties as it provides various auxiliary services to the power system i.e., load shifting, frequency regulation, voltage support and grid stabilization [
3]. However, for a microgrid to guarantee reliable supply of power and efficient utilization of the battery storage, an energy management system (EMS) needs to be developed to optimally dispatch and distribute these energy resources based on their availability and associated costs.
Optimal energy management (OEM) involves the management/scheduling of various power system variables, in a day ahead context, in order to satisfy the load demand at minimal or acceptable costs while satisfying all technical and operational constraints. The main goal of developing an effective EMS is to achieve different objectives such as levelling peak loads, balancing energy fluctuations, maximizing renewable energy usage, reducing power losses, and increasing system load factor among others [
4]. The EMS faces significant challenges as a result of the microgrid’s existence, including small size, DRES volatility and intermittency, demand uncertainty, and fluctuating electricity prices. More advancements in microgrid design and control are needed to address these obstacles. To balance the high volatility of DRESs, additional sources of flexibility must be utilized at the architectural level. Furthermore, new, and intelligent control mechanisms are required to optimize energy dispatch and overcome microgrid’s uncertainties.
Aimed at maximizing energy usage or reducing operational cost by managing intelligently the different types of energy resources and controllable loads in a grid-tied microgrid, several control approaches have been proposed. For years, conventional techniques such mixed-integer linear programming, linear programming, and dynamic programming have been proposed to optimally manage energy in microgrids [
5,
6,
7]. These methods, however, are reported to suffer from the famous curse of dimensionality and are highly susceptible to getting sub-optimal results in environments that are highly stochastic, i.e., they contain volatile variables such as load demand, grid tariffs and renewable energy. Such techniques, therefore, have limited flexibility and scalability. Further, metaheuristics techniques including particle swarm optimization (PSO), genetic algorithm (GA), and their hybrids have also been used in literature to tackle the issue of energy management in microgrids [
4,
8,
9,
10]. However, these techniques involve extensive computational time and hence, they cannot be executed online. Online operation allows computing resources to be used more economically as it doesn’t require one to have another committed computer for performing the optimization process offline. The aforementioned algorithms also don’t have a learning component, i.e., they are incapable of storing the optimization knowledge and reusing it for a new optimization task [
11]. Given that the load demand varies on hourly basis, it is required to calculate the schedule for every new generation and demand profile, and this is not computationally efficient. In addition, the performance of this techniques may deteriorate if accurate models or appropriate state variables forecasting are unavailable. Often, metaheuristic methods are hybridized with other linear methods for an advantage complementation. A comprehensive review of these decision making strategies and their methods of solution has been presented in [
12,
13].
In the last decade, intelligent learning-based techniques have made major progress in decision-making problems and have also proved ideal in overcoming these limitations, as they can automatically extract, monitor, and optimize generation and demand patterns. Additionally, they are capable of relaxing the idea of an explicit system model to ensure optimal control. This is of great benefit, since the problem of energy management is normally a partly observable problem, i.e., hidden or unknown information always exists.
The reinforcement learning (RL) method, one of the machine learning algorithms, is well known because of its ability to solve problems in stochastic environments. It aims at making optimal time-sequential decisions in an uncertain environment. Reinforcement learning involves a decision maker (agent) that learns how to act (action) in a particular situation (state) through continuous interaction with the environment so as to maximize cumulative rewards [
14,
15]. In the learning process, the agent is in a position to learn about the system and to take action that affects the environment so as to achieve its objective. In RL, the agent considers the long-term reward, instead of simply getting the immediate maximum reward. This is very important for resource optimization problems in renewable powered microgrids, where supply and demand are changing rapidly. Q-learning, one of the RL methods, is commonly used to solve sequential decision-making problems as explained by authors in [
16]. Q-learning is an off-policy algorithm that doesn’t require any prior knowledge of rewards or state transition probabilities of a system, thus making it applicable to systems that manage real-time data. Many scholars, focusing on microgrid EMS [
11,
17,
18,
19], specifically have used Q-learning to control energy. The key benefit of RL techniques is their adaptability to stochastic systems and ability to transfer knowledge, i.e., the information gained when learning policies for a specific load demand can be retrieved to learn an optimal schedule for other load profiles [
11].
Taking advantage of these characteristics, several scholars have used this approach to solve the microgrid energy management problem. For instance, Brida et al. [
20] used batch reinforcement learning to implement a microgrid EMS that optimizes battery schedules. Charge and discharge efficiency of the battery and the microgrid nonlinearity caused by inverter efficiency were considered. Elham et al. [
21] presented a multi-agent RL method for adaptive control of energy management in a microgrid. The results indicate that the grid-tied microgrid learned to reduce its dependency on the utility grid significantly. Authors in [
22] presented an optimal battery scheduling scheme for a microgrid energy management. A Q-learning technique is implemented to reduce the overall power consumption from the utility in [
22] and simulation results show that algorithm reduces dependency on the main grid. However, this work fails to consider battery trading with the utility and the impact of battery life cycle from those actions. In [
23] Zeng et al. suggested an Approximate Dynamic Programming (ADP) method to tackle microgrid energy management, considering the volatility of the demand, renewable energy availability, real-time grid tariffs, and power flow constraints. Authors in [
24] explored the feasibility of applying RL to schedule energy in a grid-connected PV-battery electric vehicle (EV) charging station. From the results, the algorithm managed to successfully obtain a day-to-day energy schedule that decreases the transactive cost between the microgrid and the utility grid. Authors in [
25,
26] proposed a battery management strategy in microgrids using RL technique. However, the incorporation of the battery wear cost in the EMS model was absent. The work in [
27] used RL to develop a real-time incentive-based demand response program; the RL algorithm focused at aiding the service provider to buy power from its subscribed customers to balance load demand and power supply and improve grid reliability. Lu et al. [
28] leveraged RL to design a dynamic pricing demand response (DR) algorithm in a hierarchical electricity market. From the results, the algorithm is seen to successfully balance energy supply and demand and reduce energy cost for consumers. Nakabi and Toivanen [
29] proposed a new microgrid architecture consisting of a wind generator, an energy storage system (ESS), a collection of thermostatically controlled and price-responsive loads, and a utility grid connection. The proposed EMS was modelled to coordinate the different energy sources. Different scenarios were investigated using various deep RL methods. The proposed A3C++ algorithm was established to have an improved convergence and it also acquired superior control policies. In [
30] a microgrid control problem focusing on energy trading with the utility is formulated. A deep Q-learning algorithm is used to learn the optimal decision-making policies. Simulation results, on real data, confirmed that the approach was effective, and it outperformed the rule-based heuristics methods. Samadi, et al. [
31] proposed a multi-agent based decentralized energy management approach in a grid-connected microgrid. The different microgrid components were designed as autonomous agents who adopted model-free RL approach to optimize their behavior. Simulation results confirmed that the proposed approach was efficacious. Shang, et al. [
32] proposed an EMS model aimed at minimizing the microgrid’s operation cost, considering the nonconvex battery degradation cost. A RL method combined with Monte-Carlo Tree Search and knowledge rules is used to optimize the system. Although, the simulation results show the efficacy of the proposed algorithm, a detailed model of battery degradation is not considered in [
32].
In recent advances reported on the implementation of RL in microgrid energy management [
20,
21,
22,
23,
25,
26,
27,
28,
29,
30,
31,
33,
34,
35] modelling of microgrid operational cost with consideration of battery degradation cost is not yet thoroughly studied. Most studies only consider the generation cost and power exchange cost. The estimation of the degradation process is very difficult and finding a simple and precise mathematical degradation model that can be used in the energy management algorithm is not easy. As the charging and discharging behaviors of a BESS have a direct impact on its life span, lifecycle degradation costs should be factored into the complex dispatch model of BESSs [
36]. It is important to note that Lithium ion batteries are quite expensive and incorporating a battery degradation model while computing the overall system cost is critical as a realistic system cost estimate is established. Thus, this paper reports on the development of an EMS for a grid-tied solar PV-battery microgrid considering battery degradation in the energy trading process, with the focus on reducing the strain on the battery. The aim of the designed EMS is to manage energy flows from and to the main grid by scheduling the battery such that the overall system cost (including cost of power purchased from the utility and battery wear cost) is reduced, and utilization of solar PV is maximized. The EMS problem is modelled as a Markov Decision Process (MDP) that fully explains the state set, action set and reward function formulation. In addition, two case studies have been considered where, in the first case, energy trading with the utility grid is permitted, whereas in the second case, it’s not. To minimize the operational costs, a Q-learning based algorithm is implemented to learn the control actions for battery energy storage system (BESS) under very complex environment (e.g., battery degradation, intermittent renewable energy supply and grid tariff uncertainty). Simulation results show that agent learns to improve battery actions at every time step by experiencing the environment modelled as an MDP.
The key contributions of this work are outlined below:
Considering the technical constraints of the BESS, and the uncertainty of solar PV generation, load consumption, and grid tariff.
Developing an EMS architecture for a grid-tied solar PV-battery microgrid and formulating the control problem as a MDP considering the state, action, and reward function. The investigation of incorporating microgrid’s constraints such that no power is scheduled back to the utility is also presented.
Using RL algorithm to learn the electrical resources and demand patterns such that system costs are reduced, and an optimized battery schedule is achieved.
Simulations results verify that the proposed algorithms substantially reduce daily operating costs under typical load demand and PV (summer and winter) generation data sets.
The novelty of the paper is presenting the design of an energy storage strategy that focuses on energy consumption optimization by maximizing the use of available PV energy and energy stored in the battery instead of focusing solely on direct storage control. In this architecture excess microgrid energy can be sold back to the utility to increase revenue however a non-trading algorithm scheme has also been studied where constraining rules are embedded into the learning process to curtail excess energy from been sold back to the utility. In addition, a battery degradation model is incorporated to reduce strain on the battery during the (dis)charge operation.
The rest of the paper is structured as:
Section 2 presents the EMS problem formulation and introduces the two costs models considered i.e., grid transaction cost and battery degradation costs.
Section 3 presents the MDP framework for the EMS problem formulation.
Section 4 explains the proposed Q-learning algorithm.
Section 5 presents the simulation setup,
Section 6: Results are presented, and the algorithms performance are evaluated, and
Section 7: recaps the paper’s major points and introduces future work ideas.
3. Markov Decision Framework as Applied to EMS Formulation
Markov decision framework or MDP is a mathematical framework used to model decision-making in situations where results are partly random and partly controllable and has been broadly adopted to map optimization problems solved through RL [
41]. An MDP is defined as a four-tuple
, where
S and
A are the state and action space,
T and
R denote the state transition probability, and the reward function respectively. Since, for this case the state transitions are deterministic, state transition modelling is not necessary [
42] and only the state space, action space, and reward function are considered.
3.1. State and State Space Formulation
The information provided by the state is essential for energy management as it contains the information that the agent uses in the decision-making process at each time step t. The state space of the EMS at any given time is defined by the utility tariff (R/kWh), the BESS state of charge, the load demand (kW) and the PV generation (kW).
Let the state of charge of the battery at time step be denoted as .
So as not to exceed the battery constraints, a guard ratio
is considered as,
where
[
19] and
denotes the energy capacity of the battery (kWh).
At each time the state of charge of the battery is constrained by, where and represents the lower and the upper bounds of the battery.
Considering the above battery safety limits, the state
at each time step
t is,
where
is the time component denoting the hour of the day,
is the generation from solar PV at time
t,
denotes the current electricity tariff at time
t notified by the utility company,
is the instantaneous load demand. The state space is enumerated by the union of all set of states within the optimization horizon as,
. The intraday microgrid operation has been divided into T timesteps, indexed as {0,1, 2…, T−1}, where T represents the optimization horizon under consideration.
3.2. Action and Action Space Formulation
In order to meet the load demand in every time step
t, the EMS of the microgrid first uses the available energy from the solar PV and the BESS, then the remaining energy is purchased from the utility. Net load
of the microgrid at each time step
t is described as the total demand
) minus the energy generated by the solar PV
as shown below:
Here, “max” ensures that the complier takes the maximum value always. For instances, if the PV is large than the load, that equation will output a negative value, which is not the case as the net load is not negative. To prevent that a zero is put (it will be the max value at that time step) meaning the load has fully been covered by the solar PV.
Since the total load demand
and PV generation
fluctuate stochastically in a real microgrid, the net demand of the microgrid,
is an unknown variable. First, the EMS tries to satisfy the net demand
through the energy stored in the BESS. Then, the remaining load demand that cannot be covered by the BESS is provided by the utility. It is described as the reminder energy
which can be enumerated as:
The amount of energy that need to be purchased at each time step is denoted as
At each time step, after covering the load demand the quantity of energy contained in the BESS denoted as
, is calculated as shown in (9).
This equation is generally computing the amount of energy remaining in the battery. The first section checks if there is any remaining solar power after supply the load, if yes, the solver will charge the battery, if there isn’t, zero will be taken. Since the EMS is designed to first check if there is any energy in the battery before purchasing from the utility as show in (8), the second part of the equation calculates the remaining energy in the battery after supplying the load so that we can have the accurate state of charge for the next time step.
Since the agent can only dispatch the battery, i.e., manage charge and discharge. To simplify this problem, the actions are discretized here into discharging/charging action category. The power unit
depicts the amount of power that is used to discharge/charge the battery in each discrete instant. The discrete action space is defined as
where
and −
are the maximum amount of charge and discharge power from the BESS in each time step respectively, while 0 indicates that the battery is idle.
is defined as the action selected at time step
t by agent, where
represents all the possible actions in the action space
under state
.
Given the action set at every time step t, the agent chooses one possible , from by following a policy , that describes a decision-making strategy for the selection of actions. More details on can be found in the next section.
Let the function of the amount of power supplied to the battery when an action
, is taken by the agent be denoted as BESS (
) and computed as,
where the negative values indicate discharge from the battery and positive values indicate charging of the battery. The result of the agent action
to the battery is based on the status of the BESS
.
It is presumed that if the action taken increase the past the maximum guard capacity , only the energy chargeable is used to charge the battery and the extra energy is discarded. Similarly, for the discharging action, only is discharged and the extra discharge energy is discarded, hence the battery constraints are never violated.
3.3. Reward Function Formulation
Reward is a scalar value used to express to the agent the goal of the learning process. Once the agent performs an action and moves to the next state, a reward is presented. Intelligent “reward engineering” is key as it links the agent actions to the objective of the algorithm [
43]. The objective of the optimization process is to minimize the transaction cost of power purchased from the utility and reduce battery wear cost.
Reward
of the proposed EMS is structured to evaluate two aspects of the system management, one is the objective function and the other two aspects suggested by [
44] are adopted to improve the agent’s performance. The objective function factors in the amount of money incurred by purchasing energy from the main grid
, and battery degradation costs
. To improve algorithm performance,
and
have been incorporated.
represents gains from pre-charged energy and
is a penalty payment charged to the agent when it chooses an action that exceeds the limits of the battery.
The pay reward
represents the cost incurred by trading power with the utility at each time step. The agent receives a negative reward if the amount of energy purchased from the grid is greater than the amount of energy sold. Otherwise, the agent will receive a positive reward of
calculated as given below.
In (12) represents the total unmet load in the microgrid at each time step (kWh) while denotes the instantaneous grid tariff (R/kWh). The sum total of indicates the power being exchanged with the utility grid at each time step t.
In the non-trading mode of operation, the energy supplied to the load (when a discharge action is selected) at any time slot cannot be higher than the load demand. Equation (13) ensures energy cannot be sold back to the utility. During training if the learning agent tries to select actions that causes power to being scheduled back to the utility a small negative penalty
will be charged.
Next,
is computed as the amount of available energy in the battery to cover the net load demand
from the energy stored in the BESS
. This reward mainly encourages the agent to always ensure that the SoC of the battery can satisfy the net load at any time. When current grid tariff
increase, this benefit reward increases as well. In simple terms, the reward reflects reduced payment that result from using the battery instead of purchasing power from the grid.
Then,
as shown in (14) below, represents a penalty received by the agent at each time step for any extra energy supplied but is not used in the charging/discharging of the battery due to enforced constraints. As the grid tariffs
increases, the over-charged penalty becomes high.
Finally, the cost of battery degradation is considered as a negative reward received by the agent and it is calculated as show in (4).
Let
denote the cumulative reward that the agent receives when it takes an action
at state
. The total reward that the agent gets at each time step is given by (17), however in the non-trading mode of operation
is incorporated in Equation (17)
As an RL agent traverses the state space, it observes a state
takes an action
and moves to the next state,
. In order to compute the impact of an action taken by the agent on future rewards while following a certain policy
,
has to be computed. It is defined as the cumulative discounted rewards at time slot
and calculated as
The first term in (18) is the immediate reward at time step
and the second term is the discounted rewards from the next state
. Here,
[0,1] is the discount factor, which determines the weight given to future rewards by the agent, where a high value makes the agent more forward thinking.
is used to represent a stochastic policy that maps states to actions:
. The agent’s goal is to find a policy
(battery schedules) that maximizes the long-term discounted rewards. An optimal policy
is the MDP’s solution, i.e., a policy that constantly selects actions that maximize the cumulative rewards for the (T) hours horizon starting from the initial state
[
14]. To solve the MDP, several RL techniques can be applied. Model-based methods, such as Dynamic programming (DP), assume that the dynamics of the MDP are known (i.e., all state transition probabilities). On the other hand, model-free techniques such as Q learning learn directly from experience and do not assume any knowledge of the environment’s dynamics. To get the solution of the MDP designed above Q-Learning has been adopted and it is explained in detailed below.
4. Q-Learning Algorithm for Energy Management Problem
Q-learning is the most widely used model-free RL algorithm i.e., it can implicitly learn an optimal policy (a sequence of battery action selection strategy) by interacting with the environment without any prior knowledge of the environment (as opposed to model based methods where the agent has to learn the entire dynamics of the system then plan to obtain the optimal policy) [
14]. Q-learning involves the finding of the so-called Q-values where Q-values are defined for all state action pairs,
. The Q-value gives the measure of goodness of selecting an action
in state
Let
represent the State-Action value function that computes the estimated total discounted rewards as calculated in (20), if an action
is executed at state
when a policy
is followed. It will be described as,
where
indicates the expected action value for each state action pair.
The
Q-value that reflects the optimal policy is denoted as
. If all possible actions in each state
are selected and executed multiple times in the environment and their Q-values updated a sufficient number of times, then Q-values eventually converge [
16] and the optimal action in that state can be found by taking the action that maximizes the Q-values. The optimal Q-value is given by,
And the optimal policy is acquired as (22) for each state
,
Equation (22) implies that an optimal action-value in any state
is described as
where
is the optimal action for state
, commonly known as the greedy action
. During the learning process, the agent interacts directly with the dynamic environment by performing actions. Generally, the agent observes a state
as it occurs, with the possible action set
, and by use of an action selection technique, it selects an action
and consequently, moves to the next state
, and receives an immediate reward,
. Then updating of the Q-values is done based on the Bellman equation as shown in (23),
where
[0,1] denotes the learning rate which determines the extent by which the new Q-value is modified,
is the current estimate of Q-value,
represents the next estimated Q-value in the next iteration, whereas
denotes the discounting factor and
is the specific iteration number. When
is sufficient small, and all possible state-action pairs are visited enough times
eventually converges to the optimal value
so that best action will be selected at each state in the successive iterations [
16]. When the agent reaches the terminal state
, since there are no future rewards, the Q-value is update as shown in (24) below:
As an agent chooses actions from the action set, it is always necessary to cleverly deal with the exploitation versus exploration dilemma [
11,
45]. Exploration helps the agent to avoid getting stuck in a local optimum while as exploitation allows the agent to selected best actions in the later episodes. Epsilon greedy (
) method is adopted here because of its simplicity. Epsilon greedy is a method of selecting actions with uniform distribution from an action space. Using this strategy, it is possible to select a random action (exploration) from the action space
with probability
. It is also possible to choose a greedy action (exploitation) with probability
, for
, from the Q-values at the given state in each episode. An exponential decay function is also leveraged, so in each iteration, the value of
is modified as follows;
, where
and
represents the minimum and maximum values of
respectively,
is the exponential decay rate and
n denotes total number of iterations.
It is to be noted that epsilon ε varies from case to case depending on system design. But the idea is to allow the agent to explore all the actions in the initial episodes so as to learn. As learning proceeds ε epsilon should gradually be decreased to enable the agent to choose greedy actions. But we should still leave a very small percentage for taking a random action as there is a probability that current estimate may be wrong and there is another better action. For practical problems during training start with a very large number of epsilons i.e., ε = 1 and keep lowering that value to 0.001 or 0.01. so that the agent can exploit the best action in the final iterations.
Algorithm for Learning Energy Management
To tackle the MDP, a Q-table is first created and initialized with zeros. At the beginning of the learning, initialization of hyperparameters
is done in lines 2–3 of the algorithm shown below. Lines 6 to11 shows the loop for every time step ∆
t. In line 5 the microgrid environment is initialized, while in line 6, the algorithm reads the current state. In line 7 action
, is selected depending on the action selection policy
. In line 8, the selected action is executed in the environment and the environment produces a reward
and the next state
. Based on the return of the environment,
is updated according to Equation (19) and if it’s a terminal state update is done by (20) in line 9; In line 10, the time step
t is incremented by one,
and the system move to the next state. After the terminal state T−1, the next episode proceeds with an updated value of
. Then, the learning process continues as seen in Algorithm 1 below.
Algorithm 1 EMS Algorithm Using Q Learning |
1. Create a q-table and initialize with zeros, 2. Initialize learning rate and gamma (and) 3. Initialize epsilon () 4. For episode (n) = 1, max Episode do 5. Initialize Microgrid Environment 6. For time step (t) = 0, T−1 do 7. Read the current state 8. Select an action at using from using the greedy policy (5) 9. Execute the selected action in the Simulation Environment and observe the reward and the next state 10. Update q-values according to (19) 11. 12. End 13. Update 14. 15. End |
5. Simulation Setup
To evaluate the performance of the proposed energy management algorithm using Q-learning, this work considers a commercial load grid-tied microgrid environment with solar PV and BESS. Numerical simulations are performed based on a commercial building load profile data adopted from [
46]. Summer and winter solar PV output data for (November (summer) and June (winter)) in a 250 kWp solar PV system located in Cape Town, South Africa adopted from [
47] are used in the simulation. To facilitate the assessment of optimized control strategy the work considers an hourly time of use (ToU) tariff obtained from Eskom a utility company operating in South Africa which specifies three price levels applied based on time of day during summer and winter seasons. Peak prices are equivalent 130.69 R/kWh, mid-peak prices equal to 90.19 R/kWh and off-peak prices equal to 57.49 R/kWh during summer while during winter peak tariff is 399.17 R/kWh, mid peak is 121.46 R/kWh and off peak price is 66.27 R/kWh [
48]. The forecasted time series inputs to the algorithm which include the commercial load demand and solar PV generation are shown in
Figure 2a,b for summer and winter season respectively. The peak load of the commercial consumption profile is noted to occur between 09:00 and 16:00 when most HVAC and loads are switched on. For the BESS, two Lithium-ion batteries are used, where each battery has a capacity of 200 kWh. The initial
SoC of the BESS is set to 0.25, and the guard ratio
is considered since any value in this range
can be selected. Thus, the maximum and minimum limit of the BESS are set up to
kWh and
respectively. The initial battery cost is determined based on the current market price of Li-ion battery which is 135 USD/kWh (2025 R/kWh) [
49]. The charge and discharge power unit ∆
p is set to 25 kW, where the charge power of BESS is uniformly discretized to
is 6. Thus, the discretized charging and discharging power of the battery is,
{−150, …, −50, −25, 0, 25, 50, …150} in kW, 150 and −150 represent the maximum charging and discharging power;
indicates the battery is idle, while the rest are values within the limit’s interval. The maximum charge and discharge power are limited to 150 and −150 to ensure safe battery operation limit while the charge and discharge power unit is set to 25 kW to give the agent more variables in the action space. For simplicity purposes, power inverter efficiencies for Solar PV and battery is assumed to be 1. The algorithm is implemented in Python programming (version 3.7.6) and executed by a computer with a 1.60 GHz processor and 8 GB RAM.
It is critical to properly select parameters, especially those to which the algorithm is highly sensitive, such as the learning rate and the discount factor, in order to achieve a suitable convergence speed and quality policies. If a large step-size rate is selected,
values can oscillate significantly and if it is too small, Q-values might take long before they converge. The choice of
was by trial and error and a value of 0.01 gave the best convergence. The
ε-greedy parameter
ε was initialized to 1 to ensure the entire search space is explored as much as possible, and a discount factor γ of 0.85 (for winter case) and 1 (for summer) is taken as the future rewards are significantly important as the immediate rewards. The simulation input parameters for the EMS algorithm can be seen in
Table 1.
In order to evaluate the performance of the proposed grid-tied microgrid energy management system, two case studies are simulated on the basis of the data characteristics mentioned above. First, two different seasons are examined to assess the impact of PV penetration. Second, the comparison between including and excluding grid constraints at the interconnection point is then performed with the aim of studying the impact on total operating costs. In the case of grid constraints (non-trading algorithm) Equations (13) and (14) are included in the optimization model to ensure that the microgrid does not sell its surplus energy back to the utility grid, while for no-grid constraints (trading algorithm) they are removed.