Reinforcement Learning Model-Based and Model-Free Paradigms for Optimal Control Problems in Power Systems: Comprehensive Review and Future Directions

Ginzburg-Ganz, Elinor; Segev, Itay; Balabanov, Alexander; Segev, Elior; Kaully Naveh, Sivan; Machlev, Ram; Belikov, Juri; Katzir, Liran; Keren, Sarah; Levron, Yoash

doi:10.3390/en17215307

Open AccessReview

Reinforcement Learning Model-Based and Model-Free Paradigms for Optimal Control Problems in Power Systems: Comprehensive Review and Future Directions

by

Elinor Ginzburg-Ganz

^1,*

,

Itay Segev

¹,

Alexander Balabanov

¹,

Elior Segev

¹,

Sivan Kaully Naveh

¹,

Ram Machlev

¹

,

Juri Belikov

²

,

Liran Katzir

³

,

Sarah Keren

⁴

and

Yoash Levron

¹

The Andrew and Erna Viterbi Faculty of Electrical and Computer Engineering, Technion—Israel Institute of Technology, Haifa 3200003, Israel

²

Department of Software Science, Tallinn University of Technology, Akadeemia tee 15a, 12618 Tallinn, Estonia

³

Advanced Energy Industries, Caesarea 38900, Israel

⁴

Computer Science Faculty of Electrical and Computer Engineering, Technion—Israel Institute of Technology, Haifa 3200003, Israel

^*

Author to whom correspondence should be addressed.

Energies 2024, 17(21), 5307; https://doi.org/10.3390/en17215307

Submission received: 30 September 2024 / Revised: 15 October 2024 / Accepted: 16 October 2024 / Published: 25 October 2024

(This article belongs to the Section F5: Artificial Intelligence and Smart Energy)

Download

Browse Figures

Versions Notes

Abstract

:

This paper reviews recent works related to applications of reinforcement learning in power system optimal control problems. Based on an extensive analysis of works in the recent literature, we attempt to better understand the gap between reinforcement learning methods that rely on complete or incomplete information about the model dynamics and data-driven reinforcement learning approaches. More specifically we ask how such models change based on the application or the algorithm, what the currently open theoretical and numerical challenges are in each of the leading applications, and which reinforcement-based control strategies will rise in the following years. The reviewed research works are divided into “model-based” methods and “model-free” methods in order to highlight the current developments and trends within each of these two groups. The optimal control problems reviewed are energy markets, grid stability and control, energy management in buildings, electrical vehicles, and energy storage.

Keywords:

reinforcement learning; model based; model free; control problems; energy management

1. Introduction

Nowadays, power systems and energy markets are experiencing rapid growth. As the population grows and economies develop, the demand for electricity rises. This necessitates larger and more intricate power systems to meet a larger demand for energy [1,2,3]. Moreover, as technological developments march forward, modern power systems incorporate a wider variety of energy sources, including renewables like solar and wind power, alongside traditional sources such as coal and natural gas. Managing this diverse mix requires complex infrastructure and control systems. Not only that, but the emergence of renewable energy sources complicates the grid topology in more than one way; it also stimulates the integration of energy storage devices such as batteries into power grids. While beneficial for balancing supply and demand and integrating renewables, this introduces new challenges related to managing and optimizing storage assets within the system. Some of these considerations are presented in [4,5]. Additional complexity induced by renewable energy sources is caused by the trend towards decentralization in power generation, with emphasis on distributed energy resources (DERs) like rooftop solar panels and small-scale wind turbines. Integrating these decentralized sources into the grid complicates its behavior and dynamics, as discussed in [6,7,8].

On top of that, outdated infrastructure in many countries requires upgrades to improve reliability, efficiency, and resilience. This often involves the implementation of advanced technologies such as smart grids, which introduce additional complications, as one can examine in the following works [9,10]. In light of these advances, there is an increase in cyber-security threats, which create the need to deal with surroundings that are not only intricate and unpredictable but also sometimes even malicious. This necessitates using redundant components or the development of cyber-security protocols to ensure systems’ robustness and resilience. All of this contributes to system complexity [11,12,13]. Finally, regulatory frameworks governing the power sector are becoming more strict, requiring utilities to meet various standards related to environmental impact, reliability, and safety, such as those addressed in [14,15]. Compliance with these regulations often involves implementing complex technologies and processes.

Overall, the combination of technological advancements, changing energy landscapes, regulatory demands, and the need for greater resiliency is driving the increased complexity of power systems in the modern era. The preceding factors are enough to conclude that in today’s world, well-known and widely studied control problems in power systems, such as grid stability control or storage energy management, escalate into large-scale problems with extremely high dimensions. The computational burden and intricate dynamics of current power systems establish the need for more advanced and efficient algorithms to solve different control problems in this domain. To address these challenges, power experts are motivated to leverage various machine learning models that exhibit remarkable performance in a variety of different domains to aid in the assessment and control of these intricate systems. Specifically, one major area of interest is the field of reinforcement learning, which is mainly used for control problems that involve a continuous decision-making process. Several other works in the recent literature review different applications in power systems and present multiple reinforcement learning techniques for solving them. For example, work [16] addresses four power system operating states: normal, preventive, emergency, and restorative. They examine the following control levels: local, household, microgrid, subsystem, and wide area. Moreover, the following applications are reviewed: cyber-security, big data analysis, short-term load forecast, and composite load modeling. Another review paper [17], divides various works in the literature into seven areas of application. In each category, the authors examine several aspects such as the RL algorithms that are used, a comparison with white- and gray-box models where relevant, and reproducibility. Examining work [18], the focus lies in studying the various reinforcement learning approaches used for optimizing power flow and distribution. Finally, work [19] focuses on three key applications: frequency regulation, voltage control, and energy management, and presents different RL approaches to produce an optimal control policy. Further, they discuss critical issues in the domain, such as safety, robustness, scalability, and data, and point out several optional future research directions.

Our aim is to extend the work that has been performed so far and present an in-depth comparison and discussion regarding model-based and model-free reinforcement learning paradigms and the various considerations arising from them. Thus, continuing the line of thinking presented in previous works, this paper presents a comprehensive review of the state-of-the-art reinforcement learning methods used for optimal control problems in power systems. While there are review papers discussing this subject, we focus here specifically on the comparison between model-based and model-free paradigms, introducing recent challenges and trends. Hence, in this work, we systematically review the latest model-based and model-free reinforcement learning frameworks and their application for control problems in power systems. There is an inherent difference between the approaches since when following the model-based paradigm, agents learn the probability distribution from which the transition function and reward are generated. On the contrary, in model-free configurations, the agent follows the optimal strategy, which maximizes the cumulative reward, without explicitly learning the mapping of the transition function and reward function. We strive to gain a deeper understanding of how well both of these methods perform in different environments and also aim to fundamentally understand what properties of the state and action space affect the learning process. Furthermore, we aim to assess the implications of deterministic and stochastic policy definitions. Finally, our most basic question is to conclude whether there are types of control problem settings where the model-based method outperforms the model-free one or vice versa. In this light, we also attempt to emphasize current trends, highlight intriguing theoretical and practical open challenges that arise in this domain, and suggest exciting future research directions. The main contributions of this paper are summarized as follows:

1.: We study the structure of the MDP formulation and the nature of the learned policy for different control problems in power systems in model-based and model-free paradigms, focusing on discrete or continuous state space and stochastic or deterministic policy.
2.: We find correlation between reinforcement learning algorithms and control problems in power systems and infer whether these change when following a model-based or model-free paradigm.
3.: We highlight current trends such as the growing literature on the subject, showing prominent RL algorithms that are investigated and data standardization reflecting reproducibility in the recent literature.
4.: We discuss future research possibilities, including the incorporation of multi-agent reinforcement learning frameworks, safe reinforcement learning study, and the investigation of RL applications in emerging power system technologies.

This paper is organized as follows: Section 2 sets a framework of fundamental reinforcement learning terminology and how it is used in power systems and presents an overview of core algorithms that are used for various control applications. In addition, the terms “model-free” and “model-based” are defined. Section 3 is concerned with the model-based paradigm of reinforcement learning applied in various control problems in power systems and reviews recent work in five main control applications that include energy market management, power grid stability and control, building energy management, electrical vehicles control problems, and energy storage control problems. The section ends with a discussion of notable trends. Section 4 is structured similarly to the previous one, only it focuses on the model-free paradigm in reinforcement learning. Following, a comparison between the two paradigms is outlined in Section 5. Next, Section 6 lists the latest open challenges that are yet to be solved for leveraging reinforcement learning approaches to address optimal control in power systems, highlighting specific considerations for model-based and model-free paradigms, and suggests future research ideas. Subsequently, Section 7 concludes this article.

2. Technical Background on Reinforcement Learning

2.1. Markov Decision Processes

Reinforcement learning (RL) provides a framework for analyzing sequential decision-making problems, where the focus is on learning from interaction to achieve a goal [20,21,22,23]. In this context, the entity that learns and performs the decision making is called an “agent”, while the element it interacts with is called the “environment”. This process is depicted in Figure 1.

This type of task is assumed to satisfy Markovian properties, and thus in most scenarios, the reinforcement learning problem is formulated as a Markov Decision Process (MDP). To formally define a reinforcement learning problem as an MDP, we must specify each element of the following tuple

〈 S, A, R, P 〉

, where

S

defines the state space of the environment,

A

induces the action space of the agent,

R

is the set of all possible numeric rewards the agent may receive, and

P

represents the distribution probability of the states, that is, given a state s and action a, the probability to transfer to a new state

s^{'}

and achieve a reward r is given by

P (s^{'}, r ∣ s, a) = \Pr {S_{t + 1} = s^{'}, R_{t + 1} = r ∣ S_{t} = s, A_{t} = a} .

(1)

The flow of the process is as follows: The agent continually interacts with the environment by choosing actions, while the environment responds to those actions by presenting the agent with new situations and by giving it rewards. More specifically, at each discrete time step

t = 0, 1, 2, \dots

, the agent receives some representation of the environments state

S_{t} \in S

, where

S

is the set of possible states, and relying on that information, the agent chooses an action

A_{t} \in A (S_{t})

, where

A (S_{t})

is the set of actions available in state

S_{t}

. In the next time step, in part influenced by its actions, the agent receives a numerical reward

R_{t + 1} \in R \subset R

and observes a new state of the environment

S_{t + 1}

.

At each time step, the agent relies on a mapping from states to probabilities, which is called the agent’s policy, to choose an action to perform. This mapping is denoted by

π

, where

π (a ∣ s)

is the probability that

A_{t} = a

if

S_{t} = s

. Reinforcement learning methods outline how an agent adapts its policy based on its experiences, as the agent’s primary objective is to maximize the cumulative reward it receives over time.

To demonstrate how reinforcement learning is adopted in a power system’s optimal control framework, let us examine a system consisting of a single controllable generator, a single storage device, and a load as depicted in Figure 2. In such a system it is necessary to decide at every moment in time how much energy should be generated and how much energy should be stored. The best combination is found by solving an optimization problem in which the objective is to minimize the overall cost. To simplify the analysis, it is assumed that the load profile can be estimated with reasonable accuracy, meaning that the power

p_{L} (\cdot) \in R_{\geq 0}

consumed over the time interval

[0, T]

is known. The generator has an output power

p_{g} (\cdot) \in R_{\geq 0}

that can be controlled and is characterized by a cost function

F (p_{g})

. It is assumed that

p_{g} (t) = p_{s} (t) + p_{L} (t)

, where

p_{s} (\cdot)

is the power that flows into the storage. Now, formulating this problem as an MDP, we represent the generator’s controller as the agent and the storage as the environment. The agent is defined by its action space

A

, which represents all the possible values that the generator can produce. The environment is represented by its state space and rewards, where the state space denotes all the possible states of charge of the storage, while the instantaneous rewards are

R_{t} = - F (p_{g} (t))

. The agent’s objective is to find a policy

π

, which determines what action to take at each state, meaning how much power to generate according to the storage’s state of charge to maximize the cumulative reward

\sum_{t = 0}^{T} R_{t} = \sum_{t = 0}^{T} - F (p_{g} (t))

.

Consider a numerical example, where

F (p_{g}) = p_{g}^{2} / 2

, and the storage capacity is

e_{max} = 10 [J]

. The current state is

S_{t} = (e_{s} [t] = 5 [J], p_{L} [t] = 3 [W])

, where

e_{s} [t]

is the state of charge of the storage at time step t, and

p_{L} [t]

is the power demand of the load at time step t; the previous reward was

R_{t} = - 10

. The agent decides to produce

p_{g} (t) = 1 [s] \cdot 6 [W]

, where

1 [s]

is the time resolution we sample in. As a result, assuming a deterministic setting, the load demand is satisfied, and the rest flows into the storage, meaning the new energy value that is stored now is

e_{s} [t + 1] = 8 [J]

and the resulting next state is

S_{t + 1} = (e_{s} [t + 1] = 8 [J], p_{L} [t + 1] = 4 [W])

, where

p_{L} [t + 1] = 4 [W]

is the load power demand at the next time step. The resulting reward is calculated by

R_{t + 1} = - 0.5 \cdot {(6)}^{2} = 18

. The sequential decision-making process is described in Figure 3. Table 1 presents some additional examples for optimal control problems in power systems and their adequate MDP formulation.

2.2. Model-Based and Model-Free Reinforcement Learning

In reinforcement learning, the key distinction between model-based and model-free paradigms lies in how they learn and plan. Model-based methods involve learning a model of the environment’s dynamics, including transition probabilities and rewards, which is then used for planning and decision making. In contrast, model-free approaches directly learn a policy or value function from experience without explicitly modeling the environment. They rely solely on observed interactions with the environment to optimize the policy without requiring knowledge of its underlying dynamics. Nonetheless, these two paradigms are in no sense two contradicting approaches, but rather, these two configurations are the edges of a full spectrum of solutions, such that each algorithm on the spectrum combines different traits from each of them. For more perspectives on the topic, one may look at the following works [24,25,26]. While model-based methods may potentially leverage a learned model for more efficient planning, model-free methods often offer greater simplicity and flexibility, especially in complex or uncertain environments where it is hard to learn an accurate model. Figure 4 presents the inherent difference between both methods. Furthermore, a graphical summary of the different approaches to each paradigm is given in Figure 5.

2.3. Model Computation

In this section, we present analytical algorithms from control theory that may be used to solve analytically the model of the environment, and then we use planning approaches to find the optimal policy.

2.3.1. Dynamic Programming

Dynamic programming is an optimization technique that tackles complex problems by breaking them down into smaller, manageable subproblems through a multi-stage decision process [27,28]. Unlike many other optimization methods, dynamic programming algorithms explore all possible solutions to find the globally optimal one. Given the impracticality of directly scanning the entire solution space, these algorithms solve the problem step by step using a recursive formula. They are versatile, applicable to both linear and nonlinear objective functions and constraints, whether convex or nonconvex. If a globally optimal solution exists, dynamic programming guarantees convergence. However, dynamic programming is constrained by several factors. It relies on a recursive formulation of the cost function, necessitating knowledge of all past and future signals, which can be unrealistic in practice. Moreover, it suffers from the “curse of dimensionality” [29]. Specifically, in power system control problems like those involving energy storage, the method’s complexity increases linearly with the number of time samples but exponentially with the number of storage devices and the number of state variables describing each device. To demonstrate a basic dynamic programming solution, recall the energy balancing problem in Figure 3. We base our example on [30]. The arising optimization problem is

\begin{matrix} \underset{{u^{(k)}}}{minimize} & Δ \sum_{k = 1}^{N} f (u^{(k)} + d^{(k)}), \\ s . t . & u^{(k)} = \frac{e^{(k)} - e^{(k - 1)}}{Δ \cdot η (e^{(k)})}, for k = 1, \dots, N \\ 0 \leq e^{(k)} \leq e_{max}, for k = 1, \dots, N \\ e_{0} = 0 . \end{matrix}

(2)

Consequently, the optimal powers in this dispatch problem are given by a recursive formula:

{(u^{(k)})}^{*} = \frac{{(e^{(k)})}^{*} - {(e^{(k - 1)})}^{*}}{Δ \cdot η ({(e^{(k)})}^{*})}

(3)

for

k = 1, \dots, N

. The initial condition is

u^{(0)} = \frac{e^{(0)}}{(Δ \cdot η (e^{(0)}))} .

(4)

2.3.2. Model Predictive Control (MPC)

Model predictive control (MPC) is a control strategy used in engineering and control theory to optimize the performance of dynamic systems subject to constraints. It involves repeatedly solving an optimization problem over a finite time horizon using a model of the system dynamics to predict future behavior and adjust control inputs accordingly. By iteratively updating the control actions, MPC aims to minimize a cost function while satisfying system constraints, thus achieving desired performance objectives, as explained in [31,32,33]. This approach is adequate for real-time applications and utilizes continuously updated information. This method is particularly well suited for energy storage system management and is also being used extensively for optimal control problems that involve thermal energy systems, where the decisions made in each time step are highly correlated due to inherent time coupling caused by state-of-charge dynamics and high inertia, as stated in [34].

2.4. Policy Learning Basic Concepts

In the following section, we discuss the two approaches of the model-free learning paradigm. First, we present the foundational concepts of many of the most common model-free algorithms which are based on policy iteration, such as Monte Carlo, Temporal Differences, and Q-learning. The main ideas are encapsulated in the “Value Function”, “Policy Iteration”, and “Value Iteration”; for a deeper discussion, one may refer to [20]. Finally, we will present a model-free algorithm that relies on policy optimization; this is the “Policy Gradient” basic algorithm.

2.4.1. Value Function

Almost all reinforcement learning algorithms involve estimating a value function of the state, representing the potential benefit of being in that state in terms of future rewards that can be expected or, more accurately, in the notion of expected return. The rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to particular policies. The value of a state s under a policy

π

, denoted by

v_{π} (s)

, is the expected return when starting in s and following

π

thereafter. For MDPs, we can define

v_{π} (s)

formally as

v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} ∣ S_{t} = s],

(5)

where

E_{π} [\cdot]

denotes the expected value of a random variable given that the agent follows policy

π

. We call the function

v_{π} (\cdot)

the state-value function for policy

π

. For all

s \in S

, the state-value function, assuming policy

π

is being followed, may be calculated recursively:

v_{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})],

(6)

where

π (a ∣ s)

is the probability of taking action a in state s under policy

π

. Under the assumption of either (1)

γ < 1

or (2) the final state is accessible from all other states under policy

π

, the existence and uniqueness of

v_{π}

are guaranteed.

2.4.2. Policy Iteration

First, we consider how to compute the state-value function

v_{π}

for an arbitrary policy

π

. This process is called policy evaluation. The initial approximation value function,

v_{0}

, is chosen arbitrarily, and each successive approximation is obtained by using the Bellman equation for

v_{π}

as an update rule:

v_{k + 1} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})]

(7)

for all

s \in S

. It is clear that

v_{k} = v_{π}

. It can be shown that the sequence

{v_{k}}

converges to

v_{π}

as

k \to \infty

. This process is called iterative policy evaluation. We compute the value function for a policy to help discover improved policies. Let us denote by

v_{π}

the value function for an arbitrary policy

π

and assume it is deterministic. The objective is to determine for some state s if we should change the policy to choose an action

a \neq π (s)

. We have an estimation of how good it is to follow the current policy from s, that is,

v_{π} (s)

, but perhaps there is a better one. To answer this question, we can consider selecting an action a in s and following the existing policy

π

thereafter. The value of this behavior is given by the q-value function:

q_{π} (s, a) = E_{π} [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a] = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})] .

(8)

If this value is greater than

v_{π} (s)

, meaning it is better to select a in s and follow

π

thereafter than to follow

π

all the time, one would expect that it is better to select a every time s is encountered, which introduces a new policy. Fundamentally, if for all

s \in S

q_{π} (s, π^{'} (s) \geq v_{π} (s),

(9)

then it may be inferred that policy

π^{'}

is at least as good as policy

π

, meaning it must obtain at least the same expected return from all states

s \in S

v_{π^{'}} (s) \geq v_{π} (s) .

(10)

This result is addressed as the policy improvement theorem.

Following this procedure of continuous policy evaluation and improvement composes the “policy iteration” algorithm that is presented in Figure 6.

2.4.3. Value Iteration

The major drawback of policy iteration is that the policy evaluation stage may itself be a lengthy iterative process that scans many states. Another formulation for the policy evaluation stage is given by

\begin{matrix} v_{k + 1} (s) & = max_{a} E [R_{t + 1} + γ v_{k} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a] \\ = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{k} (s^{'})], \end{matrix}

(11)

for all

s \in S

. Figure 7 presents the interaction between evaluation and improvement procedures. Here, each process drives either the policy or the value function closer to one of the lines, which represents a solution to one of the two goals.

2.4.4. Policy Gradient

In reinforcement learning, policy gradient methods are algorithms designed to directly optimize the policy, represented as

π_{θ} (a | s)

, which defines a probability distribution over actions a given state s, as elaborated in [35]. The main goal of these methods is to maximize the expected return,

J (π_{θ}) = E_{τ \sim π_{θ}} [R (τ)]

, which is the cumulative reward an agent accumulates over time, where

R (τ)

represents the finite-horizon undiscounted return. To achieve this, policy gradient methods use a gradient ascent algorithm to evaluate the expected return by adjusting the policy parameters

θ

in a way that gradually increases this expected return. By iteratively following the gradient of the expected return, the policy parameters are refined to enhance the agent’s performance. These methods are particularly effective for environments characterized by continuous action spaces and also for learning stochastic policies that are crucial for managing uncertainty and exploration.

Directly optimizing the policy often results in more stable and efficient learning, making policy gradient methods powerful tools for tackling complex reinforcement learning challenges. In this context, there are two approaches, the “policy-based” and “value-based” approaches. Policy-based methods are a class of algorithms that directly optimize the policy

π_{θ} (a | s)

, where

θ

represents the parameters of the policy, s is the state, and a is the action. These methods differ from value-based methods, which focus on learning a value function and deriving the policy from it. Policy-based methods aim to directly find the optimal policy by maximizing the expected return and are the most commonly used approaches.

2.5. Multi-Agent Reinforcement Learning

Multi-Agent Reinforcement Learning (MARL) is an extension of traditional reinforcement learning that involves multiple interacting agents learning simultaneously within a shared environment. Each agent in a MARL setting can have distinct goals and control different parts of the system, making the learning process significantly more complex compared to single-agent RL. This makes MARL particularly relevant to distributed or recurrent systems, where the behavior of one component affects the overall dynamics, and agents must coordinate or compete to achieve their objectives. In MARL, the interactions between agents introduce elements of time dependency, competition, and cooperation, which closely resemble the dynamics seen in distributed systems. In standard distributed systems, synchronization and coordination are typically designed into the system architecture to ensure correct functioning. However, in MARL, agents are autonomous and can make independent decisions based on their own local, and sometimes partial, observations and objectives. Therefore, the coordination and synchronization mechanisms must be developed through learning, allowing agents to establish strategies that align with global or local objectives dynamically during the runtime [36]. This is particularly challenging in systems where actions have long-term dependencies and feedback effects. The mathematical formulation resembles that of the single-agent formulation. In a MARL setting, the “Markov game” definition extends the MDP formulation and introduces a tensor of state–action spaces instead of a single one. The setting may differ greatly, depending on the application. In some scenarios, all the agents will receive the same observation of the environment, while in other scenarios, each agent will receive its own partial observation. Mixed scenarios may also occur. Moreover, the actions may be performed in a synchronous or asynchronous manner, introducing additional complexity when they are dependent. The learned joint optimal policy

{\vec{π}}^{*}

is characterized by an equilibrium, often referred to as the Nash equilibrium point, from which none of the agents has any incentive to deviate [23]. A diagram presenting the idea may be seen in Figure 8. As power grids evolve into more decentralized and complex structures, such as smart grids with distributed energy resources, then in such environments, MARL can model the interactions between multiple control agents, such as distributed generators, storage systems, and flexible consumers, each with its own objectives. For instance, in demand response programs, different agents (e.g., residential consumers and industrial participants) may independently decide how to adjust their energy consumption based on local incentives, leading to a dynamic economic game. MARL can help these agents learn to coordinate, compete, or interact in a mixed way in real time to achieve objectives, such as minimizing peak load or maximizing renewable integration.

2.6. Safe Reinforcement Learning

Safe reinforcement learning focuses on ensuring that learning agents operate within predefined safety constraints, even during the exploration phase. This is particularly critical in power system applications, where unsafe actions can lead to severe consequences, such as equipment damage or large-scale blackouts. Traditional RL methods often involve exploratory behaviors that may result in unsafe actions in the early stages of learning, which is unacceptable for power systems that require high reliability and strict adherence to operational constraints. Safe RL aims to incorporate safety considerations into the learning process by using techniques such as reward shaping, constrained optimization, or external safety layers that monitor and correct potentially hazardous actions. One common approach in Safe RL is the use of Constrained Markov Decision Processes (CMDPs), where the agent not only maximizes the expected cumulative reward but also satisfies a set of safety constraints. Mathematically, a CMDP is defined similarly to an MDP, only with the addition of a cost function that ensures that safety constraints are met along the learning process. The objective is to optimize the cumulative expected reward while keeping the cumulative cost below a certain threshold. In power system contexts, these constraints could represent voltage stability limits, thermal capacity constraints on transmission lines, or maximum ramp rates for generators. Techniques such as Lagrangian methods [37] or constrained policy optimization [38] are employed to balance the trade-off between performance and safety. Additionally, model-based Safe RL methods leverage predictive models of the power system to simulate and validate the safety of new policies before deploying them in real-world operations. To demonstrate the importance of this trait, examine the following work [39]. The paper focuses on a networked microgrid system consisting of individual microgrids. The main objective is to reduce economic costs while accounting for possible losses due to storage efficiency and ensuring that safety constraints, such as maximum power limits, are met even during training time. The authors suggest a data-driven method based on a multi-agent reinforcement learning framework which utilizes a soft actor–critic algorithm, and they validate this method by providing several numerical results.

3. Model-Based Paradigm

Model-based reinforcement learning methods can be adapted to address uncertainties in power system dynamics by incorporating probabilistic models or robust optimization techniques. One common approach is to use probabilistic transition models, where the agent learns not just a single deterministic model of the environment but a distribution over possible transition models. This involves estimating the uncertainty in the system’s state transitions and rewards through methods such as Bayesian neural networks [40], Gaussian processes [41], or ensemble models [42]. By sampling from these probabilistic models during planning, the agent can make decisions that account for variability in the system’s response, which depends on the specific knowledge about the model dynamics. This enables more robust control policies. This is particularly relevant for power systems, where stochastic factors such as fluctuating renewable generation, load variability, and network contingencies can significantly impact operational outcomes. Additionally, model-based RL can integrate robust control frameworks to handle worst-case scenarios arising from uncertain dynamics. Techniques like robust dynamic programming or distributionally robust optimization can be employed, where the agent explicitly optimizes its policy to perform well under a range of possible system realizations rather than assuming a fixed model. For example, power systems may use robust RL to manage uncertainties in energy storage dynamics or generation forecasting, ensuring that the control policy is effective even when the real system deviates from the learned model [19]. By leveraging these adaptations, model-based RL may potentially provide more reliable control strategies in the face of uncertainties. When following the model-based paradigm, the domain knowledge allows us to reduce the dimensions of the state and action spaces by removing unreachable states or invalid actions; this process transfers the model from a “full-order” model into a “reduced-order” model. Choosing between a full-order and a reduced-order model involves a trade-off between accuracy and complexity. Full-order models may capture better the detailed dynamics and nonlinearities of the power system, potentially making them more accurate and robust to varying conditions. This level of detail may aid RL agents to develop policies that are more informed and capable of handling control tasks, such as voltage stability or oscillation damping. However, these models can be computationally intensive, which can limit their application in real-time scenarios. The high dimensionality and complexity of the models also make training RL agents time-consuming, and large state–action spaces can exacerbate issues like the curse of dimensionality, leading to slower convergence or suboptimal policies. In contrast, reduced-order models offer a simplified representation of the power system dynamics, aiming to focus only on key states and modes of operation. This lower complexity makes them computationally efficient, enabling faster training and promoting them further toward real-time implementation. Such models may perform better in scenarios where quick decision making is crucial, such as in demand response or frequency regulation. However, the trade-off is a loss in accuracy and robustness since the reduced-order models may not capture all the critical dynamics, especially under transient conditions or unforeseen disturbances. This can result in policies that perform well under nominal conditions but fail when the system operates near its stability boundaries or when rare events occur. Therefore, while reduced-order models seem to be beneficial for rapid prototyping and real-time applications, care must be taken to ensure that the simplified representation does not compromise the system’s stability and reliability. For extended discussion, refer to [43,44]. We currently see two main approaches for solving large-scale computational problems in power systems and energy market problems. One approach includes classic optimal control methods: dynamic programming, linear–quadratic regulators, model predictive controllers, Pontryagin’s minimum principle, etc. These approaches operate very well when the dynamic model and physics of the problem are well-known and when the involved signals are either given or can be characterized statistically. However, these classic approaches cannot cope efficiently with changing or uncertain dynamic models. As an example, consider the simple problem of managing an energy storage device within a power system. If the total load signal is unknown in advance and cannot be characterized by a simple statistical model, solving this problem using classical methods becomes extremely challenging [45]. A complementary approach is one of data-driven approaches, such as “model-free” reinforcement learning. These approaches excel in processing large datasets, identifying patterns, and making predictions that help optimize system performance without explicit knowledge about the system dynamics. Therefore, at their core, they are designed to cope with uncertainty, and they provide fantastic results in a series of power system problems in which some elements are unknown—for example NILM problems [46], market design problems [47], and many more. However, these approaches too have considerable limitations; first and foremost, they often require very large datasets for training purposes, which are often not available. While algorithms and techniques to decrease the dimensions of the data are available (for instance—PCA, LDA, or autoencoders), this may not be enough in power system problems, since the dimension of the data is often very large. In addition, data-driven approaches can naturally learn only the patterns present in the available dataset and thus may often misinterpret the correct system structure and underlying physics, rules, or dynamic behavior. As an example, consider the simple problem of managing an energy storage device, which is connected to a single generator and a single load. When the load profile is known or can be characterized statistically, the resulting optimal control problem has many efficient solutions, for instance, the “shortest path” method [48] or the classic stochastic dynamic programming [30]. However, solving this simple problem using a data-driven approach, for instance, by learning the patterns of optimal solutions in many different cases, is extremely challenging, since the long time series involved is long and includes many interdependencies. Unless one uses a very large database for training, a straightforward data-driven approach will learn some characteristics of the optimal solutions but will probably miss most of them. In this light, it is obvious that a methodological combination of these two fundamental approaches (classical methods and data-driven methods) is called for. For instance, considering optimal control problems that appear in power systems, we would like to be able to construct optimal control policies which on the one hand do not require full knowledge of the physical world and the underlying processes but on the other hand apply knowledge of the physical world to utilize the available data in an optimal way. It is evident that the energy community is greatly interested in this direction, and in recent years we indeed have seen a growing number of studies that examine such combined solutions, be it in the field of energy markets or in optimal control problems that emerge in power systems.

From here forward, we review how these various concepts and methods are used in power system applications, starting with the important applications that arise in the context of energy markets.

3.1. Energy Market Management

Reinforcement learning (RL) has become increasingly popular in analyzing energy markets, since it effectively addresses the complexity and unpredictability of its related tasks. RL’s strength lies in its ability to learn optimal strategies through interaction with the environment, enabling it to adapt to changing conditions and uncertainties like fluctuating demand, renewable energy variability, and volatile market prices. This adaptability makes RL particularly valuable in energy markets where dynamic decision making is a key attribute. Within RL, model-based approaches offer greater sample efficiency by using predictive models of the environment, allowing for faster convergence, while model-free methods, though more flexible, require more interactions with the environment to learn optimal policies. This balance between efficiency and adaptability makes RL a powerful tool for managing the complexities of modern energy markets.

Among the many applications of RL in the energy market, energy bidding policies are a notable example, where model-based approaches have shown substantial improvements. For instance, in paper [49], the authors applied the MB-A3C algorithm to optimize wind energy bidding strategies, significantly enhancing profit margins in the face of market volatility. Similarly, RL has been applied to the optimization of energy bidding in general market-clearing processes, as seen in [50], where model-based RL approaches reduced training times and streamlined market-clearing decisions under regulatory frameworks. Peer-to-peer (P2P) energy trading is another field where RL excels, enabling efficient energy trades among prosumers. Model-based approaches, such as the MB-A3C3 model developed in [51], have optimized these trades by forecasting prices and energy availability, leading to significant cost reductions. Moreover, study [52] proposes a green power certificate trading (GC-TS) system for China, leveraging Q-learning, smart contracts, and a multi-agent Nash strategy to improve trading efficiency and collaboration. It integrates green certificates, electricity, and carbon markets, using a multi-agent reinforcement learning equilibrium model, resulting in increased trading prices and significantly improved transaction success rates. The proposed system outperforms similar models with higher convergence efficiency in trading quotes. Additionally, in [53], the authors focus on an energy dispatch problem for a wind–solar–thermal power system with nonconvex models. The authors propose a combination of federated reinforcement learning (FRL) with the model-based method. They use grid-connected renewable energy and thermal power at a bus are aggregated as a virtual power plant, and in such manner, they ensure private operation with limited but effective information exchange. Numerical studies validate the effectiveness of the proposed framework for handling the short time-scale power sources operation with nonconvex constraints.

Examining paper [54], the authors use an average reward reinforcement learning (ARRL) model to optimize bidding strategies for power generators in an electricity market. They use a Constrained Markov Decision Process (CMDP) to simulate market conditions and update reinforcement values based on the rewards received from market interactions, and they incorporate forecasts of system demand and prices as part of the state information used by the RL agent to make informed decisions. Furthermore, study [55] presents an energy management strategy for residential microgrids using a model predictive control (MPC)-based reinforcement learning (RL) approach and the Shapley value method for fair cost distribution. The authors parameterize the MPC model to approximate the optimal policy and use a deterministic policy gradient (DPG) optimizer to adjust these parameters, effectively reducing the monthly collective cost by handling system uncertainties. The paper highlights the reduction of the monthly collective cost by about 17.5% and provides a fair and equitable method for distributing the cost savings among the residents. Continue this line of thought, work [56] deals with energy management for residential aggregators, specifically in managing uncertainties related to renewable energy and load demand. They use a two-level model predictive control (MPC) framework, integrated with Q-learning, to optimize day-ahead and real-time energy management decisions. The solution demonstrated improved performance in reducing operational costs and maintaining system stability while managing the energy needs of a residential community.

In conclusion, reinforcement learning has emerged as a powerful tool in analyzing problems that arise in energy markets due to its ability to handle the complex and dynamic nature of real-time energy management. Its applications, such as optimizing energy bidding strategies and peer-to-peer trading, have demonstrated substantial improvements in efficiency and profitability, especially when model-based approaches are employed to enhance learning and decision making. A summary of the studies reviewed is presented in Table 2. The distribution of the different algorithms in the reviewed papers for this application is depicted in Figure 9.

3.2. Power Grid Stability and Control

Many studies cover various aspects of grid stability and control based on advanced reinforcement learning techniques. Although data-driven techniques have been proven useful in many cases, there are a number of advantages to using model-based paradigms or at least incorporating some domain knowledge in the learning process, especially in this application. The analysis of grid stability and control is divided into many subdomains such as voltage control, frequency control, and reactive power control. Each of these has a specific and known structure of which the underlying physics may be rigorously modeled. This observation motivates many researchers to leverage this information and guide the learning process to accelerate it, resulting in a more robust and resilient model with increased generalization abilities. For instance, considering voltage control applications, study [63] focuses on examining load shedding control for large-scale grid emergency voltage control. The authors propose a derivative-free deep reinforcement learning algorithm named PARS, which uses the domain knowledge for voltage control problems to handle computational inefficiency and poor scalability RL algorithms. For evaluation and comparison, the IEEE 39-bus and IEEE 300-bus systems were used. The proposed method was compared to other algorithms as model predictive control and proximal policy optimization methods. From the results it is evident that PARS converges faster and is able to infer more complicated scenarios that require higher abilities of generalization. Another example concerning this application may be seen in [64], which focuses on a short-term voltage stability problem. In the paper, the authors propose a deep reinforcement learning framework, which acts in a simulation representing the behavior of the power grid for learning the policy. However, to deal with the high dimensionality of the system dynamics arising in a large-scale power system, they incorporate imitation learning at the beginning of the training. The results show a 97.5% reduction in samples and an 87.7% reduction in training time for an application to the IEEE 300-bus test system when compared to the baseline PARS. From a slightly different perspective, work [65] examines the challenges of fast voltage fluctuations in an unbalanced distribution system. This work proposes a model-free approach that incorporates physical domain knowledge, ultimately guiding the training process, thus combining the model-based and model-free approaches. Specifically, they use historical data in a supervised training process of a surrogate model to map the interaction between power injections and voltage fluctuations of each node. Following this, they aim to learn a control strategy based on the experiences acquired by continuous interactions with the surrogate model. The policy is designed by an actor–critic algorithm with slight modifications. Simulation results on an unbalanced IEEE 123-bus system are presented and compared to other methods including double deep Q-learning, stochastic programming, MPC, and deep deterministic policy gradient. On top of that, a different perspective on emergency voltage control is discussed in [66]. This article focuses on the new challenges that arise in offline emergency control schemes in power systems, highlighting adaptiveness and robustness issues. The authors propose a deep reinforcement learning framework, utilizing a Q-learning algorithm, to deal with the growing complexity of the problem. They model the environment using prior domain knowledge from system theory, thus aiding the training process to converge faster, thus extending the model’s ability to generalize. Furthermore, in this work, an open-source platform named Reinforcement Learning for Grid Control (RLGC) is designed to aid in the standardized framework for the development and benchmarking of DRL algorithms for power system control, which is an important step toward a unified evaluation platform. The model was assessed for its robustness, using different scenarios and various noise types in the observations. Finally, a few case studies are discussed, including a four-machine system and the IEEE 39-bus system. A different application concerning grid stability and control is microgrid management. Microgrids can enhance grid stability but can also introduce challenges in grid management. For instance, paper [67] considers a hybrid energy storage systems (HESS) control problem in a microgrid, with a photovoltaic energy production source along with a diesel generator. The low inertia inherent to the photovoltaic system may cause power quality disturbances if the charging and discharging process of the energy storage is unregulated. The authors propose an online reinforcement learning method based on policy iteration. In the design of the reward function, they leverage domain knowledge regarding possible voltage deviations, thus introducing stochasticity to the rewards the agent receives, which aids in handling the existing uncertainties caused by the intermittent energy sources. Moreover, a separate artificial network is used to estimate the nonlinear dynamics of the storage system based on the input and output measurements. The effectiveness of the method is evaluated through HIL experiments to assess performance on real hardware, where unpredicted challenges such as communication delay and measurement noises may appear. Continue this line of thinking, in another study presented in [68], the problem examined is the optimal control of off-grid microgrids, which satisfy the power demand mainly by renewable energy sources with the aid of energy storage systems. The researchers propose a model-based reinforcement learning algorithm, utilizing a variation of PPO. A comparison with a rule-based policy and a model predictive controller is made. The benchmark uses empirically measured data from a small village in Bolivia and emphasizes the improved performance this algorithm achieves over the other methods. Examining another aspect of grid stability and control, it is clear that optimal power flow planning plays a crucial role. In this application domain, various approaches were proposed. For instance, [69] analyzes a scenario with high-level penetration of intermittent renewable energy sources, which necessitates a rapid and economical response to the changes in the power system operating state. This study suggests a real-time optimal power flow approach using Lagrangian-based deep reinforcement learning, leveraging deep deterministic policy gradient for policy optimization. The action–value function is designed to include physical constraints to allow a safe and controlled learning process. In addition, the gradients for the quality estimation of the states is derived analytically instead of using a critic network. The evaluation is performed on an IEEE 118-bus system and compared with advanced methods such as the interior-point method, DC optimal power flow and a supervised learning method. A different approach may be seen in [70], which addresses safety considerations in a distribution network between interconnected microgrids against false data injection. They propose a reinforcement learning framework with multi-objectives, using an actor–critic algorithm, and incorporate various constraints based on a domain knowledge of the system into the training process, such as voltage and frequency stability considerations and power flow limitations. Simulations on open-source data are presented. Finally, studies employing model-based RL approaches for grid stability and control applications demonstrate the potential of these methods to enhance the reliability and resilience of power systems. By modeling the complex dynamics of the grid and predicting the impact of various control actions, model-based RL enables precise regulation of voltage, frequency, and power flows. This results in improved fault tolerance and the ability to handle fluctuations from renewable energy sources more effectively. As the integration of distributed energy resources and smart grids becomes more prevalent, model-based RL will be essential for maintaining grid stability and ensuring the efficient operation of future power networks. A summary of the studies reviewed is presented in Table 3. The distribution of the different algorithms in the reviewed papers for this application is depicted in Figure 10.

3.3. Building Energy Management

In this subsection, we review several recent studies from the literature that utilize model-based reinforcement learning to manage energy usage in buildings. Managing the electricity consumption in buildings is essential to reduce peak demand and may assist in maintaining grid stability. Efficient power distribution and utilization by this type of consumer, especially when considering large-scale office buildings that sustain hundreds of offices, may substantially aid renewable energy integration and reduce carbon emissions. For example, article [73] develops a method for using DRL for energy management of heating, ventilation, and air conditioning (HVAC) optimal control. The proposed method is demonstrated in a case study on a radiant heating system. The authors use an “EnergyPlus” simulator to create a model of the building, and a soft actor–critic is used to train a DRL agent to develop the optimal control policy for the system’s supply water temperature set-point. Following this, a different study [74] focuses on utilizing a physics-based modeling method for building energy simulation, called a “whole building energy model”, in an HVAC optimal control problem. The authors use a deep deterministic policy gradient (DDPG) algorithm. By analyzing the real-life control deployment data, it was found that proposed method achieves 16.7% heating demand reduction with more than 95% probability compared to the old rule-based control. In a different approach, presented in [75], the authors implement a reinforcement learning algorithm for HVAC energy management. The model is trained offline over past data and simulated data by imitation learning from a differential MPC model. Next, they use online transfer learning with real-time data to improve performance utilizing a PPO agent. Results show a reduction in the HVAC system energy consumption, yet it maintains satisfactory human comfort on simulated data. Moreover, they show a reduction of about 16% in energy consumption when applying the proposed model to the aggregated real-world data. One analytic solution approach to optimal energy management in residential buildings is the MPC method, which incorporates prior knowledge about the system’s dynamics to develop a control policy iteratively. Study [76] focuses on mitigating the large overhead required for applying the MPC algorithm by proposing an approximate model utilizing machine learning techniques. They propose an easy implementation scheme of advanced control strategies suitable for low-level hardware by combining different multivariate regression, such as regression trees, and dimensionality reduction algorithms, such as PCA. The approach is demonstrated on a case study, in which the objective is to optimize temperature control in a six-zone building, modeled using a large state space and various disturbance types. The results indicate a great reduction in both implementation costs and computational overhead while preserving satisfactory performance. Taking this idea a step further, in [77], the authors introduce a combination of two control methods of reinforcement learning and MPC, called “RL-MPC”. The proposed algorithm can meet constraints and provide similar performance to MPC while enabling continuous learning and the possibility to deal with highly uncertain environments that the standard MPC cannot handle. When tested on a deterministic environment, the proposed algorithm achieves results as good as a regular MPC and outperforms it in a stochastic environment. In another related work [78], a new deep learning-based constrained control method inspired by MPC is introduced, called “Differentiable Predictive Control” (DPC). The proposed algorithm begins with a system identification using a physics-constrained neural state-space model. Next, a closed-loop dynamics model is obtained. From these learned dynamics, the model can infer the optimal control law. The results show that the DPC overcomes the main limitation of imitation learning-based approaches with a lower computational overhead. Looking from a different perspective, work [79] addresses an optimal control of dispatch in building energy management by coordinating the operation of distributed renewable energy resources to meet economic, reliability, and environmental objectives in a building. The authors use a parameterized Q-learning algorithm to achieve the optimal control of dispatch policy of the various power sources. The agent interacts with the environment which is model by an MPC algorithm, which provides the transition dynamics. The efficiency and effectiveness of the policy are demonstrated through simulation.

To summarize, in the domain of energy management for buildings, model-based RL approaches offer several benefits by optimizing heating, cooling, lighting, and other energy-intensive processes. These methods utilize detailed models of building dynamics and environmental conditions to predict energy consumption patterns and adjust control strategies accordingly. By doing so, they can reduce energy costs, improve occupant comfort, and decrease the environmental footprint of buildings. The growing emphasis on sustainable architecture and smart buildings underscores the importance of model-based RL in advancing energy-efficient building management solutions. A summary of the studies reviewed is presented in Table 4. The distribution of the different algorithms in the reviewed papers for this application is depicted in Figure 11.

3.4. Electrical Vehicles

Model-based reinforcement learning methods are increasingly used in electric vehicle (EV) applications due to their ability to incorporate complex system dynamics and optimize long-term decision making. By leveraging detailed models of the environment, these methods can predict the impact of actions more accurately and optimize various aspects of EV operations, such as charging, cost management, and resource allocation. This section discusses several studies that utilize model-based RL approaches to enhance EV-related tasks, including power control, cost savings, pricing strategies, and navigation planning. One study presented in [83] addresses the optimal power control problem for fuel cell EVs, focusing on reducing hydrogen consumption. The vehicle’s speed and power demands are modeled as a discrete-time Markov chain, with parameters learned via Q-learning. Another paper [84] examines a cost-saving charging policy for plug-in EVs using a fitted Q-iteration algorithm to estimate usage and future costs. This approach shows a significant reduction in charging costs, ranging from 10% to 50%. Examining work [85], price reduction for EV charging is explored in another study, where a long short-term memory network predicts future prices. White Gaussian noise is added to the actions to prevent the model from settling at nonoptimal working points. A related study [86] focuses on increasing the profitability of fast charging stations (FCSTs) through dynamic pricing. This model predicts traffic flow and demand while scoring based on a user satisfaction model. Dynamic pricing is shown to increase the average number of EVs utilizing the FCSTs, improve user satisfaction, reduce waiting times, and boost overall profits. In the domain of navigation, one study [87] aims to provide an efficient charging scheme for urban electric vehicles. The article proposes a new platform for real-time EV charging navigation based on graph reinforcement learning, which utilizes a deep Q-learning agent. The platform’s main objective is to help EV owners decide when and where to charge, aiming to minimize charging costs and travel time. Case studies are conducted within a practical zone in Nanjing, China, and the authors verify based on simulation results the effectiveness of the developed platform and the solving method. Similarly, another study [88] addresses the navigation task using a network that extracts features from simulations and employs a deep Q-learning (DQL) agent to determine the optimal path, demonstrating performance comparable to known optimal solutions. Further, examining the work introduced in [89], a combined approach is taken in a study that involves planning charging needs a day in advance and making real-time charging decisions based on this plan. This model performs well, matching benchmarked results. Another paper, [90], deals with scheduling EV charging in a power network that includes PV production. It uses a nodal multi-target (NMT) model to ensure that actions are valid (e.g., not charging an EV that does not need it), resulting in improved charging scheduling, faster convergence, and lower costs, which are assessed through simulation. Finally, work [91] examines charging control from the three perspectives of the user, the power distribution network, and the grid operator. The model’s actions are stochastic, normally distributed with parameters generated by a deep RL model. Each EV is managed by a separate RL model, with a global model coordinating all individual models. The use of this federated approach yields better results than state-of-the-art models. In summary, these studies illustrate the significant potential of model-based RL methods to enhance various aspects of EV management, offering promising solutions for optimizing power control, reducing costs, and improving overall efficiency. By incorporating detailed models of the environment, these approaches can effectively handle the complexities of EV operations, leading to better outcomes for both users and service providers. A summary of the studies reviewed is presented in Table 5. The distribution of the different algorithms in the reviewed papers for this application is depicted in Figure 12.

3.5. Energy Storage Management

Model-based reinforcement learning methods are increasingly being recognized as essential tools for optimizing energy management in systems that integrate renewable energy and energy storage. These methods enable precise control and decision making by leveraging models that capture the dynamics of the environment, leading to improved efficiency and sustainability. In particular, the application of model-based RL to energy storage management is crucial, as it not only enhances the efficiency of energy usage but also plays a pivotal role in stabilizing the grid and supporting the integration of renewable energy sources. Thus, this section covers various studies that utilize advanced model-based RL approaches to optimize energy management in various storage device applications. The papers collectively present advanced methods for optimizing energy management in various systems, including residential settings, industrial parks, and hybrid vehicles, focusing on integrating renewable energy and energy storage systems. For instance, paper [92] introduces a model-based control algorithm that optimizes photovoltaic power generation and energy storage under dynamic electricity pricing, using convex optimization and reinforcement learning to improve cost savings. Another study discussed in [93] develops an optimization model for energy management in large industrial parks, employing deep deterministic policy gradient, greedy algorithms, and genetic algorithms to manage energy storage and consumption, addressing the variability of renewable energy sources. In the context of hybrid electric vehicles (HEVs), several papers explore reinforcement learning techniques, such as Q-learning, Dyna, and Sarsa, to optimize fuel efficiency and energy management. These approaches integrate transition probability matrices and recursive algorithms to dynamically adjust control policies based on real-time driving data, significantly improving fuel economy and adaptability compared to traditional methods [94,95,96]. Additionally, one study specifically focuses on minimizing battery degradation costs in battery energy storage systems (BESSs) for power system frequency support, using a deep reinforcement learning approach with an actor–critic model and deep deterministic policy gradient [97]. Another paper introduces an adaptive control strategy for managing residential energy storage systems paired with PV modules, enhancing grid stability and reducing electricity costs through advanced forecasting techniques and a two-tier control system [98]. In a related study, a comprehensive energy consumption model for tissue paper machines is developed using machine learning techniques, with XGBoost identified as the most effective method for optimizing electricity and steam consumption [99]. Further, study [100] presents a near-optimal storage control algorithm for households with PV systems integrated into the smart grid, utilizing convex optimization to manage battery charging and discharging based on dynamic pricing. This approach reduces household electricity expenses by up to 36% compared to baseline systems, highlighting significant advancements in smart energy management. These papers demonstrate the potential of integrating advanced control algorithms, machine learning, and optimization techniques to enhance energy efficiency and operational performance across a range of applications. In summary, these studies illustrate the great interest of researchers, who are trying to employ model-based RL approaches in optimizing energy storage management across diverse applications. By effectively incorporating environmental dynamics and leveraging predictive modeling, these methods achieve superior performance in cost reduction, system efficiency, and resource utilization. As the demand for sustainable energy solutions grows, model-based RL will continue to play a crucial role in advancing smart energy management and supporting the integration of renewable energy into modern power systems. A summary of the studies reviewed is presented in Table 6. The distribution of the different algorithms in the reviewed papers for this application is depicted in Figure 13.

3.6. Prominent Trends

Reviewing Table 2, Table 3, Table 4, Table 5 and Table 6, it is clear that model-based reinforcement learning has shown considerable potential across various domains in energy management, with each application showcasing unique trends and challenges. In power grid stability, particularly voltage control, there is a noticeable trend towards using continuous state spaces and stochastic policies as presented in Figure 14. Conversely, applications like electric vehicle charging schedules often employ discrete state spaces with deterministic policies due to the unpredictable nature of influencing factors such as weather conditions and driver behavior. These discrepancies underscore the challenge of achieving perfect knowledge of the environmental dynamics, often leading to suboptimal policies produced by RL models. The complexity in modeling these systems highlights the need for adaptive and robust model-based RL methods that can handle such uncertainties. Moreover, the distribution of the different algorithms among the different applications for this paradigm in the reviewed papers is depicted in Figure 15. Another point worth mentioning is the clear standardization for the domain of voltage control. This further emphasizes the dire need for standardized tools to conduct qualitative evaluations and promote extensive, in-depth research of new methods and algorithms for various optimal control problems. In energy management for buildings, one might expect similarities with residential energy management due to apparent similarities in the objectives. However, empirical evidence suggests that even different households may require entirely different policies, highlighting the variability in dynamic models. For example, the energy consumption patterns of a person living in a small apartment can vary dramatically depending on their work schedule. The person’s consumption could differ significantly week to week if their shift times change. This variability makes it nearly impossible to model consumption profiles accurately ahead of time, thereby necessitating RL algorithms capable of dynamically adapting to these changes. Consequently, model-based RL approaches for building energy management need to be flexible enough to account for diverse and evolving consumption patterns. Most research in energy management for buildings focuses on optimizing the same objective: reducing energy costs while maintaining comfort. While these are crucial considerations, several underexplored applications could benefit from RL. For instance, integrating demand response programs into building energy management systems could enhance grid stability by adjusting demand in real time based on grid conditions. Another promising area is the integration of renewable energy sources, such as solar panels, into building energy management systems, requiring RL algorithms to manage intermittent supply while ensuring energy efficiency and stability of the power grid. Additionally, using buildings as energy reservoirs and supplying necessary electricity in an emergency event may be another area where model-based RL could provide substantial benefits. These applications highlight the need for further exploration and innovation in applying model-based RL to the multifaceted challenges of energy management.

4. Model-Free Paradigms

We now turn our attention to model-free paradigms. In contrast to model-based paradigms, model-free paradigms utilize only statistical data to shape an optimal control policy without considering any information about the model physics or dynamics. Naturally, such methods rely on aggregated data and are useful when the dynamical model under consideration is too complicated to learn, as is often the case. Therefore, such methods are increasingly used in power system optimal control problems and are gaining popularity in recent years. Moreover, these methods can address high-dimensional state and action spaces in power system control through various strategies that reduce complexity and improve learning efficiency. One approach is to use deep neural networks, e.g., deep Q-networks or policy gradient methods, as function approximators to capture the complex relationships between states and actions, enabling the RL agent to operate in high-dimensional spaces. Techniques such as convolutional neural networks or recurrent neural networks can further exploit spatial and temporal correlations, respectively, thereby reducing the effective dimensionality [105]. Another approach is to leverage dimensionality reduction techniques, such as principal component analysis or autoencoders, to create compact representations of the state space [106]. Similarly, hierarchical reinforcement learning can decompose complex control tasks into subtasks, simplifying the action space by learning high-level policies that coordinate simpler, low-level policies [107]. These methods can help face the challenges of large state–action spaces, improving both training efficiency and policy effectiveness in high-dimensional settings such as those arising in power system control problems. Another important aspect when considering model-free RL solutions is the choice of exploration strategy, which may impact the quality of the learned control policies in power systems. For instance, the widely used epsilon-greedy strategy balances exploration and exploitation by randomly selecting actions with a small probability, thus reducing the chances that the agent gets trapped in a local optimum. However, in high-dimensional action spaces, this approach may lead to inefficient exploration, as random actions are less likely to yield meaningful outcomes. More advanced strategies, such as entropy regularization, promote exploration by encouraging the policy to maintain a certain level of stochasticity, which can be particularly useful in power system control, where diverse strategies are needed to handle varying grid conditions. Other approaches, such as UCB-based methods [108], for example, prioritize actions with high uncertainty, leading to more systematic exploration. The effectiveness of these strategies depends on the specific control task, as overly aggressive exploration can cause instability in sensitive power systems, while insufficient exploration may prevent the agent from discovering optimal control policies. For extended discussion regarding exploration strategies, refer to [109]. A possible drawback of these methods is their need for large amounts of data or simulations to learn effective policies, which can be a limitation in power system control where real-world data may be scarce or costly to obtain. However, certain strategies can improve learning abilities from limited data. For example, data augmentation techniques and experience replay can be used to reuse historical data, effectively increasing the dataset size and stabilizing the learning process [19]. Offline RL methods can leverage previously collected data without the need for additional online exploration. Additionally, transfer learning and domain adaptation techniques allow agents to transfer knowledge from related tasks or simulations, reducing the need for extensive new data [110]. These methods are particularly useful in power system control, where simulating all possible grid states is computationally intensive, but successful knowledge transfer can still enable policy learning in new environments [111]. Nonetheless, the challenge remains in ensuring that the learned policies generalize well to unseen conditions without overfitting to the limited available data.

The readers may note that we repeat here the same applications as above to provide comparisons and to draw conclusions regarding the differences, advantages, and disadvantages of each approach.

4.1. Energy Market Management

Model-free reinforcement learning methods have gained attention in energy market management due to their ability to handle complex, decentralized systems without requiring a model of the environment. These techniques are particularly valuable in situations where the system dynamics are difficult to model or are constantly changing, allowing the RL agent to learn optimal strategies directly from interactions with the environment. Model-free RL approaches, such as deep deterministic policy gradient and others, have shown promise in various energy market applications. One use case, for example, is in the management of decentralized energy systems. For instance, in paper [112], the authors demonstrate the effectiveness of a model-free RL approach in optimizing multi-energy systems in residential areas, reducing energy costs without the need for a predefined environmental model. Work [113] addresses the optimization of local energy markets using reinforcement learning through the Autonomous Local Energy Exchange (ALEX) framework, which combines multi-agent learning and a double auction mechanism. They found that weak budget balancing and market truthfulness are essential for effective learning and market performance. ALEX-based pricing improved demand response functionality and reduced electricity bills by a median of 38.8% compared to traditional net billing, showcasing the efficiency of this approach. Examining study [114], a new method for optimizing energy trading is proposed. They focus on local markets and perform the optimization using a modified Erev–Roth algorithm for agent bidding strategies. Simulations showed improved self-sufficiency and reduced electricity costs through demand response and peer-to-peer trading. Furthermore, the authors of [115] focus on optimizing real-time bidding and energy management for solar–battery systems to reduce solar curtailment and enhance economic viability. They developed a model-free deep reinforcement learning algorithm (AC-DRL) using an attention mechanism and multi-grained feature convolution to make better-informed bidding decisions. The results showed that AC-DRL significantly outperforms traditional methods, reducing solar curtailments by 76% and increasing revenue by up to 23%. Considering a different context, for example, when examining [116], the researchers aim to enable community-based virtual power plants (cVPPs) to quickly deliver ancillary services to the grid by developing a decision-making model that minimizes cVPP operation costs, which is then formulated as a partially observable Markov game. The results from the numerical simulations show that the proposed method allows cVPPs to autonomously formulate energy bidding and management strategies without needing access to other cVPPs’ private information. From a different perspective, paper [117] also focuses on the challenge of integrating distributed energy resources in the grid and their implications on energy trading in local markets. The authors propose a new market model to coordinate between the distributed sources, which represent the environment, and explore a model-free prosumer-centric coordination approach through a multi-agent deep reinforcement learning method. Case studies in a real-world setting validate that the proposed market design demonstrates its effectiveness and show a comparison to other methods. Another work that considers energy trading applications in a peer-to-peer setting is [118]. In this study, the authors consider a newly emerging trend of consumer-to-consumer trading that redesigns local energy markets. This form of trading diversifies the energy market ecosystem and can be used to further support grid stability, although it introduces additional uncertainty to energy trading strategies. The researchers present a Markov Decision Process formulation for this market model and analyze beneficial trading strategies using multi-agent reinforcement learning methods that rely on a data-driven approach. The proposed model is evaluated using simulation, and the results are discussed to highlight the benefits and disadvantages of the method. Energy markets in microgrids are also an interesting subapplication since they can detach themselves from the grid at any time, making them a highly uncertain environment for planning. For example, in work [119], their objective is to achieve distributed energy scheduling and strategy making in a double auction-based microgrid. To address this issue, a multi-agent reinforcement learning approach is adopted. The authors propose an optimal equilibrium selection mechanism to improve performance of the model and enhance fairness, execution efficiency, and privacy protection. Simulation results validate the capabilities of the proposed method. To continue this line of thinking, paper [120] uses multi-agent reinforcement learning to control a microgrid in a mixed cooperative and competitive setting. The agents observe energy demand, changing electricity prices, and renewable energy production. Based on this information, they decide upon storage system scheduling to maximize the utilization of the renewables and reduce the energy costs when purchasing from the grid. The evaluation is performed in two settings: single and multi-agent. In the multi-agent setting, the researchers design the individual reward function that each agent receives by leveraging the concept of marginal contribution to better assess how the agents’ actions impacted the joint goal of reducing energy costs. Another paper that considers energy trading in microgrids is [121]. Here, they present an online reinforcement learning approach that is base on imitation learning to mimic a mixed-integer linear programming (MILP) solver. The proposed method is compared to an agent that learns the policy from scratch and to a Q-learning agent. Numerical simulations on both simulated and real-world data highlight the performance advantage of the proposed approach as compared to a few other methods. In conclusion, model-free RL approaches offer flexibility and adaptability, making them well suited for managing decentralized and multi-agent systems in the energy market. Their ability to optimize complex systems without a predefined model opens up new possibilities for advancing real-time decision making in dynamic energy environments. A summary of the studies reviewed is presented in Table 7. The distribution of the different algorithms in the reviewed papers for this application is depicted in Figure 16.

4.2. Power Grid Stability and Control

Despite the remarkable achievements of model-based reinforcement learning paradigms, in many power system applications, particularly in optimal control problems of power grid stability and management, it is impossible to acquire a perfect knowledge of the system dynamics. As a results, data-driven methods leverage statistical learning to produce optimal control policies without any knowledge of the system, instead relying solely on aggregated data. For example, examining the same task of voltage control regulation, paper [126] implements an autonomous control framework, “Grid mind”, for voltage control and the secure operation of power grids. The authors use a deep Q-network and deep deterministic policy gradient and feed the model with the current system conditions detected by real-time measurements from supervisory control and data acquisition or phasor measurement units. A case study on a realistic 200-bus test system is demonstrated. Alternatively, work [127] focuses on the management of active distribution networks, which face frequent and rapid voltage violations due to renewable energy integration. In the work, they propose a fast control of PV inverters. To achieve this, they partition the existing network into subnetworks based on voltage reactive power sensitivity. Next, the scheduling of PV inverters is presented as a Markov game. The policy is learned in a multi-agent setting, where the agents use a soft actor–critic algorithm. Each subnetwork is represented as an agent. The framework in which the agents operate is a cooperative game, so the agents are trained in a centralized manner, but they also rely on local observations for quick response. To show the effectiveness of the method, various comparative tests with different benchmark methods were performed, such as stochastic programming and several others, on IEEE 33- and 123-bus systems and a 342-node low-voltage distribution system. To continue this line of thinking, consider study [128], which discusses autonomous, real-time voltage control for economic and safe grid operation. The paper laid the foundation for “Grid mind”, which was further extended in [126] to include the continuous state space. Here, the researchers proposed a deep Q-learning algorithm to effectively learn the voltage corrections required for grid stabilization. They tested the proposed method on the standard IEEE 140-bus system with various scenarios and voltage perturbations. As was previously mentioned, power grid stability and control has many aspects. So, if we continue our analysis of the literature in this domain of research, frequency control is another widely discussed problem. For instance, consider the following paper [129], which addresses power system stability margins and especially is interested in poorly damped or unstable low-frequency oscillations. The authors aim to propose a reinforcement learning framework to effectively control these oscillations in order to ensure the stability of the system’s operation. They design a network of real-time closed-loop wide-area decentralized power system stabilizers. The data are measured by comparative tests with different benchmark methods on IEEE 33- and 123-bus systems and a 342-node low-voltage distribution system as the demonstration system, and they are processed by a set of decentralized “stability” agents, implementing a variation of a Q-learning algorithm. Finally, a Matlab simulation is designed to assess the performance of the method. Moreover, in work [130], they address the problem of frequency regulation in emergency control plans. The writers begin with a model that is designed for limited emergency scenarios utilizing a variation of a Q-learning algorithm and train it offline. Next, they use transfer learning and extend the generalization ability by using a deep deterministic policy gradient (DDPG) algorithm. They employ this system online to learn near-optimal solutions. Using Kundur’s 4-unit-13 bus system and the New England 68-bus system, they verify the capabilities of the proposed schemes. The integration of renewable energy and the latest technological advancements have increased the phenomena of microgrids, which can detach at any time from the main grid, causing malfunctions and jeopardizing its standard operation. This has given rise to a new sort of optimal control problem concerning power grid stability and domain. The literature contains many studies that investigate this problem but the studies have different objectives, emphasizing the many variables that need to be considered when managing microgrid formations and the great complexity they introduce into the system. Namely, work [131] inspects a network of interconnected microgrids that can share power with each other, called a multi-microgrid formation, and aim to propose a control policy for the power flow to enhance power system resilience. They propose a deep reinforcement learning scheme based on a double deep Q-learning algorithm, with a CNN for effective learning of Q-values. The authors evaluate the performance of the proposed scheme using a 7-bus system and the IEEE 123-bus system with different environmental conditions. Furthermore, in study [132], the researchers are also interested in the control and management of the multi-microgrid setting, only here they underscore the distribution system operator perspective, whose target is to reduce the demand-side peak-to-average ratio (PAR) and to maximize the profit from selling energy, along with protecting the users’ privacy. The microgrids are modeled without direct access to users’ information, and the retail pricing strategy via a Monte Carlo method is based on prediction. Consequently, to evaluate and compare the proposed framework, the authors use simulation and run a few conventional methods to assess the behavior of the model under uncertainty conditions, where there is only partial information. Another core subapplication that has to be mentioned in the context of power grid stability and control is the management of the power flow. This involves the balancing of supply and demand to prevent grid overloads and maintain stability. There is a need to optimize the dispatch of power from various sources, including renewable energy and storage systems, to ensure efficient power distribution and minimize losses. Dynamically adjusting the division of the power flow helps enhance system resilience against disturbances, reduce the risk of blackouts, and improve overall grid reliability. In the analysis suggested in [133], they highlight the importance of fast and accurate corrective control actions in real time of the power flow to ensure the system security and reduce costs. The authors propose a method to derive real-time alternating current (AC) optimal power flow (OPF) solutions when considering the uncertainties induced by varying renewable energy sources’ incorporation in the system and topological changes. They suggest a deep reinforcement learning framework, using a PPO agent, to assist grid operators. They validate the proposed scheme on the Illinois 200-bus system with wind generation variation and topology changes to demonstrate its generalization ability. An additional study concerning the optimal power flow can be viewed in [134]. Here, the objective is to analyze the optimal power flow of the distribution network embedded with renewable energy sources and storage devices. The analytical problem is formulated as a Markov Decision Process, and a PPO agent is trained. Using offline statistical learning on historical data and a stochastic policy, the authors aim to reduce prediction error and address the uncertainty of the environment. A comparative evaluation to double deep Q-learning and stochastic programming methods is performed, assessing the capabilities of the proposed framework. In essence, model-free RL methods have proven to be highly effective for grid stability and control applications by learning optimal control policies through direct interaction with the grid environment. These approaches are particularly advantageous in scenarios where system dynamics are too complex or unpredictable to model accurately. By continuously adapting to changing conditions and unforeseen disturbances, model-free RL can enhance grid stability, manage load balancing, and support the integration of intermittent renewable energy sources. As power systems become more decentralized and dynamic, model-free RL will play a crucial role in ensuring reliable and stable grid operation. A summary of the studies reviewed is presented in Table 8. The distribution of the different algorithms in the reviewed papers for this application is depicted in Figure 17.

4.3. Building Energy Management

A few factors contribute to the development of statistical learning methods for energy management in buildings. First, the high dimension of the data imposes complicated dynamics, which are hard to model precisely, leading to suboptimal policies produced by model-based algorithms. Moreover, smart grids and smart metering devices, along with other technological advancements for aggregating and measuring data, provide further motivation for the utilization of data-driven methods. In this subsection, we review recent studies from the literature dealing with model-free reinforcement learning solutions to manage energy in buildings. Examining an HVAC application, in [137], the study addresses the problem of optimal control for building HVAC systems. The proposed method is based on a Q-learning algorithm and is validated with the measured data from a real central chilled water system. The authors present a comparative evaluation with the basic controller, showing a conservation of 11% of the system’s energy in the first applied cooling season in comparison to the old rule-based method that was used until then. Alongside this, in [138], the paper addresses both demand response (DR) and HVAC problems. It focuses on optimizing the demand response problem while maintaining comfort in residential buildings. The authors develop a two-stage global and local policy search method. This method enhances the produced policy of the PPO algorithm, which is applied next. The method uses zero-order gradient estimation to search for the optimal policy globally, for which the objective must undergo a smoothing filtering. Next, the obtained smoothed policy undergoes the opposite procedure, and then it is fine-tuned locally to bring the first-stage solution closer to that of the original unsmoothed problem. Experiments on simulated data show that the learned control policy outperforms many existing solutions. A different perspective on encountering both demand response and residential comfort objectives in energy management schemes of buildings is discussed in [139]. The authors aim to propose a cost-effective automation system that can be widely adopted to utilize in buildings, which may be prominent energy consumers, in demand response programs. Existing optimization-based smart building control algorithms suffer from high costs due to building-specific modeling and computing resources. To tackle these issues, the paper proposes a solution using reinforcement learning, specifically an actor–critic agent. Simulations are performed with different settings, each representing a building with a different occupancy pattern. Results of the study suggest another approach for increasing the return for commercial buildings that participate in DR programs. Another DR and HVAC solution approach is suggested in [140], where they address the demand-side scheduling in a residential building to reduce electricity costs while maintaining resident comfort. The schedule optimization is performed online, and it is controlled by a deep reinforcement model, combining deep Q-learning and deep policy gradient algorithms. The proposed approach was validated on the large-scale Pecan Street Inc. database, and it includes multiple features that hold information about photovoltaic power generation, electric vehicles parked in the building’s parking lot, and various building appliances. Moving forward, in [141], the authors develop a data-driven approach based on deep reinforcement learning, using a Q-learning agent, to address scheduling problem of HVAC systems in residential buildings. The main objective of the work is to reduce energy costs. To assess the performance of their algorithm, the writers performed multiple simulations using the “EnergyPlus” tool. Experiments demonstrate that the proposed method shows a noticeable reduction in energy costs when compared with the traditional rule-based technique while still satisfying the required comfort conditions. Similarly, in [142], the main objective is to reduce energy costs of an HVAC system in a commercial building with the consideration of the changing occupancy pattern, which introduces uncertainty. They describe a Markov game formulation to represent the energy cost minimization, and then they devise an RL-based algorithm in a multi-agent framework that leverages an attention mechanism to avoid the need for any prior knowledge of the thermal dynamics. Experiments on real-world data show the effectiveness, robustness, and scalability of the proposed algorithm. Continuing the same line of thinking, ref. [143] addresses a similar problem. In this case, the authors propose a different technique which relies on a deep Q-learning method and utilizes weather forecasting data. The proposed method is called “WDQN-temPER”, and it consists of a combination of the deep Q-network algorithm together with a gated recurrent unit model, which predicts the future outdoor temperature, and its output is used as a state variable in the RL model. They experimentally demonstrate in “EnergyPlus” software [144] with simulated data that their proposed model outperforms a rule-based baseline model in terms of HVAC control, with energy savings of up to 58.79%. Moreover, paper [145] approaches the HVAC problem by optimizing a cost minimization function that jointly considers the energy consumption of the HVAC and the occupants’ thermal comfort. In the study, the authors propose a model that consists of two submodules. The first one is a deep feed-forward neural network for predicting the occupants’ thermal comfort, and the second one is a deep deterministic policy gradient algorithm for learning the optimal thermal comfort control policy. The environment is simulated under different conditions to evaluate the performance under different settings, and the proposed method indeed shows improved results when compared to the previous rule-based algorithm. Combining energy management alongside with residential comfort objectives, the researchers who conducted study [146] consider a multi-objective control problem of energy management in a residential building while trying to optimize occupant comfort, energy use, and grid interactivity. They try to utilize an efficient RL control algorithm based on a PPO agent. They address a few major drawbacks associated with RL models, including the need for large training data, long training time, and unstable behavior during the early training process where the emphasis lies on exploration actions, which makes it infeasible for real-time application in building control tasks. To address this issue, imitation learning is used along with reliance on a policy transferred from a reliable rule-based model to guide the agent in the first crucial stages. This approach showed high performance and fast running time in comparison to some rule-based models under simulated data. Moreover, this technique prevented successfully the unstable early exploration behavior. When considering multi-energy system (MES) control applications, refer to integrated systems that combine multiple forms of energy (such as electricity, heat, and gas) to optimize their generation, storage, and consumption. By leveraging the synergies between different energy carriers, MESs aim to improve overall efficiency, reliability, and sustainability in energy management. For instance, in work [112], they minimize the MES cost by optimizing the scheduling of the MES. Their solution finds an optimal energy usage for the MES. They introduce DDPG, a deep deterministic policy gradient, with a prioritized experience replay strategy for improving sample efficiency, which does not rely on knowledge about the system or any statistical knowledge from forecasting. A different perspective is presented in paper [147], which addresses both electric water heater management and DR. Here, the authors harness electric water heaters for energy storage in buildings to address the residential demand response control problem. In the work, they propose a reinforcement learning method, using an autoencoder network and fitted Q-iteration algorithm to handle the stochastic and nonlinear dynamics of the electric water heaters. From the experimental results, the proposed method showed a reduction in the total cost of energy consumption of the electric water heater by 15% when compared to the old rule-based method. In conclusion, for energy management in buildings, model-free RL methods offer a flexible and adaptive solution by learning from real-time energy consumption data and occupant behavior patterns. Without needing a predefined model of the building’s dynamics, these methods can autonomously adjust energy usage to optimize for cost savings and comfort. This makes them particularly suitable for managing the energy demands of smart buildings, which may have complex interactions and varying usage patterns. As the adoption of smart building technologies increases, model-free RL will be key to enabling efficient and responsive energy management strategies. A summary of the studies reviewed is presented in Table 9. The distribution of the different algorithms in the reviewed papers for this application is depicted in Figure 18.

4.4. Electrical Vehicles

Model-free reinforcement learning methods are also increasingly crucial in electric vehicle (EV) applications due to their ability to learn and adapt to dynamic environments without requiring explicit models of the system dynamics. These methods offer promising solutions to various challenges associated with EV charging and energy management, optimizing performance while reducing computational complexity. This review highlights several studies that showcase the effectiveness of different RL algorithms in enhancing EV-related tasks, such as charging time minimization, cost reduction, and real-time scheduling. For example, in study [153], the authors aim to minimize EV charging times using a deep learning method, achieving a charging time reduction of 60% compared to the best classical methods available. In another work [154], the objective is to reduce operative costs. The researchers leverage a deep reinforcement algorithm, using a deep Q-learning agent, to optimize the utilization of photovoltaic systems and battery storage. This model proves to be more effective than simpler DQL models by reducing costs by 15% and increasing the utilization rate of PV by 4%. Furthermore, a different perspective on real-time pricing and scheduling control for charging stations is explored in paper [155]. In the work, the authors apply the SARSA algorithm, which demonstrates higher profitability for charging stations while maintaining low average charging prices. This approach also offers computational efficiency compared to other advanced algorithms, which are presented in the evaluation section. Examining the work discussed in [156], they are also interested in cost reduction as their main objective and try to optimize costs using a genetic algorithm. The proposed model is utilized to determine optimal charging fees, framing the problem as a bi-level optimization involving both power distribution and urban transportation networks. Although this model performs well in smaller networks, it has some limitations in when applied to larger-scale settings. Another approach presented in study [157] introduces a dynamic pricing model that integrates quality-of-service considerations. Using an actor–critic model, this method achieves results comparable to dynamic programming solutions but with significantly faster computation times. In contrast, concerning the optimal control problem that arises from EV charging schedules, work [158] applies a deep Q-learning model to decide which EVs to charge based on known departure times and energy demands. Tested on real-world data, this model outperforms uncontrolled scenarios by up to 37% and approaches optimal performance, though it struggles with scalability over long time frames. Examining the work [159], the authors aim to suggest an optimal battery management policy to allow for longer ranges and battery preservation. They design a simplified car model and utilize a DQL agent for action selection, combined with model predictive control (MPC) for actual control, showing slight improvements in range and battery usage, particularly in challenging driving conditions like uphill driving. A separate study, presented in [160], has objectives to increase photovoltaic self-consumption and EV state of charge at departure. The study compares three deep reinforcement learning (DRL) models, including DDQN, DDPG, and parameterized DQN, with model predictive control for simple charging control involving building energy needs and PV utilization. While these DRL models perform well concerning time efficiency, their performance is similar to that of rule-based control methods in the sense of finding the optimal point that satisfies the trade-off between the available PV surplus, after answering the building demands, and the available charge capacity in the EV battery. Additionally, a demand response scheme is analyzed in work [161]. Here, the main focus lies on which devices to turn on to stabilize the grid in terms of power conservation while keeping user satisfaction in mind. This model ranks devices based on priority and measures total dissatisfaction scores, balancing demand management with user preferences. Lastly, referring to work [162], a simple energy management scenario involving appliances like air conditioners, washing machines, and energy storage systems is analyzed. Using weather forecasts and a Q-learning agent, the model achieves a 14% reduction in electricity costs under dynamic pricing scenarios. Overall, these studies highlight the versatility and potential of model-free RL methods in EV applications, offering substantial improvements in operational efficiency, cost-effectiveness, and computational simplicity. A summary of the studies reviewed is presented in Table 10. The distribution of the different algorithms in the reviewed papers for this application is depicted in Figure 19.

4.5. Energy Storage Management

Model-free reinforcement learning methods are becoming increasingly important for energy storage management due to their ability to learn optimal policies directly from interactions with the environment without requiring explicit models of system dynamics. These approaches offer flexibility and adaptability in complex and uncertain energy management scenarios, making them highly suitable for integrating renewable energy sources and enhancing grid stability. By leveraging real-time data and continuously improving their decision-making strategies, model-free RL techniques can optimize energy usage, reduce costs, and manage storage systems more effectively, addressing the challenges of modern energy systems. The papers collectively explore advanced control and optimization techniques across energy systems, microgrids, and hybrid vehicles, using methods such as deep reinforcement learning and Particle Swarm Optimization (PSO). Consider, for example, paper [97], which introduces an Adaptive Deadbeat (ADB) controller, optimized with PSO, for managing frequency stability in Scottish power systems with high renewable energy penetration, showing superior performance over traditional controllers. Another study [163] presents a DRL-based framework for optimizing battery energy storage arbitrage, incorporating a lithium-ion battery degradation model and a hybrid CNN-LSTM architecture to predict electricity prices. This model improves profitability and manages battery health efficiently. Further research, such as [164], employs double deep Q-learning to optimize community battery energy storage systems within microgrids, reducing bias and stabilizing learning, particularly under fluctuating market conditions. Alternatively, in work [165], a Q-learning-based strategy is also introduced for optimizing community battery storage, utilizing an enhanced epsilon-greedy policy to balance exploration and exploitation, optimizing performance in both grid-connected and islanded modes. In another paper [166], the objective is to propose a trading policy in the energy market to incentivize private producers that use renewable energy sources to participate fairly in the energy trading market. The authors tailor a DRL model for modeling prosumer behavior in local energy trading markets using a modified deep Q-network technique, enhancing learning through experience replay. Examining another application, in the study presented in [167], an actor–critic architecture enhanced by deep deterministic policy gradient is employed to optimize battery energy storage systems (BESSs) for power system frequency support, minimizing operational costs while maintaining stability. Looking at a different perspective, paper [168] introduces a multi-agent DRL method for optimizing energy management in microgrids, where agents share Q-values to achieve correlated equilibrium, significantly enhancing profitability. Additionally, in study [163], the authors propose a DRL with a noisy network approach to optimize battery storage arbitrage, combining CNN-LSTM for market forecasting and balancing financial gains with long-term operational costs. Moreover, study [169] employs a deep Q-network for dynamic energy management, integrating real-time data to minimize operating costs and ensure stable microgrid operation. Finally, work [170] integrates Markov chain models with reinforcement learning to optimize energy management in hybrid vehicles, demonstrating significant improvements in energy efficiency. In summary, these studies highlight model-free RL approaches in optimizing energy storage management across various applications, from residential settings to large-scale industrial systems. By learning directly from the environment, these methods can adapt to changing conditions and uncertainties, offering robust solutions for enhancing energy efficiency and cost-effectiveness. As model-free RL techniques continue to evolve, they hold significant potential for advancing sustainable energy management and supporting the integration of renewable energy sources into the grid. The distribution of the different algorithms in the reviewed papers for this application is depicted in Figure 20.

4.6. Prominent Trends

Reviewing the summaries presented in Table 7, Table 8, Table 9, Table 10 and Table 11, it is clear that model-free reinforcement learning has shown considerable promise across various domains, with each application showcasing unique trends and challenges. In power grid stability, particularly voltage control, there is a noticeable trend towards using stochastic policies. This approach leverages the simple underlying statistics that are inherent to these control problems. Conversely, applications like electric vehicle charging schedules often employ discrete state spaces with deterministic policies, as may be viewed in Figure 21, due to the complex statistical relations and due to the unpredictable nature of influencing factors such as weather conditions and driver behavior. In addition, the distribution of the different algorithms among the different applications for this paradigm in the reviewed papers is depicted in Figure 22. These discrepancies underscore the challenge of statistical learning when the uncertainty of the environment is high, often leading to suboptimal policies produced by model-free RL models. This complexity in the underlying statistical information highlights the need for flexible model-free RL methods that can generalize well under such uncertainties. Another point worth mentioning is the clear standardization for the domain of voltage control. As above, this further emphasizes the dire need for standardized tools to conduct qualitative evaluations and promote extensive, in-depth research of new methods and algorithms for various optimal control problems. Furthermore, it is clear that in order to learn such complicated statistical relations, there is a need for large-scale amounts of data. Currently, this is a major challenge in many domains of power systems, since the technological advances that allow data metering and aggregation are relatively new, and incorporating them into existing systems not only poses a major technological challenge but also subjects the power grid to security threats. In energy management for buildings, one might expect similarities with residential energy management due to apparent similarities in objectives. However, empirical evidence suggests that even different households may require entirely different policies, highlighting the variability in dynamic models. For example, the energy consumption patterns of a person living in a small apartment can vary dramatically depending on their work schedule. This variability makes it nearly impossible to learn and predict consumption profiles accurately ahead of time, thereby necessitating model-free RL algorithms capable of dynamically adapting to these changes. Consequently, model-free RL approaches for building energy management need to be flexible enough to account for diverse and evolving consumption patterns. Due to the rising complexity of optimal control problems in the power systems field of research and the high dimensions of these problems, statistical learning emerges as a powerful and promising tool. Hence, there is a need for further exploration and innovation in applying model-free RL to the multifaceted challenges of energy management.

5. Comparison and Discussion

Generally speaking, based on Section 3, model-based reinforcement learning methods have the potential to outperform model-free methods in power system control problems when accurate system models are available and computational resources are sufficient. This is because model-based methods explicitly incorporate system dynamics, enabling more efficient policy learning with fewer interactions. Additionally, for tasks requiring long-term planning, such as energy storage scheduling, model-based methods may utilize model behavior predictions to plan over extended horizons. However, their performance hinges on the accuracy of the system model; thus, if the model is overly simplified or does not match sufficiently with reality, the policies may degrade and become unreliable. In these cases, statistical learning may be used, possibly resulting in more accurate solutions as discussed in Section 4. Furthermore, when considering the complexity aspect of power system control problems, in some scenarios, model-free methods may be more computationally intensive during training due to the need for extensive trial-and-error interactions with the environment, especially when dealing with complex, high-dimensional power system control tasks. However, once trained, they are typically faster to deploy since the learned policy is directly used without further computation. In contrast, model-based methods require significant computational resources for constructing and updating the system model, performing model rollouts, and planning during both training and deployment phases. This can become a bottleneck when scaling to large power systems with extremely large numbers of states and actions or when using complex nonlinear models. Nonetheless, model-based methods often require fewer environment interactions, making them more sample-efficient and scalable when the cost of real-world data acquisition is high. Therefore, the trade-off between model-free and model-based methods often depends on the specific application, the available computational resources, and the feasibility of obtaining an accurate system model. In this light, there is a great potential in developing hybrid approaches that combine model-based and model-free reinforcement learning methods. It can offer improved performance and robustness by leveraging the strengths of both paradigms. For example, hybrid methods can use a model-based component to generate synthetic experiences, guiding the policy towards promising regions of the state–action space, while relying on a model-free component for fine-tuning and handling discrepancies between the model and real-world dynamics. This can lead to faster convergence and better generalization, particularly in environments where model inaccuracies exist but cannot be completely eliminated. Moreover, hybrid methods can dynamically switch between model-based planning and model-free control depending on the context, achieving a balance between exploration, exploitation, and computational efficiency. This flexibility makes hybrid approaches particularly appealing for power system control tasks that require both strategic long-term planning and real-time decision making under uncertainty, such as adaptive load shedding or voltage stability control in rapidly changing grid conditions. There are many intriguing and important topics that may be discussed based on the presented broad literature review. For instance, designing energy trading strategies in markets with high renewable energy penetration poses a challenge due to the high amount of uncertainty that results from the intermittent and unreliable nature of these energy sources, the variation in demand patterns, and fluctuating energy prices as discussed in Section 3.1 and Section 4.1. In such scenarios, RL agents can be trained to forecast market prices and renewable generation, allowing them to develop optimal bidding strategies that account for the volatility of supply and demand. For instance, a power producer with significant renewable assets can use RL to maximize profits by learning to schedule energy sales when prices are high or storing energy when prices are low, all while considering grid constraints and penalties for failure to comply with safety constraints. Furthermore, RL methods can handle multi-agent scenarios, where multiple market participants with competing objectives interact, enabling cooperative or competitive trading strategies. However, the dynamic and stochastic nature of energy markets poses challenges for RL, as the agents need to adjust to rapid changes in market conditions and evolving regulatory policies which may lack a specific statistical structure and may be hard to learn. Training such agents requires large amounts of historical data or high-fidelity simulations, making data scarcity and model accuracy critical considerations. Moreover, when considering the application of grid stability and control in distribution networks with distributed energy resources (DERs), using reinforcement learning solutions can aid to the stability and management of the grid by providing adaptive and decentralized control strategies. RL-based controllers have the potential to aid in optimizing voltage profiles, managing reactive power, and coordinating DER operations to maintain grid stability under fluctuating conditions. RL is designed to learn from complex interactions between DERs, loads, and the network, making it suitable for managing power flows in active distribution systems as was presented in Section 3.2 and Section 4.2. The possible benefit is a more resilient and self-regulating grid that can dynamically respond to disturbances such as rapid changes in solar or wind output. However, using RL in this context also presents challenges, such as ensuring the stability and safety of the learning process, as poorly trained policies could inadvertently destabilize the grid. Moreover, the high dimensionality and nonlinear nature of distribution networks complicate the learning process, requiring sophisticated exploration strategies and well-defined reward functions to guide the agent toward safe and reliable control behaviors. Another interesting idea that can be inferred from Section 3.3 and Section 4.3 is the varying occupancy patterns. Indeed, reinforcement learning-based energy management strategies can potentially improve energy efficiency in buildings with varying occupancy patterns by dynamically adjusting heating, ventilation, and air conditioning (HVAC) operations. Traditional building energy management systems often rely on preprogrammed schedules or simple rule-based control, which may not effectively handle complex occupancy variations. In contrast, RL agents can learn optimal control policies that minimize energy consumption while maintaining occupant comfort by considering real-time occupancy data, external weather conditions, and energy price signals. By predicting occupancy patterns and adapting the control strategy accordingly, RL can reduce unnecessary energy usage during periods of low occupancy and shift energy consumption to off-peak hours when possible. However, implementing RL in building management comes with challenges such as ensuring convergence to stable and efficient policies, dealing with the variability in human behavior which may hinder the statistical learning process, and integrating with existing building management systems. Additionally, the need for a reliable feedback mechanism to capture occupant comfort preferences complicates the reward design. The analysis presented above reveals a few interesting trends. First, it clearly shows that reinforcement learning methods are gaining more and more interest and are being vastly deployed for various optimal control problems in power systems. Since 2010, we have seen a steady growth in the number of publications concerning both model-based and model-free approaches for various applications, as shown in Figure 23. It is important to note that when looking for model-based papers, there is a lack of consensus regarding the definition of this approach. In many cases, researchers use “model-based” to describe traditional optimal control solutions, such as MPC. Indeed, the differentiation is not exactly clear, since a specific approach for reinforcement learning solutions is based on a known and given model, which degenerates the problem to a planning problem, as is described in Figure 5. In addition, to ensure an efficient search, the terms “model-based” and “reinforcement learning” are often written separately, in different parts of the sentence, to avoid this problem, as may be seen in the search strings presented in Table 12. The distribution between various applications is quite the same, as may be viewed in Figure 24. Similarly, in Figure 25, it may be observed that variations of the “Q-learning” algorithm are by far the most popular; afterwards, in model-based approaches, “actor–critic” variations are the second most used, while in model-free, it is the “policy gradient” algorithms. Moreover, “Q-learning” dominates the algorithms used when it comes to data-driven approaches, as may be seen in Table 13 or visually in Figure 26. To summarize, the comparison highlights the growing popularity of reinforcement learning in power system applications, emphasizing the connection between various applications, the arising MDP formulations, and the selected reinforcement learning algorithms. These trends underscore the field’s evolving dynamics and the adaptability of reinforcement learning algorithms in addressing complex optimal control challenges in power systems while also pointing to future possibilities.

6. Challenges and Future Research Work

6.1. Challenges

6.1.1. Limited Real-World Data and Standardized Tools

One major challenge when utilizing reinforcement learning methods in control problems for power systems is the limited real-world data [172,173]. Typically, reinforcement learning problems require vast amounts of data and repeated interactions with the environment to produce high-fidelity predictions and avoid injecting bias during training. Currently, there is only a limited amount of data that may be collected from real-world power systems, which impairs performance and limits the generalizability and robustness of the models. Therefore, it is highly interesting to find solutions that can promise performance guarantees, despite the limited amount of data. Alternatively, it is interesting to find ways to generate high-quality data that can be used for benchmarking.

With great caution, we also claim that the power systems area of research somewhat lags behind, in terms of available data and standard datasets, when compared to other disciplines such as computer vision or natural language processing [174,175]. The major challenge is due to the stochastic nature of the environment and the complex dynamics of the physical systems involved. The uncertainty is high, which degrades the performance of the model. Moreover, most existing solutions do not offer a closed, systematic, in-depth formulation of the physical consideration of the network. This lack of standardization hinders the ability to consistently measure and compare the performance of different reinforcement learning approaches, making it difficult to gauge their effectiveness and reliability in real-world applications. Furthermore, the diversity and complexity of power systems, coupled with the need for high levels of safety, stability, and efficiency, require robust testing environments that accurately reflect operational conditions. Without standardized benchmarks, researchers and practitioners face difficulties in replicating results, validating models, and advancing the state of the art in reinforcement learning applications for power systems.

6.1.2. Limited Scalability, Generalization, and the Curse of Dimensionality

The limited generalization ability of current reinforcement learning solutions arises from numerous factors. Model-free configurations, which rely on large-scale datasets, suffer from a lack of data or use data that were validated on small-scale, simplified systems. On the contrary, model-based methods often converge to a dynamic model that does not represent accurately the real-life model, which affects the resilience, robustness, flexibility, and generalization ability of the model. This hurdle degrades the generalization ability to handle new samples and prevents the incorporation of such solutions by power system operators. An extensive discussion is presented in [176,177,178].

Extending this idea, the curse of dimensionality in reinforcement learning refers to the exponential growth in computational complexity and data requirements as the number of state or action variables increases. In model-based reinforcement learning, this issue arises because constructing and solving a model that captures all possible state transitions and rewards becomes infeasible with high-dimensional state spaces. On the other hand, model-free reinforcement learning methods, which learn policies or value functions directly from interactions with the environment, also struggle as the amount of experience needed to accurately estimate values or policies scales exponentially with the number of dimensions. This leads to slower learning and the need for more data to achieve reliable performance. Both approaches require sophisticated techniques such as function approximation, dimensionality reduction, or hierarchical learning to manage the complexity and make high-dimensional problems tractable. To understand the extent of the problem, consider any optimal control problem of storage devices. As mentioned in [45], the dimensions of the problem grow exponentially with the number of storage devices, as illustrated in Figure 27.

6.1.3. Limited Robustness and Safety for Real-Time Applications

Reinforcement learning methods, both model-based and model-free, often exhibit limited robustness in power system control problems [176,177,178]. Model-based reinforcement learning relies on accurate system models, and any discrepancies between the model and the real system can lead to suboptimal or unsafe control actions. This lack of robustness to modeling errors or system changes can significantly impact performance. Model-free reinforcement learning, which learns directly from interaction with the environment, can struggle with the variability and uncertainty inherent in power systems. It requires extensive and diverse training data to generalize well, but even then, it might not handle unseen scenarios or rare events effectively. Both approaches need mechanisms for robust adaptation to changing conditions, such as continuous learning, domain adaptation, or integrating domain knowledge to enhance reliability and safety in dynamic power system environments.

In addition, implementing learning algorithms in the power systems domain for real-time applications presents numerous challenges. The dynamic and stochastic nature of power systems demands that reinforcement learning models continuously adapt to fluctuating conditions and uncertainties, such as variable renewable energy sources and unexpected equipment failures. These algorithms must make real-time decisions with limited data and computational resources in high-stakes environments where errors can lead to significant financial losses, equipment damage, or blackouts. The need for immediate, reliable decision making requires reinforcement learning models to learn quickly while ensuring high accuracy and robustness. Additionally, integrating reinforcement learning into existing control systems, which rely on deterministic methods, poses significant challenges, especially given the stringent regulatory and safety standards governing power systems operations. Moreover, learning algorithms deployed in real time must continuously learn and adapt as new data become available, requiring advanced techniques to maintain performance and prevent “catastrophic forgetting”. Ensuring the stability and convergence of these algorithms in a dynamic environment is complex, highlighting the need for ongoing research to develop robust, adaptable, and safe reinforcement learning solutions for real-time power system applications. Different works addressing this challenge have emerged in recent years, such as [69,129,133]. A summary of the challenges is presented in Table 14.

6.1.4. Nonstationarity and Time-Variant Environments

In nonstationarity and time-varying dynamics in power systems, the optimal policies may shift over time due to changes in system conditions. To try and address these challenges, reinforcement learning algorithms may incorporate several concepts from other disciplines such as signal processing and classical control theory. For example, concepts such as hierarchical decomposition, adaptive learning mechanisms, or domain-specific knowledge may aid in handling the nonstationarity of the environment. One approach is to decompose the complex dynamics into different hierarchical levels, such as varying time scales, frequency scales, or operational modes. By separating the system’s behavior into these distinct levels, RL can learn specialized policies for each subprocess, effectively addressing the varying nature of the problem. Recurrent neural networks and long short-term memory architectures can further enhance this approach by processing sequential data, enabling the RL agent to capture temporal dependencies and better adapt to evolving grid conditions. Additionally, online learning with adaptive parameters of the transition network allows the agent to continuously update its knowledge, making it capable of handling the ever-changing environment of power systems.

In some cases, domain knowledge such as symmetries, conservation laws, and periodic behavior of the dynamics can be integrated into the RL framework. These properties may be used to reduce the dimensionality of the state–action space and even incorporate knowledge about the possible variations of the signal due to known noise types, which can help in coping with the inherent uncertainty typical for power systems. Furthermore, utilizing techniques like the dq0 transformation may also reduce the dimensionality of the state–action space, such that the RL agent can perform analysis in a simplified domain where the system exhibits linear or time-invariant properties. This domain-informed guidance can help streamline the policy search and hopefully improve both efficiency and robustness. Another interesting idea that can be considered is to design the transition function in the RL model to evolve in time to reflect the changes in the system dynamics. In addition to all that is mentioned above, the aspect of balancing exploration and exploitation must be considered. Since the dynamics change over time, perhaps there is a need for the model to forget outdated information and constantly continue exploring to gather new information. Nevertheless, in some cases, where there is some correlation between the states of the environment, utilizing explored information may also aid in the convergence. Finally, online learning may be used to allow continuous updates to the model, since constant interaction with the environment might be imperative. However, the time scale of the interaction with the environment depends on the rate at which the dynamics of the environment evolve.

6.1.5. Reward Shaping and Global Objectives in Power Systems

The choice of reward functions and optimization objectives in reinforcement learning, in general and particularly for power system control, may influence significantly the effectiveness and efficiency of learned policies [179]. Different reward functions can lead to diverse behaviors and performance outcomes of the learned policy. There are different aspects that should be considered when designing proper reward functions. One major drawback is the ambiguous and delayed nature of the training feedback for state classification. Ambiguity stems from the stochastic nature of the problem, where both the next state and the resulting reward are unpredictable. A higher degree of variance in either the next state or the reward reduces the confidence in utility estimates. The phenomena where there is a time delay between a decision and its true return is called “delayed feedback”, and it often requires intermediate actions before an accurate evaluation of the decision’s utility can be made. Thus, using local rewards to guide the training process may aid in an efficient policy search while also helping preserve safety constraints. It is important to note that local utilities need to align with the global objectives, meaning that local decisions should be evaluated based on their contribution to the overall system performance as well. In essence, local utilities should prioritize actions that ultimately improve global utility. In terms of the number of decisions an agent makes before receiving accurate feedback, also referred to as the reward horizon, adding these local utilities shortens the time horizon in practice, effectively reducing the time lag between making a decision and receiving feedback on its effectiveness. This immediate feedback may aid the agent to focus on relevant areas of the state space, potentially ignoring irrelevant parts. This focused exploration has the potential to reduce learning time and potentially improve the generalization abilities of the agent, as the reward horizon defines a local region within which enough information is available to make near-optimal decisions for the entire process. As a final remark, an intriguing potential opportunity lies in utilizing domain knowledge for reward shaping. However, in many cases, the prior knowledge in hand is qualitative or ambiguous, and it is hard to translate the abstract idea into lower decision levels with precise rewards. This often results in a variety of potential reward functions, and the main question is how to find the best function or set of functions that represent reliably both high-level objectives and local utilities. To combine the aggregated experience with the existing prior knowledge, a possible approach is to dynamically shape the reward, hence allowing the agent to continuously adjust the reward function as it refines its understanding of the desired behaviors.

6.2. Future Work

6.2.1. Explainability

Reinforcement learning solutions for power system control problems is a rapidly evolving field of research. It provides powerful tools for analyzing complex problems while reducing computation time and is highly promising for future applications. Nonetheless, one major drawback of those algorithms, model-based as well as model-free, is their low interpretability for humans due to the large scale of the network. The underlying dynamics and the decision-making process of those algorithms is poorly understood, even by experts in the domain, since the network may consist of dozens of layers and millions of parameters, with nonlinear functions connecting them. Moreover, the design of the architecture demands multiple experiments and is considered more a form of art than a rigorous, methodical process. Users consider the “black-box” nature of the reinforcement learning models as unreliable and will not always trust their predictions; therefore, they will be reluctant to use them.

This challenge may be addressed by increasing the interpretability of the models by using explainable artificial intelligence, which will allow researchers, developers, and users to better understand the outcomes of the reinforcement learning models while preserving their performance and accuracy, as illustrated in Figure 28. This will add transparency to the operational mechanism of the models and will allow us to improve them, even in cases where the analysis procedure was faulty but produced the correct result. This concept started to emerge in recent literature, as may be seen in works [180,181].

6.2.2. Neural Architecture Search

Regardless of their remarkable performance, the design of neural network architecture is in itself a major challenge. Due to the complex structure of the network and the need to tune millions of parameters, designing neural networks is an art, which often relies on a trial-and-error process. Furthermore, the design and tuning of the network require significant computational power and often has no tight theoretical bounds on computing time and performance. These facts limit the integration of machine learning in general and reinforcement learning in particular. In light of this challenge, there is a great need to search for and develop new, efficient methods, requiring low computational resources, to plan and design neural network architectures. A potential solution may be found in the form of Neural Architecture Search [182,183,184]. This technique allows one to discover the optimal neural network architecture for a specific task without human intervention.

6.2.3. Physics-Informed Neural Networks

Although model-free methods exhibit remarkable performance in various applications in power systems, they do not take into account the underlying physical constraints and dynamics of real-world systems, which may be crucial for a more accurate and robust solution. To address this gap, an interesting field of study could be the incorporation of physics-informed neural networks (PINNs) into reinforcement learning models [185,186]. The difference between a PINN and a fully model-based configuration is that the PINN lacks the knowledge of the perfect model of the world but instead integrates physical laws or dynamics equations into the reinforcement learning framework, guiding agents to make decisions that align with the underlying physics of the system. This approach not only accelerates learning by reducing the sample complexity but also enhances the robustness and interpretability of the learned policies. Implementation involves incorporating physics-based constraints or dynamics equations as additional terms in the reinforcement learning objective function, enabling agents to learn policies that respect the laws of the environment. This method may improve the learning efficiency and generalization ability of the model.

6.2.4. Public Datasets

One of the major challenges in utilizing reinforcement learning methods for power system control problems lies in the lack of real-world, high-quality data [187,188]. Currently, the data existing represent mostly small-scale networks and may be insufficient for qualitative training or validation. Incorporating new technologies based on IoT devices to collect large-scale data from a variety of scenarios may enable reinforcement learning algorithms to converge to real-world dynamic models and reliably represent normal behaviors of the system of concern. Nowadays, there are multiple techniques for processing information in real time and efficiently storing it in remote databases [189,190,191]. These abilities may be leveraged to create unified and robust datasets, which will lead to more resilient and scalable models with high generalization abilities.

6.2.5. Safe Reinforcement Learning and Coping with Changing Conditions and Unforeseen Events

Safety considerations in critical power system control using reinforcement learning are paramount due to the high stakes involved in maintaining system stability and preventing failures [19,192,193]. In model-based reinforcement learning, ensuring safety involves accurately modeling the system dynamics and incorporating safety constraints directly into the optimization process to avoid unsafe actions. However, inaccuracies in the model can lead to unsafe decisions. In model-free reinforcement learning, the challenge is even greater, as the system learns directly from interaction with the environment, potentially exploring unsafe actions during training. Safe exploration strategies, such as incorporating safety constraints into the reward function, are crucial. Both approaches require extensive validation and testing in simulated environments before deployment to ensure that safety is not compromised in real-world operations. As an illustration, one may examine Figure 29, showing the space spanned by the set of all existing policies. This space, naturally, includes the optimal policy

π^{*}

and its approximation

\tilde{π}

. Focusing now on the subset of all policies that ensure safe system operation, we may denote by

π_{C}^{*}

the optimal constrained policy that ensures stable behavior and by

\tilde{π_{C}}

its approximation. The goal should be, in our opinion, to develop approaches that guide the training process to occur within this safe operation subspace. There is growing interest in the research community regarding the ability of reinforcement learning algorithms to adapt to changing conditions and unforeseen events such as faults or natural disasters in order to ensure the robustness and safe operation of the power system model under consideration, while there may be variability in its behavior. One approach is designed to aid in the learning of a policy that can generalize across multiple tasks. During training, the agent is exposed to a distribution of tasks, and it learns a strategy that helps it adapt rapidly to new, unseen tasks with minimal additional training. This is achieved by optimizing not just for performance on a single task but also for the agent’s ability to generalize and improve across a range of similar but distinct tasks. This approach is called “meta-RL”, and it can help agents quickly adjust to new scenarios by leveraging past experiences from similar situations, aiding them in handling sudden changes like component failures or cyber-attacks. Additionally, the incorporation of “Safe RL” strategies and risk-sensitive training objectives may help ensure that the learned policies prioritize safety and reliability, even under rare or extreme conditions. To handle real-world disruptions, RL agents can also be integrated with robust optimization and online learning methods, which can help them refine their policies in real time based on streaming data. Furthermore, using ensemble learning or uncertainty quantification techniques can potentially improve the agent’s ability to identify and respond to unforeseen events.

6.2.6. RL Integration with Different AI Techniques

Integrating reinforcement learning with other AI techniques such as deep learning, transfer learning, or large language models (LLMs) may aid in the development of improved power system control algorithms by improving learning efficiency, scalability, and interpretability. For example, deep learning models, including convolutional and recurrent neural networks, can serve as feature extractors to preprocess high-dimensional data like grid measurements or forecast time-series data, providing compact state representations for RL agents. Transfer learning can enable RL agents to leverage knowledge from related tasks, such as transferring control policies learned from one type of power grid to another with similar dynamics, thereby reducing training time and improving adaptability. Furthermore, LLMs can contribute to enhancing human–AI collaboration by translating complex control strategies and technical knowledge into natural language, making the learned policies more interpretable to grid operators. Not only that, but transformers, which are derived from LLMs, may be used for context analysis of power signals when applied in power quality monitoring tasks [194]. Additionally, hybrid architectures that combine RL with AI techniques such as Bayesian optimization or imitation learning can be used to tackle specific challenges like hyperparameter tuning to produce robust models or mimicking the behavior of human experts under uncertainty, ultimately leading to more reliable and data-efficient control solutions for power system control problems.

6.2.7. RL Applications in Emerging Power System Technologies

Reinforcement learning may aid in the management and optimization of emerging power system technologies such as smart grids and the energy internet. In smart grids, increasing integration of renewable energy sources like wind and solar introduces additional uncertainty due to their intermittent nature. Traditional power system control methods, which rely on physical models and numerical calculations, struggle to handle these uncertainties of modern grids. On top of that, the widespread deployment of technologies such as advanced metering infrastructures and wide-area monitoring systems generates massive amounts of data which RL algorithms can leverage to address some of the complexities of smart grids for tasks like renewable generation forecasting or managing flexible energy resources by load forecasting or even storage scheduling. For example, RL-based controllers can optimize the charging and discharging schedules of distributed storage devices or electric vehicles in response to fluctuating renewable generation and time-varying energy prices. Similarly, in the context of the energy internet, where diverse energy assets and services are interconnected across multiple regions, multi-agent RL can be used to help coordinate energy trading between the different sources and manage the power flow. This involves developing hierarchical or multi-agent RL frameworks that can handle the complex interactions and negotiations between various stakeholders in a highly dynamic and distributed network. While these applications offer great potential, the scalability of RL algorithms and their ability to handle the complexity of large-scale interconnected systems remain major research challenges.

Edge AI

Edge AI refers to the deployment of artificial intelligence algorithms and models that operate directly on edge devices, which are located close to or at the source of data generation [195,196,197]. Such edge devices can include smartphones, IoT devices, sensors, cameras, and other embedded systems. By processing data locally, edge AI reduces the need for constant communication with central cloud servers, leading to several significant benefits, among which is low latency, since processing data on the edge of a network minimizes delays, thus enabling real-time decision making and responses, which are crucial for applications like autonomous vehicles, industrial automation, and healthcare monitoring. Another benefit is the reduced bandwidth usage. Since data are processed locally, only the most relevant information needs to be transmitted to the cloud, reducing the amount of data sent over networks and easing bandwidth constraints. Moreover, enhanced privacy and security is achieved by keeping data processing local, which means sensitive information does not have to be sent to external servers, thereby reducing the risk of data breaches and enhancing privacy. Additionally, this contributes to reliability, since edge AI systems can continue to function even when there is limited or no internet connectivity, ensuring continuous operation in remote or network-constrained environments. Lastly, scalability is also achieved, since distributed processing across numerous edge devices can lead to better scalability, as the computational load is spread out rather than concentrated in centralized servers.

7. Conclusions

A comprehensive understanding of the differences between model-based and model-free reinforcement learning approaches is important for effectively addressing optimal control problems in the power systems domain. As is evident by the review above, each approach has unique strengths and limitations, and their applications can vary significantly depending on the specific characteristics and requirements of the problem at hand. Model-based RL, which leverages predefined models of environmental dynamics, enables efficient learning, which is guided by domain knowledge, and faster convergence, making it well suited for tasks like grid stability and control, where continuous state spaces and predictable dynamics are common. Conversely, model-free RL does not require a model of the environment and learns optimal policies through direct interaction, making it ideal for applications characterized by high complexity, dynamic changes, and high uncertainty, such as energy management in buildings, electric vehicle charging, and energy storage management that supports renewable sources integration. Understanding these trade-offs may allow researchers and practitioners to tailor RL strategies to specific demands, improving the efficiency and robustness of power systems while facilitating the integration of emerging technologies and renewable energy sources.

In this review, we chose to focus on recent papers that employ both model-based and model-free reinforcement learning paradigms, to study methods and techniques for the solution of optimal control problems in modern power systems. We highlighted five application areas: energy market management, grid stability and control, energy management in buildings, electrical vehicles, and energy storage systems. One central conclusion is that model-based solutions are better adapted for control problems with a clear mathematical structure, mostly seen in physical applications such as voltage control for grid stability. On the contrary, for applications with complicated underlying statistical structures, such as optimizing energy costs in buildings, model-free approaches, which rely solely on aggregated data and do not require any prior knowledge of the system, are better suited. However, they do require a large-scale dataset to learn from, which often does not exist. Another important aspect covered in this work is the challenges and limitations of both model-based and model-free approaches when implemented in optimal control problems of power systems. While an emerging and exciting field of study, there are a few obstacles to overcome before reinforcement learning solutions can be used as feasible and reliable solutions in real-world problems, including standardization, safety during training, and generalization ability.

Author Contributions

Conceptualization, E.G.-G. and R.M.; software, E.G.-G.; validation, E.G.-G., R.M., Y.L. and S.K.; writing—original draft preparation, E.G.-G., I.S., A.B., E.S. and S.K.N.; writing—review and editing, E.G.-G., R.M, Y.L., J.B. and S.K.; visualization, J.B.; supervision, Y.L. and S.K.; project administration, L.K.; funding acquisition, J.B. All authors have read and agreed to the published version of the manuscript.

Funding

The work of J. Belikov was partly supported by the Estonian Research Council grant PRG1463.

Conflicts of Interest

Author Liran Katzir was employed by the company Advanced Energy Industries. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RL	Reinforcement learning
MDP	Markov decision process
EM	Energy market
GSAC	Grid stability and control
BEM	Building energy management
EV	Electric vehicle
ESS	Energy storage system
PV	Photovoltaic
MG	Microgrid
DR	Demand response
ET	Energy trading
AC	Actor–critic & variations
PG	Policy gradient & variations
QL	Q-learning & variations

References

Schneider, N. Population growth, electricity demand and environmental sustainability in Nigeria: Insights from a vector auto-regressive approach. Int. J. Environ. Stud. 2022, 79, 149–176. [Google Scholar] [CrossRef]
Begum, R.A.; Sohag, K.; Abdullah, S.M.S.; Jaafar, M. CO₂ emissions, energy consumption, economic and population growth in Malaysia. Renew. Sustain. Energy Rev. 2015, 41, 594–601. [Google Scholar] [CrossRef]
Rahman, M.M. Exploring the effects of economic growth, population density and international trade on energy consumption and environmental quality in India. Int. J. Energy Sect. Manag. 2020, 14, 1177–1203. [Google Scholar] [CrossRef]
Comello, S.; Reichelstein, S.; Sahoo, A. The road ahead for solar PV power. Renew. Sustain. Energy Rev. 2018, 92, 744–756. [Google Scholar] [CrossRef]
Fathima, A.H.; Palanisamy, K. Energy storage systems for energy management of renewables in distributed generation systems. Energy Manag. Distrib. Gener. Syst. 2016, 157. [Google Scholar] [CrossRef]
Heldeweg, M.A.; Séverine Saintier. Renewable energy communities as ‘socio-legal institutions’: A normative frame for energy decentralization? Renew. Sustain. Energy Rev. 2020, 119, 109518. [Google Scholar] [CrossRef]
Urishev, B. Decentralized Energy Systems, Based on Renewable Energy Sources. Appl. Sol. Energy 2019, 55, 207–212. [Google Scholar] [CrossRef]
Yaqoot, M.; Diwan, P.; Kandpal, T.C. Review of barriers to the dissemination of decentralized renewable energy systems. Renew. Sustain. Energy Rev. 2016, 58, 477–490. [Google Scholar] [CrossRef]
Avancini, D.B.; Rodrigues, J.J.; Martins, S.G.; Rabêlo, R.A.; Al-Muhtadi, J.; Solic, P. Energy meters evolution in smart grids: A review. J. Clean. Prod. 2019, 217, 702–715. [Google Scholar] [CrossRef]
Alotaibi, I.; Abido, M.A.; Khalid, M.; Savkin, A.V. A Comprehensive Review of Recent Advances in Smart Grids: A Sustainable Future with Renewable Energy Resources. Energies 2020, 13, 6269. [Google Scholar] [CrossRef]
Alimi, O.A.; Ouahada, K.; Abu-Mahfouz, A.M. A Review of Machine Learning Approaches to Power System Security and Stability. IEEE Access 2020, 8, 113512–113531. [Google Scholar] [CrossRef]
Krause, T.; Ernst, R.; Klaer, B.; Hacker, I.; Henze, M. Cybersecurity in Power Grids: Challenges and Opportunities. Sensors 2021, 21, 6225. [Google Scholar] [CrossRef] [PubMed]
Yohanandhan, R.V.; Elavarasan, R.M.; Manoharan, P.; Mihet-Popa, L. Cyber-Physical Power System (CPPS): A Review on Modeling, Simulation, and Analysis with Cyber Security Applications. IEEE Access 2020, 8, 151019–151064. [Google Scholar] [CrossRef]
Guerin, T.F. Evaluating expected and comparing with observed risks on a large-scale solar photovoltaic construction project: A case for reducing the regulatory burden. Renew. Sustain. Energy Rev. 2017, 74, 333–348. [Google Scholar] [CrossRef]
Garcia, A.; Alzate, J.; Barrera, J. Regulatory design and incentives for renewable energy. J. Regul. Econ. 2011, 41, 315–336. [Google Scholar] [CrossRef]
Glavic, M. (Deep) Reinforcement learning for electric power system control and related problems: A short review and perspectives. Annu. Rev. Control 2019, 48, 22–35. [Google Scholar] [CrossRef]
Perera, A.; Kamalaruban, P. Applications of reinforcement learning in energy systems. Renew. Sustain. Energy Rev. 2021, 137, 110618. [Google Scholar] [CrossRef]
Al-Saadi, M.; Al-Greer, M.; Short, M. Reinforcement Learning-Based Intelligent Control Strategies for Optimal Power Management in Advanced Power Distribution Systems: A Survey. Energies 2023, 16, 1608. [Google Scholar] [CrossRef]
Chen, X.; Qu, G.; Tang, Y.; Low, S.; Li, N. Reinforcement Learning for Selective Key Applications in Power Systems: Recent Advances and Future Challenges. IEEE Trans. Smart Grid 2022, 13, 2935–2958. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Graesser, L.; Keng, W. Foundations of Deep Reinforcement Learning: Theory and Practice in Python; Addison-Wesley Data and Analytics Series; Addison-Wesley: Boston, MA, USA, 2020. [Google Scholar]
Qiang, W.; Zhongli, Z. Reinforcement learning model, algorithms and its application. In Proceedings of the International Conference on Mechatronic Science, Electric Engineering and Computer (MEC), Jilin, China, 19–22 August 2011; pp. 1143–1146. [Google Scholar] [CrossRef]
Zhang, K.; Yang, Z.; Başar, T. Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms. In Handbook of Reinforcement Learning and Control; Springer International Publishing: Cham, Switzerland, 2021; pp. 321–384. [Google Scholar] [CrossRef]
Moerland, T.M.; Broekens, J.; Plaat, A.; Jonker, C.M. Model-based Reinforcement Learning: A Survey. Found. Trends Mach. Learn. 2023, 16, 1–118. [Google Scholar] [CrossRef]
Huang, Q. Model-based or model-free, a review of approaches in reinforcement learning. In Proceedings of the 2020 International Conference on Computing and Data Science (CDS), Stanford, CA, USA, 1–2 August 2020; pp. 219–221. [Google Scholar] [CrossRef]
Freed, B.; Wei, T.; Calandra, R.; Schneider, J.; Choset, H. Unifying Model-Based and Model-Free Reinforcement Learning with Equivalent Policy Sets. Reinf. Learn. J. 2024, 1, 283–301. [Google Scholar]
Bayón, L.; Grau, J.; Ruiz, M.; Suárez, P. A comparative economic study of two configurations of hydro-wind power plants. Energy 2016, 112, 8–16. [Google Scholar] [CrossRef]
Riffonneau, Y.; Bacha, S.; Barruel, F.; Ploix, S. Optimal power flow management for grid connected PV systems with batteries. IEEE Trans. Sustain. Energy 2011, 2, 309–320. [Google Scholar] [CrossRef]
Powell, W.B. Approximate Dynamic Programming: Solving the Curses of Dimensionality; John Wiley & Sons: Hoboken, NJ, USA, 2007. [Google Scholar]
Zargari, N.; Ofir, R.; Chowdhury, N.R.; Belikov, J.; Levron, Y. An Optimal Control Method for Storage Systems with Ramp Constraints, Based on an On-Going Trimming Process. IEEE Trans. Control Syst. Technol. 2023, 31, 493–496. [Google Scholar] [CrossRef]
García, C.E.; Prett, D.M.; Morari, M. Model predictive control: Theory and practice—A survey. Automatica 1989, 25, 335–348. [Google Scholar] [CrossRef]
Schwenzer, M.; Ay, M.; Bergs, T.; Abel, D. Review on Model Predictive Control: An Engineering Perspective. Int. J. Adv. Manuf. Technol. 2021, 117, 1327–1349. [Google Scholar] [CrossRef]
Morari, M.; Garcia, C.E.; Prett, D.M. Model predictive control: Theory and practice. IFAC Proc. Vol. 1988, 21, 1–12. [Google Scholar] [CrossRef]
Li, Z.; Wu, L.; Xu, Y.; Moazeni, S.; Tang, Z. Multi-Stage Real-Time Operation of a Multi-Energy Microgrid with Electrical and Thermal Energy Storage Assets: A Data-Driven MPC-ADP Approach. IEEE Trans. Smart Grid 2022, 13, 213–226. [Google Scholar] [CrossRef]
Agarwal, A.; Kakade, S.M.; Lee, J.D.; Mahajan, G. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. J. Mach. Learn. Res. 2021, 22, 1–76. [Google Scholar] [CrossRef]
Wooldridge, M. An Introduction to MultiAgent Systems; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
Altman, E. Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Math. Methods Oper. Res. 1998, 48, 387–417. [Google Scholar] [CrossRef]
Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained Policy Optimization. In Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 22–31. [Google Scholar]
Xia, Y.; Xu, Y.; Feng, X. Hierarchical Coordination of Networked-Microgrids Toward Decentralized Operation: A Safe Deep Reinforcement Learning Method. IEEE Trans. Sustain. Energy 2024, 15, 1981–1993. [Google Scholar] [CrossRef]
Yongli, Z.; Limin, H.; Jinling, L. Bayesian networks-based approach for power systems fault diagnosis. IEEE Trans. Power Deliv. 2006, 21, 634–639. [Google Scholar] [CrossRef]
Chen, N.; Qian, Z.; Nabney, I.T.; Meng, X. Wind Power Forecasts Using Gaussian Processes and Numerical Weather Prediction. IEEE Trans. Power Syst. 2014, 29, 656–665. [Google Scholar] [CrossRef]
Wen, S.; Zhang, C.; Lan, H.; Xu, Y.; Tang, Y.; Huang, Y. A Hybrid Ensemble Model for Interval Prediction of Solar Power Output in Ship Onboard Power Systems. IEEE Trans. Sustain. Energy 2021, 12, 14–24. [Google Scholar] [CrossRef]
Chow, J.H.; Sanchez-Gasca, J.J. Power System Coherency and Model Reduction; Wiley-IEEE Press: Hoboken, NJ, USA, 2020. [Google Scholar]
Saxena, S.; Hote, Y.V. Load Frequency Control in Power Systems via Internal Model Control Scheme and Model-Order Reduction. IEEE Trans. Power Syst. 2013, 28, 2749–2757. [Google Scholar] [CrossRef]
Machlev, R.; Zargari, N.; Chowdhury, N.; Belikov, J.; Levron, Y. A review of optimal control methods for energy storage systems - energy trading, energy balancing and electric vehicles. J. Energy Storage 2020, 32, 101787. [Google Scholar] [CrossRef]
Machlev, R.; Tolkachov, D.; Levron, Y.; Beck, Y. Dimension reduction for NILM classification based on principle component analysis. Electr. Power Syst. Res. 2020, 187, 106459. [Google Scholar] [CrossRef]
Chien, I.; Karthikeyan, P.; Hsiung, P.A. Prediction-based peer-to-peer energy transaction market design for smart grids. Eng. Appl. Artif. Intell. 2023, 126, 107190. [Google Scholar] [CrossRef]
Levron, Y.; Shmilovitz, D. Optimal Power Management in Fueled Systems with Finite Storage Capacity. IEEE Trans. Circuits Syst. I Regul. Pap. 2010, 57, 2221–2231. [Google Scholar] [CrossRef]
Sanayha, M.; Vateekul, P. Model-based deep reinforcement learning for wind energy bidding. Int. J. Electr. Power Energy Syst. 2022, 136, 107625. [Google Scholar] [CrossRef]
Wolgast, T.; Nieße, A. Approximating Energy Market Clearing and Bidding with Model-Based Reinforcement Learning. arXiv 2023, arXiv:2303.01772. [Google Scholar] [CrossRef]
Sanayha, M.; Vateekul, P. Model-Based Approach on Multi-Agent Deep Reinforcement Learning with Multiple Clusters for Peer-To-Peer Energy Trading. IEEE Access 2022, 10, 127882–127893. [Google Scholar] [CrossRef]
He, Q.; Wang, J.; Shi, R.; He, Y.; Wu, M. Enhancing renewable energy certificate transactions through reinforcement learning and smart contracts integration. Sci. Rep. 2024, 14, 10838. [Google Scholar] [CrossRef] [PubMed]
Zou, Y.; Wang, Q.; Xia, Q.; Chi, Y.; Lei, C.; Zhou, N. Federated reinforcement learning for Short-Time scale operation of Wind-Solar-Thermal power network with nonconvex models. Int. J. Electr. Power Energy Syst. 2024, 158, 109980. [Google Scholar] [CrossRef]
Nanduri, V.; Das, T.K. A Reinforcement Learning Model to Assess Market Power Under Auction-Based Energy Pricing. IEEE Trans. Power Syst. 2007, 22, 85–95. [Google Scholar] [CrossRef]
Cai, W.; Kordabad, A.B.; Gros, S. Energy management in residential microgrid using model predictive control-based reinforcement learning and Shapley value. Eng. Appl. Artif. Intell. 2023, 119, 105793. [Google Scholar] [CrossRef]
Ojand, K.; Dagdougui, H. Q-Learning-Based Model Predictive Control for Energy Management in Residential Aggregator. IEEE Trans. Autom. Sci. Eng. 2022, 19, 70–81. [Google Scholar] [CrossRef]
Nord Pool. Nord Pool Wholesale Electricity Market Data. 2024. Available online: https://data.nordpoolgroup.com/auction/day-ahead/prices?deliveryDate=latest&currency=EUR&aggregation=Hourly&deliveryAreas=AT (accessed on 19 September 2024).
Australia Gird. Australia Gird Data. 2024. Available online: https://www.ausgrid.com.au/Industry/Our-Research/Data-to-share/Average-electricity-use (accessed on 19 September 2024).
Chinese Listed Companies, CNY. Carbon Emissions Data. 2024. Available online: https://www.nature.com/articles/s41598-024-60527-3/tables/1 (accessed on 19 September 2024).
Hiskens, I. IEEE PES Task Force on Benchmark Systems for Stability Controls; IEEE: Piscataway, NJ, USA, 2013. [Google Scholar]
Elia. Belgium Grid Data. 2024. Available online: https://www.elia.be/en/grid-data/ (accessed on 19 September 2024).
Comed. Chicago Electricity Price Data. 2024. Available online: https://hourlypricing.comed.com/live-prices/ (accessed on 19 September 2024).
Huang, R.; Chen, Y.; Yin, T.; Li, X.; Li, A.; Tan, J.; Yu, W.; Liu, Y.; Huang, Q. Accelerated Derivative-Free Deep Reinforcement Learning for Large-Scale Grid Emergency Voltage Control. IEEE Trans. Power Syst. 2022, 37, 14–25. [Google Scholar] [CrossRef]
Hossain, R.R.; Yin, T.; Du, Y.; Huang, R.; Tan, J.; Yu, W.; Liu, Y.; Huang, Q. Efficient learning of power grid voltage control strategies via model-based Deep Reinforcement Learning. Mach. Learn. 2023, 113, 2675–2700. [Google Scholar] [CrossRef]
Cao, D.; Zhao, J.; Hu, W.; Ding, F.; Yu, N.; Huang, Q.; Chen, Z. Model-free voltage control of active distribution system with PVs using surrogate model-based deep reinforcement learning. Appl. Energy 2022, 306, 117982. [Google Scholar] [CrossRef]
Huang, Q.; Huang, R.; Hao, W.; Tan, J.; Fan, R.; Huang, Z. Adaptive Power System Emergency Control Using Deep Reinforcement Learning. IEEE Trans. Smart Grid 2020, 11, 1171–1182. [Google Scholar] [CrossRef]
Duan, J.; Yi, Z.; Shi, D.; Lin, C.; Lu, X.; Wang, Z. Reinforcement-Learning-Based Optimal Control of Hybrid Energy Storage Systems in Hybrid AC–DC Microgrids. IEEE Trans. Ind. Inform. 2019, 15, 5355–5364. [Google Scholar] [CrossRef]
Totaro, S.; Boukas, I.; Jonsson, A.; Cornélusse, B. Lifelong control of off-grid microgrid with model-based reinforcement learning. Energy 2021, 232, 121035. [Google Scholar] [CrossRef]
Yan, Z.; Xu, Y. Real-Time Optimal Power Flow: A Lagrangian Based Deep Reinforcement Learning Approach. IEEE Trans. Power Syst. 2020, 35, 3270–3273. [Google Scholar] [CrossRef]
Zhang, H.; Yue, D.; Dou, C.; Xie, X.; Li, K.; Hancke, G.P. Resilient Optimal Defensive Strategy of TSK Fuzzy-Model-Based Microgrids’ System via a Novel Reinforcement Learning Approach. IEEE Trans. Neural Networks Learn. Syst. 2023, 34, 1921–1931. [Google Scholar] [CrossRef]
Aghaei, J.; Niknam, T.; Azizipanah-Abarghooee, R.; Arroyo, J.M. Scenario-based dynamic economic emission dispatch considering load and wind power uncertainties. Int. J. Electr. Power Energy Syst. 2013, 47, 351–367. [Google Scholar] [CrossRef]
Zhang, H.; Yue, D.; Xie, X.; Dou, C.; Sun, F. Gradient decent based multi-objective cultural differential evolution for short-term hydrothermal optimal scheduling of economic emission with integrating wind power and photovoltaic power. Energy 2017, 122, 748–766. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, C.; Lam, K.P. A deep reinforcement learning method for model-based optimal control of HVAC systems. In Proceedings of the SURFACE at Syracuse University, Syracuse, NY, USA, 24 September 2018. [Google Scholar] [CrossRef]
Zhang, Z.; Chong, A.; Pan, Y.; Zhang, C.; Lam, K.P. Whole building energy model for HVAC optimal control: A practical framework based on deep reinforcement learning. Energy Build. 2019, 199, 472–490. [Google Scholar] [CrossRef]
Chen, B.; Cai, Z.; Bergés, M. Gnu-rl: A precocial reinforcement learning solution for building hvac control using a differentiable mpc policy. In Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, Hangzhou, China, 13–14 November 2019; pp. 316–325. [Google Scholar]
Drgoňa, J.; Picard, D.; Kvasnica, M.; Helsen, L. Approximate model predictive building control via machine learning. Appl. Energy 2018, 218, 199–216. [Google Scholar] [CrossRef]
Arroyo, J.; Manna, C.; Spiessens, F.; Helsen, L. Reinforced model predictive control (RL-MPC) for building energy management. Appl. Energy 2022, 309, 118346. [Google Scholar] [CrossRef]
Drgoňa, J.; Tuor, A.; Skomski, E.; Vasisht, S.; Vrabie, D. Deep learning explicit differentiable predictive control laws for buildings. IFAC-PapersOnLine 2021, 54, 14–19. [Google Scholar] [CrossRef]
Kowli, A.; Mayhorn, E.; Kalsi, K.; Meyn, S.P. Coordinating dispatch of distributed energy resources with model predictive control and Q-learning. In Coordinated Science Laboratory Report no. UILU-ENG-12-2204, DC-256; Coordinated Science Laboratory: Urbana, IL, USA, 2012. [Google Scholar]
Bianchi, C.; Fontanini, A. TMY3 Weather Data for ComStock and ResStock. 2021. Available online: https://data.nrel.gov/submissions/156 (accessed on 19 September 2024).
Blum, D.; Arroyo, J.; Huang, S.; Drgoňa, J.; Jorissen, F.; Walnum, H.T.; Chen, Y.; Benne, K.; Vrabie, D.; Wetter, M.; et al. Building optimization testing framework (BOPTEST) for simulation-based benchmarking of control strategies in buildings. J. Build. Perform. Simul. 2021, 14, 586–610. [Google Scholar] [CrossRef]
Wind Data and Tools. Wind Data. 2024. Available online: https://www.nrel.gov/wind/data-tools.html (accessed on 19 September 2024).
Lee, H.; Cha, S.W. Energy management strategy of fuel cell electric vehicles using model-based reinforcement learning with data-driven model update. IEEE Access 2021, 9, 59244–59254. [Google Scholar] [CrossRef]
Chiş, A.; Lundén, J.; Koivunen, V. Reinforcement learning-based plug-in electric vehicle charging with forecasted price. IEEE Trans. Veh. Technol. 2016, 66, 3674–3684. [Google Scholar]
Zhang, F.; Yang, Q.; An, D. CDDPG: A deep-reinforcement-learning-based approach for electric vehicle charging control. IEEE Internet Things J. 2020, 8, 3075–3087. [Google Scholar] [CrossRef]
Cui, L.; Wang, Q.; Qu, H.; Wang, M.; Wu, Y.; Ge, L. Dynamic pricing for fast charging stations with deep reinforcement learning. Appl. Energy 2023, 346, 121334. [Google Scholar] [CrossRef]
Xing, Q.; Xu, Y.; Chen, Z.; Zhang, Z.; Shi, Z. A graph reinforcement learning-based decision-making platform for real-time charging navigation of urban electric vehicles. IEEE Trans. Ind. Inform. 2022, 19, 3284–3295. [Google Scholar] [CrossRef]
Qian, T.; Shao, C.; Wang, X.; Shahidehpour, M. Deep reinforcement learning for EV charging navigation by coordinating smart grid and intelligent transportation system. IEEE Trans. Smart Grid 2019, 11, 1714–1723. [Google Scholar] [CrossRef]
Vandael, S.; Claessens, B.; Ernst, D.; Holvoet, T.; Deconinck, G. Reinforcement learning of heuristic EV fleet charging in a day-ahead electricity market. IEEE Trans. Smart Grid 2015, 6, 1795–1805. [Google Scholar] [CrossRef]
Jin, J.; Xu, Y. Optimal policy characterization enhanced actor-critic approach for electric vehicle charging scheduling in a power distribution network. IEEE Trans. Smart Grid 2020, 12, 1416–1428. [Google Scholar] [CrossRef]
Qian, J.; Jiang, Y.; Liu, X.; Wang, Q.; Wang, T.; Shi, Y.; Chen, W. Federated Reinforcement Learning for Electric Vehicles Charging Control on Distribution Networks. IEEE Internet Things J. 2023, 11, 5511–5525. [Google Scholar] [CrossRef]
Wang, Y.; Lin, X.; Pedram, M. Accurate component model based optimal control for energy storage systems in households with photovoltaic modules. In Proceedings of the 2013 IEEE Green Technologies Conference (GreenTech), Denver, CO, USA, 4–5 April 2013; pp. 28–34. [Google Scholar] [CrossRef]
Gao, Y.; Li, J.; Hong, M. Machine Learning Based Optimization Model for Energy Management of Energy Storage System for Large Industrial Park. Processes 2021, 9, 825. [Google Scholar] [CrossRef]
Liu, T.; Zou, Y.; Liu, D.; Sun, F. Reinforcement learning of adaptive energy management with transition probability for a hybrid electric tracked vehicle. IEEE Trans. Ind. Electron. 2015, 62, 7837–7846. [Google Scholar] [CrossRef]
Kong, Z.; Zou, Y.; Liu, T. Implementation of real-time energy management strategy based on reinforcement learning for hybrid electric vehicles and simulation validation. PLoS ONE 2017, 12, e0180491. [Google Scholar] [CrossRef]
Hu, X.; Liu, T.; Qi, X.; Barth, M. Reinforcement learning for hybrid and plug-in hybrid electric vehicle energy management: Recent advances and prospects. IEEE Ind. Electron. Mag. 2019, 13, 16–25. [Google Scholar] [CrossRef]
Yan, Z.; Xu, Y.; Wang, Y.; Feng, X. Deep reinforcement learning-based optimal data-driven control of battery energy storage for power system frequency support. IET Gener. Transm. Distrib. 2020, 14, 6071–6078. [Google Scholar] [CrossRef]
Wang, Y.; Lin, X.; Pedram, M. Adaptive control for energy storage systems in households with photovoltaic modules. IEEE Trans. Smart Grid 2014, 5, 992–1001. [Google Scholar] [CrossRef]
Zhang, H.; Li, J.; Hong, M. Machine learning-based energy system model for tissue paper machines. Processes 2021, 9, 655. [Google Scholar] [CrossRef]
Wang, Y.; Lin, X.; Pedram, M. A Near-Optimal Model-Based Control Algorithm for Households Equipped with Residential Photovoltaic Power Generation and Energy Storage Systems. IEEE Trans. Sustain. Energy 2016, 7, 77–86. [Google Scholar] [CrossRef]
NREL. Measurement and Instrumentation Data Center. 2021. Available online: https://midcdmz.nrel.gov/ (accessed on 20 October 2024).
bge. Baltimore Load Profile Data. 2021. Available online: https://supplier.bge.com/electric/load/profiles.asp (accessed on 19 September 2024).
Liu, T.; Zou, Y.; Liu, D.; Sun, F. Reinforcement learning–based energy management strategy for a hybrid electric tracked vehicle. Energies 2015, 8, 7243–7260. [Google Scholar] [CrossRef]
Baah, G.K.; Podgurski, A.; Harrold, M.J. The Probabilistic Program Dependence Graph and Its Application to Fault Diagnosis. IEEE Trans. Softw. Eng. 2010, 36, 528–545. [Google Scholar] [CrossRef]
Schaefer, A.M.; Udluft, S.; Zimmermann, H.G. A recurrent control neural network for data efficient reinforcement learning. In Proceedings of the 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, Honolulu, HI, USA, 1–5 April 2007; pp. 151–157. [Google Scholar] [CrossRef]
Bitzer, S.; Howard, M.; Vijayakumar, S. Using dimensionality reduction to exploit constraints in reinforcement learning. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 3219–3225. [Google Scholar] [CrossRef]
Barto, A.G.; Mahadevan, S. Recent advances in hierarchical reinforcement learning. Discret. Event Dyn. Syst. 2003, 13, 341–379. [Google Scholar] [CrossRef]
Cowan, W.; Katehakis, M.N.; Pirutinsky, D. Reinforcement learning: A comparison of UCB versus alternative adaptive policies. In Proceedings of the First Congress of Greek Mathematicians, Athens, Greece, 25–30 June 2018; p. 127. [Google Scholar] [CrossRef]
Ladosz, P.; Weng, L.; Kim, M.; Oh, H. Exploration in deep reinforcement learning: A survey. Inf. Fusion 2022, 85, 1–22. [Google Scholar] [CrossRef]
Zhu, Z.; Lin, K.; Jain, A.K.; Zhou, J. Transfer Learning in Deep Reinforcement Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef] [PubMed]
Ren, C.; Xu, Y. Transfer Learning-Based Power System Online Dynamic Security Assessment: Using One Model to Assess Many Unlearned Faults. IEEE Trans. Power Syst. 2020, 35, 821–824. [Google Scholar] [CrossRef]
Ye, Y.; Qiu, D.; Wu, X.; Strbac, G.; Ward, J. Model-Free Real-Time Autonomous Control for a Residential Multi-Energy System Using Deep Reinforcement Learning. IEEE Trans. Smart Grid 2020, 11, 3068–3082. [Google Scholar] [CrossRef]
Zhang, S.; May, D.; Gül, M.; Musilek, P. Reinforcement learning-driven local transactive energy market for distributed energy resources. Energy AI 2022, 8, 100150. [Google Scholar] [CrossRef]
Bose, S.; Kremers, E.; Mengelkamp, E.M.; Eberbach, J.; Weinhardt, C. Reinforcement learning in local energy markets. Energy Inform. 2021, 4, 7. [Google Scholar] [CrossRef]
Li, J.; Wang, C.; Wang, H. Attentive Convolutional Deep Reinforcement Learning for Optimizing Solar-Storage Systems in Real-Time Electricity Markets. IEEE Trans. Ind. Inform. 2024, 20, 7205–7215. [Google Scholar] [CrossRef]
Li, X.; Luo, F.; Li, C. Multi-agent deep reinforcement learning-based autonomous decision-making framework for community virtual power plants. Appl. Energy 2024, 360, 122813. [Google Scholar] [CrossRef]
Ye, Y.; Papadaskalopoulos, D.; Yuan, Q.; Tang, Y.; Strbac, G. Multi-Agent Deep Reinforcement Learning for Coordinated Energy Trading and Flexibility Services Provision in Local Electricity Markets. IEEE Trans. Smart Grid 2023, 14, 1541–1554. [Google Scholar] [CrossRef]
Chen, T.; Su, W. Indirect Customer-to-Customer Energy Trading with Reinforcement Learning. IEEE Trans. Smart Grid 2019, 10, 4338–4348. [Google Scholar] [CrossRef]
Fang, X.; Zhao, Q.; Wang, J.; Han, Y.; Li, Y. Multi-agent Deep Reinforcement Learning for Distributed Energy Management and Strategy Optimization of Microgrid Market. Sustain. Cities Soc. 2021, 74, 103163. [Google Scholar] [CrossRef]
Harrold, D.J.; Cao, J.; Fan, Z. Renewable energy integration and microgrid energy trading using multi-agent deep reinforcement learning. Appl. Energy 2022, 318, 119151. [Google Scholar] [CrossRef]
Gao, S.; Xiang, C.; Yu, M.; Tan, K.T.; Lee, T.H. Online Optimal Power Scheduling of a Microgrid via Imitation Learning. IEEE Trans. Smart Grid 2022, 13, 861–876. [Google Scholar] [CrossRef]
Chen, D.; Irwin, D. SunDance: Black-box Behind-the-Meter Solar Disaggregation. In Proceedings of the Eighth International Conference on Future Energy Systems, Shatin, Hong Kong, 16–19 May 2017; e-Energy ’17. pp. 45–55. [Google Scholar] [CrossRef]
Mishra, A.K.; Cecchet, E.; Shenoy, P.J.; Albrecht, J.R. Smart: An Open Data Set and Tools for Enabling Research in Sustainable Homes. 2012. Available online: https://api.semanticscholar.org/CorpusID:6562225 (accessed on 19 September 2024).
AEMO. Electricity Distribution and Prices Data. 2024. Available online: https://aemo.com.au/en/energy-systems/electricity/national-electricity-market-nem/data-nem/data-dashboard-nem (accessed on 19 September 2024).
California ISO. California Electrical Power System Operational Data. 2024. Available online: https://www.caiso.com/ (accessed on 19 September 2024).
Duan, J.; Shi, D.; Diao, R.; Li, H.; Wang, Z.; Zhang, B.; Bian, D.; Yi, Z. Deep-Reinforcement-Learning-Based Autonomous Voltage Control for Power Grid Operations. IEEE Trans. Power Syst. 2020, 35, 814–817. [Google Scholar] [CrossRef]
Cao, D.; Zhao, J.; Hu, W.; Yu, N.; Ding, F.; Huang, Q.; Chen, Z. Deep Reinforcement Learning Enabled Physical-Model-Free Two-Timescale Voltage Control Method for Active Distribution Systems. IEEE Trans. Smart Grid 2022, 13, 149–165. [Google Scholar] [CrossRef]
Diao, R.; Wang, Z.; Shi, D.; Chang, Q.; Duan, J.; Zhang, X. Autonomous Voltage Control for Grid Operation Using Deep Reinforcement Learning. In Proceedings of the 2019 IEEE Power & Energy Society General Meeting (PESGM), Atlanta, GA, USA, 4–8 August 2019; pp. 1–5. [Google Scholar] [CrossRef]
Hadidi, R.; Jeyasurya, B. Reinforcement Learning Based Real-Time Wide-Area Stabilizing Control Agents to Enhance Power System Stability. IEEE Trans. Smart Grid 2013, 4, 489–497. [Google Scholar] [CrossRef]
Chen, C.; Cui, M.; Li, F.; Yin, S.; Wang, X. Model-Free Emergency Frequency Control Based on Reinforcement Learning. IEEE Trans. Ind. Inform. 2021, 17, 2336–2346. [Google Scholar] [CrossRef]
Zhao, J.; Li, F.; Mukherjee, S.; Sticht, C. Deep Reinforcement Learning-Based Model-Free On-Line Dynamic Multi-Microgrid Formation to Enhance Resilience. IEEE Trans. Smart Grid 2022, 13, 2557–2567. [Google Scholar] [CrossRef]
Du, Y.; Li, F. Intelligent Multi-Microgrid Energy Management Based on Deep Neural Network and Model-Free Reinforcement Learning. IEEE Trans. Smart Grid 2020, 11, 1066–1076. [Google Scholar] [CrossRef]
Zhou, Y.; Lee, W.; Diao, R.; Shi, D. Deep Reinforcement Learning Based Real-time AC Optimal Power Flow Considering Uncertainties. J. Mod. Power Syst. Clean Energy 2022, 10, 1098–1109. [Google Scholar] [CrossRef]
Cao, D.; Hu, W.; Xu, X.; Wu, Q.; Huang, Q.; Chen, Z.; Blaabjerg, F. Deep Reinforcement Learning Based Approach for Optimal Power Flow of Distribution Networks Embedded with Renewable Energy and Storage Devices. J. Mod. Power Syst. Clean Energy 2021, 9, 1101–1110. [Google Scholar] [CrossRef]
Birchfield, A.B.; Xu, T.; Gegner, K.M.; Shetye, K.S.; Overbye, T.J. Grid Structural Characteristics as Validation Criteria for Synthetic Networks. IEEE Trans. Power Syst. 2017, 32, 3258–3265. [Google Scholar] [CrossRef]
Chen, C.; Zhang, K.; Yuan, K.; Zhu, L.; Qian, M. Novel Detection Scheme Design Considering Cyber Attacks on Load Frequency Control. IEEE Trans. Ind. Inform. 2018, 14, 1932–1941. [Google Scholar] [CrossRef]
Qiu, S.; Li, Z.; Li, Z.; Li, J.; Long, S.; Li, X. Model-free control method based on reinforcement learning for building cooling water systems: Validation by measured data-based simulation. Energy Build. 2020, 218, 110055. [Google Scholar] [CrossRef]
Zhang, X.; Chen, Y.; Bernstein, A.; Chintala, R.; Graf, P.; Jin, X.; Biagioni, D. Two-stage reinforcement learning policy search for grid-interactive building control. IEEE Trans. Smart Grid 2022, 13, 1976–1987. [Google Scholar] [CrossRef]
Zhang, X.; Biagioni, D.; Cai, M.; Graf, P.; Rahman, S. An Edge-Cloud Integrated Solution for Buildings Demand Response Using Reinforcement Learning. IEEE Trans. Smart Grid 2021, 12, 420–431. [Google Scholar] [CrossRef]
Mocanu, E.; Mocanu, D.C.; Nguyen, P.H.; Liotta, A.; Webber, M.E.; Gibescu, M.; Slootweg, J.G. On-Line Building Energy Optimization Using Deep Reinforcement Learning. IEEE Trans. Smart Grid 2019, 10, 3698–3708. [Google Scholar] [CrossRef]
Wei, T.; Wang, Y.; Zhu, Q. Deep reinforcement learning for building HVAC control. In Proceedings of the 54th Annual Design Automation Conference 2017, Austin, TX, USA, 18–22 June 2017; pp. 1–6. [Google Scholar] [CrossRef]
Yu, L.; Sun, Y.; Xu, Z.; Shen, C.; Yue, D.; Jiang, T.; Guan, X. Multi-Agent Deep Reinforcement Learning for HVAC Control in Commercial Buildings. IEEE Trans. Smart Grid 2021, 12, 407–419. [Google Scholar] [CrossRef]
Shin, M.; Kim, S.; Kim, Y.; Song, A.; Kim, Y.; Kim, H.Y. Development of an HVAC system control method using weather forecasting data with deep reinforcement learning algorithms. Build. Environ. 2024, 248, 111069. [Google Scholar] [CrossRef]
EnergyPlus Whole Building Energy Simulation Program. Available online: https://energyplus.net/ (accessed on 20 October 2024).
Gao, G.; Li, J.; Wen, Y. DeepComfort: Energy-Efficient Thermal Comfort Control in Buildings Via Reinforcement Learning. IEEE Internet Things J. 2020, 7, 8472–8484. [Google Scholar] [CrossRef]
Dey, S.; Marzullo, T.; Zhang, X.; Henze, G. Reinforcement learning building control approach harnessing imitation learning. Energy AI 2023, 14, 100255. [Google Scholar] [CrossRef]
Ruelens, F.; Claessens, B.J.; Quaiyum, S.; De Schutter, B.; Babuška, R.; Belmans, R. Reinforcement Learning Applied to an Electric Water Heater: From Theory to Practice. IEEE Trans. Smart Grid 2018, 9, 3792–3800. [Google Scholar] [CrossRef]
Tutiempo Weather Service. Weather Data. 2024. Available online: https://en.tutiempo.net/climate/ws-486980.html (accessed on 19 September 2024).
Datadryad. Thermal Comfort Field Measurements. 2024. Available online: https://datadryad.org/stash/dataset/doi:10.6078/D1F671 (accessed on 19 September 2024).
Pecan Street. Consumption Data. 2024. Available online: https://www.pecanstreet.org/ (accessed on 19 September 2024).
EIA. Commercial Buildings Energy Consumption Data. 2024. Available online: https://www.eia.gov/consumption/commercial/data/2012/bc/cfm/b6.php (accessed on 19 September 2024).
Ulrike Jordan, K.V. Hot-Water Profiles. 2001. Available online: https://sel.me.wisc.edu/trnsys/trnlib/iea-shc-task26/iea-shc-task26-load-profiles-description-jordan.pdf (accessed on 19 September 2024).
Zhang, C.; Liu, Y.; Wu, F.; Tang, B.; Fan, W. Effective charging planning based on deep reinforcement learning for electric vehicles. IEEE Trans. Intell. Transp. Syst. 2020, 22, 542–554. [Google Scholar] [CrossRef]
Wang, R.; Chen, Z.; Xing, Q.; Zhang, Z.; Zhang, T. A modified rainbow-based deep reinforcement learning method for optimal scheduling of charging station. Sustainability 2022, 14, 1884. [Google Scholar] [CrossRef]
Wang, S.; Bi, S.; Zhang, Y.A. Reinforcement learning for real-time pricing and scheduling control in EV charging stations. IEEE Trans. Ind. Inform. 2019, 17, 849–859. [Google Scholar] [CrossRef]
Qian, T.; Shao, C.; Li, X.; Wang, X.; Shahidehpour, M. Enhanced coordinated operations of electric power and transportation networks via EV charging services. IEEE Trans. Smart Grid 2020, 11, 3019–3030. [Google Scholar] [CrossRef]
Zhao, Z.; Lee, C.K. Dynamic pricing for EV charging stations: A deep reinforcement learning approach. IEEE Trans. Transp. Electrif. 2021, 8, 2456–2468. [Google Scholar] [CrossRef]
Sadeghianpourhamami, N.; Deleu, J.; Develder, C. Definition and evaluation of model-free coordination of electrical vehicle charging with reinforcement learning. IEEE Trans. Smart Grid 2019, 11, 203–214. [Google Scholar] [CrossRef]
Yeom, K. Model predictive control and deep reinforcement learning based energy efficient eco-driving for battery electric vehicles. Energy Rep. 2022, 8, 34–42. [Google Scholar] [CrossRef]
Dorokhova, M.; Martinson, Y.; Ballif, C.; Wyrsch, N. Deep reinforcement learning control of electric vehicle charging in the presence of photovoltaic generation. Appl. Energy 2021, 301, 117504. [Google Scholar] [CrossRef]
Wen, Z.; O’Neill, D.; Maei, H. Optimal demand response using device-based reinforcement learning. IEEE Trans. Smart Grid 2015, 6, 2312–2324. [Google Scholar] [CrossRef]
Lee, S.; Choi, D.H. Reinforcement learning-based energy management of smart home with rooftop solar photovoltaic system, energy storage system, and home appliances. Sensors 2019, 19, 3937. [Google Scholar] [CrossRef]
Cao, J.; Harrold, D.; Fan, Z.; Morstyn, T.; Healey, D.; Li, K. Deep Reinforcement Learning-Based Energy Storage Arbitrage with Accurate Lithium-Ion Battery Degradation Model. IEEE Trans. Smart Grid 2020, 11, 4513–4521. [Google Scholar] [CrossRef]
Bui, V.H.; Hussain, A.; Kim, H.M. Double Deep Q-Learning-Based Distributed Operation of Battery Energy Storage System Considering Uncertainties. IEEE Trans. Smart Grid 2020, 11, 457–469. [Google Scholar] [CrossRef]
Bui, V.H.; Hussain, A.; Kim, H.M. Q-Learning-Based Operation Strategy for Community Battery Energy Storage System (CBESS) in Microgrid System. Energies 2019, 12, 1789. [Google Scholar] [CrossRef]
Chen, T.; Su, W. Local Energy Trading Behavior Modeling with Deep Reinforcement Learning. IEEE Access 2018, 6, 62806–62814. [Google Scholar] [CrossRef]
Liu, F.; Liu, Q.; Tao, Q.; Huang, Y.; Li, D.; Sidorov, D. Deep reinforcement learning based energy storage management strategy considering prediction intervals of wind power. Int. J. Electr. Power Energy Syst. 2023, 145, 108608. [Google Scholar] [CrossRef]
Zhou, H.; Erol-Kantarci, M. Correlated deep q-learning based microgrid energy management. In Proceedings of the 2020 IEEE 25th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), Pisa, Italy, 14–16 September 2020; pp. 1–6. [Google Scholar] [CrossRef]
Ji, Y.; Wang, J.; Xu, J.; Fang, X.; Zhang, H. Real-time energy management of a microgrid using deep reinforcement learning. Energies 2019, 12, 2291. [Google Scholar] [CrossRef]
Liu, T.; Hu, X. A bi-level control for energy efficiency improvement of a hybrid tracked vehicle. IEEE Trans. Ind. Inform. 2018, 14, 1616–1625. [Google Scholar] [CrossRef]
UK Government. UK Wholesale Electricity Market Prices. 2024. Available online: https://tradingeconomics.com/united-kingdom/electricity-price (accessed on 19 September 2024).
Lopes, J.P.; Hatziargyriou, N.; Mutale, J.; Djapic, P.; Jenkins, N. Integrating distributed generation into electric power systems: A review of drivers, challenges and opportunities. Electr. Power Syst. Res. 2007, 77, 1189–1203. [Google Scholar] [CrossRef]
Pfenninger, S.; Hawkes, A.; Keirstead, J. Energy systems modeling for twenty-first century energy challenges. Renew. Sustain. Energy Rev. 2014, 33, 74–86. [Google Scholar] [CrossRef]
Nafi, N.S.; Ahmed, K.; Gregory, M.A.; Datta, M. A survey of smart grid architectures, applications, benefits and standardization. J. Netw. Comput. Appl. 2016, 76, 23–36. [Google Scholar] [CrossRef]
Ustun, T.S.; Hussain, S.M.S.; Kirchhoff, H.; Ghaddar, B.; Strunz, K.; Lestas, I. Data Standardization for Smart Infrastructure in First-Access Electricity Systems. Proc. IEEE 2019, 107, 1790–1802. [Google Scholar] [CrossRef]
Ren, C.; Xu, Y. Robustness Verification for Machine-Learning-Based Power System Dynamic Security Assessment Models Under Adversarial Examples. IEEE Trans. Control Netw. Syst. 2022, 9, 1645–1654. [Google Scholar] [CrossRef]
Zhang, Z.; Yau, D.K. CoRE: Constrained Robustness Evaluation of Machine Learning-Based Stability Assessment for Power Systems. IEEE/CAA J. Autom. Sin. 2023, 10, 557–559. [Google Scholar] [CrossRef]
Ren, C.; Du, X.; Xu, Y.; Song, Q.; Liu, Y.; Tan, R. Vulnerability Analysis, Robustness Verification, and Mitigation Strategy for Machine Learning-Based Power System Stability Assessment Model Under Adversarial Examples. IEEE Trans. Smart Grid 2022, 13, 1622–1632. [Google Scholar] [CrossRef]
Laud, A.D. Theory and Application of Reward Shaping in Reinforcement Learning; University of Illinois at Urbana-Champaign: Urbana, IL, USA, 2004. [Google Scholar]
Machlev, R.; Heistrene, L.; Perl, M.; Levy, K.; Belikov, J.; Mannor, S.; Levron, Y. Explainable Artificial Intelligence (XAI) techniques for energy and power systems: Review, challenges and opportunities. Energy AI 2022, 9, 100169. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, J.; Xu, P.D.; Gao, T.; Gao, D.W. Explainable AI in Deep Reinforcement Learning Models for Power System Emergency Control. IEEE Trans. Comput. Soc. Syst. 2022, 9, 419–427. [Google Scholar] [CrossRef]
Ren, P.; Xiao, Y.; Chang, X.; Huang, P.Y.; Li, Z.; Chen, X.; Wang, X. A Comprehensive Survey of Neural Architecture Search: Challenges and Solutions. ACM Comput. Surv. 2021, 54, 1–34. [Google Scholar] [CrossRef]
Jalali, S.M.J.; Osório, G.J.; Ahmadian, S.; Lotfi, M.; Campos, V.M.A.; Shafie-khah, M.; Khosravi, A.; Catalão, J.P.S. New Hybrid Deep Neural Architectural Search-Based Ensemble Reinforcement Learning Strategy for Wind Power Forecasting. IEEE Trans. Ind. Appl. 2022, 58, 15–27. [Google Scholar] [CrossRef]
Wang, Q.; Kapuza, I.; Baimel, D.; Belikov, J.; Levron, Y.; Machlev, R. Neural Architecture Search (NAS) for designing optimal power quality disturbance classifiers. Electr. Power Syst. Res. 2023, 223, 109574. [Google Scholar] [CrossRef]
Huang, B.; Wang, J. Applications of Physics-Informed Neural Networks in Power Systems—A Review. IEEE Trans. Power Syst. 2023, 38, 572–588. [Google Scholar] [CrossRef]
Misyris, G.S.; Venzke, A.; Chatzivasileiadis, S. Physics-Informed Neural Networks for Power Systems. In Proceedings of the 2020 IEEE Power & Energy Society General Meeting (PESGM), Montreal, QC, Canada, 2–6 August 2020; pp. 1–5. [Google Scholar] [CrossRef]
Sami, N.M.; Naeini, M. Machine learning applications in cascading failure analysis in power systems: A review. Electr. Power Syst. Res. 2024, 232, 110415. [Google Scholar] [CrossRef]
Miraftabzadeh, S.M.; Foiadelli, F.; Longo, M.; Pasetti, M. A Survey of Machine Learning Applications for Power System Analytics. In Proceedings of the 2019 IEEE International Conference on Environment and Electrical Engineering and 2019 IEEE Industrial and Commercial Power Systems Europe (EEEIC / I&CPS Europe), Genova, Italy, 11–14 June 2019; pp. 1–5. [Google Scholar] [CrossRef]
Bedi, G.; Venayagamoorthy, G.K.; Singh, R.; Brooks, R.R.; Wang, K.C. Review of Internet of Things (IoT) in Electric Power and Energy Systems. IEEE Internet Things J. 2018, 5, 847–870. [Google Scholar] [CrossRef]
Ngo, V.T.; Nguyen Thi, M.S.; Truong, D.N.; Hoang, A.Q.; Tran, P.N.; Bui, N.A. Applying IoT Platform to Design a Data Collection System for Hybrid Power System. In Proceedings of the 2021 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh City, Vietnam, 26–28 August 2021; pp. 181–184. [Google Scholar] [CrossRef]
Sayed, H.A.; Said, A.M.; Ibrahim, A.W. Smart Utilities IoT-Based Data Collection Scheduling. Arab. J. Sci. Eng. 2024, 49, 2909–2923. [Google Scholar] [CrossRef]
Li, H.; He, H. Learning to Operate Distribution Networks with Safe Deep Reinforcement Learning. IEEE Trans. Smart Grid 2022, 13, 1860–1872. [Google Scholar] [CrossRef]
Vu, T.L.; Mukherjee, S.; Yin, T.; Huang, R.; Tan, J.; Huang, Q. Safe Reinforcement Learning for Emergency Load Shedding of Power Systems. In Proceedings of the 2021 IEEE Power & Energy Society General Meeting (PESGM), Washington, DC, USA, 26–29 July 2021; pp. 1–5. [Google Scholar] [CrossRef]
Chiam, D.H.; Lim, K.H. Power quality disturbance classification using transformer network. In Proceedings of the International Conference on Cyber Warfare, Security and Space Research, Jaipur, India, 9–10 December 2021; pp. 272–282. [Google Scholar] [CrossRef]
Gooi, H.B.; Wang, T.; Tang, Y. Edge Intelligence for Smart Grid: A Survey on Application Potentials. CSEE J. Power Energy Syst. 2023, 9, 1623–1640. [Google Scholar] [CrossRef]
Sodhro, A.H.; Pirbhulal, S.; de Albuquerque, V.H.C. Artificial Intelligence-Driven Mechanism for Edge Computing-Based Industrial Applications. IEEE Trans. Ind. Inform. 2019, 15, 4235–4243. [Google Scholar] [CrossRef]
Lv, L.; Wu, Z.; Zhang, L.; Gupta, B.B.; Tian, Z. An Edge-AI Based Forecasting Approach for Improving Smart Microgrid Efficiency. IEEE Trans. Ind. Inform. 2022, 18, 7946–7954. [Google Scholar] [CrossRef]

Figure 1. Reinforcement learning diagram, adopted from [20].

Figure 2. System configuration, consisting of a generator, a load, and a storage device.

Figure 3. Decision-making sequential process for grid-connected storage system.

Figure 4. Visualization of the difference between model-based and model-free decision making, adopted from [20].

Figure 5. Model-free and model-based approaches.

Figure 6. Generalized policy iteration, where “greedy” operator means that the policy is greedy with respect to the value function. Value and policy interact until they are optimal and become consistent with each other, adopted from [20].

Figure 7. Interaction of evaluation and improvement of the policy, adopted from [20].

Figure 8. Markov game.

Figure 9. Distribution of the different algorithms for energy market application under model-base paradigm.

Figure 10. Distribution of the different algorithms for grid stability and control application under model-base paradigm.

Figure 11. Distribution of the different algorithms for energy management in buildings application under model-base paradigm.

Figure 12. Distribution of the different algorithms for electrical vehicle application under model-base paradigm.

Figure 13. Distribution of the different algorithms for energy storage management application under model-base paradigm.

Figure 14. State-space representation and policy modeling in various power system applications under the model-based paradigm.

Figure 15. Distribution of the different algorithms among the different applications under model-based paradigm.

Figure 16. Distribution of the different algorithms for energy market application under model-free paradigm.

Figure 17. Distribution of the different algorithms for grid stability and control application under model-free paradigm.

Figure 18. Distribution of the different algorithms for energy management in buildings application under model-free paradigm.

Figure 19. Distribution of the different algorithms for electrical vehicle application under model-free paradigm.

Figure 20. Distribution of the different algorithms for energy storage management application under model-free paradigm.

Figure 21. State-space representation and policy modeling in various power system applications under the model-free paradigm.

Figure 22. Distribution of the different algorithms among the different applications under model-free paradigm.

Figure 23. Year-wise distribution of publications on reinforcement learning techniques in energy and power system applications. (a) presents articles regarding both model-based and model-free paradigms, while (b) shows papers focusing on model-free approaches.

Figure 24. Classification of papers based on the type of application and the reinforcement learning approach. (a) presents the application division for the model-based paradigm, and (b) shows the application division for the model-free paradigm.

Figure 25. Classification of papers based on the type of algorithm and the reinforcement learning approach. (a) presents the algorithm division for the model-based paradigm, and (b) shows the algorithm division for the model-free paradigm.

Figure 26. Visualization of the relation between the selected reinforcement learning algorithm and optimal control problems in power systems. (a) presents the relation between the control problem and algorithm for the model-based paradigm, and (b) shows the relation between the control problem and algorithm for the model-based paradigm.

Figure 27. Exponential growth of problem dimensions as a function of number of storage devices controlled.

Figure 28. Explainable reinforcement learning models are easier to trust by power experts and other shareholders.

Figure 29. Safe operational subspace containing robust policies.

Table 1. Examples of MDP formulation for power system control problems.

Problem	States	Actions	Reward
EM	System operator decides upon a power flow distribution	Firms set their bid	The firms’ rewards are the net profit achieved
GSAC	Voltage levels at different nodes	Adjusting the output of power generators	Cost of deviating from nominal voltage levels
BEM	Indoor temperature and humidity levels	Adjusting thermostat set-points for heating and cooling	Cost of electricity
EV	Traffic conditions and route information	Selecting a route based on traffic and charging station availability	Cost of charging, considering electricity prices and charging station fees
ESS	Battery state of charge and current consumer power demand	The controller decides how much power to produce using the generator	The power generation cost the controller must pay

Table 2. Model-based approaches for different applications in energy market management and the MDP setting.

Ref.	Application	Algorithm	State Space	Policy	Dataset & Simulator
[49]	ET	AC	Continuous	Deterministic	[57]
[51]	ET	AC	Discrete	Stochastic	[58]
[52]	ET	QL	Continuous	Deterministic	[59]
[54]	ET	Other	Discrete	Stochastic	Simulated data
[55]	ET	PG, other	Continuous	Deterministic	Real data
[50]	Dispatch	PG	Continuous	Deterministic	Simulated data
[53]	Dispatch	AC, other	Continuous	Deterministic	[60,61]
[56]	DR, microgrid	QL	Continuous	Deterministic	Real data, [62]

Table 3. Model-based approaches in different power systems’ grid stability and control applications and the MDP setting.

Ref.	Application	Algorithm	State Space	Policy	Dataset
[63]	Voltage control	Other	Continuous	Stochastic	IEEE 300, IEEE 9, [66]
[64]	Voltage control	Other	Continuous	Stochastic	IEEE 300
[65]	Voltage control	AC	Continuous	Deterministic	IEEE 123
[66]	Voltage control	QL	Continuous	Stochastic	IEEE 39, [66]
[67]	Microgrid	Other	Discrete	Deterministic	HIL platform “dSPACE MicroLabBox”
[68]	Microgrid	PPO	Continuous	Stochastic	Empirical measurements
[70]	Power flow, microgrid	AC	Continuous	Stochastic	[71,72]
[69]	Power flow	PG	Continuous	Stochastic	IEEE 118

Table 4. Model-based approaches for different applications in energy management in buildings and the MDP setting.

Ref.	Application	Algorithm	State Space	Policy	Dataset
[73]	HVAC	AC	Discrete	Deterministic	[80]
[74]	HVAC	AC	Continuous	Stochastic	Real data [Upon request]
[76]	HVAC	Other	Discrete	Deterministic	Simulated data
[78]	HVAC	Other	Discrete	Deterministic	Simulated data
[77]	HVAC	QL, other	Discrete	Deterministic	[81]
[75]	HVAC	PPO	Discrete	Stochastic	“EnergyPlus”
[79]	Dispatch	QL, other	Discrete	Deterministic	[82], Simulated data

Table 5. Model-based approaches for different applications in electrical vehicles and the MDP setting.

Ref.	Application	Algorithm	State Space	Policy	Dataset
[83]	Power flow	QL	Discrete	Deterministic	MATLAB simulation
[84]	Charge control	QL	Mixed	Deterministic	Historic prices
[85]	Charge control	PG	Continuous	Deterministic	Simulated
[86]	Charge control	AC	Continuous	Deterministic	Simulated
[87]	Charge control	QL	Discrete	Deterministic	Open street map, ChargeBar
[88]	Charge control	QL	Discrete	Deterministic	Simulated
[90]	Charge scheduling	AC	Continuous	Deterministic	Simulated
[91]	Charge control	AC	Continuous	Stochastic	Historic prices
[89]	Load balancing	QL	Continuous	Deterministic	Simulated

Table 6. Model-based approaches for different applications in energy storage management and the MDP setting.

Ref.	Application	Algorithm	State Space	Policy	Dataset
[100]	Smart grid	QL	Discrete	Stochastic	Simulated data
[92]	Smart grid	Other	Discrete	Stochastic	Simulated data
[98]	Smart grid	Other	Discrete	Stochastic	[101,102]
[94]	EV	QL	Discrete	Stochastic	Simulated data
[103]	EV	QL, other	Discrete	Deterministic	Simulated data
[95]	EV	QL	Continuous	Deterministic	Simulated data
[96]	EV	QL, other	Discrete	Stochastic	Simulated data
[93]	Renewable energy	Other [104]	Discrete	Stochastic	Simulated data
[97]	Battery ESS, frequency support	PG, AC	Continuous	Deterministic	Simulated data
[99]	Energy system modeling	Other	Discrete	Stochastic	Simulated data

Table 7. Model-free approaches for different applications in energy market management and the MDP setting.

Ref.	Application	Algorithm	State Space	Policy	Dataset
[112]	ET	PG	Continuous	Deterministic	Real data
[113]	ET	QL	Discrete	Deterministic	[122,123]
[115]	ET	PG	Continuous	Deterministic	[124]
[116]	ET	AC	Continuous	Stochastic	Simulated data
[117]	ET	QL	Continuous	Deterministic	Real data
[119]	Microgrid, dispatch	QL	Continuous	Stochastic	[125]
[120]	Microgrid	PG	Continuous	Stochastic	Simulated data
[121]	Microgrid	Other	Continuous	Deterministic	Real and simulated data

Table 8. Model-free approaches in different power systems’ grid stability and control applications and the MDP setting.

Ref.	Application	Algorithm	State Space	Policy	Dataset
[126]	Voltage control	PG	Continuous	Stochastic	Powerflow and Short circuit Assessment Tool (PSAT), 200-bus system [135]
[127]	Voltage control	AC	Continuous	Stochastic	IEEE 33-, 123-, and 342-node systems
[128]	Voltage control	QL	Discrete	Stochastic	IEEE 14-bus system
[129]	Frequency control	QL	Discrete	Deterministic	Stimulated data
[130]	Frequency control	PG	Discrete	Deterministic	Kundur’s 4-unit-13-bus system, New England 68-bus system, [136]
[131]	Microgrid	QL	Continuous	Stochastic	7-bus system and the IEEE 123-bus system
[132]	Microgrid	Other	Discrete	Deterministic	Simulated data
[133]	Power flow	PPO	Continuous	Stochastic	Illinois 200-bus system
[134]	Power flow	PPO	Continuous	Stochastic	Simulated data, West Denmark wind data

Table 9. Model-free approaches for different applications in energy management in buildings and the MDP setting.

Ref.	Application	Algorithm	State Space	Policy	Dataset
[137]	HVAC	QL	Discrete	Deterministic	Simulated data
[146]	HVAC	PPO	Continuous	Deterministic	“EnergyPlus”
[141]	HVAC	QL	Continuous	Stochastic	“EnergyPlus”
[143]	HVAC	QL	Discrete	Deterministic	Simulated data
[145]	HVAC	PG	Continuous	Stochastic	[148,149]
[142]	HVAC	AC	Continuous	Stochastic	[150]
[138]	HVAC,DR	PPO	Continuous	Deterministic	“EnergyPlus”
[140]	HVAC, DR	QL, PG	Continuous	Stochastic	[150]
[139]	DR	QL, PG	Discrete	Deterministic	[151]
[112]	Dispatch	PG	Continuous	Deterministic	Simulated data
[147]	Dispatch	QL	Continuous	Stochastic	[61,152]

Table 10. Model-free approaches for different applications in electrical vehicles and the MDP setting.

Ref.	Application	Algorithm	State Space	Policy	Dataset
[153]	Scheduling	QL	Discrete	Deterministic	“Open street map”, “ChargeBar”
[154]	Scheduling	QL	Continuous	Deterministic	Simulated
[155]	Scheduling	Other	Continuous	Deterministic	Historic data
[156]	Scheduling	Other	Continuous	Deterministic	Simulated
[158]	Scheduling	QL	Discrete	Deterministic	“ElaadNL”
[157]	Cost reduction	Other	Mixed	Deterministic	Simulated
[159]	Cost reduction	QL	Continuous	Deterministic	Simulated
[162]	Cost reduction	QL	Discrete	Deterministic	Simulated
[161]	DR	QL	Discrete	Deterministic	Simulated
[160]	SoC control	QL, PG	Continuous	Deterministic	Historic data

Table 11. Model-free approaches for different applications in energy storage management and the MDP setting.

Ref.	Application	Algorithm	State Space	Policy	Dataset
[164]	Microgrids	QL	Continuous	Deterministic	Simulated data
[165]	Microgrids	QL	Discrete	Deterministic	Simulated data
[168]	Microgrids	QL	Continuous	Deterministic	Simulated data
[169]	Microgrids	QL	Continuous	Deterministic	[125]
[167]	Frequency control	Other	Continuous	Deterministic	Simulated data
[97]	Frequency control	PG, AC	Continuous	Deterministic	Simulated data
[163]	Energy trading	QL	Continuous	Deterministic	[171]
[166]	Energy trading	QL	Continuous	Deterministic	Simulated data
[170]	EV	QL	Continuous	Stochastic	Simulated data

Table 12. Keywords for different application areas of the model-based and model-free paradigms.

RL Expressions	Power Systems Application Expressions
“model-based”	“energy market management”
OR	OR
“model learning”	“voltage control”
OR	OR
“model-free”	“frequency control”
OR	OR
“data-driven”	“reactive power control”
AND/OR	OR
“reinforcement learning”	“grid stability”
	OR
	“microgrid”
	OR
	“building energy management”
	OR
	“building”
	OR
	“electrical vehicles”
	OR
	“EV”
	OR
	“energy storage control problems”
	OR
	“battery energy storage system”
	OR
	“local energy trading”

Table 13. Relations [in %] between common reinforcement learning algorithms and optimal control problems, as shown in Figure 26.

	QL		PG		AC		PPO		Other
	MB	MF	MB	MF	MB	MF	MB	MF	MB	MF
ESS	5	7	1	1	0	1	0	0	6	1
EV	5	7	1	1	3	1	0	0	0	2
BEM	2	6	0	4	2	1	1	2	3	0
GSAC	1	3	1	2	2	1	1	2	3	1
EM	2	3	2	3	3	1	0	0	3	2

Table 14. Open challenges for application of RL algorithms in power system control problems.

Category	Challenges
Lack of standardization	Lack of real-world data for different control tasks in power systems. No qualitative simulator to efficiently integrate between accurate physical models of energy systems and reinforcement learning libraries. No standardized benchmark algorithms or datasets that represent a quality norm for various reinforcement learning algorithms.
Lack of generalization	Lack of data causes limited generalization ability in model-free algorithms. Complex models in power systems are difficult to learn; thus, model-based algorithms converge to an inaccurate model, which does not generalize well. As the state or action variables increase, there is an exponential growth in the computational requirements of the model.
Limited safety	Model-free methods produce suboptimal policy due to small acquired data, which may not perform well when unexpected events occur. The complexity of the environment’s dynamics causes model-based algorithms to produce suboptimal policies, jeopardizing the stability of the system when uncertainty is encountered. During training, the models focus on exploration and perform mainly random actions; in real-time applications for power systems, this may be catastrophic and lead to blackouts.
Nonstationarity environments	Decomposition into different hierarchical levels representing distinctive modes. Use RNN/LSTM to process sequential data and analyze temporal dependencies. Use online learning with adaptive parameters of the transition network. Incorporate prior knowledge, such as symmetries, to reduce state–action space size. Design transition function that evolves in time. Balance exploration–exploitation and forget outdated information.
Reward shaping	Define global objective and local rewards to adjust reward horizon. Adjust rewards dynamically to combine gained experience and prior knowledge. Reward shaping to incorporate physical and safety constraints.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ginzburg-Ganz, E.; Segev, I.; Balabanov, A.; Segev, E.; Kaully Naveh, S.; Machlev, R.; Belikov, J.; Katzir, L.; Keren, S.; Levron, Y. Reinforcement Learning Model-Based and Model-Free Paradigms for Optimal Control Problems in Power Systems: Comprehensive Review and Future Directions. Energies 2024, 17, 5307. https://doi.org/10.3390/en17215307

AMA Style

Ginzburg-Ganz E, Segev I, Balabanov A, Segev E, Kaully Naveh S, Machlev R, Belikov J, Katzir L, Keren S, Levron Y. Reinforcement Learning Model-Based and Model-Free Paradigms for Optimal Control Problems in Power Systems: Comprehensive Review and Future Directions. Energies. 2024; 17(21):5307. https://doi.org/10.3390/en17215307

Chicago/Turabian Style

Ginzburg-Ganz, Elinor, Itay Segev, Alexander Balabanov, Elior Segev, Sivan Kaully Naveh, Ram Machlev, Juri Belikov, Liran Katzir, Sarah Keren, and Yoash Levron. 2024. "Reinforcement Learning Model-Based and Model-Free Paradigms for Optimal Control Problems in Power Systems: Comprehensive Review and Future Directions" Energies 17, no. 21: 5307. https://doi.org/10.3390/en17215307

APA Style

Ginzburg-Ganz, E., Segev, I., Balabanov, A., Segev, E., Kaully Naveh, S., Machlev, R., Belikov, J., Katzir, L., Keren, S., & Levron, Y. (2024). Reinforcement Learning Model-Based and Model-Free Paradigms for Optimal Control Problems in Power Systems: Comprehensive Review and Future Directions. Energies, 17(21), 5307. https://doi.org/10.3390/en17215307

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning Model-Based and Model-Free Paradigms for Optimal Control Problems in Power Systems: Comprehensive Review and Future Directions

Abstract

1. Introduction

2. Technical Background on Reinforcement Learning

2.1. Markov Decision Processes

2.2. Model-Based and Model-Free Reinforcement Learning

2.3. Model Computation

2.3.1. Dynamic Programming

2.3.2. Model Predictive Control (MPC)

2.4. Policy Learning Basic Concepts

2.4.1. Value Function

2.4.2. Policy Iteration

2.4.3. Value Iteration

2.4.4. Policy Gradient

2.5. Multi-Agent Reinforcement Learning

2.6. Safe Reinforcement Learning

3. Model-Based Paradigm

3.1. Energy Market Management

3.2. Power Grid Stability and Control

3.3. Building Energy Management

3.4. Electrical Vehicles

3.5. Energy Storage Management

3.6. Prominent Trends

4. Model-Free Paradigms

4.1. Energy Market Management

4.2. Power Grid Stability and Control

4.3. Building Energy Management

4.4. Electrical Vehicles

4.5. Energy Storage Management

4.6. Prominent Trends

5. Comparison and Discussion

6. Challenges and Future Research Work

6.1. Challenges

6.1.1. Limited Real-World Data and Standardized Tools

6.1.2. Limited Scalability, Generalization, and the Curse of Dimensionality

6.1.3. Limited Robustness and Safety for Real-Time Applications

6.1.4. Nonstationarity and Time-Variant Environments

6.1.5. Reward Shaping and Global Objectives in Power Systems

6.2. Future Work

6.2.1. Explainability

6.2.2. Neural Architecture Search

6.2.3. Physics-Informed Neural Networks

6.2.4. Public Datasets

6.2.5. Safe Reinforcement Learning and Coping with Changing Conditions and Unforeseen Events

6.2.6. RL Integration with Different AI Techniques

6.2.7. RL Applications in Emerging Power System Technologies

Edge AI

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI