Appendix A
In this section, more details about MDPs, the DP value iteration method, and the RL Q-Learning algorithm implementation are given.
Given an MDP with a set of states and actions
, a reward function
, a state transition probability distribution
(from all states
to all their successor
), and a discounted factor
; the random variables
and
depend on the preceding state and action. That is, the probability of those variables occurring at the same time
is conditioned on the previous state
and the action
taken, as defined in Equation (A1).
for all
,
and
. Therefore, the function
defines de dynamics of the MDP.
Recall that the objective of an MDP is to maximize the expected discounted cumulative reward in a trajectory. Equation (A2) defines the expected reward in an episodic MDP.
where
T corresponds to the last time-step of the episode. Time-step
T leads to terminal states. Once a terminal state is reached, the episode is ended. In this formal context, a policy
establishes a mapping from states to actions. Hence,
expresses the probability of selecting an action
given a state
and following the policy
. The value function under a policy
indicates how good is to be in each state
based on the expected discounted cumulative reward and following policy
to choose the sequent actions. The value of a terminal state is always 0. The Bellman equation for the value function
is presented in Equation (A3). This expression is directly derived from Equations (A1) and (A2). It expresses the recursive relationship between the value of a state and the values of its successor states and represents a basis for many more complex derivations in MDPs. These equations are widely applied to solve control and decision-making problems.
for all
. A state-value function
is closely dependent on the policy followed. Therefore, an optimal value function
is obtained by following an optimal policy
, thus taking the action that maximizes the future rewards. The
function is defined in Equation (A4) and it is known as the Bellman optimality equation.
for all
. Similarly, the action-value function
defined for a policy
is the expected reward of taking action
in a state
and following the policy
. Equations (A5) and (A6) express the action-value function and its Bellmann optimality equation versions, respectively.
for all
and
.
The Bellman equation for the value function [
24] gives a recursive decomposition that can be used to iteratively improve this function, and which at some point will converge to the optimal value function
. The V-table stores and reuses solutions.
A DP value iteration method [
24] is used to approximate the optimal value function
in the V-table. Equation (A7) presents the recursive version of the Bellman equation to improve the value function. In the experiments presented, the environment is episodic and the discount factor
is set to 1. Moreover, the environment is deterministic and non-stochastic, due to the assumption of constant variables. Therefore, the recursive function considered for this task is the simplified version presented in the Equation (A8).
for all
and
. When no more improvements can be made (or very small improvements happen), the value iteration algorithm returns an approximation to the optimal value function
that maps states to expected cumulative reward values in a tabular form (V-table). A threshold
is set to define stopping criteria for the algorithm since infinite iterations would be needed to converge exactly to
. The V-table is used to compute the (near) optimal policy
for decision-making by choosing the action that maximizes expected transition values. Equation (A9) presents the policy derived from the value function approximation.
for all
and
.
The pseudo-code of the presented DP value iteration algorithm is illustrated in Algorithm A1.
Algorithm A1: Value iteration |
1. |
2. |
3. |
4.
|
5.
|
6. |
7. end |
8. end |
9. |
Reinforcement Learning: Q-Learning
Q-Learning is a Temporal Difference (TD) control algorithm for TD-Learning control used in RL [
24]. The learned state-action value function
directly approximates the optimal state-action value function
, regardless of the followed policy. In Q-Learning, a value is learned per each state-action pair and stored in the Q-table. Q-table updates are made after every interaction step by modifying previous beliefs with the new transition information. In this case, the learned action-value table, that directly approximates
, is obtained by a sequence of
functions. The state-action value sequential updates are presented in Equation (A10) by doing repetitive updates in the Bellman equation. For the sake of comparison of the methodologies, in this work we are working with an episodic problem, such as in the DP case, and
is set to 1.
for all
, and
. The
value is the learning rate or step size. The learning rate is used for weighting the importance of new updates in the value function. A factor of 0 would give all the weight to the past knowledge and the agent learns nothing from new experiences, while a factor of 1 gives all the weight to new experiences forgetting all past knowledge. Any intermediate value balances the importance of past knowledge with the new one.
Q-learning is an off-policy algorithm that uses two different policies in the state-action values updates: the behaviour policy and the target policy.
The behaviour policy
is used to manage the trade-off between exploitation and exploration. In this work, the behaviour policy is an
-greedy decay exploration strategy with respect to
where the
value is decayed after each learning step. Therefore, the behaviour
is converging to the target policy by decreasing the
value. The behaviour policy is expressed in Equation (A11).
for all
. The
is linearly decayed until a minimum value of 0.01 is reached at the last training step. Equations (A12) and (A13) presents the
updates.
where
,
episodes, and
batch_size are parameters of the process and
represents the number of training steps.
The target policy
is greedy with respect to
, hence it is deterministic. It is used to select the action with the higher expected cumulative reward. After training, the Q-table is used to compute the (near) optimal policy
for decision-making by choosing the action that maximizes expected learned values. Equation (A14) presents the target policy
derived from the state-action value function approximation.
for all
and
.
The pseudo-code of the presented Q-Learning algorithm is illustrated in Algorithm A2.
Algorithm A2: Q-Learning |
1. |
2. For each episode do: |
3. environment initialization |
4. While
is not terminal do: |
5. Sample action
-greedy policy from
|
6. Sample transition |
7. Update
|
8.
decay ε value |
9. |
10. end |
11. end |
11. Return |