In this section, we first provide a detailed introduction to the DSLD hyperparameter optimization model, including the design of the actor network and the DS-POOL. Then, we introduce the implementation details of the RL-AOPP framework.
3.3.1. Markov Decision Process and Construction of Actor Network
The hyperparameter optimization problem refers to selecting a specific value for each hyperparameter of the model to be optimized within its search space, thus combining them into a set of hyperparameter combinations and applying them to model training. For the ECP problem to be solved in this paper, the goal of hyperparameter optimization is to find a set of hyperparameter settings in the hyperparameter search space that minimizes the error and maximizes the accuracy of the final trained prediction model. The objective function can be expressed as Equation (
7).
In this equation, is the training set, is the validation set, is the model obtained by training on the training set with hyperparameter configuration , and L is the loss function of M in this task.
We can formulate the hyperparameter optimization problem in ECP as a Markov decision process (MDP). In this process, an optimal value needs to be determined for each hyperparameter of the model, thus ultimately obtaining an optimal hyperparameter vector. Suppose the model M to be optimized has N hyperparameters to be selected. If we consider the hyperparameter selection problem as a multiarmed bandit problem, and the search space for the ith hyperparameter is , then the entire search space is . This search space is high-dimensional and complex, and it tends to grow exponentially with the increase in the number of hyperparameters.
To solve the problem of an overly large problem space, the idea of divide and conquer is adopted to optimize the N hyperparameters separately. In reinforcement learning, we can choose different hyperparameters at different time steps, for example, selecting the t-th hyperparameter at time step t. However, separate decision making may lead to the neglect of the correlations between the hyperparameters. Therefore, this paper proposes to use the LSTM, which has the characteristic of analyzing the intrinsic connections in time series data as the core structure of the actor network. This allows each hyperparameter selection to be based on the decision of the previous hyperparameter selection, thus converting the hyperparameter selection process into a sequential decision-making process. The entire search space becomes , and the simplified search space only increases linearly with the increase in hyperparameters, thus greatly reducing the search space and improving the search rate.
If the model can predict the trend of the next hyperparameter selection direction, it can avoid some values that will not be selected in the future. Consequently, we redesign the structure of the actor network, which is shown in
Figure 1-Actor. The Actor consists of two layers of LSTM and two layers of fully connected layers (FCLs). The fully connected layers are used to adjust the dimensions of the input and output, while the LSTM layers are used to learn the latent information present in the input data and to preserve its variation trend in the time dimension. For an optimization problem containing
N hyperparameters, at a certain moment
t, the actor network selects the hyperparameter
based on the input current state
. After
N time steps, the network obtains a set of hyperparameter configurations
. Consequently, a complete epoch at the current moment includes
N time steps. Then, the action sequence interacts with the environment. The model to be optimized uses the hyperparameter combination selected at the current moment to train the model, and the accuracy of the model on the validation dataset is used as the reward value at the current moment.
Therefore, the Markov decision process for the hyperparameter optimization problem will be redefined, and the target object is a model containing N hyperparameters to be optimized:
Action space: The actor network will use N time steps, thus selecting the value of the ith hyperparameter at the ith time step, , until the end of the Nth time step to obtain a set of hyperparameter combinations. For each hyperparameter i within the N time steps, its search space is . The overall search space is .
State space: The environment includes the model to be optimized, the dataset, and the hyperparameter combination. During the execution of the optimization task, only the hyperparameter combination to be optimized is dynamically adjusted. Therefore, the hyperparameter setting of the algorithm at the previous moment is chosen to be marked as the state at the current moment.
Reward value: The accuracy of the trained model on the validation dataset using a set of actions A is used as the reward value. , where the value of is the accuracy of the MTCN model.
State transition probability: The transition probability of the environment state is not observable. Therefore, this research method adopts a model-free deterministic policy gradient approach for learning.
Discount factor: This value will be reflected in the calculation of the cumulative return value. The closer it is to 1, the longer the current network considers future benefits.
3.3.2. Construction of Differential Value Experience Sample Pool
To reduce the correlation problem among samples in the experience sample pool, we propose a method for constructing a differential value experience sample pool with a high-value priority sampling experience replay mechanism. The experience samples are stored in the experience pool according to their learning value.
During the training process of the DSLD model, the actor network selects the hyperparameter actions, and the critic network gives the value that the selected actions may produce. After the actor network selects actions and interacts with the environment, the obtained samples are first used to calculate the temporal difference error (TDError) of the action value function before being put into the sample pool. Then, the experience samples are stored in the experience pool in order of this value. The value calculation formula is shown in Equation (
8).
In the equation,
represents instant reward and
represents the action of the target policy
in state
. Our goal is to make
as small as possible, which represents the difference between the current
Q value and the target
Q value of the next step. When
is relatively large, it indicates that the sample has a significant influence on the value network, and it can be understood that this sample has a higher value. The probability of each experience sample being sampled depends on its value. The sampling probability of the jth sample is defined as
, as shown in Equation (
9).
In
,
is the rank of sample
j in the replay experience pool according to the
j value. The parameter
is used to control the degree of priority usage, and
and
indicate fully greedy sampling based on priority. During the learning process, high-value samples have a positive impact on the network, but low-value samples cannot be ignored either. The definition of sampling probability can be seen as a method of introducing random factors when selecting experiences, because even low-value experiences may be replayed, thus ensuring the diversity of sampled experiences. This diversity helps prevent overfitting of the neural network. To correct the estimation bias caused by priority sampling, the loss function Loss of the estimation network is multiplied by the importance sampling weight
, as shown in Equation (
10).
In the equation, N represents the size of the differential value sample pool, and is used to adjust the distribution of weights. The higher the value of the sample, the higher its priority, and the smaller the importance sampling weight value, which smoothes the optimization surface by correcting the loss. Therefore, the differential value experience sample pool allows the algorithm to focus more on high-value samples and accelerate the convergence of the algorithm.
At this point, the DSLD optimization model can search for the optimal hyperparameter configuration of the MTCN prediction model. For a given air conditioning ECP model, all internal parameters of the networks are initialized. The DSLD policy network selects appropriate actions (hyperparameter combinations) and sets them as the hyperparameter values of the MTCN model. The MTCN is trained on the data sample set, and then the model is validated on the validation data samples to obtain accuracy. The action and reward are combined as an experience sample for the DSLD model and stored in the sample pool according to the rules based on the differential value. When the sample pool reaches a certain condition, data samples are sampled to update the internal parameters of the value network and policy network. After multiple iterations, the RL-AOPP framework can select a better hyperparameter configuration to achieve the highest prediction accuracy under the current task and output the prediction model under this configuration.
Algorithm 1 presents the complete training process of the RL-AOPP framework.
Algorithm 1 RL-AOPP process |
- Input:
train dataset , test dataset , sample pool size N, number of samples K, learning rate - Output:
a set of optimal hyperparameters - 1:
Initialize actor network , critic network Q and their weights , target network , and their weights , sample pool R - 2:
for
do - 3:
Initialize policy - 4:
for do - 5:
if Reach convergence then - 6:
break - 7:
end if - 8:
Select action based on - 9:
Execute - 10:
= TrainMTCN() - 11:
Calculate accuracy of with in MTCN - 12:
Set instant reward as accuracy and obtain state - 13:
Calculate TD error via Equation ( 8) - 14:
Store sequentially into R based on TD error - 15:
Draw K samples via Equation ( 9) - 16:
Calculate weight of each sample via Equation ( 10) - 17:
Update with and Q with minimal loss - 18:
Update weights of target networks and - 19:
end for - 20:
end for - 21:
function TrainMTCN() - 22:
Initialize MTCN parameter - 23:
for each training epoch do - 24:
for do ▹T represents the length of - 25:
Extract data from each dimension - 26:
for do ▹L represents the number of layers in the MTCN - 27:
Use multidimensional dilated causal convolution to extract features - 28:
if dimensions match then - 29:
Calculate residual connection with - 30:
end if - 31:
Set input for next layer - 32:
end for - 33:
Use full connected layer to get prediction - 34:
end for - 35:
Compute loss L between and y - 36:
Update with - 37:
end for - 38:
return - 39:
end function
|