Next Article in Journal
High-Frequency Current Transformers Cascade for Power Electronics Measurements
Next Article in Special Issue
An Account of Models of Molecular Circuits for Associative Learning with Reinforcement Effect and Forced Dissociation
Previous Article in Journal
Periodicity Intensity of the 24 h Circadian Rhythm in Newborn Calves Show Indicators of Herd Welfare
Previous Article in Special Issue
Age-Related Changes in Functional Connectivity during the Sensorimotor Integration Detected by Artificial Neural Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning

1
Intelligence Convergence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), Daejeon 34129, Korea
2
Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), Daejeon 34129, Korea
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(15), 5845; https://doi.org/10.3390/s22155845
Submission received: 27 June 2022 / Revised: 25 July 2022 / Accepted: 2 August 2022 / Published: 4 August 2022
(This article belongs to the Special Issue Robust and Explainable Neural Intelligence)

Abstract

:
Effective exploration is one of the critical factors affecting performance in deep reinforcement learning. Agents acquire data to learn the optimal policy through exploration, and if it is not guaranteed, the data quality deteriorates, which leads to performance degradation. This study investigates the effect of initial entropy, which significantly influences exploration, especially in the early learning stage. The results of this study on tasks with discrete action space show that (1) low initial entropy increases the probability of learning failure, (2) the distributions of initial entropy for various tasks are biased towards low values that inhibit exploration, and (3) the initial entropy for discrete action space varies with both the initial weight and task, making it hard to control. We then devise a simple yet powerful learning strategy to deal with these limitations, namely, entropy-aware model initialization. The proposed algorithm aims to provide a model with high initial entropy to a deep reinforcement learning algorithm for effective exploration. Our experiments showed that the devised learning strategy significantly reduces learning failures and enhances performance, stability, and learning speed.

1. Introduction

Reinforcement learning is a commonly used optimization technique for solving sequential decision-making problems [1]. The adoption of deep learning technology to reinforcement learning (so-called deep reinforcement learning (DRL)) has shown successful performance even with high-dimensional observations and action spaces in fields such as robotic control [2,3,4,5,6], gaming [7,8,9], medical [10,11], and financial [12,13] applications. In such a DRL framework, the exploration–exploitation trade-off is a crucial issue that affects the performance of the DRL algorithm [14]. Through exploitation, the agent tries to maximize the current moment’s expected reward, whereas exploration is required to maximize the long-term reward during training [15]. In other words, even if the exploitation that makes the best decision over the current information is successful, the solution obtained by DRL would not be optimal without a number of explorations. Therefore, several studies to encourage exploration are being discussed. Incorporating the entropy term in the reinforcement learning (RL) optimization problem is a representative approach to encourage exploration. The entropy term in the DRL framework represents the stochasticity of the action selection. It is calculated based on the output of the policy. Note that the output of the policy is the action selection probability. The evenly distributed output will yield high entropy. Conversely, if the output is biased, its entropy is low. With the biased output, i.e., low entropy, there is a high probability that the agent cannot perform various actions and repeats only certain actions inhibiting exploration. Therefore, various studies are encouraging high entropy [16,17,18,19,20,21].
In [16], a proximal policy optimization (PPO) algorithm was proposed, in which the entropy bonus term was augmented to ensure sufficient exploration motivated by [22,23]. A soft actor–critic (SAC) DRL algorithm based on the maximum entropy RL framework was proposed in [17], where the entropy term was incorporated to improve exploration by acquiring diverse behaviors in the objective with the expected reward. Ref. [21] also adopted the maximum entropy RL framework as it shows better performance and more robustness. In addition, the authors in [24] proposed a maximum entropy-regularized multi-goal RL, where the entropy was combined with the multi-goal RL objective to encourage the agent to traverse diverse goal states. In [25], maximum entropy was introduced in the multi-agent RL algorithm to improve the training efficiency and guarantee a stronger exploration capability. In addition, a soft policy gradient under the maximum entropy RL framework [26] was devised, and maximum entropy diverse exploration [27] was proposed for learning diverse behaviors. However, these approaches, which consider entropy along with other factors (e.g., reward) in the objective, make the handling of low entropy difficult at model initialization. In [20], the impact of entropy on policy optimization was extensively studied. The authors observed that a more stochastic policy (i.e., a policy with high entropy) improved the performance of the DRL. The authors in [28] analyzed the effect of experimental factors in the DRL framework, where the offset in the standard deviation of actions was reported as an important factor affecting the performance of the DRL. These studies dealt with continuous control tasks, where the initial entropy can be easily controlled by adjusting the standard deviation. To the best of our knowledge, for discrete control tasks, neither any research reporting on the effect of the initial entropy nor a learning strategy exploiting it exists. One of the reasons for this may be the difficulty in controlling the entropy of discrete control tasks. The entropy in a discrete control task is determined by the action selection probability obtained through the rollout procedure, whereas, in a continuous control task, the standard deviation determines the entropy.
To address the abovementioned concerns, we have conducted experimental studies to investigate the effect of initial entropy, focusing on tasks with a discrete action space. Furthermore, based on the experimental observations, we have devised a learning strategy for DRL algorithms, namely entropy-aware model initialization. The contributions of this study can be summarized as follows:
  • We reveal a cause of frequent learning failures despite the ease of the tasks. Our investigations show that the model with low initial entropy significantly increases the probability of learning failures, and that the initial entropy is biased towards a low value for various tasks. Moreover, we observe that the initial entropy varies depending on the task and initial weight of the model. These dependencies make it difficult to control the initial entropy of the discrete control tasks;
  • We devise entropy-aware model initialization, a simple yet powerful learning strategy that exploits the effect of the initial entropy that we have analyzed. The devised learning strategy repeats the model initialization and entropy measurements until the initial entropy exceeds an entropy threshold. It can be used with any reinforcement learning algorithm because the proposed strategy just provides a well-initialized model to a DRL algorithm. The experimental results show that entropy-aware model initialization significantly reduces learning failures and improves performance, stability, and learning speed.
In Section 2, we present the results of the experimental study on the effect of the initial entropy on DRL performance with discrete control tasks. In Section 3, we describe the devised learning strategy, and discuss the experimental results in Section 4. Finally, we detail the conclusions in Section 5.

2. Effect of Initial Entropy in DRL

To investigate the effect of the initial entropy in the DRL framework, we adopted the policy gradient method (PPO [16]) implementation in RLlib [29]. The network architecture was set to be the same as in [16]. We adopted the Glorot uniform [30], which is the default initializer for Tensorflow [31] and representative RL frameworks such as RLlib, TF-Agents [32], and OpenAI Baselines [33] to initialize the network. Unless otherwise stated, PPO and Glorot uniform are the default settings for the analyses. For this experimental study, we considered eight tasks (please refer to Figure 1) with a discrete action space from the OpenAI Gym [34]. Note that eight tasks (Freeway, Breakout, Pong, Qbert, Enduro, KungFuMaster, Alien, and Boxing) were selected to cover various action space sizes and task difficulties (easy and hard exploration) referring to [35]. Freeway is the game that moves a chicken across the freeway by avoiding oncoming traffic with the action space size of 3. The Breakout game moves a paddle to hit a moving ball to destroy a brick wall, where the action space size is 4. Like the Breakout game, Pong, with the action space size of 6, competes with a computer (left paddle) by controlling the right paddle for rallying the ball, where the paddles move only vertically. In addition, Qbert is a game that moves the cube pyramid and changes the color of the top of the cube and has six action spaces. Next, Enduro is a racing game with nine action spaces aiming to pass an assigned number of cars each day. KungFuMaster is a game in which we fight the enemies we meet on the way to rescue the princess, and it has 14 action spaces. As the game with the most action space, Alien is the game where you destroy aliens’ eggs while avoiding them, and Boxing is the game where we are rewarded by defeating the enemy in the boxing ring. As seen in Figure 1 and the description above, the goals and rules for each of the eight tasks differ. The agent receives the rewards according to the task’s rules in achieving the goals. Therefore, it makes the reward range differ for each task. For example, the range of rewards that an agent can acquire in Pong is from −21 to 21, whereas, in Qbert, it can receive from 0 to more than 15,000. Please refer to [36] for detailed explanations (e.g., description, action types, rewards, and observations) for each game.
First, we investigated the effect of the initial entropy on performance (i.e., reward). We generated 50 differently initialized models for the experiment and measured the rewards after 3000 training iterations for Freeway, Pong, KungFuMaster, and Boxing, and 5000 training iterations for Breakout, Qbert, Enduro, and Alien. For each iteration, 2048 experiences were collected with 16 workers, and six stochastic gradient descent (SGD) epochs were performed with a learning rate of 2.5 × 10 4 . Figure 2 shows the reward for the initial entropy. We can see that, the lower the initial entropy, the higher the learning failures (e.g., −21 for Pong, 0 for Breakout, and −100 for Boxing). The low initial entropy leads to learning failures by inhibiting exploration. Recall that the entropy is the stochasticity of the action selection probability, and low entropy means the probability is biased towards a specific action. It causes the agent to perform the specific action for every step during the episode with a high probability. Repeating the same action makes exploration difficult. This reminds us of the importance of exploration, particularly during the earlier training stage.
We then investigated the distribution of initial entropy. For this, we generated 1000 models with different random seeds for each of the eight tasks and measured the initial entropy values. Note that the maximum value of the initial entropy is determined by the action space size of the task, for example, 1.099, 1.386, 1.792, 2.197, and 2.890 for action space sizes of 3, 4, 6, 9, and 18, respectively, which are shown in parentheses in Figure 3. From Figure 3, we can see that the initial entropy is biased towards low values, even if the maximum initial entropy value is high, owing to the large action space size. The average initial entropy values were 0.114, 0.246, 0.189, 0.342, 0.636, 0.345, 0.694, and 0.273 for Freeway, Breakout, Pong, Qbert, Enduro, KungFuMaster, Alien, and Boxing, respectively. We performed additional experiments to analyze this tendency on a different network initializer. Specifically, Figure 4 presents the results with an orthogonal initialization technique [37] instead of the Glorot uniform. Nevertheless, we can observe similar trends as in Figure 3. Our experimental findings (i.e., the high probability of learning failures for low initial entropy, and the low biased initial entropy) explain why DRL often fails for tasks with discrete action spaces and why the performance drastically varies for each experiment.
Finally, we investigated the factors affecting the initial entropy. Table 1 and Table 2 show that both the tasks and the initial weight significantly affect the initial entropy. In Table 1 and Table 2, the meaning of seed is a random seed for initializing the neural network. For example, in the first row, Seed 01, of Table 1, the same network, i.e., the same initial weights, are used for measuring the values of Pong and Qbert. The same is true for Alien and Boxing. However, the initial weights of Qbert and Alien differ as the neural network structures differ. Note that the network structure varies according to the size of the action space. For example, for the action space sizes of 6 and 18, the network’s output nodes are 6 and 18, respectively. We can see that the initial entropy varies with the task, even with the same initial weight (e.g., Seed 02’s Alien and Boxing cases in Table 1). In addition, the initial entropy differs according to the initial weight of the model, even with the same task (e.g., Seeds 03 and 04 in Alien cases in Table 1). This is because the input image, which is the observation, differs significantly for each task. These task and model initialization dependencies on initial entropy make it difficult to control the initial entropy.
From the above observations, we conclude that DRL algorithms require models with high initial entropy for successful training, and we need a strategy to generate such models.

3. Entropy-Aware Model Initialization

In the previous section, we observed that (1) learning failure frequently occurs with the model with low initial entropy, (2) the initial entropy is biased towards a low value, and (3) even with the same network architecture, the initial entropy greatly varies based on the task and the initial weight of the models. Inspired by the above experimental observations, we propose an entropy-aware model initialization strategy. The learning strategy repeatedly initializes the model until its initial entropy value exceeds the entropy threshold. In other words, the proposed learning strategy encourages DRL algorithms such as PPO [16] to collect a variety of experiences at the initial stage by providing a model with high initial entropy.
Suppose that task (E), number of actors (N), entropy threshold ( h t h ), initializer (K), and horizon (T) are given. First, we initialize the model ( π i ) with K. Then, for each n-th actor, we perform rollout with the initialized model ( π i ) for each time step t { 1 , , T } . Rollout here means the agent interacts with the environment, and, with the rollout, the agent obtains data transitions (i.e., current state, task, reward, and next state) for training. Through the rollout, we store the action selection probabilities ( p π i ( n , t ) ) for entropy calculation. Note that the action selection probability for the set of actions in action space A (e.g., A = { NOOP , FIRE , UP } in the case of Freeway with the action space size of 3) is the softmax of the outputs of π i . Then, we compute the entropy of the model ( π i ) for each actor and the time step as
h π i ( n , t ) = a A p π i ( n , t ) ( a ) log p π i ( n , t ) ( a ) .
Next, the mean entropy ( h ^ π i ) of the total action selection probabilities collected from the N actors over T horizon is computed, which is defined by
h ^ π i = 1 N T n = 1 N t = 1 T h π i ( n , t ) .
The mean entropy is compared to the predefined entropy threshold ( h t h ). If the mean entropy h ^ π i is larger than the predefined entropy threshold h t h , then we terminate the entropy-aware model initialization and output the initialized model ( π i n i t ) for the DRL algorithm such as PPO. Otherwise, we set the random seed to a different value and repeat the initialization process until h ^ π i exceeds h t h . The entire entropy-aware model initialization process is summarized in Algorithm 1. Through this learning strategy, the DRL algorithm reduces the probability of learning failure and achieves improved performance and fast convergence to a higher reward (refer to Section 4).
Algorithm 1: Entropy-aware model initialization.
Sensors 22 05845 i001

4. Experimental Results

In this section, we validate the effectiveness of the proposed learning strategy. For this, we used the experimental settings and tasks described in Section 2. In this experiment, we set the entropy threshold ( h t h ) to 0.5.
To validate the effect of the proposed entropy-aware model initialization, we considered 50 models initialized by different random seeds for each task. Figure 5 shows the rewards according to the training iterations for the eight tasks. In this figure, the red line represents the result for the conventional DRL (without the entropy-aware model initialization) denoted as “Default”, and the blue line denotes the result for the proposed entropy-aware model initialization denoted as “Proposed”. We observed that the DRL with the proposed learning strategy outperformed the conventional DRL for both tasks in four aspects. (1) It restrains the learning failures, e.g., the learning failures for the “Proposed” are 6, 0, 10, 0, 25, 2, 0, and 0, but for the “Default” are 25, 15, 35, 9, 29, 28, 4, and 0, for Freeway, Breakout, Pong, Qbert, Enduro, KungFuMaster, Alien, and Boxing, respectively. (2) It enhances the performance (i.e., average reward in Table 3) by 1.66 for Freeway, 2.22 for Breakout, 2.35 for Pong, 1.39 for Qbert, 1.41 for Enduro, 2.15 for KungFuMaster, 1.34 for Alien, and 2.17 times for Boxing. (3) It reduces the performance variations (i.e., STD of reward in Table 3) with the ratio of 34.22% for Freeway, 29.75% for Breakout, 25.37% for Pong, 65.02% for Qbert, 25.44% for Enduro, 44.63% for KungFuMaster, 53.12% for Alien, and 55.60% for Boxing. (4) It enhances the minimum and maximum rewards as can be seen in Table 3. (5) It enhances the learning speed as can be seen from the slope of the graphs in Figure 5. Figure 6 shows 50 individual learning curves for the above experiments. From the figure, we can easily observe that, by applying the proposed method, more learning curves are biased towards high rewards, and fewer learning failures occur compared to the default.
Furthermore, we conducted the experiments with the advantage actor–critic (A2C) [23] instead of PPO for thorough analyses. The results of A2C corresponding to Figure 5 and Figure 6 and in Table 3, of the PPO results are shown in Figure 7 and Figure 8 and in Table 4. We can observe the same phenomena and therefore infer that the proposed algorithm can benefit other DRL algorithms.
Table 5 shows the overhead of the entropy-aware model initialization in terms of the average number and time for repetitive initialization that repeats until the initial entropy becomes larger than the entropy threshold. For the 3000 and 5000 training iterations, the average training times were measured as 4792.75 and 8145.01 s. We can observe that the time overhead of the proposed strategy is negligible compared with the training times. Moreover, the overhead ratio by repetitive initialization in the proposed strategy was reduced because the training time increased as the task became more complex. This is mainly because the overhead of the proposed method is primarily affected by the action space size and initial entropy distribution, and not by the complexity of the task.
Figure 9 presents the number (solid line) and time taken (dashed line) for repetitive initialization along the different entropy thresholds ( h t h ). The vertical line in the graph corresponds to when h t h is set to 0.5. From Figure 9, we can observe that the time overhead increases according to the entropy threshold; however, the extent of increase is different for each task, the reasons being that (1) different action space sizes of tasks have different maximum initial entropy values, and (2) different tasks have different initial entropy distributions, as shown in Figure 3 in Section 2. In other words, the maximum initial entropy value determines the maximum value of h t h . The lower the average value of the initial entropy, the faster is the overhead increase. For example, the average initial entropy values of KungFuMaster and Boxing were 0.345 and 0.273, respectively, whereas those of Enduro and Alien were 0.636 and 0.694, respectively. According to Figure 9, we observed that the task (e.g., KungFuMaster) with a low average initial entropy value had a large overhead as the threshold increased. Based on the results in Figure 2 and Figure 9, we set the entropy threshold to 0.5, since the primary purpose of this study is to analyze the effect of initial entropy in DRL and propose a task-independent solution, that is, an entropy-aware model initialization. This value effectively restrains learning failures with tasks of large action space sizes or relatively high initial entropy distribution (e.g., Alien and Boxing) but does not incur much overhead with tasks of small action space sizes or a low-distributed initial entropy (e.g., Freeway and KungFuMaster).

5. Conclusions

In this study, we conducted experiments to investigate the effect of initial entropy in the DRL framework, focusing on tasks with discrete action spaces. The critical observation is that models with low initial entropy lead to frequent learning failures, even with easy tasks. These initial entropy values were biased towards low values. Moreover, we observed that the initial entropy varied significantly depending on the task and the initial model weight through experiments under various tasks. Inspired by experimental observations, we devised a learning strategy called entropy-aware model initialization, which repeatedly initializes the model and measures its entropy until the initial entropy exceeds a certain threshold. Its purpose is to improve learning failure, performance, performance variation, and learning speed of a DRL algorithm by providing a well-initialized model to the DRL algorithm. Furthermore, it is practical because it is easy to implement and can be easily applied along with various DRL algorithms without modifying them.
We believe this research can benefit various fields since many applications involve discrete control. Such examples are drone control [5], recommender system [38], and medical CT scans [10]. Moreover, Ref. [39] suggested that discretizing continuous control tasks may improve performance.
It may be a good research direction to propose a neural network initialization technique for deep reinforcement learning with discrete action space. Although many studies proposed initialization techniques for effective deep learning, such as the Glorot uniform and orthogonal, there are few studies on initialization techniques for effective deep reinforcement learning. As can be observed in this paper, the network’s initial state greatly impacts the algorithms’ performance.

Author Contributions

Conceptualization, S.J. and H.-I.K.; methodology, S.J.; software, S.J.; validation, S.J. and H.-I.K.; formal analysis, S.J. and H.-I.K.; investigation, S.J.; resources, S.J. and H.-I.K.; data curation, S.J. and H.-I.K.; writing—original draft preparation, S.J. and H.-I.K.; writing—review and editing, S.J. and H.-I.K.; visualization, S.J. and H.-I.K.; supervision, S.J. and H.-I.K.; project administration, S.J. and H.-I.K.; funding acquisition, S.J. and H.-I.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korea government (22ZR1100); and an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00124).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef] [Green Version]
  2. Yang, Z.; Merrick, K.; Jin, L.; Abbass, H.A. Hierarchical Deep Reinforcement Learning for Continuous Action Control. IEEE Trans. Neural Networks Learn. Syst. 2018, 29, 5174–5184. [Google Scholar] [CrossRef] [PubMed]
  3. Haarnoja, T.; Pong, V.; Zhou, A.; Dalal, M.; Abbeel, P.; Levine, S. Composable Deep Reinforcement Learning for Robotic Manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 6244–6251. [Google Scholar]
  4. Lathuilière, S.; Massé, B.; Mesejo, P.; Horaud, R. Neural network based reinforcement learning for audio–visual gaze control in human–robot interaction. Pattern Recognit. Lett. 2019, 118, 61–71. [Google Scholar] [CrossRef] [Green Version]
  5. Jang, S.; Choi, C. Prioritized Environment Configuration for Drone Control with Deep Reinforcement Learning. Hum. Centric Comput. Inf. Sci. 2022, 12, 1–16. [Google Scholar]
  6. Zhang, Q.; Ma, X.; Yang, Y.; Li, C.; Yang, J.; Liu, Y.; Liang, B. Learning to Discover Task-Relevant Features for Interpretable Reinforcement Learning. IEEE Robot. Autom. Lett. 2021, 6, 6601–6607. [Google Scholar] [CrossRef]
  7. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the Game of Go without Human Knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
  8. Patel, D.; Hazan, H.; Saunders, D.J.; Siegelmann, H.T.; Kozma, R. Improved Robustness of Reinforcement Learning Policies upon Conversion to Spiking Neuronal Network Platforms Applied to Atari Breakout Game. Neural Netw. 2019, 120, 108–115. [Google Scholar] [CrossRef] [PubMed]
  9. Nicholaus, I.T.; Kang, D.K. Robust experience replay sampling for multi-agent reinforcement learning. Pattern Recognit. Lett. 2021, 155, 135–142. [Google Scholar] [CrossRef]
  10. Ghesu, F.C.; Georgescu, B.; Zheng, Y.; Grbic, S.; Maier, A.; Hornegger, J.; Comaniciu, D. Multi-scale Deep Reinforcement Learning for Real-time 3D-landmark Detection in CT Scans. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 41, 176–189. [Google Scholar] [CrossRef] [PubMed]
  11. Raghu, A.; Komorowski, M.; Celi, L.A.; Szolovits, P.; Ghassemi, M. Continuous state-space models for optimal sepsis treatment: A deep reinforcement learning approach. In Proceedings of the Machine Learning for Healthcare Conference, Boston, MA, USA, 18–19 August 2017; pp. 147–163. [Google Scholar]
  12. Zarkias, K.S.; Passalis, N.; Tsantekidis, A.; Tefas, A. Deep Reinforcement Learning for Financial Trading using Price Trailing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3067–3071. [Google Scholar]
  13. Tsantekidis, A.; Passalis, N.; Tefas, A. Diversity-driven Knowledge Distillation for Financial Trading using Deep Reinforcement Learning. Neural Netw. 2021, 140, 193–202. [Google Scholar] [CrossRef]
  14. Ishii, S.; Yoshida, W.; Yoshimoto, J. Control of Exploitation–Exploration Meta-parameter in Reinforcement Learning. Neural Netw. 2002, 15, 665–687. [Google Scholar] [CrossRef] [Green Version]
  15. Sun, S.; Wang, H.; Zhang, H.; Li, M.; Xiang, M.; Luo, C.; Ren, P. Underwater Image Enhancement with Reinforcement Learning. IEEE J. Ocean. Eng. 2022, 1–13. [Google Scholar] [CrossRef]
  16. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  17. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
  18. Seo, Y.; Chen, L.; Shin, J.; Lee, H.; Abbeel, P.; Lee, K. State Entropy Maximization with Random Encoders for Efficient Exploration. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; pp. 9443–9454. [Google Scholar]
  19. Zhang, Y.; Vuong, Q.H.; Song, K.; Gong, X.Y.; Ross, K.W. Efficient Entropy for Policy Gradient with Multidimensional Action Space. arXiv 2018, arXiv:1806.00589. [Google Scholar]
  20. Ahmed, Z.; Le Roux, N.; Norouzi, M.; Schuurmans, D. Understanding the Impact of Entropy on Policy Optimization. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 151–160. [Google Scholar]
  21. Chen, J.; Li, S.E.; Tomizuka, M. Interpretable End-to-End Urban Autonomous Driving with Latent Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2021, 23, 5068–5078. [Google Scholar] [CrossRef]
  22. Williams, R.J. Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef] [Green Version]
  23. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  24. Zhao, R.; Sun, X.; Tresp, V. Maximum Entropy-regularized Multi-goal Reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 7553–7562. [Google Scholar]
  25. Wang, Z.; Zhang, Y.; Yin, C.; Huang, Z. Multi-agent Deep Reinforcement Learning based on Maximum Entropy. In Proceedings of the IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 18–20 June 2021; Volume 4, pp. 1402–1406. [Google Scholar]
  26. Shi, W.; Song, S.; Wu, C. Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning. arXiv 2019, arXiv:1909.03198. [Google Scholar]
  27. Cohen, A.; Yu, L.; Qiao, X.; Tong, X. Maximum Entropy Diverse Exploration: Disentangling Maximum Entropy Reinforcement Learning. arXiv 2019, arXiv:1911.00828. [Google Scholar]
  28. Andrychowicz, M.; Raichuk, A.; Stańczyk, P.; Orsini, M.; Girgin, S.; Marinier, R.; Hussenot, L.; Geist, M.; Pietquin, O.; Michalski, M.; et al. What Matters for On-policy Deep Actor-critic Methods? A Large-scale Study. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  29. Liang, E.; Liaw, R.; Nishihara, R.; Moritz, P.; Fox, R.; Goldberg, K.; Gonzalez, J.; Jordan, M.; Stoica, I. RLlib: Abstractions for Distributed Reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 3053–3062. [Google Scholar]
  30. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
  31. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org/ (accessed on 20 July 2022).
  32. Guadarrama, S.; Korattikara, A.; Ramirez, O.; Castro, P.; Holly, E.; Fishman, S.; Wang, K.; Gonina, E.; Wu, N.; Kokiopoulou, E.; et al. TF-Agents: A library for Reinforcement Learning in TensorFlow. 2018. Available online: https://github.com/tensorflow/agents (accessed on 20 July 2022).
  33. Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; Wu, Y.; Zhokhov, P. OpenAI Baselines. 2017. Available online: https://github.com/openai/baselines (accessed on 20 July 2022).
  34. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
  35. Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying count-based exploration and intrinsic motivation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
  36. Gym Documentation. 2022. Available online: https://www.gymlibrary.ml/ (accessed on 20 July 2022).
  37. Saxe, A.M.; McClelland, J.L.; Ganguli, S. Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  38. Dulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber, T.; Degris, T.; Coppin, B. Deep Reinforcement Learning in Large Discrete Action Spaces. arXiv 2015, arXiv:1512.07679. [Google Scholar]
  39. Tang, Y.; Agrawal, S. Discretizing Continuous Action Space for On-Policy Optimization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 5981–5988. [Google Scholar]
Figure 1. Example of Atari games (with action space size of 3, 4, 6, 6, 9, 14, 18, 18 for Freeway, Breakout, Pong, Qbert, Enduro, KungFuMaster, Alien, Boxing, respectively) used for the experimental study. (a) Freeway; (b) Breakout; (c) Pong; (d) Qbert; (e) Enduro; (f) KungFuMaster; (g) Alien; (h) Boxing.
Figure 1. Example of Atari games (with action space size of 3, 4, 6, 6, 9, 14, 18, 18 for Freeway, Breakout, Pong, Qbert, Enduro, KungFuMaster, Alien, Boxing, respectively) used for the experimental study. (a) Freeway; (b) Breakout; (c) Pong; (d) Qbert; (e) Enduro; (f) KungFuMaster; (g) Alien; (h) Boxing.
Sensors 22 05845 g001
Figure 2. Reward depending on the initial entropy for 8 tasks, where 50 models for each task were generated to investigate the effect of the initial entropy on the performance.
Figure 2. Reward depending on the initial entropy for 8 tasks, where 50 models for each task were generated to investigate the effect of the initial entropy on the performance.
Sensors 22 05845 g002
Figure 3. The histograms of the initial entropy for eight tasks. For each task, 1000 models were generated using Glorot uniform initializer with different random seeds.
Figure 3. The histograms of the initial entropy for eight tasks. For each task, 1000 models were generated using Glorot uniform initializer with different random seeds.
Sensors 22 05845 g003
Figure 4. The histograms of the initial entropy for eight tasks. For each task, 1000 models were generated using an orthogonal initializer with different random seeds.
Figure 4. The histograms of the initial entropy for eight tasks. For each task, 1000 models were generated using an orthogonal initializer with different random seeds.
Sensors 22 05845 g004
Figure 5. Comparison of the entropy-aware model initialization-based PPO (PPO-Proposed) with the conventional PPO (PPO-Default) for eight tasks.
Figure 5. Comparison of the entropy-aware model initialization-based PPO (PPO-Proposed) with the conventional PPO (PPO-Default) for eight tasks.
Sensors 22 05845 g005
Figure 6. Learning curves for 50 individual experiments of (a) the conventional PPO and (b) the proposed entropy-aware model initialization-based PPO for 8 tasks. (a) default; (b) proposed.
Figure 6. Learning curves for 50 individual experiments of (a) the conventional PPO and (b) the proposed entropy-aware model initialization-based PPO for 8 tasks. (a) default; (b) proposed.
Sensors 22 05845 g006
Figure 7. Comparison of the entropy-aware model initialization-based A2C (A2C-Proposed) with the conventional A2C (A2C-Default) for four tasks.
Figure 7. Comparison of the entropy-aware model initialization-based A2C (A2C-Proposed) with the conventional A2C (A2C-Default) for four tasks.
Sensors 22 05845 g007
Figure 8. Learning curves for 30 individual experiments of (a) the conventional A2C and (b) the proposed entropy-aware model initialization-based A2C for four tasks. (a) default; (b) proposed.
Figure 8. Learning curves for 30 individual experiments of (a) the conventional A2C and (b) the proposed entropy-aware model initialization-based A2C for four tasks. (a) default; (b) proposed.
Sensors 22 05845 g008
Figure 9. The number (solid line) and time (dashed line) for initialization by the entropy-aware model initialization along the different entropy threshold ( h t h ).
Figure 9. The number (solid line) and time (dashed line) for initialization by the entropy-aware model initialization along the different entropy threshold ( h t h ).
Sensors 22 05845 g009
Table 1. Initial entropy of (Pong, Qbert) pair with action space size 6 and (Alien, Boxing) pair with action space size 18 under different random seeds, where “STD” denotes the standard deviation of the initial entropy values for 10 different random seeds.
Table 1. Initial entropy of (Pong, Qbert) pair with action space size 6 and (Alien, Boxing) pair with action space size 18 under different random seeds, where “STD” denotes the standard deviation of the initial entropy values for 10 different random seeds.
Size of Action Space
618
TaskPongQbertAlienBoxing
Seed 01 1.48 × 10 3 4.74 × 10 1 3.31 × 10 1 2.55 × 10 3
Seed 02 9.68 × 10 4 8.61 × 10 1 8.85 × 10 2 8.05 × 10 9
Seed 03 9.76 × 10 1 2.70 × 10 1 9.13 × 10 4 1.07 × 10 6
Seed 04 7.20 × 10 4 2.23 × 10 2 1.33 2.05 × 10 1
Seed 05 8.04 × 10 1 1.58 × 10 1 2.25 × 10 1 6.35 × 10 1
Seed 06 8.98 × 10 5 4.68 × 10 1 2.76 × 10 1 2.11 × 10 1
Seed 07 5.64 × 10 1 1.58 × 10 1 6.05 × 10 1 4.39 × 10 1
Seed 08 1.18 × 10 1 3.42 × 10 1 7.95 × 10 1 2.88 × 10 2
Seed 09 5.73 × 10 1 4.34 × 10 1 7.28 × 10 2 2.30 × 10 1
Seed 10 1.73 × 10 3 2.79 × 10 1 7.91 × 10 1 3.79 × 10 2
STD 3.85 × 10 1 2.33 × 10 1 4.22 × 10 1 2.15 × 10 1
Table 2. Initial entropy of Freeway, Breakout, Enduro, and KungFuMaster under different random seeds, where “STD” denotes the standard deviation of the initial entropy values for 10 different random seeds.
Table 2. Initial entropy of Freeway, Breakout, Enduro, and KungFuMaster under different random seeds, where “STD” denotes the standard deviation of the initial entropy values for 10 different random seeds.
Size of Action Space
34914
TaskFreewayBreakoutEnduroKungFuMaster
Seed 01 3.78 × 10 1 2.95 × 10 1 1.22 5.55 × 10 4
Seed 02 9.93 × 10 14 5.37 × 10 12 5.52 × 10 1 4.30 × 10 8
Seed 03 2.65 × 10 1 3.38 × 10 1 1.52 4.31 × 10 2
Seed 04 9.27 × 10 1 8.27 × 10 5 2.05 × 10 3 1.63 × 10 1
Seed 05 9.79 × 10 5 6.65 × 10 10 9.87 × 10 1 3.01 × 10 1
Seed 06 2.18 × 10 1 9.47 × 10 2 1.12 6.29 × 10 2
Seed 07 3.75 × 10 2 7.13 × 10 2 1.58 1.86 × 10 4
Seed 08 1.23 × 10 1 7.73 × 10 1 4.63 × 10 1 6.79 × 10 8
Seed 09 2.89 × 10 3 2.18 × 10 2 8.16 × 10 1 6.37 × 10 1
Seed 10 7.91 × 10 4 5.79 × 10 1 6.18 × 10 1 8.08 × 10 2
STD 2.90 × 10 1 2.74 × 10 1 4.96 × 10 1 2.03 × 10 1
Table 3. Statistical results for the experimentation of the entropy-aware model initialization-based PPO and the conventional PPO.
Table 3. Statistical results for the experimentation of the entropy-aware model initialization-based PPO and the conventional PPO.
TaskMethodAvg. RewardSTD of RewardMin RewardMax Reward
FreewayDefault11.06711.369031.04
Proposed18.3767.479031.55
BreakoutDefault81.84797.8550239.27
Proposed181.90568.7392348.67
PongDefault−11.73616.507−2120.82
Proposed4.11912.319−2120.86
QbertDefault9141.8655913.837014,994.75
Proposed12,671.1302068.36812515,605.00
EnduroDefault74.24797.2300283.69
Proposed104.80472.4930326.18
KungFuMasterDefault6926.0008241.017023,356.00
Proposed14,896.0114562.688034,334.00
AlienDefault854.550498.04701665.00
Proposed1148.814233.470693.601665.30
BoxingDefault−36.10041.182−99.9436.55
Proposed6.11318.284−99.8842.10
Table 4. Statistical results for the experimentation of the entropy-aware model initialization-based A2C and the conventional A2C.
Table 4. Statistical results for the experimentation of the entropy-aware model initialization-based A2C and the conventional A2C.
TaskMethodAvg. RewardSTD of RewardMin RewardMax Reward
FreewayDefault29.8394.71018.0633.41
Proposed31.1993.09419.5933.59
BreakoutDefault198.892131.25531.00398.53
Proposed287.870106.68645.36412.72
EnduroDefault141.083115.6560328.90
Proposed285.71179.36478.26432.87
BoxingDefault25.18430.662−7.5190.07
Proposed78.12915.38548.0999.39
Table 5. The average number and time for initialization, and overhead ratio to the total training time by the proposed entropy-aware model initialization.
Table 5. The average number and time for initialization, and overhead ratio to the total training time by the proposed entropy-aware model initialization.
TaskAverage Number of
Initialization (#)
Average Time for
Initialization (s)
Time Overhead (%)
Freeway9.86119.9934.000
Breakout5.3072.5441.451
Pong5.6277.3352.578
Qbert3.9454.5401.091
Enduro1.6020.5160.410
KungFuMaster4.1052.1411.738
Alien1.8424.5360.491
Boxing3.8653.1151.771
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Jang, S.; Kim, H.-I. Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning. Sensors 2022, 22, 5845. https://doi.org/10.3390/s22155845

AMA Style

Jang S, Kim H-I. Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning. Sensors. 2022; 22(15):5845. https://doi.org/10.3390/s22155845

Chicago/Turabian Style

Jang, Sooyoung, and Hyung-Il Kim. 2022. "Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning" Sensors 22, no. 15: 5845. https://doi.org/10.3390/s22155845

APA Style

Jang, S., & Kim, H. -I. (2022). Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning. Sensors, 22(15), 5845. https://doi.org/10.3390/s22155845

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop