Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning
Round 1
Reviewer 1 Report
Comments:
This paper solves the problem of sparse reward in reinforcement learning. It focuses the current hot issues and combines novel technical means reinforcement learning. This paper encourages the agents to explore deeply to discover more possibilities and introduces the Long-term visitation count mechanism to achieve this purpose. However, there are some defects in this paper.
My specific comments are as follows.
- The proposed method use long-term visitation count to plan the future exploration actions. How to realize this purpose? Is it using the separate function to evaluate the value of the exploration actions?
- How to realize the decouple of the exploration and exploitation? And what is the separate function? This paper highlights one of the contribution points is the separate function. However, the paper lacks a specific description of this. It is recommended to add a description of the principle.
- Notice that the experiments in the text are for the grid world. And the discrete actions are all finite and limited to four actions: up, down, left, right. Does this mean that there are some limitations to the proposed method. Whether it can be extended to more environments and continuous action spaces?
- Section 2 introduces some relevant concepts and the basic work that has been done. However, it is somewhat redundant. In particular, Thompson sampling, which is a lot of well known basics does not need to be covered in detail. A streamlining of this section is suggested.
- There are some language, formatting and spelling errors. The author is suggested to check and proofread carefully. There is still much room for improvement in language.
- In Figure 21, the bigger the value of W-function discount the richer the difference in color of the grids. It is recommended that the authors provide a detailed explanation.
- In Figure 22, there are some Errors and inconsistencies. The curves of green and pink are dotted lines, but the legend is solid lines. The figures need to be checked and proofread carefully.
Author Response
We thank the Reviewer for their comments. Below we address their questions/suggestions.
1. It is done by decoupling exploration and exploration using a second value function, that takes only intrinsic rewards based on visitation count. Since a value function, by definition, considers the long-term expected sum of rewards, in this way we account for long-term exploration. This is discussed in details in Section 3.
2. The decoupling is performed using a second value function: the first (the classic Q function) takes only extrinsic rewards (which are sparse). The second takes intrinsic rewards only (hence its exploration-value) and is formally described in details in Section 3. We propose two version, defined in Eq. (7) and (11).
3. Our method is currently tied to state/action counts. However, it can be extended to continuous states/actions using pseudocounts such as #Exploration by Haoran Tang, et al. ("#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning", NeurIPS 2017). We have added a paragraph under 'Final Remarks' to discuss this.
4. We believe that Thompson sampling may not be very familiar to everyone in the RL community. Furthermore, the body of work on Thompson sampling is very large and covers different methods. We believe that all the methods discussed in our related work section are worth of mention, especially considering that there is no page limit to the paper.
6. We have added a comment to 4.2.2 to discuss the plot more in details.
7. We apologize for any misunderstanding regarding Figure 22. Lines are dotted in the plot as they overlap, making it harder to read the plot (e.g., the green line overlaps with the black line denoting N^2.7). The symbol in the legend is too small to show the line breaks of dotted lines.
We have now mentioned it in the caption.
Also please notice that colors are unequivocal, i.e., there are no lines with the same color.
Reviewer 2 Report
This manuscript is well-written. The reserach motivation, methodology and evaluations are well explained. The authors first explain background and then discuss how the sparse and distracting rewards cause problems in searching optimal solutions in RL.
This paper only addresses tabular Q-learning. It is still unknown that how the proposed method can scale up to solve complicated problems where the state representation space is large and continuous, e.g.robot arm conrol or navigation robot cotrol. Despite of this limitation, the manuscript is still worth of publication as it gives researchers good background and inspirations to understand and resolve the sparse and distracting rewards problems in RL.
This manuscript has been on Internet for a while, e.g., arXiv (Jan 2020) and Research gate (Dec. 2019). The Google Scholoar shows that this manuscript has been cited by at least three other papers, all working on deep RL. If possible, can the authors take a look at how other DRL researches comments on their work? This may help better clearify how this work contributes to the deep RL reserach society.
Author Response
We thank the Reviewer for their positive comment and suggestions.
We have added a paragraph under 'Final Remarks' to discuss possible limitations with continuous states/actions.
The three papers cite us as example of sparse-reward exploration method in RL, to highlight that the long-term effect of exploration must be taken into account.
They do not compare to us, as the scope of their work is either different, or they use continuous states/actions (e.g., PPO).
Reviewer 3 Report
The paper is well written and the contribution is clear, in my opinion. The work is ample, the introductory review of the state-of-the-art is valuable for ensuring the readers a smooth transition to the problem formulation and solution. Examples are valuable in clarifying the main points and are therefore highly appreciated. The authors provided the source code for results verification, this is another plus.
However, there are possibilities to improve the presentation and motivation of this work.
1. Please define acronyms, such as "UCB1" on their first occurrence, or provide a smoother introduction.
2. In Example 3.1 starting line 266, it seems that the agent makes in state (1,3): 3 x right, 2 x down, 1 x up, 1x left, total moves, which is difficult to motivate the coloring scheme of Fig 3.a: for example, the dark blue does not appear, although 1 x up action was performed. Could you check?
3. In the same example 3.1, derivation of the third greedy policy is not clear. If it it is based on already converged optimal Q-function, how to incorporate the intrinsic r^+_t from (6) once more? More details would help the readers.
4. Please check for minor typing errors such as "for improving exploration 1with sparse and distracting rewards". They may appear through the entire manuscript which is quite long.
5. Some disadvantages of the current solutions to the exploration-exploitation dillema are mainly related to tabuled environments whcih fail to generalize to continuous state-action spaces. The proposed long-term visitation value or deep exploration seems to suffer the same drawback since the visitation reward r^W dependes on the state-action count. What would be the means to extend the approach to more generalizable continuous state-action spaces? Could authors add a discussion about this?
6. There are many parameters to tune with the learning algorithm. How to do this optimally? Please give some rules or heuristics to make the approach more attractive.
Authors' revision is appreciated and hopefully will improve their presentation.
Author Response
We thank the Reviewer for their positive comment and careful review, which helped improve the manuscript. Below are our replies to their questions and suggestions.
1. UCB1 is mentioned in line 187 for the first time as simply "upper confidence bound (UCB)". We have fixed the missing "1".
2. We apologize for any misunderstanding regarding Figure 3.1. The Figure and the action description are correct. State (1,3) is the top left corner, and 'up' is cyan (not dark blue) because it was performed only once. Dark blue corresponds to '0' action count.
3. The 'greedy' policy in 3b does not use r^+. 'Greedy w/ bonus' does, as it uses Eq. (6). The two policies are indeed different, since 'greedy' goes to the only reward discovered so far, while 'greedy w/ bonus' prefers to execute actions that have not been executed.
4. Thank you, we have fixed that and double checked all the manuscript.
5. We have added a paragraph in 'Final Remarks' regarding this limitation.
6. The parameters to tune are the same as any RL algorithm (epsilon, learning rate, and discount factor). We only introduce a second discount factor for the intrinsic reward, and already report an ablation study on it in 4.2.2. We have added a discussion at its end to give some guidelines on how to tune it.
Round 2
Reviewer 1 Report
Comments:
Authors partially solved my problem, but there are still other critical issues.
- The method proposed in this paper seems only an improvement of the reward function. A long-term visitation count is used as a second value function to weight the intrinsic rewards and decouple the exploration and exploitation. It is of limited innovation.
- Please comment on Major Contribution of the Paper, Organization and Style, Technical Accuracy, Presentation and Adequacy of Citations (List possible additions if needed).
- The sparse reward in reinforcement learning via intrinsic rewards has been addressed by some researchers in the previous works. The authors should clarify the originality of this work with a more deep investigation on the problem.
- In Figure 22, the red line has standard deviation, but the other lines don't. They should be consistent (with or without standard deviation, all solid or dotted lines). Please explain and modify it.
Author Response
We thank the Reviewer for his insightful comments.
We want to clarify that our method is not just an improvement of the reward function. We propose the fundamental idea of decoupling exploration and exploitation by learning two value functions. This is a very novel approach in RL, and it is crucial for the following reasons.
* We can control exploration properly. If we would just 'inject' our intrinsic reward as classic algorithms (e.g., baseline "Expl. Bonus" in our plots) we would not be able to retrieve the greedy Q-function easily, since it was trained with the sum of intrinsic (exploration) and extrinsic (exploitation) rewards. Furthermore, we can better tune exploration using a different discount factor (\gamma_w).
This is discussed in Section 3.4 under "W-function vs auxiliary rewards."
* We can mitigate the overestimation bias of TD learning due to the max operator in the Bellman equation. Overestimating the W-function, in fact, is not as bad as overestimating the Q-function. The former would lead to just more exploration (possibly inefficient), but the latter would lead to premature convergence to local optima.
This is already discussed in Section 3.2 under "W-function as long-term UCB."
* We can combine exploration and exploitation in a principled way, that is using UCB to sum the Q- and W-functions. This does not require any coefficient parameter to tune the "importance" of the two components, as opposed to classic intrinsic reward methods. This is discussed in Section 3.4 under "W-function vs auxiliary rewards."
The results we show, albeit only on tabular MPDs, show that our method clearly outperforms all baseline. This achievement would not have been possible by just improving the reward.
For the aforementioned reasons, we argue that our work brings major contributions to the RL community.
Regarding related work, we already compare to the most common intrinsic rewards methods. More recent works base curiosity on model prediction errors, i.e., they learn a model of the transition function and use it as intrinsic reward. However, for tabular MDPs this is ineffective because the transition function is trivial to learn. Thus, the intrinsic reward (i.e., the exploration) fades to zero very quickly. Indeed, all these methods have been developed for continuous control. Nonetheless, we discuss and cite them in Section 2.2.
We would be happy to evaluate other missing relevant works if the Reviewer has something specific in mind, using the same experimental settings.
In Figure 22, the standard deviation is reported only for filled dots. For empty dots, the algorithm did not converge in some runs (as reported in the caption "In empty dots connected with dashed lines, the algorithm did not learn within the step limit at least once."). For empty dots, we do not report confidence interval. This was not clear in the caption and have fixed it.
To recap:
* Pink: mostly empty dots, so we do not show the confidence interval.
* Green: all filled dots report the confidence interval. It is very small, but in some cases it can be seen by zooming in.
* Blue and orange: the confidence interval is always reported, but it is just extremely small and cannot be seen. Our algorithms, indeed, perform significantly better than the others in terms of results consistency.
It's worth mentioning that the MDP is deterministic, and the randomness coming from the seeds affects only the action selection in the case of action-value ties.
We apologize for any misunderstanding and have updated the caption.
Round 3
Reviewer 1 Report
The authors have solved my problems. I have no comments.