1. Introduction
In recent times, deep reinforcement learning (DRL) has attained remarkable triumphs across a diverse range of domains, including Go [
1,
2], Atari [
3], StarCraft [
4,
5], and Robot [
6]. This unequivocally unveils the profound potential of deep reinforcement learning, which is widely recognized as the most auspicious solution for real-world sequential decision-making predicaments. However, amid their accomplishments in numerous arenas, one pivotal quandary persists: the proclivity of the deep reinforcement learning method to be unduly inefficient with samples [
7], necessitating millions of interactions even in ostensibly uncomplicated game scenarios. For instance, though Agent57 [
8] stands as the premier deep reinforcement learning algorithm capable of surpassing the average human player across all 57 Atari games, it generally mandates orders of magnitude more interactions than its human counterpart. The crux of the sample inefficiency quandary lies in achieving a delicate equilibrium between exploring and exploiting the active quest for uncharted states and behaviors that hold the promise of yielding elevated rewards and long-term gains [
9], juxtaposed against the knowledge acquired thus far to maximize instantaneous returns. The key conundrum revolves around how an agent ought to judiciously navigate the trade-off between venturing into novel actions and selecting the optimal course of action based on its accrued wisdom. A superlative strategy pertaining to exploration can effectively unravel the enigmatic realms of the unknown environment, facilitating the accumulation of informational experience, thereby hastening the agent’s learning pace, accelerating convergence, eliciting greater rewards, and amplifying the overall performance of the agent.
However, current approaches suffer from some problems when uncontrollable Gaussian noise permeates the visual domain, whereby only a minute fraction of the pixel space actually comprises pertinent, utilitarian imagery. Consequently, the agent’s capacity to accurately assess the present state is compromised, impeding its ability to select the appropriate course of action for exploration. And, regrettably, existing exploration algorithms persist in exploring while the agent progresses toward the target state, thereby causing a deviation from the original trajectory and ultimately thwarting the agent’s arrival at the exploration boundary.
The concept of constructing intrinsic rewards predicated upon predictive models was originally proposed in 1991. A profoundly intuitive approach entails leveraging forward dynamic prediction models to generate prediction errors, which can be formally captured as follows:
where the model takes the present state
and the current action
as inputs, employing either a linear function or a neural network to approximate the ensuing state
as dictated by the environment. This model epitomizes the agent’s aptitude to prognosticate the consequences of its actions. Naturally, prediction models engender prediction errors within certain states:
Consequently, these prediction errors, specifically , can be harnessed to furnish the agent with an intrinsic reward for exploration. The magnitude of this reward is inversely correlated with familiarity pertaining to the current state. Higher values of signify a lesser degree of acquaintance, thereby meriting augmented rewards intended to encourage exploratory endeavors within unfamiliar territories.
For state-finite Markov decision processes, several rudimentary exploration heuristics are available, such as the epsilon-greedy strategy [
10] and entropy regularization [
11]. Nevertheless, with the amalgamation of deep learning and reinforcement learning techniques in recent years, certain methodologies have employed neural networks to evaluate Markov decision-making strategies. While they have attained remarkable achievements in various domains, they soon encountered the predicament posed by exceedingly prodigious state spaces. The traditional approach of storing pertinent information concerning the Markov decision process via tabulation becomes untenable, owing to the vast number of states involved. In these instances, conventional exploration strategies fail to yield efficacious outcomes, as the agent finds itself ensnared within a limited subset of states. Moreover, reinforcement learning algorithms are meticulously tailored and assessed within simulated environments replete with dense rewards. However, the real-world milieu is characterized by a scarcity of rewards, with the agent exclusively updating its policies upon reaping rewards [
12]. Consequently, the agent’s exploration strategy is bestowed with the added responsibility of adroitly eschewing perilous states, thereby imposing more exacting demands. This necessitates not only astute discernment of sparse rewards but also a cautious avoidance of hazardous circumstances.
The challenges posed by deep reinforcement learning dilemmas are further compounded as the state–action space burgeons. Consider, for instance, real-world robots equipped with high-dimensional state inputs like images or high-frequency radar signals [
13], coupled with intricate operations mandating an extensive range of degrees of freedom. In essence, profound state–action spaces impede the efficacy and robustness of deep reinforcement learning algorithms. In more complex scenarios, the state–action space may exhibit a labyrinthine underlying structure, replete with causal dependencies between states or a prescribed order of access, wherein certain states are accessed with disparate probabilities. Furthermore, unlike the conventionally investigated continuous or uniformly distributed action spaces, actions may encompass a combination of discrete and continuous components. These pragmatic quandaries pose even more formidable challenges to the realm of efficient exploration.
Real-world environments often exhibit a high degree of randomness, wherein unforeseen elements frequently manifest within both their state and action spaces. To illustrate, consider the visual observations of a self-driving car [
14,
15], which may include extraneous details such as the shifting positions and contours of clouds. In certain exploratory methodologies, white noise is commonly employed to generate states of heightened entropy, infusing the environment with an element of unpredictability [
16,
17].
Following extensive training, it becomes necessary to diminish the novelty associated with frequently recurring states, along with the exploration rewards assigned to them. However, empirical investigations have uncovered a predicament arising in specific experimental settings, whereby the rapid decay of exploration rewards engenders additional challenges. For instance, consider a maze game environment composed of numerous diminutive chambers, wherein the agent consistently respawns within a singular, distinctive room. Each iteration necessitates the agent’s departure from its initial confinement, proceeding to explore the wider expanse. Yet, when the number of iterations surpasses a certain threshold, the exploration reward assigned to the path leading away from the original room diminishes to such an extent that the agent is incapable of exiting its initial confinement.
The Go-Explore algorithm [
18] encapsulates the aforementioned quandaries and identifies two fundamental issues plaguing contemporary curiosity-driven exploration approaches: “separation” and “derailment”. The concept of separation denotes that while an exploration algorithm can incentivize the agent to traverse uncharted regions of the state space, it fails to motivate the agent to transcend the boundaries established by prior exploration rewards and venture toward novel frontiers for continued exploration. A discernible “separation” exists between the current state and unexplored states. To address this separation predicament, an intuitive solution emerges promoting the agent to return to a previously explored state boundary before embarking on exploration anew.
We mainly aim to solve the above-mentioned white noise and separation derailment problems. By using the optimized feature extraction module and adding auxiliary agents to form a new training paradigm, we focus on improving the model performance of the agent in the case of sparse rewards and obvious environmental changes. Through the improvement of the above method we can make the agent not fall into the exploration dilemma in the difficult exploration environment and still maintain a better performance.
2. Related Work
In sparse-reward environments, it is often imperative to incorporate intrinsic reward signals [
19] in order to augment the overall reward acquired by the agent, thereby fostering exploration of the environment and eventual attainment of the ultimate objective. The methods for executing exploration strategies can be broadly categorized into three types, contingent upon the manner in which the intrinsic rewards are obtained: (1) exploration based on state counting; (2) exploration based on information enhancement; and (3) exploration based on curiosity rewards.
2.1. Exploration Based on State Counting
Exploration strategies rooted in state counting rely on tallying state–action pairs by converting these counts into rewards. The UCB algorithm [
20] (Upper Confidence Bounds) stimulates model exploration by selecting the action that maximizes the reward. Contrasted with the direct utilization of environmental rewards, the UCB method amplifies the likelihood of states with fewer visits. The MBIE-EB method [
21] (Model-Based Interval Estimation–Exploration Bonus) utilizes tables to tabulate state–behavior pairs and consequently appends supplementary reward signals to instigate exploration of states with scant visitation.
Divergent from the conventional counting-based approach, a pseudo counting technique based on the state density model [
22] has been proposed alongside the UCB framework. The pseudo count is assigned via the construction of a density model, subsequently employed to calculate additional rewards. DQN + SR [
23] (Deep Q-Network + successor representation) employs the norm of the successor representation as an intrinsic reward, showcasing superior performance relative to the density model within continuous spaces.
The aforementioned methodologies aid the training process by incorporating auxiliary rewards in the form of state counts alongside the primary environmental rewards. However, when confronted with a white noise environment, deploying Gaussian noise devoid of any informative value for learning as a novel state serves to spur the model towards exploration, inadvertently leading it astray from the intended goal. Our proposed approach effectively mitigates the issue of white noise through the utilization of an optimized feature extraction module, thereby endowing this methodology with enhanced efficacy in challenging exploration environments.
2.2. Exploration Based on Information Enhancement
The pursuit of information-enriched exploration propels the agent to embark on its quest by harnessing the intrinsic reward of information gain while diminishing the allure of random regions. Information gain, a reward bestowed upon the reduction of environmental uncertainty, serves as their compass. Their objective is to glean novel insights as they traverse uncharted states, identifying those endowed with greater potential for information gain as the most coveted destinations.
VIME [
24] (Variational Information Maximizing Exploration) aspires for each expedition to amass a bounty of environmental knowledge. This method employs variational inference within the Bayesian neural network framework to formalize the process of learning. Hierarchical reinforcement learning [
25] partitions the policy into two components, the primary policy and the sub-policy, wherein the former manipulates the latter, which in turn dictates the original action. A subsequent approach proposes acquiring the optimal strategy for the exploration mechanism by solving the alternative Markov decision process [
26], thus ensuring the safety of exploration while attaining an improved exploration strategy.
The aforementioned methods fortify agent training by incorporating information gain as a supplementary reward during the model learning phase, thereby mitigating the impact of random states on the agent throughout the training process. Nevertheless, when confronted with issues of reconnection and derailment, such methodologies prove inadequate in liberating the agent from the quagmire of learning stagnation and pushing it towards the frontier of uncharted environments. Our method resolves this quandary by aiding the agent via collaborative training. With the inclusion of auxiliary rewards, the arduous aspects of exploration are significantly diminished, affording the agent the means to engage in superior exploratory endeavors.
2.3. Exploration Based on Curiosity Rewards
Curiosity-based reward exploration entails the formulation of an intrinsic reward system predicated on calculating the disparity between the predicted state and the real state, thereby quantifying the prediction error and guiding the agent’s exploration of the environment.
The ICM module [
27] confers intrinsic rewards to the agent based on curiosity, employing a forward model to forecast forthcoming states and utilizing the disparity between predicted and actual states as an additional intrinsic reward. ECR [
28] (Episodic Curiosity through Reachability) proposes an intrinsic reward mechanism grounded in episodic reachability, whereby diverse intrinsic rewards are dispensed by comparing the reachability of the current state with previously encountered states stored in memory.
Our method leverages the contrast between future and predicted states as an intrinsic reward, optimizing feature embeddings to enhance the model’s predictive capabilities, while employing an auxiliary agent to circumvent the predicament of agents becoming disoriented within the environment.
2.4. Random Distillation Network
In the realm of research pertaining to the exploration algorithm based on prediction errors, certain scholars have discovered that prediction tasks unrelated to the environmental dynamics can still facilitate the agent’s expeditions within the said environment. Amongst these methodologies, Random Network Distillation (RND) [
29] stands as an exemplary representation. The modus operandi of RND involves incorporating a prediction task independent of the reinforcement learning objective. This entails designing two neural networks equipped with identical structures for the prediction task:
Target network: denoted as wherein the network parameters are randomly initialized and fixed. It receives the current state as input and outputs a predetermined value .
Prediction network: denoted as wherein the network parameters are randomly initialized. It receives the current state as input and produces a corresponding prediction of the aforementioned fixed value , namely .
The procedure unfolds as follows: predicting the target network
by evaluating the network
. For any given state
in time, the final outcome
is derived and utilized as input for the neural network. Subsequently, it undergoes evaluation by the deterministic function
to obtain the result
. Consequently, the discrepancy between the two outcomes is quantified as the RND exploration reward under state
, formulated as:
The network error also serves as the loss function for the prediction task, enabling the update of the neural network throughout said task. Therefore, training the RND model typically consists of two stages, each corresponding to the prediction task and the reinforcement learning task. The model derived from the prediction task is subsequently employed in training the reinforcement learning model, with both stages conducted alternately.
This paper’s methodology builds upon the enhancements made to the RND approach. It introduces an optimized feature extractor to generate an intrinsic reward module, referred to as ESSRND (Enhanced Self-Supervised Random Network Distillation), which aids in agent training. Furthermore, it leverages the auxiliary agent training framework (E2S2RND) to further enhance the method’s performance.
4. Results
We first verified the effectiveness of the ESSRND algorithm in a random noise maze environment. The random noise maze is a reinforcement learning environment constructed based on the pycolab game engine, and it is a typical white noise environment. By testing its exploration ability in the experimental environment and comparing it with other methods, it is verified that the ESSRND method can effectively solve the white noise problem and has better performance for difficult exploration problems. Subsequent experiments were mainly carried out on the Atari 2600, and were carried out for many of the experimental environments that were difficult to explore. The performance of the game was mainly tested in the Montezuma’s Revenge environment [
30], and related hyperparameter experiments were carried out. The performance of the E2S2RND algorithm was compared and analyzed through the Atari 2600 experimental environment, and it was verified that the E2S2RND algorithm trained with an auxiliary agent can effectively improve the agent’s exploration ability and alleviate the impact of “separation” and “derailment” problems on the agent.
4.1. ESSRND Effectiveness Experiment
Compared with the original RND, we introduced the state feature extraction module to solve the white noise problem. In order to verify its effectiveness, related experiments were carried out in the random noise maze experimental environment.
Based on the setting of color transformation, the original state space is not large, but the probability of visiting two identical states is extremely low, thus transforming this problem into a difficult exploration problem. The experimental environment is shown in
Figure 4.
In the experiment, the ratio of the explored space to the total size of the state space is used as the evaluation index of the agent’s exploration ability. The experimental results are shown in
Figure 5. Due to the effect of color transformation in the environment, both the randomly embedded features (blue) and the original RND (green) are trapped in white noise, and cannot learn an effective policy. Optimizing feature embedding (orange) can filter out the influence of white noise and increase the exploration utilization rate of the environment as the number of agent interaction steps increases, allowing the gradual learning of useful strategies. Our method can effectively alleviate the white noise problem in the training environment. Compared with random feature embedding and original RND, ESSRND can not only filter out the environment features irrelevant to the agent through optimized feature embedding, but also reach convergence.
The background in the maze environment is changing all the time, which greatly affects the ability of the agent to judge the path, so the ability of feature extraction is particularly important. Due to the addition of an optimized feature extraction module in this paper, the ability of the encoder to extract features has been enhanced, and adding the intrinsic rewards described in this paper on the basis of environmental rewards can help the agent to explore better in the maze, and the function convergence is relatively stable and is not affected by environmental changes.
4.2. E2S2RND Algorithm Performance Experiment
We conducted comparative experiments in the Montezuma’s Revenge experimental environment, mainly comparing the performance of E2S2RND and the original RND method, and the experimental results are shown in
Figure 6. As can be seen from the figure, our proposed ESSRND method (green) and the auxiliary agent training framework E2S2RND (orange) have a significant performance improvement compared to the original RND method (red). Moreover, the convergence speed of our method during the training process is also slightly improved compared with the original RND method. E2S2RND has higher rewards than ESSRND in most parameter update steps, and E2S2RND maintains a certain upward trend compared with the peak value of ESSRND. It can be seen that after applying the E2S2RND algorithm training framework, the performance of the agent is still subject to a degree of performance improvement.
This result can prove that using the auxiliary agent training framework (E2S2RND) for training on the basis of ESSRND can help the agent understand the derailment problem, jump out of the predicament, make it better for performing exploration in unknown areas, and improve the learning ability of the agent. Using this training framework can also increase the upper learning limit of the agent, so that it can finally achieve better performance.
We also conducted comparative experiments for different methods, and the results are shown in
Table 1. The experimental results of the optimized feature embedding ESSRND and the auxiliary agent training framework E2S2RND are all better than the original RND baseline method. It can be concluded that our method has greatly improved the final performance of the agent. After using the auxiliary agent framework for training, the experimental results are better than the ESSRND method without framework training, and achieve the best performance in the Pitfall, PrivateEye, Montezuma, and Venture environments, so our proposed auxiliary agent training framework has a certain role in difficult exploration environments and can help agents learn more diverse data and then learn better strategies.
The R2D2 method [
29] introduces Behavior Transfer (BT), a technique that utilizes pre-trained policies for exploration. Compared with the R2D2 method, our method is higher than the R2D2 method except for the experimental results on the Solaris environment.
The NGU method [
30] uses the same neural network to simultaneously learn multiple directed exploration strategies, with different trade-offs between exploration and exploitation. The NGU method still has the best performance in the Gravitar environment, but our method outperforms the NGU method in the other five environments.
The rewards for the six difficult exploration environments in Atari are relatively sparse. Since the auxiliary agent framework is used in the training, it can effectively help the main agent jump out of the current state when encountering exploration difficulties and continue to perform subsequent exploration steps. When falling into a local optimum during the loss optimization process, the agent must jump out of the current suboptimal state and try to reach a better state as much as possible.
4.4. Ablation Experiment
Here we also performed a simple ablation of the optimized feature extraction module and the auxiliary agent training framework in the Montezuma environment, and the experimental results are shown in
Table 2. By comparing with the baseline method RND, two improvements in this paper are analyzed.
In
Table 2, using only the optimized feature extraction module, only using the auxiliary agent training framework, or using both combined improves the results relative to the baseline RND by 13.3%, 7.0%, and 28.0%, respectively.
Only using the optimized feature extraction module improves the results by 6.3 percentage points compared with only using the auxiliary agent training framework, in which can be seen that the former contributes more to the method in this paper. Although the auxiliary agent can alleviate the problem of the exploration dilemma, in the face of the ever-changing white noise environment, only using the auxiliary agent is far from enough to solve the impact of the environment changing at any time, and the optimized feature extraction module can learn from the encoder phase to solve this problem.
And the results of using only one improvement point for the experiment are smaller than the results of using both at the same time. It can be concluded that although using one of the parts alone can improve the model effect, the degree of improvement is less than the combination of the two. Therefore, the two improvements we proposed can both improve the performance of the model in difficult exploration environments, and the effect of using both at the same time is better.
5. Conclusions
We present a novel and sophisticated approach, namely E2S2RND, which employs an intricate error prediction mechanism hinged upon finely-tuned features. Additionally, we harness the power of an auxiliary agent to facilitate comprehensive training. By leveraging the predictive errors as a means to engender intrinsic rewards, we introduce a groundbreaking paradigm for refining the exploration methodology plagued by challenges such as separation and derailment. Through meticulous enhancements to the RND method coupled with the incorporation of an optimized feature extraction module, we elevate the resilience of the RND technique against the adversities posed by white noise interferences. Ultimately, the experimental section substantiates the formidable efficacy of our proposed method.
Similarly, since our method is mainly used to solve the problem of exploration difficulty in environments with sparse rewards, the results obtained by our method in other environments with dense rewards are not satisfactory, compared with the current optimal method. There is a big gap.
In future research, we also plan to extend the method of this paper to reward intensive environments, so that it can have a larger scope of application and perform well in different types of environments.