Rapidly Exploring Random Trees Reinforcement Learning (RRT-RL): A New Era in Training Sample Diversity

Péter, István; Kővári, Bálint; Bécsi, Tamás

doi:10.3390/electronics14030443

Open AccessArticle

Rapidly Exploring Random Trees Reinforcement Learning (RRT-RL): A New Era in Training Sample Diversity

by

István Péter

¹

,

Bálint Kővári

^1,2

and

Tamás Bécsi

^2,*

¹

Asura Technologies Ltd., H-1122 Budapest, Hungary

²

Department of Control for Transportation and Vehicle Systems, Faculty of Transportation Engineering and Vehicle Engineering, Budapest University of Technology and Economics, H-1111 Budapest, Hungary

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 443; https://doi.org/10.3390/electronics14030443

Submission received: 17 December 2024 / Revised: 20 January 2025 / Accepted: 21 January 2025 / Published: 22 January 2025

(This article belongs to the Special Issue Reinforcement Learning Meets Control: Theories and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Sample efficiency is a crucial problem in Reinforcement Learning, especially when tackling environments with sparse reward signals that make convergence and learning cumbersome. In this work, a novel method is developed that combines Rapidly Exploring Random Trees with Reinforcement Learning to mitigate the inefficiency of the trial-and-error-based experience-gathering concept through the systematic exploration of the state space. The combined approach eliminates the redundancy in irrelevant training samples. Consequently, the pivotal training signals, despite their sparsity, can be further exposed to support the learning process. Experiments are made on several OpenAI gym environments to demonstrate that the proposed method does not have any context-dependent components, and the results show that it can outperform the classic trial-and-error-based training approach.

Keywords:

reinforcement learning; sample efficeincy; Rapidly Exploring Random Trees; OpenAI Gym; cosine similarity

1. Introduction

1.1. Motivation

Reinforcement Learning (RL) became a popular choice for solving sequential decision-making problems at the end of the late 2010s. Its success primarily arises from the achievements of the Alpha algorithm family [1,2], which can solve several decades-old problems. Notably, researchers from DeepMind received the Nobel Prize in chemistry for solving the protein folding problem with AlphaFold [3]. However, recently, the relevance of RL has emerged for different realms of AI, such as Large Language Models (LLMs), since the final training process is a unique form of RL called Reinforcement Learning from Human Feedback (RLHF), which helps to align the responses of LLMs with human preferences [4]. Thanks to these achievements, more and more researchers are attracted by this domain, enabling the improvement in the shortcomings of RL itself. One of these shortcomings that requires improvement is sample efficiency. In the field of artificial intelligence, particularly in Reinforcement Learning (RL) problems, employing an effective exploration strategy is crucial. This enables the agent to construct a comprehensive internal representation of the environment. The tension between integrating new knowledge and exploiting existing knowledge—commonly referred to as the exploration vs. exploitation tradeoff—emerges in most Reinforcement Learning methods. Inefficient exploration can be highly disadvantageous, as the agent wastes resources collecting redundant or low-value information while also incurring the cost of missed rewards due to unexploited knowledge.

Many popular RL algorithms, including the Deep Q-Network (DQN) approach discussed in detail in this study, implement exploration through trial-and-error in the state space of the environment. Initially, when the agent has no prior information about the environment, this randomness helps it get started. However, as the accumulated knowledge grows, opportunities arise to shift exploration toward a more targeted approach—focusing on neglected areas and aiming to efficiently cover the unexplored regions of the state space. Unfortunately, this sense of purposeful exploration is absent from the trial-and-error-based approach.

Since its introduction [5], the Rapidly-exploring Random Trees (RRT) algorithm has become popular and widely used, primarily due to its simplicity and its ability to provide optimal state space coverage. It is particularly prevalent in robotics, especially for tasks such as motion planning [6]. One of its key advantages is its ability to rapidly reach distant states, making it well suited for the comprehensive exploration of an environment.

In this study, the utility of the RRT algorithm is examined for sample collection in the context of DQN, comparing it to the traditionally used trial-and-error-based approach. Results are presented in three well-known RL environments, demonstrating that, when properly applied, RRT-based sample collection can train agents of equivalent or better performance using significantly fewer samples. This is achieved by leveraging the algorithm’s potential to avoid information redundancy and expose sparse training signals.

The topic is both current and relevant, as intelligent data processing and control methods, including Reinforcement-Learning-based approaches, are gaining increasing prominence in today’s automotive industry applications. However, the vast and complex streams of sensor data must be managed and organized effectively. A method that enables the same information content to be stored in a more compact representation for a learning RL system can be of great assistance in this regard.

Due to the characteristics of the environments studied and the nature of tree construction, the proposed method is well suited for application in simulation environments. In such settings, the reduced number of samples results in lower computational requirements for training intelligent agents.

1.2. Related Work

The RRT algorithm and its trajectory-length-optimized variant, RRT*, are introduced in [5,7]. The application of RRT in Reinforcement Learning has been explored in numerous publications, some of which are reviewed in [8]. For example, ref. [9] implements RRT sample generation under the name Planning for Policy Search (PPS) using classical control methods and a Model Predictive Control (MPC)-based local planner.

In [10], the use of RRT is examined in robotics for motion planning tasks, where a neural network serves as the local planner. Additionally, in [11], the algorithm’s applicability is evaluated for long-term navigation problems. In these cases, however, the distance function used in tree construction is typically designed based on the specific application domain.

A method similar to the one examined in this study is presented in [6,8], where the performance of a DQN network [12] applied to lane-keeping tasks is evaluated under different sample collection strategies. In the construction of the RRT tree, the distance metric used to identify the nearest nodes is the Euclidean distance between the Q-value vectors corresponding to the states.

1.3. Contribution

In every given RL problem, the agent aims to maximize the cumulated reward; this requires the development of suitable behavior. Through the training process, the agent can see different experiences that carry information about how actions pay off in different scenarios and hence how to act in different states. During the training process, this feedback shapes a specific behavior. Although neural networks are great at interpolation, they cannot extrapolate. Consequently, if the agent is not prepared for any situation, the action and the final performance will fall short. This can be tackled by training sample diversity since it means that the state space of the problem is adequately explored. Every relevant situation is used for training, and this also mitigates redundancy, which can bias the agent toward understanding better specific scenarios at the expense of others. For more information about the influence of diversity on Machine Learning, see [13]. The basic idea for the concept presented here is derived from [8]. But in [8], the algorithm is only presented in lane keeping mostly because the distance metric inside the RRT algorithm is dependent on the problem itself. Hence, it had to be improved to create a general method from the original approach.

The component that bridged the gap from solving lane keeping to becoming a general method is the cosine-similarity-based distance metric. Consequently, this study introduces a novel perspective on the distance function defined over the set of samples, which is used by the algorithm during the construction of the RRT tree. In this innovative approach, a probability distribution is generated for each state from the Q-vector estimated by the DQN network using the softmax function, and the cosine distance (cosine similarity) between these distributions is considered the relevant metric.
While the referenced article uses the softmax-Q vector only in the tree construction’s rollout phase to select the next action, we incorporate it into the nearest neighbor search phase as well, replacing the Euclidean distance.
Even though the method is implemented for the DQN algorithm for the experiment’s sake, the concept does not have any component that relies on this specific algorithm. It can be used in any model-free RL method and utilizes a memory buffer for storing the training samples during the learning process.

The paper is organized as follows: First, all the methodologies used for creating the proposed algorithm are introduced. Then, our method is detailed to clarify its operation. After the proposed method, the environments used for evaluation are introduced. Finally, the results are presented for every environment, with additional investigations that try to articulate the proposed method’s inner operation.

An important context for our contribution is its relation to another popular topic in RL, sample prioritization. Our current method focuses on improving exploration, which leads to better state space coverage and, thus, better-performing RL agents. A natural extension for further optimizing convergence during training is defining a suitable sample priority function over the samples already collected, like in [14]. We, however, focus solely on the optimized exploration aspect, with each sample gathered by our method having a uniform probability of selection for the next batch during training.

2. Methodology

2.1. Reinforcement Learning

Reinforcement Learning is one of the main fields of Machine Learning (ML), along with Supervised and Unsupervised Learning (SL, UL). RL has attracted researchers’ attention with its potential to solve sequential decision-making problems in a wide variety of fields such as robot control [15], railway optimization [16], and trajectory tracking [17]. RL has several unique virtues compared to other methods; one is its ability to generate the required training data for its training process through the interactions between the agent and the environment. Another crucial one is that the training data themselves do not bind the performance of RL; they are only bounded by the reward function created for the given control task. Moreover, RL is the only field in ML capable of synthesizing new behavior since SL and UL can only mimic the decisions hidden in their training data. This phenomenon is well articulated in the results of the Alpha algorithm family from Google Deepmind [18,19].

The training loop of RL is shown in Figure 1. Based on Figure 1, the agent chooses an action

a_{i}

that the environment receives and realizes. The realization of the action triggers a state transition inside the environment. Thanks to that, the environment sends scalar feedback to the agent called reward

r_{i}

and the next state

s_{i + 1}

to determine the next action. The goal of every training is that the agent has to maximize the cumulative reward:

G = \sum_{t = 1}^{T} γ^{t} r_{t} .

(1)

2.2. Deep Q Learning

Value-based RL methods are the frontiers of this field, thanks to the early success of DQN. The main concept of these methods is the estimation of the value of a given state through the cumulated rewards. Consequently, these types of methods learn a regression task, and their prediction is used for classification through a

π

policy. Since then, these algorithms have been used in many domains with great success, such as autonomous driving [20], traffic signal control [21], and more [22]. The DQN (Deep Q-Network) method is an adaptation of Q-learning for deep neural networks, presented in [12]. The algorithm aims at estimating the optimal Q-function in a given Markov environment (Markov Decision Process, or MDP) using deep networks, i.e.,

Q^{*} (s, a) = max_{π} E [\sum_{t = 0}^{\infty} γ^{t} r_{t} ∣ s_{0} = s, a_{0} = a, π]

provided by the optimal

π^{*}

policy. The method uses the recursive form of the function

Q^{*}

transformed from the Bellman equation for learning:

Q^{*} (s, a) = E [r + γ max_{a^{'}} Q^{*} (s^{'}, a^{'}) ∣ s, a]

The agent uses the neural network estimator (

Q (s, a, θ)

, where

θ

denotes the parameters of the network) to choose an intervention (action) according to the maximum value of the expected total future reward discounted by

γ

in each state.

For a given training sample

(s, a, r, s^{'})

during the training of the neural network, the input of the model is s, while

y = r + γ max_{a^{'}} Q (s^{'}, a^{'}; θ^{-})

is imposed on the output of the estimated Q-vector for the specific action a, while the function penalizing the deviation (loss) is the mean squared error, i.e.,

L (θ) = E_{s, a, r, s^{'}} [{(y - Q (s, a; θ))}^{2}]

A significant contribution of [12] is the solution to stability problems of similar previous attempts using deep Reinforcement Learning. This has been carried out, among other things, by using the experience replay method to randomly sample from previous state transitions to select a learning batch because experience has shown that learning with a successive and thus highly correlated set of transitions within a trajectory leads to instability. A similar improvement is achieved by using a separate target network (denoted by

θ^{-}

in the above formulas and used in the definition of the target variable), which is only updated periodically, also improving the convergence of the network during learning, since it stabilizes the distribution of expected target variables over time.

The authors demonstrated the effectiveness of the method by evaluating it on 49 different Atari 2600 simulator environments.

2.3. Efficient Sample Tree Construction: RRT

RRT (Rapidly Exploring Random Trees) is a simple algorithm for exploring state space in a tree structure, presented in [5], which exploits the phenomenon called Voronoi bias to develop an asymptotically optimal coverage of the space. Due to the efficiency of this discovery, it has been widely used for path finding in various robotic tasks since its appearance, especially for motion planning in systems with complex dynamics. There is also an improved version of the algorithm, called RRT*, presented in [7], which, in addition to efficiently covering the state space, also takes into account the optimality of trajectories, aiming at finding the shortest paths between points in the state space, but in this paper, we adapt the original method to the field of Reinforcement Learning, especially training sample collection.

2.4. Optimal Training Sample Collection with the RRT Algorithm

As established earlier, one of the main problems with RL is sample inefficiency, which is even more crucial in environments where the non-zero training signals are sparse. This phenomenon makes the training process even more challenging since the training signals that are not relevant or redundant can crucially dominate the created memory buffer. However, by analyzing the training sample collection of model-free RL algorithms, one can perceive that the process is controlled by trial-and-error and an

ϵ - g r e e d y

policy. Combining these two realizes a random walk-like exploration in a multi-dimensional state space. Thanks to that, the sparse and vital training samples are underrepresented, while the irrelevant training samples are piled up in the memory buffer. Consequently, the question is how diversity in the training samples can be ensured. Diversity in this context means that the multi-dimensional state space is equally explored. This is where the RRT algorithm is employed. The RRT algorithm can offer the efficient exploration of a state space. Unfortunately, the classic RRT algorithm has two pitfalls:

The distance metric between two tree nodes (two points of the state space);
The policy of the local planner.

Hence, before integrating the RRT into the RL training process, these steps of the RRT algorithm have to be interpreted in this domain. In our previous work [8], the distance metric used was the Euclidean distance of the predicted Q vectors, which worked well in a lane-keeping application, but it cannot solve traditional RL sandbox problems. In this paper, we propose a novel metric for measuring the distance between two points in the state space: cosine similarity. Cosine similarity is widely used in deep learning for measuring the distance between points in encoded space [23]. The intuition behind this choice is that the agent expresses the strategic content of the specific part of the state space with the (softmax of the) predicted Q values as established in [12]. Thanks to that, the RRT algorithm can explore the relevant regions of the state space and expose vital training samples while mitigating redundant experiences, regardless of the given application domain. Cosine similarity, the metric of our choice, has many usages in Machine Learning, most notably discriminating learned embeddings in a meaningful way, which is very similar to our usage with learned Q-vectors. Ref. [24] explores this in visual contrastive learning, while [25] presents a Natural Language Processing application in estimating Semantic Textual Similarity from BERT embeddings. Furthermore, this metric can differentiate the feature vectors more efficiently, as softmax normalization tends to reduce the pairwise Euclidean distances between Q-vectors, making these distances more similar. Thus, the alignment metric based on pairwise angles between vectors in high dimensions becomes a better-discriminating feature. Euclidean distances also tend to grow more similar between randomly sampled vectors with the increase in dimensionality (curse of dimensionality) [26], which also makes cosine similarity preferable in higher dimensions (thus their usage for high-dimensional SimCLR embeddings [24]. Therefore, based on its broad applicability and supposed advantages, we have chosen this function as our domain-independent distance metric.

As far as the local planner is concerned, when the RRT algorithm finds the closest tree node to the randomly generated one by calculating the cosine similarity of the predicted Q values of the nodes’ states, a rollout phase is applied that takes five random steps from the given exploration node.

In a nutshell: Instead of a trial-and-error-based sample collection, the proposed approach uses the RRT method, which incrementally builds a tree based on the cosine similarity distances calculated in the Q-space, where every node is a state S and the nodes are connected with edges that are actions a. Thanks to this concept, two connected nodes are standard

S, A, S^{'}, R

tuplets, providing training experience for the RL agent. In that way, the built tree becomes a uniquely structured memory buffer for the agent, which reduces redundant samples.

The exact methodology can be seen in Algorithm 1.

In summary, the cosine-similarity-driven RRT algorithm can mitigate the number of redundant states by leveraging its tendency for fast and broad exploration of the model’s latent softmax-Q-space. In our experience, considering sparse reward environments, the redundancy in softmax-Q-space encompasses most zero-reward states, which are actually, regardless of their otherwise overwhelming incidence when conducting only a random-walk exploration, a relatively compact cluster in the aforementioned latent space. Aiming for broad exploration helps us escape the redundancy and reach more promising states, facilitating faster convergence during model training. While substantiating the latter with ample empirical evidence across multiple environments, we additionally propose a conjecture that our exploration heuristic also mitigates the sparse reward problem in RL environments. This, of course, has to be studied rigorously in future work. However, we include state space visualizations as a starter in this regard and to provide context and insight that accompanies our main point.

Algorithm 1 RRT-based sample collection adapted for this study.

Input:

Q (s)

vector-valued function with softmax, MDP environment E, N no. of samples, K rollout steps
  Output: T tree of samples

  distance ← cosine similarity
  Initialize T with seed states from E
  while less than N samples in T do

s_{random} \leftarrow

RandomState(E)

s_{closest} \leftarrow arg {max}_{s}

distance(

Q (s), Q (s_{random})

)
for K steps do
Set state

s_{closest}

in E

q_{closest} \leftarrow Q (s_{closest})

a \leftarrow

RandomSample(

q_{closest}

)
Execute action a, observe reward r and next state

s^{'}

in E

T \leftarrow T \cup (s_{closest}, a, r, s^{'})

            if episode is terminated then
                 break

s_{closest} \leftarrow s^{'}

Return T

2.5. Limitations

Possible shortcomings of the method, when considering the generalization to other problem domains, could arise from the tree structure of the sample collection. On one hand, applying our method to RL problems where the action space is continuous is not obvious due to the analogy between nodes with a discrete number of child nodes and states with a discrete number of actions that lead to subsequent states. Some sampling strategies of the action space where the finite number of sampled actions are ensured to be comprehensive for that given state can be considered, but nonetheless, this seems like a substantial problem. On the other hand, for problems where each state allows for a large number of actions to be taken, the average number of child nodes increases; therefore, the tree will be wider, but shallower, resulting in shorter trajectory lengths on average, failing to represent the state distribution for some late game situations of the problem at hand, resulting in underperformance during training.

3. Description of Environments

The OpenAI gym framework is used for experimentation since it has a standardized interface that enables the comparison of different RL algorithms in multiple environments. However, this is not the only virtue that Gymnasium has. It is also open source, which provides the possibility of iterating with different reward functions and other modifications regarding the environments. Furthermore, we use the Gymnasium environments to support our concept that the proposed algorithm is independent of the control problem and can be utilized in any task where value-based RL algorithms are chosen to provide a solution.

3.1. Cartpole

In the Cartpole environment (presented in more detail in [27]), the goal of the agent is to balance a rod mounted on a cart for as long as possible. The cart can move left or right on a horizontal plane, with negligible friction, while the rod moves depending on the inertia of the system. In addition, the environment tries to push the rod out with low-intensity random inputs, which the agent has to counteract with an opposing intervention.

The task is relatively easy and well suited for benchmark testing reinforcement learning algorithms.

The observation of the agent is a four-dimensional vector

(x, x^{'}, θ, θ^{'})

. It has two types of interventions: 0 pushes the carriage to the left with a given force, and 1 pushes the carriage to the right (thus affecting the rod and its angle only indirectly through the system’s coupling and inertia).

During the episode, the agent receives a reward of

+ 1

as long as it can keep the pole upright and centered, more precisely as long as the angle of the pole and the position of the cart remain within the limits. The excessive sway of the rod causes the episode to end, but no negative (or other than

+ 1

) reward is generated, so the agent seeks to maximize duration through the sum of rewards.

3.2. Acrobot

In the Acrobot environment, the agent controls an articulated pendulum system hanging from a suspension, and the goal is to raise the endpoint of the pendulum above a given height (see [28]). In a discrete version of the problem, the agent can act at the hinge point where the two sections of the pendulum meet, applying a fixed amount of positive or negative torque, or 0 (no action). The agent’s observation space is a six-dimensional

(cos (θ_{1}), sin (θ_{1}), cos (θ_{2}), sin (θ_{2}), θ_{1}^{'}, θ^{'})

vector, where

θ_{1}

and

θ_{2}

are the angles at the suspension point and the hinge point and

θ_{1}^{'}

and

θ_{2}^{'}

are their angular velocities. The height of the endpoint

h = - cos (θ_{1}) - cos (θ_{1} + θ_{2}),

(2)

reaches the target state if it is greater than 1. In the implementation of the environment, the angular velocities are subject to upper and lower limits.

The problem’s difficulty lies in the complexity of the system dynamics and the sparse reward. In the Acrobot environment, nonnegative feedback is sparse since the agent receives only

- 1

reward before reaching the goal state, and the episode terminates with a single 0 reward when the goal position is reached. For this reason, it takes a lot of exploration to acquire even a single nonnegative experience with such a reward function.

In the most recent implementation of Gymnasium, the environment stops after 500 steps even if the goal has not been reached, but configuration parameters can override this, and so did we, for the sake of giving the agent a better chance at finding useful samples.

3.3. Mountain Car

The Mountain Car problem is introduced in [29] as a nonlinear, low-dimensional, challenging MDP environment. The environment has two versions; we chose the one with discrete action space. The vehicle in the environment is located on hilly terrain, and the agent only knows its position (x) and velocity (v), which are projected on the x-axis at each instant.

The system status is updated at every time step according to the following dynamics:

v_{t + 1} = v_{t} + (a - 1) \cdot f - cos (3 \cdot x_{t} \cdot) g

x_{t + 1} = x_{t} + v_{t + 1}

where

a \in {0, 1, 2}

denotes the action (0 and 2 for left and right acceleration, no acceleration for 1), and

f = 0.001

and

g = 0.0025

are the applied force and gravitational acceleration, respectively. The values x and v are limited in the environment. The challenge posed by the environment is that it only ever gives a reward of

- 1

, unlike Acrobot, where, even after reaching the goal, the agent does not receive positive feedback; it just ends the episode, and therefore, the goal is to minimize the penalized duration by reaching the hilltop at

x = 0.5

as fast as possible. By following the optimal policy, the car must first accelerate back and forth between the hill behind it and the hill in front of it several times, continuously gathering the momentum and positional and and kinetic energy to reach the hilltop.

4. Experiment Details and Rationale

As introduced above, the proposed method investigates the possible benefits of RRT-based exploration and sample selection in the RL training process. In such a setup, a fair comparison is extremely important; hence, we must ensure that both methods can utilize the same number of training samples and tune their neural network with the same number of iterations.

The RRT-RL combined method does not utilize the traditional training loop concept for gathering experiences. Instead, the modified RRT algorithm gathers training samples into the memory buffer. Notably, the experience collection in this method is the same as the tree expansion; every new node in the tree is a new experience stored in the memory buffer. In a nutshell, the proposed method substitutes the trial-and-error-based approach with an iterative tree-building process.

For comparison, two types of baselines are employed:

The standard RL approach, where the training experience is gathered during trial-and-error, and the agent is also trained during the experience collection process (the circular train buffer contents change over time).
Trial-and-error-based experience gathering, but the agent is not trained during this process, only afterwards, and thus works with a fixed pool of the required number of training samples (train buffer) that does not change in tandem with the progression of the agent training itself.

All of the algorithms run on the same seeds; hence, each environment is in the same initial state, and both have given the same amount of iteration to collect experiences to the memory buffer. After the sample collection phase, the same Neural Network architecture is initialized for all memory buffers. The seed is also the same for this process to ensure comparability. In the final stage, Neural Networks—one for the classic DQN and one for the RRT-RL approach—have the same number of iterations to tune the network with the same hyperparameters.

This evaluation method is conducted for every environment with several different seeds, and their result is averaged out to ensure that a reliable conclusion can be drawn based on the results.

5. Results

Since the Cartpole environment is relatively easy to solve, a thorough investigation is conducted on whether there is a performance gain with the proposed method. The investigation is realized by comparing the performance of all three methods (using their different training sample-gathering strategies), with the same amount of training (same number of batch updates). The results of this evaluation can be seen in Table 1.

Table 1 shows the training iteration at which the agent solved the Cartpole problem (and reached the maximum reward). As can be seen, the RRT-RL algorithm can outperform both the trial-and-error-based standard DQN training and the fixed train buffer approach, which only collects samples and does not train the agent during sample collection. These results suggest that the proposed method can consistently outperform the baselines as the number of training samples increases.

The proposed method was also tested on the Acrobot and Mountain Car environments. These problems are more complex than Cartpole; hence, the final performance is analyzed at the speed of convergence and with the number of training samples required for training the agent. The results are shown in Table 2.

As Table 2 articulates, the proposed RRT-RL method can outperform both baselines in the Acrobot and Mountain Car environments based on the performance (validation reward). Comparing the baselines to each other, the standard DQN training performs slightly better than the approach where the agent does not learn during the trial-and-error exploration. Analyzing the number of training samples that are required for reaching superior performance in the mentioned environments, it can be said that the proposed algorithm outperforms the baselines with a quarter of the training samples in the case of the Acrobot and a tenth of them in the case of Mountain Car. By analyzing the speed of convergence, it can be stated that RRT-RL requires slightly fewer training iterations, due to its more diverse training samples when solving the Acrobot environment, and in the case of the Mountain Car environment, it requires much shorter training to solve the control problem. The convergence chart of the Mountaint Car environment is shown in Figure 2; the only baseline included here is the standard DQN, due to it being the only method that could roughly approach the performance of RRT-RL.

5.1. State Space Coverage

A state space coverage analysis is conducted on the Mountain Car environment for further evaluation purposes since this problem is the most challenging one for all of the test environments. Our goal is to visualize the state space to empirically check our hypothesis that the combination of RRT and RL can yield border state space coverage with fewer samples.

Due to the nature of the RRT algorithm, it is reasonable to assume that the network learns better with RRT-collected samples because these samples cover the state space more effectively, and the results of the previous section empirically prove this assumption. While the method operates by mapping samples into the network’s Q-space, when these samples are mapped back into the original state space of the problems, they still achieve the optimal or sufficient coverage of the essential regions, even with fewer samples.

Our hypothesis that fewer but more diverse samples yield better performance is best demonstrated in the Mountain Car environment, as shown in Figure 3. Although significantly fewer samples (25%) are collected using the RRT tree, these samples cover the state space of the control problem (observed position and velocity) much more broadly. For this problem, broader exploration ensures the discovery of the optimum, as the car must first be able to climb the hill behind it to gather enough momentum to reach the goal at the top of the hill ahead.

A similar comparison for the Acrobot environment is less striking (Figure 4, where the sample collection is compared using the same number of samples). However, the agent’s results still indicate a better training sample distribution overall.

Another unique feature of the Acrobot environment, among the three problems examined, is that it is the only one where more than one type of reward is received. This allows us to examine the ratio of “winning” samples (0 rewards) to “penalized” samples (

- 1

reward). RRT-based sample collection is also more effective from this new point of view. In the reference dataset of 100,000 samples, the agent received 0 rewards in only 22 cases, whereas in the RRT dataset—ten times smaller—this number was 20. This means that the proportion of samples that are useful for learning optimal Q-values is roughly ten times higher with RRT-based sampling. Consequently, RRT-based sample collection has an indisputable advantage in performance, and along with that, the reason for this gain can also be seen in the state space coverage. Furthermore, by comparing the magnitude of the impact between the Acrobot and the Mountain Car problem, the performance gain increases as the rewards of the environment become sparse. This is highly advantageous since these tasks are the most difficult to tackle with RL.

5.2. Impact of Hyperparameters

In RRT tree construction, the rollout parameter controls how many steps of exploration are taken toward the currently selected random point in each iteration. A higher rollout value shifts the process toward depth-first exploration, while a lower value results in a more branching tree but with shorter average trajectories. A high rollout value can compensate for building an RRT forest (starting from multiple root states) instead of a single RRT tree, allowing the average trajectory length to be maintained despite fewer samples being allocated to each tree.

In the previously demonstrated RRT-based training, the rollout step count was set to 5. The impact of varying this parameter on the results of RRT-based training is shown in Table 3. Based on these results, it can be concluded that a step count of 1 is generally too low for optimal performance. Setting it to 5 typically yields good results, while further increasing the parameter’s value has problem-specific effects (e.g., for the Acrobot environment, increasing the step count to 40 does not degrade performance).

Aside from the rollout steps, the effect of changing several other hyperparameters could be studied extensively to understand the conditions under which the proposed algorithm works optimally. According to the experiment design in our current work, the DQN training configuration, like the number of training steps, batch size, learning rate, etc., were set to provide an optimal baseline and were not modified for the RRT-RL method to ensure a fair comparison. Omitting this consideration, by conducting extensive hyperparameter sensitivity analysis, one can better understand the interplay of training configuration and the new kind of sample distribution brought about by the RRT-RL method. However, this is left as a prospect for future investigation.

6. Conclusions

The results clearly demonstrate that RRT-based sample collection significantly enhances the sample-efficient coverage of the state space in the problems examined. Reducing the computational cost comes as a motivating factor for further research in this area. The current method allows for an 80% to 90% reduction in the training sample size without a loss of model performance, which translates to a similar computational resource reduction (CPU or GPU time and memory) considering only the training loop due to its linear scaling concerning the number of training samples. The cost of training sample acquisition is omitted here due to its highly environment-specific nature.

Given that networks trained on RRT-collected samples can achieve better performance with fewer samples under otherwise identical conditions, the possibility of integrating the method presented in this study into classical value-based algorithm training is raised. Such an extension could be viewed as an improved version of the exploration phase or a bridge between exploration and exploitation, leveraging progressively better network versions for data collection.

For instance, additional samples could be collected using a reference network updated every N episodes (e.g., the network achieving the highest validation reward in the previous N episodes), either as a replacement for or in conjunction with exploration steps performed with a probability of

ϵ

. Based on the findings presented here, it is reasonable to expect that such an extension would result in newly added samples offering greater value over time and promoting faster convergence.

Naturally, this approach raises several methodological and stability-related questions that warrant thorough investigation. In our plans for future work, we also wish to deepen our understanding of the advantage our method provides in RL problems and to systematically dissect its causes, as well as explore related phenomena like the hypothesized relationship between sparse reward problems in RL and our method, as mentioned in a previous section. We also wish to position our work within the context of current advancements in RL, like the topic of Deep Reinforcement Learning from Human Feedback (RLHF) [30], an area of extraordinary interest with the widespread emergence of natural language agents trained with similar methods, and with wide applicability, like the realm of coding tasks, as described in [31]. We are confident that the adaptation of our method to topics on the forefront of current RL research is the best way for us to leverage the results presented in our current work toward a greater impact in the field. In our future endeavors, we will combine our method with other domains that utilize RL, such as the domain of Large Language Models (LLMs), where Reinforcement Learning from Human Feedback is a crucial tool for training these models. As the results of this paper articulated, the proposed method can decrease the required amount of data for the training process by finding the most relevant training samples, which is extremely important in the field of LLMs since their need for training data is enormous.

Author Contributions

Conceptualization, T.B. and B.K.; methodology, B.K. and I.P.; software, I.P; validation, I.P. and B.K.; investigation, T.B. and B.K.; resources, T.B.; writing—original draft preparation, I.P. and B.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the European Union within the framework of the National Laboratory for Autonomous Systems (RRF-2.3.1-21-2022-00002). The research reported in this paper is part of project no. BME-NVA-02, implemented with the support provided by the Ministry of Innovation and Technology of Hungary from the National Research, Development and Innovation Fund, financed under the TKP2021 funding scheme. B.K was supported by project no. 2024-2.1.1-EKÖP-2024-00003 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the EKÖP-24-4-I-BME-150 funding scheme.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and source are available from the authors upon request.

Conflicts of Interest

Author István Péter and Bálint Kővári were employed by Asura Technologies Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
Fawzi, A.; Balog, M.; Huang, A.; Hubert, T.; Romera-Paredes, B.; Barekatain, M.; Novikov, A.; R Ruiz, F.J.; Schrittwieser, J.; Swirszcz, G.; et al. Discovering faster matrix multiplication algorithms with Reinforcement Learning. Nature 2022, 610, 47–53. [Google Scholar] [CrossRef] [PubMed]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. Training a helpful and harmless assistant with Reinforcement Learning from Human Feedback. arXiv 2022, arXiv:2204.05862. [Google Scholar]
LaValle, S. Rapidly-Exploring Random Trees: A New Tool for Path Planning; Research Report 9811; Iowa State University: Ames, IA, USA, 1998. [Google Scholar]
Bécsi, T. RRT-guided experience generation for Reinforcement Learning in autonomous lane keeping. Sci. Rep. 2024, 14, 24059. [Google Scholar] [CrossRef] [PubMed]
Nasir, J.; Islam, F.; Ayaz, Y. Adaptive Rapidly-Exploring-Random-Tree-Star (RRT*) -Smart: Algorithm Characteristics and Behavior Analysis in Complex Environments. Asia-Pac. J. Inf. Technol. Multimed. 2013, 2, 39–51. [Google Scholar] [CrossRef]
Balint, K.; Gergo, A.B.; Tamas, B. Deep Reinforcement Learning combined with RRT for trajectory tracking of autonomous vehicles. Transp. Res. Procedia 2024, 78, 246–253. [Google Scholar] [CrossRef]
Hollenstein, J.J.; Renaudo, E.; Saveriano, M.; Piater, J. Improving the exploration of deep Reinforcement Learning in continuous domains using Planning for Policy Search. arXiv 2020, arXiv:2010.12974. [Google Scholar]
Chiang, H.T.L.; Hsu, J.; Fiser, M.; Tapia, L.; Faust, A. RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies. arXiv 2019, arXiv:1907.04799. [Google Scholar] [CrossRef]
Faust, A.; Ramirez, O.; Fiser, M.; Oslund, K.; Francis, A.; Davidson, J.; Tapia, L. PRM-RL: Long-Range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-Based Planning. arXiv 2018, arXiv:1710.03937. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Gong, Z.; Zhong, P.; Hu, W. Diversity in Machine Learning. IEEE Access 2019, 7, 64323–64350. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2016, arXiv:1511.05952. [Google Scholar]
Lillicrap, T. Continuous control with deep Reinforcement Learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Lövétei, I.; Kővári, B.; Bécsi, T.; Aradi, S. Environment representations of railway infrastructure for Reinforcement Learning-based traffic control. Appl. Sci. 2022, 12, 4465. [Google Scholar] [CrossRef]
Mihály, A.; Do, T.T.; Thinh, K.D.; Van Vinh, N.; Gáspár, P. Linear Parameter Varying and Reinforcement Learning Approaches for Trajectory Tracking Controller of Autonomous Vehicles. Period. Polytech. Transp. Eng. 2025, 53, 94–102. [Google Scholar] [CrossRef]
Mankowitz, D.J.; Michi, A.; Zhernov, A.; Gelmi, M.; Selvi, M.; Paduraru, C.; Leurent, E.; Iqbal, S.; Lespiau, J.B.; Ahern, A.; et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature 2023, 618, 257–263. [Google Scholar] [CrossRef]
Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
Gutiérrez-Moreno, R.; Barea, R.; López-Guillén, E.; Araluce, J.; Bergasa, L.M. Reinforcement Learning-based autonomous driving at intersections in CARLA simulator. Sensors 2022, 22, 8373. [Google Scholar] [CrossRef]
Vieira, M.; Vieira, M.A.; Galvão, G.; Louro, P.; Véstias, M.; Vieira, P. Enhancing urban intersection efficiency: Utilizing visible light communication and learning-driven control for improved traffic signal performance. Vehicles 2024, 6, 666–692. [Google Scholar] [CrossRef]
Sivamayil, K.; Rajasekar, E.; Aljafari, B.; Nikolovski, S.; Vairavasundaram, S.; Vairavasundaram, I. A systematic study on Reinforcement Learning based applications. Energies 2023, 16, 1512. [Google Scholar] [CrossRef]
Ando, T.; Iino, H.; Mori, H.; Torishima, R.; Takahashi, K.; Yamaguchi, S.; Okanohara, D.; Ogata, T. Learning-based collision-free planning on arbitrary optimization criteria in the latent space through cGANs. Adv. Robot. 2023, 37, 621–633. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the ICML’20, 37th International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
Keogh, E.; Mueen, A. Curse of Dimensionality. In Encyclopedia of Machine Learning and Data Mining; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2017; pp. 314–315. [Google Scholar] [CrossRef]
Barto, A.G.; Sutton, R.S.; Anderson, C.W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 1983, SMC-13, 834–846. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Moore, A.W. Efficient Memory-Based Learning for Robot Control; Technical report; University of Cambridge: Cambridge, UK, 1990. [Google Scholar]
Christiano, P.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep Reinforcement Learning from human preferences. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar] [CrossRef]
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar]

Figure 1. The RL training loop.

Figure 2. The convergence of the RRT-RL and DQN algorithms on the Mountain Car environment. The former shows earlier convergence to a higher validation reward even with 25% of the baseline’s sample size, which can be attributed to the RRT-RL method’s better state space coverage, counteracting the difficulties imposed by the sparsity of the reward.

Figure 3. State space coverage comparison for the Mountain Car problem in terms of density. Using the RRT method, a significantly larger area of the state space is covered with only a quarter of the samples. The method successfully covers regions in the state space that prove helpful during training, in contrast with the baseline method’s samples being densely confined to the close neighborhood of the starting state.

Figure 4. State space coverage for the Acrobot problem. In the figure, samples collected through random exploration are shown in blue, while samples collected using RRT tree-based exploration are shown in red. The results indicate that the slight differences in distribution (e.g., RRT-RL samples being more dense at extreme values of

s i n (θ_{1})

and

s i n (θ_{2})

) are a sign of prioritizing more impactful samples over redundant ones, providing a broader set of experiences to learn from.

Figure 4. State space coverage for the Acrobot problem. In the figure, samples collected through random exploration are shown in blue, while samples collected using RRT tree-based exploration are shown in red. The results indicate that the slight differences in distribution (e.g., RRT-RL samples being more dense at extreme values of

s i n (θ_{1})

and

s i n (θ_{2})

) are a sign of prioritizing more impactful samples over redundant ones, providing a broader set of experiences to learn from.

Table 1. Results on the Cartpole environment by which the algorithm solved the problem first—reaching and maintaining 200 points.

Method	1000 Samples	5000 Samples	10,000 Samples
RRT-RL	245	231	198
DQN	309	282	234
DQN with fixed train buffer	367	338	294

Table 2. Results in Acrobot and Mountain Car.

Method	Acrobot 10k	Acrobot 50k	Mountain Car 50k	Mountain Car 100k
RRT-RL	$- 62.83$	$- 61.11$	$- 88.86$	$- 83.54$
DQN	$- 90.04$	$- 78.95$	$- 138.09$	$- 127.34$
DQN with fixed train buffer	$- 98.73$	$- 85.78$	$- 175.34$	$- 142.87$

Table 3. This table summarizes the effect of varying the rollout parameter (the best results are in bold).

Environment	n = 1	n = 5	n = 10	n = 20	n = 40
Cartpole5k	$388.00 \pm 9.0$	$437.00 \pm 5.0$	$373.00$ ± $15.0$	$421.00$ ± $13.0$	$74.00$ ± $21.0$
Acrobot10k	$- 107.28 \pm 7.23$	$- 62.83 \pm 2.07$	$- 73.00 \pm 4.85$	$- 62.84 \pm 2.89$	$- 61.84 \pm 1.87$
Mountain Car50k	$- 198.85 \pm 1.0$	$- 88.86$ ± $5.27$	$- 88.86$ ± $3.34$	$- 149.62 \pm 12.56$	$- 109.38 \pm 9.54$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Péter, I.; Kővári, B.; Bécsi, T. Rapidly Exploring Random Trees Reinforcement Learning (RRT-RL): A New Era in Training Sample Diversity. Electronics 2025, 14, 443. https://doi.org/10.3390/electronics14030443

AMA Style

Péter I, Kővári B, Bécsi T. Rapidly Exploring Random Trees Reinforcement Learning (RRT-RL): A New Era in Training Sample Diversity. Electronics. 2025; 14(3):443. https://doi.org/10.3390/electronics14030443

Chicago/Turabian Style

Péter, István, Bálint Kővári, and Tamás Bécsi. 2025. "Rapidly Exploring Random Trees Reinforcement Learning (RRT-RL): A New Era in Training Sample Diversity" Electronics 14, no. 3: 443. https://doi.org/10.3390/electronics14030443

APA Style

Péter, I., Kővári, B., & Bécsi, T. (2025). Rapidly Exploring Random Trees Reinforcement Learning (RRT-RL): A New Era in Training Sample Diversity. Electronics, 14(3), 443. https://doi.org/10.3390/electronics14030443

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rapidly Exploring Random Trees Reinforcement Learning (RRT-RL): A New Era in Training Sample Diversity

Abstract

1. Introduction

1.1. Motivation

1.2. Related Work

1.3. Contribution

2. Methodology

2.1. Reinforcement Learning

2.2. Deep Q Learning

2.3. Efficient Sample Tree Construction: RRT

2.4. Optimal Training Sample Collection with the RRT Algorithm

2.5. Limitations

3. Description of Environments

3.1. Cartpole

3.2. Acrobot

3.3. Mountain Car

4. Experiment Details and Rationale

5. Results

5.1. State Space Coverage

5.2. Impact of Hyperparameters

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI