Constrained Reinforcement Learning for Vehicle Motion Planning with Topological Reachability Analysis

Gu, Shangding; Chen, Guang; Zhang, Lijun; Hou, Jing; Hu, Yingbai; Knoll, Alois

doi:10.3390/robotics11040081

Open AccessEditor’s ChoiceArticle

Constrained Reinforcement Learning for Vehicle Motion Planning with Topological Reachability Analysis

by

Shangding Gu

^1,*,

Guang Chen

^1,2,

Lijun Zhang

²,

Jing Hou

²,

Yingbai Hu

¹ and

Alois Knoll

¹

Department of Informatics, Technical University of Munich, 80333 Munich, Germany

²

School of Automotive Studies, Tongji University, Shanghai 201804, China

^*

Author to whom correspondence should be addressed.

Robotics 2022, 11(4), 81; https://doi.org/10.3390/robotics11040081

Submission received: 22 June 2022 / Revised: 28 July 2022 / Accepted: 8 August 2022 / Published: 16 August 2022

(This article belongs to the Section Intelligent Robots and Mechatronics)

Download

Browse Figures

Versions Notes

Abstract

:

Rule-based traditional motion planning methods usually perform well with prior knowledge of the macro-scale environments but encounter challenges in unknown and uncertain environments. Deep reinforcement learning (DRL) is a solution that can effectively deal with micro-scale unknown and uncertain environments. Nevertheless, DRL is unstable and lacks interpretability. Therefore, it raises a new challenge: how to combine the effectiveness and overcome the drawbacks of the two methods while guaranteeing stability in uncertain environments. In this study, a multi-constraint and multi-scale motion planning method is proposed for automated driving with the use of constrained reinforcement learning (RL), named RLTT, and comprising RL, a topological reachability analysis used for vehicle path space (TPS), and a trajectory lane model (TLM). First, a dynamic model of vehicles is formulated; then, TLM is developed on the basis of the dynamic model, thus constraining RL action and state space. Second, macro-scale path planning is achieved through TPS, and in the micro-scale range, discrete routing points are achieved via RLTT. Third, the proposed motion planning method is designed by combining sophisticated rules, and a theoretical analysis is provided to guarantee the efficiency of our method. Finally, related experiments are conducted to evaluate the effectiveness of the proposed method; our method can reduce 19.9% of the distance cost in the experiments as compared to the traditional method. Experimental results indicate that the proposed method can help mitigate the gap between data-driven and traditional methods, provide better performance for automated driving, and facilitate the use of RL methods in more fields.

Keywords:

motion planning; automated driving; reinforcement learning; reachability analysis

1. Introduction

The industrial demand for automated driving technology is increasing, and this technology has made remarkable progress in providing promising transportation to our lives [1,2]. However, vehicle motion planning for automated driving needs further development which considers multiple constraints in sparse information environments; in these environments, macro information is known and micro information is unknown and uncertain, especially for safe, effective, and precise motion planning. In automated driving, a real traffic environment is full of uncertainty due to information incompleteness, traffic disturbances, and other factors such as the incomplete information provided by vehicle sensors and the disturbances of obstacles, which lead to complex traffic environments that may be difficult to predict in real time [3].

Various of traditional motion planning methods (non-data-driven methods) use heuristic algorithms [4], sampling-based methods [5], and rule-based methods [6,7] for automated driving. Although these methods perform well via solid mathematical models in terms of robustness, interpretability, and stability, they still need considerable human knowledge in advance; therefore, modeling the complex and uncertain environments using these models is difficult, and these models might not perform well under unknown and uncertain environments. However, reinforcement learning (RL) (data-driven methods) [8,9,10] does not need a lot of human knowledge in advance and may perform better in unknown and uncertain environments via trial-and-error learning, which can model complex environments using data-driven methods [11]. For example, AlphaGo shows superior performance via RL [12] as compared to humans in the Game of Go, which is one of the most challenging games that exist; Gu et al. provided a multi-agent RL method that could teach multiple robots how to run in a multi-agent system while considering the balance between the robots’ rewards and safety, in which an agent not only needs to consider its own reward and other agents’ rewards, but also its safety and other agents’ safety in unstable environments [9].

Indeed, agents using RL methods can learn how to automatically navigate a vehicle via exploration and exploitation [13,14]. Therefore, combining RL with traditional methods is significant as it can achieve better performance in uncertain environments for vehicle motion planning. However, how to combine traditional methods with RL and consider safe, effective, and precise motion planning while leveraging the advantages of these two methods and overcoming their shortcomings is a challenge. In this study, a strategy is proposed in which RL is constrained, thus leveraging traditional methods such as the topological path search (TPS) and the trajectory lane model (TLM) to achieve safe, effective, and precise motion planning (named RLTT).

Several key issues need to be solved in this study: First, the routing points need to be considered in a dynamic model of vehicles, which can make motion planning more reasonable and closer to actual environments. Second, how to transform between RL and the topological path needs to be resolved because RL is a trial-and-error algorithm in which the search for paths for large-scale areas with respect to the search time is difficult. On the contrary, the method of TPS can easily and effectively plan a path for large-scale areas. Third, how to build a dynamic model for automated driving and provide dynamic constraints that can render safer, more effective, and more precise motion planning should be considered.

To settle the mentioned problems, we first propose a planning hierarchy framework with three levels, as shown in Figure 1. The first level is for macro-scale planning via TPS, and the second level is the RL method used for micro-scale planning; the differences between macro-scale and micro-scale planning can be found in reference [15]. The third level is TLM, which can make motion planning more precise and closer to actual environments. Second, we take into account that solving the interpretability and instability problems of deep learning (DL) is difficult because DL’s ability is known, but its operations are unknown. Thus, from the perspective of the traditional methods, TPS and TLM are leveraged to ensure safety and to constrain the RL search space, where motion planning can be stabilized by combining their advantages and overcoming their shortcomings. Third, for uncertain environments such as an uncertain obstacle, RL is used to explore dynamic and uncertain environments, and then multiple constraints are considered and the safety range is set through a safe TPS buffer.

The contributions of our proposed method and model are as follows:

The RLTT method along with a theoretical analysis is proposed for automated driving, in which a novel planning hierarchy framework of three levels and a multi-scale planning based on TPS, RL, and TLM are introduced.
Multiple constraints of vehicles are considered, such as the dynamic constraints of vehicles, smooth constraints, and safety constraints, thereby making motion planning more reasonable and closer to actual scenarios.
Uncertain environments are also considered in the proposed planning method, which achieves superior performance as compared to related works.
Safe and efficient motion planning for automated driving is achieved. The RLTT method, by combining traditional methods and RL, can perform well under sparse information environments and multi-scale planning environments.

The remainder of this paper is organized as follows: Related works are introduced in Section 2; the dynamic model of a vehicle is provided in Section 3; TLM is presented in Section 4; the method of TPS is introduced in Section 5; the RLTT method for automated driving is introduced in detail in Section 6; related experiments are described in Section 7; the conclusion of the paper is given in Section 8.

2. Related Work

The planning for unmanned vehicles has attracted a lot of attention in the automated driving community [16], especially from the perspective of data-driven methods in complex environments which are unknown and uncertain.

For example, in [17], Bernhard and Knoll considered uncertain information using neural networks for automated vehicle planning. Nevertheless, they assumed all information about other vehicles to be known, and that this assumption might not be suitable for real environments. Zhang et al. [18] developed a novel bi-level actor–critic method for multi-agent coordination. Although their method achieved great success in making quick decisions regarding automated driving in highway merge environments, this method could not guarantee the safety of automated vehicles. Nick et al. [19] introduced an algorithm for clustering traffic scenarios, in which they combined convolutional neural networks and recurrent neural networks to predict traffic scenarios for the ego vehicle’s decision-making and planning. However, their method might be unstable for some extreme traffic scenarios because of the uncertain and imperfect information on some traffic situations, such as intersection environments. In [20], a safe motion planning for autonomous vehicles based on RL and Lyapunov functions was proposed, where the reward function was optimized with respect to agent safety, and the balance between reward and safety was analyzed by optimizing the probability of collision. Similarly, uncertain environments were also not considered.

Chen et al. [21] developed an end-to-end method for automated driving using RL and a sequential latent environment model, which is a semantic bird-eye mask; their methods were more interpretable than other machine learning methods to some extent. However, their method may still need to further consider sparse information environments and multiple constraints for automated driving. Tang et al. [22] introduced a motion planning method for automated driving using a soft actor-critic method [23], in which different strategies were balanced via the weights of safety, comfort, and efficiency. Zhu et al. [24] developed a motion planning method to consider the factor of pedestrian distraction based on a rule-based method and a learning-based method. Their experimental results demonstrated that the learning-based method may be better at handling the unsafe action problem at unsignalized mid-block crosswalks than the rule-based method; nonetheless, the learning-based method may generate unreasonable actions. Wen et al. [3] provided a safe RL method for autonomous driving based on constrained policy optimization, in which the risk return was considered as a hard constraint during the learning process, and a trust region constraint was used to optimize the policy. Although they achieved better performance than CPO [25] and PPO [26] in some scenarios, their method considered risk as an optimization objective based on neural network learning. This learning process might encounter violations of the safety constraint, which will not allow for safety to be achieved with high confidence. In contrast to their methods, our method is based on a geometry reachability analysis and vehicle dynamic verification, which can guarantee safety during automated driving.

Shai et al. [27] presented a sophisticated strategy based on RL which considers negotiation strategies when the ego vehicle drives in complex environments. Their method can be divided into two strategies: one strategy that can be learned, and another strategy associated with hard constraints that cannot be learned (e.g., safety constraints). Although they tried their best to ensure the safety of automated driving, their method still required an improvement in adaptability for more complex environments such as a complex intersection with multiple heterogeneous vehicles and pedestrians. Sarah M. Thornton [28] proposed a method that leveraged the Markov decision process (MDP) and dynamic programming to control the vehicle speed for safety, and this method considered uncertain pedestrians at a crosswalk. However, the method might require improvements in order to consider more uncertain environments, and the search space may need to be reduced for efficient planning. Codevilla et al. [29] used condition imitation learning for a high-level command input to achieve automated driving in simulation environments and for a 1/5-scale robotic truck in real environments. Although both experiments evaluated the effectiveness of their method, the method may need to address the need of automated vehicles for human guidance in sparse information environments.

Moreover, there are many alternative methods available for solving robot motion planning problems, such as a probabilistic Chekov method for robots that plans via chance constraints [30], a probabilistic collision constraint planning based on Gaussian probability distribution functions [31], artificial potential fields for motion planning [32], symptotically optimal planning for robots on the basis of a rapidly exploring random tree (RRT) [33], direct sampling for robot path planning derived from RRT [34], a fast marching method for path planning [35], etc.; although the above methods have shown impressive progress in robot motion planning research, the methods may need to be further developed for autonomous driving and consider the features of vehicles and autonomous driving environments.

In this paper, the proposed RLTT is different from the above-mentioned methods because it can achieve safe and efficient automated driving in sparse information environments while considering multi-constraint and multi-scale motion planning.

In RLTT, TLM is developed on the basis of the trajectory unit, which was first proposed for unmanned surface vehicles (USVs) [36,37,38]. However, USVs are different from automated vehicles. The freedom of control, navigation environments, and vehicle shapes are different [15,39]. The trajectory unit may not be suitable for automated vehicles, therefore developing TLM for automated vehicles is necessary. Moreover, TLM is different from a lattice planner[40] because a lattice planner is achieved using sample and fit data, which may require a huge amount of time and more computing power. In contrast, TLM is achieved via a dynamic model of vehicles in advance. In addition, TPS is proposed to constrain the RL search space and provide routing points for RL navigation, which can improve RL efficiency. Finally, the hierarchy framework is proposed by integrating TPS, RL, and TLM to develop the RLTT method, which can make the proposed method and model more unified.

3. Problem Formulation

In this study, a constrained RL problem is considered for motion planning; in particular, uncertain constraints

F_{u n}

, dynamic constraints

F_{d y}

, safety constraints

F_{s a}

, and smooth constraints

F_{s m}

are considered.

3.1. Uncertain Constraints $F_{u n}$

For the convenience of description, we have provided the following definitions for (1):

O_{o b s t a c l e}

denotes the obstacles;

O_{s h a p e}

denotes the shape of the obstacles;

O_{p o s i t i o n}

denotes the position of the obstacles;

O_{p a r t}

denotes partial information about the obstacles’ positions and shapes.

Definition 1.

In uncertain environments, since information about an obstacle shape cannot be fully observed,

O_{o b s e r v e}

is used to denote partial information about the obstacles that the agent observes.

O_{s h a p e} \cup O_{p o s i t i o n} = O_{o b s t a c l e} \supseteq O_{p a r t} \supseteq O_{o b s e r v e} .

(1)

3.2. Dynamic Constraints $F_{d y}$

Dynamic constraints

F_{d y}

can be defined as the dynamic model of vehicles, which is briefly shown in Figure 2. In this paper, a kinematics model, steer command, and heading angle are considered. The kinematics model is briefly introduced in this study [41,42]. In general, the steer command range is set to approximately

- 30 ° \sim 30 °

, or to

- 40 ° \sim 40 °

for other settings, and the setting of the steering command of our method is suitable for and can be applied to real vehicle experiments.

The driving speed at the axle center of the rear axle is

v_{r}

.

(X_{r}, Y_{r})

represents the coordinates at the axis of the rear axle, and

φ

represents the heading angle.

v_{r} = {\dot{X}}_{r} c o s φ + {\dot{Y}}_{r} s i n φ .

(2)

The kinematic constraints of the front and rear axles are as follows:

δ_{f}

represents the front wheel steering angle, which is similar to the steering command, and it is used to name the steering command here.

\{\begin{matrix} {\dot{X}}_{f} s i n (φ + δ_{f}) - {\dot{Y}}_{f} c o s (φ + δ_{f}) = 0 \\ {\dot{X}}_{r} s i n φ - {\dot{Y}}_{r} c o s φ = 0, \end{matrix}

(3)

where l indicates the distance between the rear axle and the front axle (wheelbase), w represents the yaw rate, and R denotes the turning radius of the vehicle. The geometric relationship between the front and rear wheels is as follows:

\{\begin{matrix} X_{f} = X_{r} + l c o s φ \\ Y_{f} = Y_{r} + l s i n φ, \end{matrix}

(4)

\{\begin{matrix} R = v_{r} / w \\ δ_{f} = a r c t a n (l / R) . \end{matrix}

(5)

According to the analysis above, the kinematic constraints can be briefly summarized in Equation (6):

[\begin{matrix} {\dot{X}}_{r} \\ {\dot{Y}}_{r} \\ \dot{φ} \end{matrix}] = [\begin{matrix} c o s φ \\ s i n φ \\ 0 \end{matrix}] * v_{r} + [\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}] * w .

(6)

3.3. Safety Constraints $F_{s a}$

The safety distance is

D_{s a f e t y}

, and the distance between the point of motion planning

P_{p o i n t}

and obstacles

O_{o b s t a c l e}

is

D_{p}

.

Definition 2.

The safety constraints (

F_{s a}

) are represented by Equation (7):

D_{s a f e t y} \leq D_{p} .

(7)

3.4. Smooth Constraints $F_{s m}$

The motion planning points are represented by the set P,

\sum_{n = 1}^{N} p_{n} \subseteq P

. The two adjacent points are

p_{1}

and

p_{2}

. The two-steer angle difference of any two adjacent points is

δ_{n o r m a l}

. The limitation of the two-steer angle difference of any two adjacent points is

δ_{l i m i t a t i o n}

, where

δ_{l i m i t a t i o n}

is set to smoothen the steer angle.

Definition 3.

The smooth constraints

F_{s m}

can be represented by Equation (8):

δ_{n o r m a l} \leq δ_{l i m i t a t i o n} .

(8)

In conclusion, we need to find a proximal optimal motion planning set

P_{o p}

for vehicles that can also simultaneously satisfy the constraints

F_{u n}, F_{d y}, F_{s a}

, and

F_{s m}

.

Definition 4.

The objective needs to satisfy the following constraints, which can be formulated as follows:

P_{o p} \subseteq (F_{u n} \cap F_{d y} \cap F_{s a} \cap F_{s m}) .

(9)

4. Trajectory Lane Model

The trajectory lane model (TLM) can be considered as a bridge that connects the routing planning and the dynamic constraints of a vehicle, thus allowing the motion planning to become closer to actual environments, as shown in Algorithm 1. TLM is constructed according to the dynamic model of the vehicle, and relative TLM rules are introduced to achieve effective and precise motion planning.

Algorithm 1: Generating Trajectory Lane.

4.1. Rules Design

Based on the analysis above, the following rules have been designed:

Rule one: The trajectories of each of the actions are equal in length. This is for the continuous, regularized, and easily spliced trajectories.
Rule two: Each trajectory has only one steer command, except for the start and end steer commands; this is for the smooth steer. The relative angle (steering angle) between two adjacent points is not greater than one degree; this is for generating a smooth trajectory.
Rule three: The trajectory of each action has the same speed; this is for those with equal length. At the start and end of the trajectory, the condition is the same.

4.2. Analysis of TLM

According to the rules above and the dynamic model of a vehicle, TLM can be constructed. The introduction of the construction of the TLM m can be found in Algorithm 1, in which several types of TLMs are constructed on the basis of the constraints of the action and the state space.

(1) TLM—type one: This is the simplest type, which only has eight direction actions, and it can be seen in Figure 3a. This kind of TLM does not have sufficient actions to achieve motion planning, and the model can not achieve curve turning.

(2) TLM—type three: This type of TLM, with 16 direction actions, has more actions than the first type, which has eight direction actions (as shown in Figure 3b). It can comprise the second type, including four direction-action positive semicircles and four direction-action negative semicircles (as shown in Figure 4b). This type of TLM still does not have sufficient actions to achieve motion planning, and sometimes this type of TLM for motion planning is locally optimal.

(3) TLM—type four: This type of TLM, with 24 direction actions, has more actions than the second type and can comprise the second type and 8 direction-action trajectories near the maximum rudder angle along the x and y axes; this can be seen in Figure 4b. This type of TLM is similar to the third type, and sometimes the type of TLM for motion planning is also locally optimal.

(4) TLM—type five: This type of TLM, with 40 direction actions, has more actions than the third type and can comprise the third type and 8 direction-action smooth semicircles (Figure 5a); this can be seen in Figure 5b. To some extent, this type of TLM has sufficient actions to achieve motion planning and can achieve proximal optimal motion planning. Figure 6 shows the 40 direction-action TLM with one example of a smooth circle. A theoretical analysis is provided in Lemma 1, which can prove that type four is the proximal optimal trajectory; in Theorem 1, completeness of trajectory space is proved.

Lemma 1.

The proximal optimal type of TLM for motion planning can be considered to be TLM’s type five, where the TLM has 40 search directions for motion planning.

Proof.

Taking type one of TLM as an example, it can be transformed into the optimal distance problem.

A trajectory is

T = {p_{1}, p_{2}, \dots, p_{i}, \dots, p_{n}}

, a point i is

p_{i} = {t_{i}, x_{i}, y_{i}, δ_{i}, v_{i}, a_{i}}

, the distance between any two adjacent points is

d_{i, i + 1} = \sqrt{(x_{i + 1} - x_{i}) + (y_{i + 1} - y_{i})}

, the set of trajectories in the space is expressed as

T = {T^{1}, T^{2}, \dots, T^{j}, \dots, T^{m}}

.

The trajectory

T_{j}

can be presented as

y_{i}^{j} = A^{j} {(x_{i}^{j})}^{2} + B^{j} x_{i}^{j} + C^{j}

,

A^{j}

,

B^{j}

, and

C^{j}

are the correlation coefficients. According to the law of a triangle, the sum of two sides is greater than the third side. j denotes the trajectory type,

j_{h}

denotes the h trajectory of the j trajectory type,

i^{j_{h}}

denotes the i point of the h trajectory of the j trajectory type.

\begin{matrix} d_{i^{j_{0}}, i^{j_{0}} + 1}^{j_{0}} & = \sqrt{{(x_{i^{j_{0}} + 1}^{j_{0}} - x_{i^{j_{0}}}^{j_{0}})}^{2} + {(y_{i^{j_{0}} + 1}^{j_{0}} - y_{i^{j_{0}}}^{j_{0}})}^{2}} \\ = \sqrt{{(x_{i^{j_{0}} + 1}^{j_{0}} - x_{i^{j_{0}}}^{j_{0}})}^{2} + {[(A^{j_{0}} {(x_{i^{j_{0}} + 1})}^{2} + B^{j_{0}} x_{i^{j_{0}} + 1}^{j_{0}} + C^{j_{0}}) - (A^{j_{0}} {(x_{i^{j_{0}}}^{j_{0}})}^{2} + B^{j_{0}} x_{i^{j_{0}}} + C^{j_{0}})]}^{2}} \\ \forall 0 \leq i^{j_{0}} \leq n^{j_{0}}, 0 \leq j^{h} \leq m^{j_{h}}, \end{matrix}

(10)

\begin{matrix} d_{i^{j_{1}}, i^{j_{1}} + 1}^{j_{1}} & = \sqrt{{(x_{i^{j_{1}} + 1}^{j_{1}} - x_{i^{j_{1}}}^{j_{1}})}^{2} + {(y_{i^{j_{1}} + 1}^{j_{1}} - y_{i^{j_{1}}}^{j_{1}})}^{2}} \\ = \sqrt{{(x_{i^{j_{1}} + 1}^{j_{1}} - x_{i^{j_{1}}}^{j_{1}})}^{2} + {[(A^{j_{1}} {(x_{i^{j_{1}} + 1}^{j_{1}})}^{2} + B^{j_{1}} x_{i^{j_{1}} + 1}^{j_{1}} + C^{j_{1}}) - (A^{j_{1}} {(x_{i^{j_{1}}}^{j_{1}})}^{2} + B^{j_{1}} x_{i^{j_{1}}}^{j_{1}} + C^{j_{1}})]}^{2}} \\ \forall 0 \leq i^{j_{1}} \leq n^{j_{1}}, 0 \leq j^{h} \leq m^{j_{h}}, \end{matrix}

(11)

According to the law of a triangle, the sum of two sides is greater than the third side.

\begin{matrix} d_{i^{j_{0}}, i^{j_{0}} + 1, i^{j_{1}}, i^{j_{1}} + 1}^{j_{0, 1}} & = \sqrt{{(x_{i^{j_{1}} + 1}^{j_{1}} - x_{i^{j_{0}}}^{j_{0}})}^{2} + {(y_{i^{j_{1}} + 1}^{j_{1}} - y_{i^{j_{0}}}^{j_{0}})}^{2}} \\ = \sqrt{{(x_{i^{j_{1}} + 1}^{j_{1}} - x_{i^{j_{0}}}^{j_{0}})}^{2} + {[(A^{j_{1}} {(x_{i^{j_{1}} + 1}^{j_{1}})}^{2} + B^{j_{1}} x_{i^{j_{1}} + 1}^{j_{1}} + C^{j_{1}}) - (A^{j_{0}} {(x_{i^{j_{0}}}^{j_{0}})}^{2} + B^{j_{0}} x_{i^{j_{0}}}^{j_{0}} + C^{j_{0}})]}^{2}} \\ \forall 0 \leq i^{j_{0}} \leq n^{j_{0}}, 0 \leq i^{j_{1}} \leq n^{j_{1}}, 0 \leq j^{h} \leq m^{j_{h}} . \end{matrix}

(12)

\begin{matrix} d_{i^{j_{0}}, i^{j_{0}} + 1, i^{j_{1}}, i^{j_{1}} + 1}^{j_{0, 1}} < d_{i^{j_{1}}, i^{j_{1}} + 1}^{j_{1}} + d_{i^{j_{0}}, i^{j_{0}} + 1}^{j_{0}}, \end{matrix}

(13)

\begin{matrix} D^{j_{0}} = \sum_{i}^{n^{j_{0}}} d_{i^{j_{0}}, i^{j_{0}} + 1}^{j_{0}}, \end{matrix}

(14)

\begin{matrix} D^{j_{1}} = \sum_{i}^{n^{j_{1}}} d_{i^{j_{1}}, i^{j_{1}} + 1}^{j_{1}}, \end{matrix}

(15)

\begin{matrix} D^{j_{0, 1}} = \sum_{i}^{n^{j_{0, 1}}} d_{i^{j_{0, 1}}, i^{j_{0, 1}} + 1}^{j_{0, 1}}, \end{matrix}

(16)

\begin{matrix} D^{j_{0, 1}} < D^{j_{0}} + D^{j_{1}} . \end{matrix}

(17)

In addition, this is similar to other types of TLMs and curve constraints (or steering smooth constraints), and the fifth type of TLM can be proven to be the proximal optimal TLM for motion planning. With an excessive number of trajectories, the search complexity will be large, and the fifth type of TLM will be sufficient for the trajectory probability. Completeness of trajectory space is subsequently proven.

Based on the Minkowski inequality and its sum [43], we can observe that

D^{j_{0, 1}}

is the proximal optimal trajectory, and the fifth type is the proximal optimal TLM. □

Theorem 1.

Completeness of trajectory space: any target point in space

R^{n}

can be reached using TLM. The number of optimal trajectory circles to cover the convex hull

P^{'}

is

N^{{\frac{d}{\sqrt{β_{i}}}}^{2}}

, in which the number of vertices of the convex hull

P^{'}

is N, the diameter of a convex hull

P^{'}

is d, and the radius of any circle of the proximal optimal TLM is

β_{i}

.

Proof.

It can be transformed into the completeness of the quadratic equation problem.

\sum_{i = 1}^{m} λ_{i} z_{i} where λ_{i} \geq 0 and \sum_{i = 1}^{m} λ_{i} = 1,

(18)

\begin{matrix} conv (P^{'}) : = \{convex combinations of z_{1}, \dots, z_{m} \in P^{'} for m \in N\} . \end{matrix}

(19)

Based on Caratheodory’s theorem [44], we can observe that for any point

p_{i}

in a convex hull

P^{'}

,

p_{i}

can be covered with a set of less than

n_{d}

points in the convex hull

P^{'}

,

n_{d}

is the dimension of the convex hull

P^{'}

, and this will depend on the dimension of the convex hull

P^{'}

,

P^{'} \in R^{n}

, whose proof can be seen in reference [44]. Derived from Caratheodory’s theorem, an approximate Caratheodory’s theorem is proposed [45], which does not depend on the dimension of the convex hull

P^{'}

. We prove the space completeness of the trajectories of the TLM based on the approximate Caratheodory’s theorem.

x_{i}

(

i \subseteq N

) is any point that belongs to a convex set

T^{'}

, with the set being bounded by a circle (its diameter is within 1) and an integer k. The proximal optimal TLM, which can be regarded as three different circles, is shown as Figure 7a; these three circles can cover any reachable areas under vehicle dynamic constraints, implying that the proximal optimal TLM can achieve the completeness of trajectory space for motion planning.

Next, we will prove the completeness of the trajectory space by using any circle from our proximal optimal TLM; this is shown in Figure 7b. The set of any convex hull

P^{'}

is

T^{'}

, the number of vertices of the convex hull

P^{'}

is N, the diameter of a convex hull

P^{'}

is d, and the radius of any circle of the proximal optimal TLM is

β_{i}

.

λ_{i}

is the probability of point

p_{i}

, and x is a point within the reachable areas.

\begin{matrix} P \{p = p_{i}\} = λ_{i}, i = 1, \dots, m, \end{matrix}

(20)

E p = \sum_{i = 1}^{m} λ_{i} p_{i} = x .

(21)

Based on the strong law of large numbers, we can obtain the following equation:

\begin{matrix} \frac{1}{k} \sum_{j = 1}^{k} p_{j} \to x . \end{matrix}

(22)

According to the weak law of large numbers, we can compute the variance of

\frac{1}{k} \sum_{j = 1}^{k} p_{j}

,

\begin{matrix} E {∥p_{j} - x∥}_{2}^{2} & = {E ∥ p - E p ∥}_{2}^{2} \\ \leq {E ∥ p ∥}_{2}^{2} \leq d, \end{matrix}

(23)

moreover, we can observe that

\begin{matrix} E {∥x - \frac{1}{k} \sum_{j = 1}^{k} p_{j}∥}_{2}^{2} & = \frac{1}{k^{2}} E {∥\sum_{j = 1}^{k} (p_{j} - x)∥}_{2}^{2} \\ = \frac{1}{k^{2}} \sum_{j = 1}^{k} E {∥p_{j} - x∥}_{2}^{2} . \end{matrix}

(24)

Therefore,

E {∥x - \frac{1}{k} \sum_{j = 1}^{k} p_{j}∥}_{2}^{2} \leq \frac{d}{k} .

(25)

For the random variables, we can have

{∥x - \frac{1}{k} \sum_{j = 1}^{k} p_{j}∥}_{2}^{2} \leq \frac{d}{k} .

(26)

Thus, for any radius

β_{i}

of any TLM circle that satisfies the vehicle constraints, we can observe that

{∥x - \frac{1}{k} \sum_{j = 1}^{k} x_{j}∥}_{2} \leq \frac{d}{\sqrt{k}} \leq β_{i},

(27)

which is derived from [44], and there are

N^{k}

ways to choose k with repetition to cover the cardinality of N. Thus,

N \leq N^{k} = N^{{(\frac{d}{β_{i}})}^{2}} .

(28)

The number of TLM circles to cover reachable areas is

N^{{(\frac{d}{β_{i}})}^{2}}

, implying that we can use the

N^{{(\frac{d}{β_{i}})}^{2}}

TLM circles to cover any convex hull

P^{'}

; the area is also reachable by using the proximal optimal trajectory circle, which finishes the proof. □

5. Topological Path Search for Automated Driving

For a large-scale and efficient path search, a topological map meant for high path search efficiency is constructed [46,47]. During the path searching process, the open geospatial consortium (OGC) is used to make a topological map [48]. OGC defines a simple feature model that is suitable for efficiently storing and accessing geographic features in relational databases.

A spatial map comprises nodes, lines, and a surface, which are three elements for constructing a map: a node is a point that does not have spatial features such as shape and size; a line comprises a series of nodes, whose spatial features include shape, length, etc.; the surface comprises closed rings, whose spatial features include shape, area, etc. In TPS, a node denotes the vehicle, a line denotes the routing path, a surface denotes the obstacles.

Spatial relationship refers to the spatial characteristic relationship between geographical entities, which is the basis of spatial data organization, query, analysis, and reasoning. It includes topological, metric, and direction relationships. Numerous topological relation expression models have been proposed to represent spatial relationships. In this study, the nine-cross model [48] is used to represent the space topological relationship used for the construction of a topological map. This is introduced through Function (29), where A and B represent two different objects,

(ð)

denotes the edge of an object,

(°)

denotes the internal part of an object,

(^{-})

denotes the external part of an object, and

(⋂)

denotes the intersection of two objects.

Based on Function (29), the topological relation predicates are defined, which are Crosses, Disjoint, Within, Contains, Equals, Touches, Intersects, and Overlaps. In this paper, we leverage Intersects, Disjoint, Touches, and Contains to analyze the topological relationship shown in Figure 8. For example,

[F F * F F * * * *]

denotes the Disjoint;

[F T * * * * * * *]

,

[F * * T * * * * *]

, and

[F * * * T * * * *]

denote Touch;

[T * F * * F * * *]

and

[T * * * * * F F *]

denote Contains/Within;

[T * * * * * * * *]

,

[* T * * * * * * *]

,

[* * * T * * * * *]

, and

[* * * * T * * * *]

denote Intersects via the nine-cross model. For two objects in space, a and b, the intersection value

(ϕ)

is denoted by F (False), the value

- (ϕ)

is denoted by T (True); the intersection is represented by 0 when the point is the intersection, by 1 when the line is the intersection, and by 2 when the area is the intersection. * means that f, 0, 1, or 2 can be selected. More specifically, a. Disjoint b means two geometric bodies (a and b) have no common intersection, thus forming a set of disconnected geometric shapes.

After constructing the topological map, the Dijkstra algorithm [49] is used to achieve an optimal routing planning, and the Euclidean metric is used to measure the distance between two points, which is provided by Function (30). The algorithm is described in Algorithm 2, in which Function (29) is used to judge the topological relationship between obstacles

O_{n}

, the start point S, and the end point E; Function (29) is used to judge the topological position relationship between the path point and each point

O_{i j}

of the ith obstacle

O_{i}

.

In addition, the safety buffer is set according to topological relationships. Moreover, the routing path can be achieved by the Dijsktra algorithm using Function (30) based on the constructed topological map; the Dijskra algorithm can achieve a more optimal path than other heuristic algorithms, e.g., the A* algorithm, even though it requires more computing time. However, in this study, the Dijskra algorithm based on TPS could quickly find the optimal path and satisfy motion planning constraints.

(\begin{matrix} A ° ⋂ B ° & A ° ⋂ ð B & A ° ⋂ B^{-} \\ ð A ⋂ B ° & ð A ⋂ ð B & ð A ⋂ B^{-} \\ A^{-} ⋂ B ° & A^{-} ⋂ ð B & A^{-} ⋂ B^{-} \end{matrix}),

(29)

d (u, v) = \sqrt{{(x_{u} - x_{v})}^{2} + {(y_{u} - y_{v})}^{2}} .

(30)

According to the size of the vehicle maneuver, the size of the safety buffer is set at 8 (m), which indicates that the distance between the path and the obstacles is always 8 (m), thus making the route planning safe and suitable for motion planning.

Algorithm 2: The method of TPS for automated driving.

6. A Motion Planning Method for Automated Driving

TLM and TPS were constructed in the previous sections. In this section, we will discuss how to achieve motion planning for automated driving. RL is briefly introduced [8], the RLTT method for automated driving is then presented in detail, and multiple constraints are considered in the RLTT method.

TPS is introduced for large and macro-scale motion planning, in which both driving safety and optimal path planning are achieved via a safety buffer, a topological map, and the Dijkstra algorithm. A representation of TPS constraints is given in Figure 9a. For micro-scale motion planning, in which the environment contains uncertain and imperfect information, building a model and learning experiences via a data-driven method is easier than traditional methods; in contrast, when the environment is uncertain, the environment is difficult to model via traditional methods. Figure 9b shows the representation of constraints through RL. TLM can be used for dynamic constraints and smooth constraints, in which the kinematic model of vehicles and steer constraints as well as the relative angle between any two adjacent points are also limited; the representation of TLM is given in Figure 9c. Considering this type of framework and environment is useful because the perfect environment information cannot be obtained in a real environment even if advanced sensors are used. The proposed method could aid in the advancement of automated driving.

6.1. RL Methods

RL can be typically regarded as MDP, and MDP is presented by a five-object tuple

(S, A, P, R, γ)

, where S denotes the state space, A denotes the action space, P shows the probability of transition, R represents the reward once one action is performed, and

γ

denotes the discounted factor of the reward. In this study, RL is regarded as the partial observation MDP, where RL makes an agent interact with the environments through trial-and-error learning in order to obtain more data about the environments, and the agent can then learn better about the environments as well as perform better. In particular, RL can deal with uncertainty problems better than traditional methods. Notably, the most likely action can be selected in the future environment on the basis of previous exploration. It can reduce the uncertainty about the environment, which means that more knowledge can be acquired about the environment, and more certain information can be obtained for future decision making.

6.2. A Motion Planning Method

First, the TLM and TPS are introduced into the framework, and Q learning is developed based on the TLM and TPS, providing the constrained RL for safe motion planning.

Second, the transition function f of Q learning is integrated into TLM and TPS. More specifically, for the transition function f, the input is action a, the current position is

P_{c}

, the angle is A, and the terminal is

P_{T}

; the output is the next position

P_{n}

, the reward is r, done is T, and the next angle is

A_{n}

. The transition process is as follows: If the current position

P_{c}

has collided with an obstacle

O_{o b s t a c l e}

, then return the state s, reward r, done, and angle A; if the current position

P_{c}

has reached the terminal

P_{T}

, then return state s, reward r, done, and angle A; at the same time, if angle A and action a are equal to angle

A_{t}

and action

a_{t}

of the trajectory, respectively, we can obtain the next position

P_{n}

← current position

P_{c}

, next angle

A_{n}

← angle A, and obtain the related reward r. If the next position

P_{n}

has collided with an obstacle

O_{o b s t a c l e}

, then return the state s, reward r, done T, and angle A; if the next position

P_{n}

has reached the terminal

P_{T}

, then return the state s, reward r, done T, and angle A. Iteratively proceed according to this logic, obtain the related path information, and return each position’s next position

P_{n}

, reward r, done T, and next angle

A_{n}

for further motion planning.

Third, the algorithm of Q learning is presented for the path value, which can be seen in Algorithm 3. Finally, according to the distance of the safety buffer, select the routing points and call Algorithm 4 for the automated driving motion planning.

Algorithm 3: Q learning method for the path value.

Algorithm 4: Motion planning for automated driving via RLTT.

6.3. Motion Planning Examples

A motion planning example of automated driving via RLTT is illustrated in this section. The problem of motion planning in certain and perfect information environments may be easy to solve using the traditional methods such as the sample-based method. However, if uncertain environments and vehicle constraints are considered, the traditional methods may not perform well. For example, in Figure 10a, the red circle point represents the start point, the green circle point represents the end point, and the green arrow represents the dynamic constraints, including the heading constraints and steer constraints, among others. When the environment is

65 %

certain and

35 %

uncertain, for example,

35 %

of the obstacles expand randomly upward or downward by

35 %

. More specifically, the position and shape of 65% of the obstacles are certain, and the position and shape of 35% of the obstacles are uncertain. When the vehicle is navigating, 35% of the obstacles change at random. The RLTT method can perform well in the environments because it can efficiently and rapidly explore the unknown world in micro-scale motion planning (blue box (from the S point to the M point) and purple box (from M point to E point)). More specifically, it leverages TPS for large and macro-scale routing planning (red box (represented by the black arrow from the S point to the E point)) and TLM for micro and precise motion planning (deep pink line). An example is given in Figure 10b, with the motion planning trajectory represented by the green arrow from the S point to the E point. Furthermore, we provide the complexity analysis of RLTT in Remark 1.

Remark 1.

The proximal optimal search time for motion planning with a constrained RL is as follows:

The search time for TPS is

O (M^{2})

, where M represents the number of obstacle points; for the RL search time, the complexity is

O (T)

, where T is the total number of steps [50].

More particularly, the convergence of our method can be guaranteed based on the Banach fixed-point theorem [51]; for the detailed convergence proof of the constrained RL, see [52].

7. Experiments

Several experiments were conducted in the same environments, and the results are analyzed here in detail. First, different RL methods for path search were tested. Second, the RLTT method for motion planning was achieved for automated driving. Third, comparison experiments in certain and uncertain environments have been carried out. In addition, comparison experiments between RLTT and RL algorithms were provided. Fourth, comparison experiments between RLTT and traditional algorithms were conducted to evaluate the effectiveness of our proposed methods. Finally, RLTT for automated driving motion planning in uncertain corridor environments was carried out to demonstrate our method’s applicability.

Moreover, all experiments were run on an Ubuntu 18.04 system (Dell PC), in which the CPU was an Intel Core i7-9750H CPU 2.60 GHz × 12, the GPU was an Intel UHD Graphics 630 (CFL GT2), the Memory was 31.2 GB; all comparison experiments were conducted under the same environment conditions, and the computation time in the experiments included training time and test time. The number of initial iteration steps was 10,000, the learning rate was

0.1

, the epsilon value of the maximizing optimal value was

0.9

. For the reward settings, we set them on the basis of the safety cost, distance cost, and steering cost of the vehicle. If the vehicle collided with obstacles, the reward was

- 10

; if the vehicle found the target position with the correct attitude, the reward was 10; the reward for each step the vehicle took along the straight line was

- 0.1

; for each step the vehicle took along the diagonal direction, the reward was

- 0.14

; for each step the vehicle took along the oblique curve, the reward was

- 0.157

; the reward was

- 0.16

for every turn in the steering action; if the vehicle took each step along the diagonal and straight direction, the reward was

- 0.196

.

7.1. Different RL Methods for Path Search

Related RL methods for conducting the path planning for automated driving have been performed, which can be seen in Figure 11. The figure shows that the dynamic programming (DP) value and policy iterations [8] achieved fewer steps, implying that the two algorithms can obtain the shortest path for automated driving. Nonetheless, DP value and policy iterations may be unsuitable for automated driving in uncertain and unknown environments because the probability and reward transition might not be known, and the above two methods typically need information regarding the model of environments.

In addition, the Monte Carlo (MC) RL method [8] achieved almost the same number of steps as the Q learning [53] for automated driving. However, the MC method sometimes falls into a deadlock during exploration and cannot obtain the desired path due to the complex environments. The Sarsa method [54] obtains more steps than Q learning because of its conservative strategy. Because RLTT is a multi-scale method and the safety buffer constraints of the first level via TPS have been further considered, it is better to develop Q learning and integrate it into the RLTT framework for automated driving. In the next section, the experiments of the RLTT method for automated driving are introduced and discussed.

7.2. A RLTT Method for Motion Planning

7.2.1. Macro-Scale Motion Planning

In this section, macro-scale planning (generally, the planning area can be set from approximately 100 to

2000

(m)) was accomplished via TPS (Figure 12a,b), which represents the convex obstacle and concave-convex obstacle environments, respectively, in which the macro-scale planning is represented by the deep pink line and the obstacles are represented by the red rectangles. The start point and end point are S and

S_{3}

in Figure 12a and

S^{'}

and

S_{3}^{'}

in Figure 12b, respectively. The safety buffer was set at 8 m, as shown in Figure 12a,b (denoted by B and the yellow arrow).

7.2.2. Micro-Scale Motion Planning

The micro-scale motion planning (generally, the planning area can be set at approximately 100 m) was achieved, as shown in Figure 13 and Figure 14, where the environment information is unknown. The start point is the S point and the end point is the E point. The obstacle is denoted by O. RL was constrained by TPS and TLM, and it was used to dynamically explore the suitable point from the S to the E points. TLM was used to splice the RL points, where the smooth and dynamic constraints were considered; the two bottom obstacles randomly changed within a certain range. In particular, obstacles

O_{3}

and

O_{4}

randomly changed within the

O_{31}

and

O_{41}

ranges (as shown in the light blue rectangles); therefore, the motion planning is different in Figure 13 and Figure 14.

7.3. Comparison Experiments in Certain and Uncertain Environments

Related experiments were conducted in different environments, which were either certain (known environment information) or uncertain environments (unknown environment information), as shown in Figure 15. The environments were the same except for the random obstacles. The average distance of five uncertain experiments was

54.87

(m), and the average distance of five certain experiments was

71.98

(m). The experiments showed that the RLTT method could deal with the uncertain environments well.

7.4. Comparison Experiments between RLTT and Q Learning Algorithms

We conducted five experiments using the RLTT method and the Q learning algorithm in unknown environments. The Q learning algorithm is a classic and commonly used algorithm for vehicle navigation. The computation time of each experiment and the averaged computation time of the five experiments using the RLTT method and the Q learning algorithm are shown in Figure 16 and Figure 17, respectively.

More specifically, the averaged computation times for the RLTT method and Q learning for motion planning were 0.325 (s) and 52.279 (s), respectively. In addition, the trajectory distance when using Q learning is usually greater than that when the RLTT method is used. We randomly selected one of the five experiments and computed the trajectory distance. The distance when using the RLTT method was 95.834 (m), whereas that when using the Q learning algorithm was 126.735 (m). The results indicate that the performance of the RLTT method was better than the classic RL algorithm, and the RLTT method achieved a shorter trajectory distance and used less time for the path search as compared to the classic RL algorithm.

7.5. Comparison Experiments between RLTT and Traditional Algorithms

This section shows the experiments comparing the RLTT and traditional algorithms for automated driving motion planning. The adopted traditional algorithm is a hybrid A* algorithm [55], which was proposed by Dolgov et al. It was developed based on the A* algorithm [39] by leveraging continuous coordinates to guarantee the kinematic feasibility of the trajectories and the conjugate gradient descent in order to improve the quality of the trajectories on the geometric workspace. The hybrid A* algorithm has shown better performance than the rapidly exploring random trees algorithms [56,57] and the parameterized curve algorithms [58] in terms of fast global convergence and optimal global values. It is very useful and widely used for automated driving since the hybrid A* algorithm can be smooth and practical for autonomous driving in real-world and unknown environments [55].

The distance of motion planning using the RLTT algorithm was 147.4 (m), which considered dynamic constraints under the environments with sparse information; on the other hand, in the same environments, the distance of motion planning using the hybrid A* algorithm while considering dynamic constraints was 184.0 (m). The experimental results indicate that our proposed method outperforms the hybrid A* algorithm in terms of distance.

7.6. Uncertain Corridor Scenarios

We referred to the experimental settings of reference [59] to carry out an experiment on uncertain corridor scenarios. Figure 18a shows real traffic environments: uncertain corridor scenarios. Figure 18b shows motion planning via RLTT for an automated vehicle in real and uncertain traffic scenarios. The red areas denote uncertain objects, where there is a corridor, some vehicles, and pedestrians crossing the corridor uncertainly. The vehicles on the main road (as indicated by the light blue arrow) can safely cross the uncertain corridor scenarios; the distance of motion planning was 61.8 (m).

8. Conclusions

In this study, a constrained RL method along with a theoretical analysis were developed based on the TLM and TPS methods in order to achieve a multi-constraint and multi-scale safe motion planning for automated driving in sparse information environments. The dynamic constraints of a vehicle as well as the smooth constraints, safety constraints, and distance optimization constraints were taken into account. In addition, related experiments were conducted to evaluate the effectiveness of our proposed method. Experimental results indicate that the proposed method is extendable and can be applied to other types of vehicle navigation and control, such as ground robots for parking maneuvers and logistics environments. We hope that our RLTT method inspires new research in robotics development. In the future, we plan to carry out more experiments and aim to improve the proposed method by performing more complex tasks.

Author Contributions

Methodology, S.G.; formal analysis, S.G.; investigation, S.G.; resources, S.G.; data curation, S.G.; writing—original draft preparation, S.G.; writing—review and editing, J.H., Y.H., and G.C.; visualization, S.G.; supervision, G.C. and A.K.; project administration, L.Z. and G.C.; funding acquisition, G.C. and A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the German Research Foundation (DFG) and the Technical University of Munich (TUM) within the framework of the Open Access Publishing Program. This work was partially supported by the European Union’s Horizon 2020 Framework Programme for Research and Innovation under the Specific Grant Agreement No. 945539 (Human Brain Project SGA3), and by the National Natural Science Foundation of China (No. 61906138).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated during this study are available from the corresponding author upon reasonable request.

Acknowledgments

We would like to thank Jiancheng Long for his very constructive suggestions; we would also like to thank Xiao Wang for the fruitful discussions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ye, Y.; Zhang, X.; Sun, J. Automated vehicle’s behavior decision making using deep reinforcement learning and high-fidelity simulation environment. Transp. Res. Part C Emerg. Technol. 2019, 107, 155–170. [Google Scholar] [CrossRef]
Chen, G.; Chen, K.; Zhang, L.; Zhang, L.; Knoll, A. VCANet: Vanishing-Point-Guided Context-Aware Network for Small Road Object Detection. Automot. Innov. 2021, 4, 400–412. [Google Scholar] [CrossRef]
Wen, L.; Duan, J.; Li, S.E.; Xu, S.; Peng, H. Safe reinforcement learning for autonomous vehicles through parallel constrained policy optimization. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–7. [Google Scholar]
Min, H.; Xiong, X.; Wang, P.; Yu, Y. Autonomous driving path planning algorithm based on improved A* algorithm in unstructured environment. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2021, 235, 513–526. [Google Scholar] [CrossRef]
Williams, G.; Drews, P.; Goldfain, B.; Rehg, J.M.; Theodorou, E.A. Information-theoretic model predictive control: Theory and applications to autonomous driving. IEEE Trans. Robot. 2018, 34, 1603–1622. [Google Scholar] [CrossRef]
Likmeta, A.; Metelli, A.M.; Tirinzoni, A.; Giol, R.; Restelli, M.; Romano, D. Combining reinforcement learning with rule-based controllers for transparent and general decision-making in autonomous driving. Robot. Auton. Syst. 2020, 131, 103568. [Google Scholar] [CrossRef]
Hang, P.; Lv, C.; Huang, C.; Cai, J.; Hu, Z.; Xing, Y. An Integrated Framework of Decision Making and Motion Planning for Autonomous Vehicles Considering Social Behaviors. IEEE Trans. Veh. Technol. 2020, 69, 14458–14469. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Gu, S.; Kuba, J.G.; Wen, M.; Chen, R.; Wang, Z.; Tian, Z.; Wang, J.; Knoll, A.; Yang, Y. Multi-agent constrained policy optimisation. arXiv 2021, arXiv:2110.02793. [Google Scholar]
Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Yang, Y.; Knoll, A. A Review of Safe Reinforcement Learning: Methods, Theory and Applications. arXiv 2022, arXiv:2205.10330. [Google Scholar]
Brunke, L.; Greeff, M.; Hall, A.W.; Yuan, Z.; Zhou, S.; Panerati, J.; Schoellig, A.P. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annu. Rev. Control. Robot. Auton. Syst. 2021, 5, 411–444. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
Tamar, A.; Wu, Y.; Thomas, G.; Levine, S.; Abbeel, P. Value iteration networks. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
Kaelbling, L.P. The foundation of efficient robot learning. Science 2020, 369, 915–916. [Google Scholar] [CrossRef]
Zhou, C.; Gu, S.; Wen, Y.; Du, Z.; Xiao, C.; Huang, L.; Zhu, M. The review unmanned surface vehicle path planning: Based on multi-modality constraint. Ocean. Eng. 2020, 200, 107043. [Google Scholar] [CrossRef]
Claussmann, L.; Revilloud, M.; Gruyer, D.; Glaser, S. A review of motion planning for highway autonomous driving. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1826–1848. [Google Scholar] [CrossRef]
Bernhard, J.; Knoll, A. Robust stochastic bayesian games for behavior space coverage. In Proceedings of the Robotics: Science and Systems (RSS), Workshop on Interaction and Decision-Making in Autonomous-Driving, Virtual Session, 12–13 July 2020. [Google Scholar]
Zhang, H.; Chen, W.; Huang, Z.; Li, M.; Yang, Y.; Zhang, W.; Wang, J. Bi-level actor-critic for multi-agent coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7325–7332. [Google Scholar]
Harmening, N.; Biloš, M.; Günnemann, S. Deep Representation Learning and Clustering of Traffic Scenarios. arXiv 2020, arXiv:2007.07740. [Google Scholar]
Zhang, L.; Zhang, R.; Wu, T.; Weng, R.; Han, M.; Zhao, Y. Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5435–5444. [Google Scholar] [CrossRef]
Chen, J.; Li, S.E.; Tomizuka, M. Interpretable End-to-End Urban Autonomous Driving With Latent Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2021, 23, 5068–5078. [Google Scholar] [CrossRef]
Tang, X.; Huang, B.; Liu, T.; Lin, X. Highway Decision-Making and Motion Planning for Autonomous Driving via Soft Actor-Critic. IEEE Trans. Veh. Technol. 2022, 71, 4706–4717. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning Research, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Zhu, H.; Han, T.; Alhajyaseen, W.K.; Iryo-Asano, M.; Nakamura, H. Can automated driving prevent crashes with distracted Pedestrians? An exploration of motion planning at unsignalized Mid-block crosswalks. Accid. Anal. Prev. 2022, 173, 106711. [Google Scholar] [CrossRef]
Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–1 August 2017; pp. 22–31. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Shalev-Shwartz, S.; Shammah, S.; Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv 2016, arXiv:1610.03295. [Google Scholar]
Thornton, S. Autonomous Vehicle Speed Control for Safe Navigation of Occluded Pedestrian Crosswalk. arXiv 2018, arXiv:1802.06314. [Google Scholar]
Codevilla, F.; Miiller, M.; López, A.; Koltun, V.; Dosovitskiy, A. End-to-end driving via conditional imitation learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1–9. [Google Scholar]
Dai, S.; Schaffert, S.; Jasour, A.; Hofmann, A.; Williams, B. Chance constrained motion planning for high-dimensional robots. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8805–8811. [Google Scholar]
Thomas, A.; Mastrogiovanni, F.; Baglietto, M. Probabilistic Collision Constraint for Motion Planning in Dynamic Environments. arXiv 2021, arXiv:2104.01659. [Google Scholar]
Mohanan, M.; Salgoankar, A. A survey of robotic motion planning in dynamic environments. Robot. Auton. Syst. 2018, 100, 171–185. [Google Scholar] [CrossRef]
Webb, D.J.; Van Den Berg, J. Kinodynamic RRT*: Asymptotically optimal motion planning for robots with linear dynamics. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, 6–10 May 2013; pp. 5054–5061. [Google Scholar]
Gammell, J.D.; Srinivasa, S.S.; Barfoot, T.D. Informed RRT*: Optimal sampling-based path planning focused via direct sampling of an admissible ellipsoidal heuristic. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; pp. 2997–3004. [Google Scholar]
Janson, L.; Schmerling, E.; Clark, A.; Pavone, M. Fast marching tree: A fast marching sampling-based method for optimal motion planning in many dimensions. Int. J. Robot. Res. 2015, 34, 883–921. [Google Scholar] [CrossRef]
Du, Z.; Wen, Y.; Xiao, C.; Zhang, F.; Huang, L.; Zhou, C. Motion planning for unmanned surface vehicle based on trajectory unit. Ocean. Eng. 2018, 151, 46–56. [Google Scholar] [CrossRef]
Zhu, M.; Xiao, C.; Gu, S.; Du, Z.; Wen, Y. A Circle Grid-based Approach for Obstacle Avoidance Motion Planning of Unmanned Surface Vehicles. arXiv 2022, arXiv:2202.04494. [Google Scholar] [CrossRef]
Gu, S.; Zhou, C.; Wen, Y.; Xiao, C.; Knoll, A. Motion Planning for an Unmanned Surface Vehicle with Wind and Current Effects. J. Mar. Sci. Eng. 2022, 10, 420. [Google Scholar] [CrossRef]
Gu, S.; Zhou, C.; Wen, Y.; Zhong, X.; Zhu, M.; Xiao, C.; Du, Z. A motion planning method for unmanned surface vehicle in restricted waters. Proc. Inst. Mech. Eng. Part M J. Eng. Marit. Environ. 2020, 234, 332–345. [Google Scholar] [CrossRef]
McNaughton, M.; Urmson, C.; Dolan, J.M.; Lee, J.W. Motion planning for autonomous driving with a conformal spatiotemporal lattice. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 4889–4895. [Google Scholar]
Rajamani, R. Vehicle Dynamics and Control; Springer: New York, NY, USA, 2011. [Google Scholar]
Gong, J.; Jiang, Y.; Xu, W. Model Predictive Control For Self-Driving Vehicles; Beijing Institute of Technology Press: Beijing, China, 2014. [Google Scholar]
Gardner, R. The brunn-minkowski inequality. Bull. Am. Math. Soc. 2002, 39, 355–405. [Google Scholar] [CrossRef]
Meurant, G. Handbook of Convex Geometry; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science; Cambridge University Press: Cambridge, UK, 2018; Volume 47. [Google Scholar]
Zhou, C.; Gu, S.; Wen, Y.; Du, Z.; Xiao, C.; Huang, L.; Zhu, M. Motion planning for an unmanned surface vehicle based on topological position maps. Ocean. Eng. 2020, 198, 106798. [Google Scholar] [CrossRef]
Gu, S.; Zhou, C.; Wen, Y.; Xiao, C.; Du, Z.; Huang, L. Path Search of Unmanned Surface Vehicle Based on Topological Location. Navig. China 2019, 42, 52–58. [Google Scholar]
Herring, J.R. OpenGIS Implementation Specification for Geographic Information-Simple Feature Access—Part 1: Common Architecture; Open Geospatial Consortium: Wayland, MA, USA, 2006; p. 95. [Google Scholar]
Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef]
Jin, C.; Allen-Zhu, Z.; Bubeck, S.; Jordan, M.I. Is Q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 4868–4878. [Google Scholar]
Latif, A. Banach contraction principle and its generalizations. In Topics in Fixed Point Theory; Springer: Cham, Switzerland, 2014; pp. 33–64. [Google Scholar]
Melo, F.S. Convergence of Q-Learning: A Simple Proof; Tech. Rep.; Institute of Systems and Robotics: Lisbon, Portugal, 2001. [Google Scholar]
Greenwald, A.; Hall, K. Correlated-Q learning. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; pp. 242–249. [Google Scholar]
Zhao, D.; Wang, H.; Shao, K.; Zhu, Y. Deep reinforcement learning with experience replay based on SARSA. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2016; pp. 1–6. [Google Scholar]
Dolgov, D.; Thrun, S.; Montemerlo, M.; Diebel, J. Practical search techniques in path planning for autonomous driving. Ann. Arbor 2008, 1001, 18–80. [Google Scholar]
Kavraki, L.E.; Svestka, P.; Latombe, J.C.; Overmars, M.H. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 1996, 12, 566–580. [Google Scholar] [CrossRef]
Plaku, E.; Kavraki, L.E.; Vardi, M.Y. Discrete search leading continuous exploration for kinodynamic motion planning. In Robotics: Science and Systems III; MIT Press: Cambridge, MA, USA, 2007; pp. 326–333. [Google Scholar]
Cremean, L.B.; Foote, T.B.; Gillula, J.H.; Hines, G.H.; Kogan, D.; Kriechbaum, K.L.; Lamb, J.C.; Leibs, J.; Lindzey, L.; Rasmussen, C.E.; et al. Alice: An information-rich autonomous vehicle for high-speed desert navigation. J. Field Robot. 2006, 23, 777–810. [Google Scholar] [CrossRef]
Wang, C.; Li, F.; Wang, Y.; Wagner, J.R. Haptic Assistive Control with Learning-Based Driver Intent Recognition for Semi-Autonomous Vehicles. IEEE Trans. Intell. Veh. 2021. [Google Scholar] [CrossRef]

Figure 1. A framework of the RLTT method: a planning hierarchy framework of three levels for motion planning with data flow.

Figure 2. A briefly diagram of the vehicle dynamic constraints.

Figure 3. Straight-line trajectory lane and semicircle trajectory lane models. (a) Straight-line trajectory in 8 action directions. (b) Semicircle trajectory in 16 action directions.

Figure 4. Semicircle trajectory lane model. (a) Negative semicircles in 12 directions based on 8 actions; (b) A 24 direction-action TLM.

Figure 5. Semicircle trajectory lane model. (a) A 32 direction-action TLM. (b) A 40 direction-action TLM.

Figure 6. A 40 direction-action TLM, with one example of a smooth circle.

Figure 7. (a) Different circles of proximal optimal TLM that cover different points. (b) Covering areas that satisfy the vehicle’s dynamic constraints using the proximal optimal TLM.

Figure 8. Schematic diagram of the topological path reachability.

Figure 9. TPS, RL, and TLM for automated driving. (a) Representation of TPS constraints. (b) Representation of RL constraints. (c) Representation of TLM constraints.

Figure 10. Certain and perfect information environments for vehicles. (a) does not consider dynamic constraints, (b) considers dynamic constraints.

Figure 11. Different RL methods for a path search.

Figure 12. Macro-scale planning. (a) Convex obstacle environments. (b) Concave-convex obstacle environments.

Figure 13. Micro-scale planning and random obstacle (red rectangle) experiment one. (a) Grid experiment environments and RLTT method for planning (green rectangle), where the two bottom obstacle shapes randomly change within a certain range. (b) TLM motion planning (color curve) according to RLTT planning.

Figure 14. Micro-scale planning and random obstacle experiment two. (a) Grid experiment environments and RLTT method for planning, where the two bottom obstacle shapes randomly change within a certain range. (b) TLM motion planning according to RLTT planning.

Figure 15. Motion planning for automated driving in uncertain and certain environments.

Figure 16. Computation time for each experiment on motion planning using RLTT and the Q learning algorithm.

Figure 17. Averaged computation time for five experiments on motion planning using RLTT and the Q learning algorithm.

Figure 18. RLTT for traffic scenarios: driving corridors. (a) Real traffic scenario: uncertain corridor scenario. (b) Motion planning for an automated vehicle in uncertain corridor scenarios.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gu, S.; Chen, G.; Zhang, L.; Hou, J.; Hu, Y.; Knoll, A. Constrained Reinforcement Learning for Vehicle Motion Planning with Topological Reachability Analysis. Robotics 2022, 11, 81. https://doi.org/10.3390/robotics11040081

AMA Style

Gu S, Chen G, Zhang L, Hou J, Hu Y, Knoll A. Constrained Reinforcement Learning for Vehicle Motion Planning with Topological Reachability Analysis. Robotics. 2022; 11(4):81. https://doi.org/10.3390/robotics11040081

Chicago/Turabian Style

Gu, Shangding, Guang Chen, Lijun Zhang, Jing Hou, Yingbai Hu, and Alois Knoll. 2022. "Constrained Reinforcement Learning for Vehicle Motion Planning with Topological Reachability Analysis" Robotics 11, no. 4: 81. https://doi.org/10.3390/robotics11040081

APA Style

Gu, S., Chen, G., Zhang, L., Hou, J., Hu, Y., & Knoll, A. (2022). Constrained Reinforcement Learning for Vehicle Motion Planning with Topological Reachability Analysis. Robotics, 11(4), 81. https://doi.org/10.3390/robotics11040081

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Constrained Reinforcement Learning for Vehicle Motion Planning with Topological Reachability Analysis

Abstract

1. Introduction

2. Related Work

3. Problem Formulation

3.1. Uncertain Constraints F u n

3.2. Dynamic Constraints F d y

3.3. Safety Constraints F s a

3.4. Smooth Constraints F s m

4. Trajectory Lane Model

4.1. Rules Design

4.2. Analysis of TLM

5. Topological Path Search for Automated Driving

6. A Motion Planning Method for Automated Driving

6.1. RL Methods

6.2. A Motion Planning Method

6.3. Motion Planning Examples

7. Experiments

7.1. Different RL Methods for Path Search

7.2. A RLTT Method for Motion Planning

7.2.1. Macro-Scale Motion Planning

7.2.2. Micro-Scale Motion Planning

7.3. Comparison Experiments in Certain and Uncertain Environments

7.4. Comparison Experiments between RLTT and Q Learning Algorithms

7.5. Comparison Experiments between RLTT and Traditional Algorithms

7.6. Uncertain Corridor Scenarios

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. Uncertain Constraints $F_{u n}$

3.2. Dynamic Constraints $F_{d y}$

3.3. Safety Constraints $F_{s a}$

3.4. Smooth Constraints $F_{s m}$