RREV: A Robust and Reliable End-to-End Visual Navigation

Ou, Wenxiao; Wu, Tao; Li, Junxiang; Xu, Jinjiang; Li, Bowen

doi:10.3390/drones6110344

Open AccessArticle

RREV: A Robust and Reliable End-to-End Visual Navigation

by

Wenxiao Ou

,

Tao Wu

^*,

Junxiang Li

,

Jinjiang Xu

and

Bowen Li

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Drones 2022, 6(11), 344; https://doi.org/10.3390/drones6110344

Submission received: 24 September 2022 / Revised: 14 October 2022 / Accepted: 31 October 2022 / Published: 4 November 2022

(This article belongs to the Special Issue Explainable Deep Architectures for Saliency-Based Autonomous Vehicle Driving Monitoring)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

With the development of deep learning, more and more attention has been paid to end-to-end autonomous driving. However, affected by the nature of deep learning, end-to-end autonomous driving is currently facing some problems. First, due to the imbalance between the “junctions” and “non-junctions” samples of the road scene, the model is overfitted to a large class of samples during training, resulting in insufficient learning of the ability to turn at intersections; second, it is difficult to evaluate the confidence of the deep learning model, so it is impossible to determine whether the model output is reliable, and then make further decisions, which is an important reason why the end-to-end autonomous driving solution is not recognized; and third, the deep learning model is highly sensitive to disturbances, and the predicted results of the previous and subsequent frames are prone to jumping. To this end, a more robust and reliable end-to-end visual navigation scheme (RREV navigation) is proposed in this paper, which was used to predict a vehicle’s future waypoints from front-view RGB images. First, the scheme adopted a dual-model learning strategy, using two models to independently learn “junctions” and “non-junctions” to eliminate the influence of sample imbalance. Secondly, according to the smoothness and continuity of waypoints, a model confidence quantification method of “Independent Prediction-Fitting Error” (IPFE) was proposed. Finally, IPFE was applied to weight the multi-frame output to eliminate the influence of the prediction jump of the deep learning model and ensure the coherence and smoothness of the output. The experimental results show that the RREV navigation scheme in this paper was more reliable and robust, especially, the steering performance of the model intersection could be greatly improved.

Keywords:

end-to-end; autonomous driving; sample imbalance; multi-frame accumulation; visual navigation; model uncertainty

1. Introduction

Automatic driving has made great contributions to many aspects of social life, such as traffic safety, convenient travel, cost reduction, and efficiency improvement. The traditional hierarchical scheme of autonomous driving has been extensively studied during the past decades. The common modules in the hierarchical scheme can be decomposed into the perception module, the decision-making module, the planning module, the motion control module, etc. Nowadays, another scheme, which is called the end-to-end scheme, has gradually become a trend [1,2]. An end-to-end autonomous driving scheme means obtaining the control commands (e.g., steer angle and brake) directly from the perception module, without the need of the traditional decision-making and planning module. This scheme is straightforward and highly efficient since it does not need complex human-crafted rules within the decision-making and planning module. Additionally, the scheme can achieve autonomous navigation in more complex environments when lacking high-precision maps, which are essential in the traditional hierarchical scheme.

As early as 1988, the driverless vehicle named ALVINN [3] used multi-layer sensors to learn direction control. In 2016, Bojarski et al. [1] used DNN to map the image to three control commands (steering, throttle, and braking) to control the vehicle, which proved that DNN can learn the control process of driving from the original image data. In 2018, Kendall et al. [4] demonstrated the first application of deep reinforcement learning in autopilot with an end-to-end method, using images as the input to learn the lane following strategy. These efforts have shown great potential for end-to-end solutions.

Previous end-to-end automatic driving only realized driving along the road, while turning at the intersection was realized through manual intervention. Some studies have begun to use advanced commands to guide vehicles to turn [5,6]. At each intersection, the vehicle’s turn instructions (straight, left turn, right turn, etc.) are specified in advance. When the vehicle arrives at the intersection, it will execute the corresponding actions according to the corresponding high-level commands. This method not only requires accurate intersection location of vehicles, but the preset steering command cannot be applied to other complex intersections, either. Other studies tend to use navigation maps to provide steering information [7,8,9]. The deep learning model needs to learn the steering information from the navigation map and fuse it with other sensor information to predict the control variable. However, the way to learn turning information through models is not effective in any case [6]. On the one hand, the method relies on effective feature fusion strategy; on the other hand, because the depth learning model is easy to overfit during training, especially when the samples are unbalanced, for example, the “junctions” samples in the road scene are far fewer than the “non-junctions” samples, the model is easy to overfit to a large class of samples (“non-junctions” samples), which leads to the inability to learn the turning ability of intersections.

Although end-to-end autopilot has received widespread attention with the rapid development of deep learning, it has always been controversial; an important reason is the lack of interpretability. All the end-to-end models are largely trained as black boxes, while lacking methods to evaluate the confidence of their output and explain the characteristics they have learned. In the autopilot task, random factors, such as the movement of traffic participants and the change in weather, will make the prediction of the model full of uncertainty. If it does not have reliable confidence, the model may make wrong predictions. The development of the research on the evaluation of uncertainty of a neural network [10] has brought the impetus to solve this problem. The evidence depth learning method [8,11] places a prior distribution on the category probability, treats the prediction of the neural network as subjective opinions, and learns the function that collects the evidence leading to these opinions by a deterministic neural net from data. The Bayesian neural network(BNN) method [12,13,14] puts the prior value above the network weight and estimates the prediction uncertainty by approximating the moment of the posterior prediction distribution. Because of the large amount of computation, this method is difficult to be applied to complex networks and resource-constrained environments. Th ensemble method [15,16] uses a variety of different models to independently predict the same thing, and evaluates the reliability of the results through the distribution of the prediction results. it believes that a group of decision makers tend to make better decisions than a single decision maker. Although this method is considered to be the most effective in some tasks, for autopilot, using multiple models at the same time not only requires a lot of computing and space resources but also cannot meet the real-time requirements. Real-time evaluation of the confidence of the model for further decision making can achieve greater robustness and ensure driving safety. However, only a few end-to-end autopilot studies pay attention to this problem. Amini et al. [17] proposed a Bayesian neural network for end-to-end control, which estimates uncertainty by using feature graph correlation in the training process. In their other work [2], a variational autoencoder network is designed, which simultaneously predicts the control signal of the vehicle and reconstructs the input image. The potential uncertainty of the network propagation is used to reconstruct the image to detect the new image which is not included in the training distribution. Liu et al. [8] proposed hybrid evidential fusion, which can learn the uncertainty of prediction directly. By training another branch of the network, the evidence distribution is output for each prediction, which is designed to capture evidence (or confidence) related to any decision made by the network. These works are from the input and network structure point of view to model and analyze the uncertainty, ignoring the rationality analysis of the prediction results.

In addition, deep models are highly sensitive to perturbations, causing the output to jump easily. In order to solve this problem, some works have adopted a scheme of integrating multiple consecutive frames as input [5,18]. However, this method is difficult to apply effectively, because it requires modeling recursion or applying 4D convolutions to process multiple input frames [8].

In this work, a robust and reliable end-to-end visual navigation scheme (RREV navigation) is proposed, in which images and navigation maps are used as inputs to predict the future waypoints of the vehicle. Specifically, in order to solve the training overfitting problem caused by the imbalance between “junctions” and “non-junctions” samples, a dual-model learning strategy is proposed, which uses two models with the same structure to predict the two kinds of samples independently. Additionally, in order to better integrate image information and navigation information, Transformer is used for feature fusion. In addition, the problems of confidence evaluation and anti-disturbance are also studied in this paper. By modeling and analyzing the output, a confidence evaluation method called “independent prediction-fitting error” (IPFE) is proposed. It is applied to the multi-frame accumulation of the output to optimize the output. The IPFE is convenient and fast, and can meet the real-time requirements of automatic driving. On the one hand, multi-frame accumulation can improve the anti-jamming ability of the system, on the other hand, it can make up for the gap in the output by using the results of historical frames when predicting errors. Because of this, the robustness of the system is improved. Offline and online experiments in virtual and real environments are carried out. The experimental results show that the dual-model learning strategy can improve the steering ability of vehicles at intersections, and the optimized output is smoother and more stable. Furthermore, the feasibility and effectiveness of IPFE for evaluating model confidence are demonstrated. The innovations and contributions of this paper are as follows:

(1) A dual-model learning strategy is proposed to solve the training overfitting problem caused by the imbalance between “junctions” and “non-junctions” samples.

(2) A rapid evaluation method of model confidence called “independent prediction-fitting error” is proposed. It is applied to multi-frame accumulation to optimize the output of the model and improve the robustness of the system.

The remainder of the paper is organized as follow: In Section 2, our approach is described in detail. The first is the input and output of the model; then, the three main parts of this paper are introduced, which are the Transformer-based dual-model learning strategy, the evaluation of model confidence, and the multi-frame accumulation method for optimizing model output. Offline and online experiments in virtual and real environments are carried out in Section 3. Finally, Section 4 provides the conclusion and prospects.

2. Methods

Given a front-view RGB image

I_{f}

and a global planning local map

I_{m}

, our objective is to learn an end-to-end neural network

f_{θ}

to predict the future waypoints of the vehicle in the front-view image; they can be expressed as Equation (1). Among them,

(x_{i}, y_{i})

represents the pixel coordinates of the prediction points in the front-view image, and the number of prediction points is k.

{\{(x_{i}, y_{i})\}}_{k = 0}^{k - 1} = f_{θ} (I_{f}, I_{m})

(1)

The structure of the model is shown in Figure 1, which mainly consists of three parts: (1) The dual-model independent learning strategy. Considering that the sample imbalance leads to insufficient training, two models with the same structure are used to predict two different situations. First, the intersection classification network with the local map as input divides the road into two situations, “junctions” and “non-junctions”, and then activates the corresponding waypoints prediction network to predict n waypoints. (2) IPFE for evaluating model confidence. Evaluating the confidence of the model can improve the interpretability of the model on the one hand and can be used to optimize the output of the model on the other hand. Use the quadratic curve to fit the predicted waypoints, and use the fitting error Q to quantify the confidence C of the model. (3) The multi-frame cumulative optimization model. Considering that the original prediction result of the deep learning model is easily affected by the disturbance, the original output from the network is optimized via multiple-frame accumulation. First, project a sequence of several frames of output to the last frame according to odometer. Then, the weighted summation is performed with the confidence C as the weight to obtain the accumulated result. Finally, the optimized waypoints are decoded from the accumulated heatmap.

2.1. Input and Output

In this paper, the navigation map is used to guide the vehicle to turn. At present, compared with the single-source single-frame input method, the multi-source fusion method based on images and lidar point cloud or the multi-frame sequential input method achieves better results in end-to-end autonomous driving [7] because, in this paper, we focus on the problem of sample imbalance and robustness of deep-learning-based systems. During feature fusion, in order to avoid additional interference caused by too much information and to ensure that the model can learn steering information, we hope that our other input is a single source compared with the multi-source fusion method based on image and lidar. Referring to the previous end-to-end visual navigation work [9], single-frame RGB images contain enough scene information to adapt to visual navigation tasks; therefore, we take the single-frame RGB image of a single front-view camera as the input of the model.

Different from most end-to-end driving schemes, which directly predict vehicle control variables (steering wheel angle, accelerator, braking, etc.), the output of our methods is predicted waypoints in the front-view image. The scheme of directly predicting the control variables makes the prediction results closely related to the kinematics of the vehicle [1,19], so it is difficult to transfer among different vehicles because, when switching to a different vehicle, data need to be re-collected to train the model. A more efficient approach is to predict intermediate representations, such as future waypoints [7,9,20].

The waypoints of a normal driving vehicle are smooth and orderly. For this reason, some studies use GRUs (Gated Recurrent Units) to predict the waypoints to ensure smooth and orderly output [7]. In contrast, we use heatmaps [21] to independently predict waypoints, breaking the constraints of smoothness and order among points. The heatmap is a two-dimensional Gaussian distribution map; the highest point of the Gaussian distribution represents a waypoint, and one heatmap represents one waypoint. Compared with the approach using GRUs, our method has better generalization and can improve the accuracy of prediction (as shown in Section 3.2). Moreover, the waypoint predicted by the heatmap is independent, which can be used to evaluate the confidence of the model. The specific detail is described in Section 2.3.

2.2. Transformer for Feature Fusion in Dual Model Structure

In the road scene, the dataset samples collected under natural conditions are often unbalanced. For example, the number of “non-junctions” samples are much more than the number of “junctions” samples. The sample imbalance affects the performance of the deep learning model’s fitting. In order to reduce the impact of sample imbalance, two models with the same structure are used to predict “junctions” and “non-junctions” samples in this paper. The process is shown in the structure diagram in Figure 1. Firstly, a binary classification network is used to classify “junctions” and “non-junctions” samples. The input of this network is the navigation map, and the Resnet18 is used for feature extraction. The extracted 512 dimensional feature outputs a two-dimensional vector after passing through three full connection layers. Then, the results of intersection classification will activate the corresponding prediction network to predict waypoints, and the network structure is shown in Figure 2.

The input of the network is the front-view image and navigation map. The turning information of the junctions needs to be learned from the navigation map, and then integrated with the road information of the front-view image. To improve the efficiency of feature fusion, Transformer is used, which is more effective in multi-source information fusion [7,22,23]. HrNet [24] is used to extract features from front-view images. HrNet always maintains the high-resolution image feature encoding in the forward pass of the network and gradually increases the sub-networks from high-resolution to low-resolution. In this process, multi-scale information is repeatedly exchanged on the parallel multi-resolution network, thereby reducing the loss of information during downsampling to a certain extent. The local map image is single. In order to reduce the calculation amount of the network as much as possible, referring to the method of human pose detection, StemNet [24] is used to extract its features. The extracted feature map is sliced, and then sent to the linear unit for dimension compression. The compressed features are added to the position encoding vector of the same dimension, and then sent to the Transformer Encoder network.

In Transformer Encoder, the core calculation is shown in Equation (2)

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(2)

which is used to calculate the correlation of each part of the input vector. The transposed multiplication of Q vector and K vector in the equation means calculating the inner product of the original vector, but the original vector here is multiplied by a matrix W, which is a learnable parameter. The vector inner product represents the angle between two vectors and also represents the projection of one vector on the other vector. The larger the projection value is, the higher the correlation between the two vectors is, and more attention should be paid to the vectors with high correlation. After calculating the vector inner product,

s o f t m a x

is used to normalize the calculation results, so that the sum of all parts of the inner product results is 1. Then, the normalized result is multiplied by the V vector, which is actually a weighted sum of the original vector using the attention mechanism. In addition, there is another operation on d in the equation; d represents the vector length, which is mainly used to decouple the “steep” degree of the

s o f t m a x (Q K^{T})

distribution from d, so that the gradient value remains stable during the training process. The correlation among input vectors is calculated through parallel operation in the self-attention module.

In the process of calculating correlation, the position relationship of vectors in the original image is not considered, but the position relationship is extremely important for image semantic description, so position coding is needed to maintain the position relationship among vectors. In this paper, sine position coding is adopted, and the calculation of position coding is shown in Equations (3) and (4). In the equation,

p o s

is the position of the current area in the image, i is the index pointing to each value in the vector, sine coding is used in even positions, and cosine coding is used in odd positions. The front-view image feature, navigation map feature, and randomly initialized path guidance point coding vector after adding position coding can be expressed as

F = ([f r o n t], [m a p], [w a y p o i n t])

. Splice the three inputs and send them to the Transformer module for fusioning.

P E (p o s, 2 i) = sin (\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}})

(3)

P E (p o s, 2 i + 1) = cos (\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}}) .

(4)

The Transformer coding structure learns the feature representation of input information through multi-layer stacked submodules. Each submodule includes a multi-headed self-attention (MSA), a multi-layer perceptron, and a standardization layer. The structure of the submodules is shown in Figure 3. Multimodal information fusion using Transformer is mainly based on the self-attention module in its structure, which is expected to find the correlation of different input information. The calculation of self-attention (SA) is shown in Equation (5)

S A (F^{l - 1}) = s o f t m a x (\frac{F^{l - 1} W_{Q} {(T^{l - 1} W_{K})}^{T}}{\sqrt{d_{h}}}) (F^{l - 1} W_{V}) .

(5)

In this equation,

W_{Q}, W_{K}, W_{V} \in R^{d \times d}

are learnable parameters.

T^{l - 1}

is the output of layer

l - 1

,

d_{h}

is the characteristic dimension, and the value is consistent with d. Multi-head self-attention is an extension of self-attention, which is introduced mainly because there are many different correlations among features. The relationship between multi-headed self-attention and self-attention can be described as Equation (6), where

W_{p} \in R^{(h \cdot d_{h}) d \times d}

is the learnable parameter in the training process.

M S A (T) = [S A_{1} (T); S A_{2} (T); \dots; S A_{h} (T)] W_{P} .

(6)

After using Transformer for fusion, n(n is the number of waypoints) heatmap results are regressed through the multi-layer perceptron network, and finally the waypoint coordinates that need to be predicted are obtained from the heatmap decoding.

2.3. Independent Prediction-Fit Error Evaluation

The uncertainty of random factors such as the movement of traffic participants and changes in illumination in the autonomous driving scene will make the prediction of the deep learning model uncertain, and the confidence of the model can be used to provide good assistance for decision making. First of all, in terms of driving safety, it can automatically warn to ensure safety. When the model confidence is low, the output may be wrong, and the brakes should be used in time. Second, it can provide a good reference for fusion in terms of end-to-end autonomous driving multi-scheme integration. In general, the collaborative decision making of multiple schemes is often better than the decision of a single scheme alone, which requires the fusion of results from different schemes, and the confidence of the model can provide a good reference for fusion. Even the confidence of the model can be a good help for the progressive optimization of end-to-end visual navigation in the future. For example, in the actual use process, the model can automatically identify and collect the samples with wrong predictions (with low confidence) which are used to retrain and update the model continuously, and the performance of the model is gradually optimized.

The thinking of our evaluation module (“Independent Prediction-Fit Error” (IPFE) Evaluation) on the model confidence is as follow. Our labels are n waypoints on the front-view image, which are sampled after quadratic fitting to the historical trajectory of the vehicle [9], so these labels satisfy smoothness and orderliness. The waypoints predicted by a good model should satisfy the smoothness and orderliness. If there is a large mutation between the waypoints, which does not meet this constraint, the result at this time should be wrong. Therefore, the confidence of the model can be reflected by the degree of smoothness and order of the discrete points. This idea is similar to the “Ensemble Methods” method in DNN uncertainty evaluation methods [10], which use multiple different models to arbitrate to evaluate uncertainty. In our method, each point that independently predicts the guiding direction can be regarded as a referee. If the guiding directions of multiple points are consistent (smooth and orderly), it means that the result after the arbitration is reliable.

The output of the model is n heatmaps, as introduced in Section 2.1. Among them, one heatmap represents one waypoint, and the predictions of each point are independent of each other. The predicted waypoints are fitted with a quadratic curve. The fitting error can be measured by the degree of confusion of the points, which is also used to evaluate the confidence of the model. Considering that these waypoints are ordered; in order to highlight the influence of the order of waypoints on the results, we only use the first, middle, and last points when performing quadratic fitting, finally calculating the fitting error of all waypoints to this curve. The specific process is as follows:

(1) Calculate Curve Fitting Error

Let

P = \{p_{1}, p_{i} = (x_{i}, y_{i}), \dots p_{n}\}

be the n waypoints predicted by the model and

(x_{i}, y_{i})

be the pixel coordinates of point

p_{i}

in the front view. The data sequence of

\{p_{1}, p_{[\frac{n}{2}]}, p_{n}\}

is fitted with a quadratic polynomial, where

[\frac{n}{2}]

in

p_{[\frac{n}{2}]}

means take

\frac{n}{2}

integer. Assuming that the fitting function is shown in Equation (7), the mean square error between the fitted curve and the data series is obtained (Equation (8)). According to the extreme value principle of multivariate functions, the minimum value of formula Equation (8) can be obtained, and the fitting function

h (x)

in the sense of minimum mean square error can be obtained. Next, calculate the fitting error between all points and

h (x)

, take all points

\{p_{1}, p_{2}, \dots p_{n}\}

, and use Equation (8) to calculate the mean square error between

h (x)

and

\{p_{1}, p_{2}, \dots p_{n}\}

as the fitting error between all points and curve.

h (x) = a_{0} + a_{1} x + a_{2} x^{2}

(7)

Q (a_{0}, a_{1}, a_{2}) = \sum_{i = 1}^{m} {(p (x_{i}) - y_{i})}^{2} = \sum_{i = 1}^{m} {(a_{0} + a_{1} x_{i} + a_{2} x_{i}^{2} - y_{i})}^{2}

(8)

(2) Quantitative Model Confidence

The fitting error Q can reflect the confidence of the model prediction. The larger the Q, the lower the confidence of the prediction. Considering that when Q is lower than a threshold, the output of the model is considered to be wrong, and the result cannot be adopted. The confidence is set to 0, so the confidence of the model can be calculated by Equation (9).

C = \{\begin{matrix} 0, & Q > t h \\ e^{\frac{- Q}{t h}}, & else \end{matrix},

(9)

where

t h

is the threshold of fitting error, which can be obtained by the following method. First, use the trained model to make predictions on the training set and calculate the fitting error for each sample, a binary variable prediction result is labeled as the measure of acceptance. Then, traverse the values of

t h

from

Q_{min}

to

Q_{max}

, calculating the F1 score of the sample. Finally, take the

t h

corresponding to the maximum F1 score as the optimal threshold.

2.4. Multi-Frame Accumulation to Optimize the Output of the Model

The prediction of different samples by the deep model is independent, so the front and back frame paths obtained by this end-to-end scheme will be inconsistent, especially when the model confidence is low. For a prediction, where inconsistent, the model especially is very low, and the prediction result at this time is not reliable. In order to obtain a reliable output of the current frame, we can refer to the results of the previous frames, that is, use the accumulated results of the previous frames as the output of the current frame.

Let

P_{t}^{f r o n t} = \{p_{t, 1}^{f r o n t}, p_{t, 2}^{f r o n t}, \dots p_{t, n}^{f r o n t} = (x_{t, n}^{f r o n t}, y_{t, n}^{f r o n t})\}

be the n waypoints on the front-view image predicted by the model at time t, use the 2D-to-3D projection trick [9] to project the front-view image points to the vehicle body coordinate system, and obtain n waypoints under the vehicle body coordinate system at time t, denoted by

P_{t}^{c a r} = \{p_{t, 1}^{c a r}, p_{t, 2}^{c a r}, \dots p_{t, n}^{c a r} = (x_{t, n}^{c a r}, y_{t, n}^{c a r})\}

. Calculate the pose transformation matrix

R T_{(t - k, t)}

from time

t - k

to time t according to the odometer of the vehicle, then the waypoints

p_{t - k}^{c a r}

of the vehicle body coordinate system at time

t - k

are transformed to the waypoints

{\hat{p}}_{t - k}^{c a r}

under the vehicle body coordinate system at time t, which can be calculated by Equation (10).

{\hat{P}}_{t - k}^{c a r} = P_{t - k}^{c a r} \times R T_{(t - k, t)} .

(10)

Project the changed

{\hat{p}}_{t - k}^{c a r}

to the front-view pixel coordinates to represent the predicted trajectory of the car, fit a quadratic curve

{\hat{h}}_{t - k} (x)

, and use a heatmap H to represent this curve. H is calculated based on Equation (11)

H_{(x, y)} = e^{\frac{- {(x - \hat{x})}^{2}}{2}}, {\hat{h}}_{t - k} (\hat{x}) = y,

(11)

where

H_{(x, y)}

represents the value of the

(x, y)

coordinate on the heatmap H, and

\hat{x}

(

\hat{x} > 0

) is the corresponding x coordinate when the curve

{\hat{h}}_{t - k} = y

; the point of the curve on the heatmap is assigned a value of 1, and the left and right sides of the point decrease according to Equation (11).

After obtaining the trajectory represented by the heatmap, multiple frames can be accumulated. The confidence of the model output at each moment is different. When the model output is wrong (i.e.,

C = 0

), this frame should be discarded. Therefore, we use the weighted summation when accumulating, and the confidence C of the model is used as the weight. The cumulative calculation formula of k frames before time t is shown in Equation (12)

{\hat{H}}_{t} = \sum_{i = 0}^{k} C_{t - i} H_{t - i},

(12)

where

{\hat{H}}_{t}

is the cumulative result of the heatmap of k frames before time t,

C_{t - i}

, and

H_{t - i}

are the confidence and single-frame heatmap at time

t - 1

, respectively.

Finally, the coordinates of the maximum pixel value greater than 0 in each row in the heatmap are taken as the optimal point, and the optimized waypoints are generated by generating labels [9].

3. Experimental Results and Discussion

In this section, firstly, we introduce the evaluation metrics, the experimental system, and the parameters for model training. Then, we compare the effect of GRU output and heatmap output in the simulation environment. In addition, the feasibility and effectiveness of our model confidence assessment method are tested with real off-road environment data. Finally, the performance of our proposed improved model in practical applications is tested.

3.1. Evaluation Metrics and Experimental Environment

This paper uses two metrics to measure the performance of the waypoint prediction algorithm: Waypoint Average Error (

W A E

, see Equation (13)) and Final Waypoint Accuracy (

F W A

, see Equations (14) and (15)).

W A E

represents the average error of all predicted waypoints and is used to measure the average accuracy of model predictions. This indicator is mainly used for fine performance evaluation during model training.

F W A

stands for the proportion of trajectories predicted correctly at the end of the path. It is mainly used to measure whether the trajectory is correctly predicted in the direction. It is a coarse-grained evaluation. Among them, the path end point here is replaced by the mean of the last three waypoints in the path guidance point sequence.

W A E = \frac{\sum_{i = 1}^{N} \sum_{t = 1}^{T_{pred}} {∥{\hat{w}}_{i}^{t} - w_{g t}^{t}∥}_{2}}{N \times T_{pred}},

(13)

F W E = \frac{\sum_{t = T - 3}^{T} {∥{\hat{W}}^{t} - {\hat{W}}_{g t}^{t}∥}_{2}}{3},

(14)

F W A = \frac{\sum_{i = 1}^{N} S_{i}}{N}, S_{i} = \{\begin{matrix} 0, & F W E > t h r \\ 1, & else \end{matrix} .

(15)

In the formula,

F W E

represents the mean value of the end-point error, which is calculated from the last three waypoints in the waypoints sequence, and T represents the number of waypoints.

{\hat{W}}^{t}

and

{\hat{W}}_{g t}^{t}

represent predicted waypoints and labels, respectively.

t h r

is the end-point error threshold (

t h r

= 10 in this experiment). When the end-point error is less than the threshold, it is considered that the waypoint sequence is correctly predicted, and

S_{i}

is the prediction score. Two indicators,

W A E

and

F W A

, are defined to evaluate the model. The main purpose is to evaluate the model from two dimensions: the absolute accuracy of the waypoint predicted by the model and the accuracy of the waypoint prediction direction. Among them, the direction accuracy in practical applications is mainly to evaluate whether the steering behavior of the model at the intersection is correct.

The hardware environment for training is a computer equipped with an Intel Xeon(R) Gold 5218R 2.10 GHz CPU and a 24 GB memory NVIDIA RTX3090 GPU; the operating system is Ubuntu18.04, PyTorch1.8 and Cuda11.4. The hardware environment for testing is equipped with Intel(R) Core(TM) i7-10750H 2.60 GHz CPU, 6 GB memory NVIDIA RTX2060 GPU; the operating system is Ubuntu18.04. The experimental scenarios are introduced separately below.

3.2. Heatmap and GRU Output Comparison Experiment

The purpose of GRU predicting waypoints is to maintain the smoothness of the path. However, the prediction using the heatmap in this paper does not maintain this feature. Will this lead to the prediction effect not being as good as that of GRU? To address this concern, we set up this experiment. The experiment is carried out in CARLA [25]. The data set is set as follows. The training set is a total of 13257 samples collected in Town1, Town5, and Town7, and the test set is 3653 samples collected in Town2. This experiment uses only one network for training. Replace the model output by the heatmap in Section 2.2 with GRU output [7], and the other structure remains unchanged to obtain the model predicted by GRU.

The performance of two different output modeling schemes is recorded in Table 1. It is evident from the table that the heatmap output outperforms the GRU output on both metrics,

F W A

and

W A E

. Aiming at the poor performance of GRU waypoint prediction, we observed the effects of the two schemes, as shown in Figure 4. Although the overall trend of the near waypoints of GRU is consistent with the label when turning, there is no strong turning trend in the far distance because the distribution of the near waypoints of the label is uniform when turning, while the lateral distribution of the far waypoints is not uniform. However, the output of GRU forcibly maintains the relative relationship of points, so both horizontal and vertical directions are uniform, resulting in a noticeable steering trend at distance. The heatmap is different because each waypoint is predicted independently, so both near and far points tend to be labels.

According to this experimental result, it can be considered that although the heatmap output does not consider and maintain the smoothness of waypoints, the accuracy of the prediction is even better than the GRU.

3.3. Verification of RREV

To verify the effectiveness and robustness of RREV, we collected the experimental data from real off-road scenarios. The data road scene includes dirt road, gravel road, weed road, and other complex road conditions, as shown in Figure 5. A total of 33,297 pieces of data were collected, of which 24,608 were used as training sets and 8689 were used as test sets.

The experimental process is as follows. Use the model trained in the training set to test the samples of the training set, calculate the fitting error Q of each sample in the training set according to the method in Section 2.3, and manually mark each prediction result as “true” or “false”. Traverse the threshold

t h

from

Q_{m i n}

to

Q_{m a x}

, calculate the F1 value of the “true” sample, and take the

t h

corresponding to the largest F1 value as the best threshold. Finally, perform the same operation on the test set to obtain the Q value and label “true” or “false” samples, and then use the selected best threshold

t h

to classify the test set.

The result is shown in Figure 6. Figure 6b shows that with an increase in threshold, the F1 value gradually increases to a maximum value and then gradually decreases. This is because as the threshold increases, more and more “true” samples are correctly classified, and the recall is gradually increasing. When F1 reaches the maximum, as the threshold increases, the “false” samples are gradually accepted as “true” samples, the precision gradually decreases, and the F1 value will gradually decrease. According to the results of Figure 6b,

t h

takes 1500 as the best threshold, the test set is tested with this threshold, and the precision of the test results is as high as 93.4%.

We observed the results of different Q values predicted by the model as shown in Figure 7. When the model confidence is high (

Q \leq 500

), the predicted waypoints are smooth and orderly and are in good agreement with the labels as the model prediction confidence gradually decreases. When

500 < Q \leq 1500

, the predicted waypoints are gradually confused, but the result at this time is not much different from the label, and this result can be accepted. When

Q > 1500

, the predicted waypoints are very confusing, and the result cannot be accepted. At this time, the prediction of the model is considered to be wrong, and the confidence level is 0. The observation result conforms to the conjecture in Section 2.3. Experiments show that the IPFE method proposed in this paper to quantify model confidence is feasible and has good results.

3.4. Compare the Effects of Different Improvements Ablation Study

(1) Simulation environment test

The purpose of this experiment is to test the effect of the improved scheme proposed in this paper, and we conduct three groups of tests.

One model + one frame: Do not distinguish between “junctions” and “non-junctions”, only use one model to predict all situations, without multi-frame accumulation;
Two models + one frame: Use two models to predict the two cases of “junctions” and “non-junctions”, respectively, without multi-frame accumulation;
Two models + multiple frames: Use two models to predict the two cases of “junctions” and “non-junctions”, respectively, and perform multi-frame accumulation.

All model output methods are heatmaps.The training data for this experiment are 13257 in total (the same as in Section 3.2), from the three towns of Town1, Town5, and Town7 in CARLA. The data are collected on the two road sections of Town7, shown in Figure 8 for testing. Route 1 is mainly composed of “non-junctions”, and there are few data at “junctions” in order to test the performance of the model along the road. Compared with route 1, more “junctions” data are added to route 2 in order to test the steering performance of the model at the intersection. The details of the test data are shown in Table 2. The number of “junctions” samples in route 1 accounted for 5.98%, but it increased to 40.63% in route 2.

The experimental results are shown in Table 3. In route 1, the FWA of “two models + single frame” is 2.5% higher than that of “one model + single frame”, but it can be improved by 14.5% in route 2. This is because the training set data samples are unbalanced (“non-junctions” data account for the majority), so that the model’s learning effect for “non-junctions” data is much better than that for “junctions” data. Since there is only a small part of the “junctions” data in route 1, the difference between the effects of “one model + single frame” and “two models + single frame” is not obvious. When the proportion of “junctions” in route 2 increases, the total FWA of the single model will decrease significantly, but since the two models can learn the “junctions” data in a targeted manner, no matter how the “junctions” data increases, the FWA of the two models will not have too much influence. Whether it is route 1 or route 2, the performance of each index of multi-frame is better than that of single frame, and this advantage is more obvious in route 2 with more “junctions” data.

Through observation, we found the results shown in Figure 9. In the first frame, when the vehicle approaches the intersection, the prediction of a single frame of the model is turned in advance, and the accumulated results of multiple frames are still stored in the same direction as the label; as the vehicle gradually enters the intersection, more and more predictions become steering. In the fourth frame, when the vehicle completely enters the intersection and starts to turn, the accumulated result of multiple frames also turns into steering. Multi-frame accumulation can solve the problem of single-frame prediction jumps, making the predictions of the previous and subsequent frames more coherent and avoiding the vehicle turning ahead or lagging behind when turning at the intersection.

(2) Complex road environment test

The above experiments show that the method proposed in this paper has better performance improvement than the previous methods. In order to test its performance on more complex road sections, the following experiments were carried out.

The tests were conducted in a real off-road environment and Town5 in CARLA, respectively, and the complexity of road conditions was reflected through texture changes and light changes. The off-road environment is shown in Figure 5 of Section 3.3. The environment in Town5 is shown in Figure 10, adding changes in different weather and light. The data settings for training and testing under the two environments are shown in Table 4; in CARLA, the model was trained in Town1, Town2, Town7 (the light change was not obvious) and tested in Town5. The training and testing venues in the cross-country environment are the same, but the routes are different. The experimental process is the same as above, and the results are shown in Table 5.

As shown in Table 5, in the CARLA environment, “Two models + multiple frames” is still the best, and “One model + one frame” is the worst. The difference in performance among the three groups was not particularly significant because the weather and light of Town5 changed sharply, which did not appear in the training set, so the overall performance decreased. However, in spite of this, the two models is still better than the one model, and the performance of the model is significantly improved after multi-frame optimization. In the cross-country environment, “Two models + one frame” is 10.6% higher than “One model + one frame” because during the mixed training of “junctions” and “non-junctions”, the model is overfitted to “non-junctions”, but the learning ability of the model to the “junctions” sample is improved after separate training. “Two models + multiple frames” is 2.3% higher than “Two models + one frame”. The visualization result is shown in Figure 11. When the prediction of the second frame is wrong, the result of multi-frame accumulation can make up for the gap in the output, so the model is more robust.

The experimental results of the simulation environment and more complex environment show that the use of two models to independently predict the “junctions” and “non-junctions” situations avoids the problem of sample imbalance; the model can be more fully trained, and the steering accuracy of the vehicle at the intersection can be improved in practical applications. The multi-frame accumulation effectively solves the problem of incoherence and instability of the path caused by the single-frame prediction jump and prevents the vehicle from turning ahead or lagging when turning.

3.5. Online Testing in Virtual and Real Environments

The above experiments are all offline tests. In order to further verify the reliability and robustness of our improved end-to-end visual navigation scheme, online tests in virtual and real environments are performed here. After decoding the waypoints from the heatmap, refer to the method in [9] to control the motion of the vehicle, project the waypoints to the vehicle coordinate system and analyze the steering value, drive the vehicle at a fixed speed, and use a PID controller for lateral control.

We conduct experiments in a virtual environment in CARLA. The real vehicle of the real environment experiment is shown in Figure 12, the size of which is 1800 × 1500 × 750 (length, width, and height in mm). It is equipped with a wheel encoder and an inertial navigation system for vehicle positioning, and a camera is used to obtain RGB images of the scene. Using the embedded GPU computing platform Xavier, the device includes an 8-core NVIDIA Carmel ARMv8.2 64-bit CPU and a 512-core Volta architecture GPU consisting of 8 stream processors.

We train the model on Town1, Town5, and Town7, and test it on Town7. The route is shown in Figure 13a. We conduct tests in real scenarios on campus, and the road topology and routes are shown in Figure 13b, The campus environment is shown in Figure 13c,d. We conduct comparative experiments on the three schemes of “One model + one frame”, “Two models + one frame”, and “Two models + multiple frames”, test them on the selected routes, respectively, and record the number and location of manual interventions.

The results of the experiment are shown in Figure 14. First of all, from Figure 14a,b, it can be seen that whether it is a virtual environment or a real environment, the number of interventions of “One model + one frame” is the largest and most of them occur at intersections. This is due to the imbalance of the samples, which leads to insufficient training of the model intersection data, and the model fails to learn the intersection steering ability well. Secondly, compared with “One model + one frame”, the number of interventions of “Two models + one frame” is greatly reduced, indicating that the learning strategy of the dual model can greatly improve the model’s ability to turn at intersections. Finally, the number of interventions of “Two models + multiple frames” in virtual and real environments is 0 and 1, which are fewer than “Two models + one frame”. It shows that the use of multi-frame accumulation can optimize the steering performance of the model intersection and improve the reliability of the model. The experimental results are in line with the analysis in Section 3.4 The results show that, compared with the single model, the dual-model learning strategy can enable the vehicle to have better steering ability at intersections, and the accumulation of multiple frames can optimize the steering performance of the model intersection.

4. Conclusions and Future Work

This paper proposes some improvement strategies for the problems existing in end-to-end autonomous driving. The experimental results show that not only the optimized output path is smoother but also the method can effectively improve the intersection steering performance. Surprisingly, we found that the performance of prediction using the heat map can match or even surpass the one using GRU.

For the work of this paper, there are still some aspects that need further improvement. Although the dual-model method proposed in this paper can avoid the occurrence of sample imbalance, the use of multiple networks makes the system complicated and bloated. Our next step is to use the idea of federated learning [26] to further optimize the model; only one model can handle both road situations. In this paper, the two models of “junctions” and “non-junctions” can be regarded as two different organizations. After the two models are trained to handle their respective tasks well, they form a mapping relationship between input and output. According to this mapping relationship, we can use one model to learn two kinds of knowledge without accessing the other party’s data.

Author Contributions

W.O.: conceptualization, investigation, methodology, software, and writing—original draft; T.W.: methodology, funding acquisition, project administration, supervision, and writing—review and editing; J.L.: methodology, funding acquisition, supervision, and writing—review and editing; J.X.: investigation, software, and review and editing; B.L.: investigation and review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

NSFC Grants 62103431.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to end learning for self-driving cars. arXiv 2016, arXiv:1604.07316. [Google Scholar]
Amini, A.; Schwarting, W.; Rosman, G.; Araki, B.; Karaman, S.; Rus, D. Variational autoencoder for end-to-end control of autonomous driving with novelty detection and training de-biasing. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 568–575. [Google Scholar]
Pomerleau, D.A. Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1988; Volume 1. [Google Scholar]
Kendall, A.; Hawke, J.; Janz, D.; Mazur, P.; Reda, D.; Allen, J.M.; Lam, V.D.; Bewley, A.; Shah, A. Learning to Drive in a Day. arXiv 2019, arXiv:1807.00412. [Google Scholar]
Chen, J.; Wu, T.; Shi, M.; Jiang, W. Porf-ddpg: Learning personalized autonomous driving behavior with progressively optimized reward function. Sensors 2020, 20, 5626. [Google Scholar] [CrossRef] [PubMed]
Codevilla, F.; Müller, M.; López, A.; Koltun, V.; Dosovitskiy, A. End-to-end driving via conditional imitation learning. In Proceedings of the 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 4693–4700. [Google Scholar]
Prakash, A.; Chitta, K.; Geiger, A. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7077–7087. [Google Scholar]
Liu, Z.; Amini, A.; Zhu, S.; Karaman, S.; Han, S.; Rus, D.L. Efficient and robust lidar-based end-to-end navigation. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13247–13254. [Google Scholar]
Xu, J.; Wu, T. End-to-end autonomous driving based on image plane waypoint prediction. In Proceedings of the In International Symposium on Control Engineering and Robotics (ISCER), Changsha, China, 18–20 February 2022; pp. 132–137. [Google Scholar]
Gawlikowski, J.; Tassi, C.R.N.; Ali, M.; Lee, J.; Humt, M.; Feng, J.; Kruspe, A.; Triebel, R.; Jung, P.; Roscher, R.; et al. A survey of uncertainty in deep neural networks. arXiv 2021, arXiv:2107.03342. [Google Scholar]
Sensoy, M.; Kaplan, L.; Kandemir, M. Evidential Deep Learning to Quantify Classification Uncertainty. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1050–1059. [Google Scholar]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv 2017, arXiv:1612.01474. [Google Scholar]
Hernández-Lobato, J.M.; Adams, R. Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1861–1869. [Google Scholar]
Sagi, O.; Rokach, L. Ensemble Learning: A Survey; Wiley Online Library: Hoboken, NJ, USA, 2018; Volume 8, p. e1249. [Google Scholar]
Hansen, L.K.; Salamon, P. Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 1990, 12, 993–1001. [Google Scholar] [CrossRef] [Green Version]
Amini, A.; Soleimany, A.; Karaman, S.; Rus, D. Spatial Uncertainty Sampling for End-to-End Control. arXiv 2018, arXiv:1805.04829. [Google Scholar]
Xu, H.; Gao, Y.; Yu, F.; Darrell, T. End-to-end learning of driving models from large-scale video datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2174–2182. [Google Scholar]
Chen, Z.; Huang, X. End-to-end learning for lane keeping of self-driving cars. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; pp. 1856–1860. [Google Scholar]
Amini, A.; Rosman, G.; Karaman, S.; Rus, D. Variational end-to-end navigation and localization. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8958–8964. [Google Scholar]
Pfister, T.; Charles, J.; Zisserman, A. Flowing convnets for human pose estimation in videos. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1913–1921. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Long Beach, CA, USA, 2017; Volume 30. [Google Scholar]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multi-modal transformer for unaligned multi-modal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2019; Volume 2019, pp. 6558–6569. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An open urban driving simulator. In Proceedings of the Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 1–16. [Google Scholar]
Abhishek, A.; Binny, S.; Johan, R.; Nithin, R.; Vishal, T. Federated learning: Collaborative machine learning without centralized training data. Int. J. Eng. Technol. Manag. Sci. 2022, 6. [Google Scholar] [CrossRef]

Figure 1. Model structure diagram.

Figure 2. Model structure diagram based on Transformer feature fusion.

Figure 3. Transformer sub-module structure diagram.

Figure 4. The comparison of GRU and heatmap prediction results. Red points represent the label, blue points represent the heatmap prediction result, and green points represent the GRU prediction result.

Figure 5. Examples of off-road scene road conditions. (a) shows the dirt road, (b) shows the gravel road, (c) shows the weed road, and (d) shows the puddle road.

Figure 6. Threshold selection. Figure (a) records the change in the Q value of the first 1000 samples in the training set; (b) is the F1, precision, and recall corresponding to different threshold sizes.

Figure 7. Different Q value prediction results, red is the label, blue is the prediction; (a)

Q \leq 500

results, (b)

500 < Q \leq 1500

, (c)

Q > 1500

.

Figure 7. Different Q value prediction results, red is the label, blue is the prediction; (a)

Q \leq 500

results, (b)

500 < Q \leq 1500

, (c)

Q > 1500

.

Figure 8. Two routes of Town7. Route 1 (red) is used to test the “non-junctions” model moving along the road, and route 2 (green) is used to test the steering performance of the “junctions” model.

Figure 9. Prediction of a continuous multi-frame at the intersection. Figures (a–d) are the 1st, 2nd, 3rd, and 4th frames in the sequence, respectively. The red points are labels, the blue points are the predictions of a single frame, and the heatmaps are the results accumulated over multiple frames.

Figure 10. Weather and light changes of Town5 in CARLA.

Figure 11. Visual results of off-road environment testing. (a–c) are the 1st, 2nd, and 3rd frames in sequence, respectively. The red points are labels, the blue points are the predictions of a single frame, the heatmaps are the results accumulated over multiple frames, and the green points are the optimized waypoints, which are decoded from heatmaps.

Figure 12. Real vehicle tested in real environment.

Figure 13. Route map. (a) is the route of Town7, (b) is the real campus environment route, and the start and goal are marked on the map. (c andd) show the real campus environment maps.

Figure 14. Online test manual intervention record results. (a) is the result of the test in Town7, and (b) is the result of the real campus environment test. The red line is the test route, and the green, blue, and yellow points are the manual intervention positions of “One model + one frame”, “Two models + one frame”, and “Two models + multiple frames”, respectively. The numbers of interventions of “One model + one frame”, “Two models + one frame” and “Two models + multiple frames” in the virtual environment are 10, 3, and 0, and in the real environment it is 13, 3, and 1.

Table 1. GRU and heatmap output comparison results.

Output	FWA (%)	WAE
GRU	53.8	6.42
Heatmap	94.3	1.72

Table 2. Details of the test set.

Route	Total Number of Samples	“Junctions” (Ratio)	“Non-Junctions” (Ratio)
1	1873	112 (5.98%)	1761 (94.02%)
2	2648	1076 (40.63%)	1572 (59.37%)

Table 3. Test results of different scenarios.

Route	Program	FWA (%)	WAE (%)
	One model + one frame	94.6	1.69
1	Two models + one frame	97.1	1.41
	Two models + multiple frames	98.4	1.26
	One model + one frame	79.2	3.12
2	Two models + one frame	93.7	1.86
	Two models + multiple frames	96.9	1.53

Table 4. Details of the dataset for complex roads.

Environment	Data Settings	Total Number of Samples	“Junctions” (Ratio)	“Non-Junctions” (Ratio)
CARLA	Train (Town1, Town2, Town7)	8444	1903 (22.54%)	6541 (77.46%)
	Test (Town5)	1387	437 (31.51%)	950 (68.49%)
Off-road	Train	18663	2115 (11.33%)	16548 (88.67%)
	Test	6681	1290 (19.31%)	5391 (80.69%)

Table 5. Test results of complex roads.

Environment	Program	FWA (%)	WAE (%)
	One model + one frame	67.1	5.52
CARLA	Two models + one frame	71.9	5.15
	Two models + multiple frames	73.6	4.87
	One model + one frame	74.5	4.59
off-road	Two models + one frame	85.1	2.78
	Two models + multiple frames	87.4	2.32

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ou, W.; Wu, T.; Li, J.; Xu, J.; Li, B. RREV: A Robust and Reliable End-to-End Visual Navigation. Drones 2022, 6, 344. https://doi.org/10.3390/drones6110344

AMA Style

Ou W, Wu T, Li J, Xu J, Li B. RREV: A Robust and Reliable End-to-End Visual Navigation. Drones. 2022; 6(11):344. https://doi.org/10.3390/drones6110344

Chicago/Turabian Style

Ou, Wenxiao, Tao Wu, Junxiang Li, Jinjiang Xu, and Bowen Li. 2022. "RREV: A Robust and Reliable End-to-End Visual Navigation" Drones 6, no. 11: 344. https://doi.org/10.3390/drones6110344

APA Style

Ou, W., Wu, T., Li, J., Xu, J., & Li, B. (2022). RREV: A Robust and Reliable End-to-End Visual Navigation. Drones, 6(11), 344. https://doi.org/10.3390/drones6110344

Article Menu

RREV: A Robust and Reliable End-to-End Visual Navigation

Abstract

1. Introduction

2. Methods

2.1. Input and Output

2.2. Transformer for Feature Fusion in Dual Model Structure

2.3. Independent Prediction-Fit Error Evaluation

2.4. Multi-Frame Accumulation to Optimize the Output of the Model

3. Experimental Results and Discussion

3.1. Evaluation Metrics and Experimental Environment

3.2. Heatmap and GRU Output Comparison Experiment

3.3. Verification of RREV

3.4. Compare the Effects of Different Improvements Ablation Study

3.5. Online Testing in Virtual and Real Environments

4. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI