1. Introduction
Automatic driving has made great contributions to many aspects of social life, such as traffic safety, convenient travel, cost reduction, and efficiency improvement. The traditional hierarchical scheme of autonomous driving has been extensively studied during the past decades. The common modules in the hierarchical scheme can be decomposed into the perception module, the decision-making module, the planning module, the motion control module, etc. Nowadays, another scheme, which is called the end-to-end scheme, has gradually become a trend [
1,
2]. An end-to-end autonomous driving scheme means obtaining the control commands (e.g., steer angle and brake) directly from the perception module, without the need of the traditional decision-making and planning module. This scheme is straightforward and highly efficient since it does not need complex human-crafted rules within the decision-making and planning module. Additionally, the scheme can achieve autonomous navigation in more complex environments when lacking high-precision maps, which are essential in the traditional hierarchical scheme.
As early as 1988, the driverless vehicle named ALVINN [
3] used multi-layer sensors to learn direction control. In 2016, Bojarski et al. [
1] used DNN to map the image to three control commands (steering, throttle, and braking) to control the vehicle, which proved that DNN can learn the control process of driving from the original image data. In 2018, Kendall et al. [
4] demonstrated the first application of deep reinforcement learning in autopilot with an end-to-end method, using images as the input to learn the lane following strategy. These efforts have shown great potential for end-to-end solutions.
Previous end-to-end automatic driving only realized driving along the road, while turning at the intersection was realized through manual intervention. Some studies have begun to use advanced commands to guide vehicles to turn [
5,
6]. At each intersection, the vehicle’s turn instructions (straight, left turn, right turn, etc.) are specified in advance. When the vehicle arrives at the intersection, it will execute the corresponding actions according to the corresponding high-level commands. This method not only requires accurate intersection location of vehicles, but the preset steering command cannot be applied to other complex intersections, either. Other studies tend to use navigation maps to provide steering information [
7,
8,
9]. The deep learning model needs to learn the steering information from the navigation map and fuse it with other sensor information to predict the control variable. However, the way to learn turning information through models is not effective in any case [
6]. On the one hand, the method relies on effective feature fusion strategy; on the other hand, because the depth learning model is easy to overfit during training, especially when the samples are unbalanced, for example, the “junctions” samples in the road scene are far fewer than the “non-junctions” samples, the model is easy to overfit to a large class of samples (“non-junctions” samples), which leads to the inability to learn the turning ability of intersections.
Although end-to-end autopilot has received widespread attention with the rapid development of deep learning, it has always been controversial; an important reason is the lack of interpretability. All the end-to-end models are largely trained as black boxes, while lacking methods to evaluate the confidence of their output and explain the characteristics they have learned. In the autopilot task, random factors, such as the movement of traffic participants and the change in weather, will make the prediction of the model full of uncertainty. If it does not have reliable confidence, the model may make wrong predictions. The development of the research on the evaluation of uncertainty of a neural network [
10] has brought the impetus to solve this problem. The evidence depth learning method [
8,
11] places a prior distribution on the category probability, treats the prediction of the neural network as subjective opinions, and learns the function that collects the evidence leading to these opinions by a deterministic neural net from data. The Bayesian neural network(BNN) method [
12,
13,
14] puts the prior value above the network weight and estimates the prediction uncertainty by approximating the moment of the posterior prediction distribution. Because of the large amount of computation, this method is difficult to be applied to complex networks and resource-constrained environments. Th ensemble method [
15,
16] uses a variety of different models to independently predict the same thing, and evaluates the reliability of the results through the distribution of the prediction results. it believes that a group of decision makers tend to make better decisions than a single decision maker. Although this method is considered to be the most effective in some tasks, for autopilot, using multiple models at the same time not only requires a lot of computing and space resources but also cannot meet the real-time requirements. Real-time evaluation of the confidence of the model for further decision making can achieve greater robustness and ensure driving safety. However, only a few end-to-end autopilot studies pay attention to this problem. Amini et al. [
17] proposed a Bayesian neural network for end-to-end control, which estimates uncertainty by using feature graph correlation in the training process. In their other work [
2], a variational autoencoder network is designed, which simultaneously predicts the control signal of the vehicle and reconstructs the input image. The potential uncertainty of the network propagation is used to reconstruct the image to detect the new image which is not included in the training distribution. Liu et al. [
8] proposed hybrid evidential fusion, which can learn the uncertainty of prediction directly. By training another branch of the network, the evidence distribution is output for each prediction, which is designed to capture evidence (or confidence) related to any decision made by the network. These works are from the input and network structure point of view to model and analyze the uncertainty, ignoring the rationality analysis of the prediction results.
In addition, deep models are highly sensitive to perturbations, causing the output to jump easily. In order to solve this problem, some works have adopted a scheme of integrating multiple consecutive frames as input [
5,
18]. However, this method is difficult to apply effectively, because it requires modeling recursion or applying 4D convolutions to process multiple input frames [
8].
In this work, a robust and reliable end-to-end visual navigation scheme (RREV navigation) is proposed, in which images and navigation maps are used as inputs to predict the future waypoints of the vehicle. Specifically, in order to solve the training overfitting problem caused by the imbalance between “junctions” and “non-junctions” samples, a dual-model learning strategy is proposed, which uses two models with the same structure to predict the two kinds of samples independently. Additionally, in order to better integrate image information and navigation information, Transformer is used for feature fusion. In addition, the problems of confidence evaluation and anti-disturbance are also studied in this paper. By modeling and analyzing the output, a confidence evaluation method called “independent prediction-fitting error” (IPFE) is proposed. It is applied to the multi-frame accumulation of the output to optimize the output. The IPFE is convenient and fast, and can meet the real-time requirements of automatic driving. On the one hand, multi-frame accumulation can improve the anti-jamming ability of the system, on the other hand, it can make up for the gap in the output by using the results of historical frames when predicting errors. Because of this, the robustness of the system is improved. Offline and online experiments in virtual and real environments are carried out. The experimental results show that the dual-model learning strategy can improve the steering ability of vehicles at intersections, and the optimized output is smoother and more stable. Furthermore, the feasibility and effectiveness of IPFE for evaluating model confidence are demonstrated. The innovations and contributions of this paper are as follows:
(1) A dual-model learning strategy is proposed to solve the training overfitting problem caused by the imbalance between “junctions” and “non-junctions” samples.
(2) A rapid evaluation method of model confidence called “independent prediction-fitting error” is proposed. It is applied to multi-frame accumulation to optimize the output of the model and improve the robustness of the system.
The remainder of the paper is organized as follow: In
Section 2, our approach is described in detail. The first is the input and output of the model; then, the three main parts of this paper are introduced, which are the Transformer-based dual-model learning strategy, the evaluation of model confidence, and the multi-frame accumulation method for optimizing model output. Offline and online experiments in virtual and real environments are carried out in
Section 3. Finally,
Section 4 provides the conclusion and prospects.
2. Methods
Given a front-view RGB image
and a global planning local map
, our objective is to learn an end-to-end neural network
to predict the future waypoints of the vehicle in the front-view image; they can be expressed as Equation (
1). Among them,
represents the pixel coordinates of the prediction points in the front-view image, and the number of prediction points is
k.
The structure of the model is shown in
Figure 1, which mainly consists of three parts: (1) The dual-model independent learning strategy. Considering that the sample imbalance leads to insufficient training, two models with the same structure are used to predict two different situations. First, the intersection classification network with the local map as input divides the road into two situations, “junctions” and “non-junctions”, and then activates the corresponding waypoints prediction network to predict n waypoints. (2) IPFE for evaluating model confidence. Evaluating the confidence of the model can improve the interpretability of the model on the one hand and can be used to optimize the output of the model on the other hand. Use the quadratic curve to fit the predicted waypoints, and use the fitting error
Q to quantify the confidence
C of the model. (3) The multi-frame cumulative optimization model. Considering that the original prediction result of the deep learning model is easily affected by the disturbance, the original output from the network is optimized via multiple-frame accumulation. First, project a sequence of several frames of output to the last frame according to odometer. Then, the weighted summation is performed with the confidence
C as the weight to obtain the accumulated result. Finally, the optimized waypoints are decoded from the accumulated heatmap.
2.1. Input and Output
In this paper, the navigation map is used to guide the vehicle to turn. At present, compared with the single-source single-frame input method, the multi-source fusion method based on images and lidar point cloud or the multi-frame sequential input method achieves better results in end-to-end autonomous driving [
7] because, in this paper, we focus on the problem of sample imbalance and robustness of deep-learning-based systems. During feature fusion, in order to avoid additional interference caused by too much information and to ensure that the model can learn steering information, we hope that our other input is a single source compared with the multi-source fusion method based on image and lidar. Referring to the previous end-to-end visual navigation work [
9], single-frame RGB images contain enough scene information to adapt to visual navigation tasks; therefore, we take the single-frame RGB image of a single front-view camera as the input of the model.
Different from most end-to-end driving schemes, which directly predict vehicle control variables (steering wheel angle, accelerator, braking, etc.), the output of our methods is predicted waypoints in the front-view image. The scheme of directly predicting the control variables makes the prediction results closely related to the kinematics of the vehicle [
1,
19], so it is difficult to transfer among different vehicles because, when switching to a different vehicle, data need to be re-collected to train the model. A more efficient approach is to predict intermediate representations, such as future waypoints [
7,
9,
20].
The waypoints of a normal driving vehicle are smooth and orderly. For this reason, some studies use GRUs (Gated Recurrent Units) to predict the waypoints to ensure smooth and orderly output [
7]. In contrast, we use heatmaps [
21] to independently predict waypoints, breaking the constraints of smoothness and order among points. The heatmap is a two-dimensional Gaussian distribution map; the highest point of the Gaussian distribution represents a waypoint, and one heatmap represents one waypoint. Compared with the approach using GRUs, our method has better generalization and can improve the accuracy of prediction (as shown in
Section 3.2). Moreover, the waypoint predicted by the heatmap is independent, which can be used to evaluate the confidence of the model. The specific detail is described in
Section 2.3.
2.2. Transformer for Feature Fusion in Dual Model Structure
In the road scene, the dataset samples collected under natural conditions are often unbalanced. For example, the number of “non-junctions” samples are much more than the number of “junctions” samples. The sample imbalance affects the performance of the deep learning model’s fitting. In order to reduce the impact of sample imbalance, two models with the same structure are used to predict “junctions” and “non-junctions” samples in this paper. The process is shown in the structure diagram in
Figure 1. Firstly, a binary classification network is used to classify “junctions” and “non-junctions” samples. The input of this network is the navigation map, and the Resnet18 is used for feature extraction. The extracted 512 dimensional feature outputs a two-dimensional vector after passing through three full connection layers. Then, the results of intersection classification will activate the corresponding prediction network to predict waypoints, and the network structure is shown in
Figure 2.
The input of the network is the front-view image and navigation map. The turning information of the junctions needs to be learned from the navigation map, and then integrated with the road information of the front-view image. To improve the efficiency of feature fusion, Transformer is used, which is more effective in multi-source information fusion [
7,
22,
23]. HrNet [
24] is used to extract features from front-view images. HrNet always maintains the high-resolution image feature encoding in the forward pass of the network and gradually increases the sub-networks from high-resolution to low-resolution. In this process, multi-scale information is repeatedly exchanged on the parallel multi-resolution network, thereby reducing the loss of information during downsampling to a certain extent. The local map image is single. In order to reduce the calculation amount of the network as much as possible, referring to the method of human pose detection, StemNet [
24] is used to extract its features. The extracted feature map is sliced, and then sent to the linear unit for dimension compression. The compressed features are added to the position encoding vector of the same dimension, and then sent to the Transformer Encoder network.
In Transformer Encoder, the core calculation is shown in Equation (
2)
which is used to calculate the correlation of each part of the input vector. The transposed multiplication of
Q vector and
K vector in the equation means calculating the inner product of the original vector, but the original vector here is multiplied by a matrix
W, which is a learnable parameter. The vector inner product represents the angle between two vectors and also represents the projection of one vector on the other vector. The larger the projection value is, the higher the correlation between the two vectors is, and more attention should be paid to the vectors with high correlation. After calculating the vector inner product,
is used to normalize the calculation results, so that the sum of all parts of the inner product results is 1. Then, the normalized result is multiplied by the
V vector, which is actually a weighted sum of the original vector using the attention mechanism. In addition, there is another operation on
d in the equation;
d represents the vector length, which is mainly used to decouple the “steep” degree of the
distribution from
d, so that the gradient value remains stable during the training process. The correlation among input vectors is calculated through parallel operation in the self-attention module.
In the process of calculating correlation, the position relationship of vectors in the original image is not considered, but the position relationship is extremely important for image semantic description, so position coding is needed to maintain the position relationship among vectors. In this paper, sine position coding is adopted, and the calculation of position coding is shown in Equations (
3) and (
4). In the equation,
is the position of the current area in the image,
i is the index pointing to each value in the vector, sine coding is used in even positions, and cosine coding is used in odd positions. The front-view image feature, navigation map feature, and randomly initialized path guidance point coding vector after adding position coding can be expressed as
. Splice the three inputs and send them to the Transformer module for fusioning.
The Transformer coding structure learns the feature representation of input information through multi-layer stacked submodules. Each submodule includes a multi-headed self-attention (MSA), a multi-layer perceptron, and a standardization layer. The structure of the submodules is shown in
Figure 3. Multimodal information fusion using Transformer is mainly based on the self-attention module in its structure, which is expected to find the correlation of different input information. The calculation of self-attention (SA) is shown in Equation (
5)
In this equation,
are learnable parameters.
is the output of layer
,
is the characteristic dimension, and the value is consistent with
d. Multi-head self-attention is an extension of self-attention, which is introduced mainly because there are many different correlations among features. The relationship between multi-headed self-attention and self-attention can be described as Equation (
6), where
is the learnable parameter in the training process.
After using Transformer for fusion, n(n is the number of waypoints) heatmap results are regressed through the multi-layer perceptron network, and finally the waypoint coordinates that need to be predicted are obtained from the heatmap decoding.
2.3. Independent Prediction-Fit Error Evaluation
The uncertainty of random factors such as the movement of traffic participants and changes in illumination in the autonomous driving scene will make the prediction of the deep learning model uncertain, and the confidence of the model can be used to provide good assistance for decision making. First of all, in terms of driving safety, it can automatically warn to ensure safety. When the model confidence is low, the output may be wrong, and the brakes should be used in time. Second, it can provide a good reference for fusion in terms of end-to-end autonomous driving multi-scheme integration. In general, the collaborative decision making of multiple schemes is often better than the decision of a single scheme alone, which requires the fusion of results from different schemes, and the confidence of the model can provide a good reference for fusion. Even the confidence of the model can be a good help for the progressive optimization of end-to-end visual navigation in the future. For example, in the actual use process, the model can automatically identify and collect the samples with wrong predictions (with low confidence) which are used to retrain and update the model continuously, and the performance of the model is gradually optimized.
The thinking of our evaluation module (“Independent Prediction-Fit Error” (IPFE) Evaluation) on the model confidence is as follow. Our labels are
n waypoints on the front-view image, which are sampled after quadratic fitting to the historical trajectory of the vehicle [
9], so these labels satisfy smoothness and orderliness. The waypoints predicted by a good model should satisfy the smoothness and orderliness. If there is a large mutation between the waypoints, which does not meet this constraint, the result at this time should be wrong. Therefore, the confidence of the model can be reflected by the degree of smoothness and order of the discrete points. This idea is similar to the “Ensemble Methods” method in DNN uncertainty evaluation methods [
10], which use multiple different models to arbitrate to evaluate uncertainty. In our method, each point that independently predicts the guiding direction can be regarded as a referee. If the guiding directions of multiple points are consistent (smooth and orderly), it means that the result after the arbitration is reliable.
The output of the model is
n heatmaps, as introduced in
Section 2.1. Among them, one heatmap represents one waypoint, and the predictions of each point are independent of each other. The predicted waypoints are fitted with a quadratic curve. The fitting error can be measured by the degree of confusion of the points, which is also used to evaluate the confidence of the model. Considering that these waypoints are ordered; in order to highlight the influence of the order of waypoints on the results, we only use the first, middle, and last points when performing quadratic fitting, finally calculating the fitting error of all waypoints to this curve. The specific process is as follows:
(1) Calculate Curve Fitting Error
Let
be the
n waypoints predicted by the model and
be the pixel coordinates of point
in the front view. The data sequence of
is fitted with a quadratic polynomial, where
in
means take
integer. Assuming that the fitting function is shown in Equation (
7), the mean square error between the fitted curve and the data series is obtained (Equation (
8)). According to the extreme value principle of multivariate functions, the minimum value of formula Equation (
8) can be obtained, and the fitting function
in the sense of minimum mean square error can be obtained. Next, calculate the fitting error between all points and
, take all points
, and use Equation (
8) to calculate the mean square error between
and
as the fitting error between all points and curve.
(2) Quantitative Model Confidence
The fitting error
Q can reflect the confidence of the model prediction. The larger the
Q, the lower the confidence of the prediction. Considering that when
Q is lower than a threshold, the output of the model is considered to be wrong, and the result cannot be adopted. The confidence is set to 0, so the confidence of the model can be calculated by Equation (
9).
where
is the threshold of fitting error, which can be obtained by the following method. First, use the trained model to make predictions on the training set and calculate the fitting error for each sample, a binary variable prediction result is labeled as the measure of acceptance. Then, traverse the values of
from
to
, calculating the F1 score of the sample. Finally, take the
corresponding to the maximum F1 score as the optimal threshold.
2.4. Multi-Frame Accumulation to Optimize the Output of the Model
The prediction of different samples by the deep model is independent, so the front and back frame paths obtained by this end-to-end scheme will be inconsistent, especially when the model confidence is low. For a prediction, where inconsistent, the model especially is very low, and the prediction result at this time is not reliable. In order to obtain a reliable output of the current frame, we can refer to the results of the previous frames, that is, use the accumulated results of the previous frames as the output of the current frame.
Let
be the
n waypoints on the front-view image predicted by the model at time
t, use the 2D-to-3D projection trick [
9] to project the front-view image points to the vehicle body coordinate system, and obtain
n waypoints under the vehicle body coordinate system at time
t, denoted by
. Calculate the pose transformation matrix
from time
to time
t according to the odometer of the vehicle, then the waypoints
of the vehicle body coordinate system at time
are transformed to the waypoints
under the vehicle body coordinate system at time
t, which can be calculated by Equation (
10).
Project the changed
to the front-view pixel coordinates to represent the predicted trajectory of the car, fit a quadratic curve
, and use a heatmap
H to represent this curve.
H is calculated based on Equation (
11)
where
represents the value of the
coordinate on the heatmap
H, and
(
) is the corresponding x coordinate when the curve
; the point of the curve on the heatmap is assigned a value of 1, and the left and right sides of the point decrease according to Equation (
11).
After obtaining the trajectory represented by the heatmap, multiple frames can be accumulated. The confidence of the model output at each moment is different. When the model output is wrong (i.e.,
), this frame should be discarded. Therefore, we use the weighted summation when accumulating, and the confidence
C of the model is used as the weight. The cumulative calculation formula of
k frames before time
t is shown in Equation (
12)
where
is the cumulative result of the heatmap of
k frames before time
t,
, and
are the confidence and single-frame heatmap at time
, respectively.
Finally, the coordinates of the maximum pixel value greater than 0 in each row in the heatmap are taken as the optimal point, and the optimized waypoints are generated by generating labels [
9].
3. Experimental Results and Discussion
In this section, firstly, we introduce the evaluation metrics, the experimental system, and the parameters for model training. Then, we compare the effect of GRU output and heatmap output in the simulation environment. In addition, the feasibility and effectiveness of our model confidence assessment method are tested with real off-road environment data. Finally, the performance of our proposed improved model in practical applications is tested.
3.1. Evaluation Metrics and Experimental Environment
This paper uses two metrics to measure the performance of the waypoint prediction algorithm: Waypoint Average Error (
, see Equation (
13)) and Final Waypoint Accuracy (
, see Equations (
14) and (
15)).
represents the average error of all predicted waypoints and is used to measure the average accuracy of model predictions. This indicator is mainly used for fine performance evaluation during model training.
stands for the proportion of trajectories predicted correctly at the end of the path. It is mainly used to measure whether the trajectory is correctly predicted in the direction. It is a coarse-grained evaluation. Among them, the path end point here is replaced by the mean of the last three waypoints in the path guidance point sequence.
In the formula, represents the mean value of the end-point error, which is calculated from the last three waypoints in the waypoints sequence, and T represents the number of waypoints. and represent predicted waypoints and labels, respectively. is the end-point error threshold ( = 10 in this experiment). When the end-point error is less than the threshold, it is considered that the waypoint sequence is correctly predicted, and is the prediction score. Two indicators, and , are defined to evaluate the model. The main purpose is to evaluate the model from two dimensions: the absolute accuracy of the waypoint predicted by the model and the accuracy of the waypoint prediction direction. Among them, the direction accuracy in practical applications is mainly to evaluate whether the steering behavior of the model at the intersection is correct.
The hardware environment for training is a computer equipped with an Intel Xeon(R) Gold 5218R 2.10 GHz CPU and a 24 GB memory NVIDIA RTX3090 GPU; the operating system is Ubuntu18.04, PyTorch1.8 and Cuda11.4. The hardware environment for testing is equipped with Intel(R) Core(TM) i7-10750H 2.60 GHz CPU, 6 GB memory NVIDIA RTX2060 GPU; the operating system is Ubuntu18.04. The experimental scenarios are introduced separately below.
3.2. Heatmap and GRU Output Comparison Experiment
The purpose of GRU predicting waypoints is to maintain the smoothness of the path. However, the prediction using the heatmap in this paper does not maintain this feature. Will this lead to the prediction effect not being as good as that of GRU? To address this concern, we set up this experiment. The experiment is carried out in CARLA [
25]. The data set is set as follows. The training set is a total of 13257 samples collected in Town1, Town5, and Town7, and the test set is 3653 samples collected in Town2. This experiment uses only one network for training. Replace the model output by the heatmap in
Section 2.2 with GRU output [
7], and the other structure remains unchanged to obtain the model predicted by GRU.
The performance of two different output modeling schemes is recorded in
Table 1. It is evident from the table that the heatmap output outperforms the GRU output on both metrics,
and
. Aiming at the poor performance of GRU waypoint prediction, we observed the effects of the two schemes, as shown in
Figure 4. Although the overall trend of the near waypoints of GRU is consistent with the label when turning, there is no strong turning trend in the far distance because the distribution of the near waypoints of the label is uniform when turning, while the lateral distribution of the far waypoints is not uniform. However, the output of GRU forcibly maintains the relative relationship of points, so both horizontal and vertical directions are uniform, resulting in a noticeable steering trend at distance. The heatmap is different because each waypoint is predicted independently, so both near and far points tend to be labels.
According to this experimental result, it can be considered that although the heatmap output does not consider and maintain the smoothness of waypoints, the accuracy of the prediction is even better than the GRU.
3.3. Verification of RREV
To verify the effectiveness and robustness of RREV, we collected the experimental data from real off-road scenarios. The data road scene includes dirt road, gravel road, weed road, and other complex road conditions, as shown in
Figure 5. A total of 33,297 pieces of data were collected, of which 24,608 were used as training sets and 8689 were used as test sets.
The experimental process is as follows. Use the model trained in the training set to test the samples of the training set, calculate the fitting error
Q of each sample in the training set according to the method in
Section 2.3, and manually mark each prediction result as “true” or “false”. Traverse the threshold
from
to
, calculate the F1 value of the “true” sample, and take the
corresponding to the largest F1 value as the best threshold. Finally, perform the same operation on the test set to obtain the
Q value and label “true” or “false” samples, and then use the selected best threshold
to classify the test set.
The result is shown in
Figure 6.
Figure 6b shows that with an increase in threshold, the F1 value gradually increases to a maximum value and then gradually decreases. This is because as the threshold increases, more and more “true” samples are correctly classified, and the recall is gradually increasing. When F1 reaches the maximum, as the threshold increases, the “false” samples are gradually accepted as “true” samples, the precision gradually decreases, and the F1 value will gradually decrease. According to the results of
Figure 6b,
takes 1500 as the best threshold, the test set is tested with this threshold, and the precision of the test results is as high as 93.4%.
We observed the results of different
Q values predicted by the model as shown in
Figure 7. When the model confidence is high (
), the predicted waypoints are smooth and orderly and are in good agreement with the labels as the model prediction confidence gradually decreases. When
, the predicted waypoints are gradually confused, but the result at this time is not much different from the label, and this result can be accepted. When
, the predicted waypoints are very confusing, and the result cannot be accepted. At this time, the prediction of the model is considered to be wrong, and the confidence level is 0. The observation result conforms to the conjecture in
Section 2.3. Experiments show that the IPFE method proposed in this paper to quantify model confidence is feasible and has good results.
3.4. Compare the Effects of Different Improvements Ablation Study
(1) Simulation environment test
The purpose of this experiment is to test the effect of the improved scheme proposed in this paper, and we conduct three groups of tests.
One model + one frame: Do not distinguish between “junctions” and “non-junctions”, only use one model to predict all situations, without multi-frame accumulation;
Two models + one frame: Use two models to predict the two cases of “junctions” and “non-junctions”, respectively, without multi-frame accumulation;
Two models + multiple frames: Use two models to predict the two cases of “junctions” and “non-junctions”, respectively, and perform multi-frame accumulation.
All model output methods are heatmaps.The training data for this experiment are 13257 in total (the same as in
Section 3.2), from the three towns of Town1, Town5, and Town7 in CARLA. The data are collected on the two road sections of Town7, shown in
Figure 8 for testing. Route 1 is mainly composed of “non-junctions”, and there are few data at “junctions” in order to test the performance of the model along the road. Compared with route 1, more “junctions” data are added to route 2 in order to test the steering performance of the model at the intersection. The details of the test data are shown in
Table 2. The number of “junctions” samples in route 1 accounted for 5.98%, but it increased to 40.63% in route 2.
The experimental results are shown in
Table 3. In route 1, the FWA of “two models + single frame” is 2.5% higher than that of “one model + single frame”, but it can be improved by 14.5% in route 2. This is because the training set data samples are unbalanced (“non-junctions” data account for the majority), so that the model’s learning effect for “non-junctions” data is much better than that for “junctions” data. Since there is only a small part of the “junctions” data in route 1, the difference between the effects of “one model + single frame” and “two models + single frame” is not obvious. When the proportion of “junctions” in route 2 increases, the total FWA of the single model will decrease significantly, but since the two models can learn the “junctions” data in a targeted manner, no matter how the “junctions” data increases, the FWA of the two models will not have too much influence. Whether it is route 1 or route 2, the performance of each index of multi-frame is better than that of single frame, and this advantage is more obvious in route 2 with more “junctions” data.
Through observation, we found the results shown in
Figure 9. In the first frame, when the vehicle approaches the intersection, the prediction of a single frame of the model is turned in advance, and the accumulated results of multiple frames are still stored in the same direction as the label; as the vehicle gradually enters the intersection, more and more predictions become steering. In the fourth frame, when the vehicle completely enters the intersection and starts to turn, the accumulated result of multiple frames also turns into steering. Multi-frame accumulation can solve the problem of single-frame prediction jumps, making the predictions of the previous and subsequent frames more coherent and avoiding the vehicle turning ahead or lagging behind when turning at the intersection.
(2) Complex road environment test
The above experiments show that the method proposed in this paper has better performance improvement than the previous methods. In order to test its performance on more complex road sections, the following experiments were carried out.
The tests were conducted in a real off-road environment and Town5 in CARLA, respectively, and the complexity of road conditions was reflected through texture changes and light changes. The off-road environment is shown in
Figure 5 of
Section 3.3. The environment in Town5 is shown in
Figure 10, adding changes in different weather and light. The data settings for training and testing under the two environments are shown in
Table 4; in CARLA, the model was trained in Town1, Town2, Town7 (the light change was not obvious) and tested in Town5. The training and testing venues in the cross-country environment are the same, but the routes are different. The experimental process is the same as above, and the results are shown in
Table 5.
As shown in
Table 5, in the CARLA environment, “Two models + multiple frames” is still the best, and “One model + one frame” is the worst. The difference in performance among the three groups was not particularly significant because the weather and light of Town5 changed sharply, which did not appear in the training set, so the overall performance decreased. However, in spite of this, the two models is still better than the one model, and the performance of the model is significantly improved after multi-frame optimization. In the cross-country environment, “Two models + one frame” is 10.6% higher than “One model + one frame” because during the mixed training of “junctions” and “non-junctions”, the model is overfitted to “non-junctions”, but the learning ability of the model to the “junctions” sample is improved after separate training. “Two models + multiple frames” is 2.3% higher than “Two models + one frame”. The visualization result is shown in
Figure 11. When the prediction of the second frame is wrong, the result of multi-frame accumulation can make up for the gap in the output, so the model is more robust.
The experimental results of the simulation environment and more complex environment show that the use of two models to independently predict the “junctions” and “non-junctions” situations avoids the problem of sample imbalance; the model can be more fully trained, and the steering accuracy of the vehicle at the intersection can be improved in practical applications. The multi-frame accumulation effectively solves the problem of incoherence and instability of the path caused by the single-frame prediction jump and prevents the vehicle from turning ahead or lagging when turning.
3.5. Online Testing in Virtual and Real Environments
The above experiments are all offline tests. In order to further verify the reliability and robustness of our improved end-to-end visual navigation scheme, online tests in virtual and real environments are performed here. After decoding the waypoints from the heatmap, refer to the method in [
9] to control the motion of the vehicle, project the waypoints to the vehicle coordinate system and analyze the steering value, drive the vehicle at a fixed speed, and use a PID controller for lateral control.
We conduct experiments in a virtual environment in CARLA. The real vehicle of the real environment experiment is shown in
Figure 12, the size of which is 1800 × 1500 × 750 (length, width, and height in mm). It is equipped with a wheel encoder and an inertial navigation system for vehicle positioning, and a camera is used to obtain RGB images of the scene. Using the embedded GPU computing platform Xavier, the device includes an 8-core NVIDIA Carmel ARMv8.2 64-bit CPU and a 512-core Volta architecture GPU consisting of 8 stream processors.
We train the model on Town1, Town5, and Town7, and test it on Town7. The route is shown in
Figure 13a. We conduct tests in real scenarios on campus, and the road topology and routes are shown in
Figure 13b, The campus environment is shown in
Figure 13c,d. We conduct comparative experiments on the three schemes of “One model + one frame”, “Two models + one frame”, and “Two models + multiple frames”, test them on the selected routes, respectively, and record the number and location of manual interventions.
The results of the experiment are shown in
Figure 14. First of all, from
Figure 14a,b, it can be seen that whether it is a virtual environment or a real environment, the number of interventions of “One model + one frame” is the largest and most of them occur at intersections. This is due to the imbalance of the samples, which leads to insufficient training of the model intersection data, and the model fails to learn the intersection steering ability well. Secondly, compared with “One model + one frame”, the number of interventions of “Two models + one frame” is greatly reduced, indicating that the learning strategy of the dual model can greatly improve the model’s ability to turn at intersections. Finally, the number of interventions of “Two models + multiple frames” in virtual and real environments is 0 and 1, which are fewer than “Two models + one frame”. It shows that the use of multi-frame accumulation can optimize the steering performance of the model intersection and improve the reliability of the model. The experimental results are in line with the analysis in
Section 3.4 The results show that, compared with the single model, the dual-model learning strategy can enable the vehicle to have better steering ability at intersections, and the accumulation of multiple frames can optimize the steering performance of the model intersection.