4.3. Surrounding Vehicles’ Motion Prediction
The parameters for a seq2seq motion prediction network structure are shown in
Table 1. The 1612 CF scenarios are utilized for both training and testing the prediction model. These scenarios are divided into 80% for training, 10% for validation, and 10% for testing. The historical feature sequence
within the time window is sent to the model at the time step
t. The model then predicts the sequence
with the length of
H, as required by the TD3 algorithm. This process repeats at each subsequent time step. The input sequence shifts by an interval
until the entire data length for the surrounding vehicles is covered.
The prediction performance of the seq2seq model on the test dataset for a specific scenario is shown in
Figure 8. During the online operation of the RL-based CF strategy, it is imperative for the seq2seq model to precisely predict the kinematic states of surrounding vehicles within the prediction time horizon. As evident from
Figure 8, the predicted values for velocity, acceleration, relative distance, and relative velocity are close to the real values. This observation is further corroborated by the mean squared error (MSE) presented in
Table 2.
4.4. Validation of TD3 CF Control
For the 1612 extracted CF scenarios, a subset comprising 90% (i.e., 1450) was allocated for training purposes, while the remaining 10% (i.e., 162) was reserved for validation. During training, the agent simulates CF scenarios from the shuffled training data. Once a CF scenario concludes, a new one is randomly selected from the training scenarios, with the agent’s state initialized using empirical data from the selected scenario. The training was repeated for 1000 episodes with each episode representing a CF scenario in this context.
The parameters pertinent to the TD3 model can be found in
Table 3. In the refinement of the TD3 network, the parameters were meticulously adjusted via the Adam optimizer, capitalizing on the stochastic nature of minibatches to ensure a comprehensive traversal of the solution space. The application of OU-noise, characterized by a mean of 0.15 and a variance of 0.2, was instrumental in augmenting the exploratory capabilities of the action space, thus enhancing the robustness of the model against local optima. This was complemented by the implementation of Gaussian noise, with a 0 mean and a standard deviation of 0.2, to the target policy, thereby inducing a smoothing effect that further entrenched the stability of the learning process. The learning rates for actor and critic networks were chosen to balance convergence speed and learning stability. A replay buffer of 20,000 and a minibatch size of 256 struck a balance between memory efficiency and sufficient training diversity.
The discount factor at 0.99 ensures a focus on long-term rewards. Soft updates with a coefficient of 0.005 provided gradual target network adjustments. Reward function parameters, such as to , were fine-tuned to accentuate the aspects of safety, comfort, energy efficiency, and overall vehicular efficiency. For instance, the and parameters in the comfort reward function are configured to penalize high jerk, promoting a driving experience that prioritizes passenger comfort. Similarly, the energy consumption reward function utilizes to effectively weigh the energy expenditure against vehicular dynamics. Particularly, the and coefficients within the efficiency reward function harmonize the intricate relationship between fuel consumption and velocity. The chosen coefficients incentivize fuel-efficient behaviors while also ensuring adherence to optimal time headways, thus reinforcing the synergy between efficiency and safety. Lastly, the threshold was anchored at 0.25, aligned with safety standards and validated through a series of simulations to safeguard against hazardous proximities.
Given the variable length of CF episodes and the actor network’s random exploration, a moving average of episode rewards with a window size of 50 is utilized to discern reward trends and evaluate the algorithm’s performance. This entire process undergoes 10 rounds to evaluate the algorithm’s convergence. To verify the performance of the proposed method, some mainstream RL algorithms, such as DDPG, SAC, and PPO, are implemented for comparison.
Figure 9 depicts the evolution of the moving average episode reward during the training episodes. Solid colored lines indicate the mean over multiple rounds, while shaded regions denote the standard deviation around these mean values. The TD3 agent’s moving average episode reward outperforms those of other agents, including those involving human driving actions. The TD3 algorithm’s reward converges around 820, whereas the SAC reward stabilizes around 635. Both the PPO and DDPG algorithms consistently underperform compared with human driving.
TD3’s dual Q-learning mechanism aims to mitigate overestimations of Q-values, while its delayed policy updates ensure infrequent policy modifications. Furthermore, TD3 introduces a policy noise clipping technique, preventing excessive variations during policy updates. In continuous action scenarios like CF, vehicles are expected to respond smoothly and consistently across varying driving scenarios. Any abrupt or substantial policy shifts can lead to suboptimal driving behavior. In comparison, DDPG may suffer from overestimation of Q-values, leading to unstable training. While PPO exhibits commendable performance in discrete action space tasks, it may not be as adept as TD3 or SAC in continuous control tasks. On the other hand, while SAC enhances its policy’s exploratory nature using the maximum entropy principle and is generally effective, its stochastic approach can sometimes lead to undue exploration. This might cause SAC to underperform relative to TD3 in certain driving scenarios. Subsequent analysis will thus center on comparing the TD3 agent’s performance with human driving behavior.
To juxtapose the performance of the TD3 policy with human driving behavior, the quantitative evaluation results are presented in
Table 4. The safety check index represents the ratio of the CF scenarios where
surpasses a given threshold to the total number of CF scenarios. The energy consumption index of the SV is calculated based on Equation (
27). The other indices are the average results calculated based on the whole test dataset.
As indicated in
Table 4, the penalty imposed during TD3 agent training for exceeding the risk threshold
results in only
of scenarios surpassing
, which is lower than that of the highD dataset. Additionally,
Table 4 highlights that the TD3 agent’s average velocity surpasses that of the highD dataset. With a focus on safety, the TD3-controlled agent not only maintains a reduced relative distance to the preceding vehicle but also ensures sufficient acceleration room for the following vehicle. This observation underscores the significant efficacy of the TD3-controlled agent in enhancing the efficiency of CF behaviors.
Figure 10 compares the performances of the TD3 agent and human driving in a specific scenario. It reveals that the TD3-controlled SV excels in maintaining a steady traffic flow, especially in the relative velocity between the preceding and following vehicles. This is due to the seq2seq model’s prediction for the motion status of the surrounding vehicles within the prediction horizon. Such anticipatory capabilities enable the model to incorporate the prospective motion trajectory of the SV during acceleration determinations at each iterative step, thereby circumventing potential compromises in CF safety and efficiency due to abrupt accelerations or decelerations exhibited by the preceding vehicle. On the other hand, the jerk exhibited by the TD3-controlled agent is considerably diminished relative to human drivers. This not only enhances ride comfort but also results in a better time headway compared with human drivers.
4.5. Platoon Analysis
Given the stable highway conditions in the highD dataset, in order to fully exploit the advantages of TD3-controlled agents in traffic oscillation multivehicle scenarios, the platoon analysis, which consists of nine vehicles with one head vehicle, is performed. The optimal velocity model (OVM) is utilized to depict the dynamics of HDVs, and the parameters are referred to [
24]. The first simulation examines a traffic wave scenario, induced by introducing a sinusoidal velocity perturbation to the head vehicle around an equilibrium velocity of 15 m/s. Within the platoon, two CAVs are positioned as the fourth and seventh vehicles. The model predictive control (MPC) controllers for CAVs are implemented according to [
24] to compare the performances of the TD3-controlled CAVs. The control objective for the platoon is to stabilize the velocity of each vehicle around 15 m/s, while maintaining the intervehicle spacing close to 20 m. The acceleration constraints for the CAVs range from −5 m/s
2 to 2 m/s
2, while the intervehicle spacing constraints range from 5 m to 40 m. The results are shown in
Figure 11, where the first row represents the scenario where all vehicles in the platoon are HDVs and the second and third rows, respectively, illustrate the performance of the platoon that includes two CAVs controlled by the MPC controller and the TD3 algorithm.
Figure 11 illustrates that as the sinusoidal perturbation wave propagates backward through the platoon, its amplitude amplifies, leading to a significant increase in overall fuel consumption and collision risk. In contrast, when the platoon incorporates CAVs controlled by either the MPC or TD3 algorithm, the amplitude of the traffic shockwave is effectively attenuated, which underscores their capabilities in mitigating disturbances and stabilizing the traffic flow. Furthermore, the quantitative evaluation results between the MPC controller and the TD3 algorithm are shown in
Table 5. It is worth noting that MPC employs precise linearized dynamics around the equilibrium state for control input design, and it can be served as a valuable benchmark for comparison, while the TD3 algorithm directly relies on the raw data obtained from the environment to perform the self-exploration strategy. As
Table 5 indicates, TD3-controlled CAVs slightly outperform MPC in noise-perturbed nonlinear traffic systems, even in the absence of explicit system knowledge. Specifically, although CAVs controlled by the TD3 algorithm slightly decreased the platoon’s overall average speed in comparison with those controlled by MPC, the relative velocities and accelerations are reduced. In the meantime, the relative distance between vehicles closer to the equilibrium state and the enhanced ride comfort are sustained. This highlights the capability of TD3-controlled CAVs to achieve a more stable and fuel-efficient traffic flow.
In the second scenario, an emergency braking event is simulated to assess the safety efficacy of TD3-controlled CAVs. Initially, the head vehicle in the platoon sustains an equilibrium velocity of 15 m/s for the first second. It then abruptly decelerates at a rate of −5 m/s
2 for the subsequent 2 s and reaches a lower velocity of 5 m/s. Subsequently, it maintains this lower velocity for 6 s. Afterward, it accelerates back to its original velocity at a rate of 2 m/s
2 and maintains that velocity for the remaining time. The results are shown in
Figure 12.
In
Figure 12, it is evident that with an all-HDV platoon, significant velocity fluctuations arise due to the braking perturbation of the lead vehicle. However, introducing two CAVs controlled by MPC and the TD3 algorithm results in a significant attenuation of the undesired traffic wave. Upon detecting the lead vehicle’s braking, these CAVs promptly decelerate, thereby ensuring a safer following distance. When comparing the performances of the MPC controller and the TD3 algorithm, the TD3 algorithm yields more stable velocity and acceleration profiles, while MPC demonstrates some overshoots. The quantitative assessments presented in
Table 6 further substantiate these observations.