Carbon Dioxide Emission Reduction-Oriented Optimal Control of Traffic Signals in Mixed Traffic Flow Based on Deep Reinforcement Learning
Abstract
:1. Introduction
- (1)
- A reward function is proposed, which is guided by CO2 reduction, and CO2 emissions are calculated based on the instantaneous fuel consumption of fuel vehicles and the instantaneous energy consumption of electric vehicles.
- (2)
- Two sets of distinct action spaces are designed to adapt to spatiotemporal differences at various intersections, which facilitates the achievement of multiobjective optimization of CO2 reduction and intersection efficiency improvement.
- (3)
- The dueling double deep Q-learning network (D3QN) algorithm is employed to optimize intersection signal control. Incorporating vehicle acceleration data into the state space modeling of traffic signal control has significantly enhanced the learning performance of the reward function.
2. Problem Statement
2.1. Problem Description and Assumptions
2.1.1. Problem Description
- (1)
- State space—S: The state space of an intersection comprises the factors that form its environment, typically encompassing vehicle status and traffic signal phase information, such as the location, speed, turning direction, and queue length of vehicles, as well as the duration and phase of traffic signals.
- (2)
- Action space—A: In the optimization of traffic signal control, the traffic light is considered as an agent that makes action decisions, namely, switching the signal phase, based on the observed state of the intersection.
- (3)
- State transition probability—P: The transition of intersection state refers to the process where the signal controller, acting as the agent, observes the intersection state at time , takes action , and causes the intersection state to transition to at time . The probability of the state transition can be expressed as , where and represent the intersection state and action at time , respectively, and represents the intersection state at time .
- (4)
- Reward function—R: The reward function in intersection signal optimization is designed to assist the signal light, or the agent, in learning the control policy. Every time the agent executes an action, the environment returns a reward value as feedback to evaluate the effectiveness of the action. Typical reward functions in intersection signal optimization include average queue length, difference in cumulative waiting time between adjacent time steps, difference in cumulative queue length between adjacent time steps, and traffic pressure.
2.1.2. Assumptions
- (1)
- Vehicles are categorized into four types based on their driving mechanism and powertrain: human-driven fuel vehicles (HDFV), human-driven electric vehicles (HDEV), connected and automated fuel vehicles (CAFV), and connected and automated electric vehicles (CAEV).
- (2)
- (3)
- V2X communication is assumed to exhibit inherent reliability, eliminating concerns about congestion, latency, or data attrition.
- (4)
- (HDV drivers consistently exhibit driving behaviors that are in strict adherence to prevailing traffic regulations.
2.2. State Space Definition
- (1)
- Vehicle speed normalized
- (2)
- Vehicle acceleration normalized
- (3)
- Vehicle deceleration normalized
2.3. Action Space Modeling
- (1)
- Fixed-phase sequence
- (2)
- Adaptive-phase sequence
2.4. Reward Function Settings
- (1)
- Instantaneous CO2 Emissions from Fuel vehicles
- (2)
- Instantaneous CO2 Emissions from Electric vehicles
3. Methodology
3.1. D3QN Algorithm
- (1)
- Main Q network parameters updateThe update strategy for the main Q network in the context of the D3QN algorithm is similar to that of the DQN, with the incorporation of a target Q network designed to mitigate the overestimation issue during training. The main Q network, denoted by , where represents the corresponding neural network parameters, is defined alongside a target Q network, denoted by , where denotes the respective neural network parameters. The main Q network is updated via minibatch gradient descent with TD temporal difference, and the specific update process can be described as follows:
- (1.1)
- The sample ) is extracted from the experience pool, where denotes the state, denotes the action taken based on state , denotes the reward provided by the environment for taking action , denotes the new state resulting from the state transition probabilities after taking action , and denotes whether or not the final state has been reached.
- (1.2)
- The double DQN network uses the main Q network to calculate the optimal action . The formula is as follows:
- (1.3)
- The target network calculates the value. The formula is as follows:
- (1.4)
- Loss function: squared error function
- (1.5)
- Stochastic gradient descent
- (1.6)
- Parameter update
- (2)
- Main Q network parameter update
3.2. D3QN Network Model and Algorithmic Process
Algorithm 1: Dueling Double DQN Algorithm |
Input: -empty replay buffer; -initial main Q network parameters; -copy of |
Input: -discount factor; -smoothing coefficient; -main Q network learning rate; -exploration rate; -replay buffer maximum size; -training batch size; -target network update frequency; -number of iterations; -max training epoch; -max time step |
for do Initialize state for do Observe state and choose action based on -greedy policy Execute action , observe reward and next state Store transition tuple in Replace and delete the oldest tuple if |
Sample a minibatch of transitions from D for each transition do Define Set Do a gradient descent step with loss Update target parameters every steps end end end |
4. Experiment and Results
4.1. Simulation Environment and Parameter Settings
4.2. Experimental Evaluation and Results Analysis
- 1.
- Considering the increasing penetration rate of connected and automated vehicles, it is essential to investigate the signal control strategies for intersections in a high CAV penetration environment. Therefore, this study sets the CAV penetration rate at and conducts a comparative analysis of the optimization effects of the proposed reward function on signalized intersections under three traffic volume conditions: high, medium, and low.
- 2.
- In real-world scenarios, the penetration rate of CAVs in mixed traffic flow at urban intersections varies continuously over time. Therefore, we set the traffic volume to and compare the optimization effect of the reward function proposed in this paper for signalized intersections under different levels of connected and automated vehicles penetration rate.
- 3.
- Given the spatiotemporal distribution characteristics of traffic flow at signalized intersections, this study conducts a comparative analysis of the optimization effects of the reward function proposed in this paper on signalized intersections under different action space schemes and provides recommendations for the selection of action space schemes.
- 4.
- To validate the feasibility of the algorithm proposed in this paper in real-world intersections, this study tests the robustness of the model in scenarios where traffic accidents occur at intersections.
- 1.
- Scheme 2: The deep reinforcement learning signal timing optimization scheme, referred to as Reward—Wait Time, employs the difference in the cumulative waiting time between adjacent time steps as the reward function, and is formulated as follows:
- 2.
- Scheme 3: The deep reinforcement learning signal timing optimization scheme, referred to as Reward—Queue Length, employs the cumulative queuing length as the reward function, and is formulated as follows:
- 3.
- Scheme 4: Fixed signal timing control (FSTC), using the Webster signal timing method to calculate the green light duration of each phase.
- 4.
- Scheme 5: Actuated signal control (ASC), working by prolonging traffic phases whenever a continuous stream of traffic is detected.
4.2.1. Analysis of Experimental Results under Different Traffic Volume Conditions
4.2.2. Analysis of Experimental Results under Different Penetration Rates of CAVs
4.2.3. Analysis of Experimental Results under Different Action Space Schemes
4.2.4. Analysis of Experimental Results concerning the Robustness of the Algorithm during Traffic Accident Scenarios
5. Conclusions
- (1)
- In various traffic scenarios with high penetration rates of connected and automated vehicles (CAVs), the proposed signal control scheme adopts different control strategies based on the traffic volume, focusing on optimizing CO2 emissions and communication efficiency. The scheme achieves optimal results in the traffic scenarios with medium traffic volume, but its optimization performance diminishes with increasing traffic volume.
- (2)
- Experiment 2 demonstrates that the proposed signal control scheme performs better in scenarios with higher CAV penetration rates. When the penetration rate reaches 90%, the scheme reduces the average waiting time of vehicles at intersections by 13.72%, 16.68%, 28.88%, and 22.88%, respectively, compared with Scheme 2, 3, 4 and 5. In addition, the optimized average CO2 emissions at intersections are also reduced by 2.14%, 3.72%, 7.42%, and respectively.
- (3)
- To further reduce CO2 emissions at intersections and improve intersection throughput efficiency, this paper compares fixed-phase sequence and adaptive-phase sequence optimization schemes under different traffic volume conditions and provides recommendations. The fixed-phase sequence action space is suitable for optimizing intersection signals with the Reward—CO2 Reduction scheme in low traffic volume scenarios, while the adaptive-phase sequence action space combined with the Reward—CO2 Reduction scheme performs better in medium to high traffic volume scenarios and maintains good robustness.
- (4)
- Considering the complexity of real-world intersections, this study simulates scenarios of traffic accidents occurring during peak hours to validate the optimization performance and robustness of the model. The experiments indicate that the proposed signal control scheme exhibits desirable optimization capabilities under unexpected conditions. Furthermore, compared with deep reinforcement learning models trained with other reward functions, our proposed model demonstrates enhanced robustness.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Number | ||
---|---|---|
1 | 1.5437 | |
2 | 1.6044 | |
3 | 1.1308 | |
4 | 2.3863 | |
5 | 3.2102 | |
6 | 3.9577 | |
7 | 4.7520 | |
8 | 5.3742 | |
9 | 5.9400 | |
10 | 6.4275 | |
11 | 7.0660 | |
12 | 7.6177 | |
13 | 8.3224 | |
14 | 8.4750 |
References
- Fellendorf, M. VISSIM: A microscopic simulation tool to evaluate actuated signal control including bus priority. In Proceedings of the 64th Institute of Transportation Engineers Annual Meeting, Dallas, TX, USA, 16–19 October 1994; pp. 1–9. [Google Scholar]
- Mirchandani, P.; Head, L. A real-time traffic signal control system: Architecture, algorithms, and analysis. Transp. Res. Part C Emerg. Technol. 2001, 9, 415–432. [Google Scholar] [CrossRef]
- Lowrie, P. Scats-a traffic responsive method of controlling urban traffic. In Sales Information Brochure; Roads & Traffic Authority: Sydney, Australia, 1990. [Google Scholar]
- Mirchandani, P.; Wang, F.-Y. RHODES to intelligent transportation systems. IEEE Intell. Syst. 2005, 20, 10–15. [Google Scholar] [CrossRef]
- Hunt, P.; Robertson, D.; Bretherton, R.; Royle, M.C. The SCOOT on-line traffic signal optimisation technique. Traffic Eng. Control 1982, 23, 190–192. [Google Scholar]
- Coelho, M.C.; Farias, T.L.; Rouphail, N.M. Impact of speed control traffic signals on pollutant emissions. Transp. Res. Part D Transp. Environ. 2005, 10, 323–340. [Google Scholar] [CrossRef]
- Yao, R.; Sun, L.; Long, M. VSP-based emission factor calibration and signal timing optimisation for arterial streets. IET Intell. Transp. Syst. 2019, 13, 228–241. [Google Scholar] [CrossRef]
- Yao, Z.; Zhao, B.; Yuan, T.; Jiang, H.; Jiang, Y. Reducing gasoline consumption in mixed connected automated vehicles environment: A joint optimization framework for traffic signals and vehicle trajectory. J. Clean. Prod. 2020, 265, 121836. [Google Scholar] [CrossRef]
- Chen, X.; Yuan, Z. Environmentally friendly traffic control strategy—A case study in Xi’an city. J. Clean. Prod. 2020, 249, 119397. [Google Scholar] [CrossRef]
- Lin, H.; Han, Y.; Cai, W.; Jin, B. Traffic signal optimization based on fuzzy control and differential evolution algorithm. IEEE Trans. Intell. Transp. Syst. 2022, 24, 8555–8566. [Google Scholar] [CrossRef]
- Xiao, G.; Lu, Q.; Ni, A.; Zhang, C. Research on carbon emissions of public bikes based on the life cycle theory. Transp. Lett. 2023, 15, 278–295. [Google Scholar] [CrossRef]
- Haitao, H.; Yang, K.; Liang, H.; Menendez, M.; Guler, S.I. Providing public transport priority in the perimeter of urban networks: A bimodal strategy. Transp. Res. Part C Emerg. Technol. 2019, 107, 171–192. [Google Scholar] [CrossRef]
- He, H.; Guler, S.I.; Menendez, M. Adaptive control algorithm to provide bus priority with a pre-signal. Transp. Res. Part C Emerg. Technol. 2016, 64, 28–44. [Google Scholar] [CrossRef]
- Wiering, M.A. Multi-agent reinforcement learning for traffic light control. In Proceedings of the Machine Learning: Proceedings of the Seventeenth International Conference (ICML’2000), Stanford University, Stanford, CA, USA, 29 June–2 July 2000; pp. 1151–1158. [Google Scholar]
- Abdulhai, B.; Kattan, L. Reinforcement learning: Introduction to theory and potential for transport applications. Can. J. Civ. Eng. 2003, 30, 981–991. [Google Scholar] [CrossRef]
- El-Tantawy, S.; Abdulhai, B. An agent-based learning towards decentralized and coordinated traffic signal control. In Proceedings of the 13th International IEEE Conference on Intelligent Transportation Systems, Funchal, Portugal, 19–22 September 2010; pp. 665–670. [Google Scholar]
- Arel, I.; Liu, C.; Urbanik, T.; Kohls, A.G. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intell. Transp. Syst. 2010, 4, 128–135. [Google Scholar] [CrossRef]
- Genders, W.; Razavi, S. Using a deep reinforcement learning agent for traffic signal control. arXiv 2016. [Google Scholar] [CrossRef]
- Ma, D.; Zhou, B.; Song, X.; Dai, H. A deep reinforcement learning approach to traffic signal control with temporal traffic pattern mining. IEEE Trans. Intell. Transp. Syst. 2021, 23, 11789–11800. [Google Scholar] [CrossRef]
- Li, Z.; Yu, H.; Zhang, G.; Dong, S.; Xu, C.-Z. Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning. Transp. Res. Part C Emerg. Technol. 2021, 125, 103059. [Google Scholar] [CrossRef]
- Lu, L.; Cheng, K.; Chu, D.; Wu, C.; Qiu, Y. Adaptive Traffic Signal Control Based on Dueling Recurrent Double Q Network. China J. Highw. Transp. 2022, 35, 267. [Google Scholar]
- Kim, G.; Sohn, K. Area-wide traffic signal control based on a deep graph Q-Network (DGQN) trained in an asynchronous manner. Appl. Soft Comput. 2022, 119, 108497. [Google Scholar] [CrossRef]
- Zhu, Y.; Yin, X.; Chen, C. Extracting Decision Tree From Trained Deep Reinforcement Learning in Traffic Signal Control. Ieee Trans. Comput. Soc. Syst. 2023, 10, 1997–2007. [Google Scholar] [CrossRef]
- Yan, L.; Zhu, L.; Song, K.; Yuan, Z.; Yan, Y.; Tang, Y.; Peng, C. Graph cooperation deep reinforcement learning for ecological urban traffic signal control. Appl. Intell. 2023, 53, 6248–6265. [Google Scholar] [CrossRef]
- Chen, Y.; Zhang, H.; Liu, M.; Ye, M.; Xie, H.; Pan, Y. Traffic signal optimization control method based on adaptive weighted averaged double deep Q network. Appl. Intell. 2023, 53, 18333–18354. [Google Scholar] [CrossRef]
- Ren, A.; Zhou, D.; Feng, J. Attention mechanism based deep reinforcement learning for traffic signal control. Appl. Res. Comput. 2023, 40, 430–434. [Google Scholar]
- Haddad, T.A.; Hedjazi, D.; Aouag, S. A deep reinforcement learning-based cooperative approach for multi-intersection traffic signal control. Eng. Appl. Artif. Intell. 2022, 114, 105019. [Google Scholar] [CrossRef]
- Kumar, N.; Rahman, S.S.; Dhakad, N. Fuzzy inference enabled deep reinforcement learning-based traffic light control for intelligent transportation system. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4919–4928. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Booth, S.; Knox, W.B.; Shah, J.; Niekum, S.; Stone, P.; Allievi, A. The perils of trial-and-error reward design: Misdesign through overfitting and invalid task specifications. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 5920–5929. [Google Scholar]
- Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Icml, Bled, Slovenia, 27–30 June 1999; pp. 278–287. [Google Scholar]
- Burda, Y.; Edwards, H.; Storkey, A.; Klimov, O. Exploration by random network distillation. arXiv 2018, arXiv:1810.12894. [Google Scholar]
- Badia, A.P.; Sprechmann, P.; Vitvitskyi, A.; Guo, D.; Piot, B.; Kapturowski, S.; Tieleman, O.; Arjovsky, M.; Pritzel, A.; Bolt, A.J.; et al. Never give up: Learning directed exploration strategies. arXiv 2020, arXiv:2002.06038. [Google Scholar]
- Market Analysis Report of China’s Intelligent Connected Passenger Vehicles from January to December 2022; China Industry Innovation Alliance for the Intelligent and Connected Vehicles (CAICV): Beijing, China, 2023.
- New Energy Vehicle Industry Development Plan (2021–2035); China. 2020. Available online: https://www.iea.org/policies/15529-new-energy-vehicle-industry-development-plan-2021-2035 (accessed on 20 October 2020).
- IEA. Global EV Data Explorer; IEA: Paris, France, 2023. [Google Scholar]
- Genders, W.; Razavi, S. Evaluating reinforcement learning state representations for adaptive traffic signal control. Procedia Comput. Sci. 2018, 130, 26–33. [Google Scholar] [CrossRef]
- Jimenez-Palacios, J.L. Understanding and Quantifying Motor Vehicle Emissions with Vehicle Specific Power and TILDAS Remote Sensing; Massachusetts Institute of Technology: Cambridge, MA, USA, 1998. [Google Scholar]
- Frey, H.; Unal, A.; Chen, J.; Li, S.; Xuan, C. Methodology for Developing Modal Emission Rates for EPA’s Multi-Scale Motor Vehicle & Equipment Emission System; US Environmental Protection Agency: Ann Arbor, MI, USA, 2002; p. 13. [Google Scholar]
- Zhao, H. Simulation and Optimization of Vehicle Energy Consumption and Emission at Urban Road Signalized Intersection; Lanzhou Jiaotong University: Lanzhou, China, 2019. [Google Scholar]
- Yang, S.; Li, M.; Lin, Y.; Tang, T. Electric vehicle’s electricity consumption on a road with different slope. Phys. A Stat. Mech. Its Appl. 2014, 402, 41–48. [Google Scholar] [CrossRef]
- Climate Letter of Approval No. 43. China. 2023. Available online: https://www.mee.gov.cn/xxgk2018/xxgk/xxgk06/202302/t20230207_1015569.html (accessed on 7 February 2023).
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013. [Google Scholar] [CrossRef]
- Christodoulou, P. Soft actor-critic for discrete action settings. arXiv 2019. [Google Scholar] [CrossRef]
- Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1995–2003. [Google Scholar]
- Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Haydari, A.; Yılmaz, Y. Deep reinforcement learning for intelligent transportation systems: A survey. IEEE Trans. Intell. Transp. Syst. 2020, 23, 11–32. [Google Scholar] [CrossRef]
- Haitao, H.; Menendez, M.; Guler, S.I. Analytical evaluation of flexible-sharing strategies on multimodal arterials. Transp. Res. Part A Policy Pract. 2018, 114, 364–379. [Google Scholar] [CrossRef]
Human-Driven Vehicle (%) | Connected and Automated Vehicle with Level 2 and above Automation Level (%) | |
---|---|---|
Fuel vehicle (%) | 51.4 | 19.6 |
Electric vehicle (%) | 13.7 | 15.3 |
Entrance | Movement | ||
---|---|---|---|
Going Straight | Turning Left | Turning Right | |
East entrance | 13.33% | 5.56% | 3.33% |
West entrance | 13.33% | 5.56% | 3.33% |
South entrance | 16.67% | 6.94% | 4.17% |
North entrance | 16.67% | 6.94% | 4.17% |
Hyperparameters | Values |
---|---|
Learning rate | 0.0003 |
Discount factor | 0.99 |
Batch size | 64 |
Exploration rate | 0.9 |
Target network update frequency | 100 |
Experience replay buffer | 500,000 |
Smoothing coefficient | 0.05 |
Human-Driven Vehicles (%) | Connected and Automated Vehicles with Level 2 and above Automation Levels (%) | |
---|---|---|
Fuel vehicle (%) | 2.0 | 19.6 |
Electric vehicle (%) | 8.0 | 70.4 |
Human-Driven Vehicle (%) | Connected and Automated Vehicle with Level 2 and above Automation Level (%) | |
---|---|---|
Fuel vehicle (%) | 56.3 | 19.6 |
Electric vehicle (%) | 13.7 | 10.4 |
Human-Driven Vehicle (%) | Connected and Automated Vehicle with Level 2 and above Automation Level (%) | |
---|---|---|
Fuel vehicle (%) | 26.3 | 19.6 |
Electric vehicle (%) | 13.7 | 40.4 |
Scheme | Average CO2 Emissions/g | Average Waiting Time/s |
---|---|---|
Scheme 1—APS | ||
Scheme 1—FPS | ||
Scheme 4 |
Scheme | Average CO2 Emissions/g | Average Waiting Time/s |
---|---|---|
Scheme 1—APS | ||
Scheme 1—FPS | ||
Scheme 4 |
Scheme | Average CO2 Emissions/g | Average Waiting Time/s |
---|---|---|
Scheme 1—APS | ||
Scheme 1—FPS | ||
Scheme 4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Z.; Xu, L.; Ma, J. Carbon Dioxide Emission Reduction-Oriented Optimal Control of Traffic Signals in Mixed Traffic Flow Based on Deep Reinforcement Learning. Sustainability 2023, 15, 16564. https://doi.org/10.3390/su152416564
Wang Z, Xu L, Ma J. Carbon Dioxide Emission Reduction-Oriented Optimal Control of Traffic Signals in Mixed Traffic Flow Based on Deep Reinforcement Learning. Sustainability. 2023; 15(24):16564. https://doi.org/10.3390/su152416564
Chicago/Turabian StyleWang, Zhaowei, Le Xu, and Jianxiao Ma. 2023. "Carbon Dioxide Emission Reduction-Oriented Optimal Control of Traffic Signals in Mixed Traffic Flow Based on Deep Reinforcement Learning" Sustainability 15, no. 24: 16564. https://doi.org/10.3390/su152416564
APA StyleWang, Z., Xu, L., & Ma, J. (2023). Carbon Dioxide Emission Reduction-Oriented Optimal Control of Traffic Signals in Mixed Traffic Flow Based on Deep Reinforcement Learning. Sustainability, 15(24), 16564. https://doi.org/10.3390/su152416564