Multi-Dimensional Double Deep Dynamic Q-Network with Aligned Q-Fusion for Dual-Ring Barrier Traffic Signal Control

Zheng, Qiming; Xu, Hongfeng; Chen, Jingyun; Zhang, Kun

doi:10.3390/app15031118

Open AccessArticle

Multi-Dimensional Double Deep Dynamic Q-Network with Aligned Q-Fusion for Dual-Ring Barrier Traffic Signal Control

Department of Transportation and Logistics, Dalian University of Technology, Dalian 116024, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1118; https://doi.org/10.3390/app15031118

Submission received: 19 December 2024 / Revised: 9 January 2025 / Accepted: 22 January 2025 / Published: 23 January 2025

(This article belongs to the Section Transportation and Future Mobility)

Download

Browse Figures

Versions Notes

Abstract

:

Model-free deep reinforcement learning (MFDRL) is well-suited for real-time traffic signal control (RTSC), as it is a sequential decision problem where the environment is difficult to be a priori modeled, but has performance metrics sufficing as rewards. Previous studies have not ideally employed MFDRL systems at typical intersections with a dual-ring barrier phase structure (DRBPS) and second-by-second signal operation. DRBPS allows phases to time flexibly while satisfying signal timing constraints in engineering, making it complicated yet common in real-world applications. This study proposes an MFDRL method, termed MD4-AQF, to address the RTSC problem under DRBPS. The state can be represented as a 4 × 30 × 4 × 4 array. We define action based on “decision point aligning” to produce a consistent action space that controls dual-ring concurrent phases simultaneously. We developed a training algorithm based on a “multi-dimensional Q-network” that reduces the number of learnable actions from 600+ to 52. We designed action selection based on “aligned Q-fusion” to end two lagging phases simultaneously with a shared compromise sub-action. In simulation experiments, MD4-AQF trains an agent to improve average vehicle delay from 135 s to 48 s. It surpasses another MFDRL ablated method by 14%, and a fully actuated conventional method by 19%.

Keywords:

reinforcement learning; deep Q-network; traffic engineering; traffic signal control

1. Introduction

Model-free deep reinforcement learning (MFDRL) is used for solving sequential decision problems. MFDRL does not rely on the environmental models pre-determined by humans. Instead, it involves learning by interactive trial and error, along with rewards from the environment. This makes MFDRL especially worth being used for problems characterized by two characteristics: (1) the environment’s dynamics cannot be perfectly a priori modeled; and (2) a series of control decisions as a whole can be assessed with a widely accepted performance metric, based on which, the reward for MFDRL can be readily defined [1].

Real-time traffic signal control (RTSC) at intersections is typically a sequential decision problem suitable for MFDRL. First, the behavior of a vehicle and the reactions of other vehicles to it are both highly stochastic. This complexity complicates the dynamics of intersections, making it difficult for humans to model such dynamics without losing any details. Second, metrics such as vehicle throughput and vehicle delay effectively measure intersection performance over a period of time and can be utilized to define rewards.

Many previous studies of MFDRL-based RTSC have focused on a road network with multiple intersections. From a traffic engineering perspective, there were still unresolved issues at individual intersections. In particular, signals often could not operate within a dual-ring barrier phase structure (DRBPS). This structure organizes phases in a fixed sequence to meet engineering constraints, while also allowing phases to time independently and flexibly, as indicated in Section 3. DRBPS is complex but remains the most common at real-world intersections [2]. RTSC methods incompatible with DRBPS cannot be widely adopted in urban areas. On the other hand, some studies did not permit second-by-second adjustment of phase green time. Instead, the green time could only be a multiple of a specific number of seconds, which suppressed the potential for performance enhancement of their methods.

This study proposes an MFDRL method for RTSC at an individual intersection with DRBPS, termed “multi-dimensional double deep dynamic Q-network with aligned Q-fusion (MD4-AQF)”. We deployed a single agent for the intersection that chose from green time options at one-second intervals for each phase. We address the following three major challenges:

How to define an action and when to select an action for two concurrent phases in different rings. For this challenge, we present an action definition based on “decision point aligning” in Section 4.
How to effectively learn hundreds of combinatorial actions for two concurrent phases. For this challenge, we present a training algorithm based on a “multi-dimensional Q-network” in Section 6.2.
How to ensure that two concurrent lagging phases end simultaneously. For this challenge, we present an action selection mechanism based on “aligned Q-fusion” in the algorithm in Section 6.3.

The remainder of this paper is organized as follows. Section 2 reviews previous studies and points out their shortcomings. Section 3 describes the problems faced by us and the agent, and the difficulties in solving them. The main contents from Section 4, Section 5 and Section 6 and their roles in the reinforcement learning loop are illustrated in Figure 1.

2. Literature Review

Regarding MFDRL-based RTSC, recent research has made significant contributions to (1) multi-agent communication and cooperation in a road network; (2) transfer learning between intersections with different roadway geometry or traffic demand; (3) training algorithm modification for better encoding or decoding traffic state features, etc.

However, those studies generally overlooked some intractable issues at an individual intersection. These issues motivated us to conduct this study. We review them in terms of the two following aspects:

Phase structure. Nearly all researchers employed a phase structure incompatible with DRBPS. This limitation inhibited the widespread use of their methods in engineering practice.
Many studies restricted a phase to last only in multiples of a specific number of seconds, such as 5 seconds or 10 seconds. This constraint prevented further fine-tuning of green times to better match traffic demand. There had not yet been a training algorithm developed that fully supports DRBPS with second-by-second signal operation.

2.1. Phase Structure: Incompatibility with DRBPS

Almost all studies deployed an agent for an intersection, except for a few that involved a regional agent managing the signals across a road network [3,4,5,6,7,8]. Researchers have started to train multiple intersection agents synchronously. But from the traffic engineering perspective, there was a remaining issue, that is, the phase structure used was often incompatible with DRBPS at each intersection. This incompatibility widened the gap between research and practical engineering applications. Two main causes of this incompatibility have been identified in the literature.

(1) Some studies took a variable phase sequence [3,4,5,6,9,10,11,12,13,14,15,16,17,18,19], allowing agents to activate any preferred phase combination at every time step, as shown in Figure 2a. But to be frank, an unfixed phase sequence is generally not acceptable in traffic engineering. This is for the sake of safety, as detailed in Section 3.

(2) Some studies maintained a fixed phase sequence [7,8,20,21,22,23,24,25,26,27,28,29,30,31,32,33] but took a single-ring structure usually with four or two phases, as seen in Figure 2b. Such a structure is less flexible than DRBPS because it forces multiple vehicle movements with imbalanced demand to have the same green time. As a consequence, single-ring phase structures are not common in the real world [2].

Figure 2. Typical signal operations in previous studies. Solid one-way arrows represent vehicle movements, and two-way arrows represent adjacent parallel pedestrian movements. (a) Signal operation with variable phase sequence. An arbitrary phase or phase combination is chosen to be served every time step (usually a few or tens of seconds). (b) Signal operation in a single-ring phase structure. Four phases are served in cycles with a fixed sequence.

To the best of our knowledge, there were only three studies compatible with DRBPS.

(1) Han, Lee, and Kim [26,27] developed two MFDRL methods with the deep deterministic policy gradient algorithm. To accommodate DRBPS, a five-dimensional continuous action was defined. Its dimensions determined, respectively, the proportion in a cycle assigned to the phases between barriers, and the split for pairs of these phases. However, such actions required a pre-set cycle length. It could not change according to the variation of total traffic demand at an intersection. Furthermore, such actions caused signal timing to be adjusted only once per cycle, making it difficult to tackle the fluctuating demand on phases.

(2) Ma et al. [33] proposed an MFDRL method based on the advantage actor–critic algorithm. The authors presented a stage-based diagram with eight phases, defining an action as whether to switch to the next stage. However, such actions led to the signal operation within a defective DRBPS. The signals were forced to operate in a pre-set stage order. For example, the order might be “east and west left-turn phases, east through and left-turn phases, east and west through phases”. Consequently, the east left-turn phase was always longer than the opposing west one in every cycle. This prevented the phases from timing flexibly based on their own demand.

2.2. Action Space and Training Algorithm

Signal controllers at real intersections primarily employ RTSC methods that allow a phase to last for any number of seconds, subject to certain constraints [2,34]. However, many researchers have restricted this flexibility to ease training by creating a coarse-grained action space. Specifically, they allowed the green time for each phase to be only in multiples of 5 s, 7 s, 10 s, 20 s, etc. [5,6,7,28,29,30,31,32,35,36,37,38,39,40,41,42,43]. In other words, this means phase green time could only be one of a few numbers, with large intervals between them, e.g.,

(7 \cdot δ)

, where

δ \in {1, 2, 3, 4, 5}

. This approach has limited the potential of MFDRL methods to achieve greater performance advantages compared to the existing methods operating on a second-by-second basis.

In scenarios where signal operations follow a fixed phase sequence, “phase duration” was a suitable action definition for MFDRL [7,8,9,10,11,20,21,22,23,24,25,26,27,28,29,30]. It means choosing a green time for the current phase. However, it is non-trivial to combine the “phase duration” definition and second-by-second signal operation. In the case of a single-ring phase structure, such a definition leads to a discrete action space containing only tens of green time choices. For example, 26 choices at one-second intervals if minimum and maximum greens are 15 s and 40 s, respectively. But regarding DRBPS, an agent needs to determine the green time for each of the concurrent phases in two rings. As a result, the action becomes the permutation of green times for two phases. The number of selectable actions grows quadratically to

26^{2} = 676

. Exploring such a large action space is very challenging for an agent.

The deep Q-network (DQN) and its variants are typical value-based MFDRL training algorithms [44]. They are competitive even with the newer algorithms, such as soft actor–critic when handling discrete actions [45,46]. The DQN variants were very common in the previous RTSC studies. But to the best of our knowledge, this paper is the first to propose an algorithm that expands the Q-dimension of a DQN variant [25] by using the technique described in [47], to address the large action space caused by DRBPS with one-second green time intervals.

3. RTSC Problem Within DRBPS and Its Complexities

This study focuses on the most typical and complex case of DRBPS at a four-leg intersection with protected left turns and permitted right turns [2]. As shown in Figure 3, it uses two rings and two barriers to organize eight phases in a signal cycle. The barriers separate the phases on two streets. Between the barriers, any phase in ring 1 can be served together with any phase in ring 2. This flexibility enables phases in different rings to adjust their timing to meet imbalanced demand, making DRBPS the most commonly used structure in real-world settings.

DRBPS involves some signal timing constraints that are fundamental in terms of safety in engineering practice [2,34,48]. They include (1) a fixed phase sequence, (2) simultaneous ending of lagging phases, (3) minimum and maximum green times, and (4) yellow change and red clearance intervals. The first two constraints are the most overlooked in previous studies, yet they are central to this study. We elaborate on both as follows.

Fixed phase sequence. The sequence is namely the order of phases in each ring. It should be pre-determined and remain fixed during certain times of the day. Otherwise, the signals for all movements will turn green or red arbitrarily and unpredictably. This can confuse or even surprise drivers, and increase the risks of red-light-running and rear-end collision.
Simultaneous ending of two concurrent lagging phases. Given that the two lagging phases precede the same barrier, this constraint ensures both rings cross this barrier together. Otherwise, conflicting vehicle movements will occur, leading to an increased risk of angled collisions. This is because each lagging phase conflicts with both phases that follow the barrier on the intersecting street.

The RTSC problem within DRBPS involves how to cyclically determine the green time of each phase in real-time, subject to the signal timing constraints above. In this study, efforts will be made to define an action and develop a training algorithm to solve this problem. Below, we describe the complexities of the problem facing us.

We need to enable the agent to control two concurrent phases in the dual ring, such as phases (1 and 5). They run in parallel but their green times overlap and can differ. They may start green successively and have different minimum greens. The agent should be capable of (1) making fully independent decisions on the green times for concurrent leading phases to accommodate varying demands, and (2) making partially independent decisions on the green times for concurrent lagging phases to balance their demands and end them together at a trade-off moment.

4. Action Definition with Decision Point Aligning

To accommodate DRBPS, we improve the action defined in [25] for a fixed phase sequence, which is originally to choose the remaining green time for the current phase at the end of its minimum green.

We begin with the question of when the agent selects an action. This moment is termed “decision point”. The question is non-trivial in DRBPS. If simply selecting an action whenever any phase reaches its minimum green like in [25], the action space can be inconsistent at different times, making it hard to determine the semantics of network outputs. This is because the two concurrent phases in a dual ring may reach their minimum greens either simultaneously or successively. Thus, the agent may sometimes need to make a decision for a single phase (corresponding to a small action space), but sometimes two decisions for two phases (a large combinatorial action space).

We propose a “decision point aligning”-based action definition to address this challenge. It aims to enable the agent to control the two phases with a consistent action space. Meanwhile, it aims to postpone the moment of decision-making as much as possible, so that the agent can base the action selection on the latest state. Specifically, we align the decision points for two concurrent phases in the dual ring at the earliest moment when either or both phases can terminate. At this point (time step t) to determine the green times for the two phases at once, represented as a 2-tuple:

A_{t} = 〈a_{t}^{1}, a_{t}^{2}〉,

(1)

\forall k \in {1, 2}, a_{t}^{k} \in Z and 0 \leq a_{t}^{k} \leq G_{max}^{k} - G_{min}^{k},

where

a_{t}^{k}

denotes the remaining green time for the current phase in the k-th ring; and

G_{max}^{k}

and

G_{min}^{k}

are the maximum and minimum greens for the current phase in the k-th ring, respectively.

The aligned decision point (ADP) and action definition can be best understood via the examples below (see Figure 4).

A case of two leading phases, $Φ 1$ and $Φ 5$ . DRBPS allows an independent termination of each phase, subject to only its own minimum green. For these two phases, ADP is the moment that either phase’s minimum green expires (i.e., the “ADP1” moment in the figure). It is the last possible moment to decide whether to give this phase the time beyond its minimum green. At this point, action $A_{t}$ represents the remaining green times after the respective ends of the minimum green durations of the two phases. $a_{t}^{1}$ and $a_{t}^{2}$ can be unequal. They denote the lengths of time marked with $a_{4}^{1}$ and $a_{4}^{2}$ in the figure, respectively.
A case of two lagging phases, $Φ 2$ and $Φ 6$ . DRBPS requires the simultaneous termination of the two phases, subject to both their minimum greens. For these two phases, ADP is the moment when the minimum greens of both phases expire (i.e., moment “ADP2” in the figure). It is the earliest possible moment at which the two phases are allowed to terminate simultaneously. Here, action $A_{t}$ represents the remaining green times for the two phases after this moment, instead of after the ends of their own minimum greens. $a_{t}^{1}$ and $a_{t}^{2}$ have to be equal. They denote the lengths of time marked with $a_{3} 6^{1}$ and $a_{3} 6^{2}$ in the figure, respectively.

With the decision point aligning, the agent selects actions four times every cycle. An action always determines the green times for two phases. But according to the typical values of the maximum and minimum greens used in engineering,

a_{t}^{1}

and

a_{t}^{2}

, each can have 20–40 possible choices. The result is at least

20^{2} = 400

combinatorial choices of

A_{t}

. This poses a new challenge of how to enable the agent to learn these many actions. We will address it in Section 6.2.

5. Reward and State Definitions

The definitions of (1) reward, (2) state, and (3) neural network structure are similar to those in [25], except for a few modifications. We describe them as follows:

(1) Reward is defined as the normalized total number of vehicles departing from the intersection between two consecutive time steps:

R_{t} = \frac{1}{L N} \sum_{l = 1}^{L N} V S_{t, t + 1}^{l},

(2)

where

R_{t}

is the reward for time step t;

V S_{t, t + 1}^{l}

is the count of vehicles crossing the stop line on the l-th approach lane between time steps t and

(t + 1)

;

L N

is the number of approach lanes at the intersection.

The agent follows the standard objective in MFDRL, which is to maximize the expectations of cumulative discounted rewards:

E [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots],

(3)

where

γ \in (0, 1]

is the discount factor. This objective amounts to encouraging the agent to serve more vehicles as early as possible to lower the vehicle delay at the intersection [25].

(2) To construct the state representation shown in Figure 5, we take a 120-meter-long portion of each intersection leg as the detection zone. It is divided into grids that are 4 m in length and are as wide as the corresponding lane widths. The state feature of a grid is a 4-tuple if it is on the approach lanes for a phase, or a 2-tuple if it is upstream from approach lanes:

f = \{\begin{matrix} (o, v, g, c) & for approach lanes, \\ (o, v) & otherwise, \end{matrix}

(4)

where o, v, g, and c denote the presence of the vehicle on the grid, the normalized speed of the vehicle, whether the corresponding phase is green, and whether the phase is a lagging phase, respectively.

The state features of the grids on the approach lanes for each phase, and those of the other grids upstream from approach lanes, are separately organized as image-like state arrays, as follows:

s = (\begin{matrix} f_{1, 1} & f_{1, 2} & \dots & f_{1, W} \\ f_{2, 1} & f_{2, 2} & \dots & f_{2, W} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ f_{H, 1} & f_{H, 2} & \dots & f_{H, W} \end{matrix}),

(5)

where

f_{h, w}

is the state feature at the h-th row and w-th column; and H and W are the number of rows and columns in the corresponding state array, respectively.

Lastly, the state at time step t is represented as a multi-element tuple that includes all the state arrays for the intersection:

S_{t} = 〈s_{LT}^{1}, s_{TH}^{1}, s_{US}^{1}, s_{LT}^{2}, \dots, s_{US}^{B}〉,

(6)

where

s_{LT}^{b}

and

s_{TH}^{b}

are the state arrays for the approach lanes on the b-th intersection leg for the left and through phases, respectively,

s_{US}^{b}

is the state array for the lanes upstream from the approach lanes on the b-th intersection leg, and B is the number of legs at the intersection. For the test-bed signalized intersection in our experiments, the state can be represented as a high-dimensional array with

4 \cdot 30 \cdot 4 \cdot 4 = 1920

elements.

(3) Each state array is input into a convolutional neural network with “ReLU” activations. Its first three layers convolve 32, 64, and 64 filters with stride 1 with zero padding, respectively. The filter size is

3 \times 3

, except when the state array corresponds to a single lane, in which case the filter size is

3 \times 1

. The convolving results for all the state arrays are concatenated as a vector. But unlike in [25], we feed this vector into several branches of fully connected layers (see Figure 5). Each branch consists of a layer with 128 units and a layer outputting multi-dimensional sub-action values. This modification is required by the training algorithm proposed in the next section.

Data availability for reward and state: The reward defined in this study relies on monitoring the number of vehicles crossing stop lines. The data can be easily collected using conventional traffic detection devices, such as underground inductive loop detectors. The defined state requires periodic measurement of vehicle locations and speeds near an intersection. These data can be captured through advanced video or radar detection systems, as well as emerging internet-of-vehicle or vehicle-to-infrastructure technologies.

6. Training Algorithm with Multi-Dimensional Aligned Q-Fusion

In this section, we introduce the training algorithm for MD4-AQF. It is modified from the double deep dynamic Q-network (D3ynQN) [25] through four steps to better accommodate the RTSC problem within DRBPS. The first step involves recalculating the D3ynQN’s parameter n for DRBPS. The second involves generalizing D3ynQN from uni-dimensional action cases to multi-dimensional ones. The third involves fusing multi-dimensional Q values for lagging phases. The fourth involves deriving the formula for a bootstrapping-based value update in the algorithm.

6.1. Dynamic N-Step Update

D3ynQN is a DQN-style training algorithm proposed in [25]. It has already been used in solving a single-ring RTSC problem with a fixed phase sequence. In this section, we introduce and improve D3ynQN to render it suitable for our action definition.

We follow the D3ynQN’s suggestion of setting a constant time-step size of one second. This ensures the rewards are discounted evenly as time goes on, to reflect the uncertainty arising from the time-varying RTSC environment [25]. Consequently, the actions will be selected every several time steps, instead of every time step. This decouples the action step from the time step. So, we will add another subscript i to the following action:

A_{t, i} = 〈 a_{t, i}^{1}, a_{t, i}^{2} 〉,

(7)

to denote the action step, representing the i-th time of action selection.

Next, we will conduct the dynamic n-step temporal-difference bootstrapping derived from D3ynQN. It adjusts the interval parameter n to match the time span between two actions so that an action value can exploit the next action’s value to bootstrap and update itself.

We begin by recalculating the aforementioned parameter, n, for DRBPS, which should equal the number of time steps between

A_{t, i}

and its next action [25]. We add the subscript t to

n_{t}

, as it is dynamic and depends on

A_{t, i}

. According to the definition of action in Section 4,

n_{t}

can be given by the following:

n_{t} = \{\begin{matrix} max [G_{min}^{1} - G_{min}^{2} + {G T G}_{t}^{1}, {G T G}_{t}^{2}], \\ if current phases are leading and G_{min}^{1} \geq G_{min}^{2}; \\ max [{G T G}_{t}^{1}, G_{min}^{2} - G_{min}^{1} + {G T G}_{t}^{2}], \\ if current phases are leading and G_{min}^{1} < G_{min}^{2}; \\ min [{G T G}_{t}^{1}, {G T G}_{t}^{2}], \\ if current phases are lagging . \end{matrix}

(8)

where

G T G_{t}^{k}

is the time from the end of the minimum green for the current phase (at time step t) to the end of the minimum green for the next phase in the k-th ring:

{G T G}_{t} = a_{t, i}^{k} + {Y C}^{k} + {R C}^{k} + {N G}_{min}^{k} .

(9)

Note that

G_{min}^{k}

and

{N G}_{min}^{k}

are the minimum greens for the current phase and its next phase in the k-th ring, respectively;

{Y C}^{k}

and

{R C}^{k}

are the yellow change and red clearance intervals for the current phase in the k-th ring, respectively. Some examples of time lengths represented by

n_{t}

are visually shown in Figure 4.

With

n_{t}

being known, we can write down the general loss formula used to update the action value

Q_{θ} (S_{t}, A_{t, i})

at time step t, given by the following:

L_{t} = {[({R S}_{t} + γ^{(n_{t})} Q B_{t}) - Q_{θ} (S_{t}, A_{t, i})]}^{2},

(10)

where

{R S}_{t}

is the sum of discounted rewards in subsequent

n_{t}

time steps:

R S_{t} = \sum_{k = t}^{t + n_{t} - 1} γ^{k - t} R_{k},

(11)

and

{Q B}_{t}

is the bootstrap action value after the

n_{t}

time steps:

Q B_{t} = Q_{θ^{-}} (S_{t + n_{t}}, arg max_{A_{t + n_{t}, i + 1}} Q_{θ} (S_{t + n_{t}}, A_{t + n_{t}, i + 1})) .

(12)

Note that

θ

and

θ^{-}

are the weights of the behavior and target Q-networks, respectively. In the following subsections, we will propose more improvements to the loss formula (10) to further accommodate DRBPS.

6.2. Multi-Dimensional Q-Network

D3ynQN and other DQN-style algorithms typically configure a Q-network to output a uni-dimensional vector with each unit for the value of a specific action. Regarding DRBPS, unfortunately, there will be hundreds of such actions (see Section 4). Each of them means a possible permutation of the remaining green times for the two concurrent phases in the dual ring. It is very difficult for the agent to directly learn this many individual actions [47].

To address this challenge, we propose the “multi-dimensional Q-network”. It improves D3ynQN via the technique of dimensionality expansion explored in [47], aimed at reducing the magnitude of the number of network outputs to be considered. We begin by treating each element of action

A_{t, i}

, referred to as a “sub-action”

a_{t, i}^{k}

, where

k \in {1, 2}

. We modify the network from directly outputting action values

Q_{θ} (S_{t}, A_{t, i})

to a branch structure that outputs multi-dimensional sub-action values (see Figure 5):

Q_{θ}^{d} (S_{t}, a_{t, i}^{d}),

(13)

where

d \in {1, 2, \dots, D}

denotes the sub-action dimension. Each dimension has E sub-action choices.

In our case, with DRBPS, the number of dimensions,

D = 2

, corresponds to the two rings. The number of sub-actions per dimension,

E = 26

, corresponds to the remaining green time options for the current phase in each ring. By turning into multi-dimensional outputs, the network now only needs to estimate

26 \cdot 2 = 52

values of sub-actions in two dimensions, instead of the original

26^{2} = 676

values of combinatorial actions. We have the agent separately select the sub-action with the maximum value among each dimension:

\{\begin{matrix} a_{t, i}^{1} = arg max_{a_{t, i}^{1, e}} [Q_{θ}^{1} (S_{t}, a_{t, i}^{1, e})], \\ a_{t, i}^{2} = arg max_{a_{t, i}^{2, e}} [Q_{θ}^{2} (S_{t}, a_{t, i}^{2, e})] . \end{matrix}

(14)

Then, the selected sub-actions in both dimensions can compose the action

A_{t, i} = 〈 a_{t, i}^{1}, a_{t, i}^{2} 〉

.

6.3. Aligned Q-Fusion

As proposed in the previous subsection, the multi-dimensional Q-network determines the phase green times for two rings independently. This makes it more challenging to ensure that two concurrent lagging phases end simultaneously, as required by DRBPS. In other words, it raises the challenge of how to ensure that the agent selects the same sub-action for each of the two dimensions in such a situation.

We propose “aligned Q-fusion” to address this challenge. Let

a_{t, i}^{d, e}

denote a specific choice of sub-action

a_{t, i}^{d}

for dimension d, and

Q_{θ}^{d, e} (S_{t}, a_{t, i}^{d, e})

denote its value, where

e \in \{0, 1, \dots, E - 1\}

is the ordinal number of this choice in its dimension. Then,

Q_{θ}^{d, e} (S_{t}, a_{t, i}^{d, e})

by definition is the agent’s estimate of the future rewards, conditioned on extending e seconds of green time for the current phase in the d-th ring.

We first align the sub-actions and their values for different dimensions like in Figure 6. If current phases are lagging, we calculate the “fused” value

{F Q}_{t, i}^{*, e}

for each aligned sub-action

a_{t, i}^{*, e}

. It equals the mean of the values of sub-actions with the same ordinals across all dimensions:

{F Q}_{t, i}^{*, e} (S_{t}, a_{t, i}^{*, e}) = \frac{1}{D} \sum_{d = 1}^{D} Q_{θ}^{d, e} (S_{t}, a_{t, i}^{d, e}) .

(15)

The resulting

{F Q}_{t, i}^{*, e} (S_{t}, a_{t, i}^{*, e})

can represent the average expected returns on extending the same green time for two concurrent phases. We then take the aligned sub-action

{s a}_{t, i}^{*}

that maximizes this fused value, as follows:

{s a}_{t, i}^{*} = arg max_{a_{t, i}^{*, e}} [{F Q}_{t, i}^{*, e} (S_{t}, a_{t, i}^{*, e})] .

(16)

For lagging phases, we make

{s a}_{t, i}^{*}

, shared by all sub-action dimensions (i.e., the two rings). Thus, the action becomes

A_{t, i} = 〈 {s a}_{t, i}^{*}, {s a}_{t, i}^{*} 〉

. Essentially,

{s a}_{t, i}^{*}

attempts to seek the best compromise between two lagging phases, upon the condition that DRBPS requires them to end simultaneously. For leading phases, we still take the best sub-action in each dimension independently as usual. In summary, the selected sub-action per dimension is as follows:

\{\begin{matrix} a_{t, i}^{1} = a_{t, i}^{2} = {s a}_{t, i}^{*}, \\ if current phases are lagging; \\ a_{t, i}^{k} = arg max_{a_{t, i}^{k, e}} [Q_{θ}^{k} (S_{t}, a_{t, i}^{k, e})] for k \in {1, 2}, \\ if current phases are leading . \end{matrix}

(17)

The above Equation (17) in effect rewrites Equation (14) by embedding the aligned Q-fusion into it.

6.4. Bootstrap Action Value

Now, we adapt the bootstrap action value

{Q B}_{t}

in Equation (12) for multi-dimensional sub-actions. As recommended in [47], we set a global

{Q B}_{t}^{*}

used by all sub-action dimensions. It is the mean of the bootstrap value of the sub-action per dimension, given by the following:

Q B_{t}^{*} = \{\begin{matrix} \frac{1}{D} \sum_{d = 1}^{D} Q_{θ^{-}}^{d} (S_{t + n_{t}}, {s a}_{t + n_{t}, i + 1}^{*}), \\ if next phases are lagging; \\ \frac{1}{D} \sum_{d = 1}^{D} Q_{θ^{-}}^{d} (S_{t + n_{t}}, arg max_{a_{t + n_{t}, i + 1}^{d, e}} Q_{θ}^{d} (S_{t + n_{t}}, a_{t + n_{t}, i + 1}^{d, e})), \\ if next phases are leading . \end{matrix}

(18)

Then, we have a loss function modified from Equation (10). It takes the mean squared error across sub-action dimensions to form an aggregated scalar loss

L_{t}^{*}

below, from which the gradients are backpropagated to all the values of the selected sub-actions:

L_{t}^{*} = \frac{1}{D} \sum_{d = 1}^{D} {[({R S}_{t} + γ^{(n_{t})} Q B_{t}^{*}) - Q_{θ}^{d} (S_{t}, a_{t, i}^{d})]}^{2} .

(19)

We need to optimize

L

, the expectation of the per-sample loss over the replay buffer

Ω

. It is estimated using the mean of

L_{t}^{*}

across a mini-batch including M samples that are uniformly sampled from

Ω

. This adds another averaging operator to Equation (19), as follows:

\begin{matrix} L & = E_{(S_{t}, A_{t, i}, {R S}_{t}, S_{t + n_{t}}) \sim Ω} (L_{t}^{*}) \\ = \frac{1}{M} \sum_{m = 1}^{M} \frac{1}{D} \sum_{d = 1}^{D} {[({R S}_{t} + γ^{(n_{t})} Q B_{t}^{*}) - Q_{θ}^{d} (S_{t}, a_{t, i}^{d})]}^{2}, \end{matrix}

(20)

where m denotes the ordinal number of a sample in the mini-batch.

Given the formula above, we can eventually sum up the training algorithm for MD4-AQF in a pseudo-code, as shown in Algorithm 1.

Algorithm 1: Training algorithm for MD4-AQF

7. Experiments

We designed three experiments for the proposed MD4-AQF method. First, we investigated the agent’s learning process for MD4-AQF. Second, we contrasted the training progress of MD4-AQF and another MFDRL method (D3ynQN), which was the same as our method but with its “multi-dimensional network” and “aligned value fusion” ablated. Third, we compared the performance of the agent trained by MD4-AQF and a conventional RTSC method. Table 1 shows the setup of the experiments.

The experiments were conducted via a virtual intersection modeled by the traffic simulation software PTV Vissim 11.00-14. Signal control was implemented through the Vissim COM interface for Python 3.9.13. Neural networks were built using PyTorch 2.0.0.

The detailed geometry of the intersection and its peak-period traffic demand scenarios can be found in [25]. Briefly, a street had three lanes in each direction and was widened to four lanes when approaching the intersection (see Figure 7). The typical DRBPS in Figure 3 was employed. The minimum greens were 15 s for through phases and 5 s for left-turn phases. The maximum greens were 25 s greater than the minimum greens. The yellow change and red clearance intervals were 3 s and 2 s, respectively. Each simulation run lasted for

μ = 4200

s, with the first 600 s being the warm-up period.

Table 1. Overview of experiment setup.

Experiment	Involved Method	Analytical Tool	Purpose and Expected Outcome
#1: agent learning	MD4-AQF	Line graph (Figure 8)	To verify whether the proposed method enables the agent to learn and improve its policy. The intersection performance is expected to gradually improve throughout the learning process.
#2: versus MFDRL method	MD4-AQF D3ynQN	Line graph (Figure 8) Box plot (Figure 9)	To verify whether the proposed method trains the agent more effectively than another MFDRL method, D3ynQN. The proposed method is expected to achieve better intersection performance than D3ynQN, both during training and at the end of training.
#3: versus conventional method	MD4-AQF FASC	Box plot (Figure 9) Scatter plot (Figure 10) Statistical test (Table 2)	To verify whether the policy learned via the proposed method outperforms the conventional method FASC. The learned policy is expected to yield better intersection performance compared to FASC in most of the tested traffic demand scenarios.

Figure 7. Layout of the virtual intersection for experiments.

Figure 8. Learning curves for MD4-AQF and D3ynQN. The solid line represents the median of the average vehicle delay across 100 simulation runs for each evaluation, and the shaded area represents the range between the top and bottom quartiles.

Figure 9. Box plots of the average vehicle delay caused by D3ynQN, MD4-AQF, and FASC.

Figure 10. Scatter plot of the average vehicle delay caused by MD4-AQF and FASC in different traffic demand scenarios. Each dot corresponds to a specific scenario, with the x-coordinate and y-coordinate denoting the results for FASC and MD4-AQF, respectively. The greater the distance of a dot from the anti-diagonal line, the larger the difference in average vehicle delay between the two methods.

8. Discussion

8.1. Experiment 1: Agent Learning

We conducted almost no hyperparameter fine-tuning for MD4-AQF but took the values of the hyperparameters used in some original MFDRL algorithms. We found that this simple approach had already yielded very satisfactory performance. The minimal need for hyperparameter tuning demonstrated the ease of use of MD4-AQF. Given this competitive edge, we had little motivation to invest time in fine-tuning hyperparameters. It is possible that better performance could be achieved if more effort is devoted to hyperparameter tuning in the future.

Specifically, the stochastic gradient descent (SGD) optimizer used was Adam, with a learning rate of

6 \cdot 10^{- 5}

, which had been tuned and used by the authors of “Rainbow” (one of the best-performing DQN-style algorithms) to stabilize training [45]. Training started after randomly selecting actions 20,000 times, again as tuned and used in Rainbow [45]. A mini-batch included 64 samples, a commonly used value in many previous studies, and in the “BDQ” algorithm, where the Q-dimensionality expansion technique in this study was involved [47]. The target network weights were smoothly updated toward the exponentially moving average of the behavior network weights:

θ^{-} \leftarrow τ θ + (1 - τ) θ^{-}

(21)

where

τ = 5 \cdot 10^{- 3}

. This practice was applied by the authors of the “SAC” algorithm consistently across various tasks [49].

γ

was set to 0.99375 such that the agent’s time horizon

1 / (1 - γ)

equaled the maximum possible cycle length. Such empirical practice was proven effective in prior work on MFDRL-based RTSC with an environment similar to that of this study [25]. The replay buffer held at most 200,000 samples, which was the maximum we could set, due to the limitation of memory capacity. The exploration probability

ϵ

in the

ϵ

-greedy behavior policy was 0.1 throughout the training and 0 for the evaluation. The value of 0.1 was the final value of

ϵ

set by DQN and its many variant algorithms in the later stages of training [44].

To track the agent’s learning progress, we evaluated the behavior network every 10 simulation runs during training. For each evaluation, we performed 100 simulation runs and recorded the quartiles of average vehicle delay at the intersection. In principle, more frequent evaluations allowed for more timely tracking of policy improvements, and more simulation runs per evaluation provided more accurate statistics. The training continued until the median of the average delay had not improved for

ρ = 200, 000

action steps. This is essentially the “early stopping” tolerance commonly used in deep learning, where training is terminated after waiting a sufficient amount of time without improving the best score. Strictly speaking,

ρ

is not considered a hyperparameter, as it does not affect the policy changes during training.

As illustrated in Figure 8, the median of delay for MD4-AQF (marked red) oscillated around 135 s in the first 20,000 action steps. This was due to the random action selection, aimed at collecting initial samples in the replay buffer. Thereafter, the policy iteration began and the median showed a sharp drop to nearly 60 s within the next 20,000 steps. This signified that the agent rapidly found an acceptable policy to time signals. Starting from the 40,000th step, the improvement of performance slowed down. The agent spent the next 60,000 steps to lower the median by 5 s, and another 100,000 steps to further reduce the median from 55 s to about 50 s before the 200,000th step. Afterward, the downward curve tended to be gentle. The agent kept on seeking better performance in fluctuating median values. Eventually, it hit a minimum of 47.89 s at the 398,510th step. Since then, the median instead started to deteriorate gradually over time. This was consistent with the known “forgetting” phenomenon in the late period of training with DQN-style algorithms [50]. In general, the pattern of the learning curve was similar to that exhibited in the previous study [25]. We saved the best-performing network weights for the subsequent experiments.

8.2. Experiment 2: Versus D3ynQN Method (Ablation Study)

We contrasted MD4-AQF with a variant of the double DQN method (D3ynQN) in terms of training progress. D3ynQN is the same as MD4-AQF but removes the components of the “multi-dimensional Q-network” and subsequent “Aligned Q-fusion” for lagging phases. Instead, D3ynQN directly learns each uni-dimensional combinatorial action for the dual ring. This requires

256 \cdot 26^{2} = 160, 256

weights in the network output layer (compared to

128 \cdot 2 \cdot 26 = 6656

for MD4-AQF). For lagging phases, D3ynQN masks the invalid actions that cause the phases to end non-simultaneously. Meanwhile, D3ynQN retains the practice of dynamic N-step update and uses the same hyperparameters of MD4-AQF. In this sense, Experiment 2 is also an ablation study on the proposed method.

D3ynQN [25] is an MFDRL method that has been demonstrated to control signals within a single-ring phase structure while satisfying the same signal timing constraints as in this study. Its success was achieved under the same conditions as in this study, including the same simulation software, road geometry, and traffic demand. Compared to MD4-AQF, D3ynQN is identical in all factors except for the two major innovations proposed in this study for DRBPS. Thus, D3ynQN serves as a competitive and comparable ablation baseline against MD4-AQF.

Figure 8 shows the learning curve for D3ynQN (marked green), compared with that for MD4-AQF. D3ynQN improved the median of the average delay to only about 80 s before the 40,000th action step. The performance exhibited severe fluctuations ranging from 65 s to 95 s until the 120,000th step. This indicated the difficulty in learning hundreds of combinatorial actions directly. As training progressed to around the 150,000th step, the oscillation of the median delay began to alleviate. The median dropped below 60 s at the 320,000th step. During subsequent training, the agent eventually achieved the best performance (55.15 s) at the 465,181st step. This result lagged behind that of MD4-AQF (47.89 s) by 7.26 s or 15.2%. Overall, D3ynQN utilized significantly more network parameters than MD4-AQF but experienced greater instability and consistently underperformed compared to MD4-AQF throughout the training process.

To further compare the agents trained via D3ynQN and MD4-AQF, we followed the manner in [25] to stochastically generate 2000 traffic demand scenarios for the intersection via the D-optimal design. In each scenario, we performed two simulation runs, with signals being timed by D3ynQN and MD4-AQF, respectively. We then recorded and compared their average vehicle delay at the intersection.

As shown in box plots in Figure 9, the medians of the average vehicle delay caused by D3ynQN and MD4-AQF, across the 2000 traffic demand scenarios, were 56.78 s and 49.62 s respectively. In comparison, the performance disadvantage of D3ynQN was 7.16 s or 14.4% in terms of the median value. Additionally, the top and bottom quartiles (as well as the upper and lower whiskers) produced by D3ynQN were larger than those of MD4-AQF. The inter-quartile range (i.e., the difference between the 75th and 25th percentiles) produced by D3ynQN was 66.91 − 50.42 = 16.49 s, whereas that of MD4-AQF was 57.69 − 44.31 = 13.38 s. This indicated that the variance in average vehicle delay caused by D3ynQN was larger than that of MD4-AQF. A similar inference could be drawn by comparing their whisker ranges (i.e., the difference between the upper and lower bounds in the box plots). The whisker range was 91.56 − 38.51 = 53.05 s from D3ynQN, versus 77.70 − 34.14 = 43.56 s from MD4-AQF. These signified that the policy quality of D3ynQN was inferior to that of MD4-AQF across various traffic load scales. Generally, as an ablated version of MD4-AQF, D3ynQN’s performance was worse than that of MD4-AQF. This result validated the significant impact of “multi-dimensional Q-network” and “Aligned Q-fusion” components in the proposed method.

8.3. Experiment 3: Versus Conventional Control Method

We compared MD4-AQF against a conventional fully actuated signal control (FASC) method in terms of vehicle delay. The technical details of FASC can be found in [25]. Briefly, FASC extended current phases on a second-by-second basis, according to the detected time headways of all vehicles crossing a specific point on each approach lane.

FASC typically utilizes data from inductive loop detectors that are widely deployed at real-world intersections. Such an actuated method has long been applied in traffic engineering. FASC has consistently served as a conventional baseline in many previous studies. Its real-time signal timing strategy could produce highly acceptable intersection performance.

We reused the prepared 2000 traffic demand scenarios via the D-optimal design in the previous experiment. We also performed two simulation runs in each scenario, with signals timed by MD4-AQF and FASC, respectively. We recorded the average vehicle delay at the intersection for the following comparative analyses.

As box plots in Figure 9 show, the medians of the average vehicle delay caused by MD4-AQF and FASC, across the 2000 traffic demand scenarios, were 49.62 s and 61.24 s, respectively. The proposed method achieved an 11.62 s (or 19.0%) advantage over FASC in terms of the median value. Moreover, it produced much lower values for both the top and bottom quartiles. Notably, the upper quartile (57.69 s) for MD4-AQF is even lower than the median (61.24 s) for the conventional method. These signified that the proposed method enabled the signals to operate with overall superior performance. On the other hand, the inter-quartile range (72.90 − 53.36 = 19.54 s) and whisker range (102.16 − 39.47 = 62.69 s) produced by FASC were comparatively wide, meaning that the conventional method resulted in less stable performance than MD4-AQF. In particular, there was an obvious contrast in the upper whisker (102.16 s versus 77.70 s). It was inferred that MD4-AQF expanded the intersection capacity and performed well in some severe demand scenarios.

We then paired the delay data produced by two methods to draw the scatter plot in Figure 10. As shown, the proposed method led to better average vehicle delay than FASC in the great majority of demand scenarios. Such an advantage could be approximately 59 s or 46% in some scenarios. Whereas the conventional method outperformed MD4-AQF in very few scenarios with a relatively small difference in delay. Admittedly, there were still some traffic patterns in FASC’s favor, suggesting its competitiveness as a method that has been widely applied over a long period. A potential direction for future work could be to develop a control mechanism that autonomously switches between the conventional method and MD4-AQF.

We further analyzed the paired average vehicle delay from a statistical perspective. We first performed the Kolmogorov–Smirnov normality test with Lilliefors correction. The resulting statistic, D = 0.073 and P < 0.05, showed no normality in the difference in vehicle delay caused by MD4-AQF and FASC. Thus, we had to resort to the non-parametric substitute of the paired t-test, the Wilcoxon matched-pair signed rank test. The resulting statistic Z was −37.729, as illustrated in Table 2. There was a significant difference in the average vehicle delay at a significance level of 0.05. The number of negative and positive ranks was 1951 and 49, respectively. It meant that MD4-AQF surpassed FASC in terms of vehicle delay in 95.35% of the 2000 demand scenarios.

According to the results of the simulation experiments, the proposed MD4-AQF method enabled an agent to acquire signal timing knowledge effectively. The agent improved the average vehicle delay from more than 130 s to less than 50 s during training. The comparison with another MFDRL method served as an ablation study. It showed that the innovative components in MD4-AQF led to faster, more stable, and more effective training with fewer network parameters. The resulting intersection performance improved by over 14% compared to when these components were removed. Furthermore, the agent trained by MD4-AQF achieved significantly better performance compared to a conventional signal control method in a statistical sense. The quantitative advantage in the median of the average vehicle delay was nearly 20%. Such an improvement stands out when compared to the typical performance gains of 10% to 21% seen in analogous MFDRL studies, relative to conventional actuated control methods, under similar conditions of high-resolution simulation software and peak-period traffic demand.

The motivation for this study lies in unlocking the potential of MFDRL for DRBPS, a topic that has been scarcely addressed in previous studies. Working with DRBPS aims to align with key requirements in traffic engineering, making the research more likely to be implemented in practice. As a consequence, DRBPS brings certain limitations in signal control in real-world intersections and increases the difficulty of agent training. DRBPS presents challenges due to the concurrent yet partially independent decision-making required for the two rings. In contrast to previous studies and our prior work [25], the novelty of this study lies in understanding and addressing these emerging challenges.

Overall, we believe that this study achieves its original intent. The proposed MD4-AQF method successfully implements phase control under DRBPS with signal timing constraints in traffic engineering. The trained agent successfully learns how to make good choices for the phases from a set of fine-grained green time options at one-second intervals.

The prior work, D3ynQN, serves as both a representative method from previous studies and the algorithmic foundation for MD4-AQF. The comparative experiments between D3ynQN and MD4-AQF in Section 8.2 show that directly applying an existing method to DRBPS can yield unsatisfactory results. The extensions proposed in MD4-AQF to adapt D3ynQN for DRBPS have proven effective. They are critical to MD4-AQF significantly outperforming conventional methods.

MD4-AQF achieved a median average vehicle delay of 49.62 s in the tested traffic demand scenarios. In contrast, the medians for D3ynQN and FASC baselines were 56.78 s and 61.24 s, respectively. Here, we explored the practical significance of the performance improvement achieved by MD4-AQF. We consulted the level-of-service (LOS) criteria established in the “Highway Capacity Manual” for motorized vehicles at signalized intersections [51]. MD4-AQF improved vehicle delay from LOS grade E (55–80 s) to grade D (35–55 s). This implies a significant reduction in vehicle stoppage and queuing. The benefits are particularly promising in terms of improving driver experience, alleviating traffic congestion, and reducing exhaust emissions.

9. Conclusions

This study proposes an MFDRL method, termed MD4-AQF, to address the RTSC problem within the context of DRBPS. The purpose is to narrow the gap between research and traffic engineering practices since DRBPS is quite commonly used in real-world intersections. To manage several challenges posed by DRBPS, we make three major contributions:

We present a “decision point aligning”-based action definition for the concurrent phases in two rings. This approach produces a consistent action space, which is used to determine the green times for both phases simultaneously.
We present a “multi-dimensional Q-network”-based training algorithm against the large action space containing 600+ permutations of green times for two phases. It can transform hundreds of combinatorial actions into tens of sub-actions.
We present “aligned Q-fusion” embedded in the algorithm to ensure that two concurrent lagging phases end simultaneously. This approach can fuse paired sub-action values and generate one remaining green time shared by the two lagging phases.

The results of simulation experiments indicate that MD4-AQF can help an agent learn useful knowledge about signal control. Its best-performing policy achieves a median average vehicle delay of 47.89 s during training. The trained agent outperforms another MFDRL method by 14.4%, and a conventional method by 19.0%, in terms of the median delay. More interestingly, we believe MD4-AQF, in essence, provides universal solutions to some general control problems as follows.

Determining the green time for two concurrent phases is essentially a multi-degree-of-freedom (DOF) control problem, where some DOFs operate in parallel but not strictly simultaneously. Our solution is to align the decision points of DOFs with the latest moment at which any DOF must make a decision. Then, the action should be defined as the move of each DOF after its own decision deadline. Therefore, an agent can control several concurrent DOFs at the same time.
Learning too many combinatorial actions for two phases is essentially a problem of the “curse of dimensionality” caused by controlling multiple DOFs at once. In such a case, the number of actions grows exponentially with the number of DOFs. Our solution is to perform dimension expansion on a DQN-style algorithm to produce multi-dimensional sub-action values. Each sub-action should correspond to the choice of a specific DOF. By doing so, the number of Q-network outputs can become only linear with the number of DOFs.
Ensuring simultaneous ends of two lagging phases is essentially a problem where all moves of DOFs are required to be identical in some states. Our solution is to align and fuse sub-action values. This is done by averaging the values of the same sub-action over different dimensions. Then, all DOFs should share and take the sub-action that maximizes the average, allowing them to move in sync.

Although this study focuses primarily on MFDRL-based RTSC at a single intersection, the proposed MD4-AQF method can be seamlessly applied to multiple individually operating intersections within a road network. Given that our MD4-AQF targets the most complex eight-phase DRBPS at a four-leg intersection, it can also be adapted to simpler structures with fewer phases at T-intersections or other types of intersections. However, to achieve more effective coordinated control between multiple intersections, developing communication between the states and the actions of intersection agents may be necessary.

Many challenging issues remain to be adequately addressed before MFDRL-based RTSC methods can be successfully applied in practice; for example, the “reality gap” between the dynamics of simulated and real intersections, as well as the reliable generalization of knowledge across varying intersection geometries and traffic demand patterns. Other promising specific directions for future research include designing reward functions to facilitate coordination across multiple intersections; effectively training from the states and rewards based on partially missing sensor data; and learning robust policies to handle edge cases, such as extreme weather or traffic accidents that an agent encounters infrequently.

Author Contributions

Conceptualization, Q.Z. and H.X.; formal analysis, Q.Z.; funding acquisition, H.X. and J.C.; investigation, H.X.; methodology, Q.Z.; project administration, Q.Z. and K.Z.; resources, H.X. and J.C.; software, H.X.; data curation, Q.Z.; supervision, H.X. and J.C.; validation, Q.Z. and H.X.; visualization, Q.Z. and K.Z.; writing—original draft, Q.Z.; writing—review and editing, Q.Z., H.X. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number: 61374193), and the Humanities and Social Science Foundation of the Ministry of Education of China (grant number: 19YJCZH201).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MFDRL	model-free deep reinforcement learning
RTSC	real-time traffic signal control
DRBPS	dual-ring barrier phase structure
MD4-AQF	multi-dimensional double deep dynamic Q-network with aligned Q-fusion
DQN	deep Q-network
ADP	aligned decision point
D3ynQN	double deep dynamic Q-network
SGD	stochastic gradient descent
FASC	fully actuated signal control
DOF	degree-of-freedom

References

Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Urbanik, T.; Tanaka, A.; Lozner, B.; Lindstrom, E.; Lee, K.; Quayle, S.; Beaird, S.; Tsoi, S.; Ryus, P.; Gettman, D.; et al. NCHRP Report 812: Signal Timing Manual, 2nd ed.; Transportation Research Board: Washington, DC, USA, 2015. [Google Scholar]
Ma, J.; Wu, F. Learning to coordinate traffic signals with adaptive network partition. IEEE Trans. Intell. Transp. Syst. 2024, 25, 263–274. [Google Scholar] [CrossRef]
Kim, G.; Kang, J.; Sohn, K. A meta-reinforcement learning algorithm for traffic signal control to automatically switch different reward functions according to the saturation level of traffic flows. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 779–798. [Google Scholar] [CrossRef]
Kim, G.; Sohn, K. Area-wide traffic signal control based on a deep graph Q-Network (DGQN) trained in an asynchronous manner. Appl. Soft Comput. 2022, 119, 108497. [Google Scholar] [CrossRef]
Gu, H.; Wang, S.; Ma, X.; Jia, D.; Mao, G.; Lim, E.G.; Wong, C.P.R. Large-scale traffic signal control using constrained network partition and adaptive deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 7619–7632. [Google Scholar] [CrossRef]
Gu, H.; Wang, S.; Ma, X.; Jia, D.; Mao, G.; Lim, E.G.; Wong, C.P.R. Traffic signal optimization for partially observable traffic system and low penetration rate of connected vehicles. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 2070–2092. [Google Scholar] [CrossRef]
Huang, H.; Hu, Z.; Wang, Y.; Lu, Z.; Wen, X. Intersec2vec-TSC: Intersection representation learning for large-scale traffic signal control. IEEE Trans. Intell. Transp. Syst. 2024, 25, 7044–7056. [Google Scholar] [CrossRef]
Yang, T.; Fan, W. Enhancing robustness of deep reinforcement learning based adaptive traffic signal controllers in mixed traffic environments through data fusion and multi-discrete actions. IEEE Trans. Intell. Transp. Syst. 2024, 25, 14196–14208. [Google Scholar] [CrossRef]
Yang, S.; Yang, B.; Zeng, Z.; Kang, Z. Causal inference multi-agent reinforcement learning for traffic signal control. Inf. Fusion 2023, 94, 243–256. [Google Scholar] [CrossRef]
Luo, H.; Bie, Y.; Jin, S. Reinforcement learning for traffic signal control in hybrid action space. IEEE Trans. Intell. Transp. Syst. 2024, 25, 5225–5241. [Google Scholar] [CrossRef]
Mukhtar, H.; Afzal, A.; Alahmari, S.; Yonbawi, S. CCGN: Centralized collaborative graphical transformer multi-agent reinforcement learning for multi-intersection signal free-corridor. Neural Netw. 2023, 166, 396–409. [Google Scholar] [CrossRef]
Liu, J.; Qin, S.; Su, M.; Luo, Y.; Zhang, S.; Wang, Y.; Yang, S. Traffic signal control using reinforcement learning based on the teacher-student framework. Expert Syst. Appl. 2023, 228, 120458. [Google Scholar] [CrossRef]
Yang, S.; Yang, B. An inductive heterogeneous graph attention-based multi-agent deep graph infomax algorithm for adaptive traffic signal control. Inf. Fusion 2022, 88, 249–262. [Google Scholar] [CrossRef]
Haddad, T.A.; Hedjazi, D.; Aouag, S. A deep reinforcement learning-based cooperative approach for multi-intersection traffic signal control. Eng. Appl. Artif. Intell. 2022, 114, 105019. [Google Scholar] [CrossRef]
Liu, Y.; Luo, G.; Yuan, Q.; Li, J.; Jin, L.; Chen, B.; Pan, R. GPLight: Grouped multi-agent reinforcement learning for large-scale traffic signal control. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; Volume 1, pp. 199–207. [Google Scholar] [CrossRef]
Mei, H.; Li, J.; Shi, B.; Wei, H. Reinforcement learning approaches for traffic signal control under missing data. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; Volume 3, pp. 2261–2269. [Google Scholar] [CrossRef]
Fang, Z.; Zhang, F.; Wang, T.; Lian, X.; Chen, M. MonitorLight: Reinforcement learning-based traffic signal control using mixed pressure monitoring. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, New York, NY, USA, 17–21 October 2022; pp. 478–487. [Google Scholar] [CrossRef]
Liang, E.; Su, Z.; Fang, C.; Zhong, R. OAM: An option-action reinforcement learning framework for universal multi-intersection control. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 4550–4558. [Google Scholar] [CrossRef]
Han, G.; Liu, X.; Han, Y.; Peng, X.; Wang, H. CycLight: Learning traffic signal cooperation with a cycle-level strategy. Expert Syst. Appl. 2024, 255, 124543. [Google Scholar] [CrossRef]
Agarwal, A.; Sahu, D.; Mohata, R.; Jeengar, K.; Nautiyal, A.; Saxena, D.K. Dynamic traffic signal control for heterogeneous traffic conditions using max pressure and reinforcement learning. Expert Syst. Appl. 2024, 254, 124416. [Google Scholar] [CrossRef]
Zhu, R.; Ding, W.; Wu, S.; Li, L.; Lv, P.; Xu, M. Auto-learning communication reinforcement learning for multi-intersection traffic light control. Knowl.-Based Syst. 2023, 275, 110696. [Google Scholar] [CrossRef]
Zhang, W.; Yan, C.; Li, X.; Fang, L.; Wu, Y.J.; Li, J. Distributed signal control of arterial corridors using multi-agent deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2023, 24, 178–190. [Google Scholar] [CrossRef]
Guo, X.; Yu, Z.; Wang, P.; Jin, Z.; Huang, J.; Cai, D.; He, X.; Hua, X.S. Urban traffic light control via active multi-agent communication and supply-demand modeling. IEEE Trans. Knowl. Data Eng. 2023, 35, 4346–4356. [Google Scholar] [CrossRef]
Zheng, Q.; Xu, H.; Chen, J.; Zhang, D.; Zhang, K.; Tang, G. Double deep Q-network with dynamic bootstrapping for real-time isolated signal control: A traffic engineering perspective. Appl. Sci. 2022, 12, 8641. [Google Scholar] [CrossRef]
Lee, H.; Han, Y.; Kim, Y. Reinforcement learning for traffic signal control: Incorporating a virtual mesoscopic model for depicting oversaturated traffic conditions. Eng. Appl. Artif. Intell. 2023, 126, 107005. [Google Scholar] [CrossRef]
Han, Y.; Lee, H.; Kim, Y. Extensible prototype learning for real-time traffic signal control. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 1181–1198. [Google Scholar] [CrossRef]
Bie, Y.; Ji, Y.; Ma, D. Multi-agent deep reinforcement learning collaborative traffic signal control method considering intersection heterogeneity. Transp. Res. Part C-Emerg. Technol. 2024, 164, 104663. [Google Scholar] [CrossRef]
Wang, M.; Chen, Y.; Kan, Y.; Xu, C.; Lepech, M.; Pun, M.O.; Xiong, X. Traffic signal cycle control with centralized critic and decentralized actors under varying intervention frequencies. IEEE Trans. Intell. Transp. Syst. 2024, 25, 20085–20104. [Google Scholar] [CrossRef]
Yazdani, M.; Sarvi, M.; Bagloee, S.A.; Nassir, N.; Price, J.; Parineh, H. Intelligent vehicle pedestrian light (IVPL): A deep reinforcement learning approach for traffic signal control. Transp. Res. Part C-Emerg. Technol. 2023, 149, 103991. [Google Scholar] [CrossRef]
Li, D.; Zhu, F.; Wu, J.; Wong, Y.D.; Chen, T. Managing mixed traffic at signalized intersections: An adaptive signal control and CAV coordination system based on deep reinforcement learning. Expert Syst. Appl. 2024, 238, 121959. [Google Scholar] [CrossRef]
Ren, F.; Dong, W.; Zhao, X.; Zhang, F.; Kong, Y.; Yang, Q. Two-layer coordinated reinforcement learning for traffic signal control in traffic network. Expert Syst. Appl. 2024, 235, 121111. [Google Scholar] [CrossRef]
Ma, D.; Zhou, B.; Song, X.; Dai, H. A deep reinforcement learning approach to traffic signal control with temporal traffic pattern mining. IEEE Trans. Intell. Transp. Syst. 2021, 23, 1–12. [Google Scholar] [CrossRef]
NEMA TS 2-2021; Traffic Controller Assemblies with NTCIP Requirements, 03.08 ed. National Electrical Manufacturers Association: Rosslyn, VA, USA, 2021.
Kang, L.; Huang, H.; Lu, W.; Liu, L. Optimizing gate control coordination signal for urban traffic network boundaries using multi-agent deep reinforcement learning. Expert Syst. Appl. 2024, 255, 124627. [Google Scholar] [CrossRef]
Zhao, Z.; Wang, K.; Wang, Y.; Liang, X. Enhancing traffic signal control with composite deep intelligence. Expert Syst. Appl. 2024, 244, 123020. [Google Scholar] [CrossRef]
Fang, J.; You, Y.; Xu, M.; Wang, J.; Cai, S. Multi-objective traffic signal control using network-wide agent coordinated reinforcement learning. Expert Syst. Appl. 2023, 229, 120535. [Google Scholar] [CrossRef]
Bouktif, S.; Cheniki, A.; Ouni, A.; El-Sayed, H. Deep reinforcement learning for traffic signal control with consistent state and reward design approach. Knowl.-Based Syst. 2023, 267, 110440. [Google Scholar] [CrossRef]
Wang, M.; Wu, L.; Li, M.; Wu, D.; Shi, X.; Ma, C. Meta-learning based spatial-temporal graph attention network for traffic signal control. Knowl.-Based Syst. 2022, 250, 109166. [Google Scholar] [CrossRef]
Ye, Y.; Zhou, Y.; Ding, J.; Wang, T.; Chen, M.; Lian, X. InitLight: Initial model generation for traffic signal control using adversarial inverse reinforcement learning. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; Volume 5, pp. 4949–4958. [Google Scholar] [CrossRef]
Lin, J.; Zhu, Y.; Liu, L.; Liu, Y.; Li, G.; Lin, L. DenseLight: Efficient control for large-scale traffic signals with dense feedback. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; Volume 6, pp. 6058–6066. [Google Scholar] [CrossRef]
Han, X.; Zhao, X.; Zhang, L.; Wang, W. Mitigating action hysteresis in traffic signal control with traffic predictive reinforcement learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 673–684. [Google Scholar] [CrossRef]
Zhang, L.; Wu, Q.; Shen, J.; Lü, L.; Du, B.; Wu, J. Expression might be enough: Representing pressure and demand for reinforcement learning based traffic signal control. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 26645–26654. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 3215–3222. [Google Scholar] [CrossRef]
Christodoulou, P. Soft actor-critic for discrete action settings. arXiv 2019, arXiv:1910.07207. [Google Scholar]
Tavakoli, A.; Pardo, F.; Kormushev, P. Action branching architectures for deep reinforcement learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 4131–4138. [Google Scholar] [CrossRef]
FHWA. Manual on Uniform Traffic Control Devices for Streets and Highways (MUTCD), 11th ed.; Federal Highway Administration: Washington, DC, USA, 2023. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1861–1870. [Google Scholar]
Roderick, M.; MacGlashan, J.; Tellex, S. Implementing the deep Q-network. In Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2017; pp. 1–9. [Google Scholar]
National Academies of Sciences, Engineering, and Medicine. In Highway Capacity Manual: A Guide for Multimodal Mobility Analysis, 7th ed.; National Academies Press: Washington, DC, USA, 2022.

Figure 1. Reinforcement learning loop of the agent to solve the problem described in Section 3. It depicts the roles of the key components proposed in Section 4, Section 5 and Section 6.

Figure 3. Typical DRBPS at a four-leg intersection.

Φ

represents the phase number. During a phase, the traffic flows corresponding to the arrows shown in the figure are served. For example, when phase 2 is activated, the northbound ‘through’ vehicles and their adjacent parallel north–south pedestrians enjoy green lights. Phases (1 and 5) comprise a pair of concurrent leading phases, and so do phases (3 and 7). Phases (2 and 6) comprise a pair of concurrent lagging phases, and so do phases (4 and 8).

Figure 3. Typical DRBPS at a four-leg intersection.

Φ

represents the phase number. During a phase, the traffic flows corresponding to the arrows shown in the figure are served. For example, when phase 2 is activated, the northbound ‘through’ vehicles and their adjacent parallel north–south pedestrians enjoy green lights. Phases (1 and 5) comprise a pair of concurrent leading phases, and so do phases (3 and 7). Phases (2 and 6) comprise a pair of concurrent lagging phases, and so do phases (4 and 8).

Figure 4. Relations between action definition, aligned decision point, interval parameter for dynamic bootstrapping, and DRBPS. Time-step size is one second. The colored blocks share the same explanations as in Figure 3. During a phase, the traffic flows corresponding to the arrows shown in the figure are served. For example, when phase 2 is activated, the northbound ‘through’ vehicles and their adjacent parallel north–south pedestrians enjoy green lights.

Figure 5. State representation and neural network structure. Information about vehicles and signals on each intersection leg is encoded in three-dimensional arrays and used as the input to the network.

Figure 6. Aligned Q-fusion. The two-dimensional Q values output from the network are marked blue. The values of the same sub-action in different dimensions are horizontally aligned and are averaged to calculate the fused values in the center purple column. The sub-action maximizing the fused value (marked with a bold and underlined number) will be selected and shared by all dimensions.

Table 2. Wilcoxon matched-pair signed rank test on the average vehicle delay caused by MD4-AQF and FASC.

FASC − MD4-AQF	N	Mean Rank	Sum of Ranks
Negative Ranks	49 ^a	530.35	25,987.00
Positive Ranks	1951 ^b	1012.31	1,975,013.00
Ties	0 ^c
Z	−37.729
Sig. (two-tailed)	0.000 **

^a FASC < MD4-AQF, ^b FASC > MD4-AQF, ^c FASC = MD4-AQF. ** at 0.05 significance level.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, Q.; Xu, H.; Chen, J.; Zhang, K. Multi-Dimensional Double Deep Dynamic Q-Network with Aligned Q-Fusion for Dual-Ring Barrier Traffic Signal Control. Appl. Sci. 2025, 15, 1118. https://doi.org/10.3390/app15031118

AMA Style

Zheng Q, Xu H, Chen J, Zhang K. Multi-Dimensional Double Deep Dynamic Q-Network with Aligned Q-Fusion for Dual-Ring Barrier Traffic Signal Control. Applied Sciences. 2025; 15(3):1118. https://doi.org/10.3390/app15031118

Chicago/Turabian Style

Zheng, Qiming, Hongfeng Xu, Jingyun Chen, and Kun Zhang. 2025. "Multi-Dimensional Double Deep Dynamic Q-Network with Aligned Q-Fusion for Dual-Ring Barrier Traffic Signal Control" Applied Sciences 15, no. 3: 1118. https://doi.org/10.3390/app15031118

APA Style

Zheng, Q., Xu, H., Chen, J., & Zhang, K. (2025). Multi-Dimensional Double Deep Dynamic Q-Network with Aligned Q-Fusion for Dual-Ring Barrier Traffic Signal Control. Applied Sciences, 15(3), 1118. https://doi.org/10.3390/app15031118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Dimensional Double Deep Dynamic Q-Network with Aligned Q-Fusion for Dual-Ring Barrier Traffic Signal Control

Abstract

1. Introduction

2. Literature Review

2.1. Phase Structure: Incompatibility with DRBPS

2.2. Action Space and Training Algorithm

3. RTSC Problem Within DRBPS and Its Complexities

4. Action Definition with Decision Point Aligning

5. Reward and State Definitions

6. Training Algorithm with Multi-Dimensional Aligned Q-Fusion

6.1. Dynamic N-Step Update

6.2. Multi-Dimensional Q-Network

6.3. Aligned Q-Fusion

6.4. Bootstrap Action Value

7. Experiments

8. Discussion

8.1. Experiment 1: Agent Learning

8.2. Experiment 2: Versus D3ynQN Method (Ablation Study)

8.3. Experiment 3: Versus Conventional Control Method

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI