Estimation of Arterial Path Flow Considering Flow Distribution Consistency: A Data-Driven Semi-Supervised Method

Zhang, Zhe; Cao, Qi; Lin, Wenxie; Song, Jianhua; Chen, Weihan; Ren, Gang

doi:10.3390/systems12110507

Open AccessArticle

Estimation of Arterial Path Flow Considering Flow Distribution Consistency: A Data-Driven Semi-Supervised Method

by

Zhe Zhang

,

Qi Cao

,

Wenxie Lin

,

Jianhua Song

,

Weihan Chen

and

Gang Ren

^*

School of Transportation, Southeast University, Jiulonghu Campus, Nanjing 211100, China

^*

Author to whom correspondence should be addressed.

Systems 2024, 12(11), 507; https://doi.org/10.3390/systems12110507

Submission received: 10 October 2024 / Revised: 16 November 2024 / Accepted: 18 November 2024 / Published: 19 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

To implement fine-grained progression signal control on arterial, it is essential to have access to the time-varying distribution of the origin–destination (OD) flow of the arterial. However, due to the sparsity of automatic vehicle identification (AVI) devices and the low penetration of connected vehicles (CVs), it is difficult to directly obtain the distribution pattern of arterial OD flow (i.e., path flow). To solve this problem, this paper develops a semi-supervised arterial path flow estimation method considering the consistency of path flow distribution by combining the sparse AVI data and the low permeability CV data. Firstly, this paper proposes a semi-supervised arterial path flow estimation model based on multi-knowledge graphs. It utilizes graph neural networks to combine some arterial AVI OD flow observation information with CV trajectory information to infer the path flow of AVI unobserved OD pairs. Further, to ensure that the estimation results of the multi-knowledge graph path flow estimation model are consistent with the distribution of path flow in real situations, we introduce a generative adversarial network (GAN) architecture to correct the estimation results. The proposed model is extensively tested based on a real signalized arterial. The results show that the proposed model is still able to achieve reliable estimation results under low connected vehicle penetration and with less observed label data.

Keywords:

urban signalized arterial; path flow; semi-supervised learning; multi-knowledge graphs; generative adversarial network

1. Introduction

Estimating the origin–destination (OD) flows of urban arterials can help traffic managers understand arterial traffic operation patterns and assist them in arterial traffic management. For instance, the majority of traditional arterial progression models fail to incorporate arterial OD flow information into their construction process and only consider providing a travel progression band for the through traffic flow in the arterial [1]. This often leads to overflow of vehicles at turns, resulting in traffic congestion. In light of the aforementioned circumstances, a number of multi-path signal control models have been put forth that incorporate OD flow information in arterials [2,3,4]. Yang, Cheng, and Chang [2] demonstrated that providing a progression band for both through traffic and multiple other critical paths can enhance the efficiency of vehicle operation under high traffic flow. However, one of the key issues is how to identify critical paths, which are defined as paths with high traffic flow. Therefore, the most basic and important information for such arterial control system is the time-varying distribution of origin–destination (OD) flows along the signalized arterials.

Conventional loop data are limited in its ability to provide information regarding arterial OD flow patterns. This is due to the fact that it only contains link flow information rather than path flow information. The advent of modern information and communication technologies (ICTs) has led to the emergence of promising data sources from the latest generation of traffic detection equipment, which hold significant potential for transportation professionals seeking to gain a deeper understanding of arterial flow patterns.

Automatic vehicle identification (AVI) technology uses fixed sensors to passively record vehicle movement patterns [5,6]. For each vehicle approaching the AVI sensor, information about its movement is captured. As a result, the movement patterns of almost all vehicles traveling on the road network can be recorded by the AVI system. AVI data can provide a full sample of vehicle travel information. However, for an arterial, if we wish to capture the full OD flow pattern, we must set up pairs of AVI sensors at each intersection. This is a costly undertaking, both in terms of construction and maintenance. The reality is that we only have HD cameras at some of the intersections in the arterial and can only observe some of the ODs in the arterial. In the arterial scenario shown in Figure 1, since three HD cameras are included only at the first and third intersections, we are able to obtain some of the ODs in the arterial (e.g., Path1-6 and Path8-3, etc.), while we are unable to obtain the path flow for other ODs that cannot be recorded by the cameras (e.g., Path4–8 and Path1-4, etc.)

Recently, with the rise of intelligent connected vehicles, it has become a reality to obtain high-resolution probe vehicle trajectory data. Connected vehicles (CVs) can cover almost the entire arterial and have the potential to provide detailed travel information. In contrast to automatic vehicle identification (AVI) data, CV trajectories provide direct but sampled observations of arterial OD flows. Through the trajectories of the connected vehicles, we can only determine the connected flow information of all ODs in the arterial. However, in real traffic conditions, the penetration rate of connected vehicles cannot reach 100% [7], so the arterial OD flow information we obtain is only representative of the sampled connected vehicle flows and not all vehicle flows in the arterial. As illustrated in Figure 1, we are able to obtain the connected vehicles flow of all the ODs in the arterial through the trajectories of the connected vehicles. For instance, we are not able to obtain all the vehicle flows of Path4-8 using the AVI HD cameras, but we are able to obtain the connected vehicle flows of Path4-8 using connected vehicle trajectories. Concurrently, it is evident that the OD flow of connected vehicles is constrained by the limited penetration rate of connected vehicles. Consequently, the OD flow of connected vehicles represents only a portion of the actual traffic flow (as illustrated in Figure 1, the connected vehicle flow of Path1-6 is a subset of the actual traffic flow).

Considering the respective potential and shortcomings of AVI data and CV data, we aim to propose a semi-supervised learning approach to fuse these two types of data to obtain an arterial path flow estimation model based on multi-knowledge graphs and GAN architecture. In recent years, graph neural networks (GNN) have achieved a series of results in the field of semi-supervised tasks (e.g., recommender systems, molecular inference, chemistry, traffic flow prediction) [8,9]. Therefore, in this paper, we construct a semi-supervised arterial path flow estimation framework based on graph neural networks and generative adversarial network architectures to achieve reliable estimation of arterial OD flows.

The rest of the paper is organized as follows. We briefly review the arterial OD flow estimation problem and give the research gaps and our contributions in Section 2. Section 3 provides a formal description of the path flow estimation problem. The proposed semi-supervised path flow estimation model and the generative adversarial network estimation framework that considers the consistency of the flow distribution are presented in Section 4. The empirical study section shows the estimation results and evaluation analysis. Finally, we summarize the paper in the last section.

2. Literature Review

OD flow estimation is a classical but still challenging research topic [10]. Currently, there is very limited research on arterial OD flow estimation. Traditionally, the most common data source for estimating arterial OD flows is link counts. Arterial OD flows are estimated based on several traffic conservation laws. Specifically, the flow of an arterial section should be the sum of all path flows passing through the section [11]. However, the proposed method can only be applied to simple highway roads [12]. Lou and Yin [13] proposed a decomposition scheme for estimating arterial OD flows. They first inferred the turning movements at each intersection based on link counts, which were then used to construct measurement equations to infer arterial-level OD flows. However, their work does not account for the effect of signal plans on time-varying link counts, thus reducing the accuracy of arterial OD flow estimates. In order to more accurately assess the impact of signal plans on turning movement at intersections, Chang and Tao [14] introduced additional constraints on signal timing information into the model, which can produce more accurate estimates. Yang and Chang [15] proposed three arterial OD flow dynamic estimation models with the objective of improving estimation accuracy. These models analyze the relationship between link counts, intersection turning movements, signal timing plans, and OD flow patterns, respectively. However, a significant number of unknown parameters present a challenge to the accuracy of the above link count-based methods. Furthermore, the calculation efficiency and tractability of these methods face significant challenges.

As data availability continues to increase with the development of sensing and connected vehicle technologies, data-driven OD flow estimation has been investigated to address this problem, such as AVI data [16,17,18,19,20] and probe vehicle data [21,22,23,24,25,26]. The basic idea of such methods is to improve the estimation accuracy by supplementing previously unavailable vehicle trip information. Although there is a paucity of literature on arterial OD estimation, there are a large number of models for estimating OD demand on general road networks [27]. Previous studies on OD flow estimation for road networks can be categorized as the Generalized Least Squares (GLS) model [28,29], the Bi-level Programming Model [30,31,32], models based on Bayesian Theory [33], the MaxEnt model [34], and the State Space model [35]. Most of the above methods use optimization methods to minimize the difference between the observed and assigned flows under various constraints. Since the time relationship of OD flows is difficult to calculate in real time, it is difficult for these methods to produce timely and effective OD estimates. In addition, each OD pair has only one alternative path along an artery. Therefore, those existing OD estimation methods for general road networks may not be applicable to arterial OD estimation. Attempts have been made by some scholars to better incorporate novel detection data to solve the arterial OD estimation problem. A recent study by Wang et al. [36] explores the use of self-supervised learning on detection vehicle data to estimate OD flows in signalized arteries. The absence of genuine AVI data for calibration during training compromises the reliability of the model in real traffic conditions. In addition, their method assumes that the defects of the generated OD matrix are zero-mean, which is not consistent with the real situation.

In summary, there are two main limitations in the problem of OD flow estimation in signalized arteries:

(1) Path OD flow estimation methods based on link counting are less accurate. On the one hand, due to the time-varying of the arterial OD flow patterns, it is difficult for the link counting-based method to accurately infer the turning movements at each intersection, which leads to a mismatch with the actual situation and introduces unavoidable errors. On the other hand, the large number of unknown parameters makes the above link counting-based methods less accurate, and their calculation efficiency and tractability face great challenges.

(2) There is a lack of arterial OD estimation methods that integrate CV data and AVI data. Most of the existing studies focus on OD flow estimation under the network, and there is a dearth of research on OD flow estimation in signalized arteries. The AVI system records the full sample flow of some OD pairs in the arterial, which provides us the opportunity to calibrate the estimation error. The CV data tracks the trajectory paths of the connected vehicles in the arterial, which can help us to grasp the connected OD flows in the arterials. However, no study has yet combined the strengths of both to provide a data-driven approach specific to OD estimation on arterial.

To solve the above problem, a new semi-supervised framework is developed to estimate the OD flow patterns of signalized arterial. Specifically, the contributions of this research are as follows:

(1) A semi-supervised method for arterial OD estimation using AVI data and CV data is proposed. The multi-knowledge graph arterials path flow estimation model combines the observed flow information of some OD pairs with AVI observations and the CV trajectory information to infer the path of OD pairs without AVI observations. Multiple knowledge-based GCN graphs are utilized to capture the spatial correlation relationship between the paths of different OD pairs, and the path flow data of observed OD pairs are used as labeled data to build a semi-supervised learning model. The experimental results show that the proposed semi-supervised method can achieve an effective estimation of path flows of unobserved OD pairs using only a small amount of AVI label data.

(2) A GAN flow estimation architecture that can guarantee the consistency of the path flow distribution is developed. We embed the established multi-knowledge graph arterial path flow estimation model into GAN as the generator component and constitute the discriminator component through several layers of fully connected networks. The generator network and the discriminator network are allowed to train against each other, and the distribution of the generator-generated path flow is examined to ensure that the proportion of the generator-generated path flow obeys the same distribution as the proportion of the path flows in the real case, which further improves the accuracy of the arterial OD path flow estimation.

3. Problem Statement

The research object of this paper is a signalized arterial with multiple intersections (shown in Figure 2), where there are a total of

N

entrances available for vehicles to enter and exit the arterial, denoted as the set

E n = \{1, 2, \dots \dots n, \dots \dots, N\}

. These entrances can be regarded as the origin and destination, and then each path in the arterial can be represented by an OD pair. The full set of OD pairs (i.e., all paths present in the arterial) in the arterial can be obtained by connecting each entrance in the arterial, denoted as the set

P a t h s = \{1, 2, \dots \dots j, \dots \dots, J\}

, where the number of OD pairs is

J

.

J

and

N

satisfies the following equation:

J = \frac{N!}{(N - 2)!}

(1)

Based on the above notational definitions, the flow of all paths in the arterial in time period t can be represented as a matrix

F l o w_{t}

, which is of size

1 \times J

and denoted as:

F l o w_{t} = [f_{1, t}, f_{2, t}, \dots \dots f_{j, t}, \dots \dots, f_{J, t}]

(2)

where

f_{j, t}

denotes the number of vehicles traveling between OD pair

j

within time

t

.

In this study, it is assumed that there are two types of vehicles (regular vehicles and connected vehicles) traveling along this signalized arterial. Since the connected vehicles broadcast real-time locations, it is easy to obtain the number of connected vehicles traveling between each OD pair in time period

t

, which is the path flow matrix

F l o w_{t}^{C V}

of connected vehicles with size

1 \times J

, denoted as:

F l o w_{t}^{C V} = [f_{1, t}^{C V}, f_{2, t}^{C V}, \dots \dots f_{j, t}^{C V}, \dots \dots, f_{J, t}^{C V}]

(3)

For the problem under study in this paper, not every intersection in a signalized arterial is equipped with high-definition cameras (automatic vehicle identification devices) in every direction. Therefore, in time period

t

, we cannot obtain the flow of all paths in the arterial, but only part of the paths in the arterial using a limited number of automatic vehicle identification devices. That is, we cannot obtain the complete

F l o w_{t}

, but only a defective

F l o w_{t}

, denoted as

F l o w_{t}^{F l a w}

:

F l o w_{t}^{F l a w} = [f_{1, t}, *, *, *, *, f_{6, t}, *, *, *, \dots \dots f_{j, t}, \dots \dots *, *, *, \dots \dots, f_{J, t}]

(4)

We divide the path samples according to whether the paths have real traffic information in time period

t

. Paths with real traffic information are denoted as labeled samples, and paths without real traffic information are denoted as unlabeled samples. We use the traffic flow of the connected vehicles in time period

t

and the previous

k

time periods as input features for the path samples.

Table 1 presents specific examples of labeled and unlabeled path samples. Path

m

and

n

represent two distinct arterial paths. Path

m

is a labeled path, as both its origin and destination are equipped with AVI devices, enabling the acquisition of real traffic flow for this path. In contrast, path

n

is an unlabeled path, where the absence of AVI devices at either the origin or the destination precludes the direct measurement of its real traffic flow. However, through the use of connected vehicle data, we can acquire the connected vehicle traffic flow for both path

m

and path

n

. For path

m

, the connected vehicle traffic flow during time period

t

and the preceding

k

time periods can be denoted as

[f_{m, t - k}^{C V}, \dots \dots, f_{m, t - 2}^{C V}, f_{m, t - 1}^{C V}, f_{m, t}^{C V}]

. For path

n

, the connected vehicle traffic flow during time period

t

and the preceding

k

time periods can be denoted as

[f_{n, t - k}^{C V}, \dots \dots, f_{n, t - 2}^{C V}, f_{n, t - 1}^{C V}, f_{n, t}^{C V}]

.

[f_{m, t - k}^{C V}, \dots \dots, f_{m, t - 2}^{C V}, f_{m, t - 1}^{C V}, f_{m, t}^{C V}]

and

[f_{n, t - k}^{C V}, \dots \dots, f_{n, t - 2}^{C V}, f_{n, t - 1}^{C V}, f_{n, t}^{C V}]

represent the sample input features for path

m

and path

n

, respectively. The real traffic flow for path

m

at time period

t

, obtained via the AVI devices, is denoted as

f_{m, t}

, which serves as the sample label for path

m

. However, for path

n

, due to the absence of AVI observations, there is no actual sample label available.

As shown in Figure 3, we train a semi-supervised learning framework using both labeled and unlabeled path samples. The framework obtained from the training can accurately estimate the true flow of the unlabeled path samples. Finally, the flow matrix of all the paths in the artery is obtained as follows:

F l o {w^{'}}_{t} = [f_{1, t}, f_{2, t}^{'}, f_{3, t}^{'}, \dots \dots, f_{6, t}, f_{7, t}^{'}, f_{8, t}^{'}, \dots \dots f_{j, t}, f_{j + 1, t}^{'} \dots \dots, f_{j + k, t}^{'}, \dots \dots, f_{J, t}]

(5)

where

f_{j, t}^{'}

denotes the estimated number of vehicles traveling in unobserved path

j

within time period

t

.

4. Materials and Methods

In this study, we develop a data-driven approach for signalized arterial path flow estimation using connected vehicle trajectory data and AVI data. To cope with the challenges of low penetration of connected vehicles and large sparsity of AVI data, we develop a semi-supervised path flow estimation framework that considers the consistency of the path flow distribution to approximate the real path flow.

The framework of the proposed method is shown in Figure 4. The connected vehicles path flow matrix and defective path flow matrix are used as input data. Firstly, we develop a semi-supervised path flow estimation model that incorporates multi-knowledge graphs. We incorporate multiple knowledge-based graph structures into the model to more broadly construct complex dependencies between arterial paths. We use graph neural networks to capture the complex mapping relationship between connected traffic demand and the real traffic demand, from which we derive the real flow of the unobserved paths in the arterial. Furthermore, to ensure that the estimation results of the multi-knowledge graph estimation model are consistent with the real path flow distribution, we introduce a generative adversarial network (GAN) architecture to correct the estimation results of the multi-knowledge graph estimation model. We establish a path flow estimation framework that incorporates flow distribution consistency. We embed the established semi-supervised path flow estimation model with multiple knowledge graphs into GAN as the generator component and constitute the discriminator component through several layers of fully connected networks. The generator network and the discriminator network are allowed to train against each other to ensure that the proportions of the path flow generated by the generator obey the same distribution as in the real case.

In the following, we will specify the details of the semi-supervised path flow estimation model based on multiple knowledge graphs and the GAN path flow estimation framework, considering the consistency of path flow distribution.

4.1. Path Flow Estimation Model Based on Multiple Knowledge Graphs

4.1.1. Semi-Supervised Road Flow Estimation Model Based on Graph Convolutional Network

In recent years, graph-based semi-supervised learning has found a wide range of applications in areas such as social network analysis, recommender systems, and cybersecurity. It uses the structure of relationships between data (represented as graphs) to improve the performance of learning algorithms, especially when labeled data are scarce. Moreover, Kipf and Welling [37] proposed a simple and effective graph convolutional network (GCN) for semi-supervised representation learning. GCN can extract representative data features by automatically learning feature information and structural information of graph data simultaneously [38]. It can achieve high accuracy with only a small amount of labeled data, which is currently one of the best choices for semi-supervised learning tasks. In this paper, a semi-supervised path flow estimation model based on GCN is developed, as shown in Figure 5. We abstract each path in the arterial as a node. The yellow nodes in Figure 5 indicate path samples with real traffic flow labels, and the blue nodes indicate path samples without real traffic flow labels. The connected edges between nodes represent the correlations between paths, expressed as an adjacency matrix

A

(i.e., the structural information of the graph). In the input layer, we use the connected vehicles’ current and historical period flow information of each path (including labeled path samples and unlabeled path samples) as input features. In the hidden layer, based on the structure of the graph, the information about the path itself and the information about the neighboring path nodes are aggregated to generate a new representation of each path node. Finally, at the output layer, we can obtain the estimated traffic flow for each path. By comparing the estimated and true flows of the labeled path samples, we can calculate the estimated loss of the model. Backpropagation optimization of the model parameters is performed based on the estimated loss, and the optimized model can be used to estimate the traffic flow of unlabeled path samples. In the following, we specify the details of the GCN-based semi-supervised path flow estimation model.

The graph structure of an arterial with

J

paths can be denoted as

G = (A, X)

.

A \in ℜ^{J \times J}

and

X \in ℜ^{J \times d}

denote the adjacency matrix of the graph structure information and the feature matrix of dimension

J * d

, respectively. Here, the graph structure information of adjacency matrix

A

is used to reflect the relationship between arterial paths. We denote the adjacency matrix

A

with the following:

A = [\begin{array}{l} a_{1, 1}, a_{1, 2}, \dots \dots, a_{1, j}, \dots \dots, a_{1, J} \\ a_{2, 1}, a_{2, 2}, \dots \dots, a_{2, j}, \dots \dots, a_{2, J} \\ ⋮ \\ ⋮ \\ a_{j, 1}, a_{j, 2}, \dots \dots, a_{j, j}, \dots \dots, a_{j, J} \\ ⋮ \\ ⋮ \\ a_{J, 1}, a_{J, 2}, \dots \dots, a_{J, j}, \dots \dots, a_{J, J} \end{array}]

(6)

where

a_{m, n} = \{\begin{cases} 1 & p a t h m a n d p a t h n have a certain relationship \\ 0 & otherwise \end{cases}

(7)

Since the graph is an undirected graph, the adjacency matrix

A

is a symmetric matrix (triangular matrix). The feature matrix

X

is denoted as:

X = [x_{1}, x_{2}, \dots \dots, x_{j}, \dots \dots, x_{J}]

(8)

x_{j} = [f_{j, t - k}^{C V}, f_{j, t - k + 1}^{C V}, \dots \dots f_{j, t - 1}^{C V}, f_{j, t}^{C V}]

denotes the connected vehicle flow observed in the current time period

t

and last

k

time periods of path

j

. Here, the feature dimension

d = k + 1

.

A

and

X

are the data in the input layer of the graph convolutional network.

According to the propagation rule for graph convolutional networks, the propagation rule for the pth layer is as follows:

h (X^{(p)}, A) = σ ({\hat{D}}^{- \frac{1}{2}} \hat{A} {\hat{D}}^{- \frac{1}{2}} X^{(p)} W^{(p)})

(9)

The mapping function

h (X^{(p)}, A)

by the graph convolutional network learning is the output feature representation of the

p

th layer of the graph convolutional network,

X^{(p + 1)} \in ℜ^{J \times d_{p}}

.

\hat{A} = A + I

, where

I

is the unit matrix, and

\hat{A}

is the adjacency matrix add the self-cycling, indicating that the feature information of the paths themselves is taken into account in the graph convolutional calculation.

\hat{D}

denotes the degree matrix of

\hat{A}

,

{\hat{D}}_{m m} = \sum_{n} {\hat{A}}_{m, n}

.

W^{(p)} \in ℜ^{d_{p} \times d_{p + 1}}

is the

p

th layer weight network parameter of the graph convolutional network. By propagating through the layers, each node has access to more higher-order information. Taking the example of a two-layer graph convolutional network, the model can be represented as:

Z = h (X, A) = \bar{A} R e L U (\bar{A} X W^{(0)}) W^{(1)}

(10)

where

\bar{A} = {\hat{D}}^{- \frac{1}{2}} \hat{A} {\hat{D}}^{- \frac{1}{2}}

. Neighborhood aggregation and information propagation occurs via two-layer graph convolution, and

z

is the estimated output value of the model (i.e., it is an estimate of the real flow for each path in the arterial). As shown in Figure 5, due to the sparsity of AVI detection devices, we only have the real observed flows for some of the paths (the paths represented by nodes 2,

J

, and

J - 1

in Figure 5). Therefore, we only use the labeled path simples to compute the model loss. The loss of the GCN model is denoted as:

l_{G C N} = \sum_{j \in p a t h s^{'}} l o s s (z_{j}, y_{j})

(11)

where

p a t h s^{'}

is the set of labeled path simples, and

l o s s (z_{j}, y_{j})

is the computed loss value between the estimated and true values of the path flow.

4.1.2. Multi-Knowledge Graphs Construction

As there are specific correlations between different paths in arterials [36], with respect to different transportation and domain knowledge, such correlations can be multifaceted (e.g., topological connectivity relationships, temporal pattern similarity, non-functional dependencies, etc.). It is difficult to adequately represent these complex associations using a single-graph structure. Therefore, we may need to build multiple knowledge-based graph structures to express these associative relationships. Multi-graph-based learning methods can mine more potential information to model reasonable contextual relationships [39]. Due to the consideration of more potential information, multi-graph-based learning method approaches are usually superior to single-graph-based approaches.

We define three knowledge-based graphs to represent different types of correlations between different paths in the arterials and fuse the multi-knowledge graphs by relational graph convolutional network (RGCN). The nodes in the three graphs represent the paths in the arterial, and the edges of the adjacency matrix

A

represent the topological connectivity, temporal similarity, and potential correlation among the paths, respectively. The representation and specific details for each knowledge graph are shown below.

(1) Topology Connectivity Graph

Figure 6 shows a simple example of a signalized arterial with two intersections, lending to the illustration of our problem. To model the physical topology of paths in a signalized artery for path

m

and

n

that originate from the same entrance, since they have the same origin, we define that there is a departure topology connectivity between them when the weight of the adjacent edges between them is 1, i.e.,

a_{m, n} = 1

. For example, for the paths starting from entrance 2 in Figure 6, the weights of the adjacent edges between them are all 1. Similarly, for the paths m and n arriving at the same entrance, since their destinations are the same, we define that there exists an arrival topology connectivity between them, in which case the weights of the adjacent edges between them are 1. For example, for the paths arriving at entrance 2 in Figure 6, the weights of the adjacent edges between them are both 1. For the paths m and n, which have neither a departure topology connection nor an arrival topology connection, the weights of the adjacent edges between them are 0, i.e.,

a_{m, n} = 0

. To summarize, for the topology connection graph, for any two paths in the arterial, the adjacent edge weight between them is defined as:

a_{m, n} = \{\begin{cases} 1 & p a t h m a n d p a t h n have a departure / arrival topology connectivity \\ 0 & otherwise \end{cases}

(12)

(2) Temporal Similarity Graph

There may be similar time series patterns between different paths in an arterial. To better capture the temporal similarity between paths, we use a similarity assessment algorithm to calculate the temporal similarity between paths. The traditional Euclidean distance can only reflect the scalar distance between two points, i.e., the numerical distance between two time series, and cannot be used to accurately describe the degree of similarity between the shapes of two time series curves. The dynamic time warping (DTW) method [40] can accurately describe the similarity of time series curves by warping the time axis for matching. Given the historical flow time series

S 1 = \{s_{1}^{1}, s_{2}^{1} \dots \dots s_{q 1}^{1}\}

and

S 2 = \{s_{1}^{2}, s_{2}^{2} \dots \dots s_{q 2}^{2}\}

of any two paths in the arterial, of lengths

q 1

and

q 2

, respectively. Using the Euclidean distance to calculate the distance between any two points in the time series

S 1

and

S 2

, i.e.,

b_{(i, j)} = {(s_{i}^{1} - s_{j}^{2})}^{2}

(13)

The distance matrix

B_{q 1 \times q 2}

:

B_{q 1 \times q 2} = [\begin{array}{l} b_{(1, 1)}, & b_{(1, 2)} \dots \dots & b_{(1, q 2)} \\ b_{(2, 1)} \\ ⋮ & ⋱ & ⋮ \\ ⋮ & ⋱ & ⋮ \\ b_{(q 1, 1)} & b_{(q 1, q 2)} \end{array}]

(14)

The DTW algorithm is to find a single continuous path that minimizes the cumulative distance from the upper left corner to the lower right corner. A smaller DTW indicates a higher similarity between the historical flow time series of the two paths. Therefore, we construct a time similarity graph based on the DTW values between paths as their adjacent edge weights. In the temporal similarity graph, for any two paths

m

and

n

in the arterial, the adjacent edge weight between them is defined as:

a_{m, n} = D T W (m, n)

(15)

Since the adjacency matrix requires that the higher the temporal similarity between paths, the higher the weight of the adjacency edges (the closer to 1). Therefore, to make the adjacency matrix

A

more computationally tractable, we introduce an exponential function

\frac{1}{e^{x}}

to map the values of DTW to the interval range [0, 1]. We define this as follows:

D T W {(m, n)}^{'} = \frac{1}{e^{D T W (m, n)}}

(16)

When the temporal similarity of the two paths is higher, this means that the value of

D T W {(m, n)}^{'}

tends to be 1 for smaller values of

D T W (m, n)

. When the temporal similarity of the two paths is smaller, which means that the value of

D T W {(m, n)}^{'}

tends to be 0 for larger values of

D T W (m, n)

. We use

D T W {(m, n)}^{'}

as the new adjacent edge weight, i.e.,:

a_{m, n} = D T W {(m, n)}^{'}

(17)

(3) Potential Correlation Graph

The two predefined knowledge-based graphs may not be enough to capture all possible correlations between paths, and there could be hidden correlations that cannot be explicitly represented. Therefore, we need to further obtain the potential correlation between signalized arterial paths. Traditional correlation measures such as Pearson and Spearman can only obtain the linear correlation between paths and have additional requirements on the data distribution (e.g., Pearson’s algorithm requires that the data obey a normal distribution). The potential correlation between paths is complex, and there is not only a linear correlation. Therefore, the above algorithms cannot meet our requirements. Compared with other correlation measures, the maximal information coefficient (MIC) algorithm can widely measure the dependency relationships between variables, such as linear, non-linear relationships, and even non-functional dependencies (such as dependencies consisting of more than one function) that cannot be represented by a single function. Therefore, we use

M I C

to define the potential correlation between two arterial paths, which takes the range of [0, 1]. A larger

M I C

value indicates a higher potential correlation between two paths. When the potential correlation between the two paths is higher, the value of

M I C (m, n)

is larger, and the value of

M I C (m, n)

tends to be 1. When the potential correlation between the two paths is lower, the value of

M I C (m, n)

is smaller, and the value of

M I C (m, n)

tends to be close to 0. Therefore, we construct the potential correlation graph based on the

M I C

values between paths as their adjacent edge weight. In the potential correlation graph, for any two paths

m

and

n

in the arterial, the adjacent edge weight between them is defined as:

a_{m, n} = M I C (m, n)

(18)

4.1.3. Multi-Knowledge Graph Integration Based on Relational Graph Convolutional Network

In the previous section, we built multiple knowledge-based graph structures based on multiple associative relationships, and in this section, we use a relational graph convolutional network (RGCN) [41] to integrate multiple knowledge-based graphs in order to learn a unified feature representation. In RGCN, nodes first aggregate neighboring node features from a single graph and then fuse the node features aggregated on multiple graphs again for a fused representation.

For different relation types of multiple graphs (e.g., Figure 7), RGCN does the following:

In the

l

th layer of the convolution, we use

W_{r}^{(l)}

to denote the linear transformation function of the knowledge graph

r

. For each of the three knowledge graphs we propose, each of them has its own linear transformation function, which is responsible for transforming the features of the neighboring nodes on the edges of the corresponding relationship graph:

h_{m}^{(l + 1)} = σ (\sum_{r \in R} \sum_{n \in ℑ_{m}^{r}} \frac{1}{c_{m, r}} W_{r}^{(l)} h_{n}^{(l)} + W_{0}^{(l)} h_{m}^{(l)})

(19)

where

h_{m}^{(l)}

denotes the state feaatures of the

l

th layer of path

m

.

R

is the set of knowledge graphs.

c_{m, r}

is the normalization weight coefficient which is responsible for assigning weights to each knowledge graph with a value of the number of knowledge graphs.

ℑ_{m}^{r}

denotes the set of neighboring nodes of path

m

in the knowledge graph

r

, and

σ

is the activation function.

According to Equation (19), after aggregating the node features of different knowledge graphs, R-GCN also needs to add the features of its own nodes, and finally, the node features after fusing multiple knowledge graphs can be obtained through an activation function. We established a path flow estimation model based on a multi-knowledge graph, and the model structure is shown in Figure 8. Firstly, we transport the input features into a topological connectivity graph, temporal similarity graph, and potential correlation graph to extract the features. Then, we fuse the features of multiple graphs based on the multi-layer RGCN. For the results of multi-graph fusion, we transport them into a fully connected network to finally obtain the estimated traffic flow of each path. When training the model, we back-propagate the error by computing the estimated loss of the model only from the real traffic labels of some of the observed paths. Thus, the model training process is semi-supervised.

4.2. A Path Flow Estimation Framework Based on Generative Adversarial Networks for Incorporating Flow Distribution Consistency

The path flow matrix

F l o w_{t}^{C V}

of the arterial connected vehicles can be regarded as a valid sampling of the total path flow matrix

F l o w_{t}

. The study by Ma et al. [42] states that if the probe vehicles are independent and are not influenced or monitored by the center, then the probe vehicles and regular vehicles choose the same route at the same rate. Probe vehicles and regular vehicles share the same route choice behavior. Our study follows the same assumption. Therefore, the distribution of the path flow proportion in the connected vehicles flow matrix

F l o w_{t}^{C V}

should be consistent with the distribution of the path flow proportion in the total flow matrix

F l o w_{t}

.

The connected path flow proportion matrix

F l o w P r o p_{t}^{C V}

is as follows:

F l o w P r o p_{t}^{C V} = \frac{F l o w_{t}^{C V}}{S u m_{t}^{C V}} = [\frac{f_{1, t}^{C V}}{S u m_{t}^{C V}}, \frac{f_{2, t}^{C V}}{S u m_{t}^{C V}}, \dots \dots \frac{f_{j, t}^{C V}}{S u m_{t}^{C V}}, \dots \dots, \frac{f_{J, t}^{C V}}{S u m_{t}^{C V}}]

(20)

where

S u m_{t}^{C V}

is the sum of the connected vehicles path flow

S u m_{t}^{C V} = f_{1, t}^{C V} + f_{2, t}^{C V} + \dots \dots + f_{j, t}^{C V} + \dots \dots + f_{J, t}^{C V}

.

The total flow path proportion matrix

F l o w P r o p_{t}

is as follows:

F l o w P r o p_{t} = \frac{F l o w_{t}}{S u m_{t}} = [\frac{f_{1, t}}{S u m_{t}}, \frac{f_{2, t}}{S u m_{t}}, \dots \dots \frac{f_{j, t}}{S u m_{t}}, \dots \dots, \frac{f_{J, t}}{S u m_{t}}]

(21)

where

S u m_{t}

is the sum of the all vehicles path flow,

S u m_{t} = f_{1, t} + f_{2, t} + \dots \dots + f_{j, t} + \dots \dots + f_{J, t}

.

F l o w P r o p_{t}^{C V}

and

F l o w P r o p_{t}

should follow the same distribution. Using the path flow estimation model based on a multi-knowledge graph, we can make the defective path flow matrix

F l o w_{t}^{F l a w}

for complementation and finally obtain the flow matrix estimation result

F l o w_{t}^{'}

for all the paths in the arterial.

Similarly, for the path estimation flow matrix

F l o w_{t}^{'}

, we also require that its distribution of the path flow proportion is consistent with the path flow proportion distribution of the total flow matrix

F l o w_{t}

and the connected vehicle flow matrix

F l o w_{t}^{C V}

.

The estimated path flow proportion matrix

F l o w P r o p_{t}^{'}

is as follows:

F l o w P r o p_{t}^{'} = \frac{F l o w_{t}^{'}}{S u m_{t}^{'}} = [\frac{f_{1, t}^{'}}{S u m_{t}^{'}}, \frac{f_{2, t}^{'}}{S u m_{t}^{'}}, \dots \dots \frac{f_{j, t}^{'}}{S u m_{t}^{'}}, \dots \dots, \frac{f_{J, t}^{'}}{S u m_{t}^{'}}]

(22)

Therefore,

F l o w P r o p_{t}^{'}

,

F l o w P r o p_{t}^{C V}

, and

F l o w P r o p_{t}

should all obey the same distribution. To fulfill this requirement, we introduce a generative adversarial network framework to implement our needs.

4.2.1. Generative Adversarial Network Architecture

The adversarial generative network model (GAN) was first proposed by Creswell et al. [43]. The GAN plays an important role and is of value in the fields of image filling and text generation. Other than that, the GAN has also been introduced to solve transportation problems, such as traffic state estimation [44] and traffic data interpolation [45]. Due to the flexibility of the framework, the GAN is able to combine different network structures to accomplish specific traffic tasks. The core idea of the GAN is derived from the two-player game model. The network structure of the GAN is shown in Figure 9, which consists of the basic components, the Generator and the Discriminator. The purpose of the generator

G e n

is to learn the distribution of real data, where the input is a random noise vector

τ

, and then generate fake data that looks real. The discriminator

D i s

is a component specifically designed to fight against the generator, trying to determine whether the input data are real or generated, discriminating between real and fake data, with the aim of using it to motivate the generator to continuously improve and generate the best-fitting data.

4.2.2. Multi-Knowledge Graph GAN Model (MKG-GAN)

The goal of the multi-knowledge graph GAN model (MKG-GAN) is to approximate the true path flow of all unobserved paths in an arterial by building a semi-supervised path flow estimation framework utilizing both connected vehicle trajectory data and defective AVI sensor data and to require that the estimation results are consistent with the proportional distribution of the connected vehicle observations. The MKG-GAN model, in general, still adopts the classic generator and discriminator structure of the GAN, as shown in Figure 10. We use the previously established multi-knowledge graph flow prediction model as the generator component. The path flow matrix

[F l o w_{t - k}^{C V}, F l o w_{t - k + 1}^{C V}, \dots \dots F l o w_{t - 1}^{C V}, F l o w_{t}^{C V}]

of the connected vehicles in time period

t

as well as the previous

k

time periods as inputs. We extract the potential features through the topology connectivity graph, temporal similarity graph, and potential correlation graph, respectively, and fuse the features extracted from the three feature graphs through the RGCN. Then, the unobserved path flow in the defective flow matrices

F l o w_{t}^{F l a w} = [f_{1, t}, *, *, *, *, f_{6, t}, *, *, *, \dots \dots f_{j, t}, \dots \dots *, *, *, \dots \dots, f_{J, t}]

of the path flow is estimated, and the complemented path flow matrix

F l o {w^{'}}_{t} = [f_{1, t}, f_{2, t}^{'}, f_{3, t}^{'}, \dots \dots, f_{6, t}, f_{7, t}^{'}, f_{8, t}^{'}, \dots \dots f_{j, t}, f_{j + 1, t}^{'} \dots \dots, f_{j + k, t}^{'}, \dots \dots, f_{J, t}]

is obtained.

In addition, based on the discussion in the previous section, we require that the path flow proportion distribution of the supplemented path flow matrix

F l o w_{t}^{'}

is consistent with the total flow matrix

F l o w_{t}

and the connected vehicle flow matrix

F l o w_{t}^{C V}

.

Therefore, we compute the estimated path flow proportion matrix

F l o w P r o p_{t}^{'}

as the fake sample input in the discriminator. The connected path flow proportion matrix

F l o w P r o p_{t}^{C V}

is used as the real sample input in the discriminator as a way to ensure that the proportion of each path flow in the estimation path flow matrix

F l o w_{t}^{'}

obtained by the generator conforms to the same distribution as the proportion of the path flow in the real case. Our discriminator component consists of several layers of fully connected networks.

As shown in Figure 10, the generator minimization loss function

L_{G e n_B C E}

in the GAN can be written in the form of minimizing the binary cross-entropy function:

L_{G e n_B C E} = \frac{1}{N} \sum_{i = 1}^{N} B C E (D i s (G e n (τ_{i})), 1)

(23)

The discriminator maximization loss function

L o s s_{D i s c r i m i n a t o r}

in the GAN can be written in the form of minimizing two binary cross-entropy functions

L o s s_{D i s_B C E}

:

L o s s_{D i s c r i m i n a t o r} = L o s s_{D i s_B C E} = \frac{1}{N} \sum_{i = 1}^{N} [\frac{BCE (D i s (σ_{i}), 1) + BCE (D i s ({\hat{σ}}_{i}), 0)}{2}]

(24)

As shown in Figure 10, the loss function of the generator should not only consider the loss

L_{G e n_B C E}

caused by the path flow proportion distribution error but also take into account the estimation loss

l_{M - G C N}

of the multi-knowledge graph path flow estimation model itself.

According to the discussion in Section 4.1.1, we can calculate the multi-knowledge graph model loss by using the real flow of the observed paths. The loss of the multi-knowledge graph model is denoted as:

l_{M - G C N} = \sum_{j \in p a t h s^{'}} l o s s (z_{j}, y_{j})

(25)

where

p a t h s^{'}

is the set of observed paths and

l o s s (z_{j}, y_{j})

is the calculated loss value between the estimated and true values of the path flow. The generator loss function is updated by adding the loss of the multi-knowledge graph model to the existing generator loss function:

L_{G e n e r a t o r} = w_{G e n_B C E} L_{G e n_B C E} + w_{M - G C N} L_{M - G C N}

(26)

In the equation,

w_{G e n_B C E}

and

w_{M - G C N}

are the weight parameters of each loss in the loss function, respectively.

5. Experiment

5.1. Description of Research Object and Data Set

In order to evaluate the validity of the proposed model and to evaluate its potential for field application, a section of roadway in Shangyu City, Zhejiang Province, was selected for this study. Figure 11 shows its geometric layout and topology. The arterial consists of 2 intersections and 20 OD pairs, so it contains a total of 20 arterial paths. Both the intersections included in the arterial are equipped with full HD cameras (automatic vehicle identification devices). Therefore, during the study period, we could obtain all the vehicle flows for all the paths (20 paths) in the arterial, which could help us to accurately evaluate the accuracy of the model.

In order to better evaluate the performance of our proposed model, we have conducted the following experimental design. We use the travel data of the arterial from 10 July to 20 July 2021 as the experimental data. In the experiment, we define two types of vehicles: connected vehicles and regular vehicles. We can directly obtain the connected flow of each path in the arterial by aggregating the trajectories of the connected vehicles. Meanwhile, we categorized the paths included in the arterial into two types in our experiments, which are observed paths and unobserved paths. For the observed paths, we know their total path flow (connected vehicle flow + regular vehicle flow), and for the unobserved paths, we know only their connected vehicle flow but not their total path flow. Based on this, we take the time interval to be 10 min, which gives us 144 pieces of data for a day, and thus a total of 1584 pieces of sample data for the study period.

5.2. Evaluation Methods and Indicators

In order to analyze and compare the estimated effects of each experiment, we divide the data set. In particular, the data from 10–18 July 2021 are used as the training set, and the data from 19–20 July 2021 are used as the test set. The number of training samples is 1296, and the number of test samples is 288. In order to verify the estimation effect of the proposed model, three error evaluation metrics are used in this paper, which are mean absolute error (MAE), mean square error (MSE), and R squared coefficient of determination (R²). The formulas for these metrics are as follows:

MAE = \frac{1}{N_{s a m p l e}} \sum_{i = 1}^{N_{s a m p l e}} |y_{i} - {\hat{y}}_{i}|

(27)

MSE = \frac{1}{N_{s a m p l e}} \sum_{i = 1}^{N_{s a m p l e}} (y_{i} - {\hat{y}}_{i})

(28)

R^{2} = 1 - \frac{\sum_{i = 1}^{N_{s a m p l e}} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N_{s a m p l e}} {(y_{i} - \hat{y})}^{2}}

(29)

where

y_{i}

and

{\hat{y}}_{i}

denote the true and estimated values, respectively, and

N_{s a m p l e}

denotes the number of samples. The lower the values of MAE and MSE, the higher the prediction accuracy of the model. The closer R² is to 1, the better the model fits the data.

5.3. Parameter Settings

In conducting the experimental design, we set the penetration rate of connected vehicles at 25%, and randomly divided the vehicles into connected and regular vehicles, with a ratio of connected to regular vehicles of 1:3. As shown in Figure 11, for the region we studied, a total of 20 paths exist (Path1-2, Path1-3, Path1-4, Path1-5, Path2-1, Path2-3, Path2-4, Path2-5, Path3-1, Path3-2, Path3-4, Path3-5, Path4-1, Path4-2, Path4-3, Path4-5, Path5-1, Path5-2, Path5-3, Path5-4). The flow of some of these paths is very small, almost close to 0. Therefore, in the study of this paper, we ignore the five paths (including Path1-3, Path3-1, Path3-2, Path4-1, and Path4-3) with tiny flows. In this paper, we set the percentage of observed paths as 50% and randomly divide the paths into observed and unobserved paths, and the ratio of observed paths and unobserved paths is 1:1. Among them, the observed paths contain seven paths, which are Path1-2, Path1-4, Path2-1, Path2-4, Path3-4, Path5-1, and Path5-3. For these observed paths, we can obtain the real flow of these paths and use it as a way to calculate the loss of semi-supervised learning. The unobserved paths contain eight paths, which are Path1-5, Path2-3, Path2-5, Path3-5, Path4-2, Path4-5, Path5-2, and Path5-4. For these unobserved paths, we only know the connected flow of these paths. Our objective is to estimate the real flow (including all vehicles) of these unobserved paths. To further evaluate the performance of the proposed model, we design the following comparison model for comparison.

(1) The simple scaling model (SSM): estimating the total path flow by scaling the path’s connected flow with the corresponding penetration rate of connected vehicles.

(2) The single-graph GCN model (S-GCN): uses only the topological connectivity graph as the graph structure, based on the GCN model for path flow estimation.

(3) The multi-knowledge graph-based GCN model (M-GCN): uses a multi-knowledge graph fused GCN model for path flow estimation.

(4) The GAN framework incorporating multiple knowledge graphs (MKG-GAN): the generator of the GAN network is a multi-knowledge graph GCN, and the discriminator is a multi-layer fully connected network.

To identify the optimal hyperparameters for the S-GCN, M-GCN, and MKG-GAN neural network models, we employed grid search combined with cross-validation techniques for hyperparameter tuning. The parameters to be optimized for each model, along with their respective search ranges, are summarized in Table 2. Within the predefined parameter search space, we systematically explored all possible combinations to identify the optimal configuration. Model performance under each parameter combination was evaluated using 10-fold cross-validation, with the MSE serving as the evaluation metric. The final optimal hyperparameter configurations for each model are presented in Table 3.

5.4. Analysis of Experimental Results

The training process of the MKG-GAN model was performed over 300 iterations and took 6580 s on a workstation equipped with NVIDIA GeForce RTX 3080 Ti GPUs (NVIDIA company, Santa Clara, CA, USA). In the testing phase, the estimation time of the model was about 0.6 s.

The time distribution patterns of some path flows in the studied arterial for one day are shown in Figure 12. As can be seen in Figure 12, there are significant differences in the traffic distribution patterns among the paths during the day, such as bimodal patterns (e.g., Path4-2, Path3-5), single-peak patterns (e.g., Path5-4), and other patterns (e.g., Path5-2). The temporal similarity graph and potential correlation graph between paths were computed based on the historical flow sequences of paths in the arterials, as shown in Figure 13b and Figure 13c, respectively. Additionally, the topological connectivity graph between paths was computed based on the geometric layout of the arterial, as shown in Figure 13a.

Firstly, this paper compares the overall performance of the above four models on all unobserved paths. In Table 4, the performance of the above four models is evaluated using the three error metrics MAE, MSE, and R². The MAE measures the average magnitude of the errors in a set of estimations without considering their direction. It is calculated as the average of the absolute differences between the estimated values and the actual values. A lower MAE indicates better model performance. The MSE measures the average squared difference between the estimated values and the actual values. It penalizes larger errors more heavily than smaller ones, making it sensitive to outliers. A lower MSE indicates better model performance. The R² is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model, with a value closer to 1 indicating a better fit. As shown in Table 4, the MKG-GAN model performs best when the penetration rate of connected vehicles is 25% and the percentage of unobserved paths is 50%. In the overall estimation performance on the eight unobserved paths, the MKG-GAN model produces a MAE of 2.33/10 min and a path MSE of 11.53/10 min. Compared with the SSM model, the model can bring about 47.99% and 75.58% improvement in these two metrics, respectively. In addition, the MKG-GAN model also outperforms S-GCN and M-GCN in these metrics. Moreover, the R² of the MKG-GAN model reaches 0.88, indicating that 88% of the variance in the actual flow rates can be explained by the MKG-GAN model, suggesting a strong fit. The proposed MKG-GAN structure is capable of capturing a greater degree of temporal and spatial information and can ensure consistency in the distribution of path flows, significantly improving the accuracy of path flow estimation.

In addition, referring to the first and second rows of Table 4, the MAE value of S-GCN model is reduced by 33.03%, and the MSE value is reduced by 50.74% in comparison to the SSM model. Moreover, the R² of the S-GCN model is improved by 55.55% to 0.77 compared to the SSM model, which indicates that effective estimation of unobserved path flow can be achieved through the semi-supervised learning mode of GCN by utilizing a small amount of observed paths as labeled data.

In addition, referring to the second and third rows of Table 4, the MAE of the M-GCN model is reduced by 9% and the MSE by 31.77%, compared with the S-GCN model. This indicates that by incorporating the multi-knowledge graph structure, the complex associative relationships of arterial paths can be better captured, and the estimation performance of the model can be significantly improved.

In addition, referring to the third and fourth rows of Table 4, the MAE value of the MKG-GAN model is reduced by 14.65% and the MSE by 27.35%, compared with the M-GCN model. It shows that by incorporating the introduction of the GAN framework, the proportional distribution of the estimated flow in each path can be ensured to be consistent with the proportional distribution of the real flow, which improves the reasonableness of the estimation results and can further improve the estimation performance of the model.

In order to better evaluate the performance of the proposed MKG-GAN model in estimating the path flow, we further analyzed the proposed MKG-GAN model on each unobserved path (Path1-5, Path2-3, Path2-5, Path3-5, Path4-2, Path4-5, Path5-2, Path5-4) for the specific flow estimation performance. As shown in Table 5, the estimation performance of the proposed MKG-GAN model in each path is significantly better than the SSM model in each metric. When observing the last row of Table 5, we find that except for Path3-5 and Path4-2, the R² score of the SSM model in the remaining paths is less than 0, which indicates that the SSM model is unable to explain the variations of the variables in these paths and is unable to do an effective fit to the data. Comparing the estimation performance of the M-GCN and S-GCN models on each path, we find that the M-GCN outperforms the S-GCN model. This shows that by incorporating the multi-knowledge graph through RGCN, the correlation relationship between arterial paths can be better captured, and the estimation performance of the flow on each unobserved path can be improved. Comparing the estimation performance of the MKG-GAN and M-GCN models on each path, we find that except for Path1-5 and Path4-2, the flow estimation performance of the MKG-GAN model on the rest of the paths is better than that of the M-GCN model. The reason for the slightly lower estimation performance of the MKG-GAN model on Path1-5 and Path4-2 than that of the M-GCN model may be that the MKG-GAN model requires that the individual path flow estimation be accurate and, at the same time, requires that the proportion of the path flow also satisfy the principle of overall consistent distribution. Because of the need to satisfy both requirements, the proposed MKG-GAN model may lose some accuracy in the estimation of certain paths in order to satisfy the overall consistent distribution principle of the path flow proportions. Figure 14 and Figure 15 show the deviation plot of the estimated flow and the real path flow for each path.

Here, Figure 14 shows the four paths with the best estimation performance (in terms of R²) and Figure 15 shows the worst. From Figure 14, it can be seen that the estimated flow of the MKG-GAN model agrees very well with the real path flow, and the proposed MKG-GAN model can produce sufficiently reliable estimates. Figure 15 shows that the MKG-GAN model still produces acceptable estimates compared to the SSM model, albeit for the worst case of estimation.

5.5. Critical Path Recognition Reliability Analysis

In the design of the multi-path progression traffic signal system, it is crucial to identify the arterial critical path flows. Therefore, we further test the reliability of the proposed MKG-GAN model for identifying the critical path flows. For each time interval, we first sort these paths in descending order based on the flow results estimated by the MKG-GAN model, which allows us to obtain a sequence of estimated paths. Then, we sort the paths in descending order based on the real flow of the paths, and we can obtain the real path sequence. By comparing the estimated path sequence with the real path sequence, we can determine how many critical paths can be accurately recognized by the proposed model and calculate the number of recognized critical paths with the correct positional ordering.

Figure 16 shows the success rate of the SSM model and the proposed MKG-GAN model in identifying the four critical paths (the number of critical paths was determined by the signal control model developed based on the geometry of the arterial [46]). From the figure, it can be seen that the MKG-GAN model recognizes all four critical paths with a recognition rate of 62%, and for the identified critical paths, their position accuracy reaches 82%, which is significantly better than the SSM model. The recognition rate of identifying three of the four actual critical paths is 97%, and the position accuracy is 79%. The recognition rate of identifying two of the four actual critical paths is 100%, and the position accuracy is 75%. The success rate of identifying one critical path out of four actual critical paths is also 100%, and the position accuracy is 72%. Thus, it is shown that the proposed model can produce satisfactory estimates of the correct ranking of the path flows, which are much higher than the SSM model both in terms of recognition rate and position accuracy.

5.6. Sensitivity Analysis

In this section, we perform a sensitivity analysis to evaluate the effectiveness of the proposed model for flow estimation under different penetration rates of connected vehicles and different shares of observed paths.

(1) At first, we fix the share of observed paths constant (where there are seven observed paths and eight unobserved paths) and compare the estimation effect of the MKG-GAN model for various connected vehicle penetration rates (ranging from 5% to 30%). The results are shown in Table 6. All three performance metrics of the SSM model and MKG-GAN improve as the penetration rate of connected vehicles increases. As the penetration rate of connected vehicles increases, the connected vehicle flow data becomes more representative of the total flow data characteristics. As shown in Table 6, when the penetration rate of connected vehicles is only 5%, the R² of MKG-GAN can reach about 0.76, and the estimation result is still reliable. This indicates that the MKG-GAN model proposed in this paper is still trustworthy under the low penetration rate of connected vehicles.

(2) Secondly, we fix the penetration rate of connected vehicles to be 25%, and we compare the estimation effect of the MKG-GAN model with different numbers of observed paths (ranging from four to ten). The results are shown in Table 7. The prediction accuracy of the SSM model remains unchanged as the number of observed paths rises. The reason for this phenomenon is that the SSM model only estimates based on the penetration rate of connected vehicles and does not use other information. However, we can see from Table 7 that all three performance metrics of the MKG-GAN model improve as the number of observed paths rises. As the number of observed paths rises, more labeled data are available for the model to be trained, which can better improve the prediction performance of the model. As shown in Table 7, when there are only 4 observed paths, the MKG-GAN can still achieve a MAE of 2.80/10 min and a MSE of 20.94/10 min, which is a reliable estimation result. This indicates that the MKG-GAN model proposed in this paper can achieve reliable estimation results by using only a small amount of observation label data.

By analyzing the path flow estimation effect of the proposed MKG-GAN model under different connected vehicle penetration rates and different numbers of observed paths, we can see that the proposed model can still achieve a reliable estimation effect under low connected vehicle penetration rates and low observed label data. This shows that the MKG-GAN model proposed in this paper still has strong applicability under different observation environments.

5.7. Reliability Analysis for Long-Distance Arterial

In order to verify the reliability of the proposed model in estimating the path flows of long-distance arterial, this paper employs a simulation model based on a real arterial scenario containing eight consecutive intersections on Liangzhu Avenue in Shangyu City, Zhejiang Province, for method validation. We used the traffic simulation software SUMO to build a simulation model of the long-distance arterial shown in Figure 17 for path flow estimation analysis. The simulation model is calibrated using field data. The calibrated simulated arterial can be used to generate simulation data for method evaluation.

In the simulation, two types of vehicles were similarly defined: connected vehicles and regular vehicles. The simulation was set to run for 120 h and then further divided into 720 identical intervals (10 min each). The initial 96 h of 576 simulation periods were utilized for the generation of historical data to train the estimation model, while the subsequent 24 h of 144 periods were employed for the validation of the estimation model. The penetration rate of connected vehicles in the simulation is 25%. For the arterial in Figure 16, which contains a total of 272 OD paths, we set the ratio of observed paths and unobserved paths to 1:1, randomly dividing the paths into observed and unobserved paths. To improve the reliability of the simulation results, we have implemented multiple runs of the simulation to verify the path flow estimation results. Specifically, we conducted 10 independent runs of the simulation, each with different initial conditions and random seed values, to ensure variability in the data.

This paper presents an analysis of the overall performance of the proposed MKG-GAN model on all 136 unobserved paths. As demonstrated in Table 8, the MKG-GAN model achieves the most accurate estimation performance among the four models. In the overall estimation performance, the MKG-GAN model produced a MAE of 2.64/10 min and a MSE of 12.59/10 min, which significantly outperformed the other three models. Furthermore, based on the standard deviation, we observe that the estimation stability of the SSM model is the poorest, while the MKG-GAN model exhibits the best estimation stability. In the ten rounds of simulation validation, the estimation performance of MKG-GAN shows only slight fluctuations, with a MAE standard deviation of 0.11, an MSE standard deviation of 1.21, and an R² standard deviation of just 0.02. These results indicate that the proposed MKG-GAN model demonstrates statistical robustness, making it a reliable tool for application in real-world scenarios. In addition, as shown in Figure 18, the R² of more than 90% of the unobserved paths is greater than 0.7, and more than 80% of the unobserved paths is greater than 0.85. This indicates that the proposed MKG-GAN model is capable of effectively fitting the actual flow of unobserved paths in the arterial, even when the distance of the arterial is considerable and the number of unobserved paths is substantial. It demonstrates high estimation reliability.

To further evaluate the flow estimation performance of the MKG-GAN model under different traffic conditions on arterial roads and at varying connected vehicle penetration rates, we designed four distinct simulation scenarios representing different levels of congestion. As shown in Figure 19, congestion simulation scenarios one, two, three, and four correspond to increases in traffic flow on the original arterial road by 5%, 10%, 15%, and 20%, respectively. As observed in Figure 19, we find that as the penetration rate of connected vehicles increases, the performance of the MKG-GAN model improves across all four congestion simulation scenarios. This finding aligns with our observations in Section 5.6. The improvement is attributed to the fact that as the penetration rate of connected vehicles rises, the connected vehicle flow data more effectively reflects the characteristics of the total traffic flow data, thereby enhancing the estimation performance of the MKG-GAN model. By comparing the performance of the MKG-GAN model across different congestion simulation scenarios, we observe that at the same penetration rate, the model’s performance improves as the level of congestion increases. This may be because, with a constant penetration rate, the number of connected vehicles increases as the overall arterial road flow increases. When the number of connected vehicles becomes sufficiently large, even at low penetration rates, the connected vehicle flow data can still capture the characteristics of the total traffic flow data. This suggests that the proposed MKG-GAN model is likely to deliver even better estimation performance on busy arterial roads in large urban areas.

5.8. Research Limitations

While the proposed semi-supervised method for estimating arterial path flow demonstrates significant improvements in accuracy and reliability, it is important to consider several limitations and the specific conditions under which the model is most effective. Regarding model assumptions, the proposed method assumes stable traffic conditions. In the event of unexpected occurrences, such as accidents or special events, the traffic demand patterns on arterial roads may change, rendering the model no longer applicable. In terms of scalability, the model can be extended to other arterial roads for path flow estimation. However, its application in large-scale networks may face limitations. This is because, under complex and extensive road network conditions, the computational complexity of the model may increase significantly, requiring the decomposition of large networks to reduce computational complexity. As for infrastructure requirements, the proposed method is designed for use on arterial roads equipped with a sufficient number of AVI devices, and these devices need to have high availability and reliability. The application of this model in areas with limited sensor coverage or outdated infrastructure may pose challenges. In such cases, the quality of the data can directly impact the performance of the model.

6. Conclusions

Real-time arterial OD flows are critical for traffic control, but the real situation is difficult to observe. To address this problem, we developed a data-driven model for signalized arterial path flow estimation, which utilizes both connected vehicle trajectory data and AVI sensor data. To cope with the challenge of high sparsity of AVI data, we built a semi-supervised path flow estimation framework to approximate the real path flow by utilizing GCN fusing multi-knowledge graphs and generative adversarial networks. Additionally, it ensures that the generated path flow adheres to the same distribution as the real path flow.

In order to evaluate the effectiveness of the proposed model, a practical validation was carried out based on the real flow data of an arterial in Shangyu City, Zhejiang Province. The experiments showed that the proposed MKG-GAN model has a MAE of 2.33/10 min, a MSE of 11.53/10 min, and a R² of 0.88 when the penetration rate of connected vehicles is 25%. Compared with the SSM model, the MAE, MSE, and R² of the model are improved by 47.99%, 75.58%, and 62.96%, respectively.

The results also showed that our proposed model can effectively identify the critical paths in the arterial. More specifically, the proposed model identified all four critical paths with a percentage of 62%, three of the four actual critical paths with a percentage of 97%, and the percentage of identifying both two and one of the four actual critical paths was 100%. In addition, the sensitivity analysis also showed that the proposed model achieves good performance for different penetration rates of connected vehicles and different ratios of observed paths. The sensitivity analysis showed that the performance of the proposed model improves as the penetration of connected vehicles increases. All three performance metrics of MKG-GAN improved as the percentage of observed paths increased. The proposed MKG-GAN model in this paper still has strong applicability in different observation environments.

In order to further verify the reliability of the proposed model for long-distance arterial, a simulation model containing eight consecutive intersections based on real scenarios was developed for method validation. The results demonstrated that when employing the MKG-GAN model to estimate the flow of 136 unobserved paths, over 80% of unobserved paths exhibited an R2 exceeding 0.85. The proposed MKG-GAN model was capable of accurately fitting the real flow of unobserved paths, even in the long-distance arterial that contained numerous unobserved paths. Extensive numerical studies have shown that the proposed MKG-GAN model has reliable accuracy in estimating the path flows in arterial. This can provide more efficient management of arterials, such as multi-path signal coordination, which can improve the efficiency of vehicle operation when traveling along the arterials. The proposed method is designed to be flexible and adaptable to changes in data availability and quality. In the future, as the penetration rate of connected vehicles (CVs) and the number of automatic vehicle identification (AVI) data points increase, the amount of real-time traffic data available for analysis will also increase. This enhanced data availability can significantly improve the performance of the MKG-GAN model. Higher CV penetration rates will provide more frequent and accurate traffic data, reducing the uncertainty in our estimates and improving the overall accuracy of the model. Additional AVI data points will provide more labeled data, which can be used to train our model more robustly, leading to better generalization and more reliable estimations. By leveraging the increasing availability of high-quality data, the proposed method can continue to provide valuable insights into traffic flow dynamics and support more effective traffic management strategies. In future research, we will further enhance the research depth and practical impact of the proposed path flow estimation method by integrating more advanced heuristic parameter selection algorithms, expanding the sensitivity analysis scenarios, and simplifying the actual deployment cost of the model.

Author Contributions

Study conception and design: Z.Z. and G.R.; data collection and processing: Z.Z., W.L. and W.C.; modeling and interpretation of results: Z.Z., J.S. and Q.C.; draft manuscript preparation: Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by financial support from the National Natural Research Foundation of China (grant no. 52202399 and 52372314), China Postdoctoral Science Foundations (grant no. 2022M710679), and the Postgraduate Research and Practice Innovation Program of Jiangsu Province (KYCX22_0286).

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to Zhe Zhang.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Little, J.; Kelson, M.; Gartner, N. Maxband: A program for setting signals on arteries and triangular networks. In Transportation Research Record 795; TRB, National Research Council: Washington, DC, USA, 1981. [Google Scholar]
Yang, X.; Cheng, Y.; Chang, G.-L. A multi-path progression model for synchronization of arterial traffic signals. Transp. Res. Part C Emerg. Technol. 2015, 53, 93–111. [Google Scholar] [CrossRef]
Arsava, T.; Xie, Y.; Gartner, N.H. Arterial progression optimization using OD-BAND: Case study and extensions. Transp. Res. Rec. 2016, 2558, 1–10. [Google Scholar] [CrossRef]
Li, C.; Wang, H.; Lu, Y. A multi-path arterial progression model with variable signal structures. Transp. A Transp. Sci. 2022, 19, 2101708. [Google Scholar] [CrossRef]
Wang, Y.; Wei, L.; Chen, P. Trajectory reconstruction for freeway traffic mixed with human-driven vehicles and connected and automated vehicles. Transp. Res. Part C Emerg. Technol. 2020, 111, 135–155. [Google Scholar] [CrossRef]
Harlow, C.; Peng, S. Automatic vehicle classification system with range sensors. Transp. Res. Part C Emerg. Technol. 2001, 9, 231–247. [Google Scholar] [CrossRef]
Zhu, F.; Ukkusuri, S.V. An Optimal Estimation Approach for the Calibration of the Car-Following Behavior of Connected Vehicles in a Mixed Traffic Environment. IEEE Trans. Intell. Transp. Syst. 2017, 18, 282–291. [Google Scholar] [CrossRef]
Feng, J.; Du, C.; Mu, Q. Traffic Flow Prediction Based on Federated Learning and Spatio-Temporal Graph Neural Networks. ISPRS Int. J. Geo-Inf. 2024, 13, 210. [Google Scholar] [CrossRef]
Lu, S.; Chen, H.; Teng, Y. Multi-Scale Non-Local Spatio-Temporal Information Fusion Networks for Multi-Step Traffic Flow Forecasting. ISPRS Int. J. Geo-Inf. 2024, 13, 71. [Google Scholar] [CrossRef]
Safdar, M.; Zhong, M.; Ren, Z.; Hunt, J.D. An Integrated Framework for Estimating Origins and Destinations of Multimodal Multi-Commodity Import and Export Flows Using Multisource Data. Systems 2024, 12, 406. [Google Scholar] [CrossRef]
Chang, G.-L.; Wu, J. Recursive estimation of time-varying origin-destination flows from traffic counts in freeway corridors. Transp. Res. Part B Methodol. 1994, 28, 141–160. [Google Scholar] [CrossRef]
Rao, W.; Wu, Y.-J.; Xia, J.; Ou, J.; Kluger, R. Origin-destination pattern estimation based on trajectory reconstruction using automatic license plate recognition data. Transp. Res. Part C Emerg. Technol. 2018, 95, 29–46. [Google Scholar] [CrossRef]
Lou, Y.; Yin, Y. A decomposition scheme for estimating dynamic origin–destination flows on actuation-controlled signalized arterials. Transp. Res. Part C Emerg. Technol. 2010, 18, 643–655. [Google Scholar] [CrossRef]
Chang, G.-L.; Tao, X. Estimation of Time-Dependent Turning Fractions at Signalized Intersections. Transp. Res. Rec. 1998, 1644, 142–149. [Google Scholar] [CrossRef]
Yang, X.; Chang, G.-L. Estimation of Time-Varying Origin–Destination Patterns for Design of Multipath Progression on a Signalized Arterial. Transp. Res. Rec. 2017, 2667, 28–38. [Google Scholar] [CrossRef]
Van Der Zijpp, N.J. Dynamic Origin-Destination Matrix Estimation from Traffic Counts and Automated Vehicle Identification Data. Transp. Res. Rec. 1997, 1607, 87–94. [Google Scholar] [CrossRef]
Dixon, M.P.; Rilett, L.R. Real-Time OD Estimation Using Automatic Vehicle Identification and Traffic Count Data. Comput. -Aided Civ. Infrastruct. Eng. 2002, 17, 7–21. [Google Scholar] [CrossRef]
Xuesong, Z.; Mahmassani, H.S. Dynamic origin-destination demand estimation using automatic vehicle identification data. IEEE Trans. Intell. Transp. Syst. 2006, 7, 105–114. [Google Scholar] [CrossRef]
Cao, Q.; Ren, G.; Li, D.; Ma, J.; Li, H. Semi-supervised route choice modeling with sparse Automatic vehicle identification data. Transp. Res. Part C Emerg. Technol. 2020, 121, 102857. [Google Scholar] [CrossRef]
Yan, J.; Zhang, L.; Gao, Y.; Qu, B. GECRAN: Graph embedding based convolutional recurrent attention network for traffic flow prediction. Expert Syst. Appl. 2024, 256, 125001. [Google Scholar] [CrossRef]
Yamamoto, T.; Miwa, T.; Takeshita, T.; Morikawa, T. Updating Dynamic Origin-destination Matrices using Observed Link Travel Speed by Probe Vehicles. In Transportation and Traffic Theory 2009: Golden Jubilee: Papers Selected for Presentation at ISTTT18, a Peer Reviewed Series Since 1959; Lam, W.H.K., Wong, S.C., Lo, H.K., Eds.; Springer: Boston, MA, USA, 2009; pp. 723–738. [Google Scholar]
Cao, P.; Miwa, T.; Yamamoto, T.; Morikawa, T. Bilevel Generalized Least Squares Estimation of Dynamic Origin–Destination Matrix for Urban Network with Probe Vehicle Data. Transp. Res. Rec. 2013, 2333, 66–73. [Google Scholar] [CrossRef]
Yang, X.; Lu, Y.; Hao, W. Origin-Destination Estimation Using Probe Vehicle Trajectory and Link Counts. J. Adv. Transp. 2017, 2017, 4341532. [Google Scholar] [CrossRef]
Ros-Roca, X.; Montero, L.; Barceló, J.; Nökel, K.; Gentile, G. A practical approach to assignment-free Dynamic Origin–Destination Matrix Estimation problem. Transp. Res. Part C Emerg. Technol. 2022, 134, 103477. [Google Scholar] [CrossRef]
Comert, G.; Amdeberhan, T.; Begashaw, N.; Medhin, N.G.; Chowdhury, M. Simple analytical models for estimating the queue lengths from probe vehicles at traffic signals: A combinatorial approach for nonparametric models. Expert Syst. Appl. 2024, 252, 124076. [Google Scholar] [CrossRef]
Li, A.; Xu, Z.; Zhang, J.; Li, T.; Cheng, X.; Hu, C. A Vector Field Visualization Method for Trajectory Big Data. ISPRS Int. J. Geo-Inf. 2023, 12, 398. [Google Scholar] [CrossRef]
Verbas, I.Ö.; Mahmassani, H.S.; Zhang, K. Time-Dependent Origin–Destination Demand Estimation: Challenges and Methods for Large-Scale Networks with Multiple Vehicle Classes. Transp. Res. Rec. 2011, 2263, 45–56. [Google Scholar] [CrossRef]
Sherali, H.D.; Park, T. Estimation of dynamic origin–destination trip tables for a general network. Transp. Res. Part B Methodol. 2001, 35, 217–235. [Google Scholar] [CrossRef]
Nie, Y.; Zhang, H.M.; Recker, W.W. Inferring origin–destination trip matrices with a decoupled GLS path flow estimator. Transp. Res. Part B Methodol. 2005, 39, 497–518. [Google Scholar] [CrossRef]
Lundgren, J.T.; Peterson, A. A heuristic for the bilevel origin–destination-matrix estimation problem. Transp. Res. Part B Methodol. 2008, 42, 339–354. [Google Scholar] [CrossRef]
Frederix, R.; Viti, F.; Tampère, C.M.J. Dynamic origin–destination estimation in congested networks: Theoretical findings and implications in practice. Transp. A Transp. Sci. 2013, 9, 494–513. [Google Scholar] [CrossRef]
Lee, S.; Heydecker, B.; Kim, Y.H.; Shon, E.-Y. Dynamic OD estimation using three phase traffic flow theory. J. Adv. Transp. 2011, 45, 143–158. [Google Scholar] [CrossRef]
Hazelton, M.L. Statistical inference for time varying origin–destination matrices. Transp. Res. Part B Methodol. 2008, 42, 542–552. [Google Scholar] [CrossRef]
Xie, C.; Kockelman, K.M.; Waller, S.T. A maximum entropy-least squares estimator for elastic origin-destination trip matrix estimation. Procedia-Soc. Behav. Sci. 2011, 17, 189–212. [Google Scholar] [CrossRef]
Lu, Z.; Rao, W.; Wu, Y.-J.; Guo, L.; Xia, J. A Kalman filter approach to dynamic OD flow estimation for urban road networks using multi-sensor data. J. Adv. Transp. 2015, 49, 210–227. [Google Scholar] [CrossRef]
Wang, Q.; Yuan, Y.; Zhang, Q.; Yang, X.T. Signalized arterial origin-destination flow estimation using flawed vehicle trajectories: A self-supervised learning approach without ground truth. Transp. Res. Part C Emerg. Technol. 2022, 145, 103917. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Zhang, Q.; Zhou, L.; Su, Y.; Xia, H.; Xu, B. Gated Recurrent Unit Embedded with Dual Spatial Convolution for Long-Term Traffic Flow Prediction. ISPRS Int. J. Geo-Inf. 2023, 12, 366. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Y.; Wei, Y.; Hu, Y.; Piao, X.; Yin, B. Metro Passenger Flow Prediction via Dynamic Hypergraph Convolution Networks. IEEE Trans. Intell. Transp. Syst. 2021, 22, 7891–7903. [Google Scholar] [CrossRef]
Joanna, L. The Use of the Dynamic Time Warping (DTW) Method to Describe the COVID-19 Dynamics in Poland. Oeconomia Copernic. 2021, 12, 539–556. [Google Scholar] [CrossRef]
Schlichtkrull, M.; Kipf, T.N.; Bloem, P.; van den Berg, R.; Titov, I.; Welling, M. Modeling Relational Data with Graph Convolutional Networks. In Proceedings of the Semantic Web, Heraklion, Greece, 3–7 June 2018; pp. 593–607. [Google Scholar]
Ma, W.; Yuan, J.; An, K.; Yu, C. Route flow estimation based on the fusion of probe vehicle trajectory and automated vehicle identification data. Transp. Res. Part C Emerg. Technol. 2022, 144, 103907. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative Adversarial Networks: An Overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Liang, Y.; Cui, Z.; Tian, Y.; Chen, H.; Wang, Y. A Deep Generative Adversarial Architecture for Network-Wide Spatial-Temporal Traffic-State Estimation. Transp. Res. Rec. 2018, 2672, 87–105. [Google Scholar] [CrossRef]
Zhang, K.; He, Z.; Zheng, L.; Zhao, L.; Wu, L. A generative adversarial network for travel times imputation using trajectory data. Comput.-Aided Civ. Infrastruct. Eng. 2021, 36, 197–212. [Google Scholar] [CrossRef]
Hai, T.; Ren, G.; Chen, W.; Cao, Q.; Dong, C. A Heuristic Approach for Multi-Path Signal Progression Considering Traffic Flow Uncertainty. Mathematics 2023, 11, 377. [Google Scholar] [CrossRef]

Figure 1. Schematic of arterial scenario equipped with AVI devices.

Figure 2. A signalized arterial structure.

Figure 3. Data-driven semi-supervised arterial path flow estimation problem description.

Figure 4. Framework of the proposed method.

Figure 5. Semi-supervised arterial path flow estimation based on GCN.

Figure 6. Topology connection diagram.

Figure 7. Multi-knowledge graph fusion based on RGCN.

Figure 8. The structure of the path flow estimation based on multiple knowledge graphs.

Figure 9. The structure of the typical GAN.

Figure 10. The structure of the multi-knowledge graph GAN model.

Figure 11. The geometric layout of the studied site.

Figure 12. The time distribution patterns of path flows in the studied arterial. (This figure serves as the foundation for calculating the temporal similarity and potential correlations between different paths. Based on these calculations, the temporal similarity graph and potential correlation graph within the multi-knowledge graph structure are constructed. We utilized the dynamic time warping (DTW) algorithm and the maximal information coefficient (MIC) algorithm to compute the temporal similarity and potential correlations based on the flow information of each path. These correlations are crucial for identifying patterns and dependencies that can inform the model’s output).

Figure 13. The corresponding adjacency matrices of the three knowledge graphs. (a) Topological connectivity graph. Each cell in the matrix represents the connectivity between two paths, with darker colors indicating stronger connections and reflecting higher topological proximity. This graph helps to capture the structural relationships between different paths in the arterial. (b) Temporal similarity graph. Each cell represents the temporal similarity between two paths, with darker colors indicating higher similarity. This graph captures the dynamic nature of traffic flow over time, providing insights into how different paths behave similarly during specific time intervals. (c) Potential correlation graph. Each cell represents the potential correlation between two paths, with darker colors indicating stronger correlations. This graph highlights the statistical dependencies and interactions between different paths. During the estimation process, the model utilizes RGCN to extract feature information from the topological connectivity graph, temporal similarity graph, and potential correlation graph. By deeply fusing these features, the model can leverage the characteristics of other paths that have strong associations with the target path, thereby enhancing the estimation accuracy.

Figure 14. The four paths with the best estimation performance. (a) Path1-5, (b) Path2-3, (c) Path5-2, and (d) Path5-4.

Figure 15. The four paths with the worst estimation performance. (a) Path3-5, (b) Path2-5, (c) Path4-2, and (d) Path4-5.

Figure 16. Critical path recognition reliability analysis. (a) SSM model, and (b) MKG-GAN model.

Figure 17. Schematic of long-distance arterial scenario.

Figure 18. Percentage of unobserved paths whose estimates satisfy different R² values.

Figure 19. Estimated performance of MKG-GAN model with different CV penetration rates for different traffic conditions.

Table 1. Examples of labeled and unlabeled path samples.

Path	Input Features	Label	Category
$m$	$[f_{m, t - k}^{C V}, \dots \dots, f_{m, t - 2}^{C V}, f_{m, t - 1}^{C V}, f_{m, t}^{C V}]$	$f_{m, t}$	Labeled path sample
$n$	$[f_{n, t - k}^{C V}, \dots \dots, f_{n, t - 2}^{C V}, f_{n, t - 1}^{C V}, f_{n, t}^{C V}]$	-	Unlabeled path sample

Table 2. Parameters to be optimized and their ranges.

Parameter	Range
Number of GCN hidden layers	2–4
Number of RGCN hidden layers	2–4
Number of fully connected hidden layers	2–4
Number of neurons in each hidden layer	8–256

Table 3. The optimal parameter combinations for each model.

Model	Parameter Settings
SSM	Expansion ratio is 1/0.25
S-GCN	GCN-Hidden_layer_sizes = (128, 128, 64, 64), Fully_connected_layer_sizes = (64, 32, 16)
M-GCN	RGCN_Hidden_layer_sizes = (128, 128, 64, 64), Fully_connected_layer_sizes = (64, 32, 16)
MKG-GAN	RGCN_Hidden_layer_sizes = (128, 128, 64, 64), Gererator_Fully_connected_layer_sizes = (64, 32, 16), Discriminator_Fully_connected_layer_sizes = (64, 64, 16)

Table 4. The overall estimated performance of each model on the unobserved paths.

Model	MAE	MSE	R²
SSM	4.48	47.22	0.54
S-GCN	3.00	23.26	0.77
M-GCN	2.73	15.87	0.84
MKG-GAN	2.33	11.53	0.88

Table 5. The estimated performance of each model on each unobserved path.

Model	Metric	Path (Origin–Destination)
Model	Metric	1-5	2-3	2-5	3-5	4-2	4-5	5-2	5-4
MKG-GAN	MAE	1.70	1.66	2.42	2.72	4.19	2.34	1.61	1.97
	MSE	5.06	5.33	9.39	14.46	35.66	11.13	4.53	6.70
	R²	0.50	0.72	0.69	0.81	0.89	0.70	0.49	0.72
M-GAN	MAE	1.49	1.90	3.07	3.86	4.10	2.91	1.62	1.83
	MSE	4.84	7.37	16.76	27.70	30.67	16.11	5.12	6.58
	R²	0.51	0.61	0.44	0.65	0.90	0.56	0.42	0.72
S-GAN	MAE	4.56	2.13	3.17	4.06	6.73	3.40	1.76	2.23
	MSE	5.65	9.26	18.10	30.87	95.29	23.40	5.63	9.74
	R²	0.44	0.51	0.40	0.61	0.70	0.38	0.37	0.59
SSM	MAE	4.19	3.40	4.30	5.73	8.60	4.21	3.45	3.43
	MSE	38.90	22.86	36.87	66.52	145.72	40.62	21.69	25.83
	R²	−0.59	−0.11	−0.15	0.19	0.56	−0.02	−1.16	0.01

Table 6. Sensitivity analysis of various penetration rates.

Indicator	Model	Penetration Rate
		5%	10%	15%	20%	25%	30%
MAE	SSM	7.97	6.27	5.52	4.91	4.48	4.22
MAE	MKG-GAN	3.34	2.91	2.69	2.53	2.33	2.17
MSE	SSM	150.49	89.66	67.22	54.72	47.22	41.82
MSE	MKG-GAN	24.31	17.58	14.51	12.83	11.53	9.49
R²	SSM	−0.44	0.13	0.35	0.47	0.54	0.59
R²	MKG-GAN	0.76	0.83	0.85	0.87	0.88	0.91

Table 7. Sensitivity analysis under various number of observed paths.

Indicator	Model	Number of Paths with Observations
		4	5	6	7	8	9
MAE	SSM	4.48	4.48	4.48	4.48	4.48	4.48
MAE	MKG-GAN	2.80	2.57	2.33	2.33	2.16	2.07
MSE	SSM	47.22	47.22	47.22	47.22	47.22	47.22
MSE	MKG-GAN	20.94	13.78	13.03	11.53	10.37	8.69
R²	SSM	0.54	0.54	0.54	0.54	0.54	0.54
R²	MKG-GAN	0.78	0.82	0.85	0.88	0.90	0.92

Table 8. The overall estimated performance of each model on the unobserved paths in the long-distance arterial (x ± σ, listed mean x and standard deviation σ are computed from 10 simulation runs).

Model	MAE	MSE	R²
SSM	5.11 ± 0.68	67.24 ± 9.36	0.51 ± 0.10
S-GCN	3.57 ± 0.17	31.41 ± 4.51	0.78 ± 0.05
M-GCN	3.05 ± 0.14	16.37 ± 2.42	0.87 ± 0.02
MKG-GAN	2.64 ± 0.11	12.59 ± 1.21	0.93 ± 0.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Cao, Q.; Lin, W.; Song, J.; Chen, W.; Ren, G. Estimation of Arterial Path Flow Considering Flow Distribution Consistency: A Data-Driven Semi-Supervised Method. Systems 2024, 12, 507. https://doi.org/10.3390/systems12110507

AMA Style

Zhang Z, Cao Q, Lin W, Song J, Chen W, Ren G. Estimation of Arterial Path Flow Considering Flow Distribution Consistency: A Data-Driven Semi-Supervised Method. Systems. 2024; 12(11):507. https://doi.org/10.3390/systems12110507

Chicago/Turabian Style

Zhang, Zhe, Qi Cao, Wenxie Lin, Jianhua Song, Weihan Chen, and Gang Ren. 2024. "Estimation of Arterial Path Flow Considering Flow Distribution Consistency: A Data-Driven Semi-Supervised Method" Systems 12, no. 11: 507. https://doi.org/10.3390/systems12110507

APA Style

Zhang, Z., Cao, Q., Lin, W., Song, J., Chen, W., & Ren, G. (2024). Estimation of Arterial Path Flow Considering Flow Distribution Consistency: A Data-Driven Semi-Supervised Method. Systems, 12(11), 507. https://doi.org/10.3390/systems12110507

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Estimation of Arterial Path Flow Considering Flow Distribution Consistency: A Data-Driven Semi-Supervised Method

Abstract

1. Introduction

2. Literature Review

3. Problem Statement

4. Materials and Methods

4.1. Path Flow Estimation Model Based on Multiple Knowledge Graphs

4.1.1. Semi-Supervised Road Flow Estimation Model Based on Graph Convolutional Network

4.1.2. Multi-Knowledge Graphs Construction

4.1.3. Multi-Knowledge Graph Integration Based on Relational Graph Convolutional Network

4.2. A Path Flow Estimation Framework Based on Generative Adversarial Networks for Incorporating Flow Distribution Consistency

4.2.1. Generative Adversarial Network Architecture

4.2.2. Multi-Knowledge Graph GAN Model (MKG-GAN)

5. Experiment

5.1. Description of Research Object and Data Set

5.2. Evaluation Methods and Indicators

5.3. Parameter Settings

5.4. Analysis of Experimental Results

5.5. Critical Path Recognition Reliability Analysis

5.6. Sensitivity Analysis

5.7. Reliability Analysis for Long-Distance Arterial

5.8. Research Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI