1. Introduction
As urbanization progresses, it directly results in the continuous expansion of cities, accompanied by significant increases in traffic flow. Although the development of today’s intelligent transportation systems and computer technology has matured, their contribution to alleviating traffic congestion and reducing traffic accidents remains insufficient. Traffic flow prediction plays a vital role in the intelligent traffic control and management system, which is crucial for alleviating urban traffic congestion and reducing traffic accidents. Generally, traffic flow data collected by sensors installed on roads at predetermined intervals often exhibit time-series distribution characteristics.
Lane occupancy prediction is a crucial indicator within traffic flow analysis. It refers to the percentage of lanes vehicles occupy on a roadway at any given time. Such predictions inform the degree of roadway congestion and traffic flow density. Lane occupancy prediction differs from traffic flow prediction by focusing specifically on traffic congestion and accidents resulting from lane occupancy [
1]. Accurate predictions enable traffic management departments to comprehend road congestion scenarios better and implement timely measures to mitigate congestion [
2]. Real-time monitoring of lane occupancy allows traffic managers to adjust traffic signals, guide flows, and employ strategies to alleviate congestion and enhance road efficiency. It serves as a vital reference for transportation planners and designers. Accurate forecasts aid planners in assessing road network capacity traffic demand and identifying bottleneck locations for informed planning, design, and improvement decisions. The outcomes of lane occupancy prediction facilitate rational planning of new roads, expansion of existing ones, and adjustment of traffic flow distribution strategies to meet future demand.
Furthermore, it is crucial for traffic safety management. Accurate lane occupancy predictions enable traffic management and emergency response agencies to identify potential crash risk areas and congestion points proactively. Accurate prediction of lane occupancy facilitates the implementation of appropriate measures, including increasing police patrols, installing temporary traffic signs, and adjusting traffic routes to prevent accidents or mitigate adverse impacts on traffic flow.
Recent advancements in intelligent transportation systems and computer technology have significantly enhanced lane occupancy prediction methods. However, external factors such as weather conditions, traffic accidents, and traffic control influence lane occupancy rates owing to the openness and complexity of urban transportation systems [
3]. Accurate lane occupancy prediction necessitates datasets with high-quality attributes, and the rapid advancement of sensor technology has facilitated more accessible data collection, thereby providing ample datasets for prediction. Data collectors, installed on or beneath roadsides, gather vehicle-related information, distributing it according to predefined intervals to exhibit time-series characteristics. However, holidays, weather, and other uncertain factors may affect the collected data, complicating lane occupancy rate prediction. Initially, traditional traffic flow prediction methods, including the Kalman filter and Markov models, were simple but ineffective under complex road conditions. As urbanization intensified, the increasing complexity and variability of urban roads led to the gradual abandonment of these methods. Subsequently, the Autoregressive Integrated Moving Average (ARIMA) model became a commonly used method for lane prediction. The ARIMA model attempts to uncover time-series patterns through autocorrelation and differencing, utilizing these patterns for future predictions [
4]. Vector Autoregressive (VAR) models, akin to ARIMA, are favored in traffic flow forecasting for their high flexibility. However, these models exhibit limited capability in capturing the nonlinear features of traffic flow data, leading to suboptimal prediction performance. Standard models like k-nearest neighbors (KNNs) and support vector regression (SVR) also share these drawbacks. Recently, the surge in machine learning and deep learning has led to the development of various neural networks, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Models based on these neural networks have found widespread applications in traffic flow prediction. These neural network models, leveraging deep learning, outperform traditional traffic flow prediction models by effectively extracting spatial and temporal features from time-series data and modeling nonlinear, nonsmooth time series to accommodate the complex dynamics of traffic flow. Additionally, researchers have increasingly studied the Transformer-based model in recent years, demonstrating its superior long-range traffic flow prediction accuracy over prior models.
Long Short-Term Memory (LSTM) [
5], a Recurrent Neural Network (RNN) variant, effectively addresses the long-term dependency issue, utilizing past information for future predictions. The state has three gates to control, which can selectively store, update, and output information. Central to LSTM is a unitary state governed by three gates that selectively store, update, and output information. It excels in processing and predicting significant events within traffic flows characterized by time series, especially those with lengthy intervals and delays. However, LSTM is not ideal for long-term prediction due to its computational intensity.
The Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) [
6] model combines CNN and LSTM network structures to extract spatio-temporal features and accomplish sequence modeling. This model preprocesses original traffic flow data, inputs it into the CNN to extract spatial features, then inputs this CNN-processed data into the LSTM layer for next-moment time-series prediction, and finally applies inverse normalization to obtain the predicted value. However, akin to LSTM, this model requires substantial computational resources, and its training process is susceptible to overfitting.
The advent of large-scale artificial intelligence models has spurred extensive research into the Transformer architecture. This deep learning model, predicated on the self-attention mechanism, demonstrates remarkable proficiency in processing sequential data, including text, speech, and images. The Transformer architecture differs from traditional RNN and CNN structures, relying solely on the self-attention mechanism to discern sequence dependencies. The Transformer offers significant advantages, including parallel computation capabilities that enhance training speed and the ability to learn long-distance dependencies, thereby yielding superior outcomes for extensive time-series data.
In recent years, extensive scholarly attention has been directed toward the Transformer architecture, resulting in the proposition of numerous enhanced models. These include the Autoformer [
7], ETSformer [
8], and FEDformer [
9], among others. However, these Transformer model variants are computationally intensive and characterized by complex parameters, leading to significant memory usage and computational resource consumption. The GBT (Two-stage Transformer) [
10] addresses the critical issue of overfitting associated with improper decoder input initialization by innovatively dividing the traditional Transformer architecture into two segments: the Auto-Regression and Self-Regression stages. This approach enables the model to enhance its predictive capacity while maintaining low temporal and spatial complexity.
The quality and characteristics of input data significantly influence the accuracy and efficiency of predictive models, which are crucial for the models’ performance. This influence stems from the models’ learning and inferential capabilities relying on the statistical regularities within the input data. In making forecasts, researchers often overlook the inherent patterns of changes in traffic flow data that demonstrate the characteristics of a time-series distribution; however, these laws significantly impact the accuracy of traffic flow predictions. Consequently, numerous academics have integrated various data decomposition methods into traffic flow prediction to address this issue. The data decomposition method, a signal processing technique, decomposes complex signals into simpler ones, thereby facilitating the analysis and comprehension of the signal’s characteristics and structure. Data decomposition methods have extensive applications in time-series forecasting. They enable forecasting models to delve deeply into the intrinsic patterns governing traffic flow data changes, thereby augmenting their capability to extract data characteristics.
Researchers have employed data decomposition methods to preprocess traffic flow data, inputting the decomposed signals into various prediction models and synthesizing each model’s outputs. This process significantly enhances the models’ capability to extract data features, improving the prediction accuracy and robustness. Commonly employed data decomposition methods, including the Fourier Transform (FT), Wavelet Transform (WT) [
11], and Empirical Mode Decomposition (EMD) [
12], decompose signals within the frequency, time–frequency, and time domains, respectively, unveiling the signal’s structure and characteristics from diverse perspectives. Applying these decomposition methods to traffic flow prediction yielded superior outcomes compared with a singular predictive model. However, the Fourier Transform struggles with non-stationary data and fails to analyze signals’ local characteristics adequately. At the same time, the Wavelet Transform is constrained by Heisenberg’s uncertainty principle, leading to pseudo-spectra and spectral aliasing in signals with mutations. Empirical Mode Decomposition, conversely, is subject to mode aliasing and endpoint effects.
Section 2 delves into a comprehensive review of this paper’s research findings on traffic flow prediction.
Section 3 outlines the fusion model’s overall architecture, including Variational Mode Decomposition [
13], CSPNet-Attention [
14], the Two-stage Transformer, and the Temporal Convolutional Neural Network-Long Short-Term Memory Network (TCN-LSTM) [
15].
Section 4 is experimental data processing and experimental validation.
Section 5 summarizes the research conducted in this paper and offers insights into the future development of traffic flow prediction.
This study employs Variational Mode Decomposition (VMD) as an adaptive and fully non-recursive method for modal change and signal processing to mitigate the complexity and nonlinearity of time series. Secondly, this study adopts a new attention mechanism, CSPNet-Attention, to replace the traditional attention mechanism, which reduces the computational complexity and memory consumption by sparsifying the attention matrix and achieves similar or even more than the prediction accuracy. Then, we take into full consideration the seasonal characteristics of the experimental data and adopt a model that integrates the Two-stage Transformer and TCN-LSTM model to conduct the prediction experiment. The overall fusion model is named CT+. The experimental results are reconstructed together using VMD to get the final experimental results.
2. Related Works
Traffic flow forecasting is a crucial foundation for traffic management and transportation planning. Traffic flow prediction is vital for both traffic managers and participants. It optimizes traffic signal control, mitigates congestion, enhances road capacity, and provides precise road condition information, guiding travel mode and path selection and improving the travel experience. Consequently, traffic flow prediction significantly enhances urban transportation efficiency and service quality. Deep learning technology is currently extensively applied in this domain, with numerous scholars proposing various traffic flow prediction models leveraging deep learning, see
Supplementary Materials Table S1. Wumaier and colleagues [
16] employed the random forest algorithm to extract multiple bootstrap samples from the original dataset, constructing a decision tree for each to develop a short-term traffic flow prediction model. This model effectively minimizes prediction error and variance, enhancing the accuracy and efficiency of short-term flow predictions, thereby surmounting traditional methodological constraints. Zhang et al. [
17] applied the spatio-temporal feature selection algorithm (STFSA) to identify the optimal time lag and spatial road section count for the input data through correlation analysis, subsequently extracting relevant spatio-temporal traffic flow features and converting them into two-dimensional matrices embedded with spatio-temporal information. Subsequently, CNNs are utilized to extract and learn features from these two-dimensional matrices, facilitating the construction of a prediction model. This approach leads to the proposal of a short-term traffic flow prediction model that integrates spatio-temporal analysis within a CNN-based deep learning framework. Du et al. [
18] leveraged a deep learning framework combining irregular convolutional residual networks with LSTM units to propose an urban traffic flow prediction model based on a Deep Irregular Convolutional Residual LSTM Network (DST-ICRL). Chen et al. [
19] introduced a trainable location matrix within the graph convolutional network framework to facilitate adaptive learning of spatial patterns in graph structures. They proposed a Location-Based Graph Convolutional Network (Location-GCN) and amalgamated Long Short-Term Memory (LSTM) with the Location-GCN to devise an end-to-end traffic prediction model.
As deep learning technology advances, numerous innovative frameworks have emerged, with the Transformer model proposed by Vaswani et al. [
20] of Google standing out as particularly notable. This model eschews traditional CNN and RNN techniques, adopting Attention as its core method for representation learning. Furthermore, it integrates techniques like Positional Encoding, Feed-Forward Networks, Residual Networks, and Regularization to enhance its generalization capabilities and computational efficiency. The model performs excellently across various Natural Language Processing (NLP) tasks.
Given the Transformer architecture’s significant impact on deep learning, numerous researchers have focused on its enhancement and expansion. Their objectives include improving computational efficiency and generalization capacity, mitigating overfitting or underfitting risks, and boosting adaptability to diverse tasks and datasets. Innovations such as Autoformer, FEDformer, BERT [
21], and Pyraformer [
22] exemplify the continual evolution and optimization of the Transformer model, underscoring the research community’s dedication to advancing model performance. Throughout this evolution, the attention mechanism—a technique for flexible information selection and synthesis—has emerged as a focal point of research. Its development has spurred the creation of numerous variants, such as multi-head Attention [
20], bottleneck attention [
23], cross-modality Attention [
24], and graph attention [
25]. These variants showcase the attention mechanisms’ robust adaptability and innovation potential across various application contexts. The advent of attention mechanisms signifies their transition from supplementary functions to fundamental modeling tools, establishing them as essential technologies in fields including image processing, speech recognition, knowledge graphs, and recommender systems. Specifically, within the crucial domain of traffic flow prediction, integrating deep learning techniques with attention mechanisms has led researchers to propose various innovative models to enhance prediction accuracy and reliability.
For instance, Zhang et al. [
26] introduced the innovative GGCN-SA, a deep learning-based traffic flow prediction model. It leverages a Graph Convolutional Network (GCN) for spatial correlation and Gated Recurrent Units (GRUs) for temporal dependence, further employing a soft attention mechanism to aggregate spatiotemporal information selectively, thereby improving its characterization of traffic flow’s spatio-temporal features. Lin et al. [
27] proposed STAtt-Net, a novel Spatial and Temporal Attention-based Convolutional Network framework, to accurately predict citywide traffic flow. Initially, the system represents traffic data as a two-dimensional, two-channel matrix, and each cell correlates with the traffic flow of a specific area. Acknowledging the transportation system’s temporal correlation and dependence, Researchers delineate the periodic patterns of the traffic data into three principal components: weekly trends, daily periodicity, and hourly proximity. Subsequently, STAtt-Net, utilizing the attention mechanism with STBlock as the foundational unit, learns the traffic flow’s temporal and spatial dependencies. This approach achieves superior prediction performance, particularly in addressing issues like spatial non-stationarity that challenge traditional methods.
Given the nonlinear and stochastic characteristics inherent to time-series data, including temperature changes, rainfall, interest rate fluctuations, exchange rate fluctuations, and traffic flow, accurate analysis and forecasting of these data are paramount. Such characteristics indicate that simple linear models are insufficient for capturing the complex interrelationships among data, necessitating advanced nonlinear models like neural networks and support vector machines to uncover underlying connections. Randomness implies that data values or their occurrence probabilities are indeterminate, adhering to certain probability distributions or stochastic processes, necessitating the application of mathematical tools like probability theory, mathematical statistics, and stochastic methods for their analysis and processing. Additionally, data instability, reflected in the variability of statistical properties like mean, variance, and covariance over time or space, necessitates employing techniques such as smoothing, differencing, and covariance analysis for effective management.
Considering the earlier analysis, accurately analyzing and predicting time-series data—including temperature changes, rainfall, interest rate fluctuations, exchange rate fluctuations, and traffic flow—presents a formidable research challenge with significant practical application value. Consequently, numerous scholars have investigated the pre-model training decomposition of data to improve its smoothness, thereby augmenting the model’s capability to discern the data’s intrinsic structure. These decomposition techniques transform complex, nonlinear, and nonsmooth raw data into stable subsequences that are more amenable to analysis and prediction, thereby revealing more profound data properties and laying a solid foundation for subsequent model training and prediction. In this endeavor, scholars persist in refining and innovating decomposition techniques to suit various data types, further enhancing the accuracy and efficiency of time-series analysis and prediction.
Li et al. [
28] initially employed wavelet decomposition to categorize original traffic flow data into high-frequency and low-frequency subsets, enhancing data stability. Subsequently, they applied a fusion model of Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks to predict the subsequent day’s traffic flow. Zhao et al. [
29] initially applied a fast Fourier Transform and its inverse within the time-series correlation module to decompose traffic flow data. They then used an attention-based encoder–decoder graph convolutional model to predict multi-step future traffic flow. Tian et al. [
30] used the Empirical Mode Decomposition method to segment traffic flow into components of varied frequencies, leveraging the self-similarity of each to choose appropriate prediction models for short-term forecasting.
Compared with wavelet decomposition, fast Fourier Transform, and Empirical Mode Decomposition, Variational Mode Decomposition autonomously identifies the requisite number of modal decompositions, preventing modal aliasing and thus potentially enhancing post-processing prediction accuracy. Research by numerous scholars has demonstrated significant improvements in model prediction accuracy following the application of Variational Mode Decomposition to time-series data. For instance, Zhao et al. [
31] initially smoothed traffic flow data using Variational Mode Decomposition, inputted the resulting subsequences into an LSTM network prediction model, optimized the model parameters using the dung beetle algorithm (IDBO), and aggregated the subsequences’ predicted values to derive final predictions. Yu et al. [
32] applied Variational Mode Decomposition to segment historical traffic flow data into K components, determined by varying K sample entropy values. They then smoothed each component using wavelet thresholding to obtain denoised subsequences, predicted each using an LSTM network, and aggregated these predictions for final results. Huang et al. [
33] addressed the non-stationarity and complexity of traffic flow data by decomposing it into K subsequences using Variational Mode Decomposition and constructing a CNN-GRU network within the Keras framework to extract temporal features. Given the uncertainty of traffic flow data influenced by external factors like travel demand and traffic signals, Kim et al. [
13] proposed a hybrid model incorporating Variational Mode Decomposition for traffic flow prediction.
Building on the preceding analysis, employing suitable data decomposition techniques coupled with integrating various prediction methods can substantially elevate the prediction model’s performance and enhance its future event forecasting accuracy. Data decomposition simplifies and reduces its noise by segmenting the original dataset into several sub-signals, facilitating more efficient extraction of essential features and patterns. Furthermore, selecting algorithms or models that align with the data’s characteristics is crucial to establishing a mapping relationship between the dataset and the predictive target. Deep learning algorithms for model optimization and enhancement further boost the prediction methods’ accuracy and applicability. This holistic approach augments the prediction model’s efficacy and offers a more accurate and reliable method for forecasting future events in complex data scenarios.
3. Model Structure
This section primarily introduces the overall architecture of the model utilized in this study, the components of the model, and the processing of experimental data.
3.1. Overall Model Architecture
This study initially segments time-series data into numerous sub-series using the Variational Mode Decomposition (VMD) technique, designed to uncover the data’s intrinsic structure and dynamic properties. Subsequently, the system calculates autocorrelation and partial correlation coefficients for each subseries, representing a critical step in identifying temporal dependencies within the data. The system employs distinct models for training, depending on the presence or absence of seasonal trends in each subsequence: the CT-Transformer model for subsequences with seasonal trends to capture cyclical variations and the TCN-LSTM model for those without, focusing on long-term dependency extraction and learning. This differential approach allows for the most suitable analysis and prediction of each subsequence.
The VMD technique is ultimately applied to reconstruct the subsequences trained by the models, synthesizing the final predicted sequences. This integration process culminates in a composite prediction framework that leverages the multidimensional nature of time-series data to enhance prediction accuracy.
Figure 1 illustrates the detailed model architecture design, delineating the complete process from data decomposition through model training to sequence reconstruction.
3.2. Variational Mode Decomposition
The key advantage of VMD [
34] is its ability to adaptively determine the number of modal decompositions based on the signal’s actual conditions. An adaptive search mechanism precisely matches the modes’ center frequencies and bandwidths, efficiently separating the intrinsic modal components (IMFs) and segmenting the signal into frequency domains, thereby identifying key signal components effectively [
35].
Assuming the original time-series data are decomposed into k components, each a modal component with a finite bandwidth and centered frequency, while minimizing the estimated bandwidths’ sum, with the constraint that the sum of all modes equals the original signal, the VMD-constrained variational model proceeds as follows in Equation (1):
Here,
uk = {
u1,
u2, …,
uk} represents the modal function,
wk = {
w1,
w2, …,
wk} signifies the central frequency of each mode,
j/πt is a complex term where
j denotes the imaginary unit and 1/
πt is a decay function, and
e−jωkt is an exponential function where
ωk is a frequency variable. Researchers convert the constrained optimization problem into an unconstrained variational problem by applying a quadratic penalty term and the Lagrangian multiplier method. Consequently, they formulate the augmented Lagrangian function as follows in Equation (2):
Here, α represents the penalty parameter, and λ signifies the Lagrange multiplier.
The comprehensive algorithmic flowchart of the Variational Mode Decomposition (VMD) algorithm is illustrated in
Figure 2 below:
In
Figure 2, we first transform the constrained optimization problem into an unconstrained variational problem using a quadratic penalty term and the Lagrange multiplier method and subsequently update the generalized function until the specified iterative constraints are satisfied.
3.3. CSPNet-Attention
The CSPNet-Attention mechanism, building upon the foundation of the Cross Stage Partial Network (CSPNet) [
14], was developed to mitigate the computational demands imposed by extensive inference calculations observed in prior works while also addressing the redundancy of gradient information during the network optimization process. The inception of the Cross Stage Partial Network aimed to harmonize the first and final feature maps of network stages, thereby accommodating gradient variability. CSPNet streamlines the optimization of the F-function (the mapping from the input
to target y) by predominantly integrating feature maps from the initial and terminal stages of the network, thus curtailing computational load. The optimization formula is delineated as follows in Equation (3):
In the CSPNet-Attention framework, the input is bifurcated into two segments along the channel dimension, represented as , where T signifies the transition function employed to truncate the gradient flow across H1, H2, …, HK, where HK refers to the operating function of the kth layer in a Convolutional Neural Network (CNN). This function usually consists of a set of convolutional layers and a nonlinear activation function. Essentially, Hk represents the computational process within the kth layer of the network.
M represents the transition function designated for amalgamating the bifurcated segments. Given an input within
RL×d, where
L symbolizes the input length and
d the input dimension, it is partitioned by the dimension into
, essentially dividing the input into two components. Here,
X1, after transiting through a one-dimensional convolutional layer A, is concatenated at the block’s conclusion, while
X2 serves as the input to the self-attention block B. The outputs of A and B are concatenated across dimensions to formulate the entire block’s output. The output matrix of the first level in the CSPNet-Attention architecture is articulated as follows in Equation (4):
In the CSPNet-Attention framework,
A(X2h) signifies the scaled dot product attention within the
hth self-attention block, where
Wh represents the
dh × dh linear projection matrix. Here,
H denotes the number of heads,
dh specifies the dimension of each head, and
Wc is the weight matrix for the one-dimensional convolutional layer, with dimensions
d/2 × d/2. The formulation updates the weights within CSPNet-Attention as follows in Equation (5):
In this framework, f represents the weight update function, while g signifies the gradient propagated along the ith path. Notably, the network integrates the gradient of the segmented part independently, illustrating a distinct approach to gradient processing within the network.
The overall network architecture of CSPNet-Attention is shown in
Figure 3 below:
In this network architecture, a bifurcation splits the input along its dimensionality. A self-attentive block, having half the input’s dimensionality, processes the left segment. Meanwhile, a 1 × 1 convolutional layer directly routes the right segment to the end of the entire block.
3.4. Two-Stage Transformer
This study introduces an innovative framework for time-series forecasting algorithms, the Two-stage Transformer [
10]. It optimizes the traditional Transformer forecasting method’s structure by segmenting it into an Auto-Regression Stage and a Self-Regression Stage. This design addresses the discrepancies in statistical properties between the input and predicted sequences. Central to this framework, the Two-stage Transformer optimizes the traditional Transformer architecture through transformative and decoupling operations, significantly reducing time and space complexity.
To further enhance model performance, this study substitutes the original self-attention mechanism with the CSPNet-Attention mechanism in the Encoder, augmenting the model’s capacity to capture sequence features. Moreover, the Two-stage Transformer incorporates an innovative Error Score Modification (EMS) module, employing a learnable Gaussian distribution N(0, σi2), where i ∈ [1, lout], to optimize the model’s prediction accuracy and robustness. Within EMS, the distribution’s center is set to zero, enabling the model to bias auto-attention score assignments towards earlier, more reliable predictive elements. Consequently, this approach amplifies the influence of earlier predictor elements on subsequent ones in the autoregressive process, thereby improving the model’s overall performance in time-series prediction.
Algorithm 1 illustrates the algorithmic flow of the error score correction module, with its algorithmic formulation provided as follows in Equation (6):
Algorithm 1 Masked self-attention with ESM |
Input: Tensor
|
Layer params: Linear(): Linear projection layer of |
1: |
2: |
3: Calculate self-attention scores |
4: × 10−5 |
5: |
6: |
7: Generate Gauss distribution matrix |
8: Add self-attention scores with ESM: |
9: |
Output: Masked self-attention scores A |
In the N(0, σi2) configuration, σi can undergo continuous adaptation throughout the model training process, rendering the error score correction module highly adaptable.
We add CSPNet-Attention to replace the self-attention mechanism in the first stage, and the improved model architecture is shown in
Figure 4 below.
In this framework, we input the input variables into the Encoder in the first stage, followed by a convolution operation, and then the output is flattened into a one-dimensional tensor, and finally, a linear transformation is realized through a fully connected layer; in the second stage, the output of the first stage is first used as input into a decoder, and then non-linear transformation and feature extraction are carried out through the feed-forward neural network (which consists of two FC layers and an activation function). Finally, the result is mapped back to the input dimension by linear projection.
3.5. TCN-LSTM
The Temporal Convolutional Neural Network-Long Short-Term Memory (TCN-LSTM) model epitomizes a cutting-edge hybrid neural network architecture, adeptly merging the benefits of Temporal Convolutional Networks (TCNs) and Long Short-Term Memory (LSTM). TCNs, by harnessing their unique convolutional structure for parallel processing of data series, significantly bolster a model’s proficiency in capturing long-term data dependencies, thus addressing time-series prediction challenges. It consists of a one-dimensional full convolutional network and several stacked expansive randomized convolutional networks. This characteristic renders TCN exceptionally suitable for analyzing and predicting sequence data with intricate time-dependent structures. Meanwhile, LSTM, a distinct class of Recurrent Neural Networks, addresses the short-term memory challenges traditional RNNs face with lengthy data sequences. LSTM efficiently orchestrates long-term information storage and retrieval via a gating mechanism, enabling sustained historical memory and agile capture of short-term sequence dependencies. This capability is pivotal in comprehending and forecasting time series of variable lengths and intricate patterns.
Merging TCN’s long-term dependency-capturing prowess with LSTM’s adept management of short-term dependencies, the TCN-LSTM model emerges as a potent and versatile solution for time-series prediction. Fusing the strengths of both networks, the model optimizes time-series data processing depth and breadth and amplifies its adaptability and precision for intricate sequence prediction tasks.
The TCN algorithm [
36] unfolds in two primary phases: convolution operations and residual linking. Its convolution operations utilize dilated convolution to expand the receptive field by interspersing zeros between convolution kernels. The mathematical expression for the dilated convolution employed in the TCN model is given by Equation (7):
where
Y [
t] represents the output at time t,
X [
t] is the input signal,
W [
i] denotes the weight at the
ith position of the convolution kernel,
k is the size of the kernel, and
d represents the dilation factor, which determines the spacing between the weights in the convolution operation.
The residual block in the TCN model comprises two convolutional layers coupled with a residual connection. In this, 1D-FCN is a one-dimensional fully convolutional network that is responsible for processing the input data and slides the convolution kernel over the time step or sequence length of the input data to extract temporal or sequential features. The computation process within the residual block unfolds as follows:
The input x is processed through a dilated convolutional layer, resulting in the output y.
The output y is then added to the input x to form the residual block’s output.
The mathematical expression for the residual block is the following Equation (8):
Long Short-Term Memory (LSTM) represents a specialized Recurrent Neural Network (RNN) variant designed to learn long-term temporal dependencies. The essence of LSTM is its cell state, akin to a conveyor belt running through the network, which is impacted by a minimal number of linear transformations, thereby ensuring a consistent flow of information. Additionally, LSTM manages the addition or removal of information to/from the cell state through an ingeniously designed gate mechanism. The mathematical expression for the core function in the LSTM is given in the following Equation (9):
In the context of LSTM, ⊙ denotes the element-wise multiplication operation.
denotes the candidate cell state, which represents the result of processing the new information at the current moment
t.
tanh is the hyperbolic tangent function, one of the activation functions, which compresses the input into the range [−1, 1].
Wc is the weight matrix connecting the input
xt to the candidate cell state ct,
Uc is the weight matrix connecting the hidden state
ht−1 from the previous moment,
,
ht−1 is the hidden state of the previous moment, and
bc is the bias vector.
ft is the forgetting gate, which determines which parts of the cell state should be forgotten or retained. Its output value is within [0, 1].
ct−1 is the cell state at the previous moment, and
it is the input gate that determines how much of the current moment’s input is written to the cell state. Its output value is within [0, 1]. The output gate and the cell state jointly determine the hidden state unit
ht of the LSTM
ct. The formulation for generating
ht is as follows in Equation (10):
In LSTM, a connection exists not only between the hidden units
ht−1 and
ht but also in a linear self-loop between cell states
ct−1 and
ct. This linear self-loop mechanism, functioning as an information sliding process, retains past information when the gating unit is activated and discards it upon deactivation. It enables LSTM to address long-term dependency issues in RNNs. Considering the synergistic combination of TCN and LSTM, leveraging their respective strengths, processed time-series data are initially input into the TCN and subsequently passed through LSTM via convolution and residual block processing, with the final output derived from a fully connected layer. The TCN-LSTM model amalgamation aims to serve as a potent instrument for time-series prediction, with the TCN layer tasked with feature extraction and the LSTM layer dedicated to sequence modeling. This hybrid methodology enhances the model’s capacity to learn from data embodying complex temporal dynamics [
37]. The overall network architecture of the TCN-LSTM model is shown in
Figure 5 below:
4. Experiments and Analysis
4.1. Experimental Data Processing
The traffic flow dataset (PeMS) used in this study originates from a comprehensive data collection initiative by the California Department of Transportation (Caltrans) across the state’s freeway system, supplemented by contributions from other state transportation agencies and partners. The dataset spans 48 months from 2016 to 2018 and concentrates on roadway occupancy along San Francisco’s freeway. Aggregating and organizing this data enabled researchers to offer a unified, comprehensive analytical framework for assessing the freeway system’s operational performance. The framework builds on a thorough analysis of the freeway network’s current state, facilitating the development of effective operational strategies and identifying and examining critical bottleneck zones for potential improvements, thereby offering robust data support for traffic management decisions.
Collecting and analyzing this dataset is crucial in optimizing traffic flow management and enhancing road usage efficiency. This study’s data were meticulously segmented into training, validation, and testing sets, with allocations of 70%, 10%, and 20%, respectively. This allocation aims to ensure the model training’s effectiveness and generalizability while affording ample data to validate and test the model’s performance, striving for high accuracy and reliability. Details of the dataset are provided in
Table 1.
This study employs an innovative seasonal characterization approach to explore seasonal trends within the PeMS Traffic dataset. As depicted in
Figure 6, the data were first segmented by season, and then average values were calculated for each season. Comparative analysis reveals that the dataset exhibits more pronounced trend fluctuations during summer and winter, as opposed to spring and fall.
Integrating the computation of autocorrelation coefficients (ACF) and partial autocorrelation coefficients (PACF) within the Variational Mode Decomposition (VMD) framework enhances the analysis further. This addition aimed to facilitate an in-depth assessment of the temporal properties of each subsequence derived from VMD. Calculating the ACF and PACF values for each subseries enabled the accurate determination of seasonal trends within each; higher values of ACF and PACF indicate stronger correlations with the season. This analysis not only deepened the understanding of seasonal variations within the dataset but also, by melding VMD with time-series characterization, propelled the identification of seasonal trends. Applying this approach yields a more precise and systematic framework for the seasonal characterization of the PeMS Traffic dataset, thus enriching the comprehension of traffic flow trends.
Assuming that the time series is
{xt}, the autocovariance function (ACVF) of lag order
k is as follows in Equation (11):
The autocorrelation coefficients are defined as follows in Equation (12):
The partial autocorrelation function (PACF) measures the linear relationship between the series
{xt} and its lag
{xt−k} of order
k, excluding the linear influence of all intervening lags
{xt−1, xt−2, …, xt−(k−1)}, and is calculated as follows in Equation (13):
4.2. Evaluation of Performance Indicators
In this study, we utilized the Mean Absolute Error (MAE) and Mean Square Error (MSE) as evaluation indices to assess the experimental outcomes. The MSE is particularly adept at reflecting the model’s performance, especially when the error distribution is approximately normal, offering a robust performance evaluation. Furthermore, employing MAE as a loss function can enhance the model’s generalization capabilities in scenarios characterized by substantial outliers or noise within the data. The computational formulas for MAE and MSE are defined as follows in Equations (14) and (15):
The Mean Absolute Error (MAE) is the average of the absolute differences between the target and predicted values.
The Mean Square Error (MSE) is the most commonly utilized regression loss function, calculated by summing the squares of the discrepancies between predicted and actual values. Taken together, MAE (Mean Absolute Error) and MSE (Mean Squared Error) serve as evaluation metrics; smaller values of these metrics indicate superior model performance. Although MAPE provides an intuitive understanding of percentage error in some cases, it is usually limited in practical applications owing to its treatment of zero values and over-sensitivity to small values, so we chose to use MAE and MSE, which are more stable and versatile.
4.3. Comparative Experiment
In this study’s experimental environment, we trained our model on a platform that included Python 3.11.5, PyTorch 2.11.0, and CUDA 12.1. We configured the hardware with an i5-13490f CPU, a GeForce RTX 4060ti GPU (8 GB), a 2048 GB hard disk, and 32 GB of RAM, operating on Windows 11 Professional Edition.
Table 2 presents the configuration of the parameters in the CT-Transformer model.
In CSPNet-Attention, the “factor” parameter governs the sparsity of sparse self-attention, determining the maximum relative distance considered for each position. “Scale” is an optional factor that adjusts the attention scores and is commonly employed in attention mechanisms to modulate the attention distribution range. “Dropout rate” refers to the probability of omitting units during the attention calculation, a regularization strategy designed to prevent model overfitting. In the TCN-LSTM hybrid model, we configured the prediction window to 96/192/336/720, set the learning rate at 0.001 and the dropout rate at 0.05, and conducted 20 training epochs. We chose a batch size of 32, allocated 64 units to the hidden layer, and applied a convolutional layer with a kernel size of 3.
Our proposed CT-Transformer model addresses the limitations of traditional forecasting methods in handling non-stationary time series and their significant computational resource consumption. We benchmarked against two advanced state-of-the-art models, NHiTS and SCINet. NHiTS, a neurally-hierarchical interpolated time-series forecasting method, employs multi-rate data sampling, linear projection, and temporal interpolation of the input signal to tackle long-term forecasting challenges. SCINet, employing sample convolution and interactions, introduces a novel method that significantly enhances the efficiency of temporal modeling and forecasting in time-series analysis. This method constructs a binary tree-structured temporal convolutional network, considerably improving temporal data handling. Additionally, for comparison, the study selected four cutting-edge Transformer variants: Autoformer, ETSformer, FEDformer, and GBT. Among these, Autoformer employs autocorrelation decomposition for long-term forecasting, splitting the input series into deep seasonal trends and stochastic fluctuations, and models cyclical dependencies using an autocorrelation attention mechanism. ETSformer integrates exponential smoothing and seasonal-level growth decomposition. FEDformer employs frequency-enhanced decomposition, utilizing Fourier and wavelet transformations. The GBT (Two-stage Transformer) breaks down the forecasting process into autoregression phases, tackling the disparity in statistical characteristics between input and predicted sequences. The research further highlights its superiority by comparing the CT-Transformer with several state-of-the-art forecasting models. The researchers align the hyperparameters of the involved models, including the proposed model, with those of several baseline models.
The research presents the experimental results of the proposed model and the baseline models in
Table 3, setting the prediction window to four distinct sizes: 96, 192, 336, and 720. The team conducted three experiments for each size to ensure stable experimental outcomes. The CT-Transformer outperforms the baseline models’ overall prediction performance, as demonstrated in the table. At a prediction window of 96, the CT-Transformer achieved relative reductions in mean absolute error (MAE) of 45.9%, 26.9%, 51.3%, 26.4%, 49.6%, and 8.70% compared with Autoformer, ETSformer, FEDformer, NHiTS, SCINet, and GBT, respectively. It also reduced the mean squared error (MSE) by 65.2%, 37.2%, 70.2%, 38.3%, 67.6%, and 14.4%, respectively. At prediction windows of 192, 336, and 720, the CT-Transformer similarly demonstrated significant reductions in MAE and MSE, showcasing superior accuracy and robustness over advanced time-series forecasting models. These comparisons are depicted in bar charts in
Figure 7 and
Figure 8, comparing the MAE and MSE values of the proposed fusion model with those of the baseline models across the four prediction lengths.
To more clearly illustrate the superiority of the proposed model,
Figure 9 presents comparison plots between our model and the baseline models, showcasing the valid values across various prediction window sizes.
4.4. Ablation Study
In this subsection, we conduct ablation studies to assess the functionality of the CT-Transformer, Variational Mode Decomposition, CSPNet-Attention, and TCN-LSTM within the fusion model presented in this paper. We evaluate five ablation variants of the fusion model: (1) Ours: the complete CT-Transformer model incorporating Variational Mode Decomposition fused with TCN-LSTM; (2) no-V: the complete CT-Transformer model excluding Variational Mode Decomposition, fused with TCN-LSTM; (3) no-C: the CT-Transformer model devoid of CSPNet-Attention, coupled with TCN-LSTM in the fusion model; (4) no-T: the complete CT-Transformer model without TCN-LSTM integration; (5) no-VT: the complete CT-Transformer model without Variational Mode Decomposition and TCN-LSTM; and (6) no-VCT: the model excluding Variational Mode Decomposition, CSPNet-Attention, and TCN-LSTM.
Figure 10 below illustrates the results of the ablation studies.
A in
Figure 10 represents the results of the ablation experiment when the predicted window size is 96, B represents the results of the ablation experiment when the predicted window size is 192, C represents the results of the ablation experiment when the predicted window size is 336, and D represents the results of the ablation experiment when the predicted window size is 720.
Table 4 below illustrates that the “Ours” method outperforms the alternative methods in predictions using the PeMS Traffic dataset. Upon removal of the Variational Mode Decomposition, CSPNet-Attention, and TCN-LSTM, individually or collectively, we observed increases in MAE values of 8.7%, 14.7%, 19.5%, and 34.2% and MSE values of 17.5%, 18.9%, 30.7%, and 134%, respectively.
The findings reveal that omitting three critical elements from the fusion model substantially reduces its effectiveness owing to various negative impacts. This evidence underscores the vital importance of each component in boosting the model’s performance within the context of this research.
5. Conclusion and Future Directions
This study introduces a model architecture integrating Variational Mode Decomposition, the CT-Transformer, and TCN-LSTM for analyzing non-stationary traffic flow data with time-series characteristics. Variational Mode Decomposition is initially applied to segment the data into multiple subsequences. Subsequently, each subsequence’s autocorrelation and partial autocorrelation coefficients are analyzed to assess their seasonal characteristics. We propose replacing the attention mechanism in the Encoder module of the two-stage Transformer’s first stage with CPSNet-Attention to reduce computational costs and memory usage, thereby enhancing model prediction accuracy. We designate the refined model as the CT-Transformer.
Furthermore, given the TCN-LSTM model’s robust long-term dependency modeling capabilities, subsequences lacking seasonal characteristics after modal decomposition are inputted into it. In contrast, those with seasonal characteristics are directed to the CT-Transformer. Ultimately, the output predictions from both models undergo variational mode reconstruction, culminating in the final predictive outcomes. Subsequent validations of the dataset demonstrate the fusion model’s substantial predictive capabilities and robustness.
Additionally, we recognize the importance of explainable AI (XAI) in achieving the interpretability of the proposed technique. Following recent trends in other vertical domains, incorporating XAI can provide transparent and interpretable insights into the model’s decision-making process. This approach not only enhances the trustworthiness of the model but also facilitates its adoption in practical applications. Future work will explore the integration of XAI methods to further improve the interpretability and explainability of our traffic flow prediction model.
Future traffic flow prediction endeavors will increasingly incorporate a broader array of external factors, extending beyond weather and temperature to include holidays, the surge in new energy vehicles, and the implementation of new road traffic regulations. Recent studies reveal that conditions at the current location dictate alterations in traffic flow, and flows at adjacent locations also significantly influence these changes. This interplay among adjacent locations suggests a spatial correlation or dependence within traffic flows. Consequently, future research in traffic flow prediction may evolve by examining inter-location traffic flows within a defined radius. Researchers anticipate that this avenue of research will enhance the accuracy and promptness of traffic flow predictions, thereby providing more dependable support for transportation planning and management decisions. A comprehensive analysis of spatial correlation’s impact on traffic flow will elucidate the intricacies of urban transportation systems and furnish effective strategies for alleviating traffic congestion and optimizing transportation networks.