Social-STGMLP: A Social Spatio-Temporal Graph Multi-Layer Perceptron for Pedestrian Trajectory Prediction

Meng, Dexu; Zhao, Guangzhe; Yan, Feihu

doi:10.3390/info15060341

Open AccessArticle

Social-STGMLP: A Social Spatio-Temporal Graph Multi-Layer Perceptron for Pedestrian Trajectory Prediction

by

Dexu Meng

^1,2,

Guangzhe Zhao

^1,2 and

Feihu Yan

^1,2,*

¹

School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

²

Beijing Key Laboratory of Robot Bionics and Function Research, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Information 2024, 15(6), 341; https://doi.org/10.3390/info15060341

Submission received: 22 April 2024 / Revised: 23 May 2024 / Accepted: 6 June 2024 / Published: 10 June 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

As autonomous driving technology advances, the imperative of ensuring pedestrian traffic safety becomes increasingly prominent within the design framework of autonomous driving systems. Pedestrian trajectory prediction stands out as a pivotal technology aiming to address this challenge by striving to precisely forecast pedestrians’ future trajectories, thereby enabling autonomous driving systems to execute timely and accurate decisions. However, the prevailing state-of-the-art models often rely on intricate structures and a substantial number of parameters, posing challenges in meeting the imperative demand for lightweight models within autonomous driving systems. To address these challenges, we introduce Social Spatio-Temporal Graph Multi-Layer Perceptron (Social-STGMLP), a novel approach that utilizes solely fully connected layers and layer normalization. Social-STGMLP operates by abstracting pedestrian trajectories into a spatio-temporal graph, facilitating the modeling of both the spatial social interaction among pedestrians and the temporal motion tendency inherent to pedestrians themselves. Our evaluation of Social-STGMLP reveals its superiority over the reference method, as evidenced by experimental results indicating reductions of 5% in average displacement error (ADE) and 17% in final displacement error (FDE).

Keywords:

pedestrian trajectory prediction; spatio-temporal graph; multi-layer perceptron; interaction

1. Introduction

Within the realm of computer vision, the prediction of pedestrian trajectories has emerged as a pivotal research direction. Its objective is to precisely anticipate the forthcoming movement trend of pedestrians, leveraging past trajectory [1]. An example of pedestrian trajectory prediction in a real-world scenario is shown in Figure 1. With the continuous development of intelligent devices and autonomous systems, there is a growing need for pedestrian trajectory prediction [2]. In contemporary society, the prediction of pedestrian trajectories holds significant importance across a multitude of applications. Specifically, in the domain of autonomous vehicles, precise prediction of pedestrian trajectories enables vehicles to strategically plan routes and mitigate collisions with pedestrians [3,4,5,6,7]. In intelligent transportation systems, pedestrian trajectory prediction can be used to improve traffic flow management and enhance traffic safety [8].

Pedestrian trajectory prediction is commonly perceived as a sequential decision-making challenge, where future trajectory coordinates of pedestrians are inferred by their historical trajectories and motion information [9]. The challenge of predicting pedestrian trajectories lies in understanding and accurately predicting pedestrian movements in the future, including speed, direction, and possible paths. To solve this problem, multiple factors need to be considered, such as the individual characteristics of pedestrians, environmental conditions, traffic rules, and social interaction. Meanwhile, pedestrian trajectory prediction also needs to consider specific conditions in different scenarios and applications. For example, in urban environments, pedestrian behavior may be influenced by traffic signals, road structure, and crowd density; in indoor environments, building layout and indoor equipment may also have an impact on pedestrian movement [10]. Therefore, pedestrian trajectory prediction methods need to have the ability to adapt to different scenarios to cope with diverse situations. Social interaction is a pivotal factor that demands attention in pedestrian trajectory prediction tasks. As pedestrians walk, they inherently influence each other. Behaviors such as queuing, group dynamics, and adherence to social norms significantly impact pedestrians’ decision-making processes. Hence, fully extracting social interaction into pedestrian trajectory prediction is essential to enhance prediction accuracy [11].

Through the advancements in deep learning [12] technology, particularly the utilization of Recurrent Neural Networks (RNNs) [13], Graph Convolutional Networks (GCNs) [14], and Transformers [15], pedestrian trajectory prediction methods have significantly improved. Neural network models possess the capability to autonomously capture intricate spatio-temporal correlations, extracting pedestrian social dynamics and motion trends from past trajectories and consequently generating anticipated trajectories. Social-LSTM [16] is an important work in pedestrian trajectory prediction, modeling each pedestrian trajectory through LSTM [17] and sharing information through a social-pooling layer. Sparse Graph Convolution Network (SGCN) [18] tackles the issue of redundant interaction in trajectory prediction by incorporating a sparse directed spatio-temporal graph. To solve the difficulty of modeling complex temporal dependencies in recurrent neural networks, Spatio-Temporal Graph Transformer Networks (STAR) [19] introduced a Transformer to model pedestrian trajectories. This model proposes the TGConv graph convolution mechanism to model pedestrian interaction relationships and employs an attention mechanism for trajectory prediction, while these methods have demonstrated good results in various scenarios, the model architecture of the recently proposed methods is not simple, and some models still require prior knowledge. As a result, most model architectures are not conducive to analysis and modification. On the other hand, there are high requirements for the efficiency and recognition accuracy of pedestrian trajectory prediction algorithms in practical scenarios. It is crucial to design a lightweight network that can be applied on embedded devices and on this basis, ensure high-precision recognition results.

To tackle these challenges, we innovatively propose a pedestrian trajectory prediction method based on a multi-layer perceptron. The model only includes two key components: fully connected layers and layer normalization [20]. The Social-STGMLP has undergone comprehensive validation using the ETH [21], UCY [22] and the SDD [23] datasets, demonstrating outstanding performance in experimental results. On the other hand, Social-STGMLP has achieved improvements in model parameter count and inference time, which proves its advantages in its lightweight and efficient performance.

The main contributions of this article are as follows:

It is demonstrated that pedestrian trajectory prediction can be modeled more simply and introduces the first-ever pedestrian trajectory prediction approach based on multi-layer perceptrons, termed Social-STGMLP.
We design an efficient structure consisting solely of fully connected layers and layer normalization. Social-STGMLP showcases impressive performance metrics concerning model parameter count and inference time.
Through extensive experimentation and analysis, Social-STGMLP demonstrates superior accuracy compared to alternative approaches. This validation underscores the efficacy and superiority of Social-STGMLP.

2. Related Works

In the last few years, numerous researchers have conducted extensive and in-depth research on pedestrian trajectory prediction. Traditional trajectory prediction methods typically adopt the strategy based on the manual function to simulate the social interaction relationship among pedestrians. However, these methods encounter challenges such as difficulty in capturing complex pedestrian interaction relationships and achieving low prediction accuracy. The rapid advancements in deep learning have precipitated significant breakthroughs across diverse domains, including image classification and medical image segmentation [24]. This solid theoretical foundation provides strong support for applying deep learning technology to pedestrian trajectory prediction.

Because of the temporal structure inherent in pedestrian trajectories, early research used Recurrent Neural Networks [13] to abstract pedestrian trajectories. Social-LSTM [16] is an important work based on Long Short-Term Memory (LSTM) [17], a variant of recurrent neural networks. It extracts the trajectories of each pedestrian using independent LSTM models and shares information through social-pooling layers. In addition, Social-LSTM proposes the assumption that pedestrian trajectories follow a bi-variate Gaussian distribution, which is pioneering work in pedestrian trajectory prediction. The State Refinement module for LSTM network (SR-LSTM) [25] proposes a state refinement module to refine the current state of all pedestrians in the scenarios. Although RNN has good sequence modeling capabilities, it has no obvious advantages in extracting pedestrian relationships.

In pedestrian trajectory prediction, it is crucial to extract the relationships among pedestrians. Graphs, as a kind of data structure that can represent an entity’s relationship, can be naturally introduced in pedestrian trajectory prediction. The pioneering application of a Graph Convolutional Neural Network [26] in pedestrian trajectory prediction is showcased by Social-STGCNN, which directly extracts pedestrian trajectories as graphs and uses the distance between pedestrians to simulate their interactions. The distance weighting method uses relative distance to simulate undirected interaction. Some researchers believe that this method makes the interaction between two pedestrians the same, cannot accurately characterize pedestrian social interactions, and dense undirected graphs introduce redundant interaction feature, resulting in the model generating excessive collision avoidance trajectories. SGCN [18] proposes a sparse directed graph to extract the interaction relationship among pedestrians and the motion trend of pedestrians. To avoid excessive loss of social interaction information, Reasonably Dense Graph Convolution Network (RDGCN) [27] sets reasonable micro-interaction weights, integrates a spatio-temporal interaction feature through a 3D graph convolution module, and uses an improved temporal convolutional network for trajectory prediction.

Since its proposal, Transformer [15] has achieved significant results in fields such as natural language processing [15], image classification [28] and speech recognition [29]. STAR [19] first introduced a Transformer to model pedestrian trajectories and proposed a spatio-temporal graph trajectory prediction framework leveraging attention mechanism. This approach proposes an improved graph convolutional TGConv based on an attention mechanism to capture complex interaction and uses a Transformer to model pedestrian interaction for trajectory prediction. Some researchers believe that the use of only graphs as structures cannot fully capture spatio-temporal information; Social Graph Transformer [30] models pedestrian trajectories as interactive spatio-temporal graphs, captures interaction features through graph convolutional networks, and then uses a Transformer for trajectory prediction.

Although these methods provide competitive results, model structures are becoming increasingly complex and difficult to train. Recently, MLP-Mixer [31] proposed an image classification network consisting solely of multi-layer perceptrons, which reduces computational complexity by replacing the attention module with two multi-layer perceptrons. In human motion prediction, MotionMixer [32] and siMLPe [33] use multi-layer perceptrons to learn the spatio-temporal dependencies of the human. We propose a simpler model structure for modeling pedestrian trajectories based on multi-layer perceptron. Compared with recent works, Social-STGMLP has more efficient parameters and achieves better performance.

3. Our Method: Social-STGMLP

3.1. Problem Definition

In machine learning and computer vision, pedestrian trajectory video is first segmented into scene maps frame by frame, and then pedestrian recognition and localization are performed on each frame of the image through image recognition technology, thereby obtaining the corresponding bi-dimensional pedestrian trajectory coordinates. The trajectory coordinates of pedestrians in the scene map can indeed be organized to time series. In essence, pedestrian trajectory prediction can be defined as a sequence prediction task. Its objective lies in forecasting the forthcoming position of a designated pedestrian by leveraging their past movement patterns. The primary inquiry of this article is articulated as follows:

In a scene map, a pedestrian whose observed historical trajectory is represented as

X = (X_{1}, X_{2}, X_{3}, \dots, X_{n})

, where

X_{i} = \{(x_{i}^{t}, y_{i}^{t}) ∣ t \in (T_{o b s}^{1}, T_{o b s}^{2}, T_{o b s}^{3}, \dots, T_{o b s}^{a})\}

, n is the number of target pedestrians in the scene map,

(x_{i}^{t}, y_{i}^{t})

is the trajectory coordinate of target pedestrian i at the timestamp t, a is the length of the observed historical trajectory, and the ground-truth trajectory is represented as (1) and (2):

Y = (Y_{1}, Y_{2}, Y_{3}, \dots, Y_{n}),

(1)

Y_{i} = \{(x_{i}^{t}, y_{i}^{t}) ∣ t \in (T_{p r e d}^{1}, T_{p r e d}^{2}, \dots, T_{p r e d}^{b})\},

(2)

where b is the length of the predicted trajectory,

T_{p r e d}^{1} = T_{o b s}^{a} + 1

. The predicted trajectory is expressed as (3) and (4):

\hat{Y} = ({\hat{Y}}_{1}, {\hat{Y}}_{2}, {\hat{Y}}_{3}, \dots, {\hat{Y}}_{n}),

(3)

{\hat{Y}}_{i} = {({\hat{x}}_{i}^{t}, {\hat{y}}_{i}^{t}) | t \in (T_{p r e d}^{1}, T_{p r e d}^{2}, \dots, T_{p r e d}^{b})} .

(4)

3.2. Main Architecture

The proposed method adopts a novel model structure, and its detailed architecture is shown in Figure 2. Firstly, the model performs a graph embedding operation on pedestrian trajectories to obtain spatial graph embedding and temporal graph embedding. The aim is to extract the trajectory information in the spatial and temporal dimensions to provide a basis for the subsequent feature learning. Then, the spatial and temporal graph embedding are input into the multi-layer perceptron to learn the spatial social interaction feature among pedestrians and the temporal motion tendency feature of pedestrians to extract the intricate correlations among pedestrians and enhance the model’s comprehension of dynamic environments. Subsequently, through the feature fusion module, the learned spatio-temporal correlation information is effectively fused to capture richer dynamic feature and more comprehensive pedestrian trajectory feature in pedestrian trajectories. Finally, the learned trajectory feature is input into the fully connected layer, which is used to predict the parameters of the bi-variate Gaussian distribution. The end-to-end framework integrates spatial and temporal information, making the model more adaptable and accurate.

3.3. Feature Extraction

Consider a situation with the pedestrian trajectories

X_{i n} \in R^{T_{o b s} \times N \times D}

, where D is the dimension of trajectory coordinates. Based on pedestrian trajectories, we construct both spatial and temporal graphs. The spatio-temporal graph extract the spatial relationship among pedestrians and the temporal dynamics of pedestrian motion over time. The spatial graph

G_{s p a} = (V^{t}, U^{t})

represents the locations of pedestrians at the moment t. The temporal graph

G_{t e m} = (V_{n}, U_{n})

of pedestrian n represents its trajectory. The node set of spatial graph

G_{s p a}

and temporal graph

G_{t e m}

are represented as

V^{t} = \{v_{n}^{t} ∣ n = 1, \dots, N\}

and

V_{n} = {v_{n}^{t} ∣ t = T_{o b s}^{1}, \dots, T_{o b s}^{a}}

, where

v_{n}^{t}

is the attribute of the pedestrian n and represents the coordinate

(x_{n}^{t}, y_{n}^{t})

of the pedestrian at time t. The edge set of spatial graph

G_{s p a}

and temporal graph

G_{t e m}

are represented as

U^{t} = {u_{i, j}^{t} ∣ i, j = 1, \dots, N}

and

U_{n} = {u_{n}^{k, q} ∣ k, q = 1, \dots, T_{o b s}}

, where

u_{i, j}^{t}, u_{n}^{\dot{k}, q} \in {0, 1}

indicates whether nodes

v_{i}^{t}, v_{j}^{t}

or

u_{i}^{t}, u_{j}^{t}

are connected. If it is connected, it is denoted as 1, and if it is disconnected, it is denoted as 0.

The spatial and temporal graphs are embedded into the vector by graph embedding, as shown in Equations (5) and (6):

E_{s p a} = ϕ (G_{s p a}, W_{E_{s p a}}),

(5)

E_{t e m} = ϕ (G_{t e m}, W_{E_{t e m}}),

(6)

where

ϕ (\cdot, \cdot)

represented linear transformation.

E_{s p a}

and

E_{t e m}

denote the spatial graph embedding and temporal graph embedding.

W_{E_{s p a}} \in R^{D \times D_{E_{s p a}}}

and

W_{E_{t e m}} \in R^{D \times D_{E_{t e m}}}

are the weights of the linear transformation.

A group of m MLP blocks are introduced to model pedestrian spatial interaction feature of spatial graph embedding and temporal motion tendency feature of temporal graph embedding, respectively, and its detailed architecture is showcased in Figure 3. Each Multi-Layer Perceptron (MLP) block is composed solely of fully connected layers and layer normalization, as depicted in Formulas (7) and (8):

H_{s p a}^{l} = H_{s p a}^{l - 1} + L N (W_{s p a}^{l} H_{s p a}^{l - 1}),

(7)

H_{t e m}^{l} = H_{t e m}^{l - 1} + L N (W_{t e m}^{l} H_{t e m}^{l - 1}),

(8)

where

H_{s p a}^{l}, H_{t e m}^{l} \in R^{T_{o b s} \times N \times F}, l \in [1, \dots, m]

denotes the output of the l-th MLP block, F represents the embedding vector dimension,

L N

represents the layer normalization operation, and

W_{s p a}^{l}

and

W_{t e m}^{l}

are the weights of the fully connected layer in the l-th MLP block.

H_{s p a}^{0}

and

H_{t e m}^{0}

are initialized to

E_{s p a}

and

E_{t e m}

, respectively.

3.4. Feature Fusion and Trajectory Prediction

The spatial social interaction and temporal motion tendency feature learned by multi-layer perceptrons are fused through fully connected layers to obtain the pedestrian trajectory feature, as shown in Formula (9).

H = L N (F C (Concat (H_{s p a}, H_{t e m}))),

(9)

where

H_{s p a}

and

H_{t e m}

are the spatial social interaction and the temporal motion tendency feature are learned through the multi-layer perceptron,

Concat (\cdot, \cdot)

represents the connection operation, and

F C

represents the fully connected layer.

In this paper, we assume that the trajectory coordinates

(x_{n}^{t}, y_{n}^{t})

of pedestrian n at timestamp t follows the bi-variate Gaussian distribution

N ({\hat{μ}}_{n}^{t}, {\hat{σ}}_{n}^{t}, {\hat{ρ}}_{n}^{t})

, where

{\hat{μ}}_{n}^{t}

represents the mean,

{\hat{σ}}_{n}^{t}

represents the standard deviation, and

{\hat{ρ}}_{n}^{t}

represents the correlation coefficient.

Given the pedestrian trajectory feature, the fully connected layer is used to predict the parameters of the bi-variate Gaussian distribution on the time dimension. Hence, the model is trained through the minimization of the negative log-likelihood loss function, as depicted in Formula (10):

L^{n} (W) = - \sum_{t = T_{p r e d}^{1}}^{T_{p r e d}^{b}} log P ((x_{n}^{t}, y_{n}^{t}) | ({\hat{μ}}_{n}^{t}, {\hat{σ}}_{n}^{t}, {\hat{ρ}}_{n}^{t})),

(10)

where W is trainable parameters in Social-STGMLP.

4. Experiment

4.1. Datasets and Metrics

Social-STGMLP is trained on the ETH [21] and UCY [22] datasets. The ETH dataset comprises two scenarios, ETH and Hotel, whereas the UCY dataset comprises three scenarios, UNIV, ZARA1, and ZARA2. All scenes are bird’s-eye views taken outdoors, including 2206 pedestrian trajectories. The dataset includes various behavioral patterns such as pedestrian obstacle avoidance and crowd interaction. This paper employs the leave-one method [34] for conducting experiments. The model is trained on four datasets, and the remaining dataset is utilized as the test set. During training and evaluation, we observe the historical trajectory in the first 8 frames (3.2 s), subsequently forecasting the trajectory for the following 12 frames (4.8 s).

The Stanford Drone Dataset (SDD) is a large, advanced dataset consisting of 60 bird’s-eye view videos containing more than 10,000 pedestrians and 185,000 interactions. Following the previous method, we divided the dataset into a training set, a validation set, and a test set to verify the validity and generalization of the model.

This paper adopts average displacement error (ADE) [35] and final displacement error (FDE) [16] as the evaluation metrics. Below is a concise overview of these evaluation metrics:

Average displacement error (ADE) [35]: The average Euclidean distance between the ground truth and the predicted trajectory across all predicted time steps. Mathematically, it is expressed by Formula (11):

A D E = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{b} \sum_{T = T_{p r e d}^{1}}^{T_{p r e d}^{b}} \sqrt{{(x_{i}^{t} - {\hat{x}}_{i}^{t})}^{2} + {(y_{i}^{t} - {\hat{y}}_{i}^{t})}^{2}} .

(11)

Final displacement error (FDE) [16]: the Euclidean distance between the endpoints of the ground truth and the predicted trajectory, which depicts the deviation at the final time step of the prediction. This is formalized by Formula (12):

F D E = \frac{1}{n} \sum_{i = 1}^{n} \sqrt{{(x_{i}^{T_{p r e d}^{b}} - {\hat{x}}_{i}^{T_{p r e d}^{b}})}^{2} + {(y_{i}^{T_{p r e d}^{b}} - {\hat{y}}_{i}^{T_{p r e d}^{b}})}^{2}} .

(12)

4.2. Experimental Settings

In our experiment, the experimental hardware processor is an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90 GHz, the approach was trained on a Tesla V100 GPU, and Social-STGMLP is based on Pytorch 1.2.0. The dimension of the FC layers in our model was set to 64. We configured the number of MLP blocks to 16 and set the number of MLP blocks in the feature fusion module to 1. The training utilized the Adam optimizer for 300 epochs with a batch size of 128, and the learning rate was set to 0.01.

4.3. Brief Introduction to Comparison Methods

In this paper, we compare Social-STGMLP with 16 state-of-the-art methods. Here is a concise introduction to the comparison approaches:

Social-LSTM [16]: Independently models each pedestrian trajectory with separate LSTM units and aggregates information with a social-pooling layer.

Social-GAN [36]: This model utilizes adversarial training and pooling mechanisms to extract interaction patterns between pedestrians.

Sophie [37]: Proposed a generative adversarial model that leverages attention mechanisms.

Social-BiGAT [38]: This model utilizes graph attention mechanisms and generative adversarial mechanisms to extract interaction among pedestrians.

PIF [39]: Introduces a multi-task model integrated with information such as human skeletons and surrounding scenes.

SR-LSTM [25]: Introduces a state refinement module to refine the current state of all pedestrians in the scenarios.

RSBG [34]: This model proposes a neural network that recursively extracts social relationships and models them as social behavior graphs.

Social-STGCNN [26]: This model predicts pedestrian trajectories using graph convolution mechanisms and an improved Time-Extrapolator Convolution Neural Network.

SGCN [18]: This model effectively represents pedestrian interaction information by removing redundant pedestrian interaction through a sparse graph convolution module.

GCHGAT [40]: This model proposes a hierarchical graph attention network with group constraints to capture interactions within, outside, and between groups separately.

PTP-STGCN [9]: The model proposes a spatio-temporal graph convolution network that extracts spatial interactions and temporal dependencies.

Social TAG [41]: This model uses STGAT and STGCN to automatically extract view area and grouping feature together.

IST-PTEPN [42]: A pedestrian trajectory prediction method that combines pedestrian trajectory and surrounding scene feature to predict endpoints.

Tri-HGNN [43]: A novel approach to pedestrian trajectory prediction that integrates spatio-temporal interaction and personal intention.

SKGACN [44]: A novel model to extract spatio-temporal relationships among pedestrians with low computational requirements.

RDGCN [27]: This model integrates spatial-temporal information through three-dimensional graph convolution modules and utilizes improved temporal convolution networks for prediction.

4.4. Quantitative Analysis

Social-STGMLP was compared with 16 advanced pedestrian trajectory prediction models across five scenarios, with the evaluation results presented in Table 1. ADE is represented on the left side, and FDE is on the right side. A lower displacement error indicates a better prediction performance, with bold data highlighting the best prediction result.

The evaluation results demonstrate that Social-STGMLP achieved outstanding performance in the experiments on the datasets, particularly excelling in the evaluation metric FDE, attaining optimal performance in all dataset scenarios. Social-STGMLP abstracts pedestrian trajectories into a spatio-temporal graph and individually models spatial social interaction among pedestrians and their temporal motion tendency. According to the experimental results, it is speculated that Social-STGMLP addresses the issue of removing excessive redundant interaction and reducing cumulative error in the prediction process. Regarding the ADE evaluation metric, our approach yielded suboptimal results. Compared to the pioneering work of Social-LSTM, Social-STGMLP demonstrated improvements of 51% and 65% on ADE and FDE. In comparison to the classical method SGCN, Social-STGMLP showed enhancements of 5% and 17% on ADE and FDE. Furthermore, compared to the latest RDGCN, Social-STGMLP achieved an 8% improvement on FDE.

The experimental results of Social-STGMLP and comparison methods in the SDD dataset are shown in Table 2. Experimental results show that Social-STGMLP has achieved good research results. Compared with other comparison methods, Social-STGMLP only uses fully connected layers and layer normalization. The experimental results show that Social-STGMLP can better extract pedestrian interactions and pedestrian movement trends.

Figure 4 illustrates a comparison chart for the average displacement error. Across the HOTEL, ZARA1, and ZARA2 scenarios in the ETH and UCY datasets, Social-STGMLP outperforms classical methods such as SR-LSTM, Social-STGCNN, and SGCN. However, it slightly trails behind the most recent method, RDGCN, in the ETH and UNIV scenarios.

A comparison chart for the final displacement error is shown in Figure 5. Social-STGMLP was compared with four classical methods: SR-LSTM, Social-STGCNN, SGCN, and RDGCN. The experimental results demonstrate that across all five scenarios in the ETH and UCY datasets, Social-STGMLP consistently outperforms all comparison methods in final displacement error. This indicates that Social-STGMLP effectively reduces the cumulative error resulting from redundant interactions in pedestrian trajectory prediction.

4.5. Ablation Study

Table 3 presents the ablation study conducted on the spatio-temporal branches of Social-STGMLP. The evaluation results affirm the validity of the structure employed in these branches. Social-STGMLP w/o Spa only employs temporal motion tendency modeling, and Social-STGMLP w/o Tem only employs spatial interaction feature modeling. In pedestrian trajectory prediction, both spatial pedestrian interaction information and temporal pedestrian intention information complement each other and are indispensable. If only temporal graph modeling or spatial graph modeling is considered, it will lead to performance degradation.

As depicted in Table 4, this study conducted ablation experiments varying the number of MLP block layers. The optimal experimental results were attained when the number of MLP block layers was configured to 16. If the model layers are shallow, the model can capture pedestrian motion trends with more straight trajectories and less directional changes. Good experimental results were obtained on datasets such as ZARA1 and ZARA2, but the prediction performance was poor on the ETH. With an increase in the number of model blocks, the complexity of the model steadily rises, leading to enhanced fitting capability for trajectories characterized by more curved paths and frequent alterations in direction, as observed in scenarios like ETH. However, excessive model layers may lead to insufficient training and insufficient data to support high complexity models, resulting in underfit results.

4.6. Qualitative Analysis

We conducted a comparative analysis between Social-STGMLP and two existing models (Social-STGCNN [26] and SGCN [18]) across five distinct scenarios, as depicted in Figure 6. These scenarios, arranged from top to bottom are ETH, HOTEL, UNIV, ZARA1, and ZARA2. The observed trajectories are depicted by solid blue lines, while the predicted trajectories are depicted by solid orange lines. Ground truth trajectories are depicted with solid green lines.The closer the predicted trajectory is to the ground truth, the better the experimental performance. Compared to the other two methods, Social-STGMLP can remove redundant interactions, avoid potential collisions, predict trajectories more naturally, always synchronize with the ground truth, and have better prediction performance.

The initial row of the figure illustrates two pedestrians moving in the same direction. Social-STGCNN [26] predicts that two pedestrians gradually move away from each other, possibly due to excessive collision avoidance and redundant interaction. SGCN [18] predicts that the trajectories of the two pedestrians are too close, indicating a potential collision. Compared with other methods, the prediction results we proposed are better. The second row in the figure shows a situation where a single person is walking, but there are other pedestrians in the walking direction, as well as obstacles such as trees and streetlights, which make trajectory prediction difficult. Social-STGCNN [26] and SGCN [18] trajectories are quite tortuous and cannot naturally predict pedestrian trajectories. The third row in the figure may have overlapping future trajectories between two groups of pedestrians, with the right group of pedestrians shifting in direction to avoid stationary pedestrians. Social-STGCNN [26] veers away from the ground truth, with the predicted trajectory closely approaching a stationary crowd, potentially increasing the risk of collision. SGCN [18] predicts a trajectory that deviates significantly from the ground truth towards the end of the trajectory. However, our proposed method ensures that the predicted trajectory direction consistently aligns with the ground truth, effectively capturing pedestrian motion trends and resulting in a better final displacement error. The fourth and fifth rows of the figure depict two scenarios: in the first scenario, two groups of pedestrians are walking toward each other, and in the second, two groups are walking in the same direction toward a stationary pedestrian. Social-STGCNN [26] does not effectively predict pedestrian trajectories, resulting in overlaps between trajectories of different groups. In contrast, Social-STGMLP accurately predicts the movement trends of pedestrians, consistently synchronizing with the actual trajectory trends. For stationary pedestrians, the predicted trajectories indicate that the model can capture the fact that the stationary pedestrian is unaffected by the surrounding movement.

4.7. Comparison of Experimental Processes, Model Parameters and Inference Time

In the same experimental environment, the training process comparison between the proposed Social-STGMLP and SGCN [18] is shown in Figure 7. It can be observed that Social-STGMLP becomes more stable as the training progresses and fits the data faster than SGCN [18].

A comparison of model parameters and inference time is shown in Table 5. Social-STGMLP is compared with Social-LSTM [16], SR-LSTM [25], Social-GAN-P [36], and PIF [39]. Social-LSTM models each pedestrian trajectory using an LSTM and interacts with social-pooling layers, resulting in larger model parameters and longer inference time.

Social-STGMLP only uses fully connected layers and layer normalization operations, without using attention mechanisms to save computational costs. The model parameter is 147 k and the inference time is 0.0017 s. Compared to the other comparison methods, Social-STGMLP has a simpler structure, significantly reduced inference time, and higher computational efficiency.

5. Conclusions

This paper introduces Social-STGMLP, which is a pedestrian trajectory prediction method based on Multi-Layer Perceptrons (MLP) and exclusively employs fully connected layers and layer normalization, thereby simplifying the modeling of pedestrian trajectories. Social-STGMLP abstracts pedestrian trajectories into a spatio-temporal graph and individually models spatial social interaction among pedestrians and their temporal motion tendency. We showcase the efficacy of Social-STGMLP over existing methods through the evaluation across five different scenarios using the datasets. Moreover, Social-STGMLP demonstrates advancements in reducing model parameters and improving inference speed, aligning with the growing need for lightweight models in predicting pedestrian trajectories. These enhancements underscore the practical applicability and scalability of Social-STGMLP in autonomous driving systems. We believe that the proposed model contributes to real-time applications such as autonomous driving and intelligent transportation systems, thereby reducing accidents and enhancing pedestrian safety.

In the future, we will further explore expanding models to handle more complex scenarios and integrating additional contextual information to improve the accuracy of the model.

Author Contributions

Conceptualization, G.Z. and F.Y.; methodology, D.M.; software, D.M.; validation, D.M.; formal analysis, F.Y.; investigation, D.M.; resources, G.Z.; data curation, D.M.; writing—original draft preparation, D.M.; writing—review and editing, F.Y.; visualization, D.M.; supervision, G.Z.; project administration, G.Z.; funding acquisition, G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (No. 62176018) and the Beijing University of Civil Engineering and Architecture Research Capacity Promotion Program for Young Scholars (X23024).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Large, F.; Vasquez, D.; Fraichard, T.; Laugier, C. Avoiding cars and pedestrians using velocity obstacles and motion prediction. In Proceedings of the IEEE Intelligent Vehicles Symposium, Parma, Italy, 14–17 June 2004; pp. 375–379. [Google Scholar]
Luo, Y.; Cai, P.; Bera, A.; Hsu, D.; Lee, W.S.; Manocha, D. Porca: Modeling and planning for autonomous driving among many pedestrians. IEEE Robot. Autom. Lett. 2018, 3, 3418–3425. [Google Scholar] [CrossRef]
Wu, P.; Chen, S.; Metaxas, D.N. Motionnet: Joint perception and motion prediction for autonomous driving based on bird’s eye view maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11385–11395. [Google Scholar]
Rudenko, A.; Palmieri, L.; Herman, M.; Kitani, K.M.; Gavrila, D.M.; Arras, K.O. Human motion trajectory prediction: A survey. Int. J. Robot. Res. 2020, 39, 895–935. [Google Scholar] [CrossRef]
DeSouza, G.N.; Kak, A.C. Vision for mobile robot navigation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 237–267. [Google Scholar] [CrossRef]
Xiao, G.; Juan, Z.; Gao, J. Travel mode detection based on neural networks and particle swarm optimization. Information 2015, 6, 522–535. [Google Scholar] [CrossRef]
Alghodhaifi, H.; Lakshmanan, S. Holistic Spatio-Temporal Graph Attention for Trajectory Prediction in Vehicle–Pedestrian Interactions. Sensors 2023, 23, 7361. [Google Scholar] [CrossRef] [PubMed]
Korbmacher, R.; Tordeux, A. Review of pedestrian trajectory prediction methods: Comparing deep learning and knowledge-based approaches. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24126–24144. [Google Scholar] [CrossRef]
Lian, J.; Ren, W.; Li, L.; Zhou, Y.; Zhou, B. Ptp-stgcn: Pedestrian trajectory prediction based on a spatio-temporal graph convolutional neural network. Appl. Intell. 2023, 53, 2862–2878. [Google Scholar] [CrossRef]
Sharma, N.; Dhiman, C.; Indu, S. Pedestrian intention prediction for autonomous vehicles: A comprehensive survey. Neurocomputing 2022, 508, 120–152. [Google Scholar] [CrossRef]
Huang, Y.; Du, J.; Yang, Z.; Zhou, Z.; Zhang, L.; Chen, H. A survey on trajectory-prediction methods for autonomous driving. IEEE Trans. Intell. Veh. 2022, 7, 652–674. [Google Scholar] [CrossRef]
Zhao, D.; Chen, Y.; Lv, L. Deep reinforcement learning with visual attention for vehicle classification. IEEE Trans. Cogn. Dev. Syst. 2016, 9, 356–367. [Google Scholar] [CrossRef]
Jozefowicz, R.; Zaremba, W.; Sutskever, I. An empirical exploration of recurrent network architectures. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 2342–2350. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Shi, L.; Wang, L.; Long, C.; Zhou, S.; Zhou, M.; Niu, Z.; Hua, G. SGCN: Sparse graph convolution network for pedestrian trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8994–9003. [Google Scholar]
Yu, C.; Ma, X.; Ren, J.; Zhao, H.; Yi, S. Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XII 16. pp. 507–523. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Pellegrini, S.; Ess, A.; Schindler, K.; Van Gool, L. You’ll never walk alone: Modeling social behavior for multi-target tracking. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 261–268. [Google Scholar]
Lerner, A.; Chrysanthou, Y.; Lischinski, D. Crowds by example. Comput. Graph. Forum 2007, 26, 655–664. [Google Scholar] [CrossRef]
Robicquet, A.; Sadeghian, A.; Alahi, A.; Savarese, S. Learning social etiquette: Human trajectory understanding in crowded scenes. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part VIII 14. pp. 549–565. [Google Scholar]
Liu, Z.; Zhang, Z.; Lei, Z.; Omura, M.; Wang, R.L.; Gao, S. Dendritic Deep Learning for Medical Segmentation. IEEE/CAA J. Autom. Sin. 2024, 11, 803–805. [Google Scholar] [CrossRef]
Zhang, P.; Ouyang, W.; Zhang, P.; Xue, J.; Zheng, N. Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12085–12094. [Google Scholar]
Mohamed, A.; Qian, K.; Elhoseiny, M.; Claudel, C. Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 14424–14432. [Google Scholar]
Sang, H.; Chen, W.; Wang, J.; Zhao, Z. RDGCN: Reasonably dense graph convolution network for pedestrian trajectory prediction. Measurement 2023, 213, 112675. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
Liu, Y.; Yao, L.; Li, B.; Wang, X.; Sammut, C. Social graph transformer networks for pedestrian trajectory prediction in complex social scenarios. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 1339–1349. [Google Scholar]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Bouazizi, A.; Holzbock, A.; Kressel, U.; Dietmayer, K.; Belagiannis, V. Motionmixer: Mlp-based 3d human body pose forecasting. arXiv 2022, arXiv:2207.00499. [Google Scholar]
Guo, W.; Du, Y.; Shen, X.; Lepetit, V.; Alameda-Pineda, X.; Moreno-Noguer, F. Back to mlp: A simple baseline for human motion prediction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 4809–4819. [Google Scholar]
Sun, J.; Jiang, Q.; Lu, C. Recursive social behavior graph for trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 660–669. [Google Scholar]
Raksincharoensak, P.; Hasegawa, T.; Nagai, M. Motion planning and control of autonomous driving intelligence system based on risk potential optimization framework. Int. J. Automot. Eng. 2016, 7, 53–60. [Google Scholar] [CrossRef] [PubMed]
Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2255–2264. [Google Scholar]
Sadeghian, A.; Kosaraju, V.; Sadeghian, A.; Hirose, N.; Rezatofighi, H.; Savarese, S. Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1349–1358. [Google Scholar]
Kosaraju, V.; Sadeghian, A.; Martín-Martín, R.; Reid, I.; Rezatofighi, H.; Savarese, S. Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2019; Volume 32. [Google Scholar]
Liang, J.; Jiang, L.; Niebles, J.C.; Hauptmann, A.G.; Fei-Fei, L. Peeking into the future: Predicting future person activities and locations in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5725–5734. [Google Scholar]
Zhou, L.; Zhao, Y.; Yang, D.; Liu, J. Gchgat: Pedestrian trajectory prediction using group constrained hierarchical graph attention networks. Appl. Intell. 2022, 52, 11434–11447. [Google Scholar] [CrossRef]
Zhang, X.; Angeloudis, P.; Demiris, Y. Dual-branch spatio-temporal graph neural networks for pedestrian trajectory prediction. Pattern Recognit. 2023, 142, 109633. [Google Scholar] [CrossRef]
Yang, X.; Fan, J.; Xing, S. IST-PTEPN: An improved pedestrian trajectory and endpoint prediction network based on spatio-temporal information. Int. J. Mach. Learn. Cybern. 2023, 14, 4193–4206. [Google Scholar] [CrossRef]
Zhu, W.; Liu, Y.; Wang, P.; Zhang, M.; Wang, T.; Yi, Y. Tri-HGNN: Learning triple policies fused hierarchical graph neural networks for pedestrian trajectory prediction. Pattern Recognit. 2023, 143, 109772. [Google Scholar] [CrossRef]
Lv, K.; Yuan, L. SKGACN: Social knowledge-guided graph attention convolutional network for human trajectory prediction. IEEE Trans. Instrum. Meas. 2023, 72, 2517111. [Google Scholar] [CrossRef]
Huang, Y.; Bi, H.; Li, Z.; Mao, T.; Wang, Z. Stgat: Modeling spatial-temporal interactions for human trajectory prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6272–6281. [Google Scholar]
Amirian, J.; Hayet, J.B.; Pettré, J. Social ways: Learning multi-modal distributions of pedestrian trajectories with gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Monti, A.; Bertugli, A.; Calderara, S.; Cucchiara, R. Dag-net: Double attentive graph neural network for trajectory forecasting. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 2551–2558. [Google Scholar]
Mohamed, A.; Zhu, D.; Vu, W.; Elhoseiny, M.; Claudel, C. Social-implicit: Rethinking trajectory prediction evaluation and the effectiveness of implicit maximum likelihood estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 463–479. [Google Scholar]

Figure 1. Pedestrian trajectory prediction in a real-world scenario.

Figure 2. The network structure of Social-STGMLP. Social-STGMLP comprises three core modules: feature extraction, feature fusion, and trajectory prediction.

Figure 3. The network architecture of an MLP block. The smallest unit of an MLP block composed solely of a fully connected layer and a layer normalization.

Figure 4. A comparison of average displacement error.

Figure 5. A comparison of final displacement error.

Figure 6. Visualization of contrast methods in different scenarios. The observed trajectories are illustrated by solid blue lines, while the predicted trajectories are denoted by solid orange lines. Ground truth trajectories are depicted with solid green lines. The closer the predicted trajectory is to the ground truth, the better the experimental performance.

Figure 7. Comparison of experimental processes. The solid blue line and solid green line, respectively, correspond to the ADE and FDE metrics of Social-STGMLP. Meanwhile, the dashed black line and dashed red line, respectively, indicate the ADE and FDE metrics of SGCN.

Table 1. Comparison with the state-of-the-art approach on the ETH/UCY dataset for ADE/FDE. We report the ADE and FDE in meters for each approach, where lower values indicate better performance. Bold data highlighting the best prediction result.

Model	YEAR	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
Social-LSTM [16]	2016	1.09/2.35	0.79/1.76	0.67/1.40	0.47/1.00	0.56/1.17	0.72/1.54
Social-GAN [36]	2018	0.81/1.52	0.72/1.61	0.60/1.26	0.34/0.69	0.42/0.84	0.58/1.18
Sophie [37]	2019	0.70/1.43	0.76/1.67	0.54/1.24	0.30/0.63	0.38/0.78	0.54/1.15
Social-BiGAT [38]	2019	0.69/1.29	0.49/1.01	0.55/1.32	0.30/0.62	0.36/0.75	0.48/1.00
PIF [39]	2019	0.73/1.65	0.30/0.59	0.60/1.27	0.38/0.81	0.31/0.68	0.46/1.00
SR-LSTM [25]	2019	0.64/1.28	0.39/0.78	0.52/1.13	0.42/0.92	0.34/0.74	0.46/0.97
RSBG [34]	2020	0.80/1.53	0.33/0.64	0.59/1.25	0.40/0.86	0.30/0.65	0.48/0.99
Social-STGCNN [26]	2020	0.64/1.11	0.49/0.85	0.44/0.79	0.34/0.53	0.30/0.48	0.44/0.75
SGCN [18]	2021	0.63/1.03	0.32/0.55	0.37/0.70	0.29/0.53	0.25/0.45	0.37/0.65
GCHGAT [40]	2022	0.63/1.10	0.38/0.73	0.55/1.16	0.33/0.66	0.30/0.64	0.44/0.86
PTP-STGCN [9]	2022	0.63/1.04	0.34/0.45	0.48/0.87	0.37/0.61	0.30/0.46	0.42/0.68
Social TAG [41]	2023	0.61/1.00	0.37/0.56	0.51/0.87	0.33/0.50	0.30/0.49	0.42/0.68
IST-PTEPN [42]	2023	0.46/0.70	0.44/0.47	0.54/0.92	0.35/0.62	0.31/0.59	0.42/0.66
Tri-HGNN [43]	2023	0.62/0.86	0.38/0.65	0.49/0.88	0.27/0.44	0.25/0.40	0.40/0.65
SKGACN [44]	2023	0.55/0.83	0.30/0.50	0.39/0.75	0.30/0.51	0.26/0.45	0.36/0.61
RDGCN [27]	2023	0.58/0.94	0.30/0.45	0.35/0.65	0.28/0.48	0.25/0.44	0.35/0.59
Social-STGMLP	/	0.60/0.94	0.29/0.38	0.36/0.59	0.27/0.44	0.24/0.38	0.35/0.54

Table 2. Comparison with the state-of-the-art approach on the SDD dataset for ADE/FDE. We report the ADE and FDE in meters for each approach, where lower values indicate better performance. Bold data highlighting the best prediction result.

	STGAT [45]	Social Ways [46]	DAG-Net [47]	Social-Implicit [48]	Social-STGMLP
ADE	0.58	0.62	0.53	0.47	0.47
FDE	1.11	1.16	1.04	0.89	0.75

Table 3. Ablation research of spatio-temporal branches, where lower values indicate better performance. Bold data highlighting the best prediction result.

Model	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
Social-STGMLP w/o Spa	0.66/1.06	0.31/0.45	0.35/0.62	0.28/0.48	0.23/0.41	0.37/0.60
Social-STGMLP w/o Tem	0.63/0.91	0.35/0.50	0.39/0.64	0.30/0.44	0.23/0.40	0.38/0.58
Social-STGMLP (Ours)	0.60/0.94	0.29/0.38	0.36/0.59	0.27/0.44	0.24/0.38	0.35/0.54

Table 4. Ablation research of the number of MLP blocks, where lower values indicate better performance. Bold data highlighting the best prediction result.

Model	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
Social-STGMLP-2	0.65/1.08	0.30/0.42	0.35/0.60	0.29/0.44	0.25/0.42	0.37/0.59
Social-STGMLP-4	0.64/0.99	0.29/0.41	0.36/0.58	0.28/0.44	0.23/0.41	0.36/0.57
Social-STGMLP-8	0.63/0.95	0.30/0.39	0.37/0.64	0.28/0.45	0.23/0.41	0.36/0.57
Social-STGMLP-16 (Ours)	0.60/0.94	0.29/0.38	0.36/0.59	0.27/0.44	0.24/0.38	0.35/0.54
Social-STGMLP-24	0.66/1.01	0.31/0.44	0.67/0.82	0.28/0.46	0.23/0.39	0.43/0.62

Table 5. Comparison of model parameters and inference time.

Model	Parameters Count	Inference Time (s)
Social-LSTM [16]	264 k	1.1789
SR-LSTM [25]	64.9 k	0.1578
Social-GAN-P [36]	46.3 k	0.0968
PIF [39]	360.3 k	0.1145
Social-STGMLP (Ours)	147 k	0.0017

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meng, D.; Zhao, G.; Yan, F. Social-STGMLP: A Social Spatio-Temporal Graph Multi-Layer Perceptron for Pedestrian Trajectory Prediction. Information 2024, 15, 341. https://doi.org/10.3390/info15060341

AMA Style

Meng D, Zhao G, Yan F. Social-STGMLP: A Social Spatio-Temporal Graph Multi-Layer Perceptron for Pedestrian Trajectory Prediction. Information. 2024; 15(6):341. https://doi.org/10.3390/info15060341

Chicago/Turabian Style

Meng, Dexu, Guangzhe Zhao, and Feihu Yan. 2024. "Social-STGMLP: A Social Spatio-Temporal Graph Multi-Layer Perceptron for Pedestrian Trajectory Prediction" Information 15, no. 6: 341. https://doi.org/10.3390/info15060341

APA Style

Meng, D., Zhao, G., & Yan, F. (2024). Social-STGMLP: A Social Spatio-Temporal Graph Multi-Layer Perceptron for Pedestrian Trajectory Prediction. Information, 15(6), 341. https://doi.org/10.3390/info15060341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Social-STGMLP: A Social Spatio-Temporal Graph Multi-Layer Perceptron for Pedestrian Trajectory Prediction

Abstract

1. Introduction

2. Related Works

3. Our Method: Social-STGMLP

3.1. Problem Definition

3.2. Main Architecture

3.3. Feature Extraction

3.4. Feature Fusion and Trajectory Prediction

4. Experiment

4.1. Datasets and Metrics

4.2. Experimental Settings

4.3. Brief Introduction to Comparison Methods

4.4. Quantitative Analysis

4.5. Ablation Study

4.6. Qualitative Analysis

4.7. Comparison of Experimental Processes, Model Parameters and Inference Time

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI