Short-Term Demand Prediction for On-Demand Food Delivery with Attention-Based Convolutional LSTM

Yu, Xinlian; Lan, Ailun; Mao, Haijun

doi:10.3390/systems11100485

Open AccessArticle

Short-Term Demand Prediction for On-Demand Food Delivery with Attention-Based Convolutional LSTM

by

Xinlian Yu

^*,

Ailun Lan

and

Haijun Mao

School of Transportation, Southeast University, Nanjing 211189, China

^*

Author to whom correspondence should be addressed.

Systems 2023, 11(10), 485; https://doi.org/10.3390/systems11100485

Submission received: 10 August 2023 / Revised: 18 September 2023 / Accepted: 19 September 2023 / Published: 22 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

Demand prediction for on-demand food delivery (ODFD) is of great importance to the operation and transportation resource utilization of ODFD platforms. This paper addresses short-term ODFD demand prediction using an end-to-end deep learning architecture. The problem is formulated as a spatial–temporal prediction. The proposed model is composed of convolutional long short-term memory (ConvLSTM), and convolutional neural network (CNN) units with encoder–decoder structure. Specifically, long short-term memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. The convolution unit is responsible for capturing spatial attributes, while the LSTM part is adopted to learn temporal attributes. Additionally, an attentional model is designed and integrated to enhance the prediction performance by addressing the spatial variation in demand. The proposed approach is compared to several baseline models using a historical ODFD dataset from Shenzhen, China. Results indicate that the proposed model obtains the highest prediction accuracy by capturing both spatial and temporal correlations with attention information focusing on different parts of the input series.

Keywords:

on-demand food delivery; demand prediction; deep learning; convolutional LSTM; attention mechanism

1. Introduction

Over the past decade, the rapid expansion of the internet has brought unprecedented convenience to people’s daily lives. One area that has experienced remarkable growth is on-demand food delivery (ODFD). For instance, in 2020, China’s online ODFD market size reached 664.62 billion Renminbi, with a year-over-year rise of 15% (EqualOcean, 2021). However, efficiently satisfying such a large ODFD demand remains a major challenge for the current service platforms. To address this challenge, numerous operating strategies have been developed, including deliverer dispatching, re-allocation, and surge pricing, all aimed at managing the high demand and improving system efficiencies for ODFD platforms [1]. These strategies can help to reduce the mismatch between demand and supply, as well as establish efficient delivery routes and resource allocation, which enables the platform to provide a better customer experience by ensuring timely deliveries and reducing waiting times. The effectiveness of these strategies, however, is heavily dependent on short-term predictions of ODFD demand [2]. Therefore, the ability to predict demand accurately becomes crucial for successful ODFD operations.

On a daily basis, the ODFD platform needs to select couriers to serve dynamic customer orders to reduce the logistics cost and the customer inconvenience cost. After an order is placed, the merchant is notified to prepare the food, and the platform will estimate when the food is ready and can be picked up, so that the system can make better planning decisions such as courier assignment for serving orders. Meanwhile, the estimated delivery time will also be presented to the customer and can be considered a service promise that the platform needs to fulfill. To allow these discussions, in this study, we further divided the ODFD demand into two classes: one is the demand sent out from a region, and the other is the received demand within a region. Accurate ODFD demand prediction for the near future (i.e., one hour) across the city would enable the platform to provide a better customer experience by ensuring timely deliveries and avoiding a local lack of couriers.

However, demand prediction for ODFD is very difficult mainly due to the following complicated challenges. First, the ODFD demand may have different spatial–temporal patterns. The spatial distribution of the ODFD demand could be affected by multiple factors, such as the regional economic agglomeration and population density, spatial distributions of restaurants [3], sociodemographic attributes [4], and personal factors of consumers [5]. Different customers may have different meal preferences. Second, the ODFD usage within a given region also varies with time. For instance, the ODFD demand may rise sharply during meal times on a daily basis. Moreover, consecutive weekdays often exhibit recurring demand patterns that unfold every 24 h, while weekends may follow a dissimilar pattern. Furthermore, other factors, such as weather and morning traffic peak, also affect the demand as couriers may not be able to deliver the meal package on time. Finally, the sent demand and received demand within a region may affect each other in the short term due to weather/traffic conditions, as well as geographical information about the origin destination pair and the travel route.

The prediction of ODFD demand belongs to the family of spatial–temporal predictions. Previous studies are mainly based on statistical models and machine learning, including the time series ARIMA approach, regressions, Bayesian network (BN) models [6], and so on. Although these approaches have alleviated the prediction difficulties, most of them do not consider spatial–temporal correlations in the demand. With traditional model structures and estimation algorithms, it can be difficult to incorporate such spatial information into predictions. In recent years, deep-learning-based approaches have been widely used for demand predictions, including bike usage prediction, ride-hailing demand–supply prediction [7], and so on. Specifically, convolutional neural networks are capable of capturing spatial–temporal correlations in transportation prediction problems. Recurrent neural networks and their extensions such as long short-term memory are well fit for processing time series data streams.

To tackle these challenges, this paper proposes an attention-based convolutional long short-term memory (At-ConvLSTM) method to perform short-term forecasting of ODFD demand at the city scale. The main contributions are three-fold. First, the spatial–temporal correlations between different regions for sent demand and received demand are captured by a combination of convolutional units and LSTM layers. Specifically, convolutional neural network (CNN) layers are utilized to enhance the extraction of spatial features, while LSTM layers are adopted to capture the short- and long-term sequential pattern information. Second, an attention model is designed and incorporated to further improve prediction accuracy. Specifically, it addresses spatial variation in demand by assigning weights to demand in different regions for each forecast step. Third, the proposed At-ConvLSTM is illustrated using a historical ODFD dataset from Shenzhen, China. Results show that it outperforms several baseline approaches, and discussions are also provided.

The remainder of the paper is organized as follows. In Section 2, related works are reviewed. Section 3 first describes the problem formally and then introduces the At-ConvLSTM model. In Section 4, we analyze our model’s performance over real datasets and compare it with several baseline methods. In the same section, we also provide some exploratory data analysis with our dataset. Lastly, we conclude the paper in Section 5.

2. Literature Review

2.1. Studies on ODFD Service

The rise of ODFD has received attention from transportation researchers. Existing research on ODFD has primarily focused on operational strategies such as delivery order assignments, courier route planning [8], traffic safety for meal delivery couriers [9], demand–supply balance [10], and heuristic order batching and assignment algorithms [11]. However, the effectiveness of such strategies is highly dependent on accurate short-term ODFD demand prediction.

Several studies focus on investigating the spatial–temporal patterns of ODFD usage. As [12] found in their research on food accessibility and built environment, the utilization of ODFD is primarily observed in densely populated urban regions, particularly in city centers and sub-centers. Moreover, a greater number of ODFD orders are observed in areas where walking for food access is less convenient but cycling for food access is more convenient. Later on, ref. [3] found that food delivery demand could also change the built environment in the long term. Ref. [13] mapped the distribution of takeaway food demand across China based on the analysis of more than 35 million takeaway food orders. Their results also indicate that ODFD demand is higher in densely populated or economically developed cities, and that demand varies greatly and regularly across different time intervals. Ref. [14] used an enhanced two-step floating catchment area (E2SFCA) approach to quantify the accessibility of ODFD in a city. The results imply that ODFD demand is more concentrated in the core area of the city, and the farther away from the city center, the less service ODFD can provide. These studies indicate that the ODFD usage is not randomly distributed across the city; instead, it exhibits latent spatial–temporal patterns.

Other studies try to identify potential factors that affect the usage of ODFD, such as the sociodemographic attributes of the locally aggregated population, service pricing strategy [15], household attributes and individual characteristics [16,17,18], mostly applying regression models. Moreover, considering spatial variation (e.g., land use in a city), ref. [19] found that the neighborhood-built environment could affect individuals’ efforts and willingness to leave their home and participate in outdoor activities, thus impacting their ODFD usage. These studies mainly focus on exploring the explanatory values in the long term instead of using short-term demand prediction across the city.

There are also a few studies concentrating on predicting the time consumed during different stages of the ODFD process. For instance, ref. [6] utilized a deep neural network (DNN), which further incorporates representations of couriers, restaurants, and delivery destinations, to predict the amount of time elapsed between a customer placing an order and he/she receiving the meal. Ref. [20] applied probabilistic forecasting for food preparation time (FPT) for the first time and proposed a non-parametric method based on deep learning. Whether it is predicting the time consumed or making other logical decisions, researchers have mostly pointed out that the uncertainty of ODFD demand is a major obstacle for better planning decisions.

During the COVID-19 pandemic, more and more people started ordering food using ODFD apps, such as Ele.me, Uber Eats, and DoorDash, for safety issues. Studies on ODFD services have also witnessed significant increase. Most of these studies focused on investigating the factors that affected the intention of using ODFD apps during the COVID-19 outbreak period in different countries, such as India [21], the USA [22], China [23], Brazil [24], and Mexico [25]. These studies adopt different methods such as UTAUT2 and UGT; apply quantitative, qualitative, and mixed research designs; and offer interesting insights. They found that delivery, subjective norms, attitudes, behavioral control, and social isolation [26] positively affect the consumers’ intention to use mobile food delivery apps. Another group of studies attempted to understand food delivery drivers’ conditions during COVID-19 in China and India [27,28]. For instance, ref. [28] analyzed the challenges faced by last-mile food delivery riders in India during the COVID-19 pandemic and categorized the riders’ challenges under operational, customer-related, organizational, and technological categories.

2.2. Prediction for ODFD Demand

Currently, there are few studies focusing on forecasting ODFD demand. Ref. [2] applied the classical time series methods (moving average, exponential smoothing, auto-regressive moving averages (ARMA), and seasonal decomposition), as well as machine learning (ML) models (random forest and support vector regressor) to predict short-term ODFD demand on a grid in French cities. The results reveal that ML models could yield more accurate prediction results than classical methods with limited demand history. Ref. [29] first used a susceptible–infected–recovered (SIR) model to forecast future COVID-19 infected cases in a given region and then constructed an ARMA model to predict food-ordering demand. While the world no longer views the COVID-19 outbreak as an exceptional event, this prediction–action combined approach demonstrates the value of ODFD demand forecasting in specific applications. However, these studies do not take the complex spatial–temporal correlation between adjacent regions into account.

Note that any service that performs ad hoc requested point-to-point transportation, like ride-hailing, at scale in an urban area benefits from a robust demand forecasting system. The challenges identified in the introduction is not limited to ODFD. Deep-learning models have been adopted to capture spatial and temporal correlations for many systems [30,31,32]. Existing short-term predictions in the transportation field, such as crowd flow prediction [33], traffic flow prediction [34], and ride-hailing demand prediction [7], have achieved higher prediction accuracies than traditional and ML methods using such deep-learning-based approaches. Recently, ref. [35] applied a CNN-LSTM regressor to predict single-step hourly food delivery demand distribution over multiple urban areas simultaneously. The results disclose a better performance over traditional statistical approaches (moving averages and univariate time series forecasting), indicating the solid implementation potential of deep learning methods for ODFD demand forecasting. However, the inter-location correlations and spatial variation in ODFD usage were not fully taken into account in their model.

3. Methodology

3.1. Problem Description

Unlike traditional urban logistics based on known demand, a customer’s request may arrive at any time and any place, while the status and location of riders also changes with time. In some cases, no delivery person may be available in the vicinity of a request, creating high waiting times and, consequently, order cancellations. Minimizing delays and improving use satisfaction for ODFD service requires effective assignments between orders and riders. Moreover, the amount of time elapsed between the order being picked up and the receipt of the food could vary due to numerous random elements. Therefore, even if the send-out demand is known, the platform does not know exactly when the order could be delivered.

In this study, we predict the send-out demand and the received demand separately, which is helpful for ODFD platforms to respond to immediate requests from customers and hedge the uncertainty in demand prediction. For instance, with future send-out demand, it is possible for the platform to bundle multiple orders to a single rider nearby or guide idle riders waiting near locations where new requests are more likely to occur. Meanwhile, with predicted received demand, the platform could make better assignment decisions based on the status of orders and riders. New emerging requests can be assigned to those riders that could finish delivery within a short time.

A city is partitioned into a set of grids

G^{I \times J}

. A day is divided into multiple time intervals (e.g., one hour per interval), each of them indexed by

t

\in T

. Each ODFD order is represented by a tuple

(p_{s e n d}, p_{r e c e i v e}, t_{s e n d}, t_{r e c e i v e})

, where

t_{s e n d}

/

t_{r e c e i v e}

denotes the delivery start/end time, and

p_{s e n d /} p_{p_{r e c e i v e}}

represents the merchant and customer location. Two types of ODFD demand are defined and predicted, i.e., the send-out demand from a region and the received demand within a region. At each time step

t

, ODFD demand across all regions is denoted as a 3D tensor

x_{t} \in R^{I \times J \times 2}

, where

{(x_{t})}_{i, j, 0} a n d {(x_{t})}_{i, j, 1}

denote the send-out demand and received demand in grid

g_{i j}

at time interval

t

, respectively. The problem is considered a spatial–temporal prediction task. That is, given a series of historical demand

\{x_{t}| t = 1, \dots, N}

, this study aims to collectively predict

\{{\hat{x}}_{t}| t = N + 1, \dots, N + B}

at each time interval

t

.

ConvLSTM is a type of recurrent neural network for spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions. Specifically, the convolution unit is responsible for capturing spatial attributes, while the LSTM part is adopted to learn temporal attributes. Designed for research tasks such as image recognition, convolutional neural networks are capable of capturing spatial–temporal correlations in transportation prediction problems. At each time step, the model receives a vector of |G| values; vectors across consecutive time steps associate the same grid to the same position along the array, heading towards the same neuron of the input layer. ODFD demands of different grids are therefore analyzed simultaneously but acquired through separate entries, hence combining two processing perspectives: the sequential evolution of urban demand over time and its geographic distribution across multiple grids of the city.

3.2. Attention-Based ConvLSTM

Figure 1 illustrates the architecture of the proposed At-ConvLSTM model for predicting short-term ODFD demand. In the encoder block, the historical demand is encoded into a sequence of tensors with specified dimensional features. Then, the attention model incorporates the attention mechanism to quantify spatial–temporal regularity based on historical demand. Finally, the decoder performs predictions based on spatial–temporal characteristics and attention information.

3.2.1. Encoder Structure

The encoder utilizes convolution units and ConvLSTM cells to extract intricate spatial–temporal connections from

\{x_{t}| t = 1, \dots, N}

. Specifically, each

x_{t}

undergoes a sequence of convolutions through convolution layers to engender the spatial interdependence:

I_{t} = C O V_{L} (x_{t}), t = 1, \dots, N

(1)

where

L

and

C O V

represent the number of convolutional layers and convolution operations, respectively. In order to circumvent down-sampling, the convolution layer does not engage in pooling. Consequently,

I_{t}

remains a 3D tensor, where the first two dimensions correspond to spatial coordinates and the third dimension encompasses the extracted features.

Each ConvLSTM layer comprises a ConvLSTM cell, which is captured by the hidden state and the cell state. Specifically, the hidden state is used to extract the input information at the last time, and the cell state is used to save the long-term information [36]. The cell state and hidden state of the cell

n

(

n \in [1, \dots, n]

) at encoding step

t

are denoted as

C_{t}^{e (n)}

and

H_{t}^{e (n)}

, which retain temporally distant and recent features, respectively.

C_{0}^{e}

&

H_{0}^{e}

are initialized with zero. When encoding, each cell possesses two internal inputs, i.e.,

C_{t - 1}^{e (n)}

and

H_{t - 1}^{e (n)}

, and an external input. The update of

C_{t}^{e (n)}

and

H_{t}^{e (n)}

is controlled by three types of gates, i.e., the input gate

i_{t}^{e (n)}

, forgetting gate

f_{t}^{e (n)}

, and output gate

o_{t}^{e (n)}

. Specifically,

i_{t}^{e (n)}

controls how much information from external input can be incorporated in

C_{t}^{e (n)}

;

f_{t}^{e (n)}

is responsible for erasing useless information from

C_{t - 1}^{e (n)}

;

o_{t}^{e (n)}

determines how much information from

C_{t}^{e (n)}

can be leaked to

H_{t}^{e (n)}

. For the lowest ConvLSTM,

I_{t}

is taken as the external input, as reported in Equations (2)–(7).

i_{t}^{e (1)} = σ (W_{x i}^{e (1)} * I_{t} + W_{h i}^{e (1)} * H_{t - 1}^{e (1)} + b_{i}^{e (1)})

(2)

f_{t}^{e (1)} = σ (W_{x f}^{e (1)} * I_{t} + W_{h f}^{e (1)} * H_{t - 1}^{e (1)} + b_{f}^{e (1)})

(3)

o_{t}^{e (1)} = σ (W_{x o}^{e (1)} * I_{t} + W_{h o}^{e (1)} * H_{t - 1}^{e (1)} + b_{o}^{e (1)})

(4)

\begin{matrix} C_{t}^{e (1)} = f_{t}^{e (1)} \circ C_{t - 1}^{e (1)} + i_{t}^{e (1)} \circ \tanh (W_{x c}^{e (1)} * I_{t} + W_{h c}^{e (1)} * H_{t - 1}^{e (1)} + b_{c}^{e (1)}) \end{matrix}

(5)

H_{t}^{e (1)} = o_{t}^{e (1)} \circ \tanh (C_{t}^{e (1)})

(6)

where ∗ denotes the convolution operator, ∘ denotes the Hadamard product, and σ (·) denotes the sigmoid function.

W

represents convolution kernel weights and

b

denotes biases of each neural network (e.g.,

b_{f}^{e}

is a bias of the

n^{t h}

cell’s forgetting gate).

For higher ConvLSTM layers,

H_{t}^{e (n - 1)}

is taken as the external input for the

n^{t h}

layer. By recursively and sequentially applying the ConvLSTM layers to

{\{I_{t}\}}_{t = 1}^{N}

, the most recent cell and hidden states,

C_{t = N}^{e (n)}

and

H_{t = N}^{e (n)}

, are obtained and then transmitted to the decoder.

3.2.2. Attention Model

The attention block is adopted to address the spatial patterns by assigning weights to different patterns based on the extracted spatial information, as shown in Figure 2. The ODFD demand distributions present certain spatio-temporal regularities, which may be caused by latent citywide patterns. For example, the demands around business centers during weekday peak hours may be high, while those at midnight are quite low. In this study, we perform clustering over historical demand tensors to capture such demand patterns using K-means++, initializing the cluster centers before proceeding with the standard k-means optimization iterations [37]. With the K-means++ initialization, the algorithm is guaranteed to find a solution that is O (log k) competitive to the optimal -means solution. The resultant K representative demand tensors (i.e., clusters) are then incorporated into the attention model.

The representative demand tensor for cluster

k

,

A_{k}

, shares the same data structure with

x_{t}

.

{\{A_{k}\}}_{k = 1}^{K}

is then fed to the convolution layer, the structure of which is identical to that used in the encoder. Spatial features from convolved

{\{A_{k}\}}_{k = 1}^{K}

result in a set of attention tensors, denoted as

{\{a_{k}\}}_{k = 1}^{K}

, representing the attention information on spatial characteristics. Note that

a_{k}

possesses the form of a 3D tensor, as does

H_{t}^{d (n)}

. The future demand trend is derived with the extracted attention information and the most recent cell and hidden states from the encoder. When predicting demand for time interval

t

, the subsequent step entails acquiring a collection of weight vectors, denoted as

⟨α_{t 1}, \dots, α_{t k}⟩

. Specifically,

α_{t k}

denotes the similarity between

a_{k}

and

{\hat{x}}_{t}

and is computed using a multi-layer perceptron (MLP). Following [38], the demand trend

z_{t}

is calculated through Equations (7)–(10).

h_{t k}^{a} = F (W_{h} {\bar{H}}_{t - 1}^{d (n)} + W_{a} {\bar{a}}_{k} + b_{h}), \forall k \in  [1, \dots, K]

(7)

s_{t k}^{a} = f (W_{s} h_{t k}^{a}), \forall k \in  [1, \dots, K]

(8)

α_{t k} = \frac{\exp (s_{t k}^{a})}{\sum_{k = 1}^{K} \exp (s_{t k}^{a})}, \forall k \in  [1, \dots, K]

(9)

z_{t} = \sum_{k = 1}^{K} α_{t k} a_{k}, \forall t \in  [N + 1, \dots, N + B]

(10)

where

F (\cdot)

is the neuronal activation function, and

H_{t - 1}^{d (n)}

and

{\{{\bar{a}}_{k}\}}_{k = 1}^{K}

are the flattened vectors of

H_{t - 1}^{d (n)}

and

{\{a_{k}\}}_{k = 1}^{K}

. In this way, the outputs of MLP are ensured to be one-dimensional variables to enable subsequent SoftMax calculation, which results in the weight vectors

⟨α_{t 1}, \dots, α_{t k}⟩

.

W_{h}

and

W_{a k}

are weights set for MLP neuron processing

{\bar{H}}_{t - 1}^{d (n)}

and

{\bar{a}}_{k}

as input, respectively.

b_{h}

is the bias for MLP’s neurons.

s_{t k}^{a}

is the output of the attention model’s MLP for

α_{t k}

, where

h_{t k}^{a}

is the hidden state of the MLP for

α_{t k}

and

W_{s}

is the weight set for output. Note that

z_{t}

is a 3D tensor.

3.2.3. Decoder Structure

The decoder block addresses the translation of the final implicit vector representation from the encoder and attention blocks into the explicit ODFD demand distribution across the city. Similar to the encoder, the decoder consists of ConvLSTM cells. The cell state and hidden state for the

n^{t h}

cell is denoted as

C_{t}^{d (n)}

and

H_{t}^{d (n)}

, respectively. Initially,

C_{t = N}^{d}

=

C_{t = N}^{e}

and

H_{t = N}^{d}

=

H_{t = N}^{e}

. Similar to the encoder, the update of

C_{t - 1}^{d (n)}

and

H_{t - 1}^{d (n)}

are also controlled by three types of gates, i.e., the input gate

i_{t}^{d (n)}

, forgetting gate

f_{t}^{d (n)}

, and output gate

o_{t}^{d (n)}

. Each layer possesses two internal inputs,

C_{t - 1}^{d (n)}

and

H_{t - 1}^{d (n)}

, and an external input. Specifically, the lowest ConvLSTM layer takes

z_{t}

as external input and

C_{t - 1}^{d (1)}

and

H_{t - 1}^{d (1)}

as internal inputs, which are reported in Equations (11)–(15).

i_{t}^{d (1)} = σ (W_{z i}^{d (1)} * z_{t} + W_{h i}^{d (1)} * H_{t - 1}^{d (1)} + b_{i}^{d (1)})

(11)

f_{t}^{d (1)} = σ (W_{z f}^{d (1)} * z_{t} + W_{h f}^{d (1)} * H_{t - 1}^{d (1)} + b_{f}^{d (1)})

(12)

o_{t}^{d (1)} = σ (W_{z o}^{d (1)} * z_{t} + W_{h o}^{d} * H_{t - 1}^{d (1)} + b_{o}^{d (1)})

(13)

C_{t}^{d (1)} = f_{t}^{d (1)} \circ C_{t - 1}^{d (1)} + i_{t}^{d (1)} \circ \tanh (W_{z c}^{d (1)} * z_{t} + W_{h c}^{d (1)} * H_{t - 1}^{d (1)} + b_{c}^{d (1)})

(14)

H_{t}^{d (1)} = o_{t}^{d (1)} \circ \tanh (C_{t}^{d (1)})

(15)

where

W

denotes convolution kernel weights and

b

denotes the biases (e.g.,

b_{i}^{d (n)}

is a bias of

n^{t h}

cell’s input gate).

For the higher ConvLSTM layers,

H_{t}^{d (n - 1)}

is taken as the external input for the

n^{t h}

layer. After all the ConvLSTM cells have completed processing, the abstracted prediction values are encapsulated by

H_{t}^{d (n)}

. Note that

H_{t}^{d (n)}

represents a three-dimensional tensor that includes a highly semantic representation of ODFD demand of time interval

t

. Due to the presence of convolutions, the demand is not intuitively comprehensible. Therefore, it will undergo deconvolutional units in order to solve out the corresponding 3D demand tensor

{\hat{x}}_{t}

. This procedure is represented as:

{\hat{x}}_{t} = D e C O V_{L} (H_{t}^{d (n)}), t \in [N + 1, \dots, N + B] .

(16)

where

D e C O V

denotes the deconvolution operation, and

L

is the number of deconvolution layers, which is the same as the number of convolution layers in the encoder.

3.3. Model Training

As can be seen, the proposed model consists of three major components, i.e., the encoder, the attention module, and the decoder. In particular, the encoder is composed of convolutional units and ConvLSTM units that encode the input data sequence into dimensional representations. The attention module computes weights based on spatial information. The decoder leverages the attention information and decodes the encoded representations to generate future ODFD demands.

Algorithm 1 outlines the overall training process. Historical demand is transformed into grid maps

\{x_{t} \in R^{I \times J \times 2}| t = 1, \dots, N}

, which are the input of the model. In the training phase, each

x_{t}

is fed into the convolution layers of the encoder, producing a 3D output,

I_{t}

, which is then utilized by encoder ConvLSTM cells hierarchically to generate two 3D historical demand representations

H_{t}^{e (n)}

and

C_{t}^{e (n)}

. Afterwards, the attention model converts

H_{N}^{d (n)}

to generate

z_{t}

. Then, the decoder ConvLSTM cells are initialized by

H_{N}^{e (n)}

and

C_{N}^{e (n)}

and produce future demand representations

H_{N + 1}^{d (n)}

based on

z_{N + 1}

. Then, the deconvolution layers deconvolve the

H_{N + 1}^{d (n)}

to derive

{\hat{x}}_{N + 1}

. This process is repeated to obtain the explicit ODFD demand for the following period

{{\hat{x}}_{t} \in R^{I \times J \times 2} | t = N + 1, \dots, N + B}

. Finally, the model is trained via backpropagation and mini-batch using the Adam optimizer [39].

Algorithm 1: Training Algorithm

Input: Historical demand observations {

x_{1}

,…,

x_{N}}

Output: Learned attention-based ConvLSTM model

$For each t = 1, \dots, N$
$I_{t} = C O V_{L} (x_{t})$
$H_{t}^{e (n)}$ , $C_{t}^{e (n)}$ = ConvLSTM ( $I_{t}$ ), $n \in [1, \dots, n]$
Compute $z_{t}$ with Equations (7)–(10)
$H_{t}^{d (n)}$ , $C_{t}^{d (n)}$ = ConvLSTM ( $I_{t}, z_{t}$ )
${\hat{x}}_{t} = D e C O V_{L} (H_{t}^{d (n)})$
Randomly initialize all learnable parameters W in the model
Train the model by updating weights W by minimizing the cross-entropy loss using the Adam optimizer

In the testing phase, predicted demand is obtained based on the model’s parameter configuration, which was set up by learning historical patterns during training. The most likely demand volume estimation is obtained according to the past automatically-learned sequential patterns of food delivery demand variations over space and time.

4. Experiment and Result Analysis

This section compares the performance of At-ConvLSTM with some classical forecast models based on a real-world data set. All runs were implemented on a computer with 16G RAM and an NVIDIA 1660Ti GPU. All deep learning prediction methods were implemented in a TensorFlow 1.15 code environment.

4.1. Study Area and Dataset

The dataset encompasses 21-day spatial-temporal data on ODFD orders on the Ele.me platform in Shenzhen, China, as shown in Figure 3. In total, it contains 1,048,576 delivery records. Each record contains the starting/ending time and location, as well as the number of orders that the couriers served simultaneously. Orders with coordinates outside of city edges, too short delivery time (e.g., <1 min), unreasonable delivery speed, and identical senders’ and receivers’ coordinates are removed as outliers. After data filtering, 879,947 records were kept for subsequent analysis. The filtered dataset still has an average of about 40,000 data records per day, which is sufficient to support the subsequent research.

Like most studies on spatio-temporal data analysis, we divided the city into regular equal grids so that it is natural to adopt a convolutional neural network for the spatial–temporal prediction tasks. In particular, the whole study area is divided into 16 × 16 grids, each grid with a size of precisely 5 km

\times

2.5 km. The size of the grid could indeed affect the prediction results. If it is set too small, there may be not enough data to represent reliable demand patterns. However, if it is set too big, the underlying correlation between grids may not be captured. As far as we know, there have been no studies that systematically investigate how to segment the city. In the future, different granularities with semantic meanings could be explored for the demand prediction.

Figure 4 shows the spatial distribution of the merchants and the customers, respectively. In comparison to the geographical map in Figure 3, areas with high density distribution are typically characterized by specific buildings or regional functions, such as university towns, high-speed railway stations, government buildings, office areas, parks, etc. This observation reveals the existence of underlying spatial distribution patterns of ODFD demand in the city, which can be used by the attention model to enhance the accuracy of real-time demand prediction. The ODFD usage demonstrates a scattered or uniform distribution in the remaining area.

Figure 5 shows the average hourly order count statistics. There is a repetitive demand pattern on a day-to-day basis, with demand rising from 6:00 a.m., peaking around 11:00–12:00 p.m., and then falling off a cliff. In particular, almost all demand is concentrated between 6:00 a.m. and 12:00 noon, and demand during other times is only a fraction of the peak demand. The reason could be that people usually do not have sufficient time to eat in the morning and at noon, while dinner time is much more plentiful, coinciding with the current widely adopted work schedule. We also observe that, in a seven-day cycle, the sixth and seventh days always have lower peak demand volumes than the first five days, just like weekdays and weekends. Although the dataset does not present any information about the day of the week, the temporal pattern is also clear enough to identify the period of the entire dataset as three successive complete weeks, starting from Monday and ending on Sunday. We also observe that the number of received orders is larger than the number of send-out orders during 9–10 a.m. The possible reason could be that people tend to eat brunch, booking before and asking for delivery during this time.

Figure 6 plots the average delivery times. As summarized in Table 1, most deliveries are quite quick (e.g., over 60% of the deliveries took less than 20 min), and rarely exceed an hour. In addition to the long distance between merchants and customers, there are other reasons for long delivery times (more than 45 min). For instance, there are not sufficient couriers during the meal time and the selected courier may be already en route to execute other orders or serve multiple orders simultaneously. Another possible reason could be that the courier cannot serve the order by taking the fastest route due to traffic congestion.

4.2. Experiment Setup

For the train–validation–test division of the data set, the first fourteen of the twenty-one days were selected as the training set, days fifteen to eighteen as the validation set, and the last three days as the test set. The validation set was applied during training epochs to avoid over-fitting. According to Figure 5, demand at the hourly granularity shows apparent periodicity. Therefore, the length of a time step is set to one hour in the following implementations.

As can be seen from Figure 5, the ODFD demand fluctuates periodically every day, peaking around 11:00 a.m. In this study, it is roughly concluded that the next value at most depends on the ten last daily time steps with 1 h frequency based on the temporal trend. To this end, we selected 10 time steps of input data and tried to predict 10 time steps ahead. That is,

N

and

B

were set to 10 in the training and testing sessions.

The training dataset was then clustered using the K-means++ method, and results are shown in Figure 7. Specifically, distortion measures the sum of squared distances between the centroid and the tensor in its range, as well as the silhouette value, measures the similarity between a tensor and the cluster it belongs to. A higher silhouette value indicates a better match with its relevant cluster and a weaker match with neighboring clusters, and vice versa. As can be seen, the distortion decreases monotonically with the increasing number of

K

clusters in general. Meanwhile, the silhouette value also decreases monotonically. The number of clusters used by the attention mechanism should maintain a balance between the distortion value and the silhouette value. It is also desirable to compress the data volume by using as few clusters as possible while ensuring the accuracy of each cluster’s characteristics. Therefore,

K

is set at 12 with balanced loss of the silhouette coefficient and convergence of the mean distortion.

4.3. Baseline Models

At-ConvLSTM was compared against seven baseline models, and the specific details of the baselines are provided below:

ARIMA (auto-regressive integrated moving average): the prediction at time $t$ is obtained by averaging values of the input spatio-temporal series within $k$ periods of $t$ where $k$ is the window length.
SARIMA: seasonal-ARIMA, which takes into account seasonality patterns for data serious containing cycles.
LASSO (least absolute shrinkage and selection operator): this model employs an L1-norm regularization term as a penalty to regulate the absolute size of regression coefficients. The parameter $α$ balances empirical errors and the complexity of the linear model. In this study, $α$ is tuned from 0.5 to 6 in increments of 0.5.
XGBoost: this is an end-to-end tree-boosting system, primarily employing the gradient boosted decision tree (GBDT) algorithm [40].
RF (random forest): this is an ensemble learning method that combines multiple decision trees to make a final prediction. The maximum number of decision trees in the forest is set to one thousand to ensure that the model is not undertrained.
ResNet (residual neural network): a convolutional neural network architecture that enables the network to learn residual mappings and ease the training of deep models. In particular, it introduces skip connections, allowing information to flow directly from one layer to another. The hyperparameters of ResNet, closeness, period, and trend, are set as 3, 1, and 1, respectively [41].
ConvLSTM: All the elements of the model are identical to the At-ConvLSTM except for the absence of the attention model. The selection of parameters is also the same as the At-ConvLSTM parameters provided below.

ARIMA, LASSO, RF, and XGBoost belong to the classical one-dimensional sequence models for time series prediction problems [42]. They predict the send-out and received demand for each grid separately based on each grid’s historical demand data. At-ConvLSTM, ConvLSTM, and ResNet perform multi-step prediction, where each single-step prediction output is used as an input for the subsequent prediction step, enabling the model to achieve multi-step prediction through iterations. The multi-step prediction that ResNet performs is implemented by iterations of single-step prediction of a whole grid map. Furthermore, all components of ConvLSTM are identical to those of At-ConvLSTM, except for the absence of the attention model. The parameter selection process remains consistent with that of At-ConvLSTM, as provided below.

4.4. At-ConvLSTM Settings

Training is performed using a minimum-batch grade descent (MBGD) method with a batch size of 16. The training epoch is 20 generations and the model is validated per epoch. The optimizer used in the network is the Adam optimizer. The initial learning rate and keep-probability parameters are set to 0.0002 and 0.9. Mean square error (MSE) is utilized as the loss function index. The network settings are presented in Table 1 [43].

4.5. Results and Discussion

The following sections will first elucidate the overall prediction accuracies of all the models, and then analyze them in terms of hourly accuracies and step-wise prediction accuracies. Finally, we discuss the region-wide prediction accuracies of At-ConvLSTM.

Table 2 presents the comparison results at the aggregate prediction level. RMSE (root mean square error) and MAE (mean absolute error) are normalized and fall between 0 and 1. The three deep-learning-based models significantly outperform the statistical and machine learning models. For example, compared to XGBoost, At-ConvLSTM reduces the MAE/RMSE by an astonishing 96.9/90.9% for send-out demand prediction and 95.8/90.7% for received demand prediction. This indicates that the spatial correlation between adjacent/farther regions provides important information for spatial–temporal ODFD demand prediction. Moreover, the convolution layers and the convolution operation in the At-ConvLSTM modeling framework could characterize the spatial correlation well. Note that the result of SARIMA is very close to that of ARIMA since they similarly use past demand values in the temporal dimension. However, both of them perform worse than the proposed model. The possible reason could be that underlying spatial correlation is not taken into account.

Among the three deep-learning-based models, ResNet performs slightly worse than the other two. There are two possible reasons. One is due to the weaker capability of ResNet’s residual block for spatial–temporal feature extraction compared to that of ConvLSTM. Another possible reason is that ResNet’s multi-step prediction is achieved by iteratively performing a single-step prediction. The errors accumulate progressively, leading to worse prediction results. Furthermore, based on the comparison between At-ConvLSTM and ConvLSTM, the attention model improves the prediction accuracy. This finding confirms that the attention model effectively captures additional precise spatial–temporal feature information during the processing of spatial data, thereby enhancing the decoding capability of the decoder. We also observe that At-ConvLSTM’s prediction accuracy at certain steps is lower than that of RF and XGBoost. The reason could be that At-ConvLSTM sacrifices its prediction accuracy at certain times in order to ensure that the overall prediction loss is minimized. Overall, At-ConvLSTM, which maximizes the utilization of temporal and spatial features, demonstrates the most stable and reliable predictions.

As seen in Table 2, the prediction accuracy for send-out demand is slightly lower than that for received demand. In the following, we further conduct analyses for send-out demand.

Figure 8 further illustrates the performance for predicting send-out demand across different time intervals within a day. The test set data has a time scale of three days, and all models present almost the same prediction accuracy pattern of each time interval, so we present the average RMSE across three days. Both At-ConvLSTM and ResNet consistently exhibit reliable predictive capabilities over the 24 h. However, At-ConvLSTM shows lower accuracy during periods of demand increase (e.g., 11:00 a.m.–12:00 p.m.) and higher accuracy during the subsequent decline. On the other hand, LASSO, XGBoost, and RF demonstrate great prediction errors before and after the peak, consistently underestimating the actual demand values. This indicates a conservative learning ability of these methods when it comes to peak demand scenarios. During non-peak periods, both XGBoost and RF exhibit higher prediction accuracy than ConvLSTM and ResNet. Moreover, At-ConvLSTM additionally recognizes 2:00, 17:00, and 20:00 as flat peaks and overestimates the corresponding demand. By adjusting the

N

and

B

values, although false flat peaks may still exist, they could be reduced or adjusted to a relatively uncritical time period (e.g., early morning) to minimize the loss.

Figure 9 shows the prediction results over grids. As observed in the figure, the prediction result is relatively low in grids with large demand volumes. This is intuitive, as greater demand implies more uncertainty and thus is more difficult to predict. The prediction accuracy could be further improved by better division of the city based on additional information (such as geographical, environmental, and event-specific knowledge) instead of dividing the city into grids.

5. Conclusions

In this paper, a deep-learning-based encoder–decoder architecture, At-ConvLSTM, is introduced to address the short-term prediction of on-demand food delivery demand at the city scale. It employs convolutional units and ConvLSTM units to extract spatial–temporal features from the demand data. And an attention model is adopted to learn the different degrees of influence of each representative citywide demand pattern for each time step. Using a real-world dataset, we compare the At-ConvLSTM model with several baseline models. Results indicate that the proposed At-ConvLSTM model used has a reliable and stable prediction capability for the short-term multi-step distribution demand forecasting problem. Furthermore, the inclusion of the attention model can indeed improve the accuracy of multi-step forecasting effectively.

Compared to traditional statistical models, the deep-learning-based prediction model proposed in this study could uncover non-linear relationships in data that would be difficult to detect through traditional methods. Moreover, it also has the ability to handle large and complex data and has been used to achieve state-of-the-art performance on a wide range of problems. However, deep learning models can only make predictions based on the data they have been trained on. They may not be able to generalize to new situations or contexts that were not represented in the training data. Another limitation is that some deep learning models are considered “black-box” models, as it is difficult to understand how the model is making predictions and identifying the factors that influence the predictions. Such models are computationally expensive and require a large amount of data and computational resources to train, including powerful GPUs and large amounts of memory. This can be costly and time-consuming.

There are some possible directions that can be addressed in the future. First, in addition to the historical demand, more environmental information (e.g., weather conditions, POI, land use, etc.) could be incorporated into the model to further improve prediction performance. Second, a deeper analysis of the prediction results to improve quality of operational decisions could be the next step to take. Rather than quantifying statistical errors, the prediction outputs could be assessed from a business perspective. For instance, it is interesting to take the uncertainty of the prediction into account when assigning couriers to a batch of orders.

Author Contributions

Conceptualization, X.Y.; methodology, X.Y. and A.L.; validation, X.Y. and A.L.; formal analysis, X.Y. and A.L.; data curation, A.L.; writing—original draft preparation, A.L.; writing—review and editing, X.Y.; visualization, A.L.; supervision, H.M.; funding acquisition, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China [No. 72201056, 71901059], and the Natural Science Foundation of Jiangsu Province in China [No. BK20210250].

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, S.; Luo, Z. On-Demand Delivery from Stores: Dynamic Dispatching and Routing with Random Demand. Manuf. Serv. Oper. Manag. 2023, 25, 595–612. [Google Scholar] [CrossRef]
Hess, A.; Spinler, S.; Winkenbach, M. Real-time demand forecasting for an urban delivery platform. Transp. Res. E Logist. Transp. Rev. 2020, 145, 102147. [Google Scholar] [CrossRef]
Loo, B.P.Y.; Wang, B. Factors associated with home-based e-working and e-shopping in Nanjing, China. Transportation 2017, 45, 365–384. [Google Scholar] [CrossRef]
Tsai, P.-H.; Chen, C.-J.; Hsiao, W.-H.; Lin, C.-T. Factors influencing the consumers’ behavioural intention to use online food delivery service: Empirical evidence from Taiwan. J. Retail. Consum. Serv. 2023, 73, 103329. [Google Scholar] [CrossRef]
Zhu, L.; Yu, W.; Zhou, K.; Wang, X.; Feng, W.; Wang, P.; Chen, N.; Lee, P. Order Fulfillment Cycle Time Estimation for On-Demand Food Delivery. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, Virtual Event, 6–10 July 2020. [Google Scholar]
Ke, J.; Yang, H.; Zheng, H.; Chen, X.; Jia, Y.; Gong, P.; Ye, J. Hexagon-Based Convolutional Neural Network for Supply-Demand Forecasting of Ride-Sourcing Services. IEEE Trans. Intell. Transp. Syst. 2018, 20, 4160–4173. [Google Scholar] [CrossRef]
Liu, S.; Jiang, H.; Chen, S.; Ye, J.; He, R.; Sun, Z. Integrating Dijkstra’s algorithm into deep inverse reinforcement learning for food delivery route planning. Transp. Res. E Logist. Transp. Rev. 2020, 142, 102070. [Google Scholar] [CrossRef]
Qin, H.; Wei, Y.; Zhang, Q.; Ma, L. An observational study on the risk behaviors of electric bicycle riders performing meal delivery at urban intersections in China. Transp. Res. F Traffic Psychol. Behav. 2021, 79, 107–117. [Google Scholar] [CrossRef]
Zheng, J.; Wang, L.; Chen, J.-F.; Wang, X.; Liang, Y.; Duan, H.; Li, Z.; Ding, X. Dynamic multi-objective balancing for online food delivery via fuzzy logic system-based supply–demand relationship identification. Comput. Ind. Eng. 2022, 172, 108609. [Google Scholar] [CrossRef]
Simoni, M.D.; Winkenbach, M. Crowdsourced on-demand food delivery: An order batching and assignment algorithm. Transp. Res. C Emerg. Technol. 2023, 149, 104055. [Google Scholar] [CrossRef]
Wang, Z.; He, S.Y. Impacts of food accessibility and built environment on on-demand food delivery usage. Transp. Res. D Transp. Environ. 2021, 100, 103017. [Google Scholar] [CrossRef]
Talamini, G.; Li, W.; Li, X. From brick-and-mortar to location-less restaurant: The spatial fixing of on-demand food delivery platformization. Cities 2022, 128, 103820. [Google Scholar] [CrossRef]
Zhang, Y.; Wen, Z. Mapping the environmental impacts and policy effectiveness of takeaway food industry in China. Sci. Total Environ. 2021, 808, 152023. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Luan, H.; Zhen, F.; Kong, Y.; Xi, G. Does online food delivery improve the equity of food accessibility? A case study of Nanjing, China. J. Transp. Geogr. 2023, 107, 103516. [Google Scholar] [CrossRef]
Li, H.-C.; Liang, J.-K. Service pricing strategy of food delivery platform operators: A demand-supply interaction model. Res. Transp. Bus. Manag. 2022, 45, 100904. [Google Scholar] [CrossRef]
Dias, F.F.; Lavieri, P.S.; Sharda, S.; Khoeini, S.; Bhat, C.R.; Pendyala, R.M.; Pinjari, A.R.; Ramadurai, G.; Srinivasan, K.K. A comparison of online and in-person activity engagement: The case of shopping and eating meals. Transp. Res. C Emerg. Technol. 2020, 114, 643–656. [Google Scholar] [CrossRef]
Kim, W.; Wang, X. To be online or in-store: Analysis of retail, grocery, and food shopping in New York city. Transp. Res. C Emerg. Technol. 2021, 126, 103052. [Google Scholar] [CrossRef]
Spurlock, C.A.; Todd-Blick, A.; Wong-Parodi, G.; Walker, V. Children, Income, and the Impact of Home Delivery on Household Shopping Trips. Transp. Res. Rec. J. Transp. Res. Board 2020, 2674, 335–350. [Google Scholar] [CrossRef]
Gehrke, S.R.; Wang, L. Operationalizing the neighborhood effects of the built environment on travel behavior. J. Transp. Geogr. 2019, 82, 102561. [Google Scholar] [CrossRef]
Gao, C.; Zhang, F.; Zhou, Y.; Feng, R.; Ru, Q.; Bian, K.; He, R.; Sun, Z. Applying Deep Learning Based Probabilistic Forecasting to Food Preparation Time for on-Demand Delivery Service. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022. [Google Scholar] [CrossRef]
Mehrolia, S.; Alagarsamy, S.; Solaikutty, V.M. Customers response to online food delivery services during COVID-19 outbreak using binary logistic regression. Int. J. Consum. Stud. 2020, 45, 396–408. [Google Scholar] [CrossRef]
Hong, C.; Choi, H.H.; Choi, E.-K.C.; Joung, H.-W.D. Factors affecting customer intention to use online food delivery services before and during the COVID-19 pandemic. J. Hosp. Tour. Manag. 2021, 48, 509–518. [Google Scholar] [CrossRef]
Zhao, Y.; Bacao, F. What factors determining customer continuingly using food delivery apps during 2019 novel coronavirus pandemic period? Int. J. Hosp. Manag. 2020, 91, 102683. [Google Scholar] [CrossRef] [PubMed]
Zanetta, L.D.; Hakim, M.P.; Gastaldi, G.B.; Seabra, L.M.J.; Rolim, P.M.; Nascimento, L.G.P.; Medeiros, C.O.; da Cunha, D.T. The use of food delivery apps during the COVID-19 pandemic in Brazil: The role of solidarity, perceived risk, and regional aspects. Food Res. Int. 2021, 149, 110671. [Google Scholar] [CrossRef]
Ramos, E.A.; Kiszka, J.J.; Pouey-Santalou, V.; Barragán, R.R.; Chávez, A.J.G.; Audley, K. Food sharing in rough-toothed dolphins off southwestern Mexico. Mar. Mammal Sci. 2020, 37, 352–360. [Google Scholar] [CrossRef]
Yeo, S.F.; Tan, C.L.; Teo, S.L.; Tan, K.H. The role of food apps servitization on repurchase intention: A study of FoodPanda. Int. J. Prod. Econ. 2021, 234, 108063. [Google Scholar] [CrossRef]
Parwez, S.; Ranjan, R. The platform economy and the precarisation of food delivery work in the COVID-19 pandemic: Evidence from India. Work. Organ. Labour Glob. 2021, 15, 11–30. [Google Scholar] [CrossRef]
Puram, P.; Gurumurthy, A.; Narmetta, M.; Mor, R.S. Last-mile challenges in on-demand food delivery during COVID-19: Un-derstanding the riders’ perspective using a grounded theory approach. Int. J. Logist. Manag. 2022, 33, 901–925. [Google Scholar] [CrossRef]
Jia, H.; Shen, S.; García, J.A.R.; Shi, C. Partner with a Third-Party Delivery Service or Not? A Prediction-and-Decision Tool for Restaurants Facing Takeout Demand Surges during a Pandemic. Serv. Sci. 2022, 14, 139–155. [Google Scholar] [CrossRef]
Dong, S.; Zhang, Y.; Zhou, X. Intelligent Hybrid Modeling of Complex Leaching System Based on LSTM Neural Network. Systems 2023, 11, 78. [Google Scholar] [CrossRef]
Du, X.; Wang, Z.; Wang, Y. The Spatial Mechanism and Predication of Rural Tourism Development in China: A Random Forest Regression Analysis. ISPRS Int. J. Geo-Inf. 2023, 12, 321. [Google Scholar] [CrossRef]
Hou, M.; Hu, X.; Cai, J.; Han, X.; Yuan, S. An Integrated Graph Model for Spatial–Temporal Urban Crime Prediction Based on Attention Mechanism. ISPRS Int. J. Geo-Inf. 2022, 11, 294. [Google Scholar] [CrossRef]
Zhang, J.; Zheng, Y.; Qi, D. Deep spatio-temporal residual networks for citywide crowd flows prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Li, Z.; Xu, H.; Gao, X.; Wang, Z.; Xu, W. Fusion attention mechanism bidirectional LSTM for short-term traffic flow prediction. J. Intell. Transp. Syst. 2022, 1–14. [Google Scholar] [CrossRef]
Crivellari, A.; Beinat, E.; Caetano, S.; Seydoux, A.; Cardoso, T. Multi-target CNN-LSTM regressor for predicting urban distribution of short-term food delivery demand. J. Bus. Res. 2022, 144, 844–853. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; Woo, W.-C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): New Orleans, LA, USA, 2015; Volume 28. [Google Scholar]
Arthur, D.; Vassilvitskii, S. K-means++ the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Cho, H. Xgboost. 2016. Available online: https://github.com/dmlc/xgboost (accessed on 18 September 2023).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Tekin, S.F.; Karaahmetoglu, O.; Ilhan, F.; Balaban, I.; Kozat, S.S. Spatio-temporal weather forecasting and attention mechanism on convolutional lstms. arXiv 2021, arXiv:2102.00696. [Google Scholar]

Figure 1. Overview of the At-ConvLSTM.

Figure 2. The attention model.

Figure 3. Boundary of Shenzhen, China.

Figure 4. Spatial distribution of ODFD usage in Shenzhen. (a) Merchant, (b) Customer.

Figure 5. Average number of ODFD orders over time.

Figure 6. Distribution of delivery time.

Figure 7. Clustering results with K-means++.

Figure 8. Prediction results for send-out demand over time.

Figure 9. Prediction results over grids. (a) Average number of send-out orders. (b) Average number of received orders. (c) RMSE for send-out orders. (d) RMSE for received orders.

Table 1. Network settings.

Blocks/Layers	(Size, Stride, No. of Filter Kernels)/Layer Setting
Encoder
Convolution layer	3 × 3, 1, 8
Convolution layer	3 × 3, 1, 16
ConvLSTM cell × 2	3 × 3, 1, 16
Decoder
Deconvolution layer	3 × 3, 1, 8
Deconvolution layer	3 × 3, 1, 2
Attention model
Convolution layer	3 × 3, 1, 8
Convolution layer	3 × 3, 1, 16
MLP	1 hidden layer with 1024 Neurons

Table 2. Comparison results on considered approaches.

Model	Send-Out Demand		Received Demand		Overall
	MAE	RMSE	MAE	RMSE	MAE	RMSE
ARIMA	0.1771	0.1055	0.1566	0.0919	0.1672	0.1913
SARIMA	0.1761	0.1045	0.1523	0.0901	0.1573	0.1802
LASSO	0.1462	0.1009	0.1254	0.0862	0.1391	0.1794
RF	0.0742	0.0328	0.0654	0.0272	0.0707	0.0570
XGBoost	0.0523	0.0364	0.0459	0.0302	0.0496	0.0648
ResNet	0.0053	0.0075	0.0046	0.0066	0.0050	0.0142
ConvLSTM	0.0028	0.0041	0.0024	0.0035	0.0026	0.0115
At-ConvLSTM	0.0016	0.0033	0.0019	0.0028	0.0018	0.0106

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, X.; Lan, A.; Mao, H. Short-Term Demand Prediction for On-Demand Food Delivery with Attention-Based Convolutional LSTM. Systems 2023, 11, 485. https://doi.org/10.3390/systems11100485

AMA Style

Yu X, Lan A, Mao H. Short-Term Demand Prediction for On-Demand Food Delivery with Attention-Based Convolutional LSTM. Systems. 2023; 11(10):485. https://doi.org/10.3390/systems11100485

Chicago/Turabian Style

Yu, Xinlian, Ailun Lan, and Haijun Mao. 2023. "Short-Term Demand Prediction for On-Demand Food Delivery with Attention-Based Convolutional LSTM" Systems 11, no. 10: 485. https://doi.org/10.3390/systems11100485

APA Style

Yu, X., Lan, A., & Mao, H. (2023). Short-Term Demand Prediction for On-Demand Food Delivery with Attention-Based Convolutional LSTM. Systems, 11(10), 485. https://doi.org/10.3390/systems11100485

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short-Term Demand Prediction for On-Demand Food Delivery with Attention-Based Convolutional LSTM

Abstract

1. Introduction

2. Literature Review

2.1. Studies on ODFD Service

2.2. Prediction for ODFD Demand

3. Methodology

3.1. Problem Description

3.2. Attention-Based ConvLSTM

3.2.1. Encoder Structure

3.2.2. Attention Model

3.2.3. Decoder Structure

3.3. Model Training

4. Experiment and Result Analysis

4.1. Study Area and Dataset

4.2. Experiment Setup

4.3. Baseline Models

4.4. At-ConvLSTM Settings

4.5. Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI