1. Introduction
Over the past decade, the rapid expansion of the internet has brought unprecedented convenience to people’s daily lives. One area that has experienced remarkable growth is on-demand food delivery (ODFD). For instance, in 2020, China’s online ODFD market size reached 664.62 billion Renminbi, with a year-over-year rise of 15% (EqualOcean, 2021). However, efficiently satisfying such a large ODFD demand remains a major challenge for the current service platforms. To address this challenge, numerous operating strategies have been developed, including deliverer dispatching, re-allocation, and surge pricing, all aimed at managing the high demand and improving system efficiencies for ODFD platforms [
1]. These strategies can help to reduce the mismatch between demand and supply, as well as establish efficient delivery routes and resource allocation, which enables the platform to provide a better customer experience by ensuring timely deliveries and reducing waiting times. The effectiveness of these strategies, however, is heavily dependent on short-term predictions of ODFD demand [
2]. Therefore, the ability to predict demand accurately becomes crucial for successful ODFD operations.
On a daily basis, the ODFD platform needs to select couriers to serve dynamic customer orders to reduce the logistics cost and the customer inconvenience cost. After an order is placed, the merchant is notified to prepare the food, and the platform will estimate when the food is ready and can be picked up, so that the system can make better planning decisions such as courier assignment for serving orders. Meanwhile, the estimated delivery time will also be presented to the customer and can be considered a service promise that the platform needs to fulfill. To allow these discussions, in this study, we further divided the ODFD demand into two classes: one is the demand sent out from a region, and the other is the received demand within a region. Accurate ODFD demand prediction for the near future (i.e., one hour) across the city would enable the platform to provide a better customer experience by ensuring timely deliveries and avoiding a local lack of couriers.
However, demand prediction for ODFD is very difficult mainly due to the following complicated challenges. First, the ODFD demand may have different spatial–temporal patterns. The spatial distribution of the ODFD demand could be affected by multiple factors, such as the regional economic agglomeration and population density, spatial distributions of restaurants [
3], sociodemographic attributes [
4], and personal factors of consumers [
5]. Different customers may have different meal preferences. Second, the ODFD usage within a given region also varies with time. For instance, the ODFD demand may rise sharply during meal times on a daily basis. Moreover, consecutive weekdays often exhibit recurring demand patterns that unfold every 24 h, while weekends may follow a dissimilar pattern. Furthermore, other factors, such as weather and morning traffic peak, also affect the demand as couriers may not be able to deliver the meal package on time. Finally, the sent demand and received demand within a region may affect each other in the short term due to weather/traffic conditions, as well as geographical information about the origin destination pair and the travel route.
The prediction of ODFD demand belongs to the family of spatial–temporal predictions. Previous studies are mainly based on statistical models and machine learning, including the time series ARIMA approach, regressions, Bayesian network (BN) models [
6], and so on. Although these approaches have alleviated the prediction difficulties, most of them do not consider spatial–temporal correlations in the demand. With traditional model structures and estimation algorithms, it can be difficult to incorporate such spatial information into predictions. In recent years, deep-learning-based approaches have been widely used for demand predictions, including bike usage prediction, ride-hailing demand–supply prediction [
7], and so on. Specifically, convolutional neural networks are capable of capturing spatial–temporal correlations in transportation prediction problems. Recurrent neural networks and their extensions such as long short-term memory are well fit for processing time series data streams.
To tackle these challenges, this paper proposes an attention-based convolutional long short-term memory (At-ConvLSTM) method to perform short-term forecasting of ODFD demand at the city scale. The main contributions are three-fold. First, the spatial–temporal correlations between different regions for sent demand and received demand are captured by a combination of convolutional units and LSTM layers. Specifically, convolutional neural network (CNN) layers are utilized to enhance the extraction of spatial features, while LSTM layers are adopted to capture the short- and long-term sequential pattern information. Second, an attention model is designed and incorporated to further improve prediction accuracy. Specifically, it addresses spatial variation in demand by assigning weights to demand in different regions for each forecast step. Third, the proposed At-ConvLSTM is illustrated using a historical ODFD dataset from Shenzhen, China. Results show that it outperforms several baseline approaches, and discussions are also provided.
The remainder of the paper is organized as follows. In
Section 2, related works are reviewed.
Section 3 first describes the problem formally and then introduces the At-ConvLSTM model. In
Section 4, we analyze our model’s performance over real datasets and compare it with several baseline methods. In the same section, we also provide some exploratory data analysis with our dataset. Lastly, we conclude the paper in
Section 5.
3. Methodology
3.1. Problem Description
Unlike traditional urban logistics based on known demand, a customer’s request may arrive at any time and any place, while the status and location of riders also changes with time. In some cases, no delivery person may be available in the vicinity of a request, creating high waiting times and, consequently, order cancellations. Minimizing delays and improving use satisfaction for ODFD service requires effective assignments between orders and riders. Moreover, the amount of time elapsed between the order being picked up and the receipt of the food could vary due to numerous random elements. Therefore, even if the send-out demand is known, the platform does not know exactly when the order could be delivered.
In this study, we predict the send-out demand and the received demand separately, which is helpful for ODFD platforms to respond to immediate requests from customers and hedge the uncertainty in demand prediction. For instance, with future send-out demand, it is possible for the platform to bundle multiple orders to a single rider nearby or guide idle riders waiting near locations where new requests are more likely to occur. Meanwhile, with predicted received demand, the platform could make better assignment decisions based on the status of orders and riders. New emerging requests can be assigned to those riders that could finish delivery within a short time.
A city is partitioned into a set of grids . A day is divided into multiple time intervals (e.g., one hour per interval), each of them indexed by . Each ODFD order is represented by a tuple , where / denotes the delivery start/end time, and represents the merchant and customer location. Two types of ODFD demand are defined and predicted, i.e., the send-out demand from a region and the received demand within a region. At each time step , ODFD demand across all regions is denoted as a 3D tensor , where denote the send-out demand and received demand in grid at time interval , respectively. The problem is considered a spatial–temporal prediction task. That is, given a series of historical demand , this study aims to collectively predict at each time interval .
ConvLSTM is a type of recurrent neural network for spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions. Specifically, the convolution unit is responsible for capturing spatial attributes, while the LSTM part is adopted to learn temporal attributes. Designed for research tasks such as image recognition, convolutional neural networks are capable of capturing spatial–temporal correlations in transportation prediction problems. At each time step, the model receives a vector of |G| values; vectors across consecutive time steps associate the same grid to the same position along the array, heading towards the same neuron of the input layer. ODFD demands of different grids are therefore analyzed simultaneously but acquired through separate entries, hence combining two processing perspectives: the sequential evolution of urban demand over time and its geographic distribution across multiple grids of the city.
3.2. Attention-Based ConvLSTM
Figure 1 illustrates the architecture of the proposed At-ConvLSTM model for predicting short-term ODFD demand. In the encoder block, the historical demand is encoded into a sequence of tensors with specified dimensional features. Then, the attention model incorporates the attention mechanism to quantify spatial–temporal regularity based on historical demand. Finally, the decoder performs predictions based on spatial–temporal characteristics and attention information.
3.2.1. Encoder Structure
The encoder utilizes convolution units and ConvLSTM cells to extract intricate spatial–temporal connections from
. Specifically, each
undergoes a sequence of convolutions through convolution layers to engender the spatial interdependence:
where
and
represent the number of convolutional layers and convolution operations, respectively. In order to circumvent down-sampling, the convolution layer does not engage in pooling. Consequently,
remains a 3D tensor, where the first two dimensions correspond to spatial coordinates and the third dimension encompasses the extracted features.
Each ConvLSTM layer comprises a ConvLSTM cell, which is captured by the hidden state and the cell state. Specifically, the hidden state is used to extract the input information at the last time, and the cell state is used to save the long-term information [
36]. The cell state and hidden state of the cell
(
) at encoding step
are denoted as
and
, which retain temporally distant and recent features, respectively.
&
are initialized with zero. When encoding, each cell possesses two internal inputs, i.e.,
and
, and an external input. The update of
and
is controlled by three types of gates, i.e., the input gate
, forgetting gate
, and output gate
. Specifically,
controls how much information from external input can be incorporated in
;
is responsible for erasing useless information from
;
determines how much information from
can be leaked to
. For the lowest ConvLSTM,
is taken as the external input, as reported in Equations (2)–(7).
where ∗ denotes the convolution operator, ∘ denotes the Hadamard product, and σ (·) denotes the sigmoid function.
represents convolution kernel weights and
denotes biases of each neural network (e.g.,
is a bias of the
cell’s forgetting gate).
For higher ConvLSTM layers, is taken as the external input for the layer. By recursively and sequentially applying the ConvLSTM layers to , the most recent cell and hidden states, and , are obtained and then transmitted to the decoder.
3.2.2. Attention Model
The attention block is adopted to address the spatial patterns by assigning weights to different patterns based on the extracted spatial information, as shown in
Figure 2. The ODFD demand distributions present certain spatio-temporal regularities, which may be caused by latent citywide patterns. For example, the demands around business centers during weekday peak hours may be high, while those at midnight are quite low. In this study, we perform clustering over historical demand tensors to capture such demand patterns using K-means++, initializing the cluster centers before proceeding with the standard k-means optimization iterations [
37]. With the K-means++ initialization, the algorithm is guaranteed to find a solution that is O (log k) competitive to the optimal -means solution. The resultant K representative demand tensors (i.e., clusters) are then incorporated into the attention model.
The representative demand tensor for cluster
,
, shares the same data structure with
.
is then fed to the convolution layer, the structure of which is identical to that used in the encoder. Spatial features from convolved
result in a set of attention tensors, denoted as
, representing the attention information on spatial characteristics. Note that
possesses the form of a 3D tensor, as does
. The future demand trend is derived with the extracted attention information and the most recent cell and hidden states from the encoder. When predicting demand for time interval
, the subsequent step entails acquiring a collection of weight vectors, denoted as
. Specifically,
denotes the similarity between
and
and is computed using a multi-layer perceptron (MLP). Following [
38], the demand trend
is calculated through Equations (7)–(10).
where
is the neuronal activation function, and
and
are the flattened vectors of
and
. In this way, the outputs of MLP are ensured to be one-dimensional variables to enable subsequent SoftMax calculation, which results in the weight vectors
.
and
are weights set for MLP neuron processing
and
as input, respectively.
is the bias for MLP’s neurons.
is the output of the attention model’s MLP for
, where
is the hidden state of the MLP for
and
is the weight set for output. Note that
is a 3D tensor.
3.2.3. Decoder Structure
The decoder block addresses the translation of the final implicit vector representation from the encoder and attention blocks into the explicit ODFD demand distribution across the city. Similar to the encoder, the decoder consists of ConvLSTM cells. The cell state and hidden state for the
cell is denoted as
and
, respectively. Initially,
=
and
=
. Similar to the encoder, the update of
and
are also controlled by three types of gates, i.e., the input gate
, forgetting gate
, and output gate
. Each layer possesses two internal inputs,
and
, and an external input. Specifically, the lowest ConvLSTM layer takes
as external input and
and
as internal inputs, which are reported in Equations (11)–(15).
where
denotes convolution kernel weights and
denotes the biases (e.g.,
is a bias of
cell’s input gate).
For the higher ConvLSTM layers,
is taken as the external input for the
layer. After all the ConvLSTM cells have completed processing, the abstracted prediction values are encapsulated by
. Note that
represents a three-dimensional tensor that includes a highly semantic representation of ODFD demand of time interval
. Due to the presence of convolutions, the demand is not intuitively comprehensible. Therefore, it will undergo deconvolutional units in order to solve out the corresponding 3D demand tensor
. This procedure is represented as:
where
denotes the deconvolution operation, and
is the number of deconvolution layers, which is the same as the number of convolution layers in the encoder.
3.3. Model Training
As can be seen, the proposed model consists of three major components, i.e., the encoder, the attention module, and the decoder. In particular, the encoder is composed of convolutional units and ConvLSTM units that encode the input data sequence into dimensional representations. The attention module computes weights based on spatial information. The decoder leverages the attention information and decodes the encoded representations to generate future ODFD demands.
Algorithm 1 outlines the overall training process. Historical demand is transformed into grid maps
, which are the input of the model. In the training phase, each
is fed into the convolution layers of the encoder, producing a 3D output,
, which is then utilized by encoder ConvLSTM cells hierarchically to generate two 3D historical demand representations
and
. Afterwards, the attention model converts
to generate
. Then, the decoder ConvLSTM cells are initialized by
and
and produce future demand representations
based on
. Then, the deconvolution layers deconvolve the
to derive
. This process is repeated to obtain the explicit ODFD demand for the following period
. Finally, the model is trained via backpropagation and mini-batch using the Adam optimizer [
39].
Algorithm 1: Training Algorithm |
Input: Historical demand observations {,…, Output: Learned attention-based ConvLSTM model , = ConvLSTM (), Compute with Equations (7)–(10) , = ConvLSTM () Randomly initialize all learnable parameters W in the model Train the model by updating weights W by minimizing the cross-entropy loss using the Adam optimizer
|
In the testing phase, predicted demand is obtained based on the model’s parameter configuration, which was set up by learning historical patterns during training. The most likely demand volume estimation is obtained according to the past automatically-learned sequential patterns of food delivery demand variations over space and time.
4. Experiment and Result Analysis
This section compares the performance of At-ConvLSTM with some classical forecast models based on a real-world data set. All runs were implemented on a computer with 16G RAM and an NVIDIA 1660Ti GPU. All deep learning prediction methods were implemented in a TensorFlow 1.15 code environment.
4.1. Study Area and Dataset
The dataset encompasses 21-day spatial-temporal data on ODFD orders on the Ele.me platform in Shenzhen, China, as shown in
Figure 3. In total, it contains 1,048,576 delivery records. Each record contains the starting/ending time and location, as well as the number of orders that the couriers served simultaneously. Orders with coordinates outside of city edges, too short delivery time (e.g., <1 min), unreasonable delivery speed, and identical senders’ and receivers’ coordinates are removed as outliers. After data filtering, 879,947 records were kept for subsequent analysis. The filtered dataset still has an average of about 40,000 data records per day, which is sufficient to support the subsequent research.
Like most studies on spatio-temporal data analysis, we divided the city into regular equal grids so that it is natural to adopt a convolutional neural network for the spatial–temporal prediction tasks. In particular, the whole study area is divided into 16 × 16 grids, each grid with a size of precisely 5 km 2.5 km. The size of the grid could indeed affect the prediction results. If it is set too small, there may be not enough data to represent reliable demand patterns. However, if it is set too big, the underlying correlation between grids may not be captured. As far as we know, there have been no studies that systematically investigate how to segment the city. In the future, different granularities with semantic meanings could be explored for the demand prediction.
Figure 4 shows the spatial distribution of the merchants and the customers, respectively. In comparison to the geographical map in
Figure 3, areas with high density distribution are typically characterized by specific buildings or regional functions, such as university towns, high-speed railway stations, government buildings, office areas, parks, etc. This observation reveals the existence of underlying spatial distribution patterns of ODFD demand in the city, which can be used by the attention model to enhance the accuracy of real-time demand prediction. The ODFD usage demonstrates a scattered or uniform distribution in the remaining area.
Figure 5 shows the average hourly order count statistics. There is a repetitive demand pattern on a day-to-day basis, with demand rising from 6:00 a.m., peaking around 11:00–12:00 p.m., and then falling off a cliff. In particular, almost all demand is concentrated between 6:00 a.m. and 12:00 noon, and demand during other times is only a fraction of the peak demand. The reason could be that people usually do not have sufficient time to eat in the morning and at noon, while dinner time is much more plentiful, coinciding with the current widely adopted work schedule. We also observe that, in a seven-day cycle, the sixth and seventh days always have lower peak demand volumes than the first five days, just like weekdays and weekends. Although the dataset does not present any information about the day of the week, the temporal pattern is also clear enough to identify the period of the entire dataset as three successive complete weeks, starting from Monday and ending on Sunday. We also observe that the number of received orders is larger than the number of send-out orders during 9–10 a.m. The possible reason could be that people tend to eat brunch, booking before and asking for delivery during this time.
Figure 6 plots the average delivery times. As summarized in
Table 1, most deliveries are quite quick (e.g., over 60% of the deliveries took less than 20 min), and rarely exceed an hour. In addition to the long distance between merchants and customers, there are other reasons for long delivery times (more than 45 min). For instance, there are not sufficient couriers during the meal time and the selected courier may be already en route to execute other orders or serve multiple orders simultaneously. Another possible reason could be that the courier cannot serve the order by taking the fastest route due to traffic congestion.
4.2. Experiment Setup
For the train–validation–test division of the data set, the first fourteen of the twenty-one days were selected as the training set, days fifteen to eighteen as the validation set, and the last three days as the test set. The validation set was applied during training epochs to avoid over-fitting. According to
Figure 5, demand at the hourly granularity shows apparent periodicity. Therefore, the length of a time step is set to one hour in the following implementations.
As can be seen from
Figure 5, the ODFD demand fluctuates periodically every day, peaking around 11:00 a.m. In this study, it is roughly concluded that the next value at most depends on the ten last daily time steps with 1 h frequency based on the temporal trend. To this end, we selected 10 time steps of input data and tried to predict 10 time steps ahead. That is,
and
were set to 10 in the training and testing sessions.
The training dataset was then clustered using the K-means++ method, and results are shown in
Figure 7. Specifically, distortion measures the sum of squared distances between the centroid and the tensor in its range, as well as the silhouette value, measures the similarity between a tensor and the cluster it belongs to. A higher silhouette value indicates a better match with its relevant cluster and a weaker match with neighboring clusters, and vice versa. As can be seen, the distortion decreases monotonically with the increasing number of
clusters in general. Meanwhile, the silhouette value also decreases monotonically. The number of clusters used by the attention mechanism should maintain a balance between the distortion value and the silhouette value. It is also desirable to compress the data volume by using as few clusters as possible while ensuring the accuracy of each cluster’s characteristics. Therefore,
is set at 12 with balanced loss of the silhouette coefficient and convergence of the mean distortion.
4.3. Baseline Models
At-ConvLSTM was compared against seven baseline models, and the specific details of the baselines are provided below:
ARIMA (auto-regressive integrated moving average): the prediction at time is obtained by averaging values of the input spatio-temporal series within periods of where is the window length.
SARIMA: seasonal-ARIMA, which takes into account seasonality patterns for data serious containing cycles.
LASSO (least absolute shrinkage and selection operator): this model employs an L1-norm regularization term as a penalty to regulate the absolute size of regression coefficients. The parameter balances empirical errors and the complexity of the linear model. In this study, is tuned from 0.5 to 6 in increments of 0.5.
XGBoost: this is an end-to-end tree-boosting system, primarily employing the gradient boosted decision tree (GBDT) algorithm [
40].
RF (random forest): this is an ensemble learning method that combines multiple decision trees to make a final prediction. The maximum number of decision trees in the forest is set to one thousand to ensure that the model is not undertrained.
ResNet (residual neural network): a convolutional neural network architecture that enables the network to learn residual mappings and ease the training of deep models. In particular, it introduces skip connections, allowing information to flow directly from one layer to another. The hyperparameters of ResNet, closeness, period, and trend, are set as 3, 1, and 1, respectively [
41].
ConvLSTM: All the elements of the model are identical to the At-ConvLSTM except for the absence of the attention model. The selection of parameters is also the same as the At-ConvLSTM parameters provided below.
ARIMA, LASSO, RF, and XGBoost belong to the classical one-dimensional sequence models for time series prediction problems [
42]. They predict the send-out and received demand for each grid separately based on each grid’s historical demand data. At-ConvLSTM, ConvLSTM, and ResNet perform multi-step prediction, where each single-step prediction output is used as an input for the subsequent prediction step, enabling the model to achieve multi-step prediction through iterations. The multi-step prediction that ResNet performs is implemented by iterations of single-step prediction of a whole grid map. Furthermore, all components of ConvLSTM are identical to those of At-ConvLSTM, except for the absence of the attention model. The parameter selection process remains consistent with that of At-ConvLSTM, as provided below.
4.4. At-ConvLSTM Settings
Training is performed using a minimum-batch grade descent (MBGD) method with a batch size of 16. The training epoch is 20 generations and the model is validated per epoch. The optimizer used in the network is the Adam optimizer. The initial learning rate and keep-probability parameters are set to 0.0002 and 0.9. Mean square error (MSE) is utilized as the loss function index. The network settings are presented in
Table 1 [
43].
4.5. Results and Discussion
The following sections will first elucidate the overall prediction accuracies of all the models, and then analyze them in terms of hourly accuracies and step-wise prediction accuracies. Finally, we discuss the region-wide prediction accuracies of At-ConvLSTM.
Table 2 presents the comparison results at the aggregate prediction level. RMSE (root mean square error) and MAE (mean absolute error) are normalized and fall between 0 and 1. The three deep-learning-based models significantly outperform the statistical and machine learning models. For example, compared to XGBoost, At-ConvLSTM reduces the MAE/RMSE by an astonishing 96.9/90.9% for send-out demand prediction and 95.8/90.7% for received demand prediction. This indicates that the spatial correlation between adjacent/farther regions provides important information for spatial–temporal ODFD demand prediction. Moreover, the convolution layers and the convolution operation in the At-ConvLSTM modeling framework could characterize the spatial correlation well. Note that the result of SARIMA is very close to that of ARIMA since they similarly use past demand values in the temporal dimension. However, both of them perform worse than the proposed model. The possible reason could be that underlying spatial correlation is not taken into account.
Among the three deep-learning-based models, ResNet performs slightly worse than the other two. There are two possible reasons. One is due to the weaker capability of ResNet’s residual block for spatial–temporal feature extraction compared to that of ConvLSTM. Another possible reason is that ResNet’s multi-step prediction is achieved by iteratively performing a single-step prediction. The errors accumulate progressively, leading to worse prediction results. Furthermore, based on the comparison between At-ConvLSTM and ConvLSTM, the attention model improves the prediction accuracy. This finding confirms that the attention model effectively captures additional precise spatial–temporal feature information during the processing of spatial data, thereby enhancing the decoding capability of the decoder. We also observe that At-ConvLSTM’s prediction accuracy at certain steps is lower than that of RF and XGBoost. The reason could be that At-ConvLSTM sacrifices its prediction accuracy at certain times in order to ensure that the overall prediction loss is minimized. Overall, At-ConvLSTM, which maximizes the utilization of temporal and spatial features, demonstrates the most stable and reliable predictions.
As seen in
Table 2, the prediction accuracy for send-out demand is slightly lower than that for received demand. In the following, we further conduct analyses for send-out demand.
Figure 8 further illustrates the performance for predicting send-out demand across different time intervals within a day. The test set data has a time scale of three days, and all models present almost the same prediction accuracy pattern of each time interval, so we present the average RMSE across three days. Both At-ConvLSTM and ResNet consistently exhibit reliable predictive capabilities over the 24 h. However, At-ConvLSTM shows lower accuracy during periods of demand increase (e.g., 11:00 a.m.–12:00 p.m.) and higher accuracy during the subsequent decline. On the other hand, LASSO, XGBoost, and RF demonstrate great prediction errors before and after the peak, consistently underestimating the actual demand values. This indicates a conservative learning ability of these methods when it comes to peak demand scenarios. During non-peak periods, both XGBoost and RF exhibit higher prediction accuracy than ConvLSTM and ResNet. Moreover, At-ConvLSTM additionally recognizes 2:00, 17:00, and 20:00 as flat peaks and overestimates the corresponding demand. By adjusting the
and
values, although false flat peaks may still exist, they could be reduced or adjusted to a relatively uncritical time period (e.g., early morning) to minimize the loss.
Figure 9 shows the prediction results over grids. As observed in the figure, the prediction result is relatively low in grids with large demand volumes. This is intuitive, as greater demand implies more uncertainty and thus is more difficult to predict. The prediction accuracy could be further improved by better division of the city based on additional information (such as geographical, environmental, and event-specific knowledge) instead of dividing the city into grids.
5. Conclusions
In this paper, a deep-learning-based encoder–decoder architecture, At-ConvLSTM, is introduced to address the short-term prediction of on-demand food delivery demand at the city scale. It employs convolutional units and ConvLSTM units to extract spatial–temporal features from the demand data. And an attention model is adopted to learn the different degrees of influence of each representative citywide demand pattern for each time step. Using a real-world dataset, we compare the At-ConvLSTM model with several baseline models. Results indicate that the proposed At-ConvLSTM model used has a reliable and stable prediction capability for the short-term multi-step distribution demand forecasting problem. Furthermore, the inclusion of the attention model can indeed improve the accuracy of multi-step forecasting effectively.
Compared to traditional statistical models, the deep-learning-based prediction model proposed in this study could uncover non-linear relationships in data that would be difficult to detect through traditional methods. Moreover, it also has the ability to handle large and complex data and has been used to achieve state-of-the-art performance on a wide range of problems. However, deep learning models can only make predictions based on the data they have been trained on. They may not be able to generalize to new situations or contexts that were not represented in the training data. Another limitation is that some deep learning models are considered “black-box” models, as it is difficult to understand how the model is making predictions and identifying the factors that influence the predictions. Such models are computationally expensive and require a large amount of data and computational resources to train, including powerful GPUs and large amounts of memory. This can be costly and time-consuming.
There are some possible directions that can be addressed in the future. First, in addition to the historical demand, more environmental information (e.g., weather conditions, POI, land use, etc.) could be incorporated into the model to further improve prediction performance. Second, a deeper analysis of the prediction results to improve quality of operational decisions could be the next step to take. Rather than quantifying statistical errors, the prediction outputs could be assessed from a business perspective. For instance, it is interesting to take the uncertainty of the prediction into account when assigning couriers to a batch of orders.