1. Introduction
Container shipping is an important form of international trade logistics, and a great deal of goods are transported from the origin to the consumption place far across the ocean by container shipping [
1]. However, the spread of COVID-19 has had a profound impact on container shipping and will even overturn the trend of container shipping in the future [
2]. Since the third quarter of 2020, there has been a global shortage in the supply of empty containers, and major shipping companies are short of shipping space [
3]. The advance booking period of Sino-European routes is about two weeks, and even the Sino-American routes are sold out. The empty container supply and lack of capacity directly leads to a rapid rise in container service charge, and the Shanghai Containerised Freight Index (SCFI) and the Freight Baltic Index (FBX) are obviously up. For example, the price of a container shipped from China to Europe has risen from
$2000 to
$15,000, a 7.5 times higher transportation cost than before [
4]. The rapidly rising price of container transportation has brought heavy burden to international trade, and the price of all kinds of goods transported by container has also risen sharply. Therefore, improving the efficiency of container shipping is an important way to better the performance of international trade and reduce trade costs. Previous practitioners and scholars have conducted studies on improving the efficiency of container shipping from many aspects [
5,
6,
7,
8]. This study focuses on the forecasting of port container throughput, because the accurate forecasting results of port container throughput can provide decision support for shipping companies, port owners, freight forwarders, and other container shipping participants. With the development of machine learning, a variety of sophisticated forecasting models were proposed based on machine learning algorithms. However, many up-to-date forecasting methods were not applied to forecast container throughput. It is necessary to compare the performance of advanced machine learning methods and conventional methods on container throughput forecasting. Therefore, the research question of this study is which of the existing forecasting methods is more accurate in forecasting container throughput.
The main contributions of this study are as follows. First, the performance of nine different time series forecasting methods on a single time series is compared, including conventional methods and machine learning methods. Secondly, the forecasting method GRU, which is accurate for short time series forecasting results, is obtained through comparison, which provides experience for future forecasting research. Thirdly, it is found that the forecasting results of machine learning algorithms on short time series are not necessarily better than those of conventional methods, and the more complex models tend to produce less ideal forecasting results.
2. Literature Review
From the perspective of learning mechanisms of forecasting models, we can divide them into two categories: conventional forecasting models and machine learning forecasting models. Conventional forecasting models are those that use simple rules or methods to forecast future values, such as the naïve method (NM), moving average (MA), autoregressive (AR) and autoregressive integrated moving average (ARIMA), etc. Machine learning forecasting models are those that employ more complex computational methods and model structures to extract underlying patterns from the data, such as multilayer perceptron (MLP), recurrent neural network (RNN), convolutional neural network (CNN) and Transformer, etc. The summary of the literature review is presented in
Table 1.
Among the conventional forecasting models, the naïve method is the simplest but most effective time series forecasting method [
9]. It takes the actual value at time
t − 1 as the forecasting value at time
t. In actual production, many enterprises choose to use the naïve method as the basic forecasting method to guide their operations plan. The naïve method is also used as a benchmark for the evaluation of the performance of other forecasting methods [
10]. For any designed forecasting model, the method is valid if its accuracy is higher than the naïve method’s, and vice versa. It is similar to random guess in classification problems. The moving average is another method commonly used to forecast future value [
11]. It uses the average of a group of recent actual values to forecast future values, such as demand and capacity, etc. However, this method can only be used when the demand is neither rapid growth nor rapid decline, and there is no seasonal factor. Previous studies investigated optimal MA length for forecasting future demand. Their findings suggest that optimal MA length is related to the frequency of occurrence of the structural change [
12]. The autoregressive model is developed from linear regression in regression analysis and used to deal with time series [
13]. It uses the historical values of the same variable (
to
) to forecast the current
. Because an autoregression model only uses the historical value of a variable to forecast its future value, it does not use other variables, so it is called autoregressive. Many studies have analysed and improved AR [
13,
14,
15,
16]. Furthermore, Box and Jenkins integrated AR and MA methods and added an integrated method to put forward the ARIMA time series forecasting model [
17]. On this basis, ARIMAX and SARIMA were designed to handle multivariate input data and seasonal input data, respectively. Many studies use ARIMA and its derived models to forecast the future value of the target and obtain acceptable forecasting accuracy [
18,
19,
20]. The traditional method is used by many enterprises because of its simple deployment and fast computing speed. However, these methods are difficult to obtain complex influence relationships from a large number of influencing factors, so scholars put forward more complex and effective forecasting models called machine learning (ML) [
21].
MLP is a kind of neural network machine learning model which attracts a great deal of attention [
22]. It is a fully connected feedforward artificial neural network and has been employed as a benchmark to test the forecasting performance of other forecasting models [
23,
24,
25]. MLP was improved by integrating other forecasting models [
26,
27,
28,
29]. The concept of deep learning originates from the development of the artificial neural network [
30]. MLP with multiple hidden layers can be considered as a deep learning structure [
31]. By combining low-level features, deep learning can form more abstract high-level attributes or features to discover distributed feature representations of data [
32]. There are many architectures for deep learning, among which RNN is a common architecture. Many complex and well-performing deep learning architectures are based on RNN [
33]. RNN has good processing ability for sequential structure data and is often used in language processing problem. Gated recurrent unit (GRU) and long short-term memory (LSTM) are two representative RNN architectures. For instance, Noman et al. proposed a GRU based model to forecast the estimated time of arrival for vessels. Their experimental results show that the GRU-based model can produce the best forecasting accuracy compared to other methods [
34]. Moreover, Chen and Huang employed Adam-optimised GRU (Adam-GRU) to forecast port throughput. Their findings can be concluded as Adam-GRU can produce relatively accurate forecasting results [
35]. Shankar et al. built a container throughput forecasting model by using LSTM. Their experiment showed that LSTM can also generate accurate forecasting results [
36]. CNN is another commonly used deep learning architecture. It was originally used to solve computer vision problems, such as image recognition, and later some scholars applied CNN to the analysis and forecasting of sequence data. For instance, Chen et al. proposed a temporal CNN to estimate probability density of time series [
37]. There are many studies that employed CNN to build time series forecasting model [
38,
39,
40,
41]. More recently, Transformer, another deep learning architecture, was first proposed by Google Brain in 2017 to solve the sequential data problem, such as natural language processing (NLP) [
42]. It features all input data into the model at once, and uses positional encodings, attention, and self-attention mechanisms to capture the patterns from the data. Based on Transformer, scholars also put forward powerful NLP models such as GPT-3 [
43], BERT [
44], T5 [
45], etc. Later, some scholars applied Transformer to time series forecasting, because time series data and text data are both sequential data [
46]. Experimental results show that Transformer can produce more accurate results in time series forecasting than previous work. There have been a number of recent studies using Transformer for forecasting. All these studies suggest that Transformer has a good performance in time series forecasting [
46,
47,
48,
49].
However, these studies only assessed some of these methods’ performance, but no research has investigated the performance of these methods on the same time series simultaneously. Thus, which method performs better on the same time series for container throughput remains unclear. In this context, the aim of this study is to compare several existing forecasting methods for the container throughput in the same port. Then, insights for selecting an appropriate method can be suggested.
Table 1.
The summary of the literature.
Table 1.
The summary of the literature.
Literature | Methods | Data | Main Finding |
---|
[18] | ARIMA, ANN | Wolf’s sunspot data, the Canadian lynx data, and the British pound = US dollar exchange rate data | The combined model can be an effective way to improve forecasting accuracy achieved by either of the models used separately. |
[19] | ARIMA | Spanish electricity market, Californian electricity market | The Spanish model needs 5 h to predict future prices, as opposed to the 2 h needed by the Californian model. |
[21] | SARIM, SVR | Aviation factors of China | The SARIMA-SVR can provide the best forecasting results. |
[24] | Particle-swarm-optimized multilayer perceptron (PSO-MLP) model | Landslides of Shicheng County in Jiangxi Province of China | Proposed PSO-MLP model addresses the drawbacks of the MLP-only model performs better than conventional artificial neural networks (ANNs) and statistical models. |
[25] | MLP, linear regression (LR) | Covid -19 positive case from March to mid-August 2020 in West Java | MLP reaches optimal if it used 13 hidden layers with learning rate and momentum = 0.1. The MLP had a smaller error than LR. |
[26] | random forest, MLP | Electrical load data of six years from a university campus | Hybrid forecast model performs better than other popular single forecast models. |
[27] | MLP, Whale optimization algorithm | Read gold price | The proposed WOA–NN model demonstrates an improvement in the forecasting accuracy obtained from the classic NN, PSO–NN, GA–NN, GWO–NN, and ARIMA model. |
[28] | Dynamic regional combined short-term rainfall forecasting approach, MLP | Actual height, temperature, tempera ture dew point difference, wind direction and wind speed at 500 hPa height | DRCF outperforms existing approaches in both threat score (TS) and root mean square error (RMSE). |
[29] | local MLP | Simulated data | A greater degree of decomposition leads to the greater reduction in forecast errors. |
[34] | GRU | Vessels that travel on the inland waterway | GRU provides the best prediction accuracy. |
[35] | Adam-GRU | Guangzhou Port | Adam-GRU outperformed all other methods. |
[36] | LSTM | Port of Singapore | LSTM outperformed all other benchmark methods. |
[37] | DeepTCN | JD-demand, JD-shipment, electricity, traffic and parts | The framework compares favorably to the state-of-the-art in both point and probabilistic forecasting. |
[38] | CNN | Bid and ask | CNNs are better suited for this kind of task. |
[39] | LSTM, CNN | Electric load dataset in the Italy-North Area | The experimental results demonstrate that the proposed model can achieve better and stable performance in STLF. |
[40] | CNN | Australian solar PV power data | Convolutional and multilayer perceptron neural networks performed similarly in terms of accuracy and training time, and outperformed the other models. |
[41] | Nonpooling CNN | Simulated data, daily visits to website | Convolutional layers tend to improve the performance, while pooling layers tend to introduce too many negative effects. |
[46] | Transformer | ILI data from the CDC | Transformer-based approach can model observed time series data as well as phase space of state variables through time delay embeddings. |
[47] | Enhancing the locality of Transformer, breaking the memory bottleneck of Transformer | Electricity-f (fine), electricity-c (coarse), traffic-f (fine), traffic-c (coarse), wind | It compares favorably to the state o the art. |
[48] | Informer | Electricity transformer temperature, electricity consuming load, weather | The experiments demonstrated the effectiveness of Informer for enhancing the prediction capacity in LSTF problem. |
[49] | customized transformer neural network | Electricity consumption dataset, traffic dataset | In terms of long-term estimation Up to eight times more resistant and in terms of estimation accuracy about 20 percent improvement, compare to other well-known methods, is obtained. |
3. Materials and Methods
This study compares the performance of nine different time series forecasting methods on the same time series, including traditional methods, which are the naïve method (NM), moving average (MA), autoregressive (AR) and autoregressive integrated moving average (ARIMA), and machine learning methods, which are multilayer perceptron (MLP), recurrent neural network (RNN), convolutional neural network (CNN) and Transformer. This section explains the technical details of these nine methods, such as calculation methods, flow charts, parameter definitions, etc.
3.1. Conventional Approaches
Conventional forecasting approaches mainly refer to methods with simple calculation process, few adjustable parameters, fast calculation speed and poor learning ability for complex nonlinear relations, such as NM, MA, AR, and ARIMA. This subsection is to explain the technical details of these conventional approaches.
3.1.1. Naïve Method
The expression of NM is shown in Equation (
1),
where
is the forecasted result of target variable at time
t, and
is the real value of target variable at time
.
3.1.2. Moving Average
The expression of MA is shown in Equation (
2),
where
is the forecasting result at time
t,
is the real observation at time
, and
n is the size of the moving windows.
3.1.3. Autoregressive
The expression of autoregressive method is shown in Equation (
3) [
50],
where
is the autoregressive operator,
p is the autoregressive order,
is the real time series at time
t, and
is the Gaussian white noise with zero mean and
.
3.1.4. AutoRegressive Integrated Moving Average
ARIMA consists of three parts: AR, integration (I), and MA, and the corresponding parameters are
p,
d,
q respectively. The general ARIMA model is called ARIMA (
p,
d,
q). The expression of ARIMA is shown in Equation (
4) [
21],
where
B is the back-shift operator, and
is the Gaussian white noise with zero mean and
. The expression of each parameter is shown in
Table 2 [
21].
3.2. Machine Learning
Machine learning forecasting methods mainly refer to methods with a complex calculation process, many adjustable parameters, slow calculation speed, and strong learning ability for complex nonlinear relations. These methods, such as MLP, RNN, CNN, and Transformer, can obtain better fitting results by adjusting a large number of parameters.
3.2.1. MLP
MLP is an interconnected network composed by many simple neurons. When the input signal to the neuron exceeds the threshold, this neuron will be at excitatory state and then send information to downstream neurons and repeat the above steps. The basic structure of MLP is shown in
Figure 1. The input data is connected to the neurons in input layer (
), and there is a full-connection architecture between the neurons in input layer (
) and the neurons in hidden layer (
). Each connection to the downstream neurons is weighted. Similarly, neurons in hidden layer (
) and neurons in output layer (
) are fully connected with weighted lines [
51].
First, the values in each layer are vectorised:
The output of the input layer is
where
is the activation function,
is the vector of the weight of the linkage between the input layer and the hidden layer, and
is the vector of the threshold value of the neurons in hidden layer.
The output of the hidden layer is
where
is the vector of the weight of the linkage between the hidden layer and the output layer,
is the vector of the threshold value of the neurons in the hidden layer.
3.2.2. GRU
As mentioned earlier, a GRU is an RNN structure, and the recurrent model of a common RNN is shown in
Figure 2. RNN is commonly composed of one or more units (the green rectangle A in the
Figure 2), and the learning model is constructed by iteratively updating the parameters in the units. The basic structure of a GRU unit is shown in
Figure 3. The calculation expressions of the parameters are shown in Equations (
10)–(
13) [
52].
where
is the input vector,
is the output vector,
is the candidate activation vector,
is the update gate vector,
is the reset gate vector,
W,
U and
b are parameter matrices and vectors, and
and
are the activation functions.
3.2.3. LSTM
LSTM is another type of RNN with the same recurrent model as
Figure 2.
Figure 4 presents the common structure of an LSTM unit. There are three types of gates in the unit, which are the input gate, forget gate, and output gate. The calculation expressions of the parameters of LSTM are shown in Equations (
14)–(
19) [
53].
where
is the input vector,
is the forget gate’s activation vector,
is the update gate’s activation vector,
is the output gate’s activation vector,
is the output vector,
is the cell input activation vector,
is the cell state vector,
W,
U and
b are parameter matrices and vectors,
,
and
are activation functions.
3.2.4. CNN
The CNN is constructed by an input layer, convolution layer, pooling layer, fully connected layer, and output layer. The input data in the input layer is first convoluted by a convolution kernel to make a convolution layer. Then, the pooling layer is to use the pooling method, such as max pooling, average pooling, etc., to effectively reduce the size of the parameter matrix, thereby reducing the number of parameters in the fully connected layer. Therefore, adding the pooling layer can speed up the calculation and prevent overfitting. After the pooling process, the pooled data is fed into the fully connected layer, which can be treated as the traditional multi-layer perceptron. The input of the fully connected layer is the feature extracted by the convolution layer and the pooling layer. The last output layer can use logistic regression, softmax regression, or even support vector machine to generate the final output. The network model adopts the gradient descent method to minimise the loss function to reverse-adjust the weight parameters in the network layer by layer, and improves the accuracy of the network through frequent iterative training.
The CNN is originally designed to deal with computer vision problems and the default input is the RGB image. This type of CNN is called as 3DCNN, because the RGB image can be filtered into three sub-image with RGB colours. If the input data is time series, then the CNN is called as 1DCNN. The basic structure of 1D-CNN is shown in
Figure 5 [
54].
3.2.5. Transformer
Transformer is the first transformation model that fully relies on self-attention to compute input and output representations without using recurrent or convolution mechanism. Self-attention is sometimes called as intra-attention. When a dataset is fed into the transformer, the data will first pass through the encoder module to encode the data, and then the encoded data will be sent to the decoder module for decoding. After decoding, the processed result will be obtained. The basic structure of Transformer is shown in
Figure 6 [
46]. It can be seen that the encoder input is fed into the input layer in the encoder, and then the positional encoding is used to inject some information about the relative or absolute position of the tokens in the sequence [
42]. Then the encoder layer 1 and encoder layer 2 are used to encode the data. Here, the number of the encoder layers in the encoder can be defined by users. After then encoding process, the encoder output is fed into the decode layer 1 in the decoder. At the same time, decoder input is fed into the input layer of the decoder. Then the output of input layer is also fed into the decoder layer 1. After the process by decoder layer 2 and linear mapping, the final output can be obtained. Similarly, the number of the decoder layers in the decoder can also be defined by the users.
3.3. Process of Comparison
The comparison process is shown in
Figure 7. The first step is to send the top 20 container ports’ throughput into the methods that need to be compared. Then the forecasting results are analysed from the perspectives of intra-method and inter-method, respectively. As an example, the pseudocode of learning and forecasting processes of MLP is presented in Algorithm 1. In the line 1, a range of the hidden layer size of the MLP is predefined. Then a variable named
, with an empty value, is predefined to hold the results generated by MLP methods with different hidden layer sizes. From line 3 to line 14, there are two
s to get the forecasting results. More details about the searching range of each method can be found in
Table 3. The source code of each forecasting method and the comparative drawing can be found at:
https://github.com/tdjuly?tab=repositories (accessed on 20 September 2022).
Algorithm 1 An algorithm with caption. |
1: | |
2: | | ▹To hold the model output |
3: | for in do | ▹ is the size of the hidden layer of MLP |
4: | set model parameters | ▹ epoch number, hz, learning rate, optimiser, etc. |
5: | for in do | ▹ top 20 container ports’ throughput |
6: | data processing | ▹ train/test partition, min-max normalisation, etc. |
7: | define training model | |
8: | training | |
9: | load fitted model | |
10: | testing | |
11: | calculate assessment criteria | ▹ test_MAPE, test_RMSE, etc. |
12: | | |
13: | end for | ▹ test_MAPE, test_RMSE, etc. |
14: | save results | |
15: | end for | |
3.4. Data Description
In this study, annual container throughput from 2004 to 2020 was obtained from the official websites of the world’s top 20 container ports. For each port, there are 17 observations. The statistical description of the data is shown in
Table 4 and the time plots of the container throughput of the world’s top 20 container ports are shown in
Figure 8. It can be seen from the figure that the annual container throughput of most ports shows a trend of gradual increase, such as Antwerp, Guangzhou, Qingdao, Ningbo, Busan, etc. However, some, such as Hong Kong, showed a downward trend. Some ports, such as Dalian and Dubai, showed a trend of increasing first and then decreasing.
Before the experiment, the obtained data should be divided into the training set and testing set. The training set is used to tune the parameters of the model, so that the forecasting results of the model will be closer to the real value. The testing set is used to test the accuracy of the trained model on the new data. According to Al-Musaylh et al. (2018), 80/20 is a common ratio of training and testing sets [
55]. Therefore, the training set includes 13 observations (
) from 2004 to 2016. The testing set includes four observations (
) from 2017 to 2020.
5. Conclusions
This research is a comparison study of nine forecasting methods on container throughput time series, four of which are traditional regression-based methods, and five of which are machine learning-based methods. The main finding of this study is that GRU is a method that can produce more accurate results with higher probability when constructing container throughput forecasting models. Another finding is that NM can be used for rapid and simple container throughput estimation when computing equipment and services are not available. The study also confirmed that machine learning methods are still the better choice over some traditional methods. An important conclusion that can be drawn from the analysis of experimental results is that machine learning methods are useful for training forecasting models, but the characteristics of the data can affect the performance of the methods. Therefore, machine learning methods are not necessarily better than traditional forecasting methods. In other words, one should be cautious about using machine learning methods to build forecasting models. This study compares the performance of different methods on multiple time series, and these time series are characterised by short observation period and small number of observations. Therefore, the conclusion of this study is applicable to any time series with the same time characteristics as this study.
Although this study explores the performance of nine different methods in forecasting the throughput of the world’s top 20 container ports, there are still limitations. As the hub of world trade, the change of port throughput is not only determined by the port city, but also determined by the operation situation of various ports around the world and the development of world trade market. This study only uses historical port throughput data as the data source. Therefore, the future research direction is to add the above influencing factors such as the development of port facilities, economic data of port cities, transportation between the port and other ports into the forecasting model and analyse their impact on port container throughput.