1. Introduction
Global monetary policies developed in response to the COVID-19 crisis and the 2022 Russia–Ukraine war have resulted in high crude oil prices, causing economic inflation and a bear market rally in 2022.
The relationship between the price of crude oil and the stock market has been a main research topic in economics and finance. The relationship between oil prices and the stock market, specifically in terms of forecasting stock returns and analyzing volatilities using oil prices, has been studied in [
1,
2]. Since oil-sensitive stocks have strong forecasting power on crude oil prices [
3], we want to examine how oil prices can be affected by the most influential stock prices in this study. Some interesting statistical methods to predict oil prices have been proposed, such as investor attention (constructed by the Google search volume index [
4]), the LASSO machine learning method [
5], and copula dependence structures between oil prices, exchange rates, and interest rates [
6]. In this study, we want to employ deep learning, Gaussian process modeling, and vine copula regression methods to predict the oil prices with the most influential stock prices.
Deep learning has been utilized to forecast stock prices in [
7], and the comparison of stock-price prediction models using pre-trained neural networks has been performed in [
8]. The LSTM and ARIMAX algorithms were employed in [
9] to analyze the impact of sentiment analysis in stock market prediction. Gaussian process regression methods and extensions for stock market prediction have been studied in [
10]. Stock prediction using Gaussian process regression has been studied in [
11]. The amount of training data required for deep learning [
11] and the choosing of hyperparameters can make the method difficult to use. The response of a Gaussian process model needs to be normally distributed if the hyperparameters are fixed. So, we propose an alternative forecasting model (vine copula regression) to predict oil-price returns using US stock returns. The copula method does not need assumptions, such as normality, linearity, and independence of errors. Additionally, vine copula can explain a flexible multivariate dependence structure. To show our proposed method’s superiority over the deep learning and Gaussian process models, we apply accuracy measures to deep learning, Gaussian process models, and vine copula regression models. This study also examines whether there are firms that are highly influential to oil prices. To do this, we use Bayesian variable selection and nonlinear principal component analysis for forecasting crude oil prices (Brent and WTI).
This paper is organized as follows:
Section 2 presents the data description and summary,
Section 3 gives an overview of the statistical models for forecasting, the illustrated comparison study of the proposed methods is presented in terms of the measures of errors in
Section 4, and the discussion is presented in
Section 5.
2. Summary Statistics
The sample contains the monthly log returns of Brent Crude, Western Texas Intermediate (WTI), and 74 major S&P 500 stock prices from February 2003 to October 2019. (Variable names for our sample data can be found in
Appendix A.) The reason for choosing monthly log returns over daily log returns in this paper is our attempt to eliminate the noise from small economic factors, such as political news. The 74 stocks were selected based on the size of their market capitalization.
Figure 1 plots the log returns for the 2 oil prices and 74 stock prices, along with a functional mean equation line of sample log returns. We observed a co-movement among oil and stock returns in
Figure 1. This is already a well-known phenomenon. Our sample was collected from the Yahoo Finance website (
https://finance.yahoo.com/) (accessed on 7 November 2020). We converted prices to log returns throughout our analyses.
Table 1 displays the summary statistics for the oil and stock price monthly log returns. We observed that Brent and WTI have similar distributional properties: the log returns of Brent and WTI prices are positively skewed with fat tails, while the average log returns of the Brent and WTI prices are close to zero.
We could expect that there would be a more prominent relationship among crude oil and major S&P 500 stock prices. We used February 1 as the beginning of the log return monthly data because of the 3 January 2003 base log return difference.
Let
be a price time series at time
t. For a log return series,
. Each of the datasets was given a new variable known as “log returns”. We summarized the descriptive statistics for the BRENT and WTI log return data, such as mean, skewness, and kurtosis, as well as 5 summary statistics in
Table 1.
In
Table 1, it can be observed that the standard deviations of the log returns of BRENT and WTI are about the same. The values of skewness in the log returns of BRENT and WTI in this period are positive, such that oil prices will increase in the future. In addition, the values of kurtosis in the log returns of BRENT and WTI are greater than 3, meaning that they have heavy tails compared to a normal distribution.
Figure 1 shows the time trend of the Brent and WTI monthly log returns over the given period. If we look at the 2008 economic crisis period, as shown in
Figure 1, the log returns of Brent and WTI were very high in the first half of 2008, and then they suddenly dropped to very low values in the second half of 2008. Because investors feared the tightening of monetary policy, a slowing economy, and an intensifying trade war between the U.S. and China in December 2018, the S&P 500 fell more than 9%, causing the log returns of Brent and WTI to be low.
4. Data Analysis
First, we want to visualize the relationship between the crude oil and stock data. Functional data analysis is a popular big data dimension-reduction method for time-course data. Functional principal component analysis (FPCA) is an effective clustering visualization analysis for time-course data. It provides a much more informative way of examining the sample covariance structure than does PCA, and it is an effective statistical method for explaining the variance of components because of the use of nonlinear eigenfunctions. The PCA only shows the clustering pattern of the whole data at a certain year or certain time, but FPCA is the more suitable method for showing the clustering pattern of the time series oil data over the given period.
Figure 2 plots the log returns of the 2 oil prices and 74 stock prices, along with a functional mean equation line of sample log returns. We observed a co-movement among oil and stock returns, as shown in
Figure 2. We then investigated the relationship in predicting oil price returns using stock returns.
We also performed a functional principal component analysis (FPCA) by using the FDA R package to determine factors (i.e., principal components) that explain the relationship between crude oil and stock prices.
Figure 3 shows a variance proportion of total variations in individual stock and oil price returns as explained by each principal component. Each component explains the percentage contribution to the whole density variation. The first principal component accounts for 24.6%, the second component explains 18.5%, and the third component accounts for 13.4% of the whole variance proportion of the FPCA. Note that the first 3 principal components account for 56.5% of the whole variability.
Through visualizations, we illustrated the relationship among the 2 major crude oil prices and the major S&P 500 stock prices. From the 2D FPCA plot in
Figure 4, we were able to classify the 2 crude oil and 74 major S&P 500 stock prices into 4 groups.
The 2D FPCA plot captures a limited view of the clusters among major stock price and oil price returns. For a detailed visualization of the relationship between the 2 crude oil and 74 stock prices, we have provided a 3D FPCA plot with the first 3 main harmonics (principal components) in
Figure 5. From
Figure 5, we can observe that most of the major stock returns are clustered together, implying a co-movement of those return series in our sample period.
The observations presented in
Figure 4 and
Figure 5 motivated our research to investigate whether there is a more prominent relationship between crude oil prices and major S&P 500 stock prices.
We also selected the most influential stocks in relation to crude oil price returns using the Bayesian variable selection method of the objective Bayesian model proposed by [
26]. Our empirical analysis restricted our attention to the stock prices of 74 firms which are considered major stocks in the S&P 500. We performed Bayesian variable selection for the BRENT and WTI oil price returns separately. Interestingly, five covariates were selected for each of the Brent and WTI oil price returns based on inclusion probabilities. The inclusion probabilities included the highest posterior probability and the median probability in the Bayesian model selection
CB, HD, HON, LIN, and PG returns were critical factors in determining Brent oil price returns. Consistently, HD, HON, LIN, PG, and UNP returns were critical factors in determining WTI oil price returns. The inclusion probabilities of all of the covariates for the Brent and WTI oil price returns are displayed in
Table 2 and
Table 3.
Following ref. [
28] (See
Section 3), we extract the first 5 principal components, using kernel PCA, to examine the power of forecasting oil price returns. We have used the first five principal components in this paper. The kernel function was used in training and predicting. This parameter can be set to any function of class kernel, which computes a dot product between two vector arguments. The corresponding component eigenvalues were 0.034684495, 0.009930120, 0.007794364, 0.006680240, and 0.005580485. Eigenvalues are used to find the proportion of the total variance explained by the components.
We also performed Gaussian process modeling to forecast oil prices in time t+1 using S&P 500 stock prices in time
t. To perform the analysis, we separated the training and forecast data from our sample, where February 2001–March 2017 was the training set for the stock data, and March 2001–April 2017 was the training set for the oil data. Test data for the stocks were collected from April 2017 to September 2019, and test data for the Brent and WTI prices were collected from May 2017 to October 2019. First, we considered all 74 major S&P 500 stock prices in our model and performed the forecasting for BRENT and WTI separately. Before we applied data to the GP model, we used the Bayesian variable selection method to select the five most influential stocks in relation to the Brent and WTI oil price returns. Then, we performed oil price forecasting using the Gaussian process model with the 5 covariates we found, as discussed in
Section 3 and as seen in
Table 2 and
Table 3.
Table 4 and
Table 5 show the GP for Brent with Covariates (CB, HD, HON, LIN, PG) and the GP for Brent with Covariates (HD, HON, LIN, PG, UNP).
We also compared the forecasting accuracy of the GP, deep learning, and vine copula methods by employing two measures for predictive accuracy.
We denoted the predicted values and actual values (), and t = 1,2, …, n. (n = the total number of test dataset).
Root-mean-square (prediction) error (RMSE):
where we define the quadratic loss function to be
.
Mean absolute error deviation (MAD):
where we define the L1 loss function to be
.
The metric errors, such as the MAD and RMSE, were used to analyze the performance of the methods. The mean absolute error is not sensitive to outliers as they are weighted less than the other observations when comparing actual and predicted values. The root-mean-square error takes bias and variance into account, but it normalizes the units. Each method also produces plots based on the actual and predicted price returns for visualization purposes.
Table 6 shows the RMSE and MAD for forecasting BRENT and WTI log return prices. From
Table 6, we can see that the RMSE and MAD decrease when we use the 5 selected covariates as compared to when we include all 74 covariates. We can also observe that forecasting Brent log return prices with major stock data using vine copula regression with NLPCA is superior to other methods, and that forecasting WTI log return prices with major stock data using Gaussian process and vine copula regression with NLPCA is superior to other methods in terms of the RMSE and MAD.
Table 7 shows a 95% Confidence Interval for the Loss functions (LOSS1 and LOSS2) of the BRENT and WTI log return prices. From
Table 7, we can observe that the width and center of the 95% Confidence Interval for the Loss functions (LOSS1 and LOSS2) of the BRENT and WTI log return prices using vine copula regression with NLPCA is smaller than that of other methods. We can confirm that forecasting BRENT and WTI log return prices with major stock data using vine copula regression with NLPCA is superior to other methods.
5. Discussion
This study is the first to investigate the predictability of oil prices using S&P 500 stock prices by using vine copula regression. We found that the BVS suggests that five stocks have the largest impact on each oil price. The selected companies are related to the energy/chemical industry or to the large retail industry. The important, selected companies for the Brent log returns are Chubb Limited (CB) (a Switzerland-based holding insurance company), Home Depot (HD) (a home improvement retailer), Honeywell International Inc. (HON) (a software–industrial company that operates through four segments: Aerospace, Honeywell Building Technologies, Performance Materials and Technologies, and Safety and Productivity Solutions), Linde plc (LIN) (an industrial gas company), and the Procter & Gamble Company (PG) (focused on providing branded consumer packaged goods to consumers across the world). The important, selected companies for the WTI log returns are HD, HON, LIN, PG, and Union Pacific Corporation (UNP), which is a railroad operating company in the United States. We also found that forecasting Brent log return prices with major stock data using vine copula regression with NLPCA is superior to other methods in terms of the RMSE and MAD. We also found that forecasting WTI log return prices with major stock data using GP and vine copula regression with NLPCA is superior to other methods in terms of the RMSE and MAD. In conclusion, the stock prices of both the energy/chemical industry and the large retail industry are effective in forecasting oil prices. This study contributes to forecasting oil prices by using a vine copula regression with selected stock prices. In future research, we will consider the prices of commodities, such as gold, silver, copper, platinum, natural gas, wheat, corn, and soybeans in relation to important financial indices, including the Consumer Price Index, inflation, and cryptocurrency prices, using the GP, deep learning, and vine copula methods.