1. Introduction
In this paper we consider statistical arbitrage strategies. Such strategies presume that the patterns observed in the historical data are expected to be repeated in the future. That is, a statistical arbitrage is a purely advanced descriptive approach designed to exploit market inefficiencies.
Khandani and Lo [
1] consider a specific strategy—first proposed by Lehmann [
2] and Lo and MacKinlay [
3]—that can be analyzed directly using individual equities returns. Given a collection of securities, they consider a long/short market-neutral equity strategy consisting of an equal dollar amount of long and short positions, where at each rebalancing interval, the long positions are made up of “losers” (underperforming stocks, relative to some market average) and the short positions are made up of “winners” (outperforming stocks, relative to the same market average). By buying yesterday’s losers and selling yesterday’s winners at each date, such a strategy actively bets on mean reversion across all stocks, profiting from reversals that occur within the rebalancing interval. For this reason, such strategies have been called “contrarian” trading strategies that benefit from market overreaction,
i.e., when underperformance is followed by positive returns and
vice-versa for outperformance. The same key idea is the basis of pairs trading strategies, which constitute another form of statistical arbitrage strategies.
The idea of pairs trading relies on long-term equilibrium among a pair of stocks. If such an equilibrium exists, then it is presumed that a specific linear combination of prices reverts to zero. A trading rule can be set up to exploit the temporary deviations (spread) to generate profit. When the spread between two assets is positive it is sold; that is, the outperforming stock is shorted and the long position is opened in the underperforming stock. In the opposite case, when the spread is negative: one buys. Gatev
et al. [
4] investigate the performance of this arbitrage rule over a period of 40 years and they find huge empirical evidence in favor of it. It is fundamental for the pairs trading strategy to precisely estimate the current and expected spread among the stock prices.
In this paper we interpret spread as the temporary deviation from the equilibrium in a cointegration model. Equilibrium in a cointegration model is interpreted as time series behavior that is characterized by stable, or otherwise stated stationary, long-run relations to which actual series return after temporary deviations. This approach differs from Gatev
et al. [
4], who implement a nonparametric framework. These authors choose a matching partner for each stock by finding the security that minimizes the sum of squared deviations between the two normalized price series; pairs are thus formed by exhaustive matching between normalized daily prices, where price includes reinvested dividends. However, as argued above, in the cointegration analysis that we perform, the spread between two assets is modeled as the temporary deviation from the long-run stable relations among the time series of asset prices. This deviation is computed as a linear combination of stock prices, where the weights in the linear combination are given by the cointegrating vector. Long-run stability also implies that there exists a finite uncertainty in the predictibility of stock prices that can be used in devising trading strategies. Therefore, pairs trading strategies are strongly dependent on the stability of ratios of pairs of stocks.
The estimated and predicted spreads are both computed from the estimated cointegration model. We introduce a simulation-based Bayesian estimation procedure that allows us to combine estimation and model uncertainty in a natural way with decision uncertainty associated with a decision process like a trading strategy. For the Bayesian estimation of the cointegration model, we work with a Metropolis-Hastings (M-H) type of sampler derived under an encompassing prior where we show that the encompassing prior is equivalent under certain conditions to the well-known Jeffreys’ or Information matrix prior. This sampling algorithm is derived by Kleibergen and Van Dijk [
5] for the Simultaneous Equations Model and extended by Kleibergen and Paap [
6] for the cointegration model. The latter authors specify a linear normalization to identify the parameters in the model. However, Strachan and Van Dijk [
7] point at possible distortions of prior beliefs associated with the linear normalization. Moreover, in our application we find out that the distribution of the spread is particularly sensitive to the choice of normalization.
Therefore we make use of an alternative normalization, the orthogonal normalization, in order to identify the parameters in the cointegration model. Given that one is usually only interested in a linear combination of price series, this normalization is a natural one since it treats the variables in the series in a symmetric way. More details are given in
Section 3. Hence, we implement the M-H sampler for the cointegration model under this normalization. We compare the performance of the pairs trading strategy under the orthogonal normalization with the performance of the counterpart under the linear normalization and find that, for our set of data, the orthogonal normalization is highly favored over the linear normalization with respect to the profitability and risk of the trading strategies.
The results imply that within the statistical arbitrage approach of pairs trading based on the cointegration model, the normalization is not only a useful device easing the parameter identification but it primarily becomes an important part of the model.
To take into account the non-normality of the conditional distribution of daily returns, we extend our approach of using the normal distribution to the case of the Student-t distribution.
The outline of the paper is as follows. In
Section 2 the conditional and implicit statistical arbitrage approaches are discussed. In
Section 3 our Bayesian analysis of the cointegration model under the encompassing prior is explained. In
Section 4 we consider an empirical application using stocks in the Dow Jones Composite Average index.
Section 5 concludes. The appendices contain technical derivations and additional tables with detailed results from our empirical application.
2. Pairs Trading: Implicit and Conditional Statistical Arbitrage
Suppose that there exists a
statistical fair price relationship [
8] between the prices
and
of two stocks, where the spread
is the deviation from this statistical fair price relationship, or “statistical mispricing”, at the end of day
t. In this paper we consider two types of trading strategies that are based upon the existence of such a long-run equilibrium relationship: conditional statistical arbitrage (CSA) and implicit statistical arbitrage (ISA), where we use the classification of Burgess [
8]. We will implement these strategies in such a way that at the end of each day the holding is updated, after which the holding is kept constant for a day. In the CSA strategy the desired holding at the end of day
t is given by
where
is the information set at the end of day
t, and where we consider
and
. A positive value of
means that we buy
spreads and a negative value of
means that we short
spreads. That is, if
and
, then a positive value of
means that we buy
of stock 1 and short
of stock 2. For
the obvious intuition of the CSA strategy is that we want to invest more in periods with larger expected profits. In this way we consider the
accuracy of the used method. In the case of
we only look at the sign of the expected change in the next day. In this way we consider the
directional accuracy of the used method. Note that the expectation in (
2) is taken over the distribution of
(given the information set
and the ‘fixed’ values of
and
). In the sequel of this paper, we use the posterior median to obtain estimates of model parameters, where the expectation in (
2) will still be taken given these “fixed” estimated values. We use the posterior median, since the posterior distribution has Cauchy type tails in one of the model specifications that we investigate and these Cauchy type tails imply that the coefficients have no posterior means.
In the ISA strategy the desired holding at the end of day
t is given by:
A positive value of
means that we buy
spreads and a negative value of
means that we short
spreads. Or equivalently, a negative value of
means that we buy
spreads and a positive value of
means that we short
spreads. That is, if
and
, then a positive value of
means that we buy
of stock 1 and short
of stock 2. In the sequel of this paper, we will substitute the posterior medians of
and
to obtain an estimate of the spread in (
1).
The CSA and ISA strategies raise several questions. First, how do we define such long-run equilibrium relationships? How are the coefficients and estimated? Second, how do we find pairs of stocks that satisfy such a long-run equilibrium relationship? Third, how do we estimate how the stock prices adjust towards their long-run equilibrium relationship? In the next section, we consider how our Bayesian analysis of the cointegration model (under linear or orthogonal normalization) provides answers to all these questions. In order to answer the first and third questions we use the posterior distribution (more precisely, the posterior median) of the parameters in the cointegration model. In order to answer the second question we compute the Bayes factor of a model with a cointegration relationship versus a model without a cointegration relationship for a large number of pairs of stocks.
At this point, we stress why we make use of the CSA and ISA strategies, rather than the approach of Gatev
et al. [
4]. In the strategy of Gatev
et al. [
4] a holding is taken as soon as it is found that a pair of prices has substantially diverged. After that, the holding remains constant until the prices have completely converged to the equilibrium relationship. A disadvantage of that trading strategy is that there is not much trading going on (
i.e., in most periods there is no trading at all), which makes it more difficult to investigate the difference in quality between different models given a finite period, or equivalently a very long period may be required to be able to find substantially credible differences in trading results between models.
3. Bayesian Analysis of the Cointegration Model Under Linear and Orthogonal Normalization
Consider a vector autoregressive model of order 1 (VAR(1)) for an
n-dimensional vector of time series
is an independent
n-dimensional vector normal process with zero mean and
positive definite symmetric (PDS) covariance matrix Σ. We will consider two alternative distributions for
: a multivariate normal distribution and a multivariate Student’s
t distribution. Φ is an
matrix with with autoregressive coefficients. The initial values in
are assumed fixed. The VAR model in (
4) can be written in error correction form
where
(with
the
identity matrix) is the long-run multiplier matrix, see e.g., Johansen [
9] and Kleibergen and Paap [
6].
If Π is a zero matrix, the series
contains
n unit roots and there is no opportunity for long term predictibility with finite uncertainty. If the matrix Π has full rank, the univariate series in
are stationary and long-run equilibrium relations are assumed to hold. Cointegration appears if the rank Π equals
r with
. The matrix
can be written as the outer product of two full rank
matrices
and
β:
The matrix β contains the cointegration vectors, which reflect the stationary long-run (equilibrium) relations between the univariate series in ; that is, each element of can be interpreted as a temporary deviation from a long-run (equilibrium) relations. The matrix α contains the adjustment parameters, which indicate the speed of adjustment to the long-run (equilibrium) relations.
To save on notation, we write (
5) in matrix notation
with
matrices
,
and
.
Under the cointegration restriction
, this model is given by:
1 The individual parameters in
are non-identified as
for any nonsingular
matrix
B. That is, postmultiplying
β by an invertible matrix
B and premultiplying
α by its inverse leaves the matrix
unchanged. Therefore,
identification restrictions are required to identify the elements of
β and
α, so that these become estimable. In this paper we will consider two different normalization restrictions for identification purposes. The first normalization is the linear normalization, which is commonly used, where we have
That is, the
elements of the first
r rows must form an identity matrix. The intuition behind this normalization is that for the case of two series it is assumed that the second series has an effect on the first series that is similar to the case of the linear regression model, where on measures the effect of a right-hand side explanatory variable on a left-hand side dependent variable. The second normalization is the orthogonal normalization, where we have
Here the interpretation is that the two series are treated as symmetrically effective and only the linear combination matters. This normalization and interpretation comes natural for a set of time series of different, symmetrically treated prices, where one is mainly interested in stable linear combinations.
In this paper we consider the case of
time series (of stock prices) in
, where the rank of Π is equal to
:
with spread
and
so that
From (
10) and (
12) it is clear that our ISA trading strategy depends on
and
, whereas our CSA trading strategy also depends on
and
.
Under the linear normalization we have
:
whereas under the orthogonal normalization we have
which is (under the further identification restriction
) equivalent with
Since the adjustment coefficients and may be close to 0, there may be substantial uncertainty about the equilibrium relationship. The linear normalization allows to take values in , whereas the orthogonal normalization allows to take values in and in . One may argue that the spread under the linear normalization is just a re-scaled version of the spread under the orthogonal normalization (where the spread under the linear normalization would result by dividing the spread under the orthogonal normalization by ). However, we will consider a moving window, where the parameters will be updated every day, so that the re-scaling factor is not constant over time. Therefore, the profit/loss of the ISA strategy under the linear normalization is not just a re-scaled version of the profit/loss of the ISA strategy under the orthogonal normalization. Further, we estimate the parameters using their posterior median, where the posterior median of under the linear normalization will typically differ from the ratio of the posterior medians of and under the orthogonal normalization. The profit/loss of the strategies under the linear normalization may be much affected by a small number of days at which the is estimated very large (in an absolute sense), whereas under the orthogonal normalization the profit/loss may be more evenly affected by the different days, as (the estimates of) and can not ‘escape’ to extreme values outside .
3.1. The Encompassing and Jeffreys’ Framework for Prior Specification and Posterior Simulation
As mentioned above, we consider the case of time series (of stock prices), where the rank of is equal to . That is, the matrix Π needs to satisfy a reduced rank restriction. A natural way to specify a prior for α and β is given by the encompassing framework, in which one first specifies a prior on Π without imposing a reduced rank restriction and then obtains the prior in our model as the conditional prior of Π given that the rank of Π is equal to 1.
As singular values are generalized eigenvalues of non-symmetric matrices, they are a natural way to represent the rank of a matrix. Using singular values we can artificially construct the full rank specification of Π via an auxiliary parameter given by the
matrix
λ;
i.e.,
λ is a scalar in our case with
and
. The reduced rank matrix
is extended into the full rank specification:
where
and
are
matrices that are specified such that
,
,
and
. The full rank specification encompasses the reduced rank case given by
. In this framework the probability
can be interpreted as a measure quantifying the likelihood of reduced rank. The specification in (
26) is obtained using the singular value decomposition
of Π, where the
matrices
U and
V are orthogonal such that
and
and the
matrix
S is diagonal and has the singular values of Π on its diagonal in a decreasing order.
To derive the elements of equation (
26) in terms of parameters Π we partition Π according to the specifics of the chosen normalization. Under the linear normalization, we partition the matrices
U,
S and
V as follows
The matrices in decomposition (
26) in terms of the blocks of
U,
S and
V are given by
Under the orthogonal normalization, the matrices are partitioned as
and the following relations hold:
Under the orthogonal normalization λ is directly equal to , whereas under the linear normalization it is just a rotation of . In both cases restriction is equivalent with restricting the smallest singular values of Π to 0.
The prior on
is equal to the conditional prior of the parameters
given that
, which is proportional to the joint prior for
evaluated at
:
where
stands for evaluated in
, where
denotes the Jacobian of the transformation from Π to
. Kleibergen and Paap [
6] derive the closed form expression for the determinant of the Jacobian
for the general case of
n variables and reduced rank
r under the linear normalization. In
Appendix B the Jacobian is derived under the orthogonal normalization of
β.
Bastürk
et al. [
10] prove that under certain conditions the encompassing prior is equivalent to Jeffreys’ prior in the cointegration model with normally distributed innovations, irrespective of the normalization applied. We emphasize this equivalence, since the use of the information matrix or Jeffreys’ prior is more well-known than the encompassing approach. Since the information matrix prior may yield certain desirable properties of the posterior, we conclude that an encompassing approach may also serve this purpose.
In a similar fashion, the posterior of
is equal to the conditional posterior of the parameters
given that
, which is proportional to the joint posterior for
evaluated at
:
where the detailed expression for
is given by Kleibergen and Paap [
6], and where
For Bayesian estimation of the cointegration model we need an algorithm to sample from the posterior density in (
18). However this posterior densities does not belong to any known class of distributions, see Kleibergen and Paap [
6], and as such can not be sampled directly. The idea of the Metropolis-Hastings (M-H) algorithm is to generate draws from the target density by constructing a Markov chain of which the distribution converges to the target distribution, using draws from a candidate density and an acceptance-rejection scheme. Kleibergen and Paap [
6] present the M-H algorithm to sample from (
18) for the cointegration model with normally distributed disturbances under the linear normalization. In this algorithm (
19) is used to form a candidate density. The general outline of this sampling algorithm is presented in
Appendix A.
Appendix B presents the approach to evaluate the acceptance-rejection weights under the orthogonal normalization. The posteriors of the coefficients under the linear normalization have Cauchy type tails, so that there exist no posterior means for the coefficients. Therefore, we estimate the coefficients using the posterior median (which we do under both normalizations to keep the comparison between the normalizations as fair as possible).
Given that the time series considered have a non-normal shape, we also consider the model under a multivariate Student’s
t distribution for the innovations
. Then the M-H algorithms are straightforwardly extended, see Geweke [
11].
Since we make use of the independence-chain Metropolis-Hastings algorithm, the simulation of candidate draws and the evaluation of the importance weights (to be used in the probability of accepting the candidate draw) can be easily performed in a parallel fashion. This would enormously increase the speed of our computations. Only the final step of the method, the actual acceptance or rejection of candidate draws, can not be performed in a parallel fashion. But this step takes relatively very little computing time. As an alternative, one can make use of importance sampling, where the whole method can be performed in a parallel fashion.
3.2. Bayes Factors
We evaluate the Bayes factor of rank 1
versus rank 2 and the Bayes factor of rank 0
versus rank 2. The Bayes factor of rank 1
versus rank 0 is obviously given by the ratio of these Bayes factors. For the evaluation of these Bayes factors we extend the method of Kleibergen and Paap [
6] who evaluate the Bayes factor as the Savage-Dickey density ratio, see Dickey [
12] and Verdinelli and Wasserman [
13] to the case of orthogonal normalization. The Bayes factor for the restricted model with
(where Π has rank 0 or 1)
versus the unrestricted model with unrestricted
λ (where Π has rank 2) equals the ratio of the marginal posterior density of
λ, and the marginal prior density of
λ, both evaluated in
. However, in the case of our diffuse prior specification this Bayes factor for rank reduction is not defined, as the marginal prior density of
λ is improper.
Therefore, we follow Chao and Phillips [
14] who use as prior height
to construct their posterior information criterium (PIC). We assume equal prior probabilities
for the rank 0, 1 or 2, so that the Bayes factor is equal to the posterior odds, the ratio of posterior model probabilities. For pairs of stock prices we will mostly observe that the estimated posterior model probability is highest for rank 0, the case of two random walk processes without cointegration. Only for a small fraction of pairs, we will observed that the estimated posterior model probability is highest for rank 1, the case of two cointegrated random walk processes.
4. Empirical Application
The CSA and ISA strategies are applied to components of the Dow Jones Composite Average index. We work with daily closing prices recorded over the period of one year, from 1 January 2009 until 31 December 2009. We consider the 65 stocks with the highest liquidity. First, we identify cointegrated pairs based on the estimated posterior probability of cointegration (
i.e., Π having rank 1) computed for the first half year of the data. That is, among the
pairs we select the 10 pairs with the highest Bayes factor of rank 1
versus rank 0 (where these Bayes factors are larger than 1) for both the linear and orthogonal normalization. The 10 pairs are identical for both normalizations; these pairs are given by
Table 1. Second, those pairs are used in the CSA and ISA trading strategies during the last 6 months of 2009. We use a rolling window, where the parameter estimates are updated at the end of each trading day, after which the positions are updated and kept constant until the end of the next trading day. We will analyze the profits from these trading strategies, where we take into account the common level of transaction costs of 0.1% (
).
Next to the “standard” CSA approach described before, we will also perform a more cautious, more conservative CSA strategy that takes into account parameter uncertainty. Here we only take a position if we are more certain about the sign of the current spread (and hence the sign of the expected change of the spread, which is the opposite sign). We only take a position if the % percentile and the % percentile of the posterior distribution of the current spread have the same sign, where we consider the cases of , , , or . The case of is the most cautious strategy, where the sign of the posterior and percentiles of the spread must be the same. Note that for this strategy reduced to the original CSA strategy.
In order to evaluate the CSA and ISA strategies in the cointegration models under the linear and orthogonal normalization and under a normal and Student’s
t distribution for the innovations, we compute two measures. First of all, the strategy can not be evaluated in terms of the percentage return on initial capital investment, as we are not only buying stocks but we are also shorting stocks. Suppose that we perform our strategies for
T consecutive trading days (where in our case
T is the number of trading days in the last 6 months of 2009). Then the average daily capital engagement is given by:
where
and
are the posterior medians of
and
computed at the end of the
t-th day in the trading period. That is,
and
are computed at the end of the last trading day before the trading period. Our first performance measure is a profitability measure that is given by the total return of the strategy divided by the average daily capital engagement:
Our second performance measure concerns the risk of the strategies. In order to estimate risk we use paths of cumulative return. When the cumulative return at time
is often lower than the cumulative return at time
t, then a strategy can be considered risky. On the other hand, if the cumulative return is growing or remains steady over most periods the strategy can be considered as having low risk. In the latter case the signals generated by the trading rule are accurate and yield (mostly) profit. We define our measure as:
for each strategy.
Table 2 presents the average of the
measure in (
22)–(23) over the ten selected (cointegrated) pairs of stocks. Detailed results, for every pair of stocks, are presented in
Appendix C. The hypothesis that the normalization plays an important role in pairs trading strategies is confirmed by the empirical findings.
Table 2 confirms that the orthogonal normalization substantially outperforms the linear counterpart, irrespectively of the assumed distribution for the innovations. It is particularly pronounced when the profitability of the CSA strategies under
(directional accuracy) is compared with the counterpart under
(accuracy). For the linear normalization the increase from
to
is linked with substantial decrease in profitability. It means that predictions of the change of the spread under the linear normalization are relatively poor compared with the orthogonal case. For
where not only the direction but also the size of predicted change of the spread play an important role, the linear normalization performs relatively poorly. On the contrary, the orthogonal normalization shows an appreciable increase in profitability for
compared to
.
As expected, the
measure for the CSA strategies in
Table 3 decreases for more cautious, more conservative strategies with larger values of
ξ. The
measure under the linear normalization is similar or worse than the counterpart under the orthogonal normalization. Obviously, this only means that the percentage of trading days with a decrease of the cumulative return is similar. The size of these decreases may be larger. In further research, we will take a closer look at the riskiness of the alternative strategies in the different models.
Now we compare the performance between the normal distribution and the Student’s t distribution. The measure is slightly better under the Student’s t distribution. For the measure the difference (in favor of the Student’s t distribution) seems somewhat more clearly present. However, it should be noted that the difference between the orthogonal and linear normalizations is much larger than the difference between the Student’s t and the normal distribution. The normalization is clearly the key factor for the profitability of the trading strategies for our set of data. One possible reason for this result is that the profit/loss of the strategies under the linear normalization may be much affected by a small number of days at which the is estimated very large (in an absolute sense), whereas under the orthogonal normalization the profit/loss may be more evenly affected by the different days, as (the estimates of) and can not “escape” to extreme values far outside . The latter may happen in the case of the linear normalization if the adjustment coefficients and are close to 0. The latter may be found for certain pairs of empirical time series of stock prices (in certain periods), where the error correction may be rather slow. In future research, we will take a closer look at the reasons for the substantial difference in profitability between the normalizations.
5. Conclusions and Topics for Further Research
In this paper we explored the connection between the statistical wellknown cointegration model and decision strategies for the selection of pairs trading of stocks using a simulation-based Bayesian procedure. We considered two cases of pairs trading strategies: a conditional statistical arbitrage method and an implicit statistical arbitrage method. We used a simulation-based Bayesian procedure for predicting stable ratios, defined in a cointegration model, of pairs of stock prices. We showed the effect that using an encompassing or Jeffreys’ prior under an orthogonal normalization has for the selection of pairs of cointegrated stock prices and for the estimation and prediction of the spread between cointegrated stock prices and its uncertainty. An empirical application was done using stocks that are ingredients of the Dow Jones Composite Average index. The results showed that the normalization has little effect on the selection of pairs of cointegrated stocks on the basis of Bayes factors. However, the results stressed the importance of the orthogonal normalization for the estimation and prediction of the spread, which leads to better results in terms of profit per capital engagement and risk than using a standard linear normalization.
An important issue for future research is to investigate the robustness of our empirical results. Here we list three topics of further research. First, there may exist a sensitivity of the results for the specific data. We already indicated that our results may be sensitive for some particular data in a sub period. Second and more generally, if one considers the percentiles of the predictive distribution for the future spread during the trading strategy, taking into account the uncertainty in future innovations, then it is important to specify the distribution of the innovations carefully. In future research, we will consider a finite mixture of Gaussian distributions for the innovations. However, the algorithm for the posterior simulation will not be a straightforward extension of the algorithm under the normal distribution, which was the case for the Student’s
t distribution. We will extend the partial and permutation-augmented MitISEM (Mixture of
t by Importance Sampling weighted Expectation Maximization) approaches of Hoogerheide
et al. [
15] to perform the posterior simulation for the cointegration model with errors obeying a finite mixture distribution. A third point is the introduction of a learning strategy by the decision maker. Fourth, the economic performance of econometric predictions can be evaluated using a utility based metric to obtain a certainty equivalence of strategies. This penalizes the excess variation in predictions perceived as “risk” of the strategy. See West
et al. [
16]. Fifth, the methods can be compared with a strategy where one invests 50% in the risk-free rate and 50% in a risky asset, which is quite a successful and robust strategy, see Marquering and Verbeek [
17]. Sixth, the performance of the methods can be compared with alternative Bayesian cointegration approaches, see Furmston
et al. [
18] and Bracegirdle and Barber [
19].