1. Introduction
A conventional method for determining the causality of the financial time series was developed by Clive Granger in the 1980s, who was awarded the Nobel Memorial Prize in 2003. The essence of the method is that a subset of an information set is excluded from analysis, and probability distributions are evaluated on a smaller information set (
Diebold 2006;
Diks and Panchenko 2006). The generation of probability distributions usually requires fitting the vector autoregressive model (VAR) to the time series and excluding some of the explanatory variables.
Nonparametric versions of the Granger causality tests were developed later, especially in the papers by
Baek and Brock (
1992), and extended by many authors (
Diks and Panchenko 2006, and op. cit.). The Baek and Brock test and its variants are computationally very intensive for the large datasets prevailing in modern securities studies. They require state space coverage with cells of the size ε > 0 and computing correlations between the cells for the decreasing epsilon.
Since the 2010s, transactions in the stock market began to carry nanosecond time stamps. This change requires new methods of analysis adapted to the new realities.
1 The emergence of the big data framework and attempts to use deep learning methods created the following challenge: Regressions became nonlinear, and may contain hundreds of thousands of parameters in the case of this paper—and trillions in the case of Google datasets. Furthermore, deep learning algorithms usually present a “black box”, and it is hard to attribute the input changes to the output differences.
The capacity of the human mind to analyze multidimensional time series consisting of billions of market events has remained largely unchanged. Because of our evolution in three-dimensional space, humans have the best grasp of two-dimensional information. Consequently, the methods of image analysis are among the best developed in the whole discipline of signal processing.
My paper adapts deep learning methods developed for image processing to the causality of the financial time series. Comparing two datasets, the one which requires more information to produce a deepfake using a neural network is considered more informative. A precise formulation of these criteria is provided in
Section 4.
C-GANs (convolutional generational adversarial neural networks) appeared in 2015. The original purpose of the method was the image analysis and/or generation of deepfakes (
Goodfellow 2017). The essence of the C-GAN is that the network is divided into two parts: generator and discriminator, or critic (
Rivas 2020). The generator net produces fake images from random noise and learns to improve them with respect to the training file. The discriminator tries to distinguish fake from real images based on training statistics.
To demonstrate this method’s utility, I use it to analyze trading imbalances in New York Stock Exchange (NYSE) trading, which the Securities and Exchange Commission (SEC) requires to be stored with a nanosecond time stamp. These images, for different days, are standardized to ensure their comparability. The imbalance events constitute a situation when the counterparty does not instantly deliver the stock to close its position. The number of these events per day is several million. The time series are preprocessed into two-dimensional images of realistic size to be analyzed using a PC.
Why is this problem important? Given the instances of “flash crashes” in the market, the first and largest of those reported being the Flash Crash of 2010 on the NYSE, the question of whether exchange-traded funds stabilize or destabilize the market became increasingly important. In particular, the Flash Crash was attributed to the toxicity of liquidity of S&P minis orders (
Easley et al. 2013). Because of the explosive growth in ETF markets (more information in
Section 2), the traded volume in the compound portfolios representing ETF shares can easily exceed trading volume in the underlying securities. Intuitively, this can cause a problem in the delivery of the underlying, which can propagate through the system and, in rare cases, cause a crash. Alternatively, the Mini-Flash crash of 24 August 2015 demonstrated a significant deviation in market index price—subject to the circuit breaking several times—and weighted ETF prices (for a detailed description, see
Moise 2023, especially
Figure 1).
In the current paper, I explore whether the market imbalance events drive ETF imbalances or vice versa. This problem is a good testbed for the proposed deep learning methodology.
The paper is structured as follows. In
Section 2, I briefly outline commonly referred information on the ETF market. In
Section 3, I describe the structure of the database.
Section 4 describes preprocessing of SEC data into two-dimensional images. In
Section 5, we establish the causality (information link) between ETF and market transaction imbalances.
Section 6 has an additional robustness check of the proposed results.
Section 7 is the summary of the results. The
Appendix A describes different kinds of imbalance messages according to NYSE. I review a possible theoretical justification for ETF informational dominance in
Appendix B.
2. Formulation of the Problem and the Literature Review
The ETF market has exhibited explosive growth in recent years (
NYSE 2021). Invented to securitize stock inventories of large asset managers, it developed into a major asset class in its own right (
Gastineau 2010). Different anomalies in the markets, particularly the Flash Crash in May 2010, were partly attributed to the instability of exchange-traded products (ETPs), especially the S&P minis (
Easley et al. 2013).
The question of whether the introduction of ETFs increase or decrease market quality has been discussed in many authoritative publications (
Bhattacharya and O’Hara 2018, and op. cit.),
Israeli et al. (
2017) mentioned that there are two main opposing reasons for expectations of market quality change. On the positive side, exposure to ETFs provides more information on the underlying stocks, particularly stocks with lesser liquidity exposed to little coverage by analysts. The question of whether the use of ETFs increase or decrease the information efficiency of markets was also posed by
Glosten et al. (
2020). They suggested that the inclusion of ETFs increases information efficiency for low-liquidity stocks.
On the negative side, uninformed traders tend to exit the market in the underlying securities in favor of ETFs, thus depressing liquidity. Furthermore, much ETF activity happens at market close (
Shum et al. 2016). Shum et al. noticed that ETFs typically have larger mispricing and wider spreads during end-of-trading, especially on the days of most volatile trading.
If the influence of ETFs is so prominent, can they be a catalyst for extreme events in the market? Some authors, e.g., A. Madhavan, have answered positively (
Madhavan and Sobczyk 2019). At least, there is a recognition that new kinds of risks are inherent in the proliferation of ETFs (
Pagano et al. 2019). If that is true, can big data analyses and deep learning instruments provide some warning about whether extreme events may be coming? And, what is the direction of the information flow—from ETF to the broader market or vice versa? (
Glosten et al. 2020).
To answer this question, we develop a structured methodology, which allows us to determine with some certainty whether ETF illiquidity results from market fluctuations or if it is the other way around. The hypothetical mechanism is as follows: ETF trading initiates the delivery of ETF constituents (“in-kind” transaction) or a cash equivalent (“in-cash” transaction) if the underlying assets are easily tradable (
Bhattacharya and O’Hara 2018). If the aggregate volume of ETF executions were small and/or evenly spread in time, this would introduce friction in orderly market execution.
And indeed, there are a few inherent problems. First, there needs to be more clarity as to whether trade imbalance results from the actual economic events in one or more underlying stocks or from the effects of stock aggregation by the ETFs, for instance, changes in the credit risk of the ETF swap counterparty.
Second, because ETF transactions are prominent in hedges, they are highly nonuniform throughout the day (
Shum et al. 2016). A company that misses a delivery can wait until the end of the day to close the deal when the market is more stable. This paper does not judge whether ETFs are a “good” or “bad” influence on market liquidity. It strives to clarify the enormous influence ETFs have on the stock market, particularly the direction of information transmission.
According to the foundational models of market microstructure, the principal driver of price changes is the imbalances in supply and demand for a given security (
Kyle 1985;
Glosten and Milgrom 1985). Currently, imbalance messages can be followed with nanosecond precision. There is probably little value added to further increasing accuracy because signals only propagate a few meters—i.e., the size of the trading room—with already achievable latency (
Bartlett and MacCrory 2019). One of the first studies going up to nanosecond granularity was “Price discovery in high resolution” (
Hasbrouck 2021). This wealth of available data creates many problems of its own. The human mind is poorly adapted to rationalize such amounts of data. Furthermore, our senses evolved in 3D space and have difficulty comprehending multidimensional datasets.
One of the principal channels of influence of the ETF market on overall market stability is using ETF shares for shorting and hedging. ETF shares “fail-to-deliver” on settlements for different reasons. Fail-to-deliver could be a signal of actual economic troubles in a company, shorting in expectation of a real price movement by the AP (authorized provider), the analog of the market makers (MM) for stock, or “operational shorting” (
Evans et al. 2018). The description of operational shorting in the above-cited paper by Evans, Moussawi, Pagano, and Sedunov is so exhaustive that I provide a somewhat long quote.
2 Before a security is classified as “fail-to-deliver”, an imbalance record is created. Usually, the imbalance is cleared by end-of-day trading or the next day before trading hours. The reputation penalty for being cited in imbalances is typically small (
Evans et al. 2018).
The reason markets and the SEC do not regulate intraday deliveries with harsher penalties is obscure. We hypothesize that the inherent optionality involved in paying for order flow (PFOF) is partially responsible for this market feature (for PFOF analysis, see, e.g., (
Lynch 2022)). If there were substantial fines or a negative reputation associated with the “failure-to-deliver”, the PFOF mechanism would suffer disruptions. Consider that an expected negative return for a penalty would overcome the price improvement offered by the wholesaler. Because the non-delivery probability is nonzero, only relatively large price improvements would justify the counterparty risk, and a large volume of trade would miss the wholesaler. A detailed discussion of the issue is outside the scope of this paper.
This work is dedicated to researching methods to rationalize imbalance datasets with nanosecond time stamps. We compress them into two-dimensional “fingerprints”, for which a rich array of algorithms developed for analyzing the visual images is already available. The dataset we use is the list of imbalance messages provided by NYSE Arca. “NYSE Arca is the world-leading ETF exchange in terms of volumes and listings. In November 2021, the exchange had a commanding 17.3% of the ETF market share in the US” (
Hayes 2022). The special significance of the data for our problem setting is illustrated by the fact that a glitch in the NYSE Arca trading system influenced hundreds of ETF funds in March 2017 (
Loder 2017).
Messages in our database have the following types: type “3”, type “34”, and type “105”. Message type 3 is a symbol index mapping (reset) message. Message 34 is a security status message, which can indicate “opening delay”, “trading halt”, and “no open/no resume status”. Finally, message type 105 is an imbalance message. More information about the format and content of the messages and the datasets can be found in
Appendix A and (
NYSE Technologies 2014).
To make use of the large statistics of the nanosecond time stamps of the 105 messages, we selected them for our analysis. Our choice is justified because the daily stream of 105 messages is in the millions, while 3 and 34 messages are in the tens of thousands.
The number of imbalance messages (type 105) for each trading day is around four million, each comprising 15–20 standardized fields. TAQ NYSE Arca equities—TAQ NYSE imbalance files provide “buy and sell imbalances sent at specified intervals during auctions throughout the trading day for all listed securities” (
NYSE Technologies 2014).
3. Preprocessing—Formation of the State Variables Database
We selected the following variables: (1) the number of messages per unit time, and price, (2) the dollar imbalance at the exception message, and (3) the remaining imbalance at settlement. The latter is rarely different from zero because a failure to rectify stock imbalances at the close of a trading session indicates a significant failure in market discipline and may entail legal consequences.
Our data can be divided into two unequal datasets: market messages in their totality and ETF-related messages, and the first group encompasses the second. Because of the large volume of the data, we used an algorithmic selection of data for the ETF group. The messages in the datasets contain the identifier “E” for exchange-traded funds, but in other places, “E” can indicate corporate bonds, but it is too common a letter to filter for it in a text file. Instead, we chose separate data on market participants provided by the NYSE, of which we filtered the names explicitly containing the words “exchange-traded” or “fund”. This identification is not 100% accurate because some closed-end mutual funds, which are not ETFs, could have entered our list, but they are expected to be dominated by ETFs.
We were left with 1061 names automatically selected from the messages file. The number of daily events related to our list can be half a million or more, so sorting by hand would be difficult, if possible at all.
We further grouped our data as follows: First, the number of type 105 messages per 100 s. Second, the cumulative imbalance every 100 s in a 12½ h trading day
3,4. The number of price bins chosen was approximately equal to the number of time intervals. Dollar imbalances are calculated by the following formula:
where
p is the last market price,
Imbt is the undelivered number of shares, and
Settle4:00 is the number of shares unsettled by the end of the trading session, usually at 4:00 p.m.
The 100 s intervals were chosen arbitrarily but intended to have a two-dimensional data tensor processed on a laptop and have sufficiently acceptable statistics. The imbalances are distributed quite irregularly at around 45,000 s and can be visually grouped into the “beginning of the day settlement”, “midday settlement”, and “end-of-day settlement” (see
Figure 1).
As expected, most of the dollar imbalances are small. To avoid data being swamped into a trivial distribution—a gigantic zeroth bin—and a uniformly small right tail, we used a logarithmic transformation for all variables, including time: . Unity was added to deal with the zeroes in the database. Nonlinear transformation distorts probability distributions, but given their sharply concentrated and peaked shape, we did not expect it to influence the results too much.
The plot of the summary statistics for a typical day is shown in
Figure 2. We observed that the maximum number of imbalance messages created by ETFs for each 100 s during a trading day is about one-eighth of the total number of exceptions (~25,000:200,000) in the market, but the cumulative value of imbalances created by the ETFs is about 60% of the total. This disparity suggests that average imbalances are much higher when ETF shares are involved.
The final step was making our data amenable to deep learning algorithms, mostly designed to deal with visual data (“fake images”). We further compressed our nonuniform rectangular matrices—in some cases, messaging did not begin at exactly 3:30 a.m., etc. One data file included only NYSE into 96 × 96 squares, which we call “fingerprints” of daily trading.
5 (
Figure 3). The fingerprints do not have an obvious interpretation; rather, they take the form of a machine-readable image, like a histogram. The transaction rate was plotted on the vertical scale. We plotted an accompanying dollar value for an imbalance on the horizontal scale. This compression method allows for treating daily imbalances as uniform images, which can be subjected to processing using deep learning algorithms.
Five randomly selected trading days (7–8 October 2019, 9 September 2020, and 4–5 October 2020) produced ten daily samples: one with total market imbalance messages, the other with ETF data only. We constructed five testing and five training samples from them according to the protocol exhibited in
Figure 4.
A relative proportion of all-market and ETF-only samples, according to
Figure 3, is provided in
Table 1.
4. Distances on a State Space
Because of their “black box” nature, the output of neural networks is hard to rationalize. First, the human mind has evolved to analyze two- or three-dimensional images in three-dimensional space. Most humans cannot directly comprehend tensor inputs, intermediate results, and outputs typical for neural networks. Second, the results of neural network analyses are necessarily stochastic and depend on the large number of estimated intrinsic parameters, which are frequently inaccessible, but in any case, too numerous to rationalize. Third, deep learning results can depend on how the training and testing samples are organized, even if they represent identical datasets. All of this can indicate the failure of a deep learning procedure (
Brownlee 2021), but it can also show additional information we fail to recognize.
6 Because neural networks are “black boxes”, instead of the interpretation of a hundred thousand—in my case trillions of—parameters in the case of Google and Microsoft deep learning networks, one has to design numerical experiments and analyze the output from a deep learning algorithm in its entirety.
To systematize the results, we propose two measures of divergence of images as follows: After the C-GAN generated fake images (“fingerprints”) of the session, we considered these images as (1) matrices and (2) nonnormalized probability distributions.
The first approach is to treat arrays as matrices (tensors). We computed the pseudo-metric cosine between the image arrays
X and
Y according to the following formula:
In the above formula, the norm ||·|| is a Frobenius matrix norm representing each image array. In the first stage, we computed the distance as the average of each twentieth of the last 400 images in the sequence. Because, sometimes, the fake image is an empty list having a zero norm, we modified this formula according to the following prescription:
Equation (3) provides answers close to the correct geometric Formula (2), but it does not fail in the case of an empty fake image. The pseudo-metric measure, calculated according to Equation (3), provides a fair picture of the affinity of the fake visual images to the originals (see
Figure 5), but it is still unstable with respect to different stochastic realizations of the simulated images.
So, we applied a second stage averaging according to the formula for the mutual information:
In Equation (4), N is the number of independent network runs. Note that this formula does not depend on whether we use a “geometrically correct” Equation (2) or a computationally convenient Equation (3). Unlike separate Ctrain,fake, and Ctest,fake norms, which may vary widely between consecutive runs of the C-GAN, their ratio is reasonably stable for a given training and testing sample. Furthermore, if one exchanges training and test samples, the argument of the summation of Equation (4) only changes its sign.
The second option is to treat output arrays as quasiprobability distributions. We used Kullback–Leibler (KL) divergences
between the two distributions (
Burnham and Anderson 2002). As is well known, Kullback–Leibler divergence is not a real metric and is asymmetrical with respect to its arguments. The intuitive meaning of
is an information gain achieved when one replaces distribution
Q (usually meaning sample) with distribution
P (usually meaning model). In our context, distribution
P is a training or test dataset, and distribution(s)
Q are fakes generated using the neural net. A final information criterion is:
The values r > 1 suggest that the test file is difficult to reproduce by training the net. The values r < 1 indicate that reproducing the dataset—the neural network does not know anything about trading, which generates observable data—is relatively easy.
There are also several variations for the design of the samples we analyzed. In the first preliminary design, we compared data of the entire market, which includes ETFs, against the data files, which contain only our sample(s) of ETFs. In that case, we expected a positive sign of mutual information when we trained the network using the entire market file—because the information from the ETFs is already contained in a training file. We display these results in the next section.
In the second design, we separated data files into single-issue and ETF stocks. The positive sign of the mutual information must appear if the training occurs on the more informative information subset of the two. We identified a more informative data file as causally determining a less informative data file.
Despite the complexity of the described procedure, the intuition behind it is quite simple. A convolutional generative adversarial network generates many fake images inspired by a training sample. These images are compared with a test sample. If the fake images perfectly fit both the training and test samples, the mutual information between them is exactly zero. Vice versa, very divergent training and test distributions suggest much additional information, which must be known to reproduce a dataset.
Positive mutual information or a higher correlation between a fake distribution and training one than between a fake and test distribution means that a training file is easier to fake than the test. On the contrary, small or negative mutual information suggests that the C-GAN’s operation produces fakes, which are relatively easily distinguishable from the training file.
5. Preliminary Results of C-GAN Analysis
The results of the fingerprint analysis using the C-GAN are displayed in
Table 2 and
Table 3. In that analysis, we used our vocabulary of 1060 funds automatically selected by words in their name. For the robustness of this choice, we also tested the list of 1753 US ETF funds in the database:
https://www.etf.com/channels/us-etfs (accessed on 20 May 2021).
7 We display five runs of the C-GAN network with the construction of samples according to the first line from
Table 1. The comparison with the first cell of
Table 3 suggests that with this second sample of ETFs, mutual information is only strengthened in the direction of “training by the overall market” and “testing by the ETF-only file”.
Each cell in
Table 2 is the binary logarithm of the ratio of the distances of the generator-devised fingerprint between training and testing samples, respectively.
Table 3 shows that the mutual information generally decreases with a diminishing fraction of market samples and increasing ETF samples. The fraction of the market vs. ETF samples in the testing samples demonstrates no visible tendency, with one exception. When the testing file is almost entirely composed of the market samples, mutual information becomes zero irrespective of the training samples (
Table 2 and
Table 3).
8The column and row averages are provided in
Table 3. Testing sample 2 is an outlier. We tentatively attribute it to one of the day’s data being exceptional. And indeed, the file for 9 September 2020 contained, probably, only stocks listed on NYSE, not all traded stocks. All averages are positive. This suggests that all-market files are easier approximated by the generator-produced fakes than the ETF-only files. We consider this evidence that ETF-only files have more distinguishing features than the all-market files and, consequently, are more distant from fakes than the training files, i.e., if one used ETF data as a training set, it was relatively easy to train the network to have a relatively high correlation of fakes with all-market samples. On the contrary, the all-market training set was insufficient to train the network to distinguish fakes from the original ETF data.
In a further robustness check, we tested another sample of 1753 ETFs selected using an online database in the same setup. The results are similar to our ad-hoc tests of the 1060 computer-selected funds (
Table 4).
6. The Test of Single-Issue Stocks against ETF Data
We conducted a test wherein we excluded the data for 1753 ETFs from the market data. The results of the test are displayed in
Table 5.
We observed that one can teach the network by feeding it with ETF-only data. Our network successfully interpolated single-issue stock data by the “learned fakes” but not vice versa. We tentatively make the case that ETF trading provides more information for traders in single-issue stocks, and henceforth, the direction of causality is from ETFs to single issues. A possible economic explanation for this phenomenon is provided in
Appendix B.
The treatment of the neural network outputs as probability distributions allows for using another measure on the state space, namely Kullback–Leibler distance (see
Section 4). The results of computing the parameter
r from Equation (5) are shown in
Table 6.
9 The layout of this table reflects the difficulty of expressing relations between tensors in a human-readable format.
In each of the cells, except one—the upper second from the left—a lower asymmetry between the proportions of ETF and single-issue samples in test and training files corresponds to a higher value of r. Average values of r in the cells are the largest for a low (25% and 50%) fraction of the ETF samples in the test file. Henceforth, a GAN network can falsify the single-issue data more successfully.
7. Conclusions
In this paper, I presented a new econometric methodology to investigate the causality of the financial time series. In variance to original Granger causality, this methodology does not rely on any explicit model of the stochastic process by which the input data were generated. It is preferable to nonparametric Granger causality techniques in the case of extra-large or multidimensional datasets because it does not rely on the computation of correlations between multiple subsets of the original data.
The proposed method was applied to solve an important question: whether individual stocks or ETFs drive the liquidity of markets. I chose the information content of the number of imbalances to measure liquidity. The latter indicates the inability to instantly fill the quote at a given price and the dollar value of incomplete transactions. The information content was measured as a pseudodistance between the time series in a two-dimensional state space (the number of a price bucket and its dollar imbalance).
The preliminary answer is that both the rate of imbalance arrivals and the dollar value of resulting imbalances of the ETFs are more informative—in the sense of finer features nonreproducible by fakes—than the individual stocks with ETFs counted as separate stocks. Higher information content of ETF imbalances is not surprising. Indeed, the imbalance messages produced by 1000+ ETFs constitute about one-eighth of the totality of exceptions in the entire database on average, but the dollar value of their imbalances is about two-thirds of the entire dollar value of the market imbalance. Theoretically, this is not surprising because, as pointed out by (
Shum et al. 2016) and (
Evans et al. 2018), ETF securities are used for hedging much more frequently than individual stocks. An explanation of this phenomenon based on extant economic theory is provided in
Appendix B.