1. Introduction
The construction and design of powerful statistical tests are crucial elements for both theoretical and applied scientists. The utility of a test generally depends on its degree of applicability, which is usually related to the assumptions contained in the design of the test, and the restrictions of the scientific field in which the test will be used. Nowadays, the utility of statistical tests also depends on efficiency: reducing the need for computational resources and speed, which are vital for real-time monitoring and control applications. Taking applicability and efficiency into account, in this paper we propose a new general, flexible statistical methodology to design and test central hypotheses, and we establish an asymptotic distribution theory for a wide range of tests by using the new proposed approach.
The new framework is based on symbolic analysis, which is a field of increasing interest for several scientific disciplines (see [
1]) Symbolic analysis studies dynamical systems on the basis of the sequences of symbols which are obtained for a suitable (and generally selected by the user) partition of the state space. In other words, the idea behind the symbolic approach is to split the phase space into a finite number of regions, and then each region is labeled with a symbol. From this point of view, the symbolic approach is a coarse-grained description of dynamics. As coarse-grained methods, which are usually used to provide some description of the data generating process, symbolic analysis focuses on some essential features of the generating dynamics which are frequently of interest to the researcher, for example, (in)dependence, cycles and nonlinear structure. In general terms, it can be said that symbolic analysis allows for designing tests that only focus on the relevant information required for the problem at hand.
This approach is not new in science. In the particular case of time series analysis, the symbolic approach implies transforming raw time series into a sequence of symbols. Although seeminglycounter-intuitive, symbolic analysis is rooted in information theory and also in dynamics theory. For example, properties of symbols or codes are central to the theory of communication [
2]. Not in vein, there is a well-established mathematical discipline, namely, symbolic dynamics, that studies the behavior of dynamical systems. The name of “symbolic dynamics” was firstly coined by [
3], although the discipline started in 1898 with the pioneering work by Hadamard, who developed a symbolic description of sequences of geodesic flows. Interestingly, ref. [
4] highlighted the power of the symbolic approach by showing that a complete description of the behavior of a dynamical system can be captured in terms of symbols. Notice that this property is crucial for the understanding of this paper as long as important characteristics of a random variable can also be captured by studying the symbols derived from it.
The symbolic approach has been useful in many areas of scientific research. In the experimentalist realm, relevant contributions have been made in several fields: astrophysics; biology and medicine; chemistry, mechanical systems and fluid flow; artificial intelligence, control and communication; and data mining, classification and rule discovery ([
5,
6,
7], for an overview). In the non-experimentalist realm, symbolic analysis has been interestingly used. In economics and finance, data are transformed and analyzed in terms of particular symbols [
8]. Two examples are recession indicators utilized to study and to determine the business cycle, and the indicators used to characterize the stock market bull and bear market periods. In geography, works like that of [
9] show how qualitative variables (symbolic analysis) can be used to map descriptions. In spatial econometrics, economic spatial dependence has recently been studied by transforming data into symbols [
10,
11]. Other interesting applications are [
12,
13].
Despite all these interesting applications and the scientifically founded roots of the symbolic approach, there is no systematic body of statistical tools for conducting inference based on symbolic sequences. There are some notable exceptions: [
14,
15,
16,
17,
18,
19,
20,
21]. A common factor to all of these statistical approaches is that they are centered on ordinal patterns, which is one type of symbol. In this paper we present a novel, systematic and general framework for any potential symbol in order to test for wide range of potential null hypotheses that include, as particular cases, most of the previously indicated multidisciplinary situations, namely, ordinal patterns. We also provide a general asymptotic distribution theory for symbolic analysis. Particularly, this paper shows how, by means of symbols, it is possible to design nonparametric tests for a wide class of null hypotheses with special attention to limitations (restrictions) that typically appear in economics and finance. Therefore, this paper aims also to provide the theoretical basis for hypothesis testing by means of symbols.
An appealing advantage to symbolic analysis is that it requires very few assumptions about the data generating process in order to conduct statistical inference. This advantage is promising as the tools based on this method will share the model-free property, which avoids making unnecessary assumptions and provides more general results. Most of the econometric and statistical tests typically used in some of the mentioned disciplines cannot deal with potential nonlinear forms of dependence. By construction, nonlinear structures are not a limitation for symbolic analysis.
The capability of this approach is clearly illustrated by the scope of what we label “the symbolic main theorem” (SMT). Given a null hypothesis H, for example, the null of serial independence, the SMT will give us four nonparametric asymptotic tests for that null, which are distribution free. The transformation of data into symbols is done by means of a symbolization map. Some of its properties are also studied in this paper. These symbolic-based tests have to deal with ordinary statistical problems that usually appear in economics and finance, such as data scarcity and suboptimal empirical power of the test. Given the flexibility of the symbols, we provide theoretical results and strategies to overcome such difficulties.
A clear example of the power of the new tool is illustrated by the spatio-temporal data modeling issues occupying a prominent role in spatial econometrics, geography and regional science, about which we can find a vast amount of literature ([
3], and references therein). We constructed several symbolic-based tests by using the SMT. These tests also constitute an added-value of the paper, because there are currently very few available tests designed to deal with spatiotemporal dependence. The problem becomes more difficult if potential nonlinear dependence is considered. A notable exception is [
9] who has treated nonlinearity in a spatial framework.
Finally, the results of this paper might be of interest to fields of research where information theory plays a relevant role. Particularly, nonparametric entropy measures and tests for serial dependence have drawn the attention of econometricians (see [
22] and references therein). The clearest link between our results and information theory is through the concept of symbolic entropy. In the context of time series analysis, permutation entropy, which is a type of symbolic entropy, uses the probabilities of length-m ordinal patterns in the definition of Shannon entropy. An ordinal pattern is a particular type of symbolization map. Given the characteristics of this map, the SMT allows us to obtain an asymptotic distribution theory for a permutation entropy-based test. Providing the statistical foundation for permutation entropy is specially relevant because: (a) there are very few asymptotic distribution theories available for entropy, in general; and (b) permutation entropy is currently used in computer science due to its relation to “incompressibility”, and is also useful in the study of dynamical systems because of its connection to complexity.
From another point of view, some well-established nonparametric tests can be understood as particular types of symbolic analysis. For example, the nonparametric runs test for randomness by Wald-Wolfowitz (see [
23]); joint-counting procedures for spatial association [
24]; and in general, categorical data techniques [
25] are simple examples that use the very general procedures of translating information into symbols. In this regard, symbolic analysis can be understood as a method related to this literature.
The paper is organized as follows: In
Section 2, we provide the main notation and relevant concepts that will be used in the paper. Among them we highlight: symbolization maps, standard or non-standard maps and decomposable maps. Due to the generality of the method, we require the potential tests to be adaptable to different contexts that are to be able to deal with a wide range of null hypotheses. To this end we introduce the notion of perfect and non-perfect set on subindexes in
Section 3. This allows us to give general theoretical results to tackle practical situations that might otherwise be intractable because of the problem and/or of the type of hypothesis. Therefore we distinguish between two main classes of theoretical situations that lead us to different statistical solutions. In
Section 4 we show how to construct symbolic-based tests via likelihood ratio statistics and via asymptotic normality.
Section 5 considers the theoretical case that the null hypothesis cannot be treated under perfect situations, and hence other results are applicable.
Section 6 puts forward the main theorem of this paper. Under the general conditions of this theorem, we introduce four tests for serial independence, four tests for spatial independence and four new tests for spatiotemporal independence, in
Section 7. These tests are based on different symbolization maps, according to those given in
Section 2 and
Section 3. Finally, in
Section 8, we outline a Monte Carlo simulation experiment to show the capabilities of the spatiotemporal test for independence under linear and nonlinear settings. The paper ends with some conclusions.
2. Notation and Definitions
As indicated in the previous section, we give some definitions and introduce the basic notation that will be used throughout the rest of the paper.
Let be a stationary real-valued process, where I is a set of indexes.
Let
be a set of
elements that we label as symbols. Now assume that there exists a map
for some subset of indexes
. We will say that
is of
-type if and only if
. We will call the map
f a symbolization map for
.
Notice that it is possible to expand the definition of a symbolization map to the
dimensional case by introducing the concept of decomposable maps: if
,
, are
k symbolization maps, then the product
is a symbolization map for the
k-dimensional variable
. We will call
F a
k-decomposable symbolization map.
Given a symbolization map and a symbol , we denote by the probability of occurrence of symbol . Symbolization maps can be classified according to their behavior under the null hypothesis. If the symbolization map f is such that under a given null hypothesis (H) all the symbols have the same probability to occur, we will say that f is a standard symbolization map. On the contrary, we will refer to f as a non-standard symbolization map.
The symbolic entropy of a process
is defined as the Shannon’s entropy of the
n distinct symbols as follows:
with the convention
Symbolic entropy, , can be understood as the information in terms of symbols of the process . Notice that . Notice also that the lower bound is attained when only one symbol occurs, and the upper bound when all n possible symbols appear with the same probability.
Consider the following index
that we define an indicator random variable
as follows:
that is, we have that
if and only if
i is of
-type, and
otherwise.
Then
is a Bernoulli variable with probability of “success”
, where “success” means that
i is of
-type. It is straightforward to see that
Our interest is in knowing how many
is are of
-type for all symbol
. In order to answer the question, we construct the following counting variable:
The variable can take the values where .
To complete with notation, we will denote by
the cardinality of the subset of symbolized indexes
formed by all the elements of
-type.
Then, under the conditions above, one could easily compute the relative frequency of a symbol
by:
which is the maximum likelihood estimator of
7. Different Symbolizations for Different Nulls Related with Independence
According to the general symbolic theorem, in this section we show how it is possible to test interesting null hypotheses by using symbolic analysis. To concrete, we focus on testing for different nulls of independence as it is a well-known field of research and because recently published articles can be generally understood and extended under this new theoretical framework. Given that each null hypothesis (step 1) will require a particular symbolization map, in this section we present different symbolization procedures (step 2) to test for serial dependence, spatial dependence, and spatiotemporal dependence, respectively. Then we present the results of step 3 and step 4 depending on the statistic technique the researcher wants to use according to Theorem 2, i.e., either likelihood ratio statistics or/and asymptotically normal statistics. Given a null hypothesis, the behavior of the tests obtained from this approach will strongly depend on the expertise of the researcher in constructing the symbolization map. We emphasize that both power analysis of the class of tests, and power competition among alternative nonparametric tests were already given in previous work [
10,
16]; therefore, we are not going to replicate them here.
As we have indicated, the crucial component of the symbolic procedure is to choose a symbolic mapping which ensures that the distribution of the symbols can detect deviations from the null. The null hypotheses considered in this section are related to the important topic of “statistical independence”. This is a very well-studied topic in time series analysis and therefore there is a generous number of available tests. On the contrary, spatial independence is not so well-known and is non-trivial how to test for it. As we will show, it is needed to use another different symbolization map for detecting spatial patterns. Similar comments can be made for spatio-temporal independence. Needless to say, there are other hypotheses of interest in econometrics, and the researcher will have to design suitable symbolic maps for testing them. For example, in [
29], the authors dealt with the opposite problem: how to test for a pure deterministic chaotic process. In these and other cases, the power of the tests will centrally depend on the ability of the research to design the symbolization map for the desired null hypothesis.
7.1. Serial Independence Tests
In the case of time series, refs. [
15,
16] used the following symbolization procedure to test for serial dependence: Let
be a real-valued time series (in this case the subindex
t refers to time) for which we are interested in testing the null of serial independence (step 1). In order to complete step 2, we denote by
the symmetric group of order
, that is, the group formed by all the permutations of length
m (for a positive integer
. Let
. The positive integer
m is usually known as the embedding dimension.
An ordinal pattern for a symbol is defined as
at a given time
. The time series can be embedded in an
m-dimensional space:
It is said that
t is of
type if and only if
is the unique symbol in the group
satisfying the two following conditions:
Notice that condition guaranties uniqueness of the symbol . This is justified if the values of have a continuous distribution so that equal values are very uncommon, with a theoretical probability of occurrence of 0.
In this case, the symbolization map is defined as
given by
where
is such that
t is of
-type. Now the design of the symbolization map (step 2) is completed.
Moreover, under the null of independence the distribution of the symbols is uniform and therefore the map is a standard symbolization map. Additionally, the set of symbolized indexes is which is not perfect.
Notice that in order to have a perfect set and therefore ensure the independence of the indicator variables
, it is enough to consider as a set of symbolized indexes
Accordingly, using this symbolization map, the next corollary straightforwardly follows from Theorem 2:
Corollary 1. Let be the symbolization map defined in (6) with . Denote by the permutation entropy defined in (1). If the time series is independent, then These results for permutation entropy are in relation to a relatively recent line of research based on order patterns for analyzing time series. Ordinal patterns can be, per se, used for descriptive purposes, like autocorrelation, with the added advantage that the require no assumptions such as Gaussianity or linearity. On the contrary, only mild stationary conditions can exist in the underlying process. The above corollary is a further step for the development of statistical inference for ordinal time series. Naturally, it is possible to obtain other kinds of statistical results by adding more assumptions to the generating process. In fact, notorious results can be found in [
4]) if Gaussianity and ergodicity are assumed. In this regard, our asymptotic results for order patterns keep assumptions at a minimum. Additionally, by maintaining general applicability at minimum cost (in terms of assumptions) for serial independence tests, some bootstrap-based statistics for ordinal patterns have been put forward in [
29].
An interesting property of the symbolization procedure presented in this section is that it can be also used for discrete distributions. To do so it necessary to consider a non-standard version of the map. Under such circumstances, the likelihood ratio (
2) can be directly used once the behavior of
is known under the null of serial independence.
7.2. Spatial Independence Tests
In the case of spatial processes, ref. [
10] gave a symbolization procedure to test for spatial independence as follows: Let
be a real-valued spatial process, where
S is a set of coordinates. Given a location
, we will denote by
the polar coordinates of location
taking as origin
.
Let
with
. Consider now that the spatial process
is embedded in a different
m-dimensional space as follows:
where
are the
nearest neighbors to
, which are ordered from lesser to higher Euclidean distance with respect to location
. Notice that in the case of two or more locations being equidistant to
, we will choose them in an anticlockwise manner. In formal terms,
are the
nearest neighbors to
satisfying the following two conditions:
Notice that conditions and ensure the uniqueness of for all .
The proposed standard symbolization map
f is defined as follows: denote by
the median of the spatial process
and let
Now, define the indicator function
Then, the standard symbolization map
is defined as:
where
stands for the set of symbols defined by
Notice that under the null of spatial independence, the distribution of the symbols is uniform and therefore the map is a standard symbolization map.
Moreover, in this case is not a perfect symbolized set. To construct a perfect symbolized set , one can proceed as follows. Take a location at random. Let be the set of nearest neighbors to s. Now select the following element in by taking such that . Then construct recursively the set by taking satisfying for all with .
As it is evident, the method is flexible enough to allow the researcher to select his own set and map of symbols for a given null. For example, if under the previous symbolization procedure, the power (or size) of the test is not satisfactory, one can always consider other possible symbolization procedures for the same null and for the same spatial process
. Let
. Again, let
be the set of nearest neighbors to
s and let
be its cardinality. Denote by
. Denote by
and
the
i-th quantile of the variables
X and
respectively, for
. We will denote by
(resp
) and
(resp.
). Then we define the symbolization map
if and only if
and
.
Again, under the null of independence the distribution of the symbols is uniform and therefore the map is a standard symbolization map.
Again, the same set of recursively constructed symbolized indexes ensures the independence of the indicator variables . Accordingly, using this symbolization map, the next corollary straightforwardly follows from Theorem 2:
Corollary 2. Let , be the symbolization maps defined in (8) and (10) with . Denote by the symbolic entropy defined in (1). If the spatial process is independent, it follows that: In
Section 2 we indicate that there is a class of symbolization maps that are non-standard. Consider a situation in which a reduction in the number of possible symbols under study will benefit the behavior and properties of the test. In this, and other potential situations, non-standard maps might be useful. As an example, we now construct a non-standard symbolization map to test for independence in the spatial context. The following symbolization is an example of the most general procedure that we give in
Appendix A.3.
Consider again the set
of symbols defined in (
9) for a fixed embedding dimension
m. Now we will denote by
the rest of the division of
a over
.
Now define the following equivalence relation ∼:
if and only if there exists an integer
k such that
for all
.
Now we consider as a set of symbols the set of classes in modulo, the equivalence relation ∼.
Notice that, in general, in this case not all the symbols in have the same probability of occurring, and therefore the symbolization map is non-standard.
7.3. Spatiotemporal Independence Tests
The issues related to spatiotemporal data modeling occupy a prominent role in current econometrics, where we can find recent literature devoted to this topic (see [
9,
30]). Spatiotemporal dependence introduces considerable difficulties with respect to modeling, computation and statistical theory. If independence can be taken for granted, and likewise the common assumption of cross-sectional independence, then computations and the application of inference rules simplifies significantly. It seems reasonable therefore to test first for spatiotemporal independence, and if the evidence for independence is strong, then proceed with the well-known methods. Unfortunately, tests for spatiotemporal independence are scarce. The aim of this section is twofold: to contribute to this rather scarce literature, and to highlight the usefulness of the novel general method presented in this paper. To this end we consider the relevant null of spatiotemporal dependence. Of particular interest for our tests is that dependence is not taken as a synonymous with correlation, and therefore nonlinearities are not restrictions for our test.
Consider the process . As in the previous cases, one can define several standard and non-standard symbolization maps. For simplicity, we adapt the previous symbolizations to the spatiotemporal case as follows:
For a fixed location
define
as the time series
. Similarly for a fixed period
we define
as the spatial process
Let
with
be the time and space embedding dimensions respectively. Then under this setting we define the following decomposable symbolization maps
for
and 4 defined by:
where
and
for
are defined as above.
Notice that, when testing for spatiotemporal independence, when the symbolization map is standard, while for is non-standard.
It is also possible to define an extension of the symbolization map
in a spatiotemporal context. Indeed, consider the following map:
defined by
where
for all
and the indicator function
is defined as in (
7).
Accordingly, using this symbolization map, the next corollary straightforwardly follows from Theorem 2:
Corollary 3. Let and be the standard symbolization maps defined in (11) with and in (12) respectively. Denote by and the symbolic entropy defined in (1). If the spatiotemporal process is independent, then: 8. Empirical Behavior of the Tests for Spatiotemporal Independence
In this section we evaluate the empirical behavior of the STG test with different configurations for the subset The first aim of this section is to show the flexibility of Corollary 3 to cope with different scenarios. The second goal is to evaluate the empirical behavior of the new test. An the third intention of this simulation is to evaluate the incidence of the selection of on the empirical size of the test and on the power.
To those ends we designed a Monte Carlo experiment as follows: Firstly we consider the problem of testing for independence on regular lattices of several orders—R = 64 (8 × 8) and T = 150; R = 100 (10 × 10)—for which we consider two possible temporal scenarios, depending on data availability, T = 200 and T = 800. We also simulated richer regular lattices of order R = 400 (20 × 20), although on this occasion we only considered T = 200. The symbolization map follows from (
12) with
and
. The test under study was generated from Corollary 3 under Expression (
13). Therefore, we used a perfect indexes subset. This subset was constructed recursively, as indicated in
Section 7:
, where
is the set conformed with
the three nearest neighbors of
in
and the four spatial locations in the next time period. The power of the test is evaluated with the following DGPs:
where
, which was also used for evaluating the empirical size of the test. Parameters
intensified temporal and spatial dependencies, respectively, and
was fixed at five in all simulations. The weighting matrix W has been specified as a binary type using a contiguity criterion and rook-type movements.
Table 1 collects the empirical size and power of STG statistical test for 1000 repetitions. It is straightforward to observe that the size is controlled, and the test is powerful. For low intensity level of parameter (
the test is absolutely powerful. We have to set
(or below) to lose power. This occurred despite the DGP under consideration.
Regular spatiotemporal configurations are interesting because (1) time series posit a natural order for observations, (2) lattice data provide the simplest extension of time series and (3) some scientific methods are compatible with this spatiotemporal configuration. However, irregular patterns are of frequent occurrence with spatial data. In geographical settings, data are liable to be recorded across heterogeneously-sized administrative regions, while economic distances do not correspond to regular spacing. Therefore, it is also useful to adapt the STG symbolic test to irregular spatiotemporal settings. In terms of our general methodology (see Corollary 3) this problem in tractable by considering the symbolization map where we control the dependence among the indicators by controlling on average the cardinality of the sets . Particularly, we will select the set of indexes such that i.e., the average of the cardinality of the sets is less than half of the number of spationtemporal neighbors.
Therefore, to complete the experiment (in the case of nonperfect lattices) we evaluate the STG-version for irregular lattices where coordinates of each spatial location are drawn from a N(0,1). We have considered the three nearest neighbors for irregular lattices. Afterwards, the resulting matrix was row-standardised in the usual way.
Table 2 collects the size and power for models constructed from DGP1 and DGP2. The introduction of irregular lattices has led us to introduce non-perfect indexes, and accordingly the size of the test slightly increased, although the levels seem acceptable, particularly for generous sample data. Power is as interesting as for the case of perfect indexes, and therefore the same comments applies (similar results are obtained in the case of using the multivariate normal approximation).
Comparison with Other Spatiotemporal Test for Independence
We now face our test with an unfavorable scenario characterized by small amount of available data on irregular lattices, also in linear and nonlinear setups. To this end we consider pairs of the following sample sizes: (36 × 10), (64 × 10), (100 × 10), (100 × 30) and (200 × 10). According to our theoretical discussion, given data scarcity and irregular spatial configuration, we use the non-perfect subset of indexes. Additionally, we consider the symbolization map based on equivalence relations
as depicted in
Appendix A.3 for nonstandard maps.
To complete the empirical study, we compare our test with another nonparametric spatiotemporal test [
31] which is described in
Appendix A.4 and we refer to it as STBP. Notice, however, that the STBP test requires one to correctly specify the weighting matrix, W; this is not a requirement for the symbolic test.
In terms of empirical size, both tests behave similarly well for linear processes (
Table 3). On the contrary, for the nonlinear processes (
Table 4), the size of the STBP test is poor, while the symbolic-based test performs as expected. In terms of empirical power, the STBP test outperforms the STG test, especially for low intensity levels of dependence in the case of the linear process. However, under a nonlinear spatiotemporal configuration, the STG clearly presents a better balance between size and power and outperforms the STBP in all cases.
9. Conclusions
Central null hypotheses in experimental and non-experimental branches of science can be easily tested by means of symbolized information. This paper provides with the analytical tools to construct nonparametric hypothesis tests based on symbols. These tools are able to cope with different null hypotheses and with distinct scenarios in which some realistic limitations might be imposed to test designs.
A shared characteristic of all these symbolic test families is that few assumptions are needed to obtain asymptotic results. Therefore, general applicability of this method is guaranteed. In particular, in this paper we have shown that two well-known symbolic-based tests are particular cases of the main symbolic theorem (Theorem 2), which is stated in this paper for the first time. Furthermore, a set of new symbolic-based tests for spatiotemporal independence is put forward by using the main results of this paper collected under the main symbolic theorem (Theorem 2). Monte Carlo simulations provide evidence of the extraordinary power of the contrasted test. Currently, there are circumstances where robustness to speed, noise or computational cost are paramount, so fruitful applications of symbolic analysis are favored.
Further lines of research are worthy. We now indicate some of them on which we and other scholars are currently working: (i) One of the appealing properties of symbolic-based testing is that it requires few assumptions. In this paper we have assumed stationarity; however, it would be interesting to study whether it is possible to be less restrictive. (ii) In the context of time series analysis, most available techniques require the existence of second moments; however, by using certain symbolizations, it might be possible to waive this requirement. This will allow time series researchers to consider a wider variety of model classes. (iii) One of the main contributions of the paper is that it suggests that researchers can design a symbolization procedure (map) to test null hypotheses. It would be interesting to study what types of null hypotheses are more suitable to analysis using symbolic maps.