Next Article in Journal
Modeling the Dynamics of T-Cell Development in the Thymus
Next Article in Special Issue
Entropy Based Student’s t-Process Dynamical Model
Previous Article in Journal
An Improved Encoder-Decoder Network Based on Strip Pool Method Applied to Segmentation of Farmland Vacancy Field
Previous Article in Special Issue
Monitoring the Zero-Inflated Time Series Model of Counts with Random Coefficient
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method

Department of Business Administration and Economics, Bielefeld University, Universitaetsstrasse 25, 33615 Bielefeld, Germany
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(4), 436; https://doi.org/10.3390/e23040436
Submission received: 19 February 2021 / Revised: 26 March 2021 / Accepted: 31 March 2021 / Published: 8 April 2021
(This article belongs to the Special Issue Time Series Modelling)

Abstract

:
This paper investigates the asymptotic properties of estimators obtained from the so called CVA (canonical variate analysis) subspace algorithm proposed by Larimore (1983) in the case when the data is generated using a minimal state space system containing unit roots at the seasonal frequencies such that the yearly difference is a stationary vector autoregressive moving average (VARMA) process. The empirically most important special cases of such data generating processes are the I(1) case as well as the case of seasonally integrated quarterly or monthly data. However, increasingly also datasets with a higher sampling rate such as hourly, daily or weekly observations are available, for example for electricity consumption. In these cases the vector error correction representation (VECM) of the vector autoregressive (VAR) model is not very helpful as it demands the parameterization of one matrix per seasonal unit root. Even for weekly series this amounts to 52 matrices using yearly periodicity, for hourly data this is prohibitive. For such processes estimation using quasi-maximum likelihood maximization is extremely hard since the Gaussian likelihood typically has many local maxima while the parameter space often is high-dimensional. Additionally estimating a large number of models to test hypotheses on the cointegrating rank at the various unit roots becomes practically impossible for weekly data, for example. This paper shows that in this setting CVA provides consistent estimators of the transfer function generating the data, making it a valuable initial estimator for subsequent quasi-likelihood maximization. Furthermore, the paper proposes new tests for the cointegrating rank at the seasonal frequencies, which are easy to compute and numerically robust, making the method suitable for automatic modeling. A simulation study demonstrates by example that for processes of moderate to large dimension the new tests may outperform traditional tests based on long VAR approximations in sample sizes typically found in quarterly macroeconomic data. Further simulations show that the unit root tests are robust with respect to different distributions for the innovations as well as with respect to GARCH-type conditional heteroskedasticity. Moreover, an application to Kaggle data on hourly electricity consumption by different American providers demonstrates the usefulness of the method for applications. Therefore the CVA algorithm provides a very useful initial guess for subsequent quasi maximum likelihood estimation and also delivers relevant information on the cointegrating ranks at the different unit root frequencies. It is thus a useful tool for example in (but not limited to) automatic modeling applications where a large number of time series involving a substantial number of variables need to be modelled in parallel.
JEL Classification:
C13; C32

1. Introduction

Many time series show seasonal patterns that, according to [1] for example, cannot be modeled appropriately using seasonal dummies because they exhibit a slowly trending behavior typical for unit root processes.
To model such processes in the vector autoregressive (VAR) framework, Ref. [2] (abbreviated as JS in the following) extend the error correction representation for seasonally integrated autoregressive processes pioneered by [3] to the multivariate case. This vector error correction formulation (VECM) models the yearly differences of a process observed S times per year. The model includes systems having unit roots at some or all of the possible locations z j = exp ( 2 π j S i ) , j = 0 , . . . , S 1 of seasonal unit roots. In JS all unit roots are assumed to be simple such that the process of yearly differences is stationary.
In this setting JS propose an estimator for the autoregressive polynomial subject to restrictions on its rank (the so-called cointegrating rank) at the unit roots z j based on an iterative scheme focusing on a pair of complex-conjugated unit roots (or the unit roots z j = 1 or z j = 1 respectively) at a time. The main idea here is the reformulation of the model using the so called vector error correction representation. Beside estimators JS also derived likelihood ratio tests for the cointegrating rank at the various unit roots.
Refs. [4,5] propose simpler estimation schemes based on complex reduced rank regression (cRRR in the following). They also show that their numerically simpler algorithm leads to test statistics for the cointegrating rank that are asymptotically equivalent to the quasi maximum likelihood tests of JS. These schemes still typically alternate between cRRR problems corresponding to different unit roots until convergence, although a one step version estimating only once at each unit root exists. Ref. [6] provides updating equations for quasi maximum likelihood estimation in situations where constraints on the parameters prohibit focusing on one unit root at a time.
The leading case here is that of quarterly data ( S = 4 ) where potential unit roots are located at ± 1 and ± i , implying that the VECM representation contains four potentially rank restricted matrices. However, increasingly time series of much higher sampling frequency such as hourly, daily or weekly observations are available. In such cases it is unrealistic that all unit roots are present. If a unit root is not present, the corresponding matrix in the VECM is of full rank. Therefore in situations with only a few unit roots being present, the VECM requires a large number of parameters to be estimated. Also in cases with a long period length (such as for example hourly data with yearly cycles) usage of the VECM involves the estimation of all coefficient matrices for lags for at least one year.
In general, for processes of moderate to large dimension the VAR framework involves estimation of a large number of parameters which potentially can be avoided by using the more flexible vector autoregressive moving average (VARMA) or the—in a sense—equivalent state space framework. This setting has been used in empirical research for the modeling of electricity markets, see the survey [7] for a long list of contributions. In particular, ref. [8] use the model described below without formal verification of the asymptotic theory for the quasi maximum likelihood estimation.
Recently, ref. [9] show that in the setting of dynamic factor models, typically used for observation processes of high dimension, the common assumption that the factors are generated using a vector autoregression jointly with the assumption that the idiosyncratic component is white noise (or more generally generated using a VAR or VARMA model independent of the factors) leads to a VARMA process. Also a number of papers (see for example [10,11,12]) show that in their empirical application the usage of VARMA models instead of approximations using the VAR model leads to superior prediction performance. This, jointly with the fact that the linearization of dynamic stochastic general equilibrium models (DSGE) leads to state space models, see e.g., [13], has fuelled recent interest in VARMA—and thus state space—modeling in particular in macroeconomics, see for example [14].
In this respect, quasi maximum likelihood estimation is the most often used approach for inference. Due to the typically highly non-convex nature of the quasi likelihood function (using the Gaussian density) in the VARMA setting, the criterion function shows many local maxima where the optimization can easily get stuck. Randomization alone does not solve the problem efficiently, as typically the parameter space is high-dimensional causing problems of the curse of dimensionality type.
Moreover, VARMA modeling requires a full specification of the state space unit root structure of the process, see [15]. The state space unit root structure specifies the number of common trends at each seasonal frequency (see below for definitions). For data of weekly or higher sampling frequency it is unlikely that the state space unit root structure is known prior to estimation. Testing all possible combinations is numerically infeasible in many cases.
As an attractive alternative in this respect the class of subspace algorithms is investigated in this paper. One particular member of this class, the so called canonical variate analysis (CVA) introduced by [16] (in the literature the algorithm is often called canonical correlation analysis; CCA), has been shown to provide system estimators which (under the assumption of known system order) are asymptotically equivalent to quasi maximum likelihood estimation (using the Gaussian likelihood) in the stationary case [17]. CVA shares a number of robustness properties in the stationary case with VAR estimators: [18] shows that CVA produces consistent estimators of the underlying transfer function in situations where the innovations are conditionally heteroskedastic processes of considerable generality. Ref. [19] shows that CVA provides consistent estimators of the transfer function even for stationary fractionally integrated processes, if the order of the system tends to infinity as a function of the sample size at a sufficient rate.
In the I(1) case [20] introduce a heuristic adaptation of the algorithm using the assumption of known cointegrating rank in order to show consistency for the corresponding transfer function estimators. However, the specification of the cointegrating rank is no easy task in itself. In case of misspecification of the cointegrating rank the properties of this approach are unclear. Ref. [21] states without proof that also the original CVA algorithm delivers consistent estimates in the I(1) case without the need to impose the true cointegrating rank.
Furthermore for I(1) processes [20] proposed various tests for the cointegrating rank and compared them to tests in the Johansen framework showing superior finite sample performance in particular for multivariate data sets with a large dimension of the modeled process.
This paper builds on these results and shows that CVA can also be used in the seasonally integrated case. The main contributions of the paper are:
(i)
It is shown that the original CVA algorithm in the seasonally integrated case provides strongly consistent system estimators under the assumption of known system order (thus delivering the currently unpublished proof of the claim in the I(1) case in [21]).
(ii)
Upper bounds for the order of convergence for the estimated system matrices are given, establishing the familiar superconsistency for the estimation of the cointegrating spaces at all unit roots.
(iii)
Several tests for separate (that is for each unit root irrespective of the specification at the other potential unit roots) determination of the seasonal cointegrating ranks are proposed which are based on the estimated systems and are simple to implement.
The derivation of the asymptotic properties of the estimators is complemented by a simulation study and an application, both demonstrating the potential of CVA and one of the suggested tests. Jointly our results imply that CVA constitutes a very reasonable initial estimate for subsequent quasi likelihood maximization in the VARMA case. Moreover the method provides valuable information on the number of unit roots present in the process, which can be used for subsequent investigation at the very least by providing upper bounds on the number of common trends present at each unit root frequency. Contrary to the JS approach in the VAR framework these tests can be performed in parallel for all unit roots, eliminating the interdependence of the results inherent in the VECM representation. Moreover, they do not use the VECM representation involving a large number of parameters in the case of high sampling rates.
These properties make CVA a useful tool in automatic modeling of multivariate (with a substantial number of variables) seasonally (co-)integrated processes.
The paper is organized as follows: in the next section the model set and the main assumptions of the paper are presented. The estimation methods are described in Section 3. Section 4 states the consistency results. Inference on the cointegrating ranks is proposed in Section 5. Data preprocessing is discussed in Section 6. The simulations are contained in Section 7, while Section 8 discusses an application to real world data. Section 9 concludes the paper. Appendix A contains supporting material, Appendix C provides the proofs of the main results of this paper, which are based on preliminary results presented in Appendix B.
Throughout the paper we will use the symbols o ( g T ) and O ( g T ) to denote orders of almost sure convergence where T denotes the sample size, i.e., x T = o ( g T ) if x T / g T 0 almost surely and x T = O ( g T ) if x T / g T is bounded almost surely for large enough T (that is there exists a constant M < such that lim sup T x T / g T M a.s.). Furthermore, o P ( g T ) , O P ( g T ) denote the corresponding in probability versions.

2. Model Set and Assumptions

In this paper state space processes ( y t ) t Z , y t R s , are considered which are defined as the solutions to the following equations for given white noise ( ε t ) t Z , ε t R s , E ε t = 0 , E ε t ε t = Ω > 0 :
x t + 1 = A x t + K ε t , y t = C x t + ε t .
Here x t R n denotes the unobserved state and A R n × n , C R s × n and K R n × s define the state space system typically written as the tuple ( A , C , K ) .
In this paper we consider without restriction of generality only minimal state space systems in innovations representation. For a minimal system the integer n is called the order of the system. As is well known (cf. e.g., [22]) minimal systems are only identified up to the choice of the basis of the state space. Two minimal systems ( A , C , K ) and ( A ˜ , C ˜ , K ˜ ) are observationally equivalent if and only if there exists a nonsingular matrix T R n × n such that A = T A ˜ T 1 , C = C ˜ T 1 , K = T K ˜ . For two observationally equivalent systems the impulse response sequences k 0 = I s , k j + 1 = C A j K = C ˜ A ˜ j K ˜ , j = 0 , 1 , . . . coincide.
Ref. [15] shows that the structure of the Jordan normal form of the matrix A determines the properties (such as stationarity) of the solutions to (1) for t Z . Eigenvalues of A on the unit circle lead to unit root processes in the sense of [15] who also define a state space unit root structure indicating the location and multiplicity of unit roots. A process ( y t ) t Z with state space unit root structure Ω S = { ( 0 , ( c 0 ) ) , ( 2 π / S , ( c 1 ) ) , . . . , ( π , ( c S / 2 ) ) } for some even integer S is called multi frequency I(1) (in short MFI(1)). Even S is chosen because it simplifies the notation by implying that S / 2 also is an integer and z = 1 is a seasonal unit root. By adjusting the notation appropriately all results hold true for odd S as well).
If, moreover, such a process is observed for S periods per year, it is called seasonal MFI(1). In this case the canonical form in [15] takes the following form:
A = diag ( A 0 , A 1 , , A S / 2 , A ) , A 0 = I c 0 , A j = cos ( ω j ) I c j sin ( ω j ) I c j sin ( ω j ) I c j cos ( ω j ) I c j , 0 < j < S / 2 , A S / 2 = I c S / 2 , C = C 0 , R | C 1 , R C 1 , I C S / 2 1 , R C S / 2 1 , I | C S / 2 | C = C 0 | C 1 C S / 2 1 | C S / 2 | C , K = K 0 , R | K 1 , R K 1 , I | | K S / 2 1 , R K S / 2 1 , I | K S / 2 | K
where ω j = 2 π j / S , j = 0 , , S / 2 denote the unit root frequencies and C j , R R s × c j , C j , I R s × c j , K j , R R c j × s , K j , I R c j × s where 0 c j s , 0 j S / 2 . Furthermore for C j , C : = C j , R i C j , I it holds that C j , C C j , C = I c j and K j , C = K j , R + i K j , I is of full row rank and positive upper triangular ( C 0 , I = C S / 2 , I = 0 , K 0 , I = K S / 2 , I = 0 ), see [15] for details. Finally | λ m a x ( A ) | < 1 , where λ m a x ( A ) denotes an eigenvalue of the matrix A with maximal modulus. The stable subsystem ( A , C , K ) is assumed to be in echelon canonical form (see [22]).
Using this notation the assumptions on the data generating process (dgp) in this paper can be stated as follows:
Assumption 1. ( y t ) t Z has a minimal state space representation ( A , C , K ) , A R n × n of the form (2) with minimal ( A , , C , , K , ) , A , R n × n in echelon canonical form where c = n n > 0 .
Furthermore the stability assumption | λ m a x ( A , ) | < 1 and the strict minimum-phase condition ρ 0 : = | λ m a x ( A K C ) | < 1 hold.
The state at time t = 1 is given by x 1 = [ x 1 , 0 , . . . , x 1 , S / 2 , x 1 , ] where x 1 , j R δ j c j (for δ j = 2 , 0 < j < S / 2 and δ j = 1 else) is deterministic and x 1 , = j = 1 A , j 1 K , ε 1 j is such that ( x t , ) t Z is stationary.
The noise process ( ε t ) t Z is assumed to be a strictly stationary ergodic martingale difference sequence with respect to the filtration F t with zero conditional mean E ( ε t | F t 1 ) = 0 , deterministic conditional variance E ( ε t ε t | F t 1 ) = Ω > 0 and finite fourth moments.
Due to the block diagonal form of A the state equations are in a convenient form such that partitioning the state vector accordingly as
x t = x t , 0 x t , 1 x t , S / 2 x t , ,
the blocks ( x t , j ) t Z , x t , j R δ j c j for c j > 0 are unit root processes with state space unit root structure { ( ω j , ( c j ) ) } . Finally ( x t , ) t Z is assumed to be stationary due to the assumptions on x 1 , . If ( y ˜ t ) t N denotes a different solution to the state space equations corresponding to x ˜ 1 then (for t > 1 )
y ˜ t y t = C A t 1 ( x ˜ 1 x 1 ) = j = 0 S / 2 C j A j t 1 ( x ˜ 1 , j x 1 , j ) + C A t 1 ( x ˜ 1 , x 1 , ) .
Note that C j A j t 1 z 12 = cos ( ω j t ) z 1 + sin ( ω j t ) z 2 , 0 < j < S / 2 (for appropriate vectors z 12 , z 1 , z 2 ),
C 0 A 0 t 1 = C 0 , C S / 2 A S / 2 t 1 = ( 1 ) t 1 C S / 2 .
Therefore the sum j = 0 S / 2 C j A j t 1 ( x ˜ 1 , j x 1 , j ) can be modeled using a constant and seasonal dummies. The term C A t 1 ( x ˜ 1 , x 1 , ) tends to zero with an exponential rate as t and hence does not influence the asymptotics.
Assumption 1 implies that the yearly difference
y t y t S = C A S x t S + ε t + i = 1 S C A i 1 K ε t i C x t S ε t S = ( C A S C ) x t S + v t = ( C A S C ) x t S , + v t
is a stationary VARMA process where v t = ε t + i = 1 S C A i 1 K ε t i ε t S since A j S = I δ j c j . Thus the process according to Assumption 1 is a unit root process in the sense of [15]. Note that we do not assume that all unit roots are contained such that the spectral density of the stationary process ( y t y t S ) t Z may contain zeros due to overdifferentiation and hence the process potentially is not stably invertible. The special form of A 0 implies that I ( 1 ) processes are a special case of our dgp while I ( d ) , d > 1 , d N , processes are not allowed for.

3. Canonical Variate Analysis

The main idea of CVA is that, given the state, the system equations (1) are linear in the system matrices. Therefore, based on an estimate of the state sequence, the system can be estimated using least squares regression. The estimate of the state is based on the following equation (for details see for example [17]):
Let Y t , f + : = [ y t , y t + 1 , , y t + f 1 ] denote the vector of stacked observations for some integer f n and let E t , f + : = [ ε t , ε t + 1 , , ε t + f 1 ] . Further define Y t , p : = [ y t 1 , , y t p ] . Then (for t > p )
Y t , f + = O f x t + E f E t , f + = O f K p Y t , p + O f ( A K C ) p x t p + E f E t , f + = β 1 Y t , p + N t , f +
where K p : = [ K , A ¯ K , A ¯ 2 K , , A ¯ p 1 K ] for A ¯ : = A K C and O f : = [ C , A C , , ( A f 1 ) C ] . The strict minimum-phase assumption implies A ¯ p 0 for p .
Let a t , b t : = T 1 t = p + 1 T f + 1 a t b t for sequences ( a t ) t N and ( b t ) t N . Then an estimate of β 1 is obtained from the reduced rank regression (RRR) Y t , f + = β 1 Y t , p + N t , f + under the rank constraint r a n k ( β 1 ) = n . This results in the estimate O f ^ K p ^ : = [ ( Ξ f + ^ ) 1 U ^ n S ^ n ] [ V ^ n ( Ξ p ^ ) 1 ] of β 1 using the singular value decomposition (SVD)
Ξ f + ^ β ^ 1 Ξ p ^ = U ^ S ^ V ^ = U ^ n S ^ n V ^ n + R ^ n .
Here β ^ 1 = Y t , f + , Y t , p Y t , p , Y t , p 1 denotes the unrestricted least squares estimate of β 1 and
Ξ f + ^ : = Y t , f + , Y t , f + 1 / 2 , Ξ p ^ : = Y t , p , Y t , p 1 / 2 .
Here the symmetric matrix square root is used. The definition is, however, not of importance and other square roots such as Cholesky factors could be used. U ^ n R f s × n denotes the matrix whose columns are the left singular vectors to the n largest singular values which are the diagonal entries in S ^ n : = diag ( σ ^ 1 , σ ^ 2 , , σ ^ n ) , σ ^ 1 σ ^ n > 0 and V ^ n R p s × n contains the corresponding right singular vectors as its columns. R ^ n denotes the approximation error.
The system estimate ( A ^ , C ^ , K ^ ) is then obtained using the estimated state x ^ t : = K p ^ Y t , p , t = p + 1 , , T + 1 via regression in the system equations.
In the algorithm a specific decomposition of the rank n matrix O f ^ K p ^ into the two factors O f ^ and K p ^ is given such that K p ^ Ξ p ^ ( Ξ p ^ ) K p ^ = I n . It is obvious that every other decomposition of O f ^ K p ^ produces an estimated state sequence in a different coordinate system, leading to a different observationally equivalent representation of the same transfer function estimator. Therefore, with respect to consistency of the transfer function estimator it is sufficient to show that there exists a factorization of O f ^ K p ^ leading to convergent system matrix estimators ( A ˜ , C ˜ , K ˜ ) , even if this factorization cannot be used in actual computations, as it requires information not known at the time of estimation.
In order to generate a consistent initial guess for subsequent quasi likelihood optimization in the set of all state space systems corresponding to processes with state space unit root structure Ω S : = { ( ω 0 , ( c 0 ) ) , . . . , ( ω S / 2 , ( c S / 2 ) ) } , however, we will derive a realizable (for known integers c j and matrices E j such that E j C , j , C = I c j ) consistent system estimate. To this end note that consistency of the transfer function implies (see for example [23]) that the eigenvalues λ ˜ l of A ^ converge (in a specific sense) to the eigenvalues λ j of A . Therefore transforming A ^ into complex Jordan normal form (where A ^ is almost surely diagonalizable), ordering the eigenvalues such that groups of eigenvalues λ ˜ l ( j ) , l = 1 , . . . , c j , converging to λ j are grouped together, we obtain a realizable system ( A ˇ , C ˇ , K ˇ ) where the diagonal blocks of the block diagonal matrix A ˇ corresponding to the unit roots converge to a diagonal matrix with the eigenvalues z j on the diagonal:   
A ˇ j , C = λ ˜ 1 ( j ) 0 0 0 λ ˜ 2 ( j ) 0 0 0 λ ˜ c j ( j ) A j , C = z j 0 0 0 z j 0 0 0 z j .
Replacing A ˇ j , C by the limit A j , C and transforming the estimates to the real Jordan normal form, we obtain estimates ( A ˘ , C ˇ , K ˇ ) corresponding to unit root processes with state space unit root structure Ω S .
Note, however, that this representation not necessarily converges as perturbation analysis only implies convergence of the eigenspaces. Therefore in the final step the estimate ( A ˘ , C ˇ , K ˇ ) is converted such that we obtain convergence of the system matrix estimates. In the class of observationally equivalent systems with the matrix
A ˘ C = diag ( A 0 , C , A 1 , C , A 1 , C ¯ . . . , A S / 2 1 , C ¯ , A S / 2 , C , A ˇ ) , A j , C = I c j z j ,
block diagonal transformations of the form T = diag ( T 0 , T 1 , T 1 ¯ , . . . , T S / 2 , I ) do not change the matrix A ˘ C . Here the basis of the stable subsystem can be chosen such that the corresponding transformed ( A ˘ , C ˘ , K ˘ ) is uniquely defined using an overlapping echelon form (see [22], Section 2.6). The impact of such transformations on the blocks of C is given by C ˇ j , C T j 1 . Therefore, if for each j = 0 , . . . , S / 2 a matrix E j C s × c j is known such that E j C , j , C C c j × c j is nonsingular, the restriction E j C ˘ j , C = I c j picks a unique representative ( A ˘ , C ˘ , K ˘ ) of the class of systems observationally equivalent to ( A ˘ , C ˇ , K ˇ ) .
Note that this estimate ( A ˘ , C ˘ , K ˘ ) is realizable if the integers c j (needed to identify the c j eigenvalues of A ^ closest to z j ), the matrices E j (needed to fix a basis for x t , j ) and the index corresponding to the overlapping echelon form for the stable part are known. Furthermore, this estimate corresponds to a process with state space unit root structure Ω S and hence can be used as a starting value for quasi likelihood maximization.
Finally in this section it should be noted that the estimate of the state x ^ t here mainly serves the purpose of obtaining an estimator for the state space system. Based on this estimate, Kalman filtering techniques can be used to obtain different estimates of the state sequence. The relation between these different estimates is unclear and so is their usage for inference. For this paper the state estimates x ^ t are only an intermediate step in the CVA algorithm.

4. Asymptotic Properties of the System Estimators

As follows from the last section, the central step in the CVA procedure is a RRR problem involving stationary and nonstationary components. The asymptotic properties of the solution to such RRR problems are derived in Theorem 3.2. of [24]. Using these results the following theorem can be proved (see Appendix C.1):
Theorem 1.
Let the process ( y t ) t Z be generated according to Assumption 1. Let ( A ^ , C ^ , K ^ ) denote theCVAestimators of the system matrices using the assumption of correctly specified order n with f n not depending on the sample size and finite and p = o ( ( log T ) a ¯ ) for some real 0 < a ¯ < , p d log T / log ρ 0 for some d > 1 where 0 < ρ 0 = | λ m a x ( A K C ) | < 1 . Let ( A , C , K ) be in the form given in (2) where ( A , , C , , K , ) is in echelon canonical form and for each j = 0 , . . . , S / 2 there exists a row selector matrix E j R s × c j such that E j C , j , C is non-singular. Then for some integer a:
(I) 
C ^ A ^ j K ^ C A j K = O P ( ( log T ) a / T ) for each j 0 .
(II) 
Using D x = d i a g ( T 1 I c , T 1 / 2 I n c ) where c = j = 0 S / 2 c j δ j we have
( A ˘ A ) D x 1 = O P ( ( log T ) a ) , T ( K ˘ K ) = O P ( ( log T ) a ) , ( C ˘ C ) D x 1 = O P ( ( log T ) a )
for some integer a < .
(III) 
If the noise is assumed to be an iid sequence, then results (I) and (II) hold almost surely.
Beside stating consistency in the seasonal integration case, the theorem also improves on the results of [20] in the I(1) case by showing that no adaptation of CVA is needed in order to obtain consistent estimators of the impulse response sequence or the system matrices. Note that this consistency result for the impulse response sequence concerns both the short and the long-run dynamics. In particular it implies that short-run prediction coefficients are consistent. Moreover the theorem establishes strong consistency rather than weak consistency as opposed to [20]. (II) establishes orders of convergence which, however, apply only to a transformed system that requires knowledge of the integers c j and matrices E j to be realized. No tight bounds for the integer a are derived, since they do not seem to be of much value.
Note that the assumptions on the innovations rule out conditionally heteroskedastic processes. However, since the proof mostly relies on convergence properties for covariance estimators for stationary processes and continuous mapping theorems for integrated processes, it appears likely that the results can be extended to conditionally heteroskedastic processes as well. For the stationary cases this follows directly from the arguments in [18], while for integrated processes results (using different assumptions on the innovations) given for example in [25] can be used. The conditions of [25] hold for example in a large number of GARCH type processes, see [26]. The combination of the different sets of assumptions on the innovations is not straightforward, however, and hence would further complicate the proofs. We refrain from including them.
It is worth pointing out that due to the block diagonal structure of A the result ( C ˘ C ) D x 1 = O P ( ( log T ) a ) implies consistency of the blocks C ˘ j corresponding to unit root z j (or the corresponding complex pair) of order almost T 1 . Using the complex valued canonical form this implies consistent estimation of C , j , C by the corresponding C ˘ j , C . In the canonical form this matrix determines the cointegrating relations (both the static as well as the dynamic ones, for details see [15]) as the unitary complement to this matrix. It thus follows that CVA delivers estimators for the cointegrating relations at the various unit roots that are (super-)consistent. In fact, the proof can be extended to show convergence in distribution of ( C ˘ C ) D x 1 . This distribution could be used in order to derive tests for cointegrating relations. However, preliminary simulations indicate that these estimates and hence the corresponding tests are not optimal and can be improved upon by quasi maximum likelihood estimation in the VARMA setting initialized by the CVA estimates. Therefore we refrain from presenting these results.
Note that the assumptions impose the restriction ρ 0 > 0 excluding VAR systems. This is done solely for stating a uniform lower bound on the increase of p as a function of T. This bound is related to the lag length selection achieved using BIC, see [27]. In the VAR case the lag length estimator using BIC will converge to the true order and thus remain finite. All results hold true if in the VAR case a fixed (that is independent of the sample size) p n is used.

5. Inference Based on the Subspace Estimators

Beside consistency of the impulse response sequence also the specification of the integers c 0 , . . . , c S / 2 is of interest. First, following [20] this information can be obtained by detecting the unity singular values in the RRR step of the procedure. Second, from the system representation (2) it is clear that the location of the unit roots is determined by the eigenvalues of A on the unit circle: The integers c j denote the number of eigenvalues at the corresponding locations on the unit circle (provided the eigenvalues are simple). Due to perturbation theory (see for example Lemma A2) we know that the eigenvalues of A ^ will converge (for T ) to the eigenvalues of A and the distribution of the mean of all eigenvalues of A ^ converging to an eigenvalue of A can be derived based on the distribution of the estimation error A ^ A . This can be used to derive tests on the number of eigenvalues at a particular location on the unit circle. Third, if n s the state process is a VAR(1) process and hence in some cases allows for inference on the number of cointegrating relations and thus also on the integers c j as outlined in [4]. Tests based on these three arguments will be discussed below.
Theorem 2.
Under the assumptions of Theorem 1 the test statistic T i = 1 c ( 1 σ ^ i 2 ) converges in distribution to the random variable
Z = t r E ( ε ˜ t , ε ˜ t , ) 0 1 W ( r ) W ( r ) 1
where ε ˜ t , = ε ˜ t , 1 E ε ˜ t , 1 ε ˜ t , ( E ε ˜ t , ε ˜ t , ) 1 ε ˜ t , (for definition of ε ˜ t , 1 and ε ˜ t , see the proof in Appendix C.2) and where W ( r ) denotes a c-dimensional Brownian motion with variance
i = 0 S 1 A u i K u Ω K u ( A u i )
with A u denoting the c × c heading submatrix of A and K u denoting the submatrix of K composed of the first c rows such that ( A u , C u , K u ) denotes the unstable subsystem.
The theorem is proved in Appendix C.2, where also the many nuisance parameters of the limiting random variable are explained and defined. The proof also corrects an error in Theorem 4 of [20], where the wrong distribution is given since the second order terms were neglected.
As the distribution is not pivotal and in particular contains information that is unknown when performing the RRR step, it is not of much interest for direct application. Nevertheless the order of convergence allows for the derivation of simple consistent estimators of the number of common trends: Let c ^ T denote the number of singular values calculated in the RRR that exceed 1 h ( T ) / T for arbitrary h ( T ) , h ( T ) < T , h ( T ) / T 0 , for T . Then it is a direct consequence of Theorem 2 in combination with σ ^ j σ j < 1 , j > c , that c ^ T c in probability, implying consistent estimation of c. Based on these results also estimators for c could be derived, for example along the lines of [28]. However, as [29] shows, such estimators have not performed well in simulations and thus are not considered subsequently.
The singular values do not provide information on the location of the unit roots. This additional information is contained in the eigenvalues of the matrix A :
Theorem 3.
Under the assumptions of Theorem 1 let λ ^ i ( m ) , i = 1 , . . . , c m denote the c m eigenvalues of A ^ closest to the unit root z m , | z m | = 1 . Then defining μ ^ m = i = 1 c m ( λ ^ i ( m ) z m ) we obtain
T μ ^ m d t r B ( r ) B ( r ) d r 1 B ( r ) d B ( r )
where B ( r ) denotes a c m -dimensional Brownian motion with zero expectation and variance I c m for z m = ± 1 and a complex Brownian motion with expectation zero and variance equal to the identity matrix else.
Further if A ˜ : = x t + 1 , x t x t , x t 1 using the true state x t and μ ˜ m = i = 1 c m ( λ ˜ i ( m ) z m ) where λ ˜ i ( m ) , i = 1 , . . . , c m denote the c m eigenvalues of A ˜ closest to z m , then T ( μ ^ m μ ˜ m ) = o P ( 1 ) .
Therefore the estimated eigenvalues can be used in order to obtain a test on the number of common trends at a particular frequency for each frequency separately. The test distribution is obtained as the limit to
T t r [ K , m , C ε t , x t , m , C x t , m , C , x t , m , C 1 ]
where x t , m , C = z m ¯ x t 1 , m , C + K , m , C ε t 1 , x 1 , m , C = 0 . The distribution thus does not depend on the presence of other unit roots or stationary components of the state. Furthermore it can be seen that it is independent of the noise variance or the matrix K , m , C . Hence critical values are easily obtained from simulations. Also note that the limiting distribution is identical for all complex unit roots.
Therefore, for each seasonal unit root location z m we can order the eigenvalues of the estimated matrix A ^ with increasing distance to z m . Then starting from the assumption of H 0 : c m = c ¯ (for a reasonable c ¯ obtained, e.g., from a plot of the eigenvalues) one can perform the test with statistic T μ ^ m . If the test rejects, then the hypothesis H 0 : c m = c ¯ 1 is tested, until the hypothesis is not rejected anymore, or H 0 : c m = 1 is reached. This is then the last test. If H 0 is rejected again, no unit root is found at this location. Otherwise we do not have evidence against c m = 1 . In any case, the system needs to be estimated only once and the calculation of the test statistics is easy even for all seasonal unit roots jointly.
The third option for obtaining tests is to use the tests derived in [4] based on the JS framework for VARs. In the case n s the state process x t + 1 = A x t + K ε t is a seasonally integrated VAR(1) process (for n > s the noise variance is singular). The corresponding VECM representation equals
p ( L ) x t = m = 1 S ( I n A z m ) X t 1 ( m ) + K ε t 1 = m = 1 S α m β m X t 1 ( m ) + K ε t 1
where z m = exp ( 2 π m S i ) , m = 1 , . . . , S and
p ( L ) = 1 L S , p t = p ( L ) x t = x t x t S , p m ( L ) = p ( L ) 1 z m ¯ L , X t ( m ) = p m ( L ) p m ( z m ) z m x t .
Note that in this VAR(1) setting no additional stationary regressors of the form p ( L ) x t j occur. Also no seasonal dummies are needed but could be added to the equation. In this setting [4] suggests to use the eigenvalues λ ^ i (ordered with increasing modulus) of the matrix (the superscript ( . ) π denotes the residuals with respect to the remaining regressors X t 1 ( j ) , j m )
X t 1 ( m ) , π , p t π p t π , p t π 1 p t π , X t 1 ( m ) , π X t 1 ( m ) , π , X t 1 ( m ) , π 1
as the basis for a test statistic
C ˜ m : = δ m i = 1 c m log ( 1 λ ^ i ) .
where δ m = 2 for complex unit roots and δ m = 1 for real unit roots. In the I ( 1 ) case this leads to the familiar Johansen trace test, for seasonal unit roots a different asymptotic distribution is obtained.
Theorem 4.
Under the assumptions of Theorem 1 let C ^ m be calculated based on the estimated state and let C ˜ m denote the same statistic based on the true state. Then for n s it holds that C ^ m C ˜ m = o P ( T 1 ) and
T C ^ m d t r d B ( r ) B ( r ) B ( r ) B ( r ) d r 1 B ( r ) d B ( r )
where B ( r ) is a real Brownian motion for z m = ± 1 or a complex Brownian motion else.
Thus again under the null hypothesis the test statistic based on the estimated state and the one based on the true state reject jointly asymptotically with probability one. Therefore for n s the tests of JS can be used to obtain information on the number of common cycles, ignoring the fact that the estimated state is used in place of the true state process.
After presenting three disjoint ideas for providing information on the number and location of unit roots, the question arises, which one to use in practice. In the following a number of ideas are given in this respect.
The criterion based on the singular values given in Theorem 2 is of limited information as it only provides the overall number of unit roots. Since the limiting distribution is not pivotal it cannot be used for tests and the choice of the cutoff value h ( T ) is somewhat arbitrary. Nevertheless, using a relatively large value one obtains a useful upper bound on c which can be included in the typical sequential procedures for tests for c j .
Using the results of Theorem 4 has the advantage of using a framework that is well known to many researchers. It is remarkable that in terms of the asymptotic distributions there is no difference involved in using the estimated state in place of the true state. The assumption n s , however, is somewhat restrictive except in situations with a large s.
Finally the results of Theorem 3 provide simple to use tests for all unit roots, independently of the specification of the model for the remaining unit roots. Again it is remarkable that, under the null, inference is identical for known and for estimated state.
Since our estimators are not quasi maximum likelihood estimators the question of a comparison with the usual likelihood ratio tests arises. For VAR models simulation exercises documented in Section 7 below demonstrate that there are situations where the proposed tests outperform tests in the VAR framework. Comparisons with tests in the state space framework (or equivalently in the VARMA framework) are complicated by the fact that no results are currently available in the literature of this framework. One difference, however, is given by the fact that quasi likelihood ratio tests in the VARMA setting require a full specification of the c j values for all unit roots. This introduces interdependencies such that the tests for one unit root depend on the specification of the cointegrating rank at the other roots. The interdependencies can be broken by performing tests based on alternative specifications for each unit root. The test based on Theorem 3 does not require this but can be performed based on the same estimate A ^ . This is seen as an advantage.
The question of the comparison of the empirical size in finite samples as well as power to local alternatives between the CVA based tests and tests based on quasi-likelihood ratios is left as a research question.

6. Deterministic Terms

Up to now it has been assumed that no deterministic terms appear in the model contrary to common practice. In the VAR framework dealing with trends is complicated by the usage of the VECM representation, see e.g., [30]. In the state space framework used in this paper, however, deterministic terms are easily incorporated.
Theorem 5.
Let the process ( y t ) t Z be generated according to Assumption 1 and assume that the process ( y ˜ t ) t Z is observed where y ˜ t = y t + Φ d t with
d t = 1 , cos ( 2 π S t ) , sin ( 2 π S t ) , ( 1 ) t R S
and Φ R s × S .
Then if theCVAestimation is applied to
y ˜ t π : = y t t = 1 T y t d t t = 1 T d t d t 1 d t , t = 1 , . . . , T ,
the results of Theorem 1 hold, i.e., the system is estimated consistently and the orders of convergence for the transformed system ( A ˘ , C ˘ , K ˘ ) hold true.
Furthermore the convergence in distribution results in Theorems 2–4 hold true where in the limits the Brownian motions B ( r ) occurring in the distributions must be replaced by their demeaned versions B ( r ) 0 1 B ( s ) d s .
In this sense the results are robust to some operations typically termed preprocessing of data such as demeaning and deseasonalizing using seasonal dummies. More general preprocessing steps such as detrending or the extraction of more general deterministic terms analogous to [30] can be investigated along the same lines.

7. Simulations

The estimation of the seasonal cointegration ranks and spaces is usually carried out via quasi maximum likelihood methods that originated from the VAR model class. Typical estimators in this setting are those of [2,4,5,31]. In the first two experiments we focus on the estimation of the cointegrating spaces and the specification of the cointegration ranks in the classical situation of quarterly data and show that there are certain situations in which CVA estimators and the test in Theorem 3 possess finite sample properties superior to those of the methods above. In a third experiment the test performance is evaluated for a daily sampling rate. Moreover, the prediction accuracy of CVA is investigated as well as its robustness to innovations exhibiting behaviors often encountered at such higher sampling rates. All simulations are carried out using 1000 replications.
To investigate the practical usefulness of the proposed procedures we generate quarterly data using two VAR dgps of dimension s = 2 first and then two more general VARMA dgps with s = 8 . Each pair contains dgps with different state space unit root structures
{ ( 0 , ( 1 ) ) , ( π / 2 , ( c π / 2 ) ) , ( π , ( 1 ) ) } , c π / 2 = 1 , 2 .
From all four dgps samples of size T { 50 , 100 , 200 , 500 } are generated with initial values set to zero. Although none of the dgps contains deterministics, the data is adjusted for a constant and quarterly seasonal dummies as in [5]. For reasons of comparability, the adjustment for deterministic terms is done before estimation.
In the third experiment we generate daily data with dimension s = 4 from a state space system including unit roots corresponding to weekly frequencies (that is a period length of seven days). In the simulations we use several years of data (excluding new year’s day to account for 52 weeks of seven days each). The first 200 observations are discarded to include the effects of different starting values. In this example the focus lies on a comparison of the prediction accuracy. Furthermore we investigate the robustness of the test procedures to conditional heteroskedasticity of the GARCH type as well as to non-normality of the innovations.
To assess the performance of specifying the cointegrating rank at unit root z using CVA, the following test statistic is constructed from the results in Theorem 3
Λ ( c ) = T | ( 1 c i = 1 c λ ^ i ) z | .
Here λ ^ 1 , , λ ^ n are the eigenvalues of A ^ ordered increasingly according to the distance from z. Note that a similar test in [20] only uses the c-th largest eigenvalue, whereas here the average over the nearest c eigenvalues is taken. Critical values have been obtained by simulation using large sample sizes (sample size 2000 (JS) and 5000 (CVA), 10,000 replications).
In our first two experiments usage of Λ ( c ) is compared with variants of the likelihood ratio test from [2] (JS), [4] ( Q 1 ), and [5] ( Q 2 , Q 3 ). Q 1 is Cubadda’s trace test for complex-valued data, Q 2 takes the information at frequency π / 2 into account when the analysis is carried out at frequency 3 π / 2 , and Q 3 iterates between π / 2 and 3 π / 2 in the alternating reduced rank regression (ARR) of [5]. For the procedure of [2] the likelihood maximization at frequency π / 2 is carried out using numerical optimization (BFGS) with initial values obtained from an unrestricted regression.
All tests are evaluated by comparing the percentages of correctly detected common trends, or hit rates, with 0.95 , the hit rate to be expected from a nominal significance level of 0.05 . The testing procedure employed for all tests is the same: at each of the frequencies it is started from a null hypothesis of s unit roots against less than s unit roots. In case of rejection, s 1 unit roots are tested versus less than s 1 and so on, until there are zero unit roots under the alternative.
For the first two experiments the estimation performance of CVA for the simultaneous estimation of the seasonal cointegrating spaces is compared with the maximum likelihood estimates of [2,4,31] (cRRR), and also with an iterative procedure (Generalized ARR or GARR) of [5]. The comparison is carried out by means of the gap metric, measuring the distance between the true and the estimated cointegrating space as in [32]. The smaller the mean gap over all replications, the better is the estimation performance. Throughout a difference between two mean gaps or two hit rates is considered statistically significant if it is larger than twice the Monte Carlo standard error.
For all procedures used in this section, an AR lag length has to be chosen first. For CVA this can be done using the AIC as in ([33], Section 5), as is done in the third experiment.
In the first two experiments where sample sizes are rather small, we estimate the lag length via minimization of the corrected AIC (AICc) ([34], p. 432), k ^ A I C c , benefitting the simulation results. For larger sample sizes the two criteria lead to the same choices. Due to the quarterly data we work with, the lag length is then chosen to be k ^ = max { k ^ A I C c , 4 } .
Other information criteria could be chosen here. An anonymous referee also suggested the application of the Modified Akaike Information Criterion (MAIC) of [35], proposed there for the I(1)-case. In an attempt to apply it to the seasonally integrated case considered here, it performed considerably worse than the AICc. Thus we refrain from using the MAIC in the following and also omit the results of that attempt. They can be obtained from the authors upon request.
For CVA the truncation indices f and p are chosen as f ^ = p ^ = 2 k ^ ([33], Section 5). The system order n is estimated by minimizing ([33], Section 5)
S V C ( n ) = σ ^ n + 1 2 + 2 n s log T T .
Here σ ^ i denotes the i-th largest singular value from the singular value decomposition of Ξ ^ f + β ^ 1 Ξ ^ p (Step 2 of CVA). Note that selecting the number of states by S V C is made less influential insofar as n ^ = max { c 0 + 2 c π / 2 + c π , n ^ S V C } , where n ^ S V C denotes the SVC estimated system order.
In Section 7.1 we start with the two VAR dgps and find that the likelihood-based procedures are mostly superior. Continuing with the VARMA dgps in Section 7.2, CVA performs better and is superior for the smaller sample sizes in terms of size and gap and better for all sample sizes in terms of power. Section 7.3 evaluates the performance of the tests for unit roots for larger sample sizes together with the prediction performance in this setting. We find that the tests are robust to the distribution of the innovations as well as to conditional heteroskedasticity of the GARCH type. Furthermore the empirical size of the tests lies close to the size already for moderate sample sizes, where the tests also show almost perfect power properties.

7.1. VAR Processes

The VAR dgps considered in this paper are given by,
X t = Π 1 X t 1 + Π 2 X t 2 + Π 3 X t 3 + Π 4 X t 4 + ε t , ε t N 0 0 , 1 0.5 0.5 1
where ( ε t ) t Z is white noise and the coefficient matrices are
Π 1 = γ 0 0 0 , Π 2 = 0.4 0.4 γ 0 0 , Π 3 = γ 0 0 0 , Π 4 = 0.6 ( γ / 10 ) 0.4 + γ 0 1 .
This dgp is adopted from [5] with a slight adjustment to Π 4 . The corresponding VECM representation in the notation of [5] equals
X 0 , t = 0.2 0 1 + γ / 8 1 X 1 , t 1 + 0.2 0 1 + γ / 8 1 X 2 , t 1 + γ 0 1 + 0.05 L L X 3 , t 1 + ε t .
As can be seen from Table 1, the dgps possess unit roots at frequencies 0, π , and π / 2 , where c π / 2 = 2 [ 1 ] for γ = 0 [ 0.2 ] , respectively. Note that in all cases the order of integration equals 1, while the number of common cycles at π / 2 is varied.
Table 2 exhibits the hit rates from the application of the different test statistics. At frequencies 0 and π , Λ is compared with the trace test of Johansen (J; based on [31] for unit roots z = 1 ), whereas at π / 2 it is competing with JS, Q 1 , Q 2 , and Q 3 . All competitors are likelihood-based tests which is the term we are referring to when we compare Λ to them as a whole.
The results for 0 and π are very similar for both dgps in that Λ scores more hits than the likelihood-based tests when the sample size is small, T { 50 , 100 } . Convergence of its finite sample distribution is slower than for the other test statistics, however, as J is closer to 0.95 from T = 200 on. For T = 500 the distribution of Λ only seems to have converged to its asymptotic distribution when c π / 2 = 2 at frequency 0, whereas convergence of the likelihood-based tests has occurred in all cases.
At π / 2 the likelihood ratio test of JS strictly dominates all implementations of [5] for all sample sizes and both dgps. It strictly dominates the CVA-based test procedure as well, except for one case, it seems: when c π / 2 = 1 and T = 50 Λ scores slightly, but significantly, more hits than the likelihood ratio test of JS. Surprisingly, Λ is drastically worse when T = 100 with only 8.7 % , only to be up at 85 % for T = 200 .
The behavior of Λ is explained by z 5 and z 6 being close to ± i when c π / 2 = 1 , cf. Table 1. For future reference we will call the corresponding roots false unit roots.
For T = 50 the estimates of eigenvalues corresponding to actual unit roots are rather not very close to ± i in contrast to the false unit roots. Thus the latter are mistaken for actual unit roots (cf. the first panel in Figure 1), leading to a hit rate of 81.1 % , one that is even larger than the rates at 0 and π . As the sample size increases, the eigenvalue estimates of the true unit roots become more and more accurate, visible from the second and third panel in Figure 1. Accordingly they can be detected correctly more often. Unfortunately however, for T = 100 the false unit roots remain to be detected such that often two instead of just one unit root are found by Λ , resulting in a hit rate of only 8.7 % . For T { 200 , 500 } Λ is able to distinguish the false unit roots from the true ones and the detection rate is getting closer to the asymptotic rate, 85.5 % and 92.7 % , respectively.
When the VAR dgp without false unit roots and c π / 2 = 2 is considered, it is visible that the hit rates of Λ at π / 2 are monotonously increasing in the sample size again. The rates are smaller than those of the likelihood-based tests, however, and also clearly worse than those of Λ at 0 and π , cf. Table 2 again.
Taken together, at frequencies 0 and π which correspond to real-valued unit roots, the use of Λ was advantageous for T = 50 . It also scored more hits for T = 100 and c π / 2 = 1 . For higher sample sizes the likelihood-based tests clearly dominate Λ at these two frequencies. At π / 2 this superiority of the likelihood-based tests for all sample sizes and both dgps continues. The example also points to a general weakness: if the sample size is low and false unit roots are present, it can be difficult for Λ to distinguish them from actual unit roots.

7.2. VARMA Processes

The second setup consists of VARMA data generated by a state space system ( A r , C r , K r ) , r = 1 , 2 , as in (1), where the matrices A 1 and A 2 are constructed as in (2) and are taken to be
A 1 = 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 , A 2 = 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 .
These two choices yield the same state space unit root structures as those of the two VAR dgps with c π / 2 = 1 and c π / 2 = 2 for A 1 and A 2 , respectively. The other two system matrices K r R ( 2 + 2 r ) × s and C r R s × ( 2 + 2 r ) with s = 8 are drawn randomly from a standard normal distribution in each replication and ( ε t ) t Z is multivariate normal white noise with an identity covariance matrix.
Note that these systems are within the VARMA model class such that the dgp is contained in the VAR setting only by increasing the lag length as a function of the sample size. While superiority of the CVA approach in such a setting might be expected, this is far from obvious. Moreover, using a long VAR approximation is the industry norm in such situations.
From the hit rates in Table 3 it can be seen that the combination of large s, small T, and a minimal lag length of four render the likelihood-based tests useless at all frequencies with hit rates below ten percent for T = 50 . Λ in contrast does not suffer from this problem and is already close to 95 % for this sample size. Only when T = 200 do the likelihood-based tests appear to work, exhibiting hit rates close to 95 % .
For all tests alike, however, it is striking that hit rates move away from 95 % when T = 500 . This behavior is most pronounced for Λ , e.g., from T = 200 to T = 500 its hit rate drops from 93.1 % to 82.4 % at 0 when A 2 is used. This phenomenon is a consequence of the fact that f and k in the algorithm are chosen data dependent. An inspection of how the hit rates depend on f and k and a comparison with the actually selected f ^ , k ^ reveals that for T = 500 too large values of f and k are chosen too often and leave room for improvement in the hit rates, cf. Figure 2. The figure stresses an important point: The performance of the unit root tests is heavily influenced by the selected lag lengths for all procedures. We tested a number of different information criteria in this respect. AICc turned out to be the best criterion overall, but not uniformly. Figure 2 indicates advantages for this example of BIC over AIC as it on average selects smaller lag lengths, associated here with higher hit rates.
To study the power of the different procedures, the transition dynamics A r in (9) are multiplied by ρ { 0.8 , 0.85 , 0.9 , 0.95 } so that the systems do not contain unit roots at any of the frequencies. Here empirical power is defined as the frequency of choosing zero common trends. This is why for ρ = 1 , when there are in fact common trends present in our specifications, the empirical power values plotted in Figure 3 are not equal to the actual size we could define as one minus the hit rate: our measure of empirical power in this situation only counts the false test conclusion of zero common trends, but there are of course multiple ways the testing procedure could conclude falsely.
As expected, rejection of the null hypothesis is easiest when ρ is small and is very difficult when it is close to 1, cf. Figure 3 for the case of A 2 .
Further, there are almost no differences among the likelihood-based tests over all combinations of sample size and frequency, only for T = 100 is JS significantly worse than the Q i , i = 1 , 2 , 3 at π / 2 . It is also clearly visible at all frequencies that the likelihood-based tests possess no or only very limited power when T = 50 and T = 100 , respectively. Λ , in contrast, is clearly more powerful in these cases. As the sample size increases to T = 200 , the power of each test improves, still Λ remains the most powerful option. Only for T = 500 have the differences almost vanished with small, but significant, advantages for Λ at 0 and π .
The results are the same when A 1 is used and c π / 2 = 1 and all of the differences described here are statistically significant.
Next the estimation performance of CVA is evaluated by calculation of the gaps between the true and the estimated cointegrating spaces. At all frequencies these gaps are compared with the GARR procedure of [5] which cycles through frequencies. At π / 2 CVA and GARR are also compared with our implementation of JS and cRRR of [4], whereas it is also compared with the usual Johansen procedure at 0 and π . All estimates are conditional on the true state space unit root structure in the sense that the minimal number of states used is larger or equal to the number of unit roots over all frequencies. Other than imposing a minimum state dimension, the estimation of the order using S V C is not influenced. The likelihood-based procedures, on the other hand, take the unit root structure as given, i.e., do not perform CI rank testing for this estimation exercise.
From the results in Table 4 it can be noted first that the likelihood-based procedures show mostly equal mean gaps. Only for π / 2 and T = 50 and both dgps does JS possess significantly larger gaps than cRRR and GARR and other differences are not statistically significant. Thus it does not matter in our example whether the iterative procedure is used or not.
Second, CVA is again superior for T = 50 where it exhibits mean gaps that are significantly smaller than those of the other estimators at all frequencies. This advantage is turned around for higher sample sizes, though: mean gaps are smaller for the likelihood-based procedures when T { 100 , 200 , 500 } and A 2 is used, if only slightly. When A 1 is used instead, mean gaps do not differ significantly from each other at π / 2 when T > 50 and at 0 , π when T = 100 and those of CVA are only very modestly worse when T { 200 , 500 } at 0 , π .
Thus, when it comes to estimating the cointegrating spaces, CVA is superior for T = 50 and equally good or only slightly worse than the likelihood-based procedures for higher sample sizes. For the systems analyzed, decreasing c π / 2 leads to gaps that are smaller for all methods and these improvements are slightly larger for CVA than for the other estimators.

7.3. Robustness of Unit Root Tests for Daily Data

In this last simulation example we examine the robustness of the proposed procedures with regard to test performance and prediction accuracy with respect to the innovation distribution and the existence of conditional heteroskedasticity of the GARCH-type, as these features are often observed in data of higher frequency, for example in financial applications. While our asymptotic results do not depend on the distribution of the innovations (subject to the assumptions), the assumptions do not include GARCH effects. Nevertheless, the theory in [25,26] suggests that the tests might be robust also in this respect.
We generate a state space system of order n = 8 using the matrix A = [ A i , j ] i , j = 1 , . . . , 8 where A i , i + 1 = 1 , i = 1 , . . . , 6 , A 7 , 1 = 1 , A 8 , 8 = 0.8 and A i , j = 0 else. This implies that the eigenvalues of this matrix are λ j = exp ( 2 π i j / 7 ) , j = 1 , . . . , 7 , λ 8 = 0.8 . Therefore the corresponding process has state space unit root structure
( ( 0 , ( 1 ) ) , ( 2 π / 7 , ( 1 ) ) , ( 4 π / 7 , ( 1 ) ) , ( 6 π / 7 , ( 1 ) ) ) .
The entries of the matrices C and K are chosen as independent standard normally distributed random variables as before.
A process ( y t ) t = 1 , . . . , T is generated from filtering an independent identically distributed innovation process ( ε t ) t = 199 , . . . , T + 1 through the system ( A , C , K ) . The first 200 observations are discarded, the last are used for validation purposes. A total of 1000 replications are generated where in each replication a different system is chosen.
With the generated data three different estimates are obtained: An autoregressive model (called AR in the following) is estimated with lag length chosen using AIC of maximal lag length equal to T . Second, an autoregressive model with large lag length (called ARlong) is estimated. This estimate is used to hint at the behavior of an autoregression using the lag length equal to a full year. This would correspond to estimating a VECM without rank restrictions, when accounting for yearly differences. The third method consists of the CVA estimates, where f = p = 2 k ^ A I C is chosen. The order is estimated by minimizing SVC. However, we correct for orders smaller than n = 7 which would limit the possibilities of finding all unit roots.
First, we compare the prediction accuracy for the three methods for two different distributions of the innovations: Beside the standard normal distribution also the student t-distribution with v = 5 degrees of freedom (scaled to unit variance) is used. This distribution shows considerably heavier tails than the normal distribution but nevertheless is covered by our assumptions.
Figure 4 provides the results for out-of-sample one day ahead mean absolute prediction error (over all coordinates) for the sample sizes T = 364 days (one year), T = 1092 (3 years) and T = 3276 (nine years). The long AR model is estimated with lag lengths of 8 weeks for the smallest sample size, 10 weeks for the medium sample size and 12 weeks for the largest sample size.
In the figure the results for the normally distributed innovations are presented as well as the ones for the t-distributed residuals (scaled to unit variance). It can be seen that for the two larger sample sizes the mean absolute error for the residuals for CVA is smaller in all cases. For the smallest sample size, by contrast, results are mixed. For CVA the results for the heavy tailed distribution in this case are much worse than for the normal distribution. For the larger sample sizes the differences are small. The maximal standard error of the estimated means over 1000 replications for T = 1092 and T = 3276 amounts to 0.05. This allows the conclusion that CVA performs better for the two larger sample sizes. For T = 364 there are no statistically significant differences between the performance of the three methods: CVA seems to suffer more from few very large errors (using the root mean square errors the CVA results are worse for T = 364 in comparison; if one uses the 95% percentiles CVA performs best also for the smallest sample size). This results in a standard error over the replications of the mean absolute error for T = 364 of 0.18 for normally distributed innovations and 3.4 for t-distributed innovations. The long AR models are clearly worse than the two other approaches. This happens even if we are still far from using a full year as the lag length.
With regard to the unit root tests we investigate results for the tests of the hypotheses H 0 : c m = 1 versus H 1 : c m = 0 at all frequencies 2 π m / 364 , m = 0 , . . . , 363 . The data generating process features unit roots with c m = 1 at the seven frequencies 2 π k / 7 , k = 0 , . . . , 6 . Therefore the tests should not reject at these frequencies, but should reject at all others.
Consequently we compare the minimum of the non-rejection rates for the seven unit roots (called empirical size below) as well as the maximum of the non-rejection rates for the non-unit root frequencies ω j = 2 π j / 364 , j 52 k , k = 0 , 1 , 2 , . . . , 6 (called empirical power below).
For the larger sample sizes the empirical size is practically 95% while the empirical power is 100%. For T = 364 we obtain an empirical size of 90% for the normal distribution and 91.5% for the t-distribution. The worst empirical power equals 89.3% (normal) and 87.6% (t-distribution). Hence even for one year of data the discrimination properties of the unit root tests are good and we do not observe differences between the normal distribution for the innovations and the heavy tailed t-distribution.
Finally we compare the empirical size and power of the tests for the various unit roots for smaller sample sizes T { 104 , 208 , 312 , 416 , 520 } . For the experiments we consider univariate GARCH models of the form
ε t , i = h t , i η t , i , h t , i 2 = 1 + α ε t 1 , i 2 + β h t 1 , i 2 , i = 1 , . . , 4 ,
where ( η t , i ) t Z is independent and identically standard normally distributed. α , β 0 are reals. It follows that the component processes ( ε t , i ) t Z show conditional heteroskedasticity, the persistence of which is governed by α + β . Here 0 < α + β < 1 implies stationarity while α + β = 1 implies persistent conditional heteroskedasticity usually termed I-GARCH. We include five different processes for the innovations:
  • norm: α = β = 0 , no GARCH effects
  • G1: α = 0.8 , β = 0.1
  • IG1: α = 0.8 , β = 0.2
  • IG2: α = 0.5 , β = 0.5
  • IG3: α = 0.2 , β = 0.8
For the five different sample sizes 1000 replications of the estimates using the CVA algorithm are obtained. For each estimate we calculate the test statistic for testing H 0 : c m = 1 versus H 0 : c m = 0 for m = 0 , . . . , 363 corresponding to the unit roots z m = exp ( 2 π i m / 364 ) . This set of unit roots contains all seven unit roots exp ( 2 π i k / 7 ) , k = 0 , . . . , 6 .
Figure 5 provides the mean over the 1000 replications of the test statistics Λ ( 1 ) for z j , j = 0 , . . . , 363 and the five sample sizes. It can be seen that the test Λ ( 1 ) is able to pinpoint the seven unit roots present in the data generating process fairly accurately even for sample size T = 104 . The zoom on the region around the unit root frequency 2 π / 7 shows that the mean value is larger than the cutoff value of the test (the dashed horizontal line) for the adjacent frequency 2 π 53 364 already for T = 312 .
Table 5 lists the minimum of the achieved percentages of non-rejections of the test statistic for the seven unit root frequencies as well as the maximum over all non-unit root frequencies. It can be seen that for all GARCH models for T = 312 the test rejects unit roots at all non unit root frequencies every time, while the empirical size is close to the nominal 5%. For small sample sizes the tests are slightly undersized while for T = 208 a slight oversizing is observed. The two larger sample sizes are omitted as the tests perform perfectly there.
It follows from the examples presented in this subsection that the test is robust also in small samples with respect to heavy tailed distributions of the innovations (subject to the assumptions). Furthermore also a remarkable robustness with respect to GARCH-type conditional heteroskedasticity is observed.

8. Application

In this section we apply CVA to the modeling of electricity consumption using a data set from [36]. The dataset contains hourly consumption data (in megawatts) from a number of US regions, scraped from the webpage of PJM Interconnection LLC, a regional transmission organization. The number of regions have changed over time, thus the data set contains many missing values. It also contains data aggregated into regions called east and west, which are not used subsequently.
In order to avoid problems with missing values, we restrict the analysis to four regions, for which data over the same time period is available: American Electric Power (AEP; in the following printed in blue), the Dayton Power and Light Company (DAYTON; black), Dominion Virginia Power (DOM; red) and Duquesne Light Co. (DUQ; green). We use data from 1 May 2005 until 31 July 2018. In this period only 3 data points are missing for the four regions and their imputation is handled by interpolation of the corresponding previous values. One observation in this sample is an obvious outlier which is corrected for analogously.
The data is split into an estimation sample covering observations up to the end of 2016 (102,291 observations on 4263 days) and a validation sample containing data in 2017 and 2018 (13,845 observations on 577 days). Data is equally sampled, but contains two hour segments when switching from winter to summer time or back. Table 6 contains some summary statistics.
Figure 6 provides an overview of the data: Panel (a) shows the full data on an hourly basis, while (b) presents aggregation to daily frequency. Panel (c) zooms in on a two year stretch of daily consumption. Panel (d) finally provides hourly data for the first month in the validation data. The figures clearly document strong daily, weekly and yearly patterns. From these figures it appears that these seasonal fluctuations are somewhat regular with changes throughout time. It is hence not clear whether a fixed seasonal pattern is appropriate. Also note that the sampling frequency is on an hourly basis such that a year roughly covers 8760 observations.
In the following we estimate (on the estimation part) and compare (on the validation part) a number of different models, first for the full hourly data set and afterwards for the aggregated daily data. As a benchmark we will use univariate AR models including deterministic seasonal patterns for daily, weekly and yearly variations. Subsequently we estimate models using CVA including different sets of such seasonal patterns.
First in the analysis using dummy variables fixed periodic patterns have been estimated. We model the natural logarithm of consumption (to reduce problems due to heteroskedasticity) and include dummies for weekdays, hours and sine and cosine terms corresponding to the first 20 Fourier frequencies with respect to annual periodicity. The corresponding results can be viewed in Figure 7. It is obvious that there is quite some periodic variation. Also the four data sets show very similar patterns as expected.
After the extraction of these deterministic terms the next step is univariate autoregressive (AR) modeling. Figure 8 shows the BIC values of AR models of lag lengths zero to 800 for the four series as well as the BIC of a multivariate AR model for the same number of lags. The chosen values are given in Table 6.
The BIC curve is extremely flat for the univariate models. Noticeable drops in BIC occur around lag 24 (one day), 144 (six days), 168 (one week), 336 (two weeks), 504 (three weeks). BIC selects large lag lengths from 529 (DUQ) up to 554 (DOM). AIC selects lag lengths close to the maximum allowed with a minimum at 772 lags. The BIC pattern of the multivariate model differs in that the two drops at two and three weeks are missing. Instead, the optimal BIC value is obtained at lag 194, well below the optimal lag lengths in the univariate cases. AIC here opts for lag length 531, just over 22 days.
Subsequently CVA is applied with f = k ^ B I C , p = k ^ A I C as estimated for the multivariate model. This differs from the usual recommendation of f = p = 2 k ^ A I C in order to avoid numerical problems with huge matrices. The order is chosen according to SVC, resulting in n ^ = 240 . The corresponding model is termed Mod 1 in the following. Note that this configuration of f , n ^ does not fulfill the requirements of our asymptotic theory. The bound f n ensures that the matrix O f has full column rank. Generically this will be the case for f s n leading to a less restrictive assumption. In practice too low values of f will be detected by n ^ estimated close to the maximum, which is not the case here.
As a second model we only use weekday dummies but neglect the other deterministics. Again AIC ( k ^ A I C = 531 ) and BIC ( k ^ B I C = 195 ) are used to determine the optimal lag length in the multivariate AR model. The corresponding CVA estimated model uses n ^ = 245 according to SVC, resulting in Mod 2.
The third model uses only a constant as deterministic term. Again similar AIC (555) and BIC (195) values are selected. A state space model, Mod 3, using CVA is estimated with n ^ = 209 .
Figure 9 provides information on the results. Panel (a) shows the coefficients of the univariate AR models. It can be seen that lags around one day and one to three weeks play the biggest role for all four datasets. Panel (b) shows that the multivariate models lead to better one step ahead predictions in terms of the root mean square error (RMSE). Mod 1 and Mod 2 show practically equivalent out of sample prediction error for all four data sets, while Mod 3 delivers the best out of sample fit for all four regions.
In particular in financial applications data of high sampling frequency shows persistent behaviour, also in terms of conditional heteroskedasticity, as well as heavy tailed distributions of the innovations. For our data sets Figure 10 below provides some information in this respect for the residuals according to Mod 3. Panel (a) provides a plot of the residuals in the year 2018 (contained in the validation period). It can be seen that large deviations occur occasionally, while else residuals vary in a tight band around 0. The kernel density estimates for the normalized (to unit variance) residuals on the full validation data set in panel (b) show the typical heavy tailed distributions. Panel (c) contains an ACF plot for the four regions again calculated using the full validation sample. It demonstrates that the model successfully eliminates all autocorrelations with only a few ACF values occurring outside the confidence interval. Panel (d) provides the ACF plot for the squared innovations to examine GARCH-type effects. While GARCH-effects are clearly visible, the ACF drops to zero fast with occasional positive values (except maybe for the Duquesne data).
Applying the eigenvalue based test Λ ( 1 ) for c = 1 and all Fourier frequencies ω j = 2 π j / ( 365 * 24 ) we find that for Mod 2 and Mod 3 the largest p-value is obtained for ω 365 corresponding to a period length of one day with 0.0187 for Mod 2 (test statistic 6.6) and 0.02 for Mod 3 (with a statistic of 6.5). This implies that the unit root at frequency ω 365 is not rejected for a significance level of 1%, but is rejected for 5%. All other unit roots are rejected at every usual significance level. For Mod 1 the test statistic for ω 365 equals 41.2 corresponding to a p-value of practically 0. This implies that on top of a deterministic daily pattern the series show strong persistence at the daily period. Excluding the hourly dummies pulls the roots closest to ω 365 closer to the unit circle resulting in insignificant unit root tests and improves the one step ahead forecasts. Including the dummies weakens the evidence of a unit root while leading to worse predictions.
The analysis is repeated with data aggregated to daily sampling frequency. The aggregation reduces the required lag lengths, as is visible from Table 6 in the univariate case, and hence we use CVA with the recommended f = p = 2 k ^ A I C . Beside the univariate models, in this case also a naive model of predicting the consumption for today as yesterday’s consumption is used. Three multivariate models are estimated: Mod 1 contains weekday dummies and sine and cosine terms for the first twenty Fourier frequencies corresponding to a period of one year. Mod 2 only contains the weekday dummies, while Mod 3 only uses the constant. Figure 11 provides the out-of-sample RMSE for one day ahead predictions (panel (a)) and seven days ahead predictions (panel (b)).
It can be seen that both Mod 1 and Mod 2 beat the univariate AR models in terms of one step ahead prediction error, while Mod 3 performs better for seven days ahead prediction. Mod 1 performs on par with Mod 2 for one step ahead prediction but performs better in predicting seven steps ahead. In Figure 12 poles and zeros for the three estimated state space models are plotted. Here the poles (marked with ‘x’) are the eigenvalues of the matrix A. These are the inverses of the determinantal roots of the autoregressive matrix polynomial in the equivalent VARMA representation. The zeros are the inverses of zeros of the determinant of the MA polynomial. We can see that for Mod 3 with only a constant, poles close to 2 π j / 7 , j = 1 , . . . , 6 arise to capture the weekly pattern. The other two models only show one pole close to the unit circle, a real pole of almost z = 1 . The pole corresponding to Mod 1 is closer to the unit circle than the one for Mod 2 (see (b)).
For Mod 3 we obtain p-values for the tests of three complex unit roots of 0.05 ( ω = 2 π / 7 ), 0.165 ( 4 π / 7 ) and 0.01 ( 6 π / 7 ), which are hence all not statistically significant for significance level α = 0.01 . The corresponding test for z = 1 shows a p-value of 0.004. This provides evidence against the null hypothesis of the root being present. For Mod 1 the p-value for the test of z = 1 is 0.28 and hence we cannot reject the null. Mod 2 provides a p-value of 0.023 and hence weak evidence for the presence of the unit root. This can be seen from the distance of the nearest pole from the point z = 1 in Figure 12.
Jointly this indicates that the location and strength of persistence due to the estimated roots is influenced by the presence of deterministic terms: if the deterministic terms are not included in the model, the cyclical patterns are generated by poles situated close to the unit circle.
The decision whether on top of the deterministic seasonality unit roots exist, is not easy in all cases: for the daily data the locations of the poles indicate that deterministic seasonality is enough to capture weekly fluctuations while a unit root at z = 1 appears to be needed to capture yearly variations. For hourly data there is evidence that the daily cycle is best captured with a unit root at frequency ω 365 . This leads to the best predictive fit. Finally note that temporal aggregation from hourly data to daily data implies that the frequency ω 365 for hourly data aliases to the frequency ω = 0 in the daily data. Therefore the higher evidence of a unit root at z = 1 found in daily data might be a consequence of the unit root at frequency ω 365 found for hourly data, compare [37].
The system matrix estimates as well as the evidence in support of unit roots at ω 365 for hourly data and z = 1 for daily data that we obtain from the CVA modeling can be taken as starting points in subsequent quasi maximum likelihood estimation.

9. Conclusions

In this paper the asymptotic properties of CVA estimators for seasonally integrated unit root processes are investigated. The main results can be summarized as follows:
  • CVA provides consistent estimators for long-run and short-run dynamics without knowledge of the location and number of unit roots. Hence the algorithm is robust with respect to the presence of trending components at frequency zero as well as at the other seasonal unit root frequencies.
  • The singular values calculated in the RRR step reveal information on the total number of unit roots. The distance of the singular values to one can be used to construct a consistent estimator of this quantity.
  • The eigenvalues of A ^ can be used in order to test for the number of common trends. Under the null hypothesis these tests are asymptotically equivalent to the corresponding tests using the true state, making the derivation of asymptotic results and the simulation of the test distribution simple.
  • An analogous statement holds for the Johansen trace test in the I ( 1 ) case and analogous tests in the M F I ( 1 ) case calculated on the basis of the estimated state in the restrictive setting of n s . Under the null hypothesis these tests reject and accept asymptotically jointly with the corresponding tests calculated using the true state.
  • From the simulation exercises we conclude that CVA performs best when the dgp is of the more general VARMA type, the process dimension is moderate to large and the sample size is small. Then it is superior to the likelihood-based procedures based on VAR approximations in terms of the estimation performance and the size and power of Λ , the test developed from CVA. For higher sample sizes the likelihood-based procedures are clearly superior when it comes to the size of the corresponding tests, whereas Λ remains the best test choice in terms of empirical power. The estimation performance is about equal for all procedures when the sample size is high with slight advantages for the likelihood-based procedures.
  • The simulations also demonstrate that the unit root test results are robust with respect to the distribution of the innovation sequence as well as some forms of conditional heteroskedasticity of the GARCH-type.
Because of the promising performance of CVA and in particular its robustness it can be recommended as a simple way to extract information on the number of common trends from the estimated matrix of transition dynamics. This information can be used in order to reduce the uncertainty in a subsequent likelihood ratio analysis where quasi maximum likelihood estimates can be obtained starting from the CVA estimates. Since the CVA estimates can be obtained for a range of orders numerically fast they are seen as a valuable starting point for the empirical modeling of time series potentially including seasonal cointegration. Moreover they can also be used in situations where the number of seasons is large or even unclear as in hourly data sets as demonstrated in the case study.   

Author Contributions

Conceptualization, D.B. and R.B.; methodology, D.B.; software, R.B.; formal analysis, D.B. and R.B.; writing—original draft preparation, D.B. and R.B.; writing—review and editing, D.B. and R.B.; visualization, D.B. and R.B.; supervision, D.B. Both authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation—Projektnummer 276051388) which is gratefully acknowledged. We acknowledge support for the publication costs by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Supporting Material

Appendix A.1. Complex Valued Canonical Form

Additionally to the real valued canonical form (2) we will also use the corresponding complex valued representation obtained by transforming each block corresponding to unit root z j = cos ( ω j ) + i sin ( ω j ) with the transformation matrix
T j = I c j i I c j I c j i I c j
leading to the triple of system matrices in the j-th block as:
A j , C = z j ¯ I c j 0 0 z j I c j , K j , C = K j , C K j , C ¯ , j , C = C j , C / 2 C j , C ¯ / 2 ,
such that
x t + 1 , j , C = z j ¯ x t , j , C + K j , C ε t , x t , j = T j 1 x t , j , C x t , j , C ¯ .
Lemma A1.
Let x t = [ x t , 0 , x t , 1 , , x t , S / 2 , x t , ] where x t , j is generated according to x t + 1 , j = A j x t , j + K j ε t , t N with A j as in (2) and K j = [ K j , R , K j , I ] R δ j c j × s using iid white noise process ( ε t ) t N where x 0 , j is deterministic. Further let ( x t , ) t N denote the stationary solution to the equation x t + 1 , = A x t , + K ε t such that M = E x t , x t , > 0 .
(I) Then using Q T = ( log log T ) / T for u t = i = 0 q φ i ε t + i for arbitrary q N , q < , and coefficients φ i , i = 0 , . . . , q we have
x t , , u t = O ( Q T ) , u t j , u t E u t j u t = O ( Q T ) , x t , j , x t , = O ( log T ) , x t , j , u t = O ( log T ) x t , j , x t , k / T = O ( log log T ) , j , k = 0 , . . . , S / 2 .
If ( ε t ) t Z only fulfills Assumptions 1 then the order bounds hold in probability rather than almost surely.
(II) Furthermore for 0 < j , k < S / 2
x t , j , C , ε t d 1 2 0 1 W j d B j , C = : M j , x t , j , C , x t , k , C / T d 1 2 0 1 W j W j : = N j , j = k , 0 , j k x t , j , ε t d 1 2 0 1 ( W j , R d B j , R + W j , I d B j , I ) 1 2 0 1 ( W j , I d B j , R W j , R d B j , I ) , x t , k , x t , j / T d 1 2 0 1 ( W k , R W k , R + W k , I W k , I ) 0 1 ( W k , R W k , I W k , I W k , R ) 0 1 ( W k , R W k , I W k , I W k , R ) 0 1 ( W k , R W k , R + W k , I W k , I ) , j = k 0 , j k
where W j = W j , R + i W j , I = K j , C B j , C , K j , C = K j , R + i K j , I , B j , C = B j , R + i B j , I and B j , R , B j , I are two independent Brownian motions with covariance matrix Ω. For j = 0 and j = S / 2 the results hold analogously:
x t , 0 , ε t d 0 1 W 0 , R d W 0 , R , x t , 0 , x t , 0 / T d 0 1 W 0 , R W 0 , R , x t , S / 2 , ε t d 0 1 W S / 2 , R d W S / 2 , R , x t , S / 2 , x t , S / 2 / T d 0 1 W S / 2 , R W S / 2 , R .
Proof. 
Most evaluations in (I) are standard, see for example Lemma 4 in [38].
(II) follows from the results in Section 4 of [2] for the complex valued representations or [39] for the corresponding real case.    □

Appendix A.2. Perturbation of Eigendecompositions

Lemma A2
(Rayleigh-Schrödinger expansion). Let A ^ t = A δ A t where δ A t 0 and where A = U Λ U 1 R n × n , Λ = d i a g ( λ 1 I c 1 , . . . , λ J I c J ) , j = 1 J c j = n is diagonalizable. U = [ U 1 , . . . , U J ] C n × n is a nonsingular matrix such that for U 1 = [ V 1 , . . . , V J ] we have V j U j = I c j .
Then for each circle B ( λ j , δ ) around λ j not containing any other eigenvalue of A there exist from some t onwards
  • c j eigenvalues of A ^ t in the circle B ( λ j , δ ) around λ j
  • a basis U ^ t , j for the space spanned by the eigenspaces to these c j eigenvalues such that V j U ^ t , j = I c j ,
  • a sequence of matrices B ^ t , j = V j A ^ t U ^ t , j C c j × c j .
Then U ^ t , j = k = 0 Z k , B ^ t , j = k = 0 C k where
Z 0 = U j , C 0 = λ j I c j , Z k = Σ ( δ A t Z k 1 + i = 1 k 1 Z k i C i ) , C k = V j δ A t Z k 1 .
Here Σ = U ( Λ I n λ j ) + U 1 where d i a g ( s 1 , . . . , s n ) + = d i a g ( s 1 + , . . . , s n + ) and x + = 1 / x , x 0 and zero else, that is ( Λ I n λ j ) + denotes a quasi-inverse.
Furthermore for ρ = δ A t < 1 we obtain: C k μ C ρ k , Z k μ Z ρ k , k 0 .
The results follow directly from Section 2.9 of [23], see in particular Proposition 2.9.1 and the discussion below this proposition. Further note that the results hold for each root separately and hence the restriction j = 1 needs to hold only for the investigated root for the results to apply. Finally note that a second order approximation U ^ t , j = Z 0 + Z 1 + Z 2 and B ^ t , j = C 0 + C 1 + C 2 is accurate to the order o ( δ A t 2 ) .

Appendix A.3. Random Transformation of Systems

Lemma A3.
Let the assumptions of Theorem 1 hold and use the same notation as given there. Let ( A ˜ , C ˜ , K ˜ ) denote a sequence of systems converging a.s. to ( A , C , K ) such that ( A ˜ A ) D x 1 = O ( ( log T ) a ) , T ( K ˜ K ) = O ( ( log T ) a ) , ( C ˜ C ) D x 1 = O ( ( log T ) a ) and let A 0 = S 0 A S 0 1 = d i a g ( A 0 , 11 , A 0 , 22 ) , K 0 = S 0 K , C 0 = C S 0 1 . Further let
S T = S T , 11 S T , 12 0 S T , 22 S 0
such that ( S T S 0 ) D x 1 = O ( ( log T ) a ) . Let Δ S = ( S T S 0 ) D x 1 , Δ A = ( A ˜ A ) D x 1 and denote the sequence of transformed systems as ( A ^ , C ^ , K ^ ) = ( S T A ˜ S T 1 , C ˜ S T 1 , S T K ˜ ) . Let the block entries of S 0 be denoted as S i j and the blocks of Δ S be denoted as Δ S i j . Then:
T ( A ^ 11 A 0 , 11 ) = ( Δ S 11 A 11 A 0 , 11 Δ S 11 + S 11 Δ A 11 + S 12 Δ A 21 ) S 11 1 + o ( 1 ) , T ( A ^ 12 A 0 , 12 ) = ( S 11 Δ A 12 + S 12 Δ A 22 ) S 22 1 + Δ S 12 S 22 1 A 0 , 22 A 0 , 11 Δ S 12 S 22 1 + o ( 1 ) , T ( A ^ 21 A 0 , 21 ) = S 22 Δ A 21 S 11 1 + o ( 1 ) , T ( A ^ 22 A 0 , 22 ) = Δ S 22 S 22 1 A 0 , 22 + S 22 Δ A 22 S 22 1 A 0 , 22 Δ S 22 S 22 1 + o ( 1 ) , T ( K ^ K 0 ) = Δ S 12 K 2 + S 11 T ( K ˜ 1 K 1 ) + S 12 T ( K ˜ 2 K 2 ) Δ S 22 K 2 + S 22 T ( K ˜ 2 K 2 ) + o ( 1 ) , ( C ^ C 0 ) D x 1 = ( C ^ C ) D x 1 S 11 1 0 0 S 22 1 C 0 Δ S 11 S 11 1 Δ S 12 S 22 1 0 Δ S 22 S 22 1 + o ( 1 ) .
Proof. 
The proof follows from straightforward computations using the various orders of convergence by neglecting higher order terms.    □

Appendix B. Reduced Rank Regression with Integrated Variables

The main results of this paper are based on a more general result documented in [24] (henceforth called BRRR). BRRR uses a slightly different setting and in particular a different dgp. The following lemma provides the essence of the results of BRRR that will be used below.
Lemma A4.
Let ( y t ) t N , ( z t r ) t N , ( z t u ) t N , y t R s , z t r R m , z t u R l be three processes related via
y t = b r z t r + b u z t u + u t
where the zero mean stationary process ( u t ) t N is such that E u t ( z t r ) = 0 , E u t ( z t u ) = 0 , E u t u t > 0 and where n = rank ( b r ) < min ( s , m ) , that is b r is of reduced rank.
Further assume that there exist square nonsingular matrices T y R s × s , T r R m × m , T u R n × n such that
y ˜ t = T y y t = ( T y b r T r 1 ) ( T r z t r ) + ( T y b r T r 1 ) ( T r z t r ) + T y u t = b ˜ r z ˜ t + b ˜ u z ˜ t u + u ˜ t
such that with c = n c we have
b ˜ r = I c 0 0 0 0 b ˜ r , , b ˜ r , = O ˜ Γ , O ˜ R ( s c ) × c , Γ R m × c .
Here the partitioning corresponds to z ˜ t = [ z ˜ t , 1 , z ˜ t , 2 , z ˜ t , ] where z ˜ t , 1 R c , z ˜ t , 2 R m c m are MFI(1) processes and ( z ˜ t , ) t N , z ˜ t , R m is stationary, z ˜ t u = [ ( z ˜ t , 1 u ) , ( z ˜ t , u ) ] where ( z ˜ t , 1 u ) t N is a MFI(1) process and ( z ˜ t , u ) t N is stationary and where the following bounds hold ( z ˜ t , : = [ z ˜ t , 1 , z ˜ t , 2 ] ):
u ˜ t , u ˜ t = O ( 1 ) , u ˜ t , z ˜ t , = O ( Q T ) , u ˜ t , z ˜ t , u = O ( Q T ) , u ˜ t , u ˜ t E u ˜ t u ˜ t = O ( Q T ) , u ˜ t , z ˜ t , : = O ( log T ) , u ˜ t , z ˜ t , 1 u = O ( log T ) , M ^ = z ˜ t , z ˜ t , u , z ˜ t , z ˜ t , u , M ^ 1 = O ( 1 ) , M ^ = O ( 1 ) , M > 0 M ^ 1 = z ˜ t , : z ˜ t , 1 u , z ˜ t , : z ˜ t , 1 u , M ^ 1 / T = O ( log log T ) , ( M ^ 1 ) 1 = O ( Q T 2 ) , z ˜ t , z ˜ t , u , z ˜ t , : z ˜ t , 1 u = O ( log T ) , M ^ M = O ( Q T ) .
Then the reduced rank regression estimator b ^ R R R = [ b ^ r , R R R , b ^ u , R R R ] maximizing the Gaussian likelihood subject to rank ( β r ) = n = c + c is consistent: b ^ R R R b = O ( ( log T ) a / T ) for some a < . Furthermore b ˜ R R R , r b ˜ r = [ O ( ( log T ) a / T ) , O ( ( log T ) a / T ) ] with b ˜ R R R , r = T y b ^ R R R , r T r 1 , where the second block has m columns and corresponds to the stationary components of the regressor vector.
Proof. 
The theorem slightly extends the results of BRRR by adding high level assumptions instead of low level assumptions on the data generating process. The proof hence consists in adjusting the proof in BRRR. In the following we only indicate where arguments in BRRR need to be replaced. A detailed proof would replicate much of the arguments in BRRR and hence is omitted.
The representation of Theorem 3.1 in BRRR is contained in the assumptions. Then consistency follows from examining the proof of the first part of Theorem 3.2 in BRRR: essential for the norm bounds are Lemma A.1 (I) and (III). The norm bounds stated under point (I) are directly assumed in this lemma except for the filtered version using n t in place of x t . Instead, here the results for n t which are needed in the proof of Theorem 3.2 of BRRR are directly assumed. (III) then follows. Lemmas A.3–A.5 in BRRR do not depend on the assumptions on the various processes and hence continue to hold. Then the proof for consistency in Appendix A.3.1 of BRRR only uses these norm bounds referring also to [38] (which is also only based on the norm bounds contained in the assumptions of this lemma) and hence continues to hold.    □

Appendix C. Proofs of the Theorems

Appendix C.1. Proof of Theorem 1

For proving consistency of the transfer function estimators it is sufficient to find a (possibly) random matrix S ˜ T such that the least squares estimates ( A ˜ , C ˜ , K ˜ ) of one representation ( A , C , K ) of the true system obtained using x ˜ t : = S ˜ T x ^ t converges (a.s.) to ( A , C , K ) . This will be done in two steps: First a particular basis (which is not realizable in practice) will be chosen such that K ˜ p K p = o ( 1 ) sufficiently fast such that in the second step the regressions in the system equations based on the resulting state estimator x ˜ t are consistent. The derivation of the first step will also provide an approximation of the error term which can be used in order to derive the asymptotic distribution.

Appendix C.1.1. Proof of Theorem 1 (I)

The central step in CVA is the solution to the RRR problem. The following proof heavily draws on the results contained in [24] (henceforth called BRRR) collected in Lemma A4 for easier reference. As in BRRR, in order to derive the asymptotic properties we first transform the vectors in order to separate stationary and nonstationary terms. In order to achieve the separation let Z t = [ y t 1 , y t 2 , . . . , y t S ] R s S . Then for p = k S we obtain
Y t , p = y t 1 y t 2 y t S y t S 1 y t k S = Z t Z t S Z t ( k 1 ) S .
It is easy to see that for each j the process ( Z r S j ) r N is an I ( 1 ) process. Moreover the strict minimum-phase condition for ( A , C , K ) implies that also for the system corresponding to ( Z r S j ) r N the strict minimum-phase condition holds.
Define the transformation T S : = [ O S , 1 , O S , ] where O S , 1 R s S × c denotes the matrix containing the first c columns of O S for the system ( A , C , K ) in the canonical form. Further O S , is a block column of an orthonormal matrix such that O S , O S , 1 = 0 . Then the argument of [20] shows that in T S Z t the first c components are integrated while the remaining s S c components are stationary. Then consider for p = k S < t T f + 1 (using O S , 1 = ( O S , 1 O S , 1 ) 1 O S , 1 )
z ˜ t , p : = O S , 1 O S ( x t A ¯ p x t p ) O S , Z t O S , 1 ( Z t Z t S ) O S , Z t S O S , 1 ( Z t ( k 2 ) S Z t ( k 1 ) S ) O S , Z t ( k 1 ) S , y ˜ t : = O f , 1 O f , Y t , f + .
Here O f , is a matrix such that O f , O f , 1 = 0 , O f , O f , = I . Obviously z ˜ t , p is a linear transformation of Y t , p and y ˜ t of Y t , f + . It can be shown that the linear transformation is nonsingular such that there is a one-one relation between Y t , p and z ˜ t , p . In z ˜ t , p and y ˜ t only the first c components are unit root processes, the remaining components being stationary.
For p k S the final p k S block rows of z ˜ t , p are defined as y t ( k 1 ) S j y t k S j , j = 1 , . . . , p k S . Clearly also these components are stationary.
Partition z ˜ t , p = [ z ˜ t , 1 , z ˜ t , ] , z ˜ t , 1 R c , into its first c and the remaining coordinates (omitting the subscript p on the right hand side for notational convenience). Similarly partition y ˜ t = [ y ˜ t , 1 , y ˜ t , ] , y ˜ t , 1 R c . Using these transformed matrices, Y t , f + = β 1 Y t , p + N t , f + can be written as
y ˜ t = b ˜ 1 z ˜ t , p + N ˜ t , f , p + = y ˜ t , 1 y ˜ t , = I c 0 0 b ˜ , p z ˜ t , 1 z ˜ t , + O f ˜ A ¯ p x t p + ε ˜ t , 1 ε ˜ t ,
where
b ˜ 1 = I c 0 0 b ˜ , p , b ˜ , p = E y ˜ t , z ˜ t , E z ˜ t , z ˜ t , 1 = O , p Γ , p , b ˜ 1 = O p Γ p
and where b ˜ , p is of rank n c providing a representation of the form given in Theorem 3.1 of BRRR except that the error term N ˜ t , f , p + (defined by the equation) is not white. Finally (A2) also defines the sub blocks ε ˜ t , i of N ˜ t , f , p + which are hence linear combinations of E t , f + and therefore typically MA(f) processes. Note that z ˜ t , 1 , z ˜ t , , y ˜ t , depend on the choice of f and p which is not reflected in the notation.
Here E z ˜ t , z ˜ t , 1 and E y ˜ t , z ˜ t , are worth a remark: for p = k S the results of [20] can be directly used to obtain upper and lower bounds for the norms of these matrices uniformly in k N . For p k S the additional rows in z ˜ t , add entries to E y ˜ t , z ˜ t , that are of order O ( λ p ) for some 0 < λ < 1 as y t y t S is a VARMA process. Similarly the smallest eigenvalue of E z ˜ t , z ˜ t , can be bounded from below based on arguments for p = k S following [20] which in turn refer to Theorem 6.6.10 of [22]. The additional terms for p k S correspond to backward innovations with non-singular covariance matrix thus also leading to a lower bound of the smallest eigenvalue uniformly in k. (The backward innovations representation for a stationary VARMA process ( y t ) t Z equals y t = j = 1 k j b y t + j + ε t b and can be obtained from the complex conjugate of the spectral density. Nonsingularity of the spectral density implies that the backward innovation ε t b have nonsingular covariance matrix. This implies a lower bound on the accuracy with which components of y t ( k 1 ) S j can be predicted based on y t i , i ( k 1 ) S .)
Furthermore the strict minimum-phase assumption for the state space representation ( A , C , K ) of the process ( y t ) t Z implies the strict minimum-phase assumption for the sub-sampled process ( Z k S + j ) k Z . Thus the arguments of [20] show that [ b ˜ , p , 0 ] b ˜ , where the norm of the difference is of order O ( A ¯ p ) . The increase of p as a function of the sample size jointly with the strict minimum-phase assumption implies that O ( A ¯ p ) = o ( T 1 ) . This also implies that O f ˜ A ¯ p x t p = o p ( T 1 / 2 ) .
Correspondingly there exists a limiting decomposition b ˜ , = O Γ such that Γ S = I n c where S denotes a selector matrix whose columns contain the vectors of the canonical basis of R . Since [ K , ( A K C ) K , ( A K C ) 2 K , . . . , ( A K C ) n 1 K ] is of full row rank it can be assumed that S only has nonzero entries in its first n s rows. Denoting the submatrix of the first p s rows by S p , 2 then also [ Γ ] 1 : p S p , 2 = I n c where [ . ] 1 : p denotes the first p block columns of a matrix. This fixes a unique decomposition of b ˜ and hence O and Γ do not depend on p. Convergence of b ˜ , p to b ˜ jointly with the lower bound on p ( T ) then implies convergence of order o ( T 1 ) of O , p to O and Γ , p to [ Γ ] 1 : p using the decomposition of b ˜ , p such that Γ , p S p , 2 = I n c . Correspondingly O p O and Γ p [ Γ ] 1 : p 0 .
Therefore the reduced rank regression in the CVA procedure shows the same structure as investigated in Lemma A4 with the differences that z ˜ t , 2 and z ˜ t u do not occur, and z ˜ t , has increasing size as a function of the sample size. The next lemma therefore extends the results of BRRR to the RRR used in CVA:
In the following we will use a generic a N in statements like O ( ( log T ) a ) , not necessarily the same in each occurrence. In this sense e.g., the product of two terms that are O ( ( log T ) a ) is again taken to be O ( ( log T ) a ) .
Lemma A5.
Let the assumptions of Theorem 1 hold where additionally ( ε t ) t Z is iid. Introduce the notation
D ˜ z = d i a g ( T 1 / 2 I c , I p s c ) , D ˜ y = d i a g ( T 1 / 2 I c , I f s c ) , D ˜ x = d i a g ( T 1 / 2 I c , I n c ) .
Let G ¯ p denote a solution to
( D ˜ z z ˜ t , p , y ˜ t D ˜ y ) ( D ˜ y y ˜ t , y ˜ t D ˜ y ) 1 ( D ˜ y y ˜ t , z ˜ t , p D ˜ z ) G ¯ p = ( D ˜ z z ˜ t , p , z ˜ t , p D ˜ z ) G ¯ p R ¯ 2
using the notation of (A1) where R ¯ 2 Θ 2 = d i a g ( I c , Θ ) R n × n and where G ¯ p is normalized such that G ¯ 1 , 1 , p = I c , G ¯ , 2 , p S p , 2 = I n c for a selector matrix S p , 2 . Further let
Γ ¯ p = I c 0 0 Γ ¯ , p , Γ ¯ , p S p , 2 = I n c
denote the solution to the decoupled problem where the stationary and the nonstationary subproblem are separated:
z ˜ t , 1 , y ˜ t , 1 y ˜ t , 1 , y ˜ t , 1 1 y ˜ t , 1 , z ˜ t , 1 Γ ¯ 1 , 1 , p z ˜ t , , y ˜ t , y ˜ t , , y ˜ t , 1 y ˜ t , , z ˜ t , Γ ¯ , p = z ˜ t , 1 , z ˜ t , 1 Γ ¯ 1 , 1 , p Θ ¯ 1 z ˜ t , , z ˜ t , Γ ¯ , p Θ ¯ .
(I) Then if f n fixed independent of T and p d log T / log ρ 0 , d > 1 , p = o ( ( log T ) a ¯ ) for some a ¯ < the a.s. results of Lemma A.6 (I)-(III) and Lemma A.7 of [24] hold true for ( log T ) 3 replaced by ( log T ) a for some integer a < . In particular G ¯ p Γ ¯ p = O ( ( log T ) a / T 1 / 2 ) .
(II) Using the notation δ G p : = G ¯ p Γ ¯ p define
S ˜ T : = I c T δ G , 1 , p z ˜ t , , z ˜ t , Γ ¯ , p 0 I p s c , Γ ¯ , p : = Γ ¯ , p ( Γ ¯ , p z ˜ t , , z ˜ t , Γ ¯ , p ) 1 .
Then for Γ ˜ p : = S ˜ T D ˜ x 1 G ¯ p D ˜ z and
Γ = I 0 0 Γ
we obtain Γ ˜ p [ Γ ] 1 : p = [ O ( ( log T ) a / T ) , O ( ( log T ) a / T 1 / 2 ) ] where the partitioning corresponds to the partitioning of z ˜ t , p into z ˜ t , 1 and z ˜ t , . Here Γ denotes the right factor of b ˜ , = O Γ such that [ Γ ] 1 : p S p , 2 = I n c holds.
(III) Let the assumptions of Theorem 1 hold. Then Z ^ T : = T v e c ( Γ ˜ p [ Γ ] 1 : p ) I c 0 converges in distribution.
Proof. 
(I) First consider the entries of the vectors y ˜ t , and z ˜ t , (see (A1)) more closely. Since in
O f , Y t , f + = O f , ( O f , x t , + E f E t , f + )
the nonstationary directions are filtered by definition, y ˜ t , is stationary and does not depend on T.
Further, also z ˜ t , is stationary for fixed p as the nonstationary directions are either filtered by pre-multiplication with O S , or by yearly differencing Z t Z t S .
Therefore we obtain from stationary theory for fixed p = k S that
E y ˜ t , z ˜ t , ( E z ˜ t , z ˜ t , ) 1 y ˜ t , , z ˜ t , z ˜ t , , z ˜ t , 1 = o ( 1 ) .
Here sup p ( E z ˜ t , z ˜ t , ) 1 < has been discussed before. Now E y ˜ t , z ˜ t , ( E z ˜ t , z ˜ t , ) 1 = β ˜ , p + o ( T 1 / 2 ) = O , p [ Γ ] 1 : p + o ( T 1 / 2 ) where the o ( T 1 / 2 ) term appears due to neglecting O f ˜ A ¯ p x t p . It follows that det ( β ˜ , p S p , 2 ) ( β ˜ , p S p , 2 ) = det [ O , p O , p ] > 0 and hence β ^ , p β ˜ , p F r = o ( 1 ) implies lim T det ( β ^ , p S p , 2 ) ( β ^ , p S p , 2 ) > 0 a.s. where β ^ , p : = y ˜ t , , z ˜ t , z ˜ t , , z ˜ t , 1 . Since O ^ , p Γ ¯ , p β ˜ , p = o ( 1 ) due to consistency, also
lim T det ( O ^ , p Γ ¯ , p S p , 2 ) ( O ^ , p Γ ¯ , p S p , 2 ) = lim T det O ^ , p O ^ , p det ( Γ ¯ , p S p , 2 ) 2 > 0 a . s .
Since Γ ¯ , p Γ , p = o ( 1 ) due to the definition of Γ ¯ , p and the continuity of the solution of the eigenvalue problem it follows that O ^ , p O , p = o ( 1 ) and therefore lim sup T det O ^ , p O ^ , p > 0 . As in Lemma 6 of [40] it can be shown that Γ , p [ Γ ] 1 : p = o ( T 1 ) and O , p = O + o ( T 1 ) for the range of p given in Theorem 1 since these matrices correspond to a stationary problem. Hence the chosen normalization of Γ ¯ , p can be used a.s.
Next in order to obtain the convergence of G ¯ to Γ ¯ p , Lemma A.6 of BRRR is slightly extended to the current situation (for details and notation see there). Lemma A.6 of BRRR contains three parts: BRRR(I) gives bounds on the error in the matrices (with l T = log T )
δ y z = 1 T y ˜ t , 1 , z ˜ t , 1 1 T y ˜ t , 1 , z ˜ t , 1 T y ˜ t , , z ˜ t , 1 y ˜ t , , z ˜ t , 1 T z ˜ t , 1 , z ˜ t , 1 0 0 y ˜ t , , z ˜ t , = O ( 1 T l T a ) O ( 1 T l T a ) O ( 1 T l T a ) 0 , δ y y = 1 T y ˜ t , 1 , y ˜ t , 1 1 T y ˜ t , , y ˜ t , 1 T y ˜ t , , y ˜ t , 1 y ˜ t , , y ˜ t , 1 T z ˜ t , 1 , z ˜ t , 1 0 0 y ˜ t , , y ˜ t , = O ( 1 T l T a ) O ( 1 T l T a ) O ( 1 T l T a ) 0 , δ z z = 1 T z ˜ t , 1 , z ˜ t , 1 1 T z ˜ t , 1 , z ˜ t , 1 T z ˜ t , , z ˜ t , 1 z ˜ t , , z ˜ t , 1 T z ˜ t , 1 , z ˜ t , 1 0 0 z ˜ t , , z ˜ t , = 0 O ( 1 T l T a ) O ( 1 T l T a ) 0 .
BRRR(II) deals with J = Q ¯ Φ ¯ =
D ˜ z z ˜ t , y ˜ t D ˜ y ( D ˜ y y ˜ t , y ˜ t D ˜ y ) 1 D ˜ y y ˜ t , z ˜ t D ˜ z 1 T z ˜ t , 1 , z ˜ t , 1 0 0 z ˜ t , , y ˜ t , y ˜ t , , y ˜ t , 1 y ˜ t , , z ˜ t ,
and BRRR(III) shows that there exists a solution G ¯ p converging to a solution Γ ¯ p of the separated problem.
For showing the orders of convergence of δ z z the arguments are unchanged except for noting that in z ˜ t , 1 , z ˜ t , the number of columns increases as a function of the sample size. Since the a.s. bounds on the entries of this expression hold uniformly (as follows straightforwardly from the arguments of Lemma A.1 of BRRR) this does not change the arguments. With respect to δ y z note that now y ˜ t = β ˜ 1 z ˜ t , p + ε ˜ t + O f ˜ A ¯ p x t p . Due to the increase of p as a function of the sample size, A ¯ p = o ( T 1 ϵ ) for small enough ϵ > 0 and therefore O f ˜ A ¯ p x t p = o ( T 1 / 2 ϵ / 2 ) since x t = o ( T ( 1 + ϵ ) / 2 ) (uniformly in 1 t T ) whether ( x t ) t Z is a unit root process or stationary. Hence O f ˜ A ¯ p x t p , O f ˜ A ¯ p x t p = o ( 1 ) . Further O f ˜ A ¯ p x t p , ε ˜ t = o ( T 1 / 2 ) follows from x t p , ε ˜ t = O ( log T ) (see Lemma A.1 (I)). This shows that the additional term is always of lower order and can be neglected. The remaining arguments follow exactly as in the proof of Lemma A.6 of BRRR. The proof of Lemma A.7 of BRRR only uses the order bounds derived above and hence follows immediately. This shows (I).
(II) Using the definition of S ˜ T we obtain:
Γ ˜ p = S ˜ T D ˜ x 1 G ¯ p D ˜ z = S ˜ T I c T δ G , 1 , p δ G 1 , 2 , p / T G ¯ , 2 , p = I c δ G , 1 , p z ˜ t , , z ˜ t , Γ ¯ , p δ G 1 , 2 , p T δ G , 1 , p ( I z ˜ t , , z ˜ t , Γ ¯ , p G ¯ , 2 , p ) δ G 1 , 2 , p / T G ¯ , 2 , p .
From (I) and Lemma A.7 of BRRR δ G , 1 , p = O ( ( log T ) a / T 1 / 2 ) , δ G 1 , 2 , p = O ( ( log T ) a / T 1 / 2 ) and G ¯ , 2 , p Γ ¯ , p = o ( ( log T ) a / T 1 / 2 ) . Finally
δ G , 1 , p ( I z ˜ t , , z ˜ t , Γ ¯ , p G ¯ , 2 , p ) = δ G , 1 , p ( I z ˜ t , , z ˜ t , Γ ¯ , p Γ ¯ , p ) + O ( ( log T ) a / T ) = O ( ( log T ) a / T )
as in the proof of Lemma A.7 of BRRR. Using Lemma A.5 (III) of BRRR with Ξ ^ f = y ˜ t , , y ˜ t , 1 / 2 it follows that Γ ¯ , p Γ , p = O ( ( log T ) a T 1 / 2 ) . Since G ¯ , 2 , p Γ ¯ , p = o ( ( log T ) a / T 1 / 2 ) the same rate of convergence holds for G ¯ , 2 , p Γ , p = O ( ( log T ) a / T 1 / 2 ) . It follows that Γ ˜ p [ Γ ] 1 : p = [ O ( ( log T ) a / T ) , O ( ( log T ) a / T 1 / 2 ) ] .
(III) From above we have
T ( Γ ˜ p [ Γ ] 1 : p ) I c 0 = ( T δ G , 1 , p z ˜ t , , z ˜ t , Γ ¯ , p T δ G 1 , 2 , p ) T δ G 1 , 2 , p + o P ( 1 ) .
Now from the proof of Lemma A.7 of BRRR we obtain
[ T δ G , 1 , p ] = Ξ O ( I Θ 2 ) 1 Γ ¯ , p + o P ( 1 ) .
Furthermore using the expression given in Lemma A.7 of BRRR:
T δ G 1 , 2 , p = T Z 11 1 [ δ z z 1 Γ , p Θ 2 J 1 , Γ , p ] ( I Θ 2 ) 1 + o P ( 1 ) = T Z 11 1 [ δ z z 1 Γ , p ( Θ 2 I ) + [ δ z z 1 , J 1 , ] Γ , p ] ( I Θ 2 ) 1 + o P ( 1 ) = Z 11 1 z ˜ t , 1 , x t , Z 11 1 T [ J 1 , δ z z 1 , ] Γ , p ( I Θ 2 ) 1 + o P ( 1 ) = Z 11 1 z ˜ t , 1 , x t , Z 11 1 E ε ˜ t , 1 ε ˜ t , ( E y ˜ t , ( y ˜ t , ) ) 1 E y ˜ t , x t , ( I Θ 2 ) 1 + o P ( 1 ) .
This shows the result.    □
The transformations in the representation lead to an estimator G ¯ taking the place of K p ^ . Using S ˜ T as defined in Lemma A5 the corresponding estimator Γ ˜ p = S ˜ T D ˜ x 1 G ¯ p D ˜ z fulfills Γ ˜ p Γ p = [ O ( ( log T ) a / T ) , O ( ( log T ) a / T ) ] .
Based on this result let ( A , C , K ) denote the realization of the true transfer function in the state basis corresponding to Γ p where Γ p S p = I n and let ( A ˜ , C ˜ , K ˜ ) denote the (unfeasible) CVA estimates using x ˜ t : = Γ ˜ p z ˜ t , p . The next lemma then provides the main ingredients for the rest of the proofs:
Lemma A6.
Let the assumptions of Theorem 1 hold and define D x : = d i a g ( I c T 1 , I n c T 1 / 2 ) . Then there exists an integer a < such that
( A ˜ A ) D x 1 = O ( ( log T ) a ) , ( C ˜ C ) D x 1 = O ( ( log T ) a ) , ( K ˜ K ) = O ( ( log T ) a / T 1 / 2 ) .
Proof. 
First note that the regression of Y t , f + onto Y t , p includes time points t = p + 1 , . . . , T f + 1 whereas for estimating the system matrices we can use x ^ t , t = p + 1 , . . . , T + 1 and y t , t = p + 1 , . . . , T . Thus in this proof we use a t , b t p + 1 T : = T 1 t = p + 1 T a t b t instead of a t , b t = T 1 t = p + 1 T f + 1 a t b t .
The following orders of convergence are straightforward to derive using the results of Lemma A1, A ¯ p = o ( T 1 ) , ( Γ ˜ p [ Γ ] 1 : p ) D z 1 = O ( ( log T ) a ) and x ˜ t x t = ( Γ ˜ p [ Γ ] 1 : p ) z ˜ t , p A ¯ p x t p , t > p according to Lemma A5 and Lemma A1 for the range of p given in Theorem 1:
ε t , x ˜ t x t p + 1 T = O ( p ( log T ) a / T ) , D ˜ z z ˜ t , p , x ˜ t x t p + 1 T = O ( p ( log T ) a / T 1 / 2 ) D ˜ z z ˜ t + 1 , p , x ˜ t x t p + 1 T = O ( p ( log T ) a / T 1 / 2 ) , D ˜ x x t , x ˜ t x t p + 1 T = O ( p ( log T ) a / T 1 / 2 ) x ˜ t x t , x ˜ t x t p + 1 T = O ( p 2 ( log T ) a / T ) .
Using these orders of convergence we obtain
D ˜ x x ˜ t , x ˜ t p + 1 T D ˜ x = D ˜ x x t , x t p + 1 T D ˜ x + O ( p 2 ( log T ) a / T 1 / 2 ) > 0 a . s .
From Lemma A1 also ( D ˜ x x ˜ t , x ˜ t p + 1 T D ˜ x ) 1 = ( D ˜ x x t , x t p + 1 T D ˜ x ) 1 ( 1 + o ( 1 ) ) = O ( ( log T ) a ) .
Therefore
( C ˜ C ) D x 1 = T ε t , x ˜ t p + 1 T C x ˜ t x t , x ˜ t p + 1 T D ˜ x ( D ˜ x x ˜ t , x ˜ t p + 1 T D ˜ x ) 1 = T ε t , x t p + 1 T D ˜ x ( D ˜ x x t , x t p + 1 T D ˜ x + o ( 1 ) ) 1 T C x ˜ t x t , x t p + 1 T D ˜ x ( D ˜ x x t , x t p + 1 T D ˜ x ) 1 + o ( 1 ) = O ( p ( log T ) a ) .
This in particular establishes consistency for the estimate. Next analogously (using the notation δ x t = x ˜ t x t ) we obtain ( A ˜ A ) D x 1 =
T x ˜ t + 1 A x ˜ t , x ˜ t p + 1 T D ˜ x ( D ˜ x x ˜ t , x ˜ t p + 1 T D ˜ x ) 1 = T ( x ˜ t + 1 x t + 1 ) + ( x t + 1 A x t ) + A ( x t x ˜ t ) , x ˜ t p + 1 T D ˜ x ( D ˜ x x t , x t p + 1 T D ˜ x + o ( 1 ) ) 1 = T δ x t + 1 , x t p + 1 T A δ x t , x t p + 1 T + K ε t , x t p + 1 T D ˜ x ( D ˜ x x t , x t p + 1 T D ˜ x ) 1 + o ( 1 ) = O ( p ( log T ) a )
and therefore consistency for A ˜ is established. Finally note that for
ε ^ t = y t C ˜ x ˜ t = ε t + C ( x t x ˜ t ) + ( C C ˜ ) x ˜ t
it follows that ε ^ t , ε ^ t p + 1 T = Ω + O ( p 2 ( log T ) a / T 1 / 2 ) . Furthermore since ε ^ t denotes the residuals of the regression of y t onto x ˜ t it follows that ε ^ t , x ¯ t p + 1 T = 0 . From this we obtain
T ( K ˜ K ) = T ( x ˜ t + 1 K ε ^ t , ε ^ t p + 1 T ( ε ^ t , ε ^ t p + 1 T ) 1 = T ( x ˜ t + 1 x t + 1 ) A δ x t + K ( ε t ε ^ t ) , ε ^ t p + 1 T ( ε ^ t , ε ^ t p + 1 T ) 1 = T δ x t + 1 A δ x t + K ( ε t ε ^ t ) , ε ^ t p + 1 T Ω 1 ( 1 + o ( 1 ) ) = T δ x t + 1 A δ x t + K ( ε t ε ^ t ) , ε t p + 1 T Ω 1 ( 1 + o ( 1 ) ) + o ( 1 ) = T δ x t + 1 , ε t p + 1 T Ω 1 + o ( 1 ) = T ( Γ ˜ p Γ p ) z ¯ t + 1 , p , ε t p + 1 T Ω 1 + o ( 1 ) = O ( p ( log T ) a ) .
   □
These expressions do not only show consistency of a specific order, but also give the relevant highest order terms for the asymptotic distribution, which are used below.
As C ^ A ^ j K ^ = C ˜ A ˜ j K ˜ C A j K = C A j K , Lemma A6 establishes consistency for the impulse response sequence C ^ A ^ j K ^ (thus proofs Theorem 1 (I)) as well as, jointly with p = O ( ( log T ) a ) , the rate of convergence O ( ( log T ) a / T 1 / 2 ) for the not realizable choice of the basis and the impulse response sequence C A j K .

Appendix C.1.2. Proof of Theorem 1 (II)

In order to arrive at the canonical representation ( A ˇ , C ˇ , K ˇ ) two steps are performed: first the reordered Jordan normal form is calculated, afterwards the matrices C ˜ j , C are transformed such that E j C ˇ j , C = I c j holds. We will follow these steps below.
In the first step a transformation matrix U ^ needs to be found such that A ˜ = U ^ A ˜ U ^ 1 is in reordered Jordan normal form. In this respect A ˜ and A are used in Lemma A2. Accordingly U ^ t = [ U ^ t , 1 , . . . , U ^ t , S / 2 , U ^ t , ] can be defined such that V j U ^ t , j = I c j where U R n × n corresponds to the transformation from A to A as given in the theorem. An appropriate choice of z ˜ t , 1 leads to U = I n . Furthermore the basis in the space spanned by the columns of U ^ t , where U ^ t , j U ^ t , = 0 can be chosen such that [ 0 , I ] U ^ t , = I for large enough T.
A first order approximation according to Lemma A2 then leads to
U ^ t , j = U j + Z 1 + O ( A ^ A 2 ) = U j Σ ( A ^ A ) U j + O ( A ^ A 2 )
for j = 0 , . . . , S / 2 . Consequently U ^ t , j U j = O ( ( log T ) a T 1 ) and thus also U ^ t U = O ( ( log T ) a T 1 ) . Consequently the order of convergence for the transformed system ( A ^ , C ^ , K ^ ) is unchanged. In a second step an upper triangular transformation matrix U ˜ can be found transforming ( A ^ , C ^ , K ^ ) such that A ˜ corresponds to the reordered Jordan normal form. Due to the upper block triangularity of this transform we can apply Lemma A3 to show that the order of convergence remains identical.
For the second step note that Lemma A3 provides the required terms: An application to the block diagonal transformation S T = diag ( E 1 C ˜ 1 , C , . . . , E S / 2 C ˜ S / 2 , C , S T , ) , where S T , transforms the stationary subsystem to echelon form, concludes the proof.

Appendix C.1.3. Proof of Theorem 1 (III)

The only argument that uses the iid assumption is the almost sure convergence of ( D ˜ x x t , x t D ˜ x ) 1 . Weakening the assumptions on the noise implies that this order of convergence still holds in probability while the almost sure version cannot be shown with the tools of this paper. This concludes the proof of Theorem 1.

Appendix C.2. Proof of Theorem 2

Using the notation introduced in (A1),
X ^ = D ˜ z y ˜ t , z ˜ t , p D ˜ z ( D ˜ z z ˜ t , p , z ˜ t , p D ˜ z ) 1 D ˜ z z ˜ t , p , y ˜ t D ˜ z ( D ˜ z y ˜ t , y ˜ t D ˜ z ) 1 X = I c 0 0 X ,
for a suitable matrix X , . The eigenvalues of X ^ are the squares of the singular values of the RRR problem in the first step of CVA. Therefore
T i = 1 c ( 1 σ ^ i 2 ) = T t r U c ( X ^ X ) [ U c ( X I ) ( X ^ X ) U c ] + o P ( 1 ) = T t r Δ X 11 Δ X 12 ( X , I ) Δ X 21 + o P ( 1 )
according to a second order approximation in the Rayleigh-Schrödinger expansions (Lemma A2). Now, in the current situation we obtain ( I X ^ ) I 0 =
= D ˜ y y ˜ t , y ˜ t D ˜ y D ˜ y y ˜ t , z ˜ t , p D ˜ z ( D ˜ z z ˜ t , p , z ˜ t , p D ˜ z ) 1 D ˜ z z ˜ t , p , y ˜ t D ˜ y ( D ˜ y y ˜ t , y ˜ t D ˜ y ) 1 I 0 = D ˜ y ε ˜ t , ε ˜ t D ˜ y D ˜ y ε ˜ t , z ˜ t , p D ˜ z ( D ˜ z z ˜ t , p , z ˜ t , p D ˜ z ) 1 D ˜ z z ˜ t , p , ε ˜ t D ˜ y ( D ˜ y y ˜ t , y ˜ t D ˜ y ) 1 I 0 .
Furthermore ε ˜ t , z ˜ t , p D ˜ z ( D ˜ z z ˜ t , p , z ˜ t , p D ˜ z ) 1 D ˜ z z ˜ t , p , ε ˜ t = O P ( T 1 ) and
( D ˜ y y ˜ t , y ˜ t D ˜ y ) 1 I 0 = I y ˜ t , , y ˜ t , 1 y ˜ t , , y ˜ t , 1 / T ( y ˜ t , 1 π , y ˜ t , 1 π / T ) 1
where y ˜ t , 1 π = y ˜ t , 1 y ˜ t , 1 , y ˜ t , y ˜ t , , y ˜ t , 1 y ˜ t , .
From this we get using E ε ˜ t , ε ˜ t , = E y ˜ t , y ˜ t , X , E y ˜ t , y ˜ t , :
T ( I c X ^ 1 , 1 ) = ε ˜ t , 1 , ε ˜ t , 1 ε ˜ t , 1 , ε ˜ t , y ˜ t , , y ˜ t , 1 y ˜ t , , y ˜ t , 1 ( y ˜ t , 1 π , y ˜ t , 1 π / T ) 1 + o P ( 1 ) , T Δ X 2 , 1 = ε ˜ t , , ε ˜ t , 1 + ( I X , ) y ˜ t , , y ˜ t , 1 ( y ˜ t , 1 π , y ˜ t , 1 π / T ) 1 + o P ( 1 ) T Δ X 1 , 2 = E ε ˜ t , 1 ε ˜ t , ( E y ˜ t , y ˜ t , ) 1 + o P ( 1 ) .
Thus T i = 1 c ( 1 σ ^ i 2 ) =
= t r ε ˜ t , 1 , ε ˜ t , 1 E ε ˜ t , 1 ε ˜ t , ( E y ˜ t , y ˜ t , ) 1 ( I X 0 , ) 1 E ε ˜ t , ε ˜ t , 1 ( y ˜ t , 1 , y ˜ t , 1 / T ) 1 + o P ( 1 ) = t r ε ˜ t , 1 , ε ˜ t , 1 E ε ˜ t , 1 ε ˜ t , ( E ε ˜ t , ε ˜ t , ) 1 E ε ˜ t , ε ˜ t , 1 ( y ˜ t , 1 , y ˜ t , 1 / T ) 1 + o P ( 1 ) d Z .

Appendix C.3. Proof of Theorem 3

The proof of Theorem 3 follows the same path as the proof of Theorem 1. In (A5) it was shown that the asymptotic distribution of T ( A ˜ 11 A , 11 ) depends on
x ˜ t + 1 , j x t + 1 , j , x t , k , x ˜ t , j x t , j , x t , k , ε t , x t , j , x t , k , x t , j / T
for j , k = 0 , . . . , S / 2 . Note that
δ x t + i = x ˜ t + i x t + i = ( Γ ˜ p [ Γ ] 1 : p ) z ˜ t + i , p + o P ( T 1 )
for i = 0 , 1 . Then the results of Lemma A5 show that the first c columns of ( Γ ˜ p [ Γ ] 1 : p ) converge to a random variable (below denoted as Z Γ ) when multiplied with T while the remaining columns converge in distribution when multiplied with T . Therefore
δ x t + i , x t , k = T ( Γ ˜ p [ Γ ] 1 : p ) z ˜ t + i , p , x t , k T + o P ( 1 ) = T ( Γ ˜ p [ Γ ] 1 : p ) I c 0 z ˜ t + i , 1 , x t , k T + o P ( 1 ) .
Due to the definition (A1), z ˜ t , 1 = [ x t , j ] j = 0 , . . . , S / 2 + o ( T 1 ) and hence (using A = d i a g ( A , u , A , ) )
z ˜ t + 1 , 1 , x t , k / T = A , u z ˜ t , 1 , x t , k / T + o ( 1 ) .
Considering now the complex-valued representation and using the notation
Δ Γ 1 : = T ( Γ ˜ p [ Γ ] 1 : p ) I c 0 , S j = [ 0 c j , i < j c i , I c j , 0 c j , i > j c j ]
where S j z ˜ t , 1 = x t , j , C , it follows that the contribution of these two terms to the limiting distribution of the diagonal block corresponding to the unit root z j amounts to (using x t , j , C , x t , k , C / T 0 for k j and δ x t , j , C = x ˜ t , j , C x t , j , C )
δ x t + 1 , j , C , x t , j , C A , j j δ x t , j , C , x t , j , C =
= S j Δ Γ 1 A , u z ˜ t , 1 , x t , j , C T A , j j S j Δ Γ 1 z ˜ t , 1 , x t , j , C T + o P ( 1 ) = S j Δ Γ 1 S j A , j j x t , j , C , x t , j , C T A , j j S j Δ Γ 1 S j x t , j , C , x t , j , C T + o P ( 1 ) = S j Δ Γ 1 S j z j ¯ x t , j , C , x t , j , C T z j ¯ S j Δ Γ 1 S j x t , j , C , x t , j , C T + o P ( 1 ) = o P ( 1 ) .
Therefore, for the diagonal blocks in (A5) these two terms do not contribute and the asymptotic distribution is determined by
T K , j ε t , x t , j x t , j , x t , j 1
for which the asymptotic results are provided in Lemma A1. This also shows that estimating the state does not change the asymptotic distribution in the diagonal blocks as the impact of Γ ˜ p Γ p is of lower order.
In order to derive the distribution of the sum of the eigenvalues note that as in the proof of Theorem 2, according to Lemma A2 the sum of the eigenvalues of A ˜ converging to z j obeys the following second order approximation:
T i = 1 c j ( λ ^ i z j ) = T t r U j ( A ^ A ) [ U j A ( z j ) ( A ^ A ) U j ] + o P ( T 1 ) = T t r A ^ , j j z j I c j + o P ( 1 )
since ( A ^ A ) U j = O ( ( log T ) a T 1 ) in this case implying that the second order terms vanish. Thus we obtain the asymptotic distribution under the null hypothesis as the limiting distribution of
T t r [ K , j , C ε t , x t , j , C x t , j , C , x t , j , C 1 ] .
It is easy to verify that this test statistic is pivotal for complex and real unit roots. This proves Theorem 3.

Appendix C.4. Proof of Theorem 4

The result for C ˜ m can be shown using the results of [4]. As the eigenvalues are insensitive to changes in the basis we can assume without restriction of generality that the only unit root components in T X t ( m ) are contained in the first c m rows:
c t ( m ) : = T X t ( m ) = c t , u ( m ) c t , ( m ) , D ˜ c = T 1 I c m 0 0 I n c m .
Due to the filtering, c t , ( m ) is stationary while c t , u ( m ) contains the unit root z m . Then the relevant matrix X ^ m can be written as
X ^ m : = c t 1 π , p t π p t π , p t π 1 p t π , c t 1 π c t 1 π , c t 1 π 1 .
Since p t = K ε t 1 + j = 1 , j m S α j β j X t 1 ( j ) + [ 0 , α ˜ m ] c t 1 ( m ) , we consequently have p t π = K ε t 1 π + [ 0 , α ˜ m ] c t 1 π . Therefore, for the three components of X ^ m we obtain with appropriate definitions of the random variables S m , T m and using standard asymptotics
p t π , p t π = K ε t 1 π + α ˜ m c t 1 , π , K ε t 1 π + α ˜ m c t 1 , π K E ε t 1 ε t 1 K + α ˜ m E c t 1 , Π ( c t 1 , Π ) α ˜ m > 0 , p t π , c t 1 π = K ε t 1 π + α ˜ m c t 1 , π , c t 1 π d [ S m , α ˜ m E c t 1 , Π ( c t 1 , Π ) ] , p t π , c t 1 π c t 1 π , c t 1 π 1 D ˜ c 1 = [ 0 , α ˜ m ] + K ε t 1 π , c t 1 π c t 1 π , c t 1 π 1 D ˜ c 1 K ε t 1 π , c t 1 π c t 1 π , c t 1 π 1 D ˜ c 1 d [ T m , 0 ] .
Correspondingly the first block column X ^ m , u of X ^ m converges to zero such that T X ^ m , u converges in distribution while the second block column converges in probability without normalization. This shows that
T i = 1 c m λ ^ i = T t r U m ( X ^ m X m ) [ U m X m ( X ^ m X m ) U m + o P ( 1 ) = t r T X ^ m , u u + o P ( 1 )
converges in distribution. The limit is given in [4].
For the case of the estimated state note that the difference between the estimated and the true state is given as
x ˜ t x t = Γ ˜ p z ˜ t , p Γ p z ˜ t , p A ¯ p x t p = ( Γ ˜ p Γ p ) z ˜ t , p A ¯ p x t p .
The strict minimum-phase assumption and the assumption on the increase of p = p ( T ) implies that the second term can be neglected being o P ( T 1 ) . Furthermore
( Γ ˜ p Γ p ) z ˜ t , p = ( Γ ˜ p Γ p ) D ˜ z 1 D ˜ z z ˜ t , p , ( Γ ˜ p Γ p ) D ˜ z 1 = O P ( T 1 / 2 ) .
Using this it can be concluded that
p ^ t , p ^ t = p t , p t + O P ( T 1 / 2 ) , p ^ t , c ^ t 1 , ( m ) = p t , c t 1 , ( m ) + O P ( T 1 / 2 ) , p ^ t , c ^ t 1 , u ( m ) = p t , c t 1 , u ( m ) + O P ( T 1 / 2 ) , c ^ t , u ( m ) , c ^ t , u ( k ) = c t , u ( m ) , c t , u ( k ) + O P ( 1 ) .
These equations imply that the difference between the expression using the true state and the one using the estimated state converges to zero, implying that the two tests accept and reject jointly asymptotically under the null hypothesis.

References

  1. Rodrigues, P.M.; Taylor, A. Alternative estimators and unit root tests for seasonal autoregressive processes. J. Econom. 2004, 120, 35–73. [Google Scholar] [CrossRef]
  2. Johansen, S.; Schaumburg, E. Likelihood Analysis of Seasonal Cointegration. J. Econom. 1999, 88, 301–339. [Google Scholar] [CrossRef] [Green Version]
  3. Hylleberg, S.; Engle, R.; Granger, C.; Yoo, B. Seasonal Integration and Cointegration. J. Econom. 1990, 44, 215–238. [Google Scholar] [CrossRef]
  4. Cubadda, G. Complex Reduced Rank Models For Seasonally Cointegrated Time Series. Oxf. Bull. Econ. Stat. 2001, 63, 497–511. [Google Scholar] [CrossRef]
  5. Cubadda, G.; Omtzigt, P. Small-sample improvements in the statistical analysis of seasonally cointegrated systems. Comput. Stat. Data Anal. 2005, 49, 333–348. [Google Scholar] [CrossRef] [Green Version]
  6. Ahn, S.K.; Cho, S.; Seong, B. Inference of Seasonal Cointegration: Gaussian Reduced Rank Estimation and Tests for Various Types of Cointegration. Oxford Bull. Econ. Stat. 2004, 66, 261–284. [Google Scholar] [CrossRef]
  7. Vivas, E.; Allende-Cid, H.; Salas, R. A Systematic Review of Statistical and Machine Learning Methods for Electrical Power Forecasting with Reported MAPE Score. Entropy 2020, 22, 1412. [Google Scholar] [CrossRef]
  8. García-Martos, C.; Rodríguez, J.; Sánchez, M.J. Forecasting electricity prices and their volatilities using Unobserved Components. Energy Econ. 2011, 33, 1227–1239. [Google Scholar] [CrossRef] [Green Version]
  9. Dufour, J.M.; Stevanović, D. Factor-augmented VARMA models with macroeconomic applications. J. Bus. Econ. Stat. 2013, 31, 491–506. [Google Scholar] [CrossRef]
  10. Dias, G.; Kapetanios, G. Estimation and forecasting in vector autoregressive moving average models for rich datasets. J. Econom. 2018, 202, 75–91. [Google Scholar] [CrossRef] [Green Version]
  11. Foroni, C.; Marcellino, M.; Stevanović, D. Mixed-frequency models with moving-average components. J. Appl. Econom. 2019, 34, 688–706. [Google Scholar] [CrossRef]
  12. Kascha, C.; Trenkler, C. Simple Identification and specification of cointegrated VARMA models. J. Appl. Econom. 2015, 30, 675–702. [Google Scholar] [CrossRef]
  13. Ravenna, F. Vector autoregressions and reduced form representations of DSGE models. J. Monet. Econ. 2007, 54, 2048–2064. [Google Scholar] [CrossRef] [Green Version]
  14. Komunjer, I.; Zhu, Y. Likelihood ratio testing in linear state space models: An application to dynamic stochastic general equilibrium models. J. Econom. 2020, 218, 561–586. [Google Scholar] [CrossRef]
  15. Bauer, D.; Wagner, M. A State Space Canonical Form for Unit Root Processes. Econom. Theory 2012, 28, 1313–1349. [Google Scholar] [CrossRef]
  16. Larimore, W.E. System Identification, reduced order filters and modeling via canonical variate analysis. In Proceedings of the 1983 American Control Conference, San Francisco, CA, USA, 22–24 June 1983; pp. 445–451. [Google Scholar]
  17. Bauer, D. Comparing the CCA subspace method to quasi maximum likelihood methods in the case of no exogenous inputs. J. Time Ser. Anal. 2006, 26, 631–668. [Google Scholar] [CrossRef]
  18. Bauer, D. Using Subspace Methodes for Estimating ARMA models for multivariate time series with conditionally heteroskedastic innovations. Econom. Theory 2008, 24, 1063–1092. [Google Scholar]
  19. Bauer, D. Using Subspace Methods to Model Long-Memory Processes. In Theory and Applications of Time Series Analysis. ITISE 2018. Contributions to Statistics; Valenzuela, O., Rojas, F., Pomares, H., Rojas, I., Eds.; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
  20. Bauer, D.; Wagner, M. Estimating Cointegrated Systems Using Subspace Algorithms. J. Econom. 2002, 111, 47–84. [Google Scholar] [CrossRef] [Green Version]
  21. Bauer, D. Estimating linear dynamical systems using subspace methods. Econom. Theory 2005, 21, 181–211. [Google Scholar] [CrossRef]
  22. Hannan, E.J.; Deistler, M. The Statistical Theory of Linear Systems; John Wiley: New York, NY, USA, 1998. [Google Scholar]
  23. Chatelin, F. Eigenvalues of Matrices; John Wiley & Sons: Hoboken, NJ, USA, 1993. [Google Scholar]
  24. Bauer, D. Asymptotic Distribution of Estimators in Reduced Rank Regression Settings When the Regressors Are Integrated. Technical Report. 2012. Available online: http://arxiv.org/abs/1211.1439 (accessed on 26 March 2021).
  25. Phillips, P.C.B.; Durlauf, S.N. Multiple Time Series Regression with Integrated Processes. Rev. Econ. Stud. 1986, LIII, 473–495. [Google Scholar] [CrossRef] [Green Version]
  26. Carrasco, M.; Chen, X. Mixing and Moment Properties of Various GARCH and Stochastic Volatility Models. Econom. Theory 2002, 18, 17–39. [Google Scholar] [CrossRef]
  27. Bauer, D.; Wagner, M. Autoregressive Approximations to MFI(1) Processes; Technical Report; Department for Mathematical Methods in Economics: TU Wien, Austria, 2004. [Google Scholar]
  28. Bierens, H. Nonparametric cointegration analysis. J. Econom. 1997, 77, 379–404. [Google Scholar] [CrossRef] [Green Version]
  29. Wagner, M. A Comparison of Johansen’s, Bierens’ and the Subspace Algorithm Method for Cointegration Analysis. Oxf. Bull. Econ. Stat. 2004, 66, 399–424. [Google Scholar] [CrossRef] [Green Version]
  30. Johansen, S.; Nielsen, M. The cointegrated vector autoregressive model with general deterministic terms. J. Econom. 2018, 202, 214–229. [Google Scholar] [CrossRef] [Green Version]
  31. Lee, H. Maximum Likelihood Inference on Cointegration and Seasonal Cointegration. J. Econom. 1992, 54, 1–47. [Google Scholar] [CrossRef]
  32. Bauer, D.; Wagner, M. Using subspace algorithm cointegration analysis: Simulation performance and application to the term structure. Comput. Stat. Data Anal. 2009, 53, 1954–1973. [Google Scholar] [CrossRef]
  33. Bauer, D. Order Estimation for Subspace Methods. Automatica 2001, 37, 1561–1573. [Google Scholar] [CrossRef] [Green Version]
  34. Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2006. [Google Scholar]
  35. Qu, Z.; Perron, P. A Modified Information Criterion for Cointegration Tets Based on a VAR Approximation. Econom. Theory 2007, 23, 638–658. [Google Scholar] [CrossRef] [Green Version]
  36. Mulla, R. Hourly Energy Consumption. Available online: www.kaggle.com/robikscube/hourly-energy-consumption/ (accessed on 22 January 2021).
  37. del Barrio Castro, T.; Rodrigues, P.M.M.; Taylor, A.M.R. Temporal Aggregation of Seasonally Near-Integrated Processes. J. Time Ser. Anal. 2019, 40, 872–886. [Google Scholar] [CrossRef] [Green Version]
  38. Bauer, D. Almost sure bounds on the estimation error for ols estimators when the regressors include certain MFI(1) processes. Econom. Theory 2009, 25, 571–582. [Google Scholar] [CrossRef]
  39. Ahn, S.; Reinsel, G. Estimation of Partially Nonstationary Vector Autoregressive Models with Seasonal Behaviour. J. Econom. 1994, 1994 62, 317–350. [Google Scholar] [CrossRef]
  40. Bauer, D.; Deistler, M.; Scherrer, W. Consistency and Asymptotic Normality of some Subspace Algorithms for Systems Without Observed Inputs. Automatica 1999, 35, 1243–1254. [Google Scholar] [CrossRef]
Figure 1. Eigenvalues around z = i of 1000 replications when γ = 0.2 ( c π / 2 = 1 ).
Figure 1. Eigenvalues around z = i of 1000 replications when γ = 0.2 ( c π / 2 = 1 ).
Entropy 23 00436 g001
Figure 2. Relationship between hit rates and chosen values of f and k, illustration for the VARMA dgp using A2. The lower x-axes show f or k, above are the choice frequencies of the selection criteria.
Figure 2. Relationship between hit rates and chosen values of f and k, illustration for the VARMA dgp using A2. The lower x-axes show f or k, above are the choice frequencies of the selection criteria.
Entropy 23 00436 g002
Figure 3. Empirical power of the different test procedures (VARMA dgp with A 2 ). Twice the Monte Carlo standard error is 0.005.
Figure 3. Empirical power of the different test procedures (VARMA dgp with A 2 ). Twice the Monte Carlo standard error is 0.005.
Entropy 23 00436 g003
Figure 4. Mean of absolute value of one day ahead prediction error over all four components. CVA (blue), AR (red) and long AR (black). Dash-dot lines refer to the t-distribution.
Figure 4. Mean of absolute value of one day ahead prediction error over all four components. CVA (blue), AR (red) and long AR (black). Dash-dot lines refer to the t-distribution.
Entropy 23 00436 g004
Figure 5. Results of the unit root tests for all seasonal unit roots jointly.
Figure 5. Results of the unit root tests for all seasonal unit roots jointly.
Entropy 23 00436 g005
Figure 6. Electricity consumption data.
Figure 6. Electricity consumption data.
Entropy 23 00436 g006
Figure 7. Periodic patterns from dummy variables.
Figure 7. Periodic patterns from dummy variables.
Entropy 23 00436 g007
Figure 8. BIC values for univariate models and multivariate model (dashed line; divided by four to fit).
Figure 8. BIC values for univariate models and multivariate model (dashed line; divided by four to fit).
Entropy 23 00436 g008
Figure 9. Results for the hourly datasets.
Figure 9. Results for the hourly datasets.
Entropy 23 00436 g009
Figure 10. Residual analysis.
Figure 10. Residual analysis.
Entropy 23 00436 g010
Figure 11. Results for the hourly datasets.
Figure 11. Results for the hourly datasets.
Entropy 23 00436 g011
Figure 12. Poles (x) and zeros (o) of the transfer functions corresponding to the three models: Mod 1 (red), Mod 2 (blue), Mod 3 (magenta).
Figure 12. Poles (x) and zeros (o) of the transfer functions corresponding to the three models: Mod 1 (red), Mod 2 (blue), Mod 3 (magenta).
Entropy 23 00436 g012
Table 1. Eigenvalues of the coefficient matrix of the companion form.
Table 1. Eigenvalues of the coefficient matrix of the companion form.
j
12345678
γ = 0.2 z j −11i−i0.126 + i0.990.126 − i0.99−0.7900.737
| z j | 11110.9980.9980.7900.737
γ = 0 μ j −1i−i1i−i0.775−0.775
| μ j | 1111110.7750.775
Table 2. Hit rates for the different tests (VAR dgp). Twice the maximum (over all entries) Monte Carlo standard error is 0.005.
Table 2. Hit rates for the different tests (VAR dgp). Twice the maximum (over all entries) Monte Carlo standard error is 0.005.
0 π / 2 π
T Λ J Λ JSQ1Q2Q3 Λ J
γ = 0 500.6850.3480.3510.9030.8440.8510.8440.6810.343
1000.8410.7320.4900.9250.9000.9020.9000.8310.724
2000.8970.9510.8410.9340.9250.9240.9250.8760.936
5000.9310.9380.9160.9490.9410.9420.9410.9270.948
γ = 0.2 500.5500.3670.8110.7960.7770.7780.7880.6040.297
1000.7110.8010.0870.9200.9130.9080.9080.7990.806
2000.9070.9220.8550.9540.9490.9480.9470.8540.939
5000.9440.9530.9270.9390.9380.9380.9360.9240.942
Table 3. Hit rates for the different tests (VARMA dgp). Twice the maximum (over all entries) Monte Carlo standard error is 0.005.
Table 3. Hit rates for the different tests (VARMA dgp). Twice the maximum (over all entries) Monte Carlo standard error is 0.005.
0 π / 2 π
T Λ J Λ JSQ1Q2Q3 Λ J
A 1 500.8900.0030.9060.0240.0270.0320.0250.8970.008
1000.9280.4340.9440.7550.7830.7830.7610.9300.440
2000.9360.9370.9230.9250.9150.9160.9150.9250.924
5000.8520.9010.8530.9190.9060.9040.9040.8530.894
A 2 500.8630.0080.7850.0620.0470.0630.0390.8670.006
1000.9170.5000.8800.5820.5960.5960.5710.9160.518
2000.9310.9270.8820.9080.9150.9130.9110.9190.922
5000.8240.8820.7860.8780.8600.8590.8610.8120.865
Table 4. Mean gaps between estimated and true cointegrating spaces (VARMA dgp). 2 M C s e denotes twice the maximal Monte Carlo standard error for the corresponding row.
Table 4. Mean gaps between estimated and true cointegrating spaces (VARMA dgp). 2 M C s e denotes twice the maximal Monte Carlo standard error for the corresponding row.
0 π / 2 π
T2MCseCVAJGARRCVAJScRRRGARRCVAJGARR
A 1 500.0160.1160.1890.1920.0910.1470.1300.1300.1110.1920.197
1000.0040.0470.0480.0480.0390.0350.0350.0350.0470.0460.046
2000.0030.0230.0190.0190.0190.0160.0160.0160.0240.0190.019
5000.0030.0120.0070.0070.0080.0080.0060.0060.0110.0070.007
A 2 500.0160.1740.2450.2420.2500.3490.3310.3310.1650.2310.234
1000.0040.0720.0610.0610.0980.0800.0780.0780.0690.0600.060
2000.0030.0310.0260.0260.0470.0360.0340.0340.0320.0270.027
5000.0030.0160.0110.0100.0210.0150.0130.0130.0170.0110.011
Table 5. Percentage of accept (minimum for all unit root frequencies) and reject (maximum for non unit root frequencies) of Λ ( 1 ) test statistic.
Table 5. Percentage of accept (minimum for all unit root frequencies) and reject (maximum for non unit root frequencies) of Λ ( 1 ) test statistic.
Unit Root FrequenciesNon Unit Root Frequencies
TnormG1IG1IG2IG3normG1IG1IG2IG3
1040.940.890.870.880.870.870.820.790.820.79
2080.980.960.950.940.960.780.750.720.720.69
3120.970.960.960.950.950.000.000.000.000.00
Table 6. Summary of data sets.
Table 6. Summary of data sets.
RegionDaily Obs. (4263 est., 577 val.)Hourly Obs. (102,291 est., 13,845 val.)
MeanMean(log)Std.(log)AICBICMean(log)Std.(log)AICBIC
AEP371,84412.820.12743129.630.168782532
DAYTON48,89710.790.1444337.600.193772531
DOM262,72712.470.1581739.280.215795554
DUQ39,83710.580.1302377.400.177800529
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bauer, D.; Buschmeier, R. Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method. Entropy 2021, 23, 436. https://doi.org/10.3390/e23040436

AMA Style

Bauer D, Buschmeier R. Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method. Entropy. 2021; 23(4):436. https://doi.org/10.3390/e23040436

Chicago/Turabian Style

Bauer, Dietmar, and Rainer Buschmeier. 2021. "Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method" Entropy 23, no. 4: 436. https://doi.org/10.3390/e23040436

APA Style

Bauer, D., & Buschmeier, R. (2021). Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method. Entropy, 23(4), 436. https://doi.org/10.3390/e23040436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop