1. Introduction
A tornado is a small-scale vortex flow with strong destructive force. The United States is the country with the most tornadoes in the world. According to the records of the National Oceanic and Atmospheric Administration (NOAA) (spc2016) [
1], a total of 62208 tornadoes occurred from 1950 to 2016. Almost every year, tornadoes cause damage to property and the economy and also result in deaths and injuries. Tornadoes are a major concern in the United States. In order to estimate tornado damages, risk analysis is needed [
2].
The risk analysis of tornadoes requires building an appropriate model. In general, the characteristics of tornadoes can be described by physical parameters such as path length and width, moving direction, moving speed, and wind speed model, as well as statistical parameters such as annual frequency and seasonal characteristics. Banik [
3] used the frequency, intensity, path length, and width of tornadoes to establish a tornado risk assessment model in southern Ontario, Canada. Shen [
4] and Coleman [
5] established a tornado risk assessment model in the United States by using the frequency of tornadoes. Tamura [
6] established a tornado risk analysis model for Japanese tornado data according to the frequency, length, width, and moving direction of tornadoes. In the above risk analysis, frequency analysis is an important part of tornado analysis. Generally, meteorological variables such as tornadoes have many uncertainties. The Bayesian hierarchical model is often used in the study of the probability model of meteorological variables [
7]. For example, Cheng [
8] established the probability model by using a Bayesian hierarchical model for the frequency of tornadoes in Canada. Sang [
9] also adopted the spatial model for building extreme temperature values with space-time attributes. Potvin [
10] used Bayesian hierarchical modeling to estimate tornado reporting rates and expected tornado counts in the central United States from 1975 to 2016. To improve estimates of tornado frequency in the central United States, Potvin [
11] developed a sophisticated statistical model that accounts for these population-dependent tornadoes. Moore [
12] analyzed the spatial distribution characteristics of tornadoes in the United States; they [
13] looked at the relationship between tornado activity in the United States and the El Nio/Southern Oscillation in all four seasons and across multiple regions. Coffer [
14] used Random Forest classification to predict tornadoes. Allen [
15] employed Kernel Density Estimation for spatial pattern analysis and space-time cubes to visualize the spatiotemporal frequency of tornadoes and potential trends. Gensini [
16] analyzed the spatial trends in the United States’ tornado frequency. They focused on an environmental covariate approach to examine potential changes in United States tornado frequency. Cao [
17] developed a statistical framework to quantitatively explain two-way interconnections between long-term climate trends and internal variabilities. Some of the studies might consider the spatial correlation of tornadoes [
18,
19,
20,
21] but have not considered the characteristics of dispersion. The aforementioned literature used Poisson distribution to fit the data but did not check if it was appropriate for the data. Recently, Huang [
22] has proposed a new first-order, non-negative, integer-valued autoregressive model with Bell innovations(BL-INAR(1)) to analyze the count data, and the model is suitable for counts exhibiting overdispersion. The data is non-negative and integer-valued Poisson distribution is usually used for count data analysis and sometimes generalized linear model since the data is a non-negative integer-value. The count data with repeated measures may have a dispersion issue, which is often ignored in engineering practice.
The simple analysis of the annual frequency of tornadoes in various regions of the United States from 1967 to 2016 shows that the spatial correlation and overdispersion phenomenon exists. To account for those phenomena, the Bayesian method is applied to analyze the annual frequency of tornadoes with three distributions, i.e., Poisson distribution, negative binomial distribution and Polya distribution. The parameters of each distribution are assumed to vary with the grids to account for the inhomogeneity in the dataset. Spatial correlation is assumed by the a priori distribution of spatial model parameters. MCMC (Markov Chain Monte Carlo) method is used for posterior parameter estimation. Finally, the simulation analysis is carried out by using the established spatial model. The numerical characteristics and spatial correlation of the simulated tornado frequency are compared with the actual data to show the effectiveness of the proposed method.
We organize the rest of this article as follows. In
Section 2, we briefly introduce the Bayesian hierarchical model and the three discrete distributions, including its definition and some properties.
Section 3 introduces the method of data processing and analysis, and explains why we adopt the method in this paper. In
Section 4, we use the MCMC algorithm to estimate the parameters of the model and then compare the results with other models to show the superior performance of the proposed model, and the simulation analysis also shows the practicability of the model. The paper concludes in
Section 5.
2. Methodology
Here we attempt to establish a probability model of the annual frequency of tornadoes by using the Bayesian inference technique. Firstly, the annual frequency of tornadoes in each region is assumed to follow a discrete distribution, and the discrete distribution is decomposed into a series of conditional probability models to obtain a Bayesian hierarchical model. X, Y, and Z are random variables and their joint probability density function can be expressed as .
Hypothesis: , observation data set; , parameter set; , hyperparameter set.
The Bayesian hierarchical model simplifies complex problems into three main parts:
Sampling distribution:
Priors for :
Priors for :
The first part is mainly concerned with the sampling distribution function or likelihood function of the observed values under the condition that the relevant parameters of the observed data are given. The second part is concerned with the prior distribution, which is the probability distribution of the parameters of the data model in part one under the condition of hyperparameters. The last part is the prior distribution of the hyperparameters, also known as the super prior distribution. In application, these parts can be subdivided into some sub-stages to construct a multi-stage hierarchical model, which makes the model more flexible.
The three parts are introduced to study what distribution the process and parameters of interest follow given the observed data, that is, the posterior of the parameters and processes. The expression is given by Bayes’ rule:
In Bayesian inference, assuming that the parameters of the model are not fixed values, but random variables, the appropriate distribution of parameters can be assigned according to experience and prior knowledge, that is, super prior distribution. For the selection of prior distribution, the conjugate distribution of the likelihood function is generally selected to facilitate iterative operation, or without prior information, the uninformative prior is directly selected.
In this paper, under the condition that the annual frequency data of tornadoes in each region are known, the distribution functions of frequency and parameters are assumed to construct a Bayesian hierarchical model. Finally, the posterior parameters of the annual frequency distribution are estimated by MCMC sampling.
2.1. Sampling Distributions
The data set is given, where is the frequency of tornadoes in the k-th area in the i-th year. Here, it is assumed that the distribution of tornado frequency in different years is the same in the same year.
- (1)
Suppose that the annual frequency of tornadoes in each region in the i-th year follows a Poisson distribution under the condition of the expected occurrence frequency :
The probability density function of the Poisson distribution is:
The probability density function of the Poisson distribution is:
and the mean and variance of the Poisson distribution are equal.
- (2)
Suppose that
follows a negative binomial distribution with parameters
and
:
The probability density function of negative binomial distribution, expectation, and variance are respectively:
Equations (6) and (7) show that the mean value of the negative binomial distribution is smaller than the variance.
- (3)
Suppose that
follows a Polya distribution with parameters
and
:
The probability density function of polya [
20] distribution, expectation, and variance, respectively:
2.2. Priors of Parameters
- (1)
Suppose that the parameter
of the Poisson distribution follows the lognormal distribution with parameters
and
:
where
. Covariance matrix
is a function of spatial distance, which can be expressed as:
The off-diagonal elements of the covariance matrix are not zero, indicating that there is the correlation between the elements of a random variable . Where is the spatial distance matrix, taking the Euclidean distance, we have , where represents the latitude of m and n regions, and represents the corresponding longitude.
- (2)
The value of parameter
of the negative binomial distribution is non-negative. Assume that uninformative prior is a lognormal distribution, that is:
The value of parameter
ranges from 0 to 1. In order to consider the spatial correlation,
is assumed to follow a lognormal distribution with parameters
and
, i.e.,
The assumption of covariance matrix is the same as that of Poisson distribution.
- (3)
Suppose that the parameter
of the Polya distribution follows an uninformative prior of the uniform distribution,
where a and b are non-negative integers.
Considering the spatial correlation, assume that the parameter
follows a lognormal distribution with hyperparameters
and
, i.e.,
The assumption of covariance matrix is the same as that of Poisson distribution.
2.3. Priors of Hyperparameters
Assume that the elements
of the mean
of lognormal distribution are independent and identically distributed, and its prior distribution is a normal distribution
where
and
are constants.
Suppose that the hyperparameter follows the uninformative prior distribution of in all three distributions, where and k is a constant.
4. Analysis Results
4.1. Estimation Algorithm
After establishing the Bayesian hierarchical model, a series of parameters of posterior distribution can be sampled by the MCMC method. The MH (metropolis Hastings) algorithm in the MCMC method is adopted in this paper. The basic steps of this algorithm are as follows:
(1) Select the appropriate proposed distribution , and is the objective function.
(2) Generate from the proposed distribution as the initial value of the sampling sequence.
(3) Repeat the following steps until the sampling sequence converges to a stable state according to some criteria.
(3.1) Generate candidate values from .
(3.2) Generate a random number from the uniform distribution .
(3.3) If , accept , , and the sampling sequence is updated in step , otherwise , the sampling sequence has not been updated in step .
(3.4) Increase the value of .
In the Bayesian hierarchical model in this paper, the likelihood function is the target distribution, and the super prior distribution is the proposed distribution. On the premise of known observation data, the estimated values of parameters and super parameters are obtained through multiple iterations; that is, the estimated values of parameters , , , ,
4.2. Comparison between Simulation Frequency and Actual Frequency
According to the Bayesian hierarchical model established above, the posterior parameter estimates of three distributions
are obtained by the MH algorithm. The parameters of Polya’s interpretation model in
Section 2.2 are
a = 0,
b = 10; the values of
and
in the parameter model in
Section 2.3 are
and
in the Poisson distribution; In the negative binomial distribution,
,
; In Polya distribution,
,
;
k= 100 in Poisson distribution and negative binomial distribution,
k = 20 in Polya distribution. It can be seen from the data model in
Section 2.1 that if the parameters of the probability distribution are known, the mean and variance of the distribution function can be estimated from the parameters. Therefore, after the posterior parameter estimates are obtained, the mean and variance corresponding to each grid distribution function are calculated and then compared with the mean and variance calculated from the actual data.
From dataset
, eight grids were randomly selected. The mean and variance of the tornado frequency series of each grid are shown in
Table 2. At the same time, the values of mean and variance obtained from the estimation of posterior parameters in the corresponding grids under several distributions and BL-INAR(1) model are given. The results are also shown in
Table 2.
The results in
Table 2 show that under the three distribution assumptions and BL-INAR(1) model, the mean value calculated by the posterior parameters is relatively close to the actual mean value, except for Polya. It was found that the mean values of the negative binomial distribution and BL-INAR(1) model were close to the actual value. The mean of the Poisson distribution was not bad, but the Polya distribution was not suitable for the data. The variance range of the actual data in each grid was 1.24–1111.56, which fluctuated greatly. It was seen that the variance estimated by the posterior parameters of the negative binomial distribution was closer to the real variance, and better described the volatility and overdispersion of the data, only slightly overestimated. BL-INAR(1) model can closely estimate the mean values and explain part of the overdispersion of the data, but its variances were much smaller than the actual data. Note that the BL-INAR(1) model did not consider the spatial correlation among the sites. It means that the BL-INAR(1) model is a potential method if it can be modified by different distributions of spatial structure in the parameters. By comprehensive comparison analysis, the Bayesian hierarchical model, considering spatial correlation, was effective, and the model with negative binomial distribution was the best, which had more consistent results with the actual data.
After obtaining the estimates of the posterior parameters of the three distributions, the established models are used to simulate the annual tornado frequency of each grid. We used a histogram to compare the distributions of simulated data and real data. The simulation study did not consider the BL-INAR(1) model since it can not simulate the spatial correlation among the sites. It can be seen from
Figure 4 that the simulated data of Poisson distribution and negative binomial distribution are closer to the actual data. More details from
Figure 4 are as follows:
- (1)
The zero frequency of Poisson distribution simulation data is close to that of the real data. The number of grids falling into the first interval of Poisson distribution is higher than that of the real data in the 9–12 interval, and slightly lower than that in the last few intervals.
- (2)
For the simulated data with a negative binomial distribution, the number of grids in the first two intervals is slightly higher than the actual value, the number of grids in the 5–15 intervals is slightly lower than the real number, and the number of grids in the subsequent intervals is higher than the real number. The simulated values of the negative binomial distribution are more dispersed and can better reflect the characteristics of the original numbers.
- (3)
The grid number of simulated data of the Polya distribution is concentrated in the first interval, and the grid number of subsequent intervals is basically lower than that of real data. The simulation value of tornado occurrence times of Polya distribution in each grid is obviously small, and its effect is the worst compared with the first two distributions.
It can be seen that the distribution of the simulated frequencies of tornadoes with a negative binomial model is quite close to the distribution of the real value. Such a result reflects the superiority of the proposed method.
4.3. Analysis of Posterior Parameter Estimates
The spatial correlation of the original data is established by the prior distribution of the parameters in the model. Therefore, the spatial correlation of posterior parameters should be similar to the spatial correlation of actual data. The sample of parameters of Poisson distribution, negative binomial distribution, and Polya distribution is obtained through MCMC, the spatial correlation coefficients of the stabilized parameters
,
and
are calculated, respectively. The spatial correlation coefficient matrix of the parameters adopts the same eigenvector angular ordering method as the spatial correlation coefficient matrix of the original data, and the correlation coefficient matrix shown in
Figure 5 is obtained. The results show that the posterior parameters have an obvious spatial correlation. The correlation coefficient matrix of Poisson distribution parameters is shown in
Figure 5a. It can be seen that the Poisson distribution also reflects the spatial correlation, but its characteristics are significantly different from those in
Figure 2a. The correlation coefficient matrix of the negative binomial distribution parameters is shown in
Figure 5b, and its correlation matrix is more similar to the correlation matrix of the original data (
Figure 2). The correlation coefficient matrix (c) of Polya distribution parameters is similar to that of the Poisson distribution and does not really reflect the spatial correlation of the original data. The correlations of the posterior parameters of the three distributions are different since they are related to their distribution characteristics and parameter assumptions. In addition, the posterior parameters of different distributions retain an obvious correlation. Thus, it is feasible to set the superparameters of the model as a function of spatial distance so as to consider the spatial correlation of tornado occurrence frequency.
5. Conclusions
In this paper, three distributions, including Poisson distribution, negative binomial distribution, and Polya distribution, and BL-INAR(1) models are used to fit the annual tornado frequency data. Considering the spatial correlation of the data, probability models of the annual tornado frequency in each region is established based on the Bayesian hierarchical framework. The means and variances of the data were compared with those of the posterior analysis. The established models were also used to simulate the frequencies of tornadoes and the corresponding spatial relationship. The results are summarized as follows:
- (1)
The posterior analysis shows that estimated means and variances of negative binomial for each grid are closer to those of actual data and provide a better explanation of overdispersion shown in the data. These statistics are better than those by existing distributions, such as the Poisson distribution.
- (2)
The distributions of the simulated frequency based on a negative binomial is close to the distribution of the actual data, meaning the negative binomial model is more suitable for the data.
- (3)
The analysis of raw data reveals the spatial correlation of data. The proposed spatial correlation analysis based on negative binomial distribution is consistent with the results of actual data.
Overall, the proposed method with a negative binomial model can better describe the spatial correlation and overdispersion of tornado frequencies. That has a certain guiding significance for the disaster reduction and prevention of tornadoes. However, it should be noted that the occurrence of tornadoes may be related to time and many meteorological factors, such as temperature, air pressure, vertical wind shear, and even illumination and rainfall, etc. The observed data may also be affected by non-meteorological factors such as monitoring equipment, stations, period, and so on. Therefore, data collection of tornadoes with more variates is needed for more accurate analysis. Furthermore, the advanced models can be developed, such as BL-INAR(1) model with different distributions and spatially correlated parameters, semiparametric models with consideration of overdispersion and spatial structure, etc.