1. Introduction
The hidden Markov model (HMM), introduced by Baum and his colleagues in a series of papers (Baum and Petrie [
1]), is a useful way to represent dependent heterogeneous phenomena. It is often used to model stochastic processes and time-dependent sequences and has been applied to a wide range of applications such as medical diagnosis, financial forecasting, and natural language processing. In many applications, the marginal distribution of observations is clearly a multi-model, meaning that the observations come from a mixture of different distributions associated with different regimes. This behavior is a key attribute of HMMs. HMMs can be used to detect these regimes and find the hidden states that correspond to them. Once the hidden states have been found, a mixture of different distributions is identified on them (Titterington et al. [
2]; Zucchini and MacDonald [
3]). Inference in HMMs is typically based on maximum likelihood or Bayesian approaches. However, the dependence structure of HMMs can be more difficult to compute than that of regular mixtures. Robert et al. [
4] provided an efficient Bayesian estimate of HMMs via Gibbs sampling, and Chib [
5] introduced a widely used state simulation procedure (see also Campbell et al. [
6]). Li et al. [
7] developed an approach to multivariate time series anomaly detection in which an HMM is engaged here to detect anomalies in multivariate time series. Nguyen [
8] used an HMM to predict a daily stock price of three active trading stocks, Apple, Google, and Facebook, based on their historical data.
HMMs also have a number of disadvantages, such as assumption of independence and limitation of state space, while they have a number of advantages, including flexibility, interpretability, and efficiency. Another limitation of HMMs lies in directly incorporating additional explanatory variables. Covariate-dependent hidden Markov models (CDHMMs) are a class of statistical models used to model sequential data where the transition probabilities between hidden states depend on observed covariates. This makes CDHMMs well-suited for a wide range of applications, such as medical diagnosis, financial forecasting, and natural language processing. CDHMMs can be used to identify different market states, such as bull markets, bear markets, and sideways markets, or used to model the progression of Alzheimer’s disease and to identify different stages of Alzheimer’s disease. CHMMs can be also used to segment customers into different groups based on their purchase behavior. For example, CHMMs can be used to identify customers who are likely to churn and to identify customers who are likely to be high-value customers.
One of the most common approaches to modeling covariates in HMMs is to use a mixture transition distribution model. In an MTD model, the transition probabilities between hidden states are modeled as a mixture of distributions, where the weights of the mixture components are determined by the covariates. This approach is relatively simple to implement and can be used to model a wide range of covariate effects. Another approach to modeling covariates in HMMs is to use a covariate-dependent transition matrix model. In this approach, the transition probabilities between hidden states are parameterized directly using the covariates. This approach is more flexible than the MTD approach, but it can also be more difficult to implement and estimate.
Rabiner and Juang [
9] applied the CHMM to speech recognition and showed that it outperformed HMMs without covariate-dependent transition probabilities. Marshall and Jones [
10] discussed the application of a multi-state model to diabetic retinopathy under the assumption that a continuous time Markov process determines the transition times between disease stages. Altman [
11] presented a class of mixed HMMS where covariates and random effects captured differences between an observed process and an underlying hidden process. Chamroukhi et al. [
12] proposed a finite mixture model of hidden Markov regression with covariate dependence. They applied the CHMM to financial time series analysis and showed that it outperformed other methods for financial time series modeling. Maruotti [
13] provided a general review for statistical methods that combine HMMs and random effects models in a longitudinal setting.
Rubin et al. [
14] proposed a joint logistic regression and Markov chain model to describe a binary cross-sectional response, where the unobserved transition rates of a two-state continuous-time Markov chain are included as covariates. Sarkar and Zhu [
15] proposed a novel approach for detecting clusters and regimes in time series data in the presence of random covariates, which is based on a CHMM with covariate-dependent transition probabilities. In a covariate-dependent transition matrix model, HMMs are extended by allowing the transition probabilities to depend on covariates through regression models. In this case, each transition probability requires a different regression model, which can be especially problematic especially when the dimension of state variable is large. To avoid this problem, we here introduce the previous hidden state as another explanatory variable to HMMs through a logistic regression model. That is, Markovian properties are achieved by implanting the previous state variables to the logistic regression model. Note that the proposed model can be applied regardless of the dimension of the state variable in the hidden Markov model.
In some cases, it can be difficult to find the factors that affect a particular phenomenon. This is because the factors may be hidden or may interact in complex ways. HMMs can be used to analyze hidden states without using observed values as the dependent variable. This can be an effective way to find significant variables if they cannot be found directly. In Korea, matsutake mushrooms have not been cultivated on a large scale. Instead, they are harvested from natural forests. The effect of climatic factors on matsutake mushroom yield has not been studied in detail. Some studies have been conducted to investigate the relationship between matsutake mushroom occurrence and weather patterns, but they have not been able to identify the significant meteorological factors. The proposed model is used to identify the factors that indirectly affect matsutake mushroom production. This could help to improve the cultivation of matsutake mushrooms in Korea.
The remaining sections are organized as follows. In
Section 2, a brief summary of the hidden Markov model is given. In
Section 3, we introduce a hidden Markov model based on logistic regression. Then, we present the hierarchical Bayesian procedure of the proposed model in
Section 4. In
Section 5 and
Section 6, we demonstrate the hidden Markov model based on logistic regression by way of a simulation study and a case study example. The last section is the conclusion of this paper and mentions further study.
2. Hidden Markov Model
We here introduce a hidden Markov model. A brief definition of a hidden Markov model and the assumptions of the model are introduced. A hidden Markov model
is a particular kind of dependent mixture. In its most general form, the hidden Markov model is defined as follows:
where
and
represent an unobserved state and an observation at the time
t, respectively. Note that
denotes a vector of unobserved states from time 1 to time
t,
denotes a vector of observations from time 1 to time
t, and
T is the number of total observations. That is,
and
.
Unlike typical mixture models, the HMM assumes that observed data are generated through a finite-valued, unobserved process. This unobserved process is assumed to be in one of a finite number of discrete states at each discrete time point, and given the previous state or states, it is assumed to transit stochastically in the Markov fashion. The data observed at each time point depend only on the value of the corresponding hidden state and are independent of others. The heterogeneity of data is represented by hidden Markov states. In other words, a pair with the state and its random value . Here, is considered conditionally independent given . The HMM derives its name from the following two defining attributes. First, is distributed as a (finite-state) Markov chain. Second, is not observed.
3. Logistic Regression on Hidden State
A logistic regression is used to model the probability of a certain class or event existing such as pass or fail or to model a binary dependent variable. Consider a model with
p predictors,
, and one binary response variable
Y with a probability
. A linear relationship between the predictor variables and the log-odds of the event that
is assumed. This linear relationship can be written in the following mathematical form, where
are parameters of the model:
By simple algebraic manipulation, the probability that
is
Therefore, predictors affecting the binary dependent variable Y can be found with logistic regression analysis by estimating parameters .
We assume that
are the predictor variables and the space of hidden state
of HMM is
at the time point
t. That is,
is a binary variable. Then, we model the following logistic regression:
for
, and then the probability that
is
where
represents the Markovian property in Equation (
1). That is, the term involving
allows for a shift in the transition probability depending on the previous hidden state. Therefore, hidden state model
can be expressed as follows:
where
and
. Additionally, in the function
,
is a starting hidden state immediately before the hidden state
, and it is assumed as follows:
where
is a parameter of the state
distribution. Note that
denotes Bernoulli distribution.
4. Hierarchical Bayesian Approach to HMM Based on Logistic Regression
Now we introdue explanatory variables to HMMs through a logistic regression model in which we add the previous state as another explanatory variable in the logistic regression model to maintain the temporal dependence of the HMM. At first, hierarchical structures are described, then we check the full conditional distributions of each parameter for Gibbs sampling. In addition, prediction steps of the proposed model are shown.
4.1. HMM Based on Logistic Regression
In the transition probability distribution
, it is assumed that
for
, where
and
are parameters of the mixture distributions. Therefore, the data model
can be expressed as
where
. The process model
is given by
Finally, we consider prior distributions for all model parameters and an initial state . In summary, the hierarchical structure can be expressed as follows:
Data model:
Process model:
Prior model:
4.2. Bayesian Analysis
Since a direct sampling from a joint posterior distribution of model parameters is computationally difficult, simplified or approximate MCMC approaches such as the Metropolis–Hasting (M-H) approach within the Gibbs algorithm are needed.
The joint posterior distribution
is proportional to
Note that and denote a prior distribution function and a posterior distribution function given , respectively. Note that the full conditional distribution used in the Gibbs algorithm can be easily found through the joint posterior distribution.
The full conditional distribution of logistic regression parameters,
, is of the form
where the prior distribution
generally is assumed to be non-informative, which is a constant. Note that
denotes a conditionl posterior distribution function.
The full conditional distribution of data model parameters,
, can be expressed as
where the prior distribution
is also assumed to be non-informative prior, which is a constant.
The full conditional distributions of parameters,
and
, are given by
and
respectively. Note that we here assume
as the prior distribution
. From a non-informative prior assumption, the full conditional distribution of parameter
is a beta distribution of the form
Note that
denotes a beta distribution. In addition, since the hidden state space is assumed to be
, the probability of the hidden state
can be expressed as
for
. As a result, the posterior distribution of
can be defined as follows:
The full conditional distribution of a hidden state
can be expressed as
for
, and
where
are the unobserved states from time 1 to time
t except time
j. That is,
. In the same way of
, the posterior distribution of
can be obtained as follows:
for
, and
Each full conditional distribution obtained in this way can be used for sampling with the Gibbs algorithm. Note that the full conditional distributions of logistic parameters , data model parameters , and are not given as a closed form. Thus the M-H algorithm is required for sampling from these distributions.
4.3. Prediction
The posterior distribution sample can be generated in
Section 4.2, and the observation model and process model are already assumed in (1) and (2), and then the Bayesian prediction process is carried out in the following steps:
Generate model parameters from the .
Predict the state variable from the .
Finally predict from .
Note that step 1 can be done through the Gibbs algorithm.
5. Simulation Study
We present the results of a small simulation study designed to investigate the proposed model-based approach in terms of the parameter estimation. We consider two models based on Gaussian mixture and exponential mixture, respectively. Because our application of interest is the effects of the previous hidden state and the degree of mixing of two distributions on the estimations, which is also considered in the simulation, for each model, we execute the following two steps:
Generate hidden states sequentially based on transition probabilities through the logistic regression.
Generate a sequence of observations from the mixture distribution corresponding the hidden states.
Let be the observation at time t and let be the associated hidden state. Let be the covariates from , respectively. We consider the following eight models:
, and .
(, , , ) and (, , ).
(, , , ) and (, , ).
(, , , ) and (, , ).
(, , , ) and (, , ).
, and .
(, ) and (, , ).
(, ) and (, , ).
(, ) and (, , ).
(, ) and (, , ).
Note that the settings of (
,
,
,
) and (
,
) mean a mild mixing of two distributions, and the settings of (
,
), (
,
), and (
,
) mean a strong mixing of two distributions. In addition,
controls the effect of the previous hidden state. The eight models were simulated and averaged statistics calculated, with the simulation exercise repeated 200 times.
Table 1 and
Table 2 show the averaged posterior means and posterior standard deviations for parameters based on Gaussian mixtures and exponential mixtures, respectively. It turns out that the stronger the mixing and the weaker the influence of hidden states, the greater the uncertainty in the estimation. However, overall, it demonstrates a good performance.
6. Data Analysis
Matsutake mushrooms are a valuable forest product that can increase rural incomes by up to 100%. They are wild mushrooms that grow naturally in Pinus densiflora forests. In Korea, matsutake mushrooms have not been cultivated due to a lack of agricultural technology. However, the impact of climate on matsutake mushroom yield has not been well studied. Previous studies have attempted to relate matsutake mushroom occurrence to weather patterns in Korea but have not been able to identify the specific meteorological factors that are most influential. Here, we use hidden Markov models (HMMs) to identify hidden states in a specific area of Bonghwa-gun, Gyeongsangbuk-do. We then use these hidden states to identify the meteorological factors that indirectly affect matsutake mushroom yield.
6.1. Data Description
The data for this study are the annual production of matsutake mushrooms (kg) observed in a region of Bonghwa-gun, Gyeongsangbuk-do, from 1997 to 2016. Matsutake mushrooms are harvested in Korea from late August to late October, with the peak harvest period occurring for about 10 days from late September to early October. The production of matsutake mushrooms can vary significantly depending on the weather. Therefore, this study considered the meteorological factors in May, June, July, and August from 1997 to 2016. The meteorological factors were combined and analyzed in terms of time and space. The detailed variables are summarized in
Table 3.
6.2. Hierarchical Modeling
We here assume that the production of matsutake mushroom is in one of two states: a lean year or a bumper year. Each hidden state has an independent distribution of total matsutake mushroom production, which is assumed to be a Gaussian distribution. From year to year, persistence varies from state to state because it is governed by state and predictor each time. Let denote the (unobserved) annual year state at time t (i.e., for a bumper year, for a lean year). Let be the (observed) total matsutake mushroom production amount at time t for . The total matsutake mushroom production amount is conditionally independent given the current year state . For a hierarchical modeling, we express the full conditional distribution more specifically.
We consider a Gaussian mixture for an observation
. That is,
, and
. We assume informative prior
, where
c is a constant. Then, the full conditional distribution of the logistic parameters
is obtained by
Based on the informative prior
, where
c is a constant, the full conditional distribution of data model parameters
is given by
Assuming
, the full conditional distribution of hyperparameter
and initial state
can be expressed as
and
where
Finally, the full conditional distribution of hidden state
is of the form
where
for
, and
where
6.3. Analysis
Before examining the parameters from the posterior distribution, we first perform a regression analysis using the production of matsutake mushrooms as the dependent variable to identify the meteorological factors that explain the variation in matsutake mushroom production. Two weather variables are found to be significant (see
Table 4).
In a comparison for our proposed method, the Altman’s approach can be considered. Altman [
11] considered the following model:
at the time point t.
Conditional on the hidden state , observation is assumed to be normally distributed with mean and variance , where .
It can be also extended by allowing the transition probabilities to depend on covariates or random effects through logistic regression models.
Table 5 shows the estimated coefficients with their standard error for the model to be compared. In Altman’s approach, only hidden state and maximum temperature during June are found to be significant, which means that most meteorological variables do not directly affect the observed annual production of matsutake mushrooms.
For a Bayesian analysis for the proposed HMM based on logistic regression, two parallel chains are used to check the convergence of the chain. After generating 35,000 samples, the initial 15,000 samples are discarded to eliminate the influence on the initial value, then 1000 samples are extracted by selecting every 20th samples to eliminate the autocorrelation. Gelman–Rubin statistics (G-R) are also checked (Gilks et al. [
16]). First, let us look at the sequence of hidden states,
, corresponding to the matsutake production (see
Figure 1). The dashed line represents the annual matsutake mushroom production, and the solid line shows the year hidden state corresponding to the annual matsutake mushroom production. When
, it means a lean year, and when
, it means a bumper year. Second, let us take a look at the logistic parameters,
. In a logistic regression based on the hidden Markov model (HMM), we use the hidden state in each year as the dependent variable of the logistic regression model to identify the meteorological factors that indirectly explain the variation in matsutake mushroom production (see
Table 6). As a result of the analysis, four variables are found to be significant. The analysis shows that the total precipitation in August and the mean ground temperature in May are additional variables that affect matsutake mushroom production, compared to the previous model. In addition, there is a significant shift in the transition probability depending on whether the previous year was a lean year or a bumper year.
7. Concluding Remarks
Covariates-dependent hidden Markov models extended by allowing the transition probabilities to depend on covariates are a powerful class of statistical models and are used to model a wider range of sequential data. In this paper, Markovian properties were achieved by implanting the previous state variables to the logistic regression model. The logistic regression based on the hidden Markov model (HMM) is applied to identify significant variables affecting the annual production of matsutake mushrooms (kg) data observed in the region of Bonghwa-gun from 1997 to 2016. The proposed method differs from the existing analysis methods by using the state variable in logistic regression and the mixture distribution of the state, rather than using the observed value directly in the analysis. It is particularly useful to identify the relevance of variables that are difficult to show in the existing models. As a result, we find additional meteorological factors affecting the annual production of matsutake mushrooms compared to the existing methods. The proposed covariates-dependent hidden Markov models can be useful tools for sequential data in the presence of covariates and used in a variety of applications, including financial time series analysis, medical diagnosis, and customer segmentation. In addition, it can be applied in modeling covariates in HMMs regardless of the dimension of the state variable in the hidden Markov model.