The FCR-HL model mainly solves four problems: (1) the optimization problem of clustering: partitioning data into different clusters with the perspective of regression can incorporate more information to attain a better clustering effect, and how to solve the optimization problem is the key point. (2) Parameter estimation problem: the parameters in the regression that explain the impact of the covariate on the response variable need to be estimated within each cluster. (3) Clustering number estimation that decides how many clusters are needed is an important part for the model. (4) Iterative algorithm: it is difficult to solve the problem of both partition and parameter estimation simultaneously; our model gives an iterative process to solve the three problems mentioned above. The following will explain the solutions to the four problems, respectively.
2.1. Clustering Optimization and Parameter Estimation
Ramsay and Dalzell [
24] proposed functional data analysis, which uses non-parametric ideas to fit data, and can effectively capture the continuous characteristics of data. In the functional data analysis, the functional regression model is an effective and convenient method. This paper mainly focuses on one typical functional regression models, that is, the covariates are functional data, and the response variables are scalar types, which have a functional covariate and scaler response variable:
where the response variable
is a scalar and the vector expression is
,
n is the observation number, and the covariate variable
represents
th functional trajectory that has a bounded upper limit T. Assuming
, the Karhunen-loeve expansion can be used for the functional covariates to obtain Equation (2):
where
represents the mean function of the covariate, and
is the eigenfunction corresponding to the
largest eigenvalue
of the covariance
, the eigenfunctions are orthogonal to each other, and satisfy
and
. Using the functional principal component analysis (FPCA),
named as the functional principal component scores of
in the direction of
are obtained, which satisfy
and
. According to Formula (2), Formula (1) can be rewritten as:
where
,
, the mean function
of
is mapped to the constant parameter
, and
are mapped to the parameter
. In other words, the parameter
includes the mean value of
when
and the information of the mean trend of
, and the parameter
stands for the effect of the
th deviations of
on
. In this way, the auxiliary information between the covariate and the response variable is reflected in the parameter
. This paper builds the FCR-HL model based on the auxiliary information to cluster the data.
For Equation (3), the summation term is truncated at
, which is determined using the AIC criterion Li, et al. [
25], which is to estimate the optimal
, which is given by minimizing the sum of the pseudo-Gaussian loglikelihood and
. Note
,
, we rewrite Formula (2) in a matrix expression:
The advantage of using FPCA technology is that the infinite-dimensional functional data can be converted into low-dimensional data, and then it helps to construct a linear regression model relating to the principal component scores. On one hand, this method can reduce the computational difficulty and the algorithm complexity due to the dimensional curse. On the other hand, it can preserve the nonlinear characteristics in the covariate, which are utilized for the regression analysis. At the same time, the principal component scores estimated by FPCA have good statistical characteristics, especially the unbiasedness and consistency, which are helpful for inferring the subsequent parameter estimations discussed later.
The goal of clustering optimization in this paper is to cluster data from the perspective of the regression hyperplane. The FCR-HL model mainly has two steps of iterations: First, obtaining the parameter estimates under the given partition. Second, clustering samples based on the parameter estimates. According to the two-step iterative algorithm, the optimal regression clustering results can be found.
Firstly, given a partition, the parameters are estimated from the perspective of the regression hyperplane and with the data that has been partitioned. Compared with the random partition, it takes the relationship between the covariate and response variable as an auxiliary information for clustering. The parameters can be estimated with greater accuracy once the partition has deduced the heterogeneity between the data. It is assumed that the samples from the same partition have the following relationship:
where
represent the sub-populations and
where
is the samples size of the cluster
, and
is the number of clusters, which may grow with sample size,
are the observed response data belonging to the cluster
,
are the vector scores derived from the observed functional covariate
belonging to the cluster
, and
are the coefficients of the cluster
.
In (5), it is necessary to first solve the unknown functional principal component scores, and then we can estimate the parameter
. It should be noted that the estimate of the functional principal component scores can directly affect the result of the parameter estimates, considering that the PACE (principal component analysis through conditional expectation) method proposed by Yao, et al. [
26] is unbiased and consistent estimation method for the functional principal component scores. The PACE method gives the estimators
where
and
and
are jointly Gaussian. Then, the PACE method is used to estimate the functional principal component scores and the mean function according to Formula (2), by which the principal component score estimates
and the estimation of the mean function
have the following convergence properties:
where
is obtained by the local liner smoother, and
is the bandwidth used in the local linear smoother. Formulas (6) and (7) show that
converges to
and
are unbiased estimates for
when
, which are the good statistical characteristics mentioned before. Thus,
can be replaced by the estimates
as a new regression model shown in Formula (8):
Based on Formula (8), the log-likelihood function can be shown in Formula (9):
It is difficult to obtain the optimal partition and the estimates of the unknown parameters in (9) just by maximizing
. Thus, an iterative method is proposed. Firstly, the optimization objective of clustering, fixing
at
and
at
, is to maximize the log-likelihood function when the observation data
belongs to the cluster:
To solve Formula (10), the parameter estimations
needs to be obtained. The idea is to maximize the log-likelihood function of the data within the class. Formula (11) is the log-likelihood function of the data
:
Then, the parameters
are obtained according to the maximum likelihood estimation:
where
represents the sample size of the cluster
and
for simplification. Then, parameter estimations
are brought into Formula (9) to obtain the log-likelihood function of the complete data:
When fixing the partition
, the
and
are the maximum likelihood estimators of the regression within the cluster, as shown in (12) and (13). When fixing
and
, the, the likelihood function will be maximized if the test data belongs to the cluster
. It is noted that the log-likelihood function is a monotonically increasing function, so it can reach the local maximum if a limited number of iterations are carried out alternatively. Furthermore, the parameter estimates derived from this optimization also have good statistical characteristics. First, the principal component scores obtained by FPCA are obtained by mapping the information of the data itself to the direction of the principal component.
are the unbiased estimates of
. Thus,
and
can be considered as non-random variables for the response variable
, and the maximum likelihood estimation can give estimates having good statistical characteristics, for example, the unbiasedness:
where variance of
can be used to verify the significance of the parameter. Because only the variance of
is estimated correctly, the significance results of the parameter estimates are reliable.
From Formulas (6), (7), (16), and (17), it can be known that the converges to in probability. Therefore, it can be ensured that the obtained optimal number of clusters converges to the real number with probability 1 when data is clustered from the perspective of regression hyperplane.
In addition, it is also noted that the estimations using maximum likelihood are based on the classical assumption that the error term in Formula (8) obeys independently and identically normal distribution. Once the assumption is broken, the maximum likelihood estimation results are problematic. Thus, when it comes to the data which violates the independently and identically normal error distribution, a robust estimation (M-estimation) scheme, a generalized maximum likelihood estimation method is given. A special case of M-estimation is the Huber distribution, which has a normal distribution at the origin and an exponential distribution at the tail. The parameter estimation can be obtained according to the Huber distribution:
where
is the error function of the Huber distribution, and
is a fixed constant. Given parameter estimates
, the optimal objective function for clustering can be obtained from sample observations
:
This function is also strictly monotonically increasing.
The partition above is under the condition of a given number of clusters, so the next step is to give an estimation method for the number of clusters.
2.2. Estimation of the Optimal Number of Clusters
After estimating model parameters and optimizing the clustering scheme, we need to discuss the estimation of the optimal number of clusters. In this paper, the information criteria is used as the clustering loss function among the iterative algorithm. Then, we can simultaneously update the identification of heterogeneity in clusters and the optimal number of clusters.
After using the FPCA, the sample data is . According to the previous analysis, it is assumed that the sample is composed of M sub-populations, and the characteristics of each population are represented by the regression hyperplane determined by the parameters.
Denote the partition
, and obtain the regression model of each subpopulation is:
where
is the sample size of cluster
and
,
is the response variable and principal component scores belonging to
, respectively, and
for
, and
is a
identity matrix. Notice that
is a
matrix, both
and
are
vectors. The estimation of the number of clusters adopts the information criterion method based on the maximum likelihood estimation proposed by Shao and Wu [
27], which is denoted as LS-C and can be obtained by:
where
is estimated by the maximization likelihood estimation in this case,
is a strictly increasing function of
and
generally, and
or
. The first part is the residual sum of squares and the second part is a penalty function relating to
and
. At the same time, Shao and Wu [
27] have proved that the estimate derived by
will converge to the correct number of regression hyperplanes (or the number of the clusters) with probability 1 when the sample size is large enough (
). It is noted that the LS-C is based on the maximum likelihood estimation. Again, a robust estimation for the case that does not have an independently and identically normal error distribution. Rao, et al. [
28] constructed the robust information criterion denoted as RM-C:
where
is obtained by using the M-estimation in this case, and both
and
are same with Formula (24). By minimizing the information criteria LS-C or RM-C when the error distribution is independently identical normal or not, the number of clusters and the partition can be obtained. The advantage of the information criteria LS-C and RM-C is that the estimated number of clusters converges to the real number when the sample size is large enough, and the details can be referenced in Shao and Wu [
27] and Rao, Wu and Shao [
28].
2.3. Iterative Algorithm Design
In the FCR-HL model proposed in this paper, parameter estimation, clustering optimization, and the estimation of the number of clusters are all continuously updated in the iterative process, and this section will explain the iterative algorithm.
First, the residual sum of squares obtained based on maximum likelihood estimation within the cluster is recorded as RSS (residual squares sums), and the residual sum of squares obtained based on M-estimation is recorded as RRSS (robust residual squares sums):
Then, within the cluster RSS at each candidate is calculated for the least square regression or the RRSS for the M-estimation based regression to approximate the local minimization, and determine the optimal cluster number by the information criteria LS-C or the RM-C, respectively.
In addition, the regression-based cluster method is easily affected by the initial partition. The global minimum of the information criterion or its good approximation can be achieved when using a good initial partition. Thus, it is necessary to determine the initial partition
. Based on the idea proposed in Qian, et al. [
29], we extend it to handle the functional data. The following table Algorithm 1 shows the iterative initial partition algorithm.
Algorithm 1 An iterative algorithm for initial partition. |
Step 1: Using the FPCA on to estimate the functional principal component score . |
Step 2: Through mapping the mean function and the basis function to the parameter , respectively, we build a functional regression model where the functional principal component scores are the covariates. |
Step 3: Parameters are estimated by the maximum likelihood estimation or robust estimation and based on the whole data. |
Step 4: (1) Set a distance threshold and a sample size constant . (2) For , we calculate the distance between the point and the regression hyperplane obtained in Step 3. If the distance is less than the threshold , then the point is partitioned into , otherwise the point is partitioned into , where , otherwise go to Step 5. (3) For , a point in the dataset , we estimate the parameters again and calculate the new distance. If the distance is less than , the point is partitioned into , otherwise into , where , otherwise go to Step 5. Step 5: Obtain the initial partition . |
It should be noted that the constants
and
are set based on the data. The initial partition is an iterative hierarchical binary clustering method, which adopts a regression model such as the least square regression in each iteration. The regression is robust, having a high breakdown threshold; thus, the Algorithm 1 is highly likely to produce a reasonable initial partition. After the initial partition, the iterative algorithm of the FCR-HL model are shown in the table Algorithm 2:
Algorithm 2 The Partition iteration algorithm based on the initial partition. |
Step 1: Let , we calculate the or of the initial partition , and the parameter estimation . |
Step 2: Let , we calculate the or of the data in and in , respectively, and we can obtain the or , where < or < . Then the updated partition is , and let or . |
Step 3: Continue to iterate Step 2 until or no longer drops. |
In summary, the parameter estimations and the partition are updated in the iterative algorithm. Finally, the final regression clustering result can be obtained.
In the simulation analysis and the empirical data analysis, the K-means method has been used as a comparison model as it is a representative cluster method which only utilizes the distance between observations themselves. Our model emphasizes the importance of the auxiliary information between the response and covariate and cluster data from the regression perspective to dig the heterogeneity.