Next Article in Journal
A Novel Approach to Predict the Asian Exchange Stock Market Index Using Artificial Intelligence
Previous Article in Journal
Fault Detection in Industrial Equipment through Analysis of Time Series Stationarity
Previous Article in Special Issue
A Piecewise Linear Regression Model Ensemble for Large-Scale Curve Fitting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models †

by
Duo Zheng
1,
Keren Li
2 and
Jie Yang
3,*
1
Amazon, 425 106th Ave NE, Bellevue, WA 98004, USA
2
Department of Mathematics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
3
Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA
*
Author to whom correspondence should be addressed.
This work was performed while at the University of Illinois at Chicago and is not associated with Amazon.
Algorithms 2024, 17(10), 456; https://doi.org/10.3390/a17100456
Submission received: 5 August 2024 / Revised: 22 September 2024 / Accepted: 9 October 2024 / Published: 14 October 2024
(This article belongs to the Special Issue Machine Learning Algorithms for Big Data Analysis (2nd Edition))

Abstract

:
In this paper, we propose an efficient method called the response-aided score-matching representative (RASMR) approach to facilitate massive data model selection and data analysis with generalized linear models (GLMs) and a predetermined data partition due to data localization. Similar to the original score-matching representative (SMR) approach, RASMR constructs an artificial data point, called the representative, for each data block. It then fits a GLM on the representative dataset, which provides not only an efficient approach for massive data analysis but also an ideal solution in response to privacy concerns by avoiding the transfer of sensitive data. By further splitting the data blocks according to the values of the response variables, RASMR can obtain more accurate parameter estimates than SMR. Furthermore, by theoretical justifications and simulation studies, we show that RASMR can be more efficiently utilized for model selection and variable selection for a massive dataset by approximating the Akaike information criterion (AIC) and the aggregated prediction errors for cross-validation, which are commonly used for choosing the most appropriate statistical model and drawing reliable conclusions. We also apply the proposed RASMR approach to the airline on-time performance data, which consists of 371 data files labeled by month, and show that RASMR can be successfully used for selecting the most appropriate model for real massive data analysis.

1. Introduction

The numerous innovations in data analysis in the last decade have dramatically affected the technologies used in our daily lives. Datasets with unprecedented sizes and complexities are rapidly generated and collected from a great variety of resources [1]. While big datasets bring us incredible opportunities for new discoveries, many traditional methods that perform well for moderate sample sizes are no longer realistic for analyzing massive amounts of data [2].
To address the big data challenges, many traditional statistical tools have been reinvented or adapted to deal with the gigantic volume or size of big data, which is a major goal of our work. Comprehensive reviews, such as [3,4], have been provided for algorithmic solutions to big data problems, including divide-and-conquer, subsampling-based approaches, stochastic gradient descent, and online updating.
Divide-and-conquer approaches split the big data into manageable blocks, extract local summaries from each data block, and then generate overall insight. Various types of algorithms have been proposed to adapt classic statistical tools to big data problems [5,6,7,8,9,10,11,12].
Subsampling-based approaches draw subsamples from the original data by elaborately designed sampling mechanisms and then perform downstream inference and prediction based on the subsamples. For example, leveraging for big data regression, [13] constructed nonuniform sampling probabilities so that influential data points were sampled with high probabilities, and [14] proposed a novel method called information-based optimal sub-data selection (IBOSS), which selects samples from a big dataset based on the D-optimality criterion; it has been extended to various statistical models, such as logistic regression [15].
When multiple subsamples of the original data are available, different approaches have been proposed to aggregate the estimates obtained from different subsamples, including bagging or bootstrap aggregating [16], stacking [17,18], magging or maximin aggregating [19], neagging or normalized entropy aggregating [20,21]. These aggregation approaches are especially useful for inhomogeneous large-scale data under regression analysis.
Stochastic gradient descent (SGD) algorithms [22,23,24] update in a sequential manner based on a noisy gradient. Local SGD is the key building block of different federated learning algorithms [25,26], which were discussed for both homogeneous nodes [27,28,29,30,31,32] and heterogeneous nodes [29,33,34,35,36,37,38,39,40,41,42].
As stated in [43], new challenges in the big data era involve data security and localization legislation. According to [44,45], legislation on data localization has emerged as a global trend, with more nations developing laws that prohibit the transfer of sensitive data. Data localization may not kill global cooperation in data science, but it will undoubtedly create unpleasant barriers to data communication. A novel approach called the score-matching representative (SMR) was proposed by [43] for analyzing big data under communication restraints. Given an existing data partition, such as data blocks labeled by countries, regions, or sources, SMR constructs model-specified data representative(s) for each block and performs downstream analysis on the constructed representatives. Unlike the divide-and-conquer or subsampling approaches, SMR does not require communicating the actual data but rather their representatives, which are not part of the original data and, thus, avoid their transfer. Comprehensive studies show that the accuracy of the estimated model parameters based on SMR can be comparable to the full data estimates [43]. According to [43], the computational complexity of SMR is O ( N p ) given the sample size N and the number p of model parameters.
Nevertheless, there are three hidden traps in [43]’s SMR method (see Section 2.1 for more details), as follows: (1) there can be multiple solutions to the score-matching equations; (2) the constructed representatives may be quite far away from their data blocks; and (3) the SMR algorithm may fail to converge. Due to these issues, SMR may not be accurate enough for more elaborate tasks, such as model selection and variable selection.
In this paper, we develop a new representative approach, called the response-aided score-matching representative (RASMR) approach, for big data analysis under generalized linear models (GLMs). It improves SMR significantly by splitting the data blocks further with the aid of the response variable (see Figure 1 for a graphical display). Since the refined data blocks yield only one solution to the score-matching equation, RASMR ensures the uniqueness of the data representatives. Compared with SMR, RASMR can not only provide much more accurate estimates for model parameters but also be used for model selection and variable selection efficiently, which are critical for data scientists to draw reliable conclusions from data analysis.

2. RASMR for Big Data Analysis under GLM

2.1. SMR Approach for GLM

In this section, we review the key components of the original SMR approach proposed by [43], along with its potential issues.
Following the notation of [43], the original data { ( x i , y i ) , i = 1 , , N } are provided with an index partition { I 1 , , I K } of I = { 1 , , N } , where x i R d represents the ith covariate vector, y i R is the corresponding response, and I = j = 1 K I j , I j for each j, and I i I j = for each i j .
Under a generalized linear model (GLM) [46,47], there is a link function g, p known predictor functions h 1 , , h p , and p unknown regression parameters β 1 , , β p , such that the expectation of the response variable Y i given x i satisfies the following:
E ( Y i x i ) = μ i   a n d   η i = g ( μ i ) = X i T β ,
where X i = ( h 1 ( x i ) , , h p ( x i ) ) T , i = 1 , , N , and β = ( β 1 , , β p ) T .
Given the GLM (1), we denote the kth data block by D k = { ( X i , y i ) , i I k } and let s k ( β ) = i I k [ y i G ( η i ) ] ν ( η i ) X i be the contribution made by the kth data block to the score function s ( β ) = k = 1 K s k ( β )  [43,46], where G ( η ) = g 1 ( η ) , ν ( η ) = G ( η ) / h ( η ) , and h ( η i ) = Var ( Y i ) are functions of η or η i . According to Section 2.5 of [46], the maximum likelihood estimate (MLE) of β solves the score equation s ( β ) = 0 . The SMR algorithm [43] was designed to find y ˜ k R and X ˜ k R p solving s k ( β ) = n k [ y ˜ k G ( η ˜ k ) ] ν ( η ˜ k ) X ˜ k = s ˜ k ( β ) , where n k = | I k | is the size of I k , and η ˜ k = X ˜ k T β . More specifically, the SMR algorithm first chooses the following:
y ˜ k = i I k ν ( η i ) η i 1 i I k ν ( η i ) η i y i ,
then solves the score-matching equation as follows:
n k ν ( η ˜ k ) [ y ˜ k G ( η ˜ k ) ] η ˜ k = i I k ν ( η i ) [ y i G ( η i ) ] η i
for η ˜ k R , and then constructs the kth representative ( X ˜ k , y ˜ k ) by calculating the following:
X ˜ k = n k ν ( η ˜ k ) [ y ˜ k G ( η ˜ k ) ] 1 i I k ν ( η i ) [ y i G ( η i ) ] X i .
Then the score function s ( β ) = i = 1 N [ y i G ( η i ) ] ν ( η i ) X i = k = 1 K   s k ( β ) = k = 1 K s ˜ k ( β ) = s ˜ ( β ) . The MLE β ^ of β based on the full data, which solves the score equation s ( β ) = 0 , is expected to be the same as the SMR estimate β ˜ based on the weighted representative data { ( n k , X ˜ k , y ˜ k ) , k = 1 , , K } , which solves the matched score equation s ˜ ( β ) = 0 . In practice, since β is unknown, the SMR algorithm first solves the matched score function s ˜ ( β ) = 0 with some initial representatives for β ˜ , and then solves the score-matching Equation (3) for the representatives. The procedure may continue iteratively until reaching a predetermined accuracy level. Three iterations were suggested by [43] for general applications, whose computational complexity is O ( N p ) .
Given the successful applications of SMR in estimating parameter values for big data analysis using a GLM, its accuracy level may not be high enough for more delicate applications, including model selection and variable selection, due to the following three issues. Firstly, there can be more than one solution for Equation (3) (see Section 2.2). SMR chooses the solution whose representative is closest to the mean representative (MR, that is, X ¯ k = n k 1 i I k X i , y ¯ k = n k 1 i I k y i ), which, however, may not match the likelihood well. Secondly, the predictor representative X ˜ k obtained from Equation (4) may be far away from its corresponding data block (see Section S2 in the Supplementary Materials), which may not represent its data block well. Thirdly, the SMR algorithm may not converge well in practice, which may occur along with misspecified link functions or highly skewed predictor distributions (see Section 4).

2.2. Solving the Score-Matching Equation with Splitting Points

In this section, we investigate the number of solutions for the score-matching Equation (3) and explain why more stable and accurate solutions can be obtained for Equation (3) by further splitting the data blocks, which is our motivation to propose the RASMR algorithms in later sections.
According to Equations (2) and (4), the kth representative ( X ˜ k , y ˜ k ) can be obtained as weighted averages of the kth data block D k = { ( X i , y i ) , i I k } . The weights of y i ’s and X i ’s are ν ( η i ) η i and ν ( η i ) [ y i G ( η i ) ] , respectively. To stabilize the constructed representatives, we propose to keep the weights with the same sign (that is, all positive or all negative) in each individual data block. That is, besides splitting the data block according to sgn ( η i ) , as suggested in Remark 3.1 of [43], we suggest further splitting the data blocks by sgn ( y i G ( η i ) ) .
Now, we investigate the number of solutions to Equation (3) and how to further split the data block to ensure a unique solution. Following [43], we denote S ( η ) = ν ( η ) [ y ˜ k G ( η ) ] η . According to the proof for Theorem 3.1 in [43], Equation (3) can be rewritten as follows:
S ( η ˜ k ) = 1 n k i I k S ( η i ) .
We further denote η k = min i I k { η i } and η k = max i I k { η i } . If both ν ( η ) and G ( η ) are continuous, then there exists a solution η * [ η k , η k ] that solves Equation (5) (see Theorem 3.1 in [43]).
Examples of ν ( η ) and G ( η ) for commonly used GLMs can be found in Table 1 of [43], which are all continuous. If S ( η ) is strictly monotone on [ η k , η k ] , then there exists a unique η * [ η k , η k ] that solves (3) (see Theorem A1 in Appendix A).
If ν ( η ) is a constant, such as for the normal model with an identity link, the Bernoulli model with a logit link, the Poisson model with a log link, the gamma model with a reciprocal link, and the inverse Gaussian model with an inverse-square link, without any loss of generality, we rewrite S ( η ) = [ y ˜ k G ( η ) ] η . Then, its first derivative S ( η ) = y ˜ k G ( η ) + G ( η ) η , with its key component T ( η ) = G ( η ) + G ( η ) η . In this case, we have the following: y ˜ k = i I k η i y i i I k η i = n k 1 i I k η i y i η ¯ k , where η ¯ k = n k 1 i I k η i corresponds to the mean representative (MR). We denote the following: G ˜ k = i I k η i G ( η i ) i I k η i = n k 1 i I k η i G ( η i ) η ¯ k . Then, we have the following:
S ¯ = 1 n k i I k S ( η i ) = 1 n k i I k [ y ˜ k G ( η i ) ] η i = η ¯ k ( y ˜ k G ˜ k ) .
For the normal model with an identity link, that is, the usual linear model, there can be up to two solutions to (3) (see Theorem A2 in Appendix A).
For Bernoulli models (see Table 1 of [43]), y i { 0 , 1 } and G ( η i ) ( 0 , 1 ) for all i. Then, y i < G ( η i ) always implies y i = 0 , and y i > G ( η i ) always implies y i = 1 . Suppose (i) either η i > 0 for all i I k or η i < 0 for all i I k ; and (ii) either y i > G ( η i ) for all i I k or y i < G ( η i ) for all i I k  . Then, y ˜ k is either 0 or 1. Depending on y ˜ k = 0 or 1, we denote S ( η ) as S 0 ( η ) or S 1 ( η ) , and the potential splitting point as η l or η r (see Table 1), respectively. For Bernoulli models with logit, probit, cloglog, loglog, or cauchit links, a data block I k after splitting at η l or η r yields a unique solution solving (3) (see Theorem A3 in Appendix A and Lemmas S1–S5 in Section S6 of the Supplementary Materials). We summarize in Table 1 the potential splitting points η l for blocks with y ˜ k = 0 and η r for blocks with y ˜ k = 1 .
For the Poisson model with a log link, if either η i > 0 for all i I k or η i < 0 for all i I k , then there are up to two solutions solving (3) (see Theorem A4 in Appendix A).
For the gamma model with a reciprocal link, η ¯ k = n k 1 i I k η i is the unique solution solving (3) (see Theorem A5 in Appendix A).
For the inverse Gaussian model with an inverse-square link, in general, there are up to two solutions solving (3) (see Theorem A6 in Appendix A).
In conclusion, the score-matching Equation (3) often yields two solutions for commonly used GLMs, confirming the first hidden trap mentioned in the Introduction section for the original SMR algorithm. On the other hand, there are, at most, two solutions for Equation (3) (see Theorems A2–A6 in Appendix A), which motivates us to further split the corresponding data block in a way that each sub-block yields a unique solution.

2.3. Response-Aided Score-Matching Representative Approach

In this section, we propose a new algorithm (see Algorithm 1 for its pseudocode) with further splits based on the response variable y i ’s, and call it the response-aided score-matching representative (RASMR) approach. This suggests that further splits may at most quadruple the original number of data blocks. Since the time complexity of the original SMR is O ( N p )  [43], which does not depend on the number of data blocks, the RASMR algorithm consumes no significantly more time than the SMR algorithm (see Section 4.2). Its time complexity is O ( N p ) as well.
Algorithm 1: RASMR.
Data: D = { ( X i , y i ) , i = 1 , , N } with a partition of K blocks indexed by { I 1 , , I K } . Denote the kth data block D k = { ( X i , y i ) , i I k } .
Result: Parameter estimate β ˜ of a given GLM with a predetermined number T of iterations.
Calculate the initial weighted representative data: D ˜ ( 0 ) = { ( n k , X ˜ k ( 0 ) = X ¯ k ,   y ˜ k ( 0 ) = y ¯ k ) } k = 1 K with
X ¯ k = n k 1 i I k X i and y ¯ k = n k 1 i I k y i , that is, the mean representatives
Implement the iteratively reweighted least squares (IRLS) procedure [48] on D ˜ ( 0 ) to obtain the initial estimate β ˜ ( 0 )
Algorithms 17 00456 i001
Remark 1.
The iteratively reweighted least squares (IRLS or IWLS) procedure has been widely used for finding the MLE of a GLM [48,49,50]. Nevertheless, some more robust variants of IRLS have been proposed to make the estimates less sensitive to outliers (see [51] and references therein). One may choose a more robust procedure than IRLS in Algorithm 1, given that the targeted parameter estimate β ^ is under the same criterion.
Remark 2.
Following the splits described in (2) of Algorithm 1, according to Theorems A2–A6 in Appendix A, many sub-blocks yielded a unique solution to (3). There are still leftover cases for normal/linear, Bernoulli, Poisson, and inverse Gaussian models, under which, some sub-blocks may yield up to two solutions to (3). For those cases, according to the proofs of Theorems A2–A4 and A6 (see Section S6 in the Supplementary Materials), we may further split such a block into two sub-blocks according to the peak value η of S ( η ) , which is 1 2 y ˜ k l for normal/linear models, η l or η r for Bernoulli models, u ( y ˜ k l ) for the Poisson models, or ( 4 y ˜ k l 2 ) 1 for the inverse Gaussian models. Based on our experience, such a split often leads to data blocks with a unique solution to (3).

2.4. RASMR with the Delta Ratio Split

As we will demonstrate in Section 4.2, RASMR significantly improves upon SMR in estimating model parameters. However, it may not perform sufficiently well when approximating likelihood with non-Gaussian covariates x i or predictors X i for model selection purposes.
To investigate this, we start with the mean representative (MR) of the predictors in block I k , X ¯ k = n k 1 i I k X i , which is a natural choice for the block center. Then, the predictor radius of I k can be defined as Δ k = max i I k X i X ¯ k , where ‖·‖ is the Euclidean norm. We call the relative distance of X ˜ k from X ¯ k , the delta ratio, defined as δ ˜ k = X ˜ k X ¯ k / Δ k  .
By definition, δ ˜ k = 0 if n k = 1 . If δ ˜ k > 1 for some I k , then its predictor representative X ˜ k is outside the corresponding data block, which implies that it may not be a good representative for calculating the likelihood. Compared with SMR, RASMR’s delta ratios are significantly smaller (see Figure S1 in the Supplementary Materials), which partly explains why RASMR improves SMR significantly. Nevertheless, the delta ratios of RASMR may tend to inflate when the distribution of covariates is too extreme or complicated (see Section S2 in the Supplementary Materials for more details).
If δ ˜ k is greater than a predefined threshold δ 0 , a fourth layer of further splitting at the mean η ¯ k = n k 1 i I k η i of η i ’s may be applied. Since the sample mean is sensitive to outliers, using η ¯ k as the cutoff point of the split may enable us to separate the outliers from the majority of the data block, which will make the leftover members in the data blocks closer to each other. In other words, we may identify the minorities of data points in the original data block and separate them into their own data block.
In practice, if the likelihood approximation is not a serious concern, setting the threshold for delta ratios to be one is a conservative choice, since it only requires the representatives to stay inside its data block, not necessarily very close to the center (see Section 4.3 for further discussion on choosing δ 0 ). The pseudocode of the RASMR algorithm with the delta ratio split is described in Algorithm 2.
Algorithm 2: RASMR algorithm with the delta ratio split.
Algorithms 17 00456 i002

2.5. Learning Rate Scheduling

As an iterative method to maximize the log-likelihood function, the RASMR approach may suffer from a convergence issue, since the solution at each iteration may be far away from the global maximizer, or jump back and forth at the final stages (see Figure S13 in the Supplementary Materials for such an example).
The convergence issue of RASMR comes from various reasons. It could be caused by a bad initial estimate via the mean representative (MR). A small number of data blocks may also make the convergence poor (see Figure S12 in the Supplementary Materials for such an example; see also Section S1 for generating data blocks), especially when the number of parameters is moderately large (see Figure S13). The inherent complexities of the original data, such as ultra-high-dimensional, highly skewed, or heavily tailed distributions of covariates or predictors, may cause similar problems as well.
To make the RASMR approach more robust and practically useful for big data analysis, we adopt a learning rate scheduling strategy, which is a popular idea in the modern development of machine learning and artificial intelligence to resolve convergence issues (see, e.g., [52,53], and references therein). More specifically, we update the RASMR parameter estimates with an exponential learning rate decay as follows:
β ( t ) = β ( t 1 ) + e θ t ( β ˜ ( t ) β ( t 1 ) ) ,
where β ˜ ( t ) is the original RASMR parameter estimate after the tth iteration, θ > 0 is a hyperparameter to control the learning rate e θ t . Apparently, if θ = 0 or with the learning rate 1, (7) yields the original RASMR estimates. A larger value of θ posts a quicker rate of decay, and will force the convergence of iterative parameter estimates. On the other hand, if the learning rate quickly shrinks to zero, the estimate may be updated very slowly, making the improvement of later iterations negligible. To overcome the issue that the learning rate decays too fast, we suggest truncating the learning rate at, say, the 10th iteration and making the learning rate constant afterward. The truncation strategy for the exponential learning rate decay works well when the number of variables is as large as 100 (see Figure S13).

3. Model Selection and Variable Selection Using RASMR

3.1. Information-Based Criteria and Model Selection

During the process of RASMR parameter estimation by Algorithms 1 and 2, two very useful side products are also generated, namely the response representative y ˜ k and the predictor representative X ˜ k  . The quadruple ( n k , y ˜ k , X ˜ k , β ˜ ) provides information for approximating the maximum likelihood, which is a key component for information-based criteria, such as the Akaike information criterion (AIC, [54,55]) and Bayesian information criterion (BIC, [55]).
Recall that the log-likelihood function l ( β ; y , X ) through the full dataset y = ( y 1 , ,   y N ) T and X = ( X 1 , , X N ) T attains its maximum at the MLE β ^ . We denote the log-likelihood approximated through the RASMR representative quadruple set { ( n k , y ˜ k , X ˜ k , β ˜ ) ,   k = 1 , , K } by l ( β ˜ ; y ˜ , X ˜ ) , where y ˜ = ( y ˜ 1 , , y ˜ K ) T , and X ˜ = ( X ˜ 1 , , X ˜ K ) T .
Denoting Δ ˜ = max k max i I k X i X ˜ k , the maximum Euclidean distance of predictor vectors away from their corresponding predictor representatives across all data blocks, the following theorem guarantees the consistency of the estimated log-likelihood l ( β ˜ ; y , X ) , as Δ ˜ goes to zero.
Theorem 1.
Suppose the log-likelihood function l ( β ; y , X ) is twice differentiable and strictly concave on a compact set C R P , with its maximum not located on the boundary of C. Suppose β ˜ is the estimate obtained by Algorithms 1 or 2, and y ˜ k = n k 1 i I k y i + O ( Δ ˜ ) for each k, then l ( β ˜ ; y , X ) l ( β ^ ; y , X ) = O ( Δ ˜ 1 / 2 ) as Δ ˜ 0 .
The proof of Theorem 1, as well as all other proofs in this paper, has been relegated to Section S6 of the Supplementary Materials.
We denote the AIC and BIC values calculated based on l ( β ˜ ; y , X ) as A I C ˜ and B I C ˜ , respectively. As a direct conclusion of Theorem 1, we have the following corollary:
Corollary 1.
For a generalized linear model under the same technical conditions of Theorem 1, the values of A I C ˜ and B I C ˜ converge to the original values of AIC and BIC, respectively, as Δ ˜ 0 . Moreover, A I C ˜ = A I C + O ( Δ ˜ 1 / 2 ) , and B I C ˜ = B I C + O ( Δ ˜ 1 / 2 ) .
The computation of l ( β ˜ ; y ˜ , X ˜ ) only requires the K representatives rather than the whole dataset. Its accuracy not only depends on the estimate β ˜ of the parameters but also on the representatives y ˜ and X ˜ . The next theorem confirms that the estimated log-likelihood l ( β ˜ ; y ˜ , X ˜ ) based on RASMR algorithms is consistent as Δ ˜ goes to zero.
Theorem 2.
Under the same technical conditions of Theorem 1, l ( β ˜ ; y ˜ , X ˜ ) l ( β ^ ; y , X ) = O ( Δ ˜ 1 / 2 ) , as Δ ˜ 0 .
We denote the AIC and BIC calculated with l ( β ˜ ; y ˜ , X ˜ ) as A I C ˜ ˜ and B I C ˜ ˜ , respectively. As a direct conclusion of Theorem 2, we have the following corollary:
Corollary 2.
For a generalized linear model under the same technical conditions of Theorem 1, the values of A I C ˜ ˜ and B I C ˜ ˜ converge to the original values of AIC and BIC, respectively, as Δ ˜ 0 . Moreover, A I C ˜ ˜ = A I C + O ( Δ ˜ 1 / 2 ) , and B I C ˜ ˜ = B I C + O ( Δ ˜ 1 / 2 ) .
Corollary 2 ensures that A I C ˜ ˜ and B I C ˜ ˜ can be used to approximate the original value of A I C and B I C , and the error vanishes along with the maximum size of data blocks. Both SMR and RASMR algorithms can be used to generate A I C ˜ ˜ and B I C ˜ ˜ , but numerical experiments overwhelmingly favor RASMR as it improves both the parameter estimates and the representatives.

3.2. Link Function Selection

For GLM (1), the link function g plays a key role in modeling the relationship between a linear combination of predictors and the expectation of the response variable. It is critical for data analysis to choose the most appropriate link function for a given dataset. An information-based approach, such as A I C or B I C , requires precise estimation of the maximum likelihood. Theorem 2 and Corollary 2 provide theoretical justifications for using A I C ˜ ˜ or B I C ˜ ˜ to select the most appropriate link function.
In our simulation studies (see Section S3 in the Supplementary Materials), SMR fails to choose the logit link function from its competitors (see Table S1), due to the low quality of its representatives. The RASMR with the delta ratio split (Algorithm 2) outperforms SMR with representatives of better quality, making the corresponding A I C ˜ ˜ and B I C ˜ ˜ more reliable (see Table S2).

3.3. Variable Selection

Variable selection is an essential step in statistical data analysis, including subset selection and stepwise selection (see, for example, [55]).
RASMR can be directly extended for subset selection in big data variable selection problems with a moderate number p of predictors. In the steps involving model fitting and evaluating information criteria ( A I C or B I C ), RASMR can be implemented readily on the dataset to draw accurate results (see Section S5.1 in the Supplementary Materials).
To balance the processing time and performance of variable selection, we introduce a quick variable screening process using the mean representatives (MR, see Section S4 in the Supplementary Materials). The information criteria A I C ¯ and B I C ¯ based on MR perform sufficiently well (see Table S1).
For forward stepwise variable selection, we recommend using RASMR to perform a finer selection until the stopping criterion is met (see Section S4 in the Supplementary Materials).

3.4. Cross-Validation

Cross-validation is a widely used data-driven technique for evaluating model performance (see, for example, [55]). For big data analysis under GLMs, we extend RASMR for V-fold cross-validation (VFCV) in the big data analysis as follows:
  • Data { ( X i , y i ) , i = 1 , , N } are given with a partition I 1 , , I K of { 1 , , N } .
  • A random partition A 1 , , A V of { 1 , , N } is given for V-fold cross-validation.
  • For j = 1 , , V , fit the target model on the training set { ( X i , y i ) , i A j } using RASMR with blocks I 1 A j , , I K A j after removing empty ones, and then calculate the aggregated prediction errors R ^ j V F C V when applying the fitted model on the testing set { ( X i , y i ) , i A j } .
  • Report N 1 j = 1 V R ^ j V F C V as the estimated average predictor error.
Note that we do not need to run the same number of iterations on each model fitting. Since RASMR is an iterative approach that can benefit from good initial values of parameter estimates, the estimate from the previous model fitting can be used as the initial values for the next model fitting procedure. Due to this reason, we may set a larger number of iterations for the first RASMR model fitting and then use a smaller number of iterations for later model fitting (see Section S5.2 in the Supplementary Materials).

4. Simulation Studies and Numerical Justifications

4.1. Simulation Setup and Evaluation Method

To justify the performance improvement of the RASMR algorithm on data of massive sizes, we conduct extensive simulations on a main-effects GLM as follows:
E ( Y i ) = μ i and η i = g ( μ i ) = X i T β = β 0 + β 1 x i 1 + + β d x i d ,
where i = 1 , , N . Following [43,56], we choose d = 7 , β 0 = 0 , and β 1 = = β 7 = 0.5 . The feature variables x i = ( x i 1 , x i 2 , , x i d ) T are randomly generated from one of seven different distributions, namely mzNormal, nzNormal, ueNormal, mixNormal, T 3 , EXP, and BETA (see Section S5 in the Supplementary Materials for more details).
For Bernoulli regression models (see Example 1 in Section 4.2), we consider four commonly used link functions, namely logit, cloglog, probit, and cauchit. We first simulate N = 10 6 data points from each of the seven distributions, respectively, under the logit link. We then obtain the data blocks I 1 , , I K using K-means clustering with K = 1000 . By assuming each of the four link functions, we obtain the corresponding parameter estimates using the MR, SMR, and RASMR algorithms, respectively. We then compare the estimate β ˜ i with the full data estimate β ^ i under the corresponding link function. The performance of estimation is measured by the root mean square error (RMSE, that is, [ i = 1 7 ( β ˜ i β ^ i ) 2 / 7 ] 1 / 2 ). A smaller RMSE indicates a better approach.
Similarly, we consider a Poisson regression model with a log link (see Example 2 in Section 4.2) and a Gamma regression model with a reciprocal link (see Example 3 in Section 4.2). For each example in Section 4.2 and Section 4.3, the corresponding simulation is repeated 100 times. The summarized results, including the average RMSE and the sample standard deviation (std in parenthesis) of 100 RMSEs, are listed in the corresponding tables.
Note that the K-means clustering algorithm performed for each simulation study is used for illustration purposes. It generates a partition with blocks whose data points are homogeneous. Other clustering methods, such as k-medoids (see, for example, [57] and references therein), may be used for the same purpose as well. Our goal here is to evaluate the performance of MR, SMR, and RASMR given the same partition, not to compare different clustering algorithms.
For SMR, each simulation study recommends a number of iterations, T, set to 3, as suggested by [43], as improvement beyond the third iteration is often negligible (see Figure S12 in the Supplementary Materials). In contrast, RASMR typically continues to improve its parameter estimates as T increases, provided the number of data blocks is sufficiently large (see Figure S12). Overall, we recommend setting T to 10 for RASMR algorithms to balance estimation accuracy with computational time. It should be noted that RASMR still outperforms SMR even with the same number of iterations (see Table 6).

4.2. Performance of RASMR, Algorithm 1

In this section, we evaluate the performance of Algorithm 1 on Bernoulli, Poisson, and gamma regression models, respectively.
Example 1.
In this example, we consider Bernoulli regression models (see (8)) for binary classifications, with one of four commonly used link functions, namely logit, cloglog, probit, and cauchit. The goal is to check not only the estimation performance of RASMR when the link function is correctly specified but also its performance when the link function is misspecified, which is crucial when selecting the link function.
The results of the corresponding simulation study as described in Section 4.1 are summarized in Table 2. In almost all scenarios, the RMSEs of RASMR are significantly lower than those of the mean representative (MR) approach or the original SMR approach, which implies that the RASMR algorithm (Algorithm 1) outperforms MR and the original SMR algorithm in this example. The estimates based on RASMR are not only much more accurate than the estimates using MR or SMR but also more robust.
Nevertheless, for the case with the cauchit link and nzNormal distribution, the RASMR algorithm performs as poorly as other methods. In Section 4.3, we will show further improvement by using Algorithm 2.
Example 2.
In this example, we consider Poisson regression with its canonical link function, namely, log link. The corresponding simulation is repeated for 100 times and the results are summarized in Table 3.
From Table 3 we can see that (1) In terms of the average RMSE (the smaller, the better), the original SMR outperforms MR, and RASMR further improves the accuracy of estimates impressively, although not as accurate as the full data estimate in this case (see Table 3 in [43]); (2) In terms of the percentage of NAs (the smaller, the more robust) out of 100 simulations, both MR and RASMR achieve 0 % and outperform the original SMR.
Example 3.
In the example, we consider Gamma regression with its canonical link function, namely, reciprocal link. Since Gamma regression with reciprocal is relatively fragile on the distribution of features, we only consider features generated from Beta distribution. Again, the data clusters are generated from K-means clustering with K = 1000 . Similar to Examples 1 and 2, we obtain the RMSE of the MR, SMR, and RASMR estimates from the full data estimate and summarize the results in Table 4. The pattern of the results is similar to the ones in Table 3, which confirms the significant improvement of RASMR against both MR and SMR under Gamma regression models.
In [43], extensive comparisons between SMR and other big data approaches have been made in terms of computing speed. According to [43], given a partition of data, SMR is noticeably faster than a divide-and-conquer (DC) approach proposed by [5], but slower than the subsampling approach under A-optimality [56] for logistic regression. Using their results of computing time comparison as a reference, we compare the computing speed of RASMR with SMR for binary classification with four link functions: logit, cloglog, probit, and cauchit. The covariates are generated from the seven distributions described previously and the computation is conducted on a PC running Windows 10 Home (Version 2004) with 2.80 GHz 4-core Intel i7-7700HQ and 16 GB of memory. The simulation is repeated 100 times and the average CPU time for one iteration of SMR and RASMR is recorded in Table 5.
According to Table 5, RASMR is slightly slower than SMR. It is because additional splits are made on a proportion of data blocks, which result in a larger number of data blocks in RASMR, given the same partition of data. Nevertheless, based on our experience, most of the RASMR split requirements are not fulfilled, thus the resulting final number of data blocks of RASMR is only slightly larger than that of SMR.
Similar to [43], we calculate the corresponding RMSE from the full data estimate [ i = 1 7 ( β ˜ i β ^ i ) 2 / 7 ] 1 / 2 and the RMSE from the true parameter values [ i = 1 7 ( β ˜ i β i ) 2 / 7 ] 1 / 2 , respectively. The simulation follows the same settings as in Example 1, and the comparisons are among MR, SMR with T = 3 , RASMR with T = 3 , and DC with K = 1000 . The results are listed in Table 6.
According to Table 6, with the same number of iterations ( T = 3 ), RASMR still outperforms MR, SMR, and DC. When the comparison is made with respect to the true parameter values, RASMR performs almost as well as the full data estimate with negligible differences. Recall that the data contain the entire information for estimating the parameters and the RASMR approach achieves its theoretical extreme in terms of estimation accuracy in this case.

4.3. Performance of RASMR with the Delta Ratio Split, Algorithm 2

In Section 2.4, we develop Algorithm 2 for estimating the maximum likelihood better, especially for non-Gaussian covariates or predictors. To justify the improvement of Algorithm 2 against RASMR, in this section, we employ the same simulation setting as described in Section 4.1 (see Example 1). For illustration purposes, the threshold δ 0 for delta ratio split is chosen to be 0.05 , 0.1 , 0.5 , and 1, respectively. We fit the logistic regression model using SMR, RASMR, and RASMR with the delta ratio split and various thresholds. The estimated maximum log-likelihood approaches the full data value as the threshold of the delta ratio decreases, except in the T 3 case, where the improvement is negligible (see Figure S11 in the Supplementary Materials). One reason is that in the case of T 3 , the delta ratios are much smaller than those in other cases (see Figure S1 in the Supplementary Materials), and even 0.05 does not capture a good amount of blocks.
To evaluate how the delta ratio split impacts the accuracy of parameter estimates, we choose K = 100 , 500 , and 1000 for K-means clustering, respectively. With moderately large K, such as K = 500 or 1000, RASMR with a delta ratio split performs roughly the same as the original RASMR (see Figure S12 in the Supplementary Materials), and both RASMR and RASMR with a delta ratio split outperform SMR. Nevertheless, when K is as small as 100, the RASMR algorithms may fail for some data structures, such as nzNormal (where the data have imbalanced responses), T 3 (where the data distribute with heavy tails), or mixNormal (where the data come from a mixture of two distributions).
In Table 7, we list the performance of RASMR with the delta ratio split and δ 0 = 0.1 when the link function is misspecified. Comparing Table 7 with Table 2, we can see that RASMR with the delta ratio split (Algorithm 2) is, in general, better than or comparable with Algorithm 1 in terms of estimation accuracy. The improvements are significant in the unbalanced or heavy-tailed cases, such as nzNormal with cauchit link, and ueNormal with cloglog or cauchit links (see Section S5 in the Supplementary Materials for more details).

5. Real Data Analysis

In this section, we use a real example—the airline on-time performance data—collected from the Bureau of Transportation Statistics (https://www.transtats.bts.gov/, accessed on 31 August 2018), to illustrate how RASMR works for analyzing real data with a massive size.

5.1. The Airline On-Time Performance Data

The airline on-time performance data contain detailed information on US domestic flights since October 1987. The original data are saved in individual files labeled by month. For illustration purposes, in this study, we use data from between October 1987 and August 2018, 371 files/months in total, which contain 182,751,882 flights. After removing the records with missing or invalid inputs, N = 179,528,198 is actually considered in this study.
Following [43], we formulate the data into a binary classification problem aiming to model the flight delay status, a binary response ArrDelayLabel (arrival delay label), which is generated by cutting the continuous variable ARRIVAL DELAY at the 15-min point. For simplicity, the departure time block DepTimeBlk, originally a factor of 17 levels, is regrouped into a factor with 4 levels (1 for 12:00 a.m.–05:59 a.m., 2 for 06:00 a.m.–11:59 a.m., 3 for 12:00 p.m.–05:59 p.m., and 4 for 06:00 p.m.–11:59 p.m.). Moreover, three categorical variables (QUARTER, DayOfWeek, and DepTimeBlk) and three continuous variables (DISTANCE, DepDelay, and CRSTimeElapsed) are used as illustrations to explain the status of flight arrival delays (see Table S8 in the Supplementary Materials for a list of the variables).

5.2. Model Selection

In this section, we select the most appropriate model for the airline data, which consists of 371 months and 179,528,198 flight records.
As described in Section 3.2, we first perform the link function selection using A I C ˜ ˜ and five-fold cross-validation based on RASMR with the delta ratio split (Algorithm 2). In this step, all variables listed in Table S8 are included, and the candidate link functions include logit, cloglog, probit, and cauchit.
According to the link function selection results listed in Table 8, we select the logit link for our model (see Figure S14 in the Supplementary Materials for the fitted model based on “glm2” function in [58] and the RASMR representative data).
As described in Section 3.3, we further perform the variable selection for the airline data by implementing the stepwise selection strategy with an initial screening based on MR and fine selection using RASMR (see Section S4 in the Supplementary Materials). The selection criterion is A I C ˜ ˜ again. When using “glm2”, we estimate the dispersion parameter by Pearson’s Chi-square statistic divided by its degree of freedom (see Figure S15 in the Supplementary Materials). If n, N, and p represent the number of representatives, the full data size, and the number of parameters in the model, respectively, the degrees of freedom for Pearson’s Chi-square test are n p for RASMR and N p for the full data. To estimate the dispersion parameter of full data, the RASMR dispersion parameter estimate needs to be adjusted by the degrees of freedom, i.e., ϕ ^ = ϕ ^ R A S M R · ( n p ) / ( N p ) , where ϕ ^ is the dispersion parameter estimated from the full data and ϕ ^ R A S M R is the estimate from the RASMR representative dataset. After such an adjustment, the estimated dispersion parameters are 3291.14 and 3341.67 for the model with all variables and the model with the selected variables, respectively, which indicates strong over-dispersion.
In conclusion, the GLM after variable selection contains only one variable, namely, DepDelay (departure delay in minutes), which is the most informative variable for predicting the arrival delay of a flight.

5.3. Comparison Analysis

For performance evaluation and comparison purposes, we create an oracle model by fitting a logistic regression model using data from between March 2012 and February 2017, 60 months in total, utilizing all explanatory variables listed in Table S8. Then, the oracle response of ArrDelayLabel for the entire dataset is generated accordingly based on the oracle model.
Assuming that we do not know the true link (i.e., logit), we fit the Bernoulli model with one of four different links (namely, logit, cloglog, probit, and cauchit) using four approaches: (1) iterative reweighted least square (IRLS) on the combined data; (2) the mean representative (MR) approach; (3) the score-matching representative (SMR) approach; and (4) RASMR with the delta ratio split and exponential learning rate decay.
The data blocks for MR, SMR, and RASMR are created by the following strategy. The original airline data are saved in monthly files, which are regarded as natural data blocks. Before applying MR, SMR, or RASMR, we further split each of the monthly files according to the distinct values of three categorical variables, namely, QUARTER, DayOfWeek, and DepTimeBlk. Then, each monthly file is divided into 4 × 7 × 4 = 112 sub-blocks. We then split each sub-block further according to the three continuous variables via one of the following two schemes. The first scheme is a correlation-based quantile split. That is, we split each data block at the 25%, 50%, and 75% quantiles of the three continuous variables as the cutting points, and obtain 4 × 4 × 4 = 64 sub-blocks. The second scheme is to apply K-means clustering to the three continuous variables with K = 64 . As a result, each monthly file is divided into 112 × 64 = 7168 sub-blocks (see Section S1 in the Supplementary Materials for more discussion).
We compare MR, SMR, and RASMR with four different data sizes, namely, 60 months, 120 months, 240 months, and 371 months. The IRLS estimates on the combined data are referred to as the full data estimate, but only available for 60-month and 120-month datasets. As a result, in Table 9 and Table 10, we list RMSE of MR, SMR, and RASMR (with the delta ratio split) from the IRLS estimates for 60 months and 120 months, and their RMSE from the oracle parameter values for 240 months and 371 months.
According to Table 9 and Table 10, RASMR produces consistently more accurate estimates, regardless of the sample size or clustering scheme.
To further compare the convergence properties of SMR and RASMR under different link functions, we plot log (RMSE) based on SMR or RASMR with K-means clustering in Figure 2 and quantile split clustering in Figure 3, respectively, against the number of iterations. Both Figure 2 and Figure 3 show that RASMR has a better convergence property than the original SMR in this case, which makes RASMR benefit considerably from the increasing number of iterations.

6. Conclusions

The proposed RASMR approach manages to generate more accurate estimates of parameters due to better-quality representatives after additional model-specific splitting. The estimation accuracy improves along with the increasing number of iterations. For data with a relatively simple structure, typically a small number of iterations can provide accurate enough estimates for further data analysis, including model selection. For data with a more complicated structure or a larger number of parameters, we recommend an increased number of iterations and an exponential learning rate decay to ensure better convergence behavior.
For a practical implementation of RASMR, K-means clustering may consume a considerable amount of time. When a natural partition is not available, we recommend small K-means clustering or correlation-based quantile split for a faster clustering process (see Section S1 in the Supplementary Materials for more details). When the number of covariates or predictors is large, we recommend the correlation-based quantile split strategy due to its fast speed.
When applying RASMR to the model selection, we recommend RASMR with the delta ratio split. A default threshold of delta ratio is 1 since the delta ratios rarely go beyond 1 for typical applications. When the competing models have very close performance, we recommend a smaller threshold, such as 0.1 , to ensure better-quality representatives.
A key step with RASMR is to construct a good set of representatives and find an accurate approximation β ˜ for the MLE β ^ based on the full data. When some collinearity exists among the covariates or predictors, β ^ may not be unique and the RASMR algorithm may encounter an identifiability issue. It is worth exploring how to adjust RASMR for variable selection under the presence of collinearity.
The proposed RASMR approach is especially useful for analyzing big data with binary responses, such as intrusion detection for cyber security [59] and fraud detection for insurance companies [60]. Nevertheless, when the responses have three or more categories, the data may be modeled by multinomial logistic models [61,62,63] instead of GLMs. In this case, it is challenging to construct representatives for given data blocks that are more efficient than the corresponding mean representatives.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/a17100456/s1, including more discussions on S1. Generating Data Blocks; S2. More on Delta Ratio Split; S3. More on Link Function Selection; S4. More on Variable Selection Using RASMR; S5. More Simulation Studies; S6. Proofs and Relevant Lemmas; S7. More on Airline Data; S8. More Figures [64,65,66,67,68,69,70,71,72,73].

Author Contributions

Conceptualization, D.Z., K.L. and J.Y.; methodology, D.Z. and J.Y.; software, D.Z. and K.L.; validation, D.Z.; formal analysis, D.Z.; investigation, D.Z. and J.Y.; resources, K.L. and J.Y.; data curation, D.Z.; writing—original draft preparation, D.Z. and J.Y.; writing—review and editing, K.L. and J.Y.; visualization, D.Z.; supervision, J.Y.; project administration, J.Y.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the U.S. NSF grant DMS-1924859.

Data Availability Statement

The airline on-time performance data are publicly available via the Bureau of Transportation Statistics (https://www.transtats.bts.gov/ (accessed on 31 August 2018)).

Conflicts of Interest

Author D.Z.’s work was performed while at the University of Illinois at Chicago. All authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Theorems on Solving the Score-Matching Equation

In this appendix, we provide theorems and remarks on solving the score-matching Equation (3), including explicit solutions for the normal/linear model (Theorem A2), the gamma model (Theorem A5), and the inverse Gaussian model (Theorem A6). With the suggested second split based on y i > G ( η i ) and further split for Bernoulli models (Remark A2) and Poisson model (Remark A3), we largely resolve the issue of multiple solutions to (3) in the original SMR algorithm, as explained in Section 2.2.
Theorem A1.
Suppose both ν ( η ) and G ( η ) are continuous on [ η k , η k ] . If S ( η ) is strictly monotone on [ η k , η k ] , then there exists a unique η * [ η k , η k ] that solves (3).
Theorem A2.
For the normal model with an identity link, that is, the usual linear model, there are up to two solutions solving (3):
η ˜ k , 1 = 1 2 y ˜ k y ˜ k 2 4 S ¯ , η ˜ k , 2 = 1 2 y ˜ k + y ˜ k 2 4 S ¯ ,
where S ¯ = η ¯ k y ˜ k n k 1 i I k η i 2 and η ¯ k = n k 1 i I k η i  . Furthermore,
(1) 
If η i < 0 and y i > G ( η i ) for all i I k , then the only solution is η ˜ k , 1  .
(2) 
If η i > 0 and y i < G ( η i ) for all i I k , then the only solution is η ˜ k , 2  .
In general, if y ˜ k / 2 ( η k , η k ) , then the solution is unique.
Note that the essentially same explicit solutions as in (A1) are described in Section 4.2.2.1 of [64].
Along with Theorem A1 and Lemmas S1–S5 in the Supplementary Materials (see Section S6), we conclude the following summarized results for Bernoulli models.
Theorem A3.
For Bernoulli models with logit, probit, cloglog, loglog, or cauchit links, a data block I k satisfying one of the following six conditions yields a unique solution solving (3):
(i) 
η i < η l and y i < G ( η i ) for all i I k ;
(ii) 
η l < η i < 0 and y i < G ( η i ) for all i I k ;
(iii) 
η i > 0 and y i < G ( η i ) for all i I k ;
(iv) 
η i < 0 and y i > G ( η i ) for all i I k ;
(v) 
0 < η i < η r and y i > G ( η i ) for all i I k ;
(vi) 
η i > η r and y i > G ( η i ) for all i I k ,
where η l < 0 and η r > 0 are constants listed in Table 1. A general data block under Bernoulli models may yield up to two solutions solving (3).
Remark A1.
In practice, the “<” and “>” in Theorem A3 can be relaxed to “≤” and “≥”, respectively, as long as “=” can only be attained by a small portion of the observations.
Remark A2.
Theorem A3 suggests further split sub-blocks according to η l or η r , which are constants associated with link functions. More specifically, if η i < 0 and y i < G ( η i ) for all i I k , we further split I k into two sub-blocks at the splitting point η l (see cases (i) and (ii)); if η i > 0 and y i > G ( η i ) for all i I k , we further split I k at the splitting point η r (see cases (v) and (vi)). Each resulting block yields a unique solution solving (3), which is equivalent to solving S ( η ) = S ¯ (see (6)). For example, for Bernoulli models with logit link, that is, logistic regression models, we have the following:
S ( η ) = η e η 1 + e η , f o r c a s e s ( i ) , ( i i ) a n d ( i i i ) ; η 1 + e η , f o r c a s e s ( i v ) , ( v ) , a n d ( v i ) .
Furthermore, the solutions are associated with the Lambert W-function P ( z ) , which solves z = w e w (see R function lambertW in package VGAM  [65]). It can be verified that, for cases (ii) and (iii), η ˜ k = P ( S ¯ e S ¯ ) S ¯ ; for cases (iv) and (v), η ˜ k = S ¯ P ( S ¯ e S ¯ ) . Nevertheless, lamberW does not provide solutions for cases (i) and (vi) for now. One may use a quasi-Newton algorithm to solve them.
Theorem A4.
For the Poisson model with a log link, suppose either η i > 0 for all i I k or η i < 0 for all i I k  . Then, there are up to two solutions η ˜ k , 1 [ η k , u ( y ˜ k ) ] and η ˜ k , 2 [ u ( y ˜ k ) , η k ] solving (3), where u ( y ) 1 denotes the unique root of the transcendent equation e η ( 1 + η ) = y given y 0 . Special cases include the following:
(i) 
If y i = 0 and η i 0 for all i I k , then there exists a unique solution η ˜ k = 1 + u ( S ¯ / e ) 0 .
(ii) 
If y i = 0 and η i ( 1 , 0 ) for all i I k , then there exists a unique solution in ( 1 , 0 ) .
(iii) 
If y i = 0 and η i 1 for all i I k , then there exists a unique solution in ( , 1 ] .
Remark A3.
Theorem A4 suggests that the observations with y i = 0 be extracted and further partitioned into three sub-blocks according to η i ( , 1 ] , ( 1 , 0 ) and [ 0 , ) , respectively, or with splitting points 1 and 0. Then, η ˜ k solving (3) within each sub-block is unique and can be obtained easily. Moreover, all three special cases in Theorem A4 are related to the well-known Lambert W-function P ( z ) , which solves z = w e w . The function “lambertW” from R package VGAM may be used for this purpose as well.
Theorem A5.
For the gamma model with a reciprocal link, η ¯ k = n k 1 i I k η i is the unique solution solving (3). Furthermore, if y i < G ( η i ) for all i I k , then η ¯ k ( 0 , y ˜ k 1 ) ; if y i > G ( η i ) for all i I k , then η ¯ k ( y ˜ k 1 , ) .
Remark A4.
For the gamma model with a reciprocal link, η ˜ k = η ¯ k = X ¯ k β is the only solution solving (3), where X ¯ k = n k 1 i I k X i is the mean representative. In this case, ν ( η ) is a constant (see Table 1 in [43]), and we have the following:
X ˜ k = i I k ( y i η i 1 ) X i n k ( y ˜ k η ¯ 1 )
according to (4). That is, X ˜ k X ¯ k as in (A2), even if η ˜ k = X ¯ k β .
Theorem A6.
For the inverse Gaussian model with an inverse-square link, in general, there are up to two solutions, i.e.,
η ˜ k , 1 = 1 4 y ˜ k S ¯ 1 8 y ˜ k S ¯ 2 y ˜ k 2 , η ˜ k , 2 = 1 4 y ˜ k S ¯ + 1 8 y ˜ k S ¯ 2 y ˜ k 2
solving (3), where S ¯ = n k 1 i I k 1 2 ( η i 1 / 2 y ˜ k ) η i , and η ˜ k , 1 ( 4 y ˜ k 2 ) 1 η ˜ k , 2 . Furthermore, if y i > G ( η i ) for all i I k , then S ¯ < 0 and η ˜ k = ( 1 4 y ˜ k S ¯ + 1 8 y ˜ k S ¯ ) / ( 2 y ˜ k 2 ) is the only solution.

References

  1. Dautov, R.; Distefano, S. Quantifying volume, velocity, and variety to support (big) data-intensive application development. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2843–2852. [Google Scholar]
  2. Fan, J.; Han, F.; Liu, H. Challenges of big data analysis. Natl. Sci. Rev. 2014, 1, 293–314. [Google Scholar] [CrossRef]
  3. Wang, C.; Chen, M.; Schifano, E.; Wu, J.; Yan, J. Statistical methods and computing for big data. Stat. Its Interface 2016, 9, 399–414. [Google Scholar] [CrossRef]
  4. Lin, L.; Lu, J. A race-DC in Big Data. arXiv 2019, arXiv:1911.11993. [Google Scholar]
  5. Lin, N.; Xi, R. Aggregated estimating equation estimation. Stat. Its Interface 2011, 4, 73–83. [Google Scholar] [CrossRef]
  6. Chen, X.; Xie, M.g. A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 2014, 24, 1655–1684. [Google Scholar]
  7. Schifano, E.; Wu, J.; Wang, C.; Yan, J.; Chen, M. Online updating of statistical inference in the big datasetting. Technometrics 2016, 58, 393–403. [Google Scholar] [CrossRef]
  8. Zhao, T.; Cheng, G.; Liu, H. A partially linear framework for massive heterogeneous data. Ann. Stat. 2016, 44, 1400–1437. [Google Scholar] [CrossRef]
  9. Lee, J.D.; Liu, Q.; Sun, Y.; Taylor, J.E. Communication-efficient sparse regression. J. Mach. Learn. Res. 2017, 18, 115–144. [Google Scholar]
  10. Battey, H.; Fan, J.; Liu, H.; Lu, J.; Zhu, Z. Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 2018, 46, 1352–1382. [Google Scholar] [CrossRef]
  11. Shi, C.; Lu, W.; Song, R. A massive data framework for M-estimators with cubic-rate. J. Am. Stat. Assoc. 2018, 113, 1698–1709. [Google Scholar] [CrossRef]
  12. Chen, X.; Liu, W.; Zhang, Y. Quantile regression under memory constraint. Ann. Stat. 2019, 47, 3244–3273. [Google Scholar] [CrossRef]
  13. Ma, P.; Sun, X. Leveraging for big data regression. Wiley Interdiscip. Rev. Comput. Stat. 2015, 7, 70–76. [Google Scholar] [CrossRef]
  14. Wang, H.; Yang, M.; Stufken, J. Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 2019, 114, 393–405. [Google Scholar] [CrossRef]
  15. Cheng, Q.; Wang, H.; Yang, M. Information-based optimal subdata selection for big data logistic regression. J. Stat. Plan. Inference 2020, 209, 112–122. [Google Scholar] [CrossRef]
  16. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
  17. Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
  18. Breiman, L. Stacked regressions. Mach. Learn. 1996, 24, 49–64. [Google Scholar] [CrossRef]
  19. Bühlmann, P.; Meinshausen, N. Magging: Maximin aggregation for inhomogeneous large-scale data. Proc. IEEE 2015, 104, 126–135. [Google Scholar] [CrossRef]
  20. da Conceição Costa, M.; Macedo, P. Normalized entropy aggregation for inhomogeneous large-scale data. In Proceedings of the Theory and Applications of Time Series Analysis: Selected Contributions from ITISE 2018, Granada, Spain, 19–21 September 2018; Springer: Berlin/Heidelberg, Germany, 2019; pp. 19–29. [Google Scholar]
  21. Costa, M.C.; Macedo, P.; Cruz, J.P. Neagging: An aggregation procedure based on normalized entropy. In Proceedings of the AIP Conference Proceedings, Rhodes, Greece, 17–23 September 2020; AIP Publishing: Melville, NY, USA, 2022; Volume 2425. [Google Scholar]
  22. Tran, D.; Toulis, P.; Airoldi, E.M. Stochastic gradient descent methods for estimation with large datasets. arXiv 2015, arXiv:1509.06459. [Google Scholar]
  23. Lin, J.; Rosasco, L. Optimal Rates for multi-pass stochastic gradient methods. J. Mach. Learn. Res. 2017, 18, 1–47. [Google Scholar]
  24. Airoldi, E.; Toulis, P. Stochastic Gradient Methods for Principled Estimation with Large Data Sets. In Handbook of Big Data; Chapman & Hall: London, UK, 2016; pp. 243–266. [Google Scholar]
  25. Konečnỳ, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
  26. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; PMLR: New York, NY, USA, 2017; pp. 1273–1282. [Google Scholar]
  27. Stich, S.U. Local SGD converges fast and communicates little. arXiv 2018, arXiv:1805.09767. [Google Scholar]
  28. Stich, S.U.; Karimireddy, S.P. The error-feedback framework: Better rates for SGD with delayed gradients and compressed updates. J. Mach. Learn. Res. 2020, 21, 1–36. [Google Scholar]
  29. Khaled, A.; Mishchenko, K.; Richtárik, P. Tighter theory for local SGD on identical and heterogeneous data. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 26–28 August 2020; PMLR: New York, NY, USA, 2020; pp. 4519–4529. [Google Scholar]
  30. Spiridonoff, A.; Olshevsky, A.; Paschalidis, Y. Communication-efficient sgd: From local sgd to one-shot averaging. Adv. Neural Inf. Process. Syst. 2021, 34, 24313–24326. [Google Scholar]
  31. Wang, J.; Joshi, G. Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms. J. Mach. Learn. Res. 2021, 22, 1–50. [Google Scholar]
  32. Zhou, F.; Cong, G. On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv 2017, arXiv:1708.01012. [Google Scholar]
  33. Koloskova, A.; Loizou, N.; Boreiri, S.; Jaggi, M.; Stich, S. A unified theory of decentralized sgd with changing topology and local updates. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: New York, NY, USA, 2020; pp. 5381–5393. [Google Scholar]
  34. Jiang, P.; Agrawal, G. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18), Montréal, QC, Canada, 3–8 December 2018; Curran Associates: Red Hook, NY, USA, 2018; pp. 2530–2541. [Google Scholar]
  35. Haddadpour, F.; Mahdavi, M. On the convergence of local descent methods in federated learning. arXiv 2019, arXiv:1910.14425. [Google Scholar]
  36. Zhu, Z.; Hong, J.; Zhou, J. Data-free knowledge distillation for heterogeneous federated learning. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 12878–12889. [Google Scholar]
  37. Li, A.; Sun, J.; Li, P.; Pu, Y.; Li, H.; Chen, Y. Hermes: An efficient federated learning framework for heterogeneous mobile clients. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, New Orleans, LA, USA, 31 January–4 February 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 420–437. [Google Scholar]
  38. Sery, T.; Shlezinger, N.; Cohen, K.; Eldar, Y.C. Over-the-air federated learning from heterogeneous data. IEEE Trans. Signal Process. 2021, 69, 3796–3811. [Google Scholar] [CrossRef]
  39. Mendieta, M.; Yang, T.; Wang, P.; Lee, M.; Ding, Z.; Chen, C. Local learning matters: Rethinking data heterogeneity in federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 8397–8406. [Google Scholar]
  40. Qu, L.; Zhou, Y.; Liang, P.P.; Xia, Y.; Wang, F.; Adeli, E.; Fei-Fei, L.; Rubin, D. Rethinking architecture design for tackling data heterogeneity in federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10061–10071. [Google Scholar]
  41. Fang, X.; Ye, M. Robust federated learning with noisy and heterogeneous clients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10072–10081. [Google Scholar]
  42. Ye, M.; Fang, X.; Du, B.; Yuen, P.C.; Tao, D. Heterogeneous federated learning: State-of-the-art and research challenges. Acm Comput. Surv. 2023, 56, 1–44. [Google Scholar] [CrossRef]
  43. Li, K.; Yang, J. Score-matching representative approach for big data analysis with generalized linear models. Electron. J. Stat. 2022, 16, 592–635. [Google Scholar] [CrossRef]
  44. Bowman, C. Data localization laws: An emerging global trend. JURIST–Hotline. 6 January 2017. Available online: https://www.jurist.org/commentary/2017/01/courtney-bowman-data-localization/ (accessed on 8 October 2024).
  45. Chander, A.; Lê, U.P. Data nationalism. Emory LJ 2014, 64, 677. [Google Scholar]
  46. McCullagh, P.; Nelder, J. Generalized Linear Models, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 1989. [Google Scholar]
  47. Dobson, A.; Barnett, A. An Introduction to Generalized Linear Models, 4th ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
  48. Green, P.J. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J. R. Stat. Soc. Ser. 1984, 46, 149–170. [Google Scholar] [CrossRef]
  49. Gentle, J.E. Matrix Algebra; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  50. R Core Team and contributors worldwide. In The R Stats Package; R Package Version 4.5.0.; R Foundation for Statistical Computing: Vienna, Austria, 2024.
  51. Peng, L.; Kümmerle, C.; Vidal, R. On the convergence of IRLS and its variants in outlier-robust estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 17808–17818. [Google Scholar]
  52. Bengio, Y. Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 437–478. [Google Scholar]
  53. You, K.; Long, M.; Wang, J.; Jordan, M.I. How does learning rate decay help modern neural networks? arXiv 2019, arXiv:1908.01878. [Google Scholar]
  54. Akaike, H. Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, 2–8 September 1971; Akademiai Kiado: Budapest, Hungary, 1973; pp. 267–281. [Google Scholar]
  55. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  56. Wang, H.; Zhu, R.; Ma, P. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 2018, 113, 829–844. [Google Scholar] [CrossRef] [PubMed]
  57. Schubert, E.; Rousseeuw, P.J. Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms. Inf. Syst. 2021, 101, 101804. [Google Scholar] [CrossRef]
  58. Marschner, I.; Donoghoe, M.W. glm2: Fitting Generalized Linear Models; R Package Version 1.2.1.; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
  59. Kumar, V.S. A Big Data Analytical Framework for Intrusion Detection Based on Novel Elephant Herding Optimized Finite Dirichlet Mixture Models. Int. J. Data Inform. Intell. Comput. 2023, 2, 11–20. [Google Scholar]
  60. Jones, K.I.; Sah, S. The Implementation of Machine Learning In The Insurance Industry with Big Data Analytics. Int. J. Data Inform. Intell. Comput. 2023, 2, 21–38. [Google Scholar]
  61. Glonek, G.; McCullagh, P. Multivariate logistic models. J. R. Stat. Soc. Ser. 1995, 57, 533–546. [Google Scholar] [CrossRef]
  62. Zocchi, S.; Atkinson, A. Optimum experimental designs for multinomial logistic models. Biometrics 1999, 55, 437–444. [Google Scholar] [CrossRef]
  63. Bu, X.; Majumdar, D.; Yang, J. D-optimal Designs for Multinomial Logistic Models. Ann. Stat. 2020, 48, 983–1000. [Google Scholar] [CrossRef]
  64. Li, K. Score-Matching Representative Approach for Big Data Analysis with Generalized Linear Models. Ph.D. Thesis, University of Illinois at Chicago, Chicago, IL, USA, 2018. [Google Scholar]
  65. Yee, T.; Moler, C. VGAM: Vector Generalized Linear and Additive Models; R Package Version 1.1-11.; R Foundation for Statistical Computing: Vienna, Austria, 2024. [Google Scholar]
  66. Lohr, S. Sampling: Design and Analysis; Chapman and Hall/CRC: Boca Raton, FL, USA, 2019. [Google Scholar]
  67. Gordon, R. Values of Mills’ ratio of area to bounding ordinate and of the normal probability integral for large values of the argument. Ann. Math. Stat. 1941, 12, 364–366. [Google Scholar] [CrossRef]
  68. Birnbaum, Z. An inequality for Mill’s ratio. Ann. Math. Stat. 1942, 13, 245–246. [Google Scholar] [CrossRef]
  69. Mitrinovic, D.; Vasic, P. Analytic Inequalities; Springer: Berlin/Heidelberg, Germany, 1970. [Google Scholar]
  70. Baricz, A. Mills’ ratio: Monotonicity patterns and functional inequalities. J. Math. Anal. Appl. 2008, 340, 1362–1370. [Google Scholar] [CrossRef]
  71. Marshall, A.; Olkin, I. Life Distributions; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  72. Corless, R.; Gonnet, G.; Hare, D.; Jeffrey, D.; Knuth, D. On the Lambert W function. Adv. Comput. Math. 1996, 5, 329–359. [Google Scholar] [CrossRef]
  73. Dunn, P.; Smyth, G. Generalized Linear Models with Examples in R; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Figure 1. Graphical display for SMR and RASMR.
Figure 1. Graphical display for SMR and RASMR.
Algorithms 17 00456 g001
Figure 2. Trend of log(RMSE) for SMR and RASMR under different link functions for the first 60 months with K-means clustering.
Figure 2. Trend of log(RMSE) for SMR and RASMR under different link functions for the first 60 months with K-means clustering.
Algorithms 17 00456 g002
Figure 3. Trend of log(RMSE) for SMR and RASMR estimations under different link functions for the first 60 months with quantile split clustering.
Figure 3. Trend of log(RMSE) for SMR and RASMR estimations under different link functions for the first 60 months with quantile split clustering.
Algorithms 17 00456 g003
Table 1. Splitting points for Bernoulli models with different links.
Table 1. Splitting points for Bernoulli models with different links.
Link Function η l for S 0 ( η ) η r for S 1 ( η )
logit−1.2784645427611.278464542761
probit−0.8399236756920.839923675692
cloglog−10.729114174900
loglog−0.7291141749001
cauchit−0.8019164250450.801916425045
Table 2. Average (std) of RMSEs ( 10 3 ) over 100 simulations for binary classification.
Table 2. Average (std) of RMSEs ( 10 3 ) over 100 simulations for binary classification.
SimulationBinary Classification, K-Means ( K = 1000 ), True Link Function = Logit
RepresentativesMROriginal SMRRASMR
SetupLogitCloglogProbitCauchitLogitCloglogProbitCauchitLogitCloglogProbitCauchit
mzNormal17.99.29.233.41.81.61.02.4 3.7 × 10 5 2.3 × 10 1 6.4 × 10 5 1.0 × 10 2
(0.3)(0.3)(0.2)(1.2)(0.6)(0.5)(0.3)(2.4) ( 1.0 × 10 5 ) ( 1.8 × 10 1 ) ( 2.6 × 10 5 ) ( 4.2 × 10 3 )
nzNormal14.35.07.01354.00.81.551.6 1.5 × 10 3 2.5 × 10 2 2.5 × 10 4 69.6
(0.7)(0.3)(0.3)(45.8)(1.1)(0.4)(0.5)(34.6) ( 5.0 × 10 4 ) ( 3.0 × 10 2 ) ( 7.4 × 10 5 ) (20.0)
ueNormal2111141104553.3112.119.4 7.2 × 10 5 4.3 4.7 × 10 4 10.4
(1.4)(1.5)(0.7)(5.2)(1.4)(3.4)(1.1)(10.1) ( 3.3 × 10 6 ) ( 1.2 ) ( 9.9 × 10 5 ) ( 10.2 )
mixNormal17.58.68.648.13.01.81.23.2 3.6 × 10 4 5.0 × 10 1 1.6 × 10 4 1.9 × 10 1
(0.4)(0.6)(0.2)(3.1)(0.9)(0.7)(0.3)(2.5) ( 1.6 × 10 4 ) ( 1.8 × 10 1 ) ( 1.8 × 10 4 ) ( 5.5 × 10 2 )
T 3 12.210.17.810.010.715.06.78.6 6.3 × 10 2 1.2 × 10 1 5.2 × 10 2 2.8 × 10 2
(3.1)(2.6)(1.9)(2.7)(3.1)(39.5)(1.9)(2.7) ( 3.5 × 10 2 ) ( 7.2 × 10 2 ) ( 5.2 × 10 2 ) ( 2.8 × 10 2 )
EXP12.43.96.210.45.81.42.84.5 1.4 × 10 6 3.3 × 10 4 5.3 × 10 6 9.6 × 10 3
(0.9)(0.5)(0.5)(1.4)(1.0)(0.3)(0.5)(1.4) ( 3.3 × 10 7 ) ( 8.1 × 10 5 ) ( 6.9 × 10 7 ) ( 4.4 × 10 3 )
BETA3.11.31.77.62.00.91.011.3 9.2 × 10 7 4.7 × 10 5 2.7 × 10 5 5.6 × 10 3
(0.8)(0.4)(0.5)(1.7)(0.7)(0.3)(0.3)(2.3) ( 6.2 × 10 7 ) ( 2.4 × 10 5 ) ( 2.6 × 10 6 ) ( 6.7 × 10 3 )
Table 3. Average (std) of RMSEs ( 10 3 ) over 100 simulations for the Poisson regression.
Table 3. Average (std) of RMSEs ( 10 3 ) over 100 simulations for the Poisson regression.
SimulationPoisson Regression, K-Means ( K = 1000 )
RepresentativesMRPercentage of NAsOriginal SMRPercentage of NAsRASMRPercentage of NAs
mzNormal37.5 (12.9)0%13.2 (10.2)0%2.0 (0.5)0%
nzNormal37.5 (12.9)0%23.6 (15.7)3%0.2 ( 3.6 × 10 2 ) 0%
ueNormal82.5 (25.9)0%9.5 (7.2)69% 8.5 × 10 2 ( 2.3 × 10 2 ) 0%
mixNormal49.9 (20.7)0%29.5 (17.7)1%0.7 (0.1)0%
T 3 298 (458)0%97.2 (144)10%31.3 (79.1)0%
EXP31.2 (0.7)0%6.2 (1.0)0% 1.9 × 10 4 ( 5.8 × 10 5 ) 0%
BETA0.5 (0.2)0%2.3 (0.3)0% 1.2 × 10 7 ( 1.8 × 10 9 ) 0%
Table 4. Average (std) of RMSEs ( 10 3 ) over 100 Simulations for Gamma Regression.
Table 4. Average (std) of RMSEs ( 10 3 ) over 100 Simulations for Gamma Regression.
SimulationGamma Regression, K-Means ( K = 1000 )
RepresentativesMRPercentage of NAsOriginal SMRPercentage of NAsRASMRPercentage of NAs
Beta7.4 (0.7)0%5.8 (1.2)11% 2.0 × 10 2 (0.2)0%
Table 5. Average CPU time (seconds) over 100 simulations for binary classification.
Table 5. Average CPU time (seconds) over 100 simulations for binary classification.
SimulationBinary Classification, K-Means (K = 1000), True Link Function = Logit
RepresentativesSMR ( T  = 3)RASMR ( T  = 3)
SetupLogitCloglogProbitCauchitLogitCloglogProbitCauchit
mzNormal6.048.9511.096.517.329.7512.957.63
nzNormal5.998.9012.826.226.609.8213.656.70
ueNormal6.7510.3313.557.147.6810.3815.827.79
mixNormal5.908.5211.266.226.639.5212.647.02
T 3 6.328.438.106.366.779.068.507.00
EXP5.667.9810.516.246.559.2012.116.90
BETA5.667.9610.736.366.519.1012.577.01
Table 6. Average (std) of RMSEs ( 10 3 ) over 100 simulations from full data estimates or true parameter value.
Table 6. Average (std) of RMSEs ( 10 3 ) over 100 simulations from full data estimates or true parameter value.
SimulationBinary Classification, K-Means (K = 1000), True Link Function = Logit
BenchmarkRMSE from Full Data EstimateRMSE from True Parameter Value
MethodsRepresentative ApproachesDCFull DataRepresentative ApproachesDC
Setup MRSMR ( T  = 3)RASMR ( T  = 3)MRSMR ( T  = 3)RASMR ( T  = 3)
mzNormal17.98(0.31)1.92(0.62)0.054(0.018)6.93(0.12)3.72(1.08)18.40(1.02)4.17(1.16)3.72(1.08)7.90(1.04)
nzNormal14.30(0.70)4.53(1.18)0.29(0.085)20.20(0.35)7.23(2.05)16.04(1.74)8.58(2.02)7.24(2.07)21.49(1.64)
ueNormal211.17(1.44)3.83(1.57)0.72(0.018)13.12(0.26)2.11(0.82)211.17(1.35)4.41(1.77)2.22(0.84)13.24(1.24)
mixNormal17.37(0.33)2.79(0.83)0.14(0.043)11.20(0.20)4.96(1.33)17.94(1.05)5.91(1.48)4.96(1.33)12.09(1.15)
T 3 12.23(3.13)11.3(3.23)1.89(0.74)12.06(0.34)16.00(4.43)20.51(5.44)20.03(5.54)16.03(4.51)19.63(3.76)
EXP12.4(9.15)6.58(0.82)0.014(0.0013)16.88(0.31)6.18(1.67)14.5(2.18)9.25(2.02)6.18(1.67)18.25(2.30)
BETA3.03(0.80)2.34 0.68)0.00031(0.000090)5.92(0.20)7.49(2.38)7.89(2.30)7.73(2.34)7.49(2.38)9.31(2.54)
Table 7. Average (std) of RMSEs ( 10 3 ) over 100 simulations using RASMR with the delta ratio split.
Table 7. Average (std) of RMSEs ( 10 3 ) over 100 simulations using RASMR with the delta ratio split.
SimulationK-Means (K = 1000), True Link Function = Logit
ApproachRASMR with the Delta Ratio Split, Threshold = 0.1
SetupLogitCloglogProbitCauchit
mzNormal 4.0 × 10 5 3.2 × 10 2 6.1 × 10 5 8.2 × 10 3
( 9.9 × 10 6 ) ( 3.7 × 10 2 ) ( 2.4 × 10 5 ) ( 3.2 × 10 3 )
nzNormal 1.6 × 10 3 1.5 × 10 2 2.7 × 10 4 8.8
( 5.6 × 10 4 ) ( 1.4 × 10 2 ) ( 8.5 × 10 5 ) ( 4.5 )
ueNormal 2.0 × 10 4 9.9 × 10 2 1.2 × 10 3 2.3 × 10 2
3.4 × 10 3 9.6 × 10 1 2.4 × 10 3 3.3 × 10 2
mixNormal 3.8 × 10 4 1.9 × 10 1 1.8 × 10 4 7.3 × 10 2
( 1.5 × 10 4 ) ( 1.4 × 10 1 ) ( 1.8 × 10 4 ) ( 3.0 × 10 2 )
T 3 6.3 × 10 2 1.0 × 10 1 5.2 × 10 2 2.7 × 10 2
3.4 ( × 10 2 ) ( 5.6 × 10 2 ) ( 2.5 × 10 2 ) ( 1.5 × 10 2 )
EXP 2.3 × 10 6 4.2 × 10 4 5.8 × 10 6 1.4 × 10 2
( 1.8 × 10 6 ) ( 9.0 × 10 5 ) ( 7.7 × 10 7 ) ( 9.7 × 10 3 )
BETA 4.6 × 10 6 7.3 × 10 5 2.6 × 10 5 1.0 × 10 2
( 3.9 × 10 7 ) ( 3.4 × 10 5 ) ( 3.4 × 10 6 ) ( 3.8 × 10 3 )
Table 8. Link function selection using A I C ˜ ˜ and 5-fold cross-validation with RASMR.
Table 8. Link function selection using A I C ˜ ˜ and 5-fold cross-validation with RASMR.
Link FunctionLogitCloglogProbitCauchit
A I C ˜ ˜ 902840229717250395844750113052242
5-fold CV with Cross-entropy Loss1.384411.74911.82282.1535
Table 9. Average (std) of RMSEs ( 10 3 ) over 10 simulations of the airline on-time performance data using K-means clustering.
Table 9. Average (std) of RMSEs ( 10 3 ) over 10 simulations of the airline on-time performance data using K-means clustering.
SimulationBinary Classification, K-Means (K = 64), True Link Function = Logit
RepresentativesMRSMRRASMR
SetupLogitCloglogProbitCauchitLogitCloglogProbitCauchitLogitCloglogProbitCauchit
60 months28.444.312.1135.729.3212.3 (NA removed)12.6115.51.63.00.334.3
( 4.5 × 10 4 ) ( 4.6 × 10 5 ) ( 1.2 × 10 4 ) ( 7.0 × 10 4 ) ( 2.2 × 10 3 ) 8.7 (NA removed) ( 7.6 × 10 4 ) ( 1.6 × 10 2 ) ( 1.0 × 10 5 ) ( 8.5 × 10 6 ) ( 2.0 × 10 6 ) ( 2.5 × 10 5 )
120 months28.344.312.2135.429.3212.1 (NA removed)12.6115.31.63.00.334.3
( 4.5 × 10 4 ) ( 4.5 × 10 5 ) ( 1.2 × 10 4 ) ( 7.0 × 10 4 ) ( 2.2 × 10 3 ) 8.6 (NA removed) ( 7.6 × 10 4 ) ( 1.6 × 10 2 ) ( 1.0 × 10 5 ) ( 8.5 × 10 6 ) ( 2.0 × 10 6 ) ( 2.5 × 10 5 )
240 months24.741.89.0111.433.2219.8 (NA removed)12.7142.01.63.20.331.1
( 3.3 × 10 4 ) ( 2.6 × 10 5 ) ( 8.9 × 10 5 ) ( 7.4 × 10 4 ) ( 4.3 × 10 3 ) 9.4 (NA removed) ( 7.6 × 10 4 ) ( 2.2 × 10 2 ) ( 9.7 × 10 6 ) ( 1.0 × 10 5 ) ( 2.8 × 10 6 ) ( 1.7 × 10 5 )
371 months24.241.49.0111.235.9216.1 (NA removed)12.7140.31.63.20.331.2
( 3.1 × 10 4 ) ( 2.6 × 10 5 ) ( 8.9 × 10 5 ) ( 7.4 × 10 4 ) ( 6.2 × 10 3 ) 9.4 (NA removed) ( 7.6 × 10 4 ) ( 2.0 × 10 2 ) ( 9.7 × 10 6 ) ( 1.0 × 10 5 ) ( 2.8 × 10 6 ) ( 1.6 × 10 5 )
Notes: RMSEs of 60 months and 120 months are calculated with respect to the IRLS estimates; RMSEs of 240 months and 371 months are calculated with respect to the oracle parameter values.
Table 10. Average (std) of RMSEs ( 10 3 ) over 10 simulations of the airline on-time performance data using correlation-based quantile split.
Table 10. Average (std) of RMSEs ( 10 3 ) over 10 simulations of the airline on-time performance data using correlation-based quantile split.
SimulationBinary Classification, Correlation-Based Quantile Split, True Link Function = Logit
RepresentativesMRSMRRASMR
SetupLogitCloglogProbitCauchitLogitCloglogProbitCauchitLogitCloglogProbitCauchit
60 months62.7182.830.1376.4150.1239.249.1384.021.320.62.437.2
( 1.9 × 10 3 ) ( 7.9 × 10 2 ) ( 1.5 × 10 2 ) ( 4.2 × 10 2 ) 0.70.4 ( 2.6 × 10 2 ) 1.5 ( 1.5 × 10 4 ) ( 1.1 × 10 3 ) ( 1.2 × 10 4 ) ( 6.6 × 10 3 )
120 months62.5182.730.1374.2150.0239.549.4384.721.320.62.437.2
( 1.9 × 10 3 ) ( 7.9 × 10 2 ) ( 1.5 × 10 2 ) ( 4.2 × 10 2 ) 0.70.4 ( 2.6 × 10 2 ) 1.5 ( 1.5 × 10 4 ) ( 1.1 × 10 3 ) ( 1.2 × 10 4 ) ( 6.6 × 10 3 )
240 months60.4180.427.7385.4147.3242.148.4373.520.220.82.436.9
( 1.1 × 10 3 ) ( 8.1 × 10 2 ) ( 1.0 × 10 2 ) ( 4.4 × 10 2 ) 0.70.4 ( 2.7 × 10 2 ) 1.4 ( 1.4 × 10 4 ) ( 1.1 × 10 3 ) ( 1.2 × 10 4 ) ( 6.4 × 10 3 )
371 months60.3180.127.7385.2147.3242.148.4373.420.220.842.436.9
( 1.1 × 10 3 ) ( 8.1 × 10 2 ) ( 1.0 × 10 2 ) ( 4.4 × 10 2 ) 0.70.4 ( 2.6 × 10 2 ) 1.4 ( 1.4 × 10 4 ) ( 1.1 × 10 3 ) ( 1.2 × 10 4 ) ( 6.4 × 10 3 )
Notes: RMSEs of 60 months and 120 months are calculated with respect to the IRLS estimates; RMSEs of 240 months and 371 months are calculated with respect to the oracle parameter values.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, D.; Li, K.; Yang, J. Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models. Algorithms 2024, 17, 456. https://doi.org/10.3390/a17100456

AMA Style

Zheng D, Li K, Yang J. Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models. Algorithms. 2024; 17(10):456. https://doi.org/10.3390/a17100456

Chicago/Turabian Style

Zheng, Duo, Keren Li, and Jie Yang. 2024. "Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models" Algorithms 17, no. 10: 456. https://doi.org/10.3390/a17100456

APA Style

Zheng, D., Li, K., & Yang, J. (2024). Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models. Algorithms, 17(10), 456. https://doi.org/10.3390/a17100456

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop