Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models †
Abstract
:1. Introduction
2. RASMR for Big Data Analysis under GLM
2.1. SMR Approach for GLM
2.2. Solving the Score-Matching Equation with Splitting Points
2.3. Response-Aided Score-Matching Representative Approach
Algorithm 1: RASMR. |
Data: with a partition of K blocks indexed by . Denote the kth data block . Result: Parameter estimate of a given GLM with a predetermined number T of iterations. Calculate the initial weighted representative data: with and , that is, the mean representatives Implement the iteratively reweighted least squares (IRLS) procedure [48] on to obtain the initial estimate |
2.4. RASMR with the Delta Ratio Split
Algorithm 2: RASMR algorithm with the delta ratio split. |
2.5. Learning Rate Scheduling
3. Model Selection and Variable Selection Using RASMR
3.1. Information-Based Criteria and Model Selection
3.2. Link Function Selection
3.3. Variable Selection
3.4. Cross-Validation
- Data are given with a partition of .
- A random partition of is given for V-fold cross-validation.
- For , fit the target model on the training set using RASMR with blocks after removing empty ones, and then calculate the aggregated prediction errors when applying the fitted model on the testing set .
- Report as the estimated average predictor error.
4. Simulation Studies and Numerical Justifications
4.1. Simulation Setup and Evaluation Method
4.2. Performance of RASMR, Algorithm 1
4.3. Performance of RASMR with the Delta Ratio Split, Algorithm 2
5. Real Data Analysis
5.1. The Airline On-Time Performance Data
5.2. Model Selection
5.3. Comparison Analysis
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Theorems on Solving the Score-Matching Equation
- (1)
- If and for all , then the only solution is .
- (2)
- If and for all , then the only solution is .
- (i)
- and for all ;
- (ii)
- and for all ;
- (iii)
- and for all ;
- (iv)
- and for all ;
- (v)
- and for all ;
- (vi)
- and for all ,
- (i)
- If and for all , then there exists a unique solution .
- (ii)
- If and for all , then there exists a unique solution in .
- (iii)
- If and for all , then there exists a unique solution in .
References
- Dautov, R.; Distefano, S. Quantifying volume, velocity, and variety to support (big) data-intensive application development. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2843–2852. [Google Scholar]
- Fan, J.; Han, F.; Liu, H. Challenges of big data analysis. Natl. Sci. Rev. 2014, 1, 293–314. [Google Scholar] [CrossRef]
- Wang, C.; Chen, M.; Schifano, E.; Wu, J.; Yan, J. Statistical methods and computing for big data. Stat. Its Interface 2016, 9, 399–414. [Google Scholar] [CrossRef]
- Lin, L.; Lu, J. A race-DC in Big Data. arXiv 2019, arXiv:1911.11993. [Google Scholar]
- Lin, N.; Xi, R. Aggregated estimating equation estimation. Stat. Its Interface 2011, 4, 73–83. [Google Scholar] [CrossRef]
- Chen, X.; Xie, M.g. A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 2014, 24, 1655–1684. [Google Scholar]
- Schifano, E.; Wu, J.; Wang, C.; Yan, J.; Chen, M. Online updating of statistical inference in the big datasetting. Technometrics 2016, 58, 393–403. [Google Scholar] [CrossRef]
- Zhao, T.; Cheng, G.; Liu, H. A partially linear framework for massive heterogeneous data. Ann. Stat. 2016, 44, 1400–1437. [Google Scholar] [CrossRef]
- Lee, J.D.; Liu, Q.; Sun, Y.; Taylor, J.E. Communication-efficient sparse regression. J. Mach. Learn. Res. 2017, 18, 115–144. [Google Scholar]
- Battey, H.; Fan, J.; Liu, H.; Lu, J.; Zhu, Z. Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 2018, 46, 1352–1382. [Google Scholar] [CrossRef]
- Shi, C.; Lu, W.; Song, R. A massive data framework for M-estimators with cubic-rate. J. Am. Stat. Assoc. 2018, 113, 1698–1709. [Google Scholar] [CrossRef]
- Chen, X.; Liu, W.; Zhang, Y. Quantile regression under memory constraint. Ann. Stat. 2019, 47, 3244–3273. [Google Scholar] [CrossRef]
- Ma, P.; Sun, X. Leveraging for big data regression. Wiley Interdiscip. Rev. Comput. Stat. 2015, 7, 70–76. [Google Scholar] [CrossRef]
- Wang, H.; Yang, M.; Stufken, J. Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 2019, 114, 393–405. [Google Scholar] [CrossRef]
- Cheng, Q.; Wang, H.; Yang, M. Information-based optimal subdata selection for big data logistic regression. J. Stat. Plan. Inference 2020, 209, 112–122. [Google Scholar] [CrossRef]
- Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
- Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
- Breiman, L. Stacked regressions. Mach. Learn. 1996, 24, 49–64. [Google Scholar] [CrossRef]
- Bühlmann, P.; Meinshausen, N. Magging: Maximin aggregation for inhomogeneous large-scale data. Proc. IEEE 2015, 104, 126–135. [Google Scholar] [CrossRef]
- da Conceição Costa, M.; Macedo, P. Normalized entropy aggregation for inhomogeneous large-scale data. In Proceedings of the Theory and Applications of Time Series Analysis: Selected Contributions from ITISE 2018, Granada, Spain, 19–21 September 2018; Springer: Berlin/Heidelberg, Germany, 2019; pp. 19–29. [Google Scholar]
- Costa, M.C.; Macedo, P.; Cruz, J.P. Neagging: An aggregation procedure based on normalized entropy. In Proceedings of the AIP Conference Proceedings, Rhodes, Greece, 17–23 September 2020; AIP Publishing: Melville, NY, USA, 2022; Volume 2425. [Google Scholar]
- Tran, D.; Toulis, P.; Airoldi, E.M. Stochastic gradient descent methods for estimation with large datasets. arXiv 2015, arXiv:1509.06459. [Google Scholar]
- Lin, J.; Rosasco, L. Optimal Rates for multi-pass stochastic gradient methods. J. Mach. Learn. Res. 2017, 18, 1–47. [Google Scholar]
- Airoldi, E.; Toulis, P. Stochastic Gradient Methods for Principled Estimation with Large Data Sets. In Handbook of Big Data; Chapman & Hall: London, UK, 2016; pp. 243–266. [Google Scholar]
- Konečnỳ, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
- McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; PMLR: New York, NY, USA, 2017; pp. 1273–1282. [Google Scholar]
- Stich, S.U. Local SGD converges fast and communicates little. arXiv 2018, arXiv:1805.09767. [Google Scholar]
- Stich, S.U.; Karimireddy, S.P. The error-feedback framework: Better rates for SGD with delayed gradients and compressed updates. J. Mach. Learn. Res. 2020, 21, 1–36. [Google Scholar]
- Khaled, A.; Mishchenko, K.; Richtárik, P. Tighter theory for local SGD on identical and heterogeneous data. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 26–28 August 2020; PMLR: New York, NY, USA, 2020; pp. 4519–4529. [Google Scholar]
- Spiridonoff, A.; Olshevsky, A.; Paschalidis, Y. Communication-efficient sgd: From local sgd to one-shot averaging. Adv. Neural Inf. Process. Syst. 2021, 34, 24313–24326. [Google Scholar]
- Wang, J.; Joshi, G. Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms. J. Mach. Learn. Res. 2021, 22, 1–50. [Google Scholar]
- Zhou, F.; Cong, G. On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv 2017, arXiv:1708.01012. [Google Scholar]
- Koloskova, A.; Loizou, N.; Boreiri, S.; Jaggi, M.; Stich, S. A unified theory of decentralized sgd with changing topology and local updates. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: New York, NY, USA, 2020; pp. 5381–5393. [Google Scholar]
- Jiang, P.; Agrawal, G. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18), Montréal, QC, Canada, 3–8 December 2018; Curran Associates: Red Hook, NY, USA, 2018; pp. 2530–2541. [Google Scholar]
- Haddadpour, F.; Mahdavi, M. On the convergence of local descent methods in federated learning. arXiv 2019, arXiv:1910.14425. [Google Scholar]
- Zhu, Z.; Hong, J.; Zhou, J. Data-free knowledge distillation for heterogeneous federated learning. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 12878–12889. [Google Scholar]
- Li, A.; Sun, J.; Li, P.; Pu, Y.; Li, H.; Chen, Y. Hermes: An efficient federated learning framework for heterogeneous mobile clients. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, New Orleans, LA, USA, 31 January–4 February 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 420–437. [Google Scholar]
- Sery, T.; Shlezinger, N.; Cohen, K.; Eldar, Y.C. Over-the-air federated learning from heterogeneous data. IEEE Trans. Signal Process. 2021, 69, 3796–3811. [Google Scholar] [CrossRef]
- Mendieta, M.; Yang, T.; Wang, P.; Lee, M.; Ding, Z.; Chen, C. Local learning matters: Rethinking data heterogeneity in federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 8397–8406. [Google Scholar]
- Qu, L.; Zhou, Y.; Liang, P.P.; Xia, Y.; Wang, F.; Adeli, E.; Fei-Fei, L.; Rubin, D. Rethinking architecture design for tackling data heterogeneity in federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10061–10071. [Google Scholar]
- Fang, X.; Ye, M. Robust federated learning with noisy and heterogeneous clients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10072–10081. [Google Scholar]
- Ye, M.; Fang, X.; Du, B.; Yuen, P.C.; Tao, D. Heterogeneous federated learning: State-of-the-art and research challenges. Acm Comput. Surv. 2023, 56, 1–44. [Google Scholar] [CrossRef]
- Li, K.; Yang, J. Score-matching representative approach for big data analysis with generalized linear models. Electron. J. Stat. 2022, 16, 592–635. [Google Scholar] [CrossRef]
- Bowman, C. Data localization laws: An emerging global trend. JURIST–Hotline. 6 January 2017. Available online: https://www.jurist.org/commentary/2017/01/courtney-bowman-data-localization/ (accessed on 8 October 2024).
- Chander, A.; Lê, U.P. Data nationalism. Emory LJ 2014, 64, 677. [Google Scholar]
- McCullagh, P.; Nelder, J. Generalized Linear Models, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 1989. [Google Scholar]
- Dobson, A.; Barnett, A. An Introduction to Generalized Linear Models, 4th ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
- Green, P.J. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J. R. Stat. Soc. Ser. 1984, 46, 149–170. [Google Scholar] [CrossRef]
- Gentle, J.E. Matrix Algebra; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
- R Core Team and contributors worldwide. In The R Stats Package; R Package Version 4.5.0.; R Foundation for Statistical Computing: Vienna, Austria, 2024.
- Peng, L.; Kümmerle, C.; Vidal, R. On the convergence of IRLS and its variants in outlier-robust estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 17808–17818. [Google Scholar]
- Bengio, Y. Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 437–478. [Google Scholar]
- You, K.; Long, M.; Wang, J.; Jordan, M.I. How does learning rate decay help modern neural networks? arXiv 2019, arXiv:1908.01878. [Google Scholar]
- Akaike, H. Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, 2–8 September 1971; Akademiai Kiado: Budapest, Hungary, 1973; pp. 267–281. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Wang, H.; Zhu, R.; Ma, P. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 2018, 113, 829–844. [Google Scholar] [CrossRef] [PubMed]
- Schubert, E.; Rousseeuw, P.J. Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms. Inf. Syst. 2021, 101, 101804. [Google Scholar] [CrossRef]
- Marschner, I.; Donoghoe, M.W. glm2: Fitting Generalized Linear Models; R Package Version 1.2.1.; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
- Kumar, V.S. A Big Data Analytical Framework for Intrusion Detection Based on Novel Elephant Herding Optimized Finite Dirichlet Mixture Models. Int. J. Data Inform. Intell. Comput. 2023, 2, 11–20. [Google Scholar]
- Jones, K.I.; Sah, S. The Implementation of Machine Learning In The Insurance Industry with Big Data Analytics. Int. J. Data Inform. Intell. Comput. 2023, 2, 21–38. [Google Scholar]
- Glonek, G.; McCullagh, P. Multivariate logistic models. J. R. Stat. Soc. Ser. 1995, 57, 533–546. [Google Scholar] [CrossRef]
- Zocchi, S.; Atkinson, A. Optimum experimental designs for multinomial logistic models. Biometrics 1999, 55, 437–444. [Google Scholar] [CrossRef]
- Bu, X.; Majumdar, D.; Yang, J. D-optimal Designs for Multinomial Logistic Models. Ann. Stat. 2020, 48, 983–1000. [Google Scholar] [CrossRef]
- Li, K. Score-Matching Representative Approach for Big Data Analysis with Generalized Linear Models. Ph.D. Thesis, University of Illinois at Chicago, Chicago, IL, USA, 2018. [Google Scholar]
- Yee, T.; Moler, C. VGAM: Vector Generalized Linear and Additive Models; R Package Version 1.1-11.; R Foundation for Statistical Computing: Vienna, Austria, 2024. [Google Scholar]
- Lohr, S. Sampling: Design and Analysis; Chapman and Hall/CRC: Boca Raton, FL, USA, 2019. [Google Scholar]
- Gordon, R. Values of Mills’ ratio of area to bounding ordinate and of the normal probability integral for large values of the argument. Ann. Math. Stat. 1941, 12, 364–366. [Google Scholar] [CrossRef]
- Birnbaum, Z. An inequality for Mill’s ratio. Ann. Math. Stat. 1942, 13, 245–246. [Google Scholar] [CrossRef]
- Mitrinovic, D.; Vasic, P. Analytic Inequalities; Springer: Berlin/Heidelberg, Germany, 1970. [Google Scholar]
- Baricz, A. Mills’ ratio: Monotonicity patterns and functional inequalities. J. Math. Anal. Appl. 2008, 340, 1362–1370. [Google Scholar] [CrossRef]
- Marshall, A.; Olkin, I. Life Distributions; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
- Corless, R.; Gonnet, G.; Hare, D.; Jeffrey, D.; Knuth, D. On the Lambert W function. Adv. Comput. Math. 1996, 5, 329–359. [Google Scholar] [CrossRef]
- Dunn, P.; Smyth, G. Generalized Linear Models with Examples in R; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Link Function | for | for |
---|---|---|
logit | −1.278464542761 | 1.278464542761 |
probit | −0.839923675692 | 0.839923675692 |
cloglog | −1 | 0.729114174900 |
loglog | −0.729114174900 | 1 |
cauchit | −0.801916425045 | 0.801916425045 |
Simulation | Binary Classification, K-Means (), True Link Function = Logit | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Representatives | MR | Original SMR | RASMR | |||||||||
Setup | Logit | Cloglog | Probit | Cauchit | Logit | Cloglog | Probit | Cauchit | Logit | Cloglog | Probit | Cauchit |
mzNormal | 17.9 | 9.2 | 9.2 | 33.4 | 1.8 | 1.6 | 1.0 | 2.4 | ||||
(0.3) | (0.3) | (0.2) | (1.2) | (0.6) | (0.5) | (0.3) | (2.4) | |||||
nzNormal | 14.3 | 5.0 | 7.0 | 135 | 4.0 | 0.8 | 1.5 | 51.6 | 69.6 | |||
(0.7) | (0.3) | (0.3) | (45.8) | (1.1) | (0.4) | (0.5) | (34.6) | (20.0) | ||||
ueNormal | 211 | 114 | 110 | 455 | 3.3 | 11 | 2.1 | 19.4 | ||||
(1.4) | (1.5) | (0.7) | (5.2) | (1.4) | (3.4) | (1.1) | (10.1) | |||||
mixNormal | 17.5 | 8.6 | 8.6 | 48.1 | 3.0 | 1.8 | 1.2 | 3.2 | ||||
(0.4) | (0.6) | (0.2) | (3.1) | (0.9) | (0.7) | (0.3) | (2.5) | |||||
12.2 | 10.1 | 7.8 | 10.0 | 10.7 | 15.0 | 6.7 | 8.6 | |||||
(3.1) | (2.6) | (1.9) | (2.7) | (3.1) | (39.5) | (1.9) | (2.7) | |||||
EXP | 12.4 | 3.9 | 6.2 | 10.4 | 5.8 | 1.4 | 2.8 | 4.5 | ||||
(0.9) | (0.5) | (0.5) | (1.4) | (1.0) | (0.3) | (0.5) | (1.4) | |||||
BETA | 3.1 | 1.3 | 1.7 | 7.6 | 2.0 | 0.9 | 1.0 | 11.3 | ||||
(0.8) | (0.4) | (0.5) | (1.7) | (0.7) | (0.3) | (0.3) | (2.3) |
Simulation | Poisson Regression, K-Means () | |||||
---|---|---|---|---|---|---|
Representatives | MR | Percentage of NAs | Original SMR | Percentage of NAs | RASMR | Percentage of NAs |
mzNormal | 37.5 (12.9) | 0% | 13.2 (10.2) | 0% | 2.0 (0.5) | 0% |
nzNormal | 37.5 (12.9) | 0% | 23.6 (15.7) | 3% | 0.2 | 0% |
ueNormal | 82.5 (25.9) | 0% | 9.5 (7.2) | 69% | 0% | |
mixNormal | 49.9 (20.7) | 0% | 29.5 (17.7) | 1% | 0.7 (0.1) | 0% |
298 (458) | 0% | 97.2 (144) | 10% | 31.3 (79.1) | 0% | |
EXP | 31.2 (0.7) | 0% | 6.2 (1.0) | 0% | 0% | |
BETA | 0.5 (0.2) | 0% | 2.3 (0.3) | 0% | 0% |
Simulation | Gamma Regression, K-Means () | |||||
---|---|---|---|---|---|---|
Representatives | MR | Percentage of NAs | Original SMR | Percentage of NAs | RASMR | Percentage of NAs |
Beta | 7.4 (0.7) | 0% | 5.8 (1.2) | 11% | (0.2) | 0% |
Simulation | Binary Classification, K-Means (K = 1000), True Link Function = Logit | |||||||
---|---|---|---|---|---|---|---|---|
Representatives | SMR ( = 3) | RASMR ( = 3) | ||||||
Setup | Logit | Cloglog | Probit | Cauchit | Logit | Cloglog | Probit | Cauchit |
mzNormal | 6.04 | 8.95 | 11.09 | 6.51 | 7.32 | 9.75 | 12.95 | 7.63 |
nzNormal | 5.99 | 8.90 | 12.82 | 6.22 | 6.60 | 9.82 | 13.65 | 6.70 |
ueNormal | 6.75 | 10.33 | 13.55 | 7.14 | 7.68 | 10.38 | 15.82 | 7.79 |
mixNormal | 5.90 | 8.52 | 11.26 | 6.22 | 6.63 | 9.52 | 12.64 | 7.02 |
6.32 | 8.43 | 8.10 | 6.36 | 6.77 | 9.06 | 8.50 | 7.00 | |
EXP | 5.66 | 7.98 | 10.51 | 6.24 | 6.55 | 9.20 | 12.11 | 6.90 |
BETA | 5.66 | 7.96 | 10.73 | 6.36 | 6.51 | 9.10 | 12.57 | 7.01 |
Simulation | Binary Classification, K-Means (K = 1000), True Link Function = Logit | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Benchmark | RMSE from Full Data Estimate | RMSE from True Parameter Value | ||||||||
Methods | Representative Approaches | DC | Full Data | Representative Approaches | DC | |||||
Setup | MR | SMR ( = 3) | RASMR ( = 3) | MR | SMR ( = 3) | RASMR ( = 3) | ||||
mzNormal | 17.98(0.31) | 1.92(0.62) | 0.054(0.018) | 6.93(0.12) | 3.72(1.08) | 18.40(1.02) | 4.17(1.16) | 3.72(1.08) | 7.90(1.04) | |
nzNormal | 14.30(0.70) | 4.53(1.18) | 0.29(0.085) | 20.20(0.35) | 7.23(2.05) | 16.04(1.74) | 8.58(2.02) | 7.24(2.07) | 21.49(1.64) | |
ueNormal | 211.17(1.44) | 3.83(1.57) | 0.72(0.018) | 13.12(0.26) | 2.11(0.82) | 211.17(1.35) | 4.41(1.77) | 2.22(0.84) | 13.24(1.24) | |
mixNormal | 17.37(0.33) | 2.79(0.83) | 0.14(0.043) | 11.20(0.20) | 4.96(1.33) | 17.94(1.05) | 5.91(1.48) | 4.96(1.33) | 12.09(1.15) | |
12.23(3.13) | 11.3(3.23) | 1.89(0.74) | 12.06(0.34) | 16.00(4.43) | 20.51(5.44) | 20.03(5.54) | 16.03(4.51) | 19.63(3.76) | ||
EXP | 12.4(9.15) | 6.58(0.82) | 0.014(0.0013) | 16.88(0.31) | 6.18(1.67) | 14.5(2.18) | 9.25(2.02) | 6.18(1.67) | 18.25(2.30) | |
BETA | 3.03(0.80) | 2.34 0.68) | 0.00031(0.000090) | 5.92(0.20) | 7.49(2.38) | 7.89(2.30) | 7.73(2.34) | 7.49(2.38) | 9.31(2.54) |
Simulation | K-Means (K = 1000), True Link Function = Logit | |||
---|---|---|---|---|
Approach | RASMR with the Delta Ratio Split, Threshold = | |||
Setup | Logit | Cloglog | Probit | Cauchit |
mzNormal | ||||
nzNormal | ||||
ueNormal | ||||
mixNormal | ||||
EXP | ||||
BETA | ||||
Link Function | Logit | Cloglog | Probit | Cauchit |
---|---|---|---|---|
90284022 | 97172503 | 95844750 | 113052242 | |
5-fold CV with Cross-entropy Loss | 1.38441 | 1.7491 | 1.8228 | 2.1535 |
Simulation | Binary Classification, K-Means (K = 64), True Link Function = Logit | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Representatives | MR | SMR | RASMR | |||||||||
Setup | Logit | Cloglog | Probit | Cauchit | Logit | Cloglog | Probit | Cauchit | Logit | Cloglog | Probit | Cauchit |
60 months | 28.4 | 44.3 | 12.1 | 135.7 | 29.3 | 212.3 (NA removed) | 12.6 | 115.5 | 1.6 | 3.0 | 0.3 | 34.3 |
8.7 (NA removed) | ||||||||||||
120 months | 28.3 | 44.3 | 12.2 | 135.4 | 29.3 | 212.1 (NA removed) | 12.6 | 115.3 | 1.6 | 3.0 | 0.3 | 34.3 |
8.6 (NA removed) | ||||||||||||
240 months | 24.7 | 41.8 | 9.0 | 111.4 | 33.2 | 219.8 (NA removed) | 12.7 | 142.0 | 1.6 | 3.2 | 0.3 | 31.1 |
9.4 (NA removed) | ||||||||||||
371 months | 24.2 | 41.4 | 9.0 | 111.2 | 35.9 | 216.1 (NA removed) | 12.7 | 140.3 | 1.6 | 3.2 | 0.3 | 31.2 |
9.4 (NA removed) |
Simulation | Binary Classification, Correlation-Based Quantile Split, True Link Function = Logit | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Representatives | MR | SMR | RASMR | |||||||||
Setup | Logit | Cloglog | Probit | Cauchit | Logit | Cloglog | Probit | Cauchit | Logit | Cloglog | Probit | Cauchit |
60 months | 62.7 | 182.8 | 30.1 | 376.4 | 150.1 | 239.2 | 49.1 | 384.0 | 21.3 | 20.6 | 2.4 | 37.2 |
0.7 | 0.4 | 1.5 | ||||||||||
120 months | 62.5 | 182.7 | 30.1 | 374.2 | 150.0 | 239.5 | 49.4 | 384.7 | 21.3 | 20.6 | 2.4 | 37.2 |
0.7 | 0.4 | 1.5 | ||||||||||
240 months | 60.4 | 180.4 | 27.7 | 385.4 | 147.3 | 242.1 | 48.4 | 373.5 | 20.2 | 20.8 | 2.4 | 36.9 |
0.7 | 0.4 | 1.4 | ||||||||||
371 months | 60.3 | 180.1 | 27.7 | 385.2 | 147.3 | 242.1 | 48.4 | 373.4 | 20.2 | 20.84 | 2.4 | 36.9 |
0.7 | 0.4 | 1.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zheng, D.; Li, K.; Yang, J. Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models. Algorithms 2024, 17, 456. https://doi.org/10.3390/a17100456
Zheng D, Li K, Yang J. Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models. Algorithms. 2024; 17(10):456. https://doi.org/10.3390/a17100456
Chicago/Turabian StyleZheng, Duo, Keren Li, and Jie Yang. 2024. "Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models" Algorithms 17, no. 10: 456. https://doi.org/10.3390/a17100456
APA StyleZheng, D., Li, K., & Yang, J. (2024). Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models. Algorithms, 17(10), 456. https://doi.org/10.3390/a17100456