Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data
Abstract
:1. Introduction
2. Material and Method
2.1. Multinomial Logisitic Models
2.2. The Most Appropriate Order for Categorical Responses
2.3. Sparse Variable Screening Using the AZIAD Package
2.4. Best Number of Covariates for Categorical Regression Analysis
2.5. Backward Variable Selection Based on AIC
3. RNA-Seq Gene Expression Data
4. Data Analysis and Results
4.1. Model Selection and Variable Selection for Sparse Genes
4.2. Order Selection for Response Categories
4.3. Backward Variable Selected Models
4.4. Final Models
4.5. Prognostic Multi-Gene Signatures
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. List of 31 Selected Genes for the Adjacent-Categories Po Model (See Section 4.3 and Section 4.5)
Appendix B. List of 74 Selected Genes for the Adjacent-Categories Npo Model (See Section 4.3 and Section 4.5)
Appendix C. Fitted Adjacent-Categories Npo Model with 74 Selected Genes (See Section 4.4)
k | Name | ||||
---|---|---|---|---|---|
1 | Intercept | −19.842 | −1.624 | 9.602 | −10.581 |
2 | gene_706 | −0.232 | 0.0232 | −0.592 | 0.281 |
3 | gene_742 | −0.549 | −0.651 | −0.406 | 0.355 |
4 | gene_1510 | 0.0648 | 0.0107 | 0.0758 | −0.344 |
5 | gene_2288 | −0.791 | −0.277 | −0.493 | 1.134 |
6 | gene_3439 | 0.113 | 0.897 | −0.997 | −0.00296 |
7 | gene_3461 | 0.272 | 0.028 | −0.389 | 0.531 |
8 | gene_3552 | 0.235 | −0.398 | 0.428 | −0.432 |
9 | gene_3598 | 0.225 | −0.142 | −0.583 | 0.983 |
10 | gene_3737 | 0.431 | −0.394 | 0.0109 | 0.827 |
11 | gene_3836 | 0.297 | −0.546 | 0.538 | −0.162 |
12 | gene_3862 | −0.463 | 0.207 | −0.372 | 0.0824 |
13 | gene_4223 | −0.521 | 0.313 | −1.592 | 1.269 |
14 | gene_4467 | 0.167 | 0.139 | 0.500 | −0.673 |
15 | gene_4618 | −0.669 | 0.358 | −2.068 | 1.973 |
16 | gene_4640 | −0.738 | 0.711 | −0.173 | 0.168 |
17 | gene_4833 | 0.286 | 0.317 | −0.611 | 0.229 |
18 | gene_4979 | −0.121 | 0.390 | −1.267 | 1.178 |
19 | gene_5050 | −0.100 | 0.201 | −0.256 | 0.222 |
20 | gene_5394 | 0.124 | 0.150 | −1.162 | 1.046 |
21 | gene_6162 | −1.111 | −0.530 | −0.306 | 0.327 |
22 | gene_6226 | 0.135 | 0.130 | 0.663 | −0.826 |
23 | gene_6722 | −0.392 | 0.356 | −0.234 | 0.0363 |
24 | gene_6838 | 0.525 | −0.276 | 0.597 | −0.494 |
25 | gene_6890 | −0.159 | 0.0228 | −0.112 | −0.333 |
k | Name | ||||
---|---|---|---|---|---|
26 | gene_7235 | −1.114 | 0.178 | −0.0569 | 0.193 |
27 | gene_7560 | 0.160 | −0.275 | 0.206 | −0.392 |
28 | gene_7792 | 0.331 | −0.0609 | −0.253 | 0.430 |
29 | gene_7964 | 1.010 | 0.529 | −0.142 | −0.498 |
30 | gene_7965 | 1.559 | 0.139 | 0.745 | −0.534 |
31 | gene_8003 | 0.042 | −0.319 | 0.188 | −0.185 |
32 | gene_8349 | 0.488 | −0.235 | −0.0766 | 0.273 |
33 | gene_8891 | 0.371 | 0.111 | 1.253 | −1.249 |
34 | gene_9175 | 0.0943 | −0.219 | 1.001 | 0.192 |
35 | gene_9176 | 0.0486 | 0.211 | −1.320 | 0.985 |
36 | gene_9181 | 0.0786 | −0.196 | 0.139 | −0.159 |
37 | gene_9626 | 0.429 | 0.0889 | 0.204 | 0.169 |
38 | gene_9680 | −0.189 | 0.00538 | 0.364 | −0.354 |
39 | gene_9979 | −0.201 | −0.139 | 0.904 | −0.950 |
40 | gene_10061 | −0.544 | 0.389 | −1.463 | 0.909 |
41 | gene_10284 | −0.0725 | 0.0631 | −0.593 | 0.712 |
42 | gene_10460 | 0.526 | −1.255 | −0.416 | 1.049 |
43 | gene_10489 | −0.862 | 0.466 | −0.206 | 0.231 |
44 | gene_10809 | −0.0157 | 0.00227 | −0.662 | 0.613 |
45 | gene_10950 | 0.127 | 0.219 | −0.186 | 0.244 |
46 | gene_11440 | −0.548 | 0.385 | 0.0711 | −0.231 |
47 | gene_11449 | 0.366 | 0.0614 | −0.336 | −0.542 |
48 | gene_11566 | 0.0015 | −0.958 | 0.320 | 0.210 |
49 | gene_12013 | 0.810 | −0.662 | −0.391 | 0.499 |
50 | gene_12068 | 0.233 | 0.171 | 0.477 | −0.391 |
k | Name | ||||
---|---|---|---|---|---|
51 | gene_12695 | 0.544 | −0.276 | 0.426 | −0.101 |
52 | gene_12977 | −0.173 | 0.319 | −1.094 | 0.702 |
53 | gene_12995 | −0.0335 | −0.032 | −0.0601 | 0.207 |
54 | gene_13210 | 0.736 | −0.422 | 1.404 | −0.650 |
55 | gene_13497 | 0.0561 | −0.175 | 0.335 | −0.142 |
56 | gene_14569 | 0.374 | −0.000751 | 0.077 | −0.262 |
57 | gene_14646 | 0.269 | −0.328 | −0.199 | 0.212 |
58 | gene_14866 | −0.00596 | 0.307 | 1.347 | −1.000 |
59 | gene_15447 | −0.209 | 0.00293 | 0.0196 | 0.0194 |
60 | gene_15633 | −0.000597 | −0.0341 | −0.0535 | −0.122 |
61 | gene_15894 | −0.0516 | −0.309 | 0.732 | −0.596 |
62 | gene_15896 | −0.0755 | 0.0453 | 0.268 | −0.332 |
63 | gene_15898 | 0.640 | 0.272 | 0.674 | −0.676 |
64 | gene_15945 | −1.183 | 0.683 | 0.232 | −0.576 |
65 | gene_16169 | −0.633 | 0.213 | −0.944 | 0.848 |
66 | gene_16246 | −0.248 | 0.646 | −1.134 | 0.682 |
67 | gene_16337 | −0.384 | −0.0586 | 0.373 | −0.220 |
68 | gene_16392 | 0.529 | 0.680 | 0.118 | −0.271 |
69 | gene_16817 | 0.0321 | −0.0286 | −0.127 | −0.353 |
70 | gene_17688 | 1.044 | −0.438 | 1.351 | −1.215 |
71 | gene_17801 | 0.680 | −0.0994 | 0.475 | −0.0449 |
72 | gene_17949 | 0.357 | 0.176 | 0.356 | −0.456 |
73 | gene_19236 | −0.117 | −0.0792 | −0.0906 | −0.173 |
74 | gene_19661 | −0.280 | 0.325 | −0.501 | 0.327 |
75 | gene_20476 | 0.151 | −0.0272 | 0.597 | −0.542 |
References
- Dousti Mousavi, N.; Yang, J.; Aldirawi, H. Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data. Genes 2023, 14, 403. [Google Scholar] [CrossRef]
- Krishnan, R.; Liang, D.; Hoffman, M. On the challenges of learning with inference networks on sparse, high-dimensional data. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Playa Blanca, Spain, 9–11 April 2018; pp. 143–151. [Google Scholar]
- Aldirawi, H.; Morales, F.G. Univariate and Multivariate Statistical Analysis of Microbiome Data: An Overview. Appl. Microbiol. 2023, 3, 322–338. [Google Scholar] [CrossRef]
- Aldirawi, H.; Yang, J.; Metwally, A.A. Identifying Appropriate Probabilistic Models for Sparse Discrete Omics Data. In Proceedings of the 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Chicago, IL, USA, 19–22 May 2019; pp. 1–4. [Google Scholar]
- Wang, L.; Aldirawi, H.; Yang, J. Identifying zero-inflated distributions with a new R package iZID. Commun. Inf. Syst. 2020, 20, 23–44. [Google Scholar] [CrossRef]
- Aldirawi, H.; Yang, J. Modeling Sparse Data Using MLE with Applications to Microbiome Data. J. Stat. Theory Pract. 2022, 16, 13. [Google Scholar] [CrossRef]
- Dousti Mousavi, N.; Aldirawi, H.; Yang, J. An R Package AZIAD for Analysing Zero-Inflated and Zero-Altered Data. J. Stat. Comput. Simul. 2023, 1–27. [Google Scholar] [CrossRef]
- Yoshida, K.; Yoshimoto, J.; Doya, K. Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data. BMC Bioinform. 2017, 18, 108. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Manzhos, S.; Ihara, M. Advanced machine learning methods for learning from sparse data in high-dimensional spaces: A perspective on uses in the upstream of development of novel energy technologies. Physchem 2022, 2, 72–95. [Google Scholar] [CrossRef]
- Metwally, A.A.; Aldirawi, H.; Yang, J. A review on probabilistic models used in microbiome studies. Commun. Inf. Syst. 2018, 18, 173–191. [Google Scholar] [CrossRef]
- Romero, R.; Hassan, S.S.; Gajer, P.; Tarca, A.L.; Fadrosh, D.W.; Nikita, L.; Galuppi, M.; Lamont, R.F.; Chaemsaithong, P.; Miranda, J.; et al. The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women. Microbiome 2014, 2, 4. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dudoit, S.; Fridlyand, J.; Speed, T.P. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 2002, 97, 77–87. [Google Scholar] [CrossRef] [Green Version]
- McCullagh, P.; Yang, J. Stochastic classification models. Int. Congr. Math. 2006, 3, 669–686. [Google Scholar]
- McCullagh, P.; Nelder, J. Generalized Linear Models, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 1989. [Google Scholar]
- Dobson, A.J.; Barnett, A.G. An Introduction to Generalized Linear Models, 4th ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
- Glonek, G.; McCullagh, P. Multivariate logistic models. J. R. Stat. Soc. Ser. B 1995, 57, 533–546. [Google Scholar] [CrossRef]
- Zocchi, S.; Atkinson, A. Optimum experimental designs for multinomial logistic models. Biometrics 1999, 55, 437–444. [Google Scholar] [CrossRef] [PubMed]
- Bu, X.; Majumdar, D.; Yang, J. D-optimal Designs for Multinomial Logistic Models. Ann. Stat. 2020, 48, 983–1000. [Google Scholar] [CrossRef]
- Wang, T.; Yang, J. Identifying the most appropriate order for categorical responses. Stat. Sin. 2023; to appear. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Agresti, A. Categorical Data Analysis, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
- Hirotsugu, A. Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, 2–8 September 1971; pp. 267–281. [Google Scholar]
- Dousti Mousavi, N.; Aldirawi, H.; Yang, J. AZIAD: Analyzing Zero-Inflated and Zero-Altered Data; R Package Version 0.0.2. 2022. Available online: https://CRAN.R-project.org/package=AZIAD (accessed on 23 July 2023).
- Harrison, C.W.; He, Q.; Huang, H.H. Clustering Gene Expressions Using the Table Invitation Prior. Genes 2022, 13, 2036. [Google Scholar] [CrossRef]
- Yee, T.; Moler, C. VGAM: Vector Generalized Linear and Additive Models; R Package Version 1.1.8. 2023. Available online: https://CRAN.R-project.org/package=VGAM (accessed on 23 July 2023).
- Yee, T.W. Vector Generalized Linear and Additive Models: With an Implementation in R; Springer: New York, NY, USA, 2015. [Google Scholar]
- Burnham, K.P.; Anderson, D.R. Understanding AIC and BIC in Model Selection. Sociol. Methods Res. 2004, 33, 261–304. [Google Scholar] [CrossRef]
- Itadani, H.; Mizuarai, S.; Kotani, H. Can systems biology understand pathway activation? Gene expression signatures as surrogate markers for understanding the complexity of pathway activation. Curr. Genom. 2008, 9, 349–360. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Oldenhuis, C.; Oosting, S.; Gietema, J.; De Vries, E. Prognostic versus predictive value of biomarkers in oncology. Eur. J. Cancer 2008, 44, 946–953. [Google Scholar] [CrossRef] [PubMed]
Top t Genes | Best Models | Cross-Entropy Loss | Error Count/Rate |
---|---|---|---|
25 | Baseline-category npo | 1.50 | 40/801 = 0.049 |
30 | Adjacent-cate. npo | 0.59 | 11/801 = 0.013 |
Baseline-category npo | 0.59 | 11/801 = 0.013 | |
50 | Adjacent-cate. po | 1.88 | 58/801 = 0.072 |
Adjacent-cate. npo | 0.14 | 7/801= 0.0087 | |
Baseline-category npo | 0.33 | 7/801 = 0.0087 | |
60 | Adjacent-cate. po | 2.01 | 60/801 = 0.075 |
Adjacent-cate. npo | 0.15 | 6/801 = 0.0075 | |
Baseline-category npo | 0.15 | 6/801 = 0.0075 | |
70 | Adjacent-cate. po | 1.95 | 60/801 = 0.075 |
Adjacent-cate. npo | 0.21 | 5/801 = 0.0062 | |
Baseline-category npo | 0.21 | 5/801 = 0.0062 | |
80 | Adjacent-cate. po | 2.01 | 70/801 = 0.087 |
Adjacent-cate. npo | 0.14 | 4/801 = 0.0049 | |
Baseline-category npo | 0.22 | 4/801 = 0.0049 | |
100 | Adjacent-cate. po | 1.79 | 60/801 = 0.075 |
150 | Adjacent-cate. po | 2.41 | 69/801 = 0.086 |
Number of Genes | Type of Model | Cross-Entropy Loss | Error Count/Rate |
---|---|---|---|
31 | Adjacent-cate. po | 0.063 | 1/801 = 0.0012 |
74 | Adjacent-cate. npo | 0 | |
4 | Baseline-category npo | 0.031 | 2/801 = 0.0025 |
acpo | acnpo | nomnpo | top50 | |
---|---|---|---|---|
acpo | 31 | 29 | 2 | 20 |
acnpo | 29 | 74 | 4 | 45 |
nomnpo | 2 | 4 | 4 | 2 |
top50 | 20 | 45 | 2 | 50 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dousti Mousavi, N.; Aldirawi, H.; Yang, J. Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data. BioTech 2023, 12, 52. https://doi.org/10.3390/biotech12030052
Dousti Mousavi N, Aldirawi H, Yang J. Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data. BioTech. 2023; 12(3):52. https://doi.org/10.3390/biotech12030052
Chicago/Turabian StyleDousti Mousavi, Niloufar, Hani Aldirawi, and Jie Yang. 2023. "Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data" BioTech 12, no. 3: 52. https://doi.org/10.3390/biotech12030052
APA StyleDousti Mousavi, N., Aldirawi, H., & Yang, J. (2023). Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data. BioTech, 12(3), 52. https://doi.org/10.3390/biotech12030052