Statistical Analysis and Data Science for Complex Data

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Probability and Statistics".

Deadline for manuscript submissions: 15 December 2024 | Viewed by 5722

Special Issue Editor


E-Mail Website
Guest Editor
Department of Statistics, National Chengchi University, Taipei 116, Taiwan
Interests: graphical models; high-dimensional data analysis; machine learning; measurement error and error classification; survival analysis

Special Issue Information

Dear Colleagues,

Nowadays, thanks to the rapid development of technology, datasets can be collected easily in many fields, such as biology, manufacturing, and so on. Typically, given a dataset, one may encounter situations wherein (i) the sample size is large or (ii) the dimension of variables is large, yielding so-called big data or high-dimensional data, respectively. However, rare samples or variables are informative in data analysis. On the other hand, datasets usually contain complex structures caused by the collection procedure, such as censoring, measurement errors, or missingness. With noisy data, it becomes more challenging to choose informative subdata, detect important variables, or conduct analyses. In light of these challenges, this Special Issue aims to provide a platform to publish novel statistical methods and algorithms that handle those complex structures in various research fields. Topics of interest for this Special Issue include but are not limited to biostatistics, bioinformatics, causal inference, meta analysis, statistical process control, and survival analysis.

Dr. Li-pang Chen
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • algorithm
  • big data
  • high dimensionality
  • noisy data

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

14 pages, 1238 KiB  
Article
Semiparametric Analysis of Additive–Multiplicative Hazards Model with Interval-Censored Data and Panel Count Data
by Tong Wang, Yang Li, Jianguo Sun and Shuying Wang
Mathematics 2024, 12(23), 3667; https://doi.org/10.3390/math12233667 - 22 Nov 2024
Abstract
In survival analysis, interval-censored data and panel count data represent two prevalent types of incomplete data. Given that, within certain research contexts, the events of interest may simultaneously involve both data types, it is imperative to perform a joint analysis of these data [...] Read more.
In survival analysis, interval-censored data and panel count data represent two prevalent types of incomplete data. Given that, within certain research contexts, the events of interest may simultaneously involve both data types, it is imperative to perform a joint analysis of these data to fully comprehend the occurrence process of the events being studied. In this paper, a novel semiparametric joint regression analysis framework is proposed for the analysis of interval-censored data and panel count data. It is hypothesized that the failure time follows an additive–multiplicative hazards model, while the recurrent events follow a nonhomogeneous Poisson process. Additionally, a gamma-distributed frailty is introduced to describe the correlation between the failure time and the count process of recurrent events. To estimate the model parameters, a sieve maximum likelihood estimation method based on Bernstein polynomials is proposed. The performance of this estimation method under finite sample conditions is evaluated through a series of simulation studies, and an empirical study is illustrated. Full article
(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)
17 pages, 307 KiB  
Article
Analyzing Treatment Effect by Integrating Existing Propensity Score and Outcome Regressions with Heterogeneous Covariate Sets
by Yi-Hau Chen, Szu-Yuan Hsu, Jie-Huei Wang and Chien-Chou Su
Mathematics 2024, 12(14), 2265; https://doi.org/10.3390/math12142265 - 19 Jul 2024
Viewed by 618
Abstract
Analyzing treatment or exposure effect is a major research theme in scientific studies. In the current big-data era where multiple sources of data are available, it is of interest to perform a synthesized analysis of treatment effects by integrating information from different data [...] Read more.
Analyzing treatment or exposure effect is a major research theme in scientific studies. In the current big-data era where multiple sources of data are available, it is of interest to perform a synthesized analysis of treatment effects by integrating information from different data sources or studies. However, studies may contain heterogeneous and incomplete covariate sets, and individual data therein may not be accessible. We apply and extend the generalized meta-analysis method to integrate summary results (e.g., regression coefficients) of outcome and treatment (propensity score, PS) regression analyses across different datasets that may contain heterogeneous covariate sets. The proposed integrated analysis utilizes a reference dataset, which contains data on the complete set of covariates. The asymptotic distribution for the proposed integrated estimator is established. Simulations reveal that the proposed estimator performs well. We apply the proposed method to obtain the causal effect of waist circumference on hypertension by integrating two existing outcomes and PS regression analyses with different sets of covariates. Full article
(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)
24 pages, 2452 KiB  
Article
Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data
by Jie-Huei Wang, Cheng-Yu Liu, You-Ruei Min, Zih-Han Wu and Po-Lin Hou
Mathematics 2024, 12(14), 2209; https://doi.org/10.3390/math12142209 - 15 Jul 2024
Cited by 1 | Viewed by 1105
Abstract
The complexity of cancer development involves intricate interactions among multiple biomarkers, such as gene-environment interactions. Utilizing microarray gene expression profile data for cancer classification is anticipated to be effective, thus drawing considerable interest in the fields of bioinformatics and computational biology. Due to [...] Read more.
The complexity of cancer development involves intricate interactions among multiple biomarkers, such as gene-environment interactions. Utilizing microarray gene expression profile data for cancer classification is anticipated to be effective, thus drawing considerable interest in the fields of bioinformatics and computational biology. Due to the characteristics of genomic data, problems of high-dimensional interactions and noise interference do exist during the analysis process. When building cancer diagnosis models, we often face the dilemma of model adaptation errors due to an imbalance of data types. To mitigate the issues, we apply the SMOTE-Tomek procedure to rectify the imbalance problem. Following this, we utilize the overlapping group screening method alongside a binary logistic regression model to integrate gene pathway information, facilitating the identification of significant biomarkers associated with clinically imbalanced cancer or normal outcomes. Simulation studies across different imbalanced rates and gene structures validate our proposed method’s effectiveness, surpassing common machine learning techniques in terms of classification prediction accuracy. We also demonstrate that prediction performance improves with SMOTE-Tomek treatment compared to no imbalance treatment and SMOTE treatment across various imbalance rates. In the real-world application, we integrate clinical and gene expression data with prior pathway information. We employ SMOTE-Tomek and our proposed methods to identify critical biomarkers and gene-environment interactions linked to the imbalanced binary outcomes (cancer or normal) in patients from the Cancer Genome Atlas datasets of lung adenocarcinoma and breast invasive carcinoma. Our proposed method consistently achieves satisfactory classification accuracy. Additionally, we have identified biomarkers indicative of gene-environment interactions relevant to cancer and have provided corresponding estimates of odds ratios. Moreover, in high-dimensional imbalanced data, for achieving good prediction results, we recommend considering the order of balancing processing and feature screening. Full article
(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)
Show Figures

Figure 1

21 pages, 377 KiB  
Article
Joint Statistical Inference for the Area under the ROC Curve and Youden Index under a Density Ratio Model
by Siyan Liu, Qinglong Tian, Yukun Liu and Pengfei Li
Mathematics 2024, 12(13), 2118; https://doi.org/10.3390/math12132118 - 5 Jul 2024
Cited by 2 | Viewed by 1011
Abstract
The receiver operating characteristic (ROC) curve is a valuable statistical tool in medical research. It assesses a biomarker’s ability to distinguish between diseased and healthy individuals. The area under the ROC curve (AUC) and the Youden index (J [...] Read more.
The receiver operating characteristic (ROC) curve is a valuable statistical tool in medical research. It assesses a biomarker’s ability to distinguish between diseased and healthy individuals. The area under the ROC curve (AUC) and the Youden index (J) are common summary indices used to evaluate a biomarker’s diagnostic accuracy. Simultaneously examining AUC and J offers a more comprehensive understanding of the ROC curve’s characteristics. In this paper, we utilize a semiparametric density ratio model to link the distributions of a biomarker for healthy and diseased individuals. Under this model, we establish the joint asymptotic normality of the maximum empirical likelihood estimator of (AUC,J) and construct an asymptotically valid confidence region for (AUC,J). Furthermore, we propose a new test to determine whether a biomarker simultaneously exceeds prespecified target values of AUC0 and J0 with the null hypothesis H0:AUCAUC0 or JJ0 against the alternative hypothesis Ha:AUC>AUC0 and J>J0. Simulation studies and a real data example on Duchenne Muscular Dystrophy are used to demonstrate the effectiveness of our proposed method and highlight its advantages over existing methods. Full article
(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)
Show Figures

Figure 1

22 pages, 1823 KiB  
Article
Computation of the Mann–Whitney Effect under Parametric Survival Copula Models
by Kosuke Nakazono, Yu-Cheng Lin, Gen-Yih Liao, Ryuji Uozumi and Takeshi Emura
Mathematics 2024, 12(10), 1453; https://doi.org/10.3390/math12101453 - 8 May 2024
Cited by 1 | Viewed by 1166
Abstract
The Mann–Whitney effect is a measure for comparing survival distributions between two groups. The Mann–Whitney effect is interpreted as the probability that a randomly selected subject in a group survives longer than a randomly selected subject in the other group. Under the independence [...] Read more.
The Mann–Whitney effect is a measure for comparing survival distributions between two groups. The Mann–Whitney effect is interpreted as the probability that a randomly selected subject in a group survives longer than a randomly selected subject in the other group. Under the independence assumption of two groups, the Mann–Whitney effect can be expressed as the traditional integral formula of survival functions. However, when the survival times in two groups are not independent of each other, the traditional formula of the Mann–Whitney effect has to be modified. In this article, we propose a copula-based approach to compute the Mann–Whitney effect with parametric survival models under dependence of two groups, which may arise in the potential outcome framework. In addition, we develop a Shiny web app that can implement the proposed method via simple commands. Through a simulation study, we show the correctness of the proposed calculator. We apply the proposed methods to two real datasets. Full article
(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)
Show Figures

Figure 1

16 pages, 393 KiB  
Article
Data-Adaptive Multivariate Test for Genomic Studies Using Fused Lasso
by Masao Ueki
Mathematics 2024, 12(10), 1422; https://doi.org/10.3390/math12101422 - 7 May 2024
Viewed by 825
Abstract
In genomic studies, univariate analysis is commonly used to discover susceptible variants. It applies univariate regression for each variant and tests the significance of the regression coefficient or slope parameter. This strategy, however, may miss signals that are jointly detectable with other variants. [...] Read more.
In genomic studies, univariate analysis is commonly used to discover susceptible variants. It applies univariate regression for each variant and tests the significance of the regression coefficient or slope parameter. This strategy, however, may miss signals that are jointly detectable with other variants. Multivariate analysis is another popular approach, which tests grouped variants with a predefined group, e.g., based on a gene, pathway, or physical location. However, the power will be diminished if the modeling assumption is not suited to the data. Therefore, data-adaptive testing that relies on fewer modeling assumptions is preferable. Possible approaches include a data-adaptive test proposed by Ueki (2021), which applies to various data-adaptive regression models using a generalization of Yanai’s generalized coefficient of determination. While several regression models are possible choices for the data-adaptive test, this paper focuses on the fused lasso that can count for the effect of adjacent variants and investigates its performance through comparison with other existing tests. Simulation studies demonstrate that the test using fused lasso has a high power compared to the existing tests including the univariate regression test, saturated regression test, SKAT (sequence kernel association test), burden test, SKAT-O (optimized sequence kernel association test), and the tests using lasso, ridge, and elastic net when assuming a similar effect of adjacent variants. Full article
(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)
Show Figures

Figure 1

Back to TopTop