Mathematics

Research

10 pages, 473 KiB

Open AccessFeature PaperArticle

by Mahlon Scott and Hsin-Hsiung Huang

Mathematics 2025, 13(3), 486; https://doi.org/10.3390/math13030486 - 31 Jan 2025

Viewed by 233

Storm surges present a severe risk to coastal communities and infrastructure, underscoring the critical importance of accurately estimating extreme events such as the 100-year return surge. These estimates are essential not only for effective hazard assessment but also for informing resilient coastal design. [...] Read more.

Storm surges present a severe risk to coastal communities and infrastructure, underscoring the critical importance of accurately estimating extreme events such as the 100-year return surge. These estimates are essential not only for effective hazard assessment but also for informing resilient coastal design. Inspired by principles of robust statistical modeling, this paper introduces a Bayesian hierarchical model integrated with Gaussian processes to account for spatial random effects. This approach enhances the precision of long return period storm surge estimates and enables the seamless generalization of predictions to nearby unmonitored coastal regions, much like the way advanced Bayesian frameworks are applied to high-dimensional neuroimaging or spatiotemporal data, bridging gaps between observations and uncharted territories. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

27 pages, 392 KiB

Open AccessArticle

Analysis of Receiver Operating Characteristic Curves for Cure Survival Data and Mismeasured Biomarkers

by Li-Pang Chen

Mathematics 2025, 13(3), 424; https://doi.org/10.3390/math13030424 - 27 Jan 2025

Viewed by 300

Abstract

Cure models and receiver operating characteristic (ROC) curve estimation are two important issues in survival analysis and have received attention for many years. In the development of biostatistics, these two topics have been well discussed separately. However, a rare development in the estimation [...] Read more.

Cure models and receiver operating characteristic (ROC) curve estimation are two important issues in survival analysis and have received attention for many years. In the development of biostatistics, these two topics have been well discussed separately. However, a rare development in the estimation of the ROC curve has been made available based on survival data with the cure fraction. On the other hand, while a large body of estimation methods have been proposed, they rely on an implicit assumption that the variables are precisely measured. In applications, measurement errors are generally ubiquitous and ignoring measurement errors can cause unexpected bias for the estimator and lead to the wrong conclusion. In this paper, we study the estimation of the ROC curve and the area under curve (AUC) when variables or biomarkers are subject to measurement error. We propose a valid procedure to handle measurement error effects and estimate the parameters in the cure model, as well as the AUC. We also make an effort to establish the theoretical properties with rigorous justification. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

26 pages, 12514 KiB

Open AccessArticle

Reconstruction and Prediction of Chaotic Time Series with Missing Data: Leveraging Dynamical Correlations Between Variables

by Jingchan Lv, Hongcun Mao, Yu Wang and Zhihai Yao

Mathematics 2025, 13(1), 152; https://doi.org/10.3390/math13010152 - 3 Jan 2025

Viewed by 595

Abstract

Although data-driven machine learning methods have been successfully applied to predict complex nonlinear dynamics, forecasting future evolution based on incomplete past information remains a significant challenge. This paper proposes a novel data-driven approach that leverages the dynamical relationships among variables. By integrating Non-Stationary [...] Read more.

Although data-driven machine learning methods have been successfully applied to predict complex nonlinear dynamics, forecasting future evolution based on incomplete past information remains a significant challenge. This paper proposes a novel data-driven approach that leverages the dynamical relationships among variables. By integrating Non-Stationary Transformers with LightGBM, we construct a robust model where LightGBM builds a fitting function to capture and simulate the complex coupling relationships among variables in dynamically evolving chaotic systems. This approach enables the reconstruction of missing data, restoring sequence completeness and overcoming the limitations of existing chaotic time series prediction methods in handling missing data. We validate the proposed method by predicting the future evolution of variables with missing data in both dissipative and conservative chaotic systems. Experimental results demonstrate that the model maintains stability and effectiveness even with increasing missing rates, particularly in the range of 30% to 50%, where prediction errors remain relatively low. Furthermore, the feature importance extracted by the model aligns closely with the underlying dynamic characteristics of the chaotic system, enhancing the method’s interpretability and reliability. This research offers a practical and theoretically sound solution to the challenges of predicting chaotic systems with incomplete datasets. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

► Show Figures

Figure 1

15 pages, 3267 KiB

Open AccessArticle

EWMA Control Chart Integrated with Time Series Models for COVID-19 Surveillance

by Chen-Rui Hsu and Hsiuying Wang

Mathematics 2025, 13(1), 115; https://doi.org/10.3390/math13010115 - 30 Dec 2024

Viewed by 645

Abstract

The global outbreak of coronavirus disease 2019 (COVID-19) has posed a severe threat to public health and caused widespread socioeconomic disruptions in the past several years. While the pandemic has subsided, it is essential to explore effective disease surveillance tools to aid in [...] Read more.

The global outbreak of coronavirus disease 2019 (COVID-19) has posed a severe threat to public health and caused widespread socioeconomic disruptions in the past several years. While the pandemic has subsided, it is essential to explore effective disease surveillance tools to aid in controlling future pandemics. Several studies have proposed methods to capture the epidemic trend and forecast new daily confirmed cases. In this study, we propose the use of exponentially weighted moving average (EWMA) control charts integrated with time series models to monitor the number of daily new confirmed cases of COVID-19. The conventional EWMA control chart directly monitors the number of daily new confirmed cases. The proposed methods, however, monitor the residuals of time series models fitted to these data. In this study, two time series models—the auto-regressive integrated moving average (ARIMA) model and the vector auto-regressive moving average (VARMA) model—are considered. The results are compared with those of the conventional EWMA control chart using three datasets from India, Malaysia, and Thailand. The findings demonstrate that the proposed method can detect disease outbreak signals earlier than conventional control charts. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

► Show Figures

Figure 1

20 pages, 368 KiB

Open AccessArticle

Adaptive Reduction of Curse of Dimensionality in Nonparametric Instrumental Variable Estimation

by Ming-Yueh Huang and Kwun Chuen Gary Chan

Mathematics 2025, 13(1), 106; https://doi.org/10.3390/math13010106 - 30 Dec 2024

Viewed by 389

Abstract

Nonparametric estimation of instrumental variable treatment effects typically builds on various nonparametric identification results. However, these estimators often face challenges from the curse of dimensionality in practice, as multi-dimensional covariates are common. To address this issue, we investigate the nonparametric identification of a [...] Read more.

Nonparametric estimation of instrumental variable treatment effects typically builds on various nonparametric identification results. However, these estimators often face challenges from the curse of dimensionality in practice, as multi-dimensional covariates are common. To address this issue, we investigate the nonparametric identification of a range of treatment effects within different sufficient dimension reduction models. We also examine the efficiency of estimation and find that, unlike fully nonparametric approaches, nonparametric estimators derived from maximal dimension reduction based on identification results may not be efficient. We study the conditions for achieving maximal dimension reduction to ensure efficiency for a binary instrumental variable and extend these results to multivariate and general instrumental variables. The proposed nonparametric sufficient dimension reduction framework imposes no constraints on the distribution of the observed data while mitigating the curse of dimensionality in a data-adaptive manner. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

14 pages, 1244 KiB

Open AccessArticle

Semiparametric Analysis of Additive–Multiplicative Hazards Model with Interval-Censored Data and Panel Count Data

by Tong Wang, Yang Li, Jianguo Sun and Shuying Wang

Mathematics 2024, 12(23), 3667; https://doi.org/10.3390/math12233667 - 22 Nov 2024

Viewed by 565

Abstract

In survival analysis, interval-censored data and panel count data represent two prevalent types of incomplete data. Given that, within certain research contexts, the events of interest may simultaneously involve both data types, it is imperative to perform a joint analysis of these data [...] Read more.

In survival analysis, interval-censored data and panel count data represent two prevalent types of incomplete data. Given that, within certain research contexts, the events of interest may simultaneously involve both data types, it is imperative to perform a joint analysis of these data to fully comprehend the occurrence process of the events being studied. In this paper, a novel semiparametric joint regression analysis framework is proposed for the analysis of interval-censored data and panel count data. It is hypothesized that the failure time follows an additive–multiplicative hazards model, while the recurrent events follow a nonhomogeneous Poisson process. Additionally, a gamma-distributed frailty is introduced to describe the correlation between the failure time and the count process of recurrent events. To estimate the model parameters, a sieve maximum likelihood estimation method based on Bernstein polynomials is proposed. The performance of this estimation method under finite sample conditions is evaluated through a series of simulation studies, and an empirical study is illustrated. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

► Show Figures

Figure 1

17 pages, 307 KiB

Open AccessArticle

Analyzing Treatment Effect by Integrating Existing Propensity Score and Outcome Regressions with Heterogeneous Covariate Sets

by Yi-Hau Chen, Szu-Yuan Hsu, Jie-Huei Wang and Chien-Chou Su

Mathematics 2024, 12(14), 2265; https://doi.org/10.3390/math12142265 - 19 Jul 2024

Viewed by 806

Abstract

Analyzing treatment or exposure effect is a major research theme in scientific studies. In the current big-data era where multiple sources of data are available, it is of interest to perform a synthesized analysis of treatment effects by integrating information from different data [...] Read more.

Analyzing treatment or exposure effect is a major research theme in scientific studies. In the current big-data era where multiple sources of data are available, it is of interest to perform a synthesized analysis of treatment effects by integrating information from different data sources or studies. However, studies may contain heterogeneous and incomplete covariate sets, and individual data therein may not be accessible. We apply and extend the generalized meta-analysis method to integrate summary results (e.g., regression coefficients) of outcome and treatment (propensity score, PS) regression analyses across different datasets that may contain heterogeneous covariate sets. The proposed integrated analysis utilizes a reference dataset, which contains data on the complete set of covariates. The asymptotic distribution for the proposed integrated estimator is established. Simulations reveal that the proposed estimator performs well. We apply the proposed method to obtain the causal effect of waist circumference on hypertension by integrating two existing outcomes and PS regression analyses with different sets of covariates. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

24 pages, 2452 KiB

Open AccessArticle

Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data

by Jie-Huei Wang, Cheng-Yu Liu, You-Ruei Min, Zih-Han Wu and Po-Lin Hou

Mathematics 2024, 12(14), 2209; https://doi.org/10.3390/math12142209 - 15 Jul 2024

Cited by 2 | Viewed by 1597

Abstract

The complexity of cancer development involves intricate interactions among multiple biomarkers, such as gene-environment interactions. Utilizing microarray gene expression profile data for cancer classification is anticipated to be effective, thus drawing considerable interest in the fields of bioinformatics and computational biology. Due to [...] Read more.

The complexity of cancer development involves intricate interactions among multiple biomarkers, such as gene-environment interactions. Utilizing microarray gene expression profile data for cancer classification is anticipated to be effective, thus drawing considerable interest in the fields of bioinformatics and computational biology. Due to the characteristics of genomic data, problems of high-dimensional interactions and noise interference do exist during the analysis process. When building cancer diagnosis models, we often face the dilemma of model adaptation errors due to an imbalance of data types. To mitigate the issues, we apply the SMOTE-Tomek procedure to rectify the imbalance problem. Following this, we utilize the overlapping group screening method alongside a binary logistic regression model to integrate gene pathway information, facilitating the identification of significant biomarkers associated with clinically imbalanced cancer or normal outcomes. Simulation studies across different imbalanced rates and gene structures validate our proposed method’s effectiveness, surpassing common machine learning techniques in terms of classification prediction accuracy. We also demonstrate that prediction performance improves with SMOTE-Tomek treatment compared to no imbalance treatment and SMOTE treatment across various imbalance rates. In the real-world application, we integrate clinical and gene expression data with prior pathway information. We employ SMOTE-Tomek and our proposed methods to identify critical biomarkers and gene-environment interactions linked to the imbalanced binary outcomes (cancer or normal) in patients from the Cancer Genome Atlas datasets of lung adenocarcinoma and breast invasive carcinoma. Our proposed method consistently achieves satisfactory classification accuracy. Additionally, we have identified biomarkers indicative of gene-environment interactions relevant to cancer and have provided corresponding estimates of odds ratios. Moreover, in high-dimensional imbalanced data, for achieving good prediction results, we recommend considering the order of balancing processing and feature screening. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

► Show Figures

Figure 1

21 pages, 377 KiB

Open AccessArticle

Joint Statistical Inference for the Area under the ROC Curve and Youden Index under a Density Ratio Model

by Siyan Liu, Qinglong Tian, Yukun Liu and Pengfei Li

Mathematics 2024, 12(13), 2118; https://doi.org/10.3390/math12132118 - 5 Jul 2024

Cited by 2 | Viewed by 1352

Abstract

The receiver operating characteristic (ROC) curve is a valuable statistical tool in medical research. It assesses a biomarker’s ability to distinguish between diseased and healthy individuals. The area under the ROC curve (

A U C

) and the Youden index (J [...] Read more.

The receiver operating characteristic (ROC) curve is a valuable statistical tool in medical research. It assesses a biomarker’s ability to distinguish between diseased and healthy individuals. The area under the ROC curve (

A U C

) and the Youden index (J) are common summary indices used to evaluate a biomarker’s diagnostic accuracy. Simultaneously examining

A U C

and J offers a more comprehensive understanding of the ROC curve’s characteristics. In this paper, we utilize a semiparametric density ratio model to link the distributions of a biomarker for healthy and diseased individuals. Under this model, we establish the joint asymptotic normality of the maximum empirical likelihood estimator of

(A U C, J)

and construct an asymptotically valid confidence region for

(A U C, J)

. Furthermore, we propose a new test to determine whether a biomarker simultaneously exceeds prespecified target values of

A U C_{0}

and

J_{0}

with the null hypothesis

H_{0} : A U C \leq A U C_{0}

or

J \leq J_{0}

against the alternative hypothesis

H_{a} : A U C > A U C_{0}

and

J > J_{0}

. Simulation studies and a real data example on Duchenne Muscular Dystrophy are used to demonstrate the effectiveness of our proposed method and highlight its advantages over existing methods. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

► Show Figures

Figure 1

22 pages, 1823 KiB

Open AccessArticle

Computation of the Mann–Whitney Effect under Parametric Survival Copula Models

by Kosuke Nakazono, Yu-Cheng Lin, Gen-Yih Liao, Ryuji Uozumi and Takeshi Emura

Mathematics 2024, 12(10), 1453; https://doi.org/10.3390/math12101453 - 8 May 2024

Cited by 2 | Viewed by 1498

Abstract

The Mann–Whitney effect is a measure for comparing survival distributions between two groups. The Mann–Whitney effect is interpreted as the probability that a randomly selected subject in a group survives longer than a randomly selected subject in the other group. Under the independence [...] Read more.

The Mann–Whitney effect is a measure for comparing survival distributions between two groups. The Mann–Whitney effect is interpreted as the probability that a randomly selected subject in a group survives longer than a randomly selected subject in the other group. Under the independence assumption of two groups, the Mann–Whitney effect can be expressed as the traditional integral formula of survival functions. However, when the survival times in two groups are not independent of each other, the traditional formula of the Mann–Whitney effect has to be modified. In this article, we propose a copula-based approach to compute the Mann–Whitney effect with parametric survival models under dependence of two groups, which may arise in the potential outcome framework. In addition, we develop a Shiny web app that can implement the proposed method via simple commands. Through a simulation study, we show the correctness of the proposed calculator. We apply the proposed methods to two real datasets. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

► Show Figures

Figure 1

16 pages, 393 KiB

Open AccessArticle

Data-Adaptive Multivariate Test for Genomic Studies Using Fused Lasso

by Masao Ueki

Mathematics 2024, 12(10), 1422; https://doi.org/10.3390/math12101422 - 7 May 2024

Viewed by 953

Abstract

In genomic studies, univariate analysis is commonly used to discover susceptible variants. It applies univariate regression for each variant and tests the significance of the regression coefficient or slope parameter. This strategy, however, may miss signals that are jointly detectable with other variants. [...] Read more.

In genomic studies, univariate analysis is commonly used to discover susceptible variants. It applies univariate regression for each variant and tests the significance of the regression coefficient or slope parameter. This strategy, however, may miss signals that are jointly detectable with other variants. Multivariate analysis is another popular approach, which tests grouped variants with a predefined group, e.g., based on a gene, pathway, or physical location. However, the power will be diminished if the modeling assumption is not suited to the data. Therefore, data-adaptive testing that relies on fewer modeling assumptions is preferable. Possible approaches include a data-adaptive test proposed by Ueki (2021), which applies to various data-adaptive regression models using a generalization of Yanai’s generalized coefficient of determination. While several regression models are possible choices for the data-adaptive test, this paper focuses on the fused lasso that can count for the effect of adjacent variants and investigates its performance through comparison with other existing tests. Simulation studies demonstrate that the test using fused lasso has a high power compared to the existing tests including the univariate regression test, saturated regression test, SKAT (sequence kernel association test), burden test, SKAT-O (optimized sequence kernel association test), and the tests using lasso, ridge, and elastic net when assuming a similar effect of adjacent variants. Full article

(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Statistical Analysis and Data Science for Complex Data

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (11 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI