entropy-logo

Journal Browser

Journal Browser

Recent Advances in Statistical Inference for High Dimensional Data

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Information Theory, Probability and Statistics".

Deadline for manuscript submissions: closed (30 June 2024) | Viewed by 9787

Special Issue Editors


E-Mail Website
Guest Editor
Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH 43403-0206, USA
Interests: model selection in mixed models; generalized linear models; bootstrap methods; high dimensional data; modeling diagnostics; multiple comparison procedures; Bayesian inference

E-Mail Website
Guest Editor
Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH 43403-0206, USA
Interests: mixture models; high dimensional data; zero-inflated population; generalized linear models; transformed data analysis

Special Issue Information

Dear Colleagues,

Statistical and computational challenges are created for high-dimensional data where the number of variables is greater than the number of cases. To cope with the challenges, more and more statistical methodologies for high-dimensional data have been developed and extensively applied in a wide range of fields including biology, medical informatics, engineering, psychology, financial time series, and climate forecasting. In this Special Issue, we welcome research work on high-dimensional data. We strongly encourage interdisciplinary work with real data analysis.  

This Special Issue calls for papers in, but not limited to, the following areas:

  • Statistical modeling methods for high-dimensional data and applications (e.g., regression, mixed models, mixture models, generalized linear models);
  • Model selection for high-dimensional data and applications;
  • Information theory and applications (e.g., decision optimization, clustering, classification);
  • Dimensionality reduction methods and applications in different real datasets;
  • Variable selection based on feature screening for high-dimensional data (e.g., bioinformatics, medical informatics, psychology, economics);
  • Statistical learning methods for high-dimensional data and applications (e.g., Lasso, splines, trees, random forests, neural networks, clustering, classification);
  • Applications based on Bayesian inference for high-dimensional data;
  • Statistical computing for high-dimensional data.

Prof. Dr. Junfeng Shang
Prof. Dr. Hanfeng Chen
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • statistical modeling
  • statistical inference
  • model selection
  • information theory
  • dimensionality reduction
  • feature screening
  • statistical learning
  • interdisciplinary applications
  • bioinformatics
  • Bayesian inference
  • statistical computing

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 382 KiB  
Article
Can a Transparent Machine Learning Algorithm Predict Better than Its Black Box Counterparts? A Benchmarking Study Using 110 Data Sets
by Ryan A. Peterson, Max McGrath and Joseph E. Cavanaugh
Entropy 2024, 26(9), 746; https://doi.org/10.3390/e26090746 - 31 Aug 2024
Viewed by 1189
Abstract
We developed a novel machine learning (ML) algorithm with the goal of producing transparent models (i.e., understandable by humans) while also flexibly accounting for nonlinearity and interactions. Our method is based on ranked sparsity, and it allows for flexibility and user control in [...] Read more.
We developed a novel machine learning (ML) algorithm with the goal of producing transparent models (i.e., understandable by humans) while also flexibly accounting for nonlinearity and interactions. Our method is based on ranked sparsity, and it allows for flexibility and user control in varying the shade of the opacity of black box machine learning methods. The main tenet of ranked sparsity is that an algorithm should be more skeptical of higher-order polynomials and interactions a priori compared to main effects, and hence, the inclusion of these more complex terms should require a higher level of evidence. In this work, we put our new ranked sparsity algorithm (as implemented in the open source R package, sparseR) to the test in a predictive model “bakeoff” (i.e., a benchmarking study of ML algorithms applied “out of the box”, that is, with no special tuning). Algorithms were trained on a large set of simulated and real-world data sets from the Penn Machine Learning Benchmarks database, addressing both regression and binary classification problems. We evaluated the extent to which our human-centered algorithm can attain predictive accuracy that rivals popular black box approaches such as neural networks, random forests, and support vector machines, while also producing more interpretable models. Using out-of-bag error as a meta-outcome, we describe the properties of data sets in which human-centered approaches can perform as well as or better than black box approaches. We found that interpretable approaches predicted optimally or within 5% of the optimal method in most real-world data sets. We provide a more in-depth comparison of the performances of random forests to interpretable methods for several case studies, including exemplars in which algorithms performed similarly, and several cases when interpretable methods underperformed. This work provides a strong rationale for including human-centered transparent algorithms such as ours in predictive modeling applications. Full article
(This article belongs to the Special Issue Recent Advances in Statistical Inference for High Dimensional Data)
Show Figures

Figure 1

29 pages, 2777 KiB  
Article
Research on Active Safety Situation of Road Passenger Transportation Enterprises: Evaluation, Prediction, and Analysis
by Lili Zheng, Shiyu Cao, Tongqiang Ding, Jian Tian and Jinghang Sun
Entropy 2024, 26(6), 434; https://doi.org/10.3390/e26060434 - 21 May 2024
Viewed by 834
Abstract
The road passenger transportation enterprise is a complex system, requiring a clear understanding of their active safety situation (ASS), trends, and influencing factors. This facilitates transportation authorities to promptly receive signals and take effective measures. Through exploratory factor analysis and confirmatory factor analysis, [...] Read more.
The road passenger transportation enterprise is a complex system, requiring a clear understanding of their active safety situation (ASS), trends, and influencing factors. This facilitates transportation authorities to promptly receive signals and take effective measures. Through exploratory factor analysis and confirmatory factor analysis, we delved into potential factors for evaluating ASS and extracted an ASS index. To predict obtaining a higher ASS information rate, we compared multiple time series models, including GRU (gated recurrent unit), LSTM (long short-term memory), ARIMA, Prophet, Conv_LSTM, and TCN (temporal convolutional network). This paper proposed the WDA-DBN (water drop algorithm-Deep Belief Network) model and employed DEEPSHAP to identify factors with higher ASS information content. TCN and GRU performed well in the prediction. Compared to the other models, WDA-DBN exhibited the best performance in terms of MSE and MAE. Overall, deep learning models outperform econometric models in terms of information processing. The total time spent processing alarms positively influences ASS, while variables such as fatigue driving occurrences, abnormal driving occurrences, and nighttime driving alarm occurrences have a negative impact on ASS. Full article
(This article belongs to the Special Issue Recent Advances in Statistical Inference for High Dimensional Data)
Show Figures

Figure 1

33 pages, 1742 KiB  
Article
A Blockwise Bootstrap-Based Two-Sample Test for High-Dimensional Time Series
by Lin Yang
Entropy 2024, 26(3), 226; https://doi.org/10.3390/e26030226 - 1 Mar 2024
Viewed by 1296
Abstract
We propose a two-sample testing procedure for high-dimensional time series. To obtain the asymptotic distribution of our -type test statistic under the null hypothesis, we establish high-dimensional central limit theorems (HCLTs) for an α-mixing sequence. Specifically, we derive two HCLTs [...] Read more.
We propose a two-sample testing procedure for high-dimensional time series. To obtain the asymptotic distribution of our -type test statistic under the null hypothesis, we establish high-dimensional central limit theorems (HCLTs) for an α-mixing sequence. Specifically, we derive two HCLTs for the maximum of a sum of high-dimensional α-mixing random vectors under the assumptions of bounded finite moments and exponential tails, respectively. The proposed HCLT for α-mixing sequence under bounded finite moments assumption is novel, and in comparison with existing results, we improve the convergence rate of the HCLT under the exponential tails assumption. To compute the critical value, we employ the blockwise bootstrap method. Importantly, our approach does not require the independence of the two samples, making it applicable for detecting change points in high-dimensional time series. Numerical results emphasize the effectiveness and advantages of our method. Full article
(This article belongs to the Special Issue Recent Advances in Statistical Inference for High Dimensional Data)
Show Figures

Figure 1

15 pages, 325 KiB  
Article
Distance Correlation-Based Feature Selection in Random Forest
by Suthakaran Ratnasingam and Jose Muñoz-Lopez
Entropy 2023, 25(9), 1250; https://doi.org/10.3390/e25091250 - 23 Aug 2023
Cited by 11 | Viewed by 3234
Abstract
The Pearson correlation coefficient (ρ) is a commonly used measure of correlation, but it has limitations as it only measures the linear relationship between two numerical variables. The distance correlation measures all types of dependencies between random vectors X and Y [...] Read more.
The Pearson correlation coefficient (ρ) is a commonly used measure of correlation, but it has limitations as it only measures the linear relationship between two numerical variables. The distance correlation measures all types of dependencies between random vectors X and Y in arbitrary dimensions, not just the linear ones. In this paper, we propose a filter method that utilizes distance correlation as a criterion for feature selection in Random Forest regression. We conduct extensive simulation studies to evaluate its performance compared to existing methods under various data settings, in terms of the prediction mean squared error. The results show that our proposed method is competitive with existing methods and outperforms all other methods in high-dimensional (p300) nonlinearly related data sets. The applicability of the proposed method is also illustrated by two real data applications. Full article
(This article belongs to the Special Issue Recent Advances in Statistical Inference for High Dimensional Data)
Show Figures

Figure 1

26 pages, 357 KiB  
Article
Feature Screening for High-Dimensional Variable Selection in Generalized Linear Models
by Jinzhu Jiang and Junfeng Shang
Entropy 2023, 25(6), 851; https://doi.org/10.3390/e25060851 - 26 May 2023
Viewed by 2023
Abstract
The two-stage feature screening method for linear models applies dimension reduction at first stage to screen out nuisance features and dramatically reduce the dimension to a moderate size; at the second stage, penalized methods such as LASSO and SCAD could be applied for [...] Read more.
The two-stage feature screening method for linear models applies dimension reduction at first stage to screen out nuisance features and dramatically reduce the dimension to a moderate size; at the second stage, penalized methods such as LASSO and SCAD could be applied for feature selection. A majority of subsequent works on the sure independent screening methods have focused mainly on the linear model. This motivates us to extend the independence screening method to generalized linear models, and particularly with binary response by using the point-biserial correlation. We develop a two-stage feature screening method called point-biserial sure independence screening (PB-SIS) for high-dimensional generalized linear models, aiming for high selection accuracy and low computational cost. We demonstrate that PB-SIS is a feature screening method with high efficiency. The PB-SIS method possesses the sure independence property under certain regularity conditions. A set of simulation studies are conducted and confirm the sure independence property and the accuracy and efficiency of PB-SIS. Finally we apply PB-SIS to one real data example to show its effectiveness. Full article
(This article belongs to the Special Issue Recent Advances in Statistical Inference for High Dimensional Data)
Back to TopTop