1. Introduction
Scientists are often led to study the relationships and dependencies between the response variable and several other covariates. However, regression analysis is the statistical tool for investigating such relationships and it is one of the most commonly used statistical methods in many scientific fields, such as medicine, biology, agriculture, economics, engineering, sociology, etc. In medical research, econometrics, and other research fields, it is very common to use regression analysis to interpret the correlation existing between different variables. However, the basic form of the regression analysis is not suitable for many cases, where the relationships are often non-linear and the probability distribution of the output variable may be an abnormal distribution.
For such dependence modeling problems, we attempt to provide a functional form that will summarize the relationship between response and explanatory variables. In several practical situations, as an example, a vector of covariates is used to explain, interpret, or predict the response variable Y. This is encountered in many fields, including medical fields and social science. The type of functional relationship we attempt to figure out could depend on the marginal behavior of variables or their joint behavior. In this paper, we consider the construction of dependent modeling procedures based on the separation of these two behaviors when the covariates are a mixture of continuous and discrete variables.
For this context, we consider procedures that allow the representation of a multivariate distribution as a function of its uni-variate marginals through a connection function called a copula. Copulas have been increasingly popular for modeling statistical dependence in multivariate sets of data and have been applied to various areas, including medical research, environmental science, econometrics, actuarial science, agronomy, and others. A key feature of copulas is that they provide flexible representations of the multivariate distribution by allowing for the dependence structure of the variables of interest to be modeled separately from the marginal structure and, by specifying a copula, we summarize all the dependencies between margins (see Nelsen [
1] for more about this subject).
The power of this approach principally lies in the ability for a practitioner to model the dependence structure independently of the marginal behaviors. Furthermore, the advantages of using copulas in modeling are the allowance to model both linear and non-linear dependence, an arbitrary choice of a marginal distribution, and the capability of modeling extreme endpoints. However, the principal advantage of a copula regression is that there are no restrictions and no specification on the probability distributions that can be used.
It is interesting to note that copula-based regression models offer significant advantages in capturing complex dependencies between variables, making them highly useful in various fields. In finance, they allow for better portfolio risk management by modeling non-linear dependencies between asset returns and macroeconomic factors, especially during market downturns. In insurance, copula-based regression can be applied to explain pricing in terms of different dependent types of claims, such as frequency and severity. In environmental studies, regression as function of a copula is useful to establish the relationship between rainfall and river discharge, especially in the case of non-linear dependence. In healthcare, regression with a copula enables researchers to examine how lifestyle factors influence health outcomes, such as cholesterol levels, while capturing the potential interdependence among these health indicators.
In the literature, there exist many recent studies of regression based on copulas; as examples, we cite Sheikhi et al. [
2] and Ali et al. [
3] among others. As a new contribution to this domain, we consider in this paper the estimation problem of the mean regression function for a regression model, where
is a random vector of dimension
and
Y is a random variable with cumulative distribution function (c.d.f.)
and density function
.
Y is the response variable and
is the set of covariates. We denote by
the c.d.f. of the variables
and we denote by
its corresponding density. For a given
, we will note by
the shortcut for
. From the inspiring work of Sklar [
4], the c.d.f. of
evaluated at
can be expressed in terms of
, where
C is the copula distribution of
, that is, the function from
to
defined by
Recently, Noh et al. [
5] exploited the above decomposition to introduce a novel idea consisting of expressing the mean regression function
, in terms of the copula and margins of
as follows.
where
is the copula density corresponding to
C and
is the copula density of
. This shows that the mean regression function
is the ratio of a numerator that only captures the mean dependence between
Y and
X and a denominator that captures the dependence within
X. It is worth mentioning that the formula is only valid when the covariates are continuous. A new reformulation is needed when the covariates are not all continuous, which is the case for many real-world applications, especially in medicine.
Furthermore, Noh et al. [
5] proposed a semi-parametric estimator for the regression function given in (
1). Specifically, they utilized the inference function for margins (IFM) technique to estimate the copula-based regression curve. This method proceeds in two stages: first, it estimates the marginal parameters, and then it estimates the corresponding dependence parameter. These authors demonstrate, both theoretically and empirically, that the resulting estimates obtained exhibit desirable properties when the parametric copula family is adequately chosen.
Noh et al. [
5] stimulated extensive research on copula-based regression. Noh et al. [
6] applied the method of Noh et al. [
5] to the quantile regression with i.i.d. or time series that are completely observed. De Backer et al. [
7] extended the method of Noh et al. [
6] to the quantile regression with censored data. Kraus and Czado [
8] studied the quantile regression with complete data, using D-vine copulas. Rémillard et al. [
9] discussed the asymptotic connection between the estimators of Noh et al. [
6] and Kraus and Czado [
8]. Chang and Joe [
10] proposed an algorithm for computing the conditional distribution function via the vine copula. Furthermore, Nagler and Vatter [
11] unified various copula-based regressions by formulating a general loss function which may not be continuously differentiable. Their generalized regression model includes the conditional mean regression of Noh et al. [
5], the conditional quantile regression of Noh et al. [
6], and the asymmetric least squares of Newey and Powell [
12] as special cases. The unified framework enhances the systematic interpretation of the different existing regressions. For additional discussion into similar methods, see [
13,
14,
15,
16,
17] and the literature cited therein.
As an extension of the framework by Noh et al. [
5], we incorporate discrete variables into the set of covariates
. By establishing a connection with various classes of copulas through an alternative equation to (
1), we calculate the conditional mean,
, of
. In this context, we develop the relationship between the copula and the marginals. Furthermore, we illustrate this relationship for specific families of copulas, such as Archimedean copulas and the Gaussian copula, highlighting their properties that are beneficial for our analysis.
The next step involved addressing the estimation problem. Here, we also adopt a semi-parametric approach along with the inference function for margins (IFM) method to estimate the proposed regression curve. First, we estimate the marginal distributions using their empirical distributions, and then we estimate the dependence parameter associated with the underlying copula. A simulation studies for different classes of copulas and different distributions for the output Y are considered to illustrate the usefulness of the findings.
The rest of the paper is organized as follows.
Section 2, discusses different copula concepts in the multivariate setting.
Section 3 outlines the copula-based regression model proposed for case where the set of covariates includes both discrete and continuous variables.
Section 4 covers the estimation procedure of the proposed regression model.
Section 5 is dedicated to a simulation study that assesses the performance of the suggested copula-based regression. Conclusion and remarks come in
Section 6.
5. Simulation Study
The objective of this section is to conduct simulations to compare the proposed conditional mean estimator with some competitors. To achieve this, we focus on the case where
with mixed covariates; specifically,
is continuous, and
is discrete. In this case, the proposed estimator is deduced from its general form expressed in (19) as follows,
where
As scenarios, we consider the most common cases to show the improvement of our estimator over the OLS estimator. However, for the copula of
, we consider Clayton, Frank, and Gumbel with parameter
and for the variables
or
, while
and
with distribution
,
and
. The generalized inverse of
is
or equivalently,
Simulation algorithm:Given , and .
For .
Generate from a copula .
Set , and .
Use the generated sample , to estimate and define the empirical distributions of , , and .
Evaluate the estimator
for
belonging to the grid defined by
For fixed
, we first compute the theoretical value
and then evaluate
using
J random samples of size
n. We denote the corresponding estimates by
, where
. To assess the performance, we employ the empirical integrated mean squared error (IMSE), which is formulated as follows:
where
denotes the cardinality number of the grid
F. Notably,
can be decomposed into the square of empirical bias,
, and the empirical variance,
, as follows:
In this simulation study, different values of the parameters are considered, which represent different dependence scenarios ranging from weak to strong, with Kendall’s tau,
, values lying in the interval
. With a sample size
, the response,
Y, is generated from
distribution and Student’s
t-distribution with 3 degrees of freedom. Also,
is generated from a Uniform(0, 1) and
with distribution
,
and
, where
and
. In this context, we report and compare the integrated mean square error (IMSE) and the integrated mean absolute error (IMAE) with the respective errors derived from the least squares (ls) regression method. This comprehensive approach ensured the reliability of the comparison by accounting for variability in outcomes across multiple realizations. The reported values in
Table 1 and
Table 2, corresponding to normal distribution and Student’s
t-distribution, respectively, represent the averages calculated from a total of 100 realizations. The results show that the proposed method consistently outperformed the least squares regression method across all the scenarios. This dominance was evident in both metrics, IMSE and IMAE, and across all varieties of Kendall’s tau values and sample sizes. We also analyzed the evolution of MSE with sample size, confirming a clear reduction as
n grows, improving estimator accuracy and stability (see
Table 3). Specifically, we considered
and
, which are relatively small. As n increases, the estimator improves significantly in terms of IMSE.
Particularly, the proposed method revealed a more accurate and robust performance, indicating a lower IMSE and IMAE between the estimated and actual values than the least squares method. This enhanced performance can be attributed to the proposed method’s ability to more effectively capture and account for the underlying correlation structure represented by Kendall’s tau in the data. Unlike the least squares method, which assumes a specific form of relationship (linear), the proposed method offers a more flexible and robust approach to analyzing data with varying degrees of correlation and complexity.