1. Introduction
For regression models, the categorical variables are important for applications and the explanatory variables are always thought of as grouped. Considering the interpretability and the accuracy of the models, the information regarding the group should be considered for the modeling, especially for the high-dimensional settings where sparsity and variable selection can play a very important role in estimation accuracy. Generally speaking, the regression model with the penalized regularizations gives a good result for variable-selection problems; we found lots of research in the literature for this kind of problem [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12]. When the explanatory variables have a group structure, penalized regularization also plays an important role, such as with group least absolute shrinkage and selection operators (LASSOs) [
13], the group smoothly clipped absolute deviation (SCAD) penalty [
14] and the group minimax concave penalty (MCP) [
15] models.
As we all know, the
norm is seen as a good approximation of the
norm, and it can recover a more sparse solution than the
norm [
16]. For the group sparsity of variables with a group structure, the
norm plays an important role in the sparse aspect. The
norm with a group structure is described as follows:
where
, and
is the grouping of the variable
. Here
, and
denotes the index set corresponding to the
i-th group. We denote
to be the index sets, and
is to denote the index set
.
For the linear regression problem with
regularization, the oracle inequality and the global recovery bound were established in the paper [
17]. The goodness of
regularization was very obvious. For the logistic regression model, we employed penalized regularization in the variable-selection problems.
We assume that
are the coefficients of the explanatory variables. The matrix
X denotes the explanatory variables, and it is given as follows:
denotes the th row of the matrix X, are the categorical variables and .
In this paper, we considered the logistic regression model with the
norm, described as follows:
where
is the loss function and
is defined by formulation (
1). Moreover, we know
from the properties of the logistic regression model, and
,
and
is the penalized parameter.
The group LASSO for the logistic regression model [
18] is able to perform variable selection on groups of variables, and the model has the following form:
where
controls the amount of penalization, and
is used to rescale the penalty with respect to the dimensionality of the parameter vector
.
Moreover, a quite general composite absolute penalty for the group sparsity problem is considered in [
19], and this model includes the group LASSO as a special case. The group LASSO is an important extension of regularization, and it proposes an
regularization for each group and, ultimately, gives the sparsity in a group manner. This property can be found in the numerical experiments.
We found that models (
2) and (
3) were logistic regression models with different penalized regularizations. They aimed to give a solution with group sparsity. The logistic regression model with
regularization is different with the LASSO logistic regression, and it can give a more sparse solution within a group or between groups by adjusting the values of
p and
q. This could be found in the numerical experiments.
To illustrate the goodness of model (
2), we also introduced the logistic regression problem with the elastic net penalty that could give a sparse solution of the problem, and it is described as follows:
where
and
are the penalized parameters. Model (
4) was not good at group sparsity, and we showed this in the numerical parts.
This paper is organized as follows. In
Section 2, we introduce the inequalities of
regularization, the properties of the loss function for the logistic regression model, the
-group restricted eigenvalue condition relative to
(
-GREC(S,N)) and establish the oracle inequality and the global recovery bound for model (
2). In
Section 3, we apply the ADMM to solve model (
2), and we found that the subproblems of the algorithm could be solved efficiently. In
Section 4, we employ two numerical experiments that are always used for variable-selection problems to show the goodness of model (
2) and the ADMM algorithm. The results for the LASSO logistic regression model (
3) and the logistic regression model with the elastic net penalty (
4) are compared with those of model (
2), and we determine the advantages of model (
2). We also give the results for model (
2), which produced the real data in
Section 5, and the results showed the effectiveness of the model and the algorithm that we gave in the paper. The last section draws conclusions and presents future work.
We introduce some notations that we will use in the following analysis. Let
be the index set of nonzero groups of
,
be the complement of
,
be the group sparsity of
. For a variable
and
, we employ
to denote the subvector of
corresponding to
. For a group
, we employ
to describe a zero group, where
means that
for all
. We give
,
and
, and we use rank
to denote the rank of
among
(in a decreasing order). We employ
to denote the index set of the first
N largest groups in the value of
among
, which means
Moreover, we let
, and we denote
2. Theoretical Analysis
In this section, we analyze the oracle property and the global recovery bound of the penalized regression model (
2). Firstly, we introduce the following inequalities of the
norm and the properties of the loss function
.
Lemma 1 ([
17], p. 8).
Let , and K be the smallest integer such that . Then the following relation holds: Lemma 2 ([
17], p. 9).
Let and . Then we have Lemma 3 ([
17], p. 13).
Let , and , and for . Then the following inequalities hold: Moreover, the following properties are about the Lipschitz continuity and the convexity of the loss function .
Proposition 1. For , we have Proof. For
, based on the differential mean value theorem and the properties of the norms we can get
Hence, we obtain our desirable result. □
Proposition 2. For , the function is convex.
Proof. From the definition of the function
, we know the Hessian matrix of the function
is
Here, we find for the matrix and . Thus, the Hessian matrix is a positive semi-definite matrix. Hence, the function is convex. □
The above lemmas and properties state the inequalities for
regularization, and they will help the proof of the oracle inequality and the global recovery bound. The oracle inequalities for predicition error were discussed in [
20,
21], and they were derived without restricted eigenvalue conditions for LASSO-type estimators or sparsity. Morever, for the group LASSO problems, the oracle inequalities were discussed in [
22,
23,
24] under the restricted eigenvalue assumption. For linear regression with
regularizaiton, the oracle inequality was established in [
17] with the help of the
-GREC(S,N).
Moreover, the -GREC(S,N) was very important for the analysis of the oracle property and the global recovery bound of the norm. We define to be the smallest non-zero eigenvalue of the Hessian matrix of the function . We introduce it in the following definition.
Definition 1. Let . The -GREC(S,N) is said to be satisfied if The oracle property is an important property for the variable selection, and it gives an upper bound on the square error of the logistic regression problem and the violation of the true nonzero groups for each point in the level set of the objective function of problem (
2).
For
, the level set is given as follows:
From the definition of the level set in ([
25], p. 8), we know that many properties of the optimization problem (
2) relate to the level set
.
Theorem 1. Let , and let the - hold. Let be the unique solution of at a group sparsity level S, and be the index set of nonzero groups of . Let K be the smallest integer such that . Then, for any , which means that , the following oracle inequality holds: Moreover, letting , we have Proof. Let
, and by the definition of the level set
we have
Then by Lemmas 1 and 2 and the fact
, one has the following formulation:
Then, the
-
implies the following:
From the expansion of the Taylor formulation, we obtain the following relationship:
where
. The first inequality of formulation (
10) is based on the fact that
is the unique optimal solution of
at group sparsity level
. Moreover, because of the uniqueness of
, we obtain that the smallest eigenvalue of the Hessian matrix of the function
is positive. Hence, we obtain the second inequality.
Combining this with formulation (
9), we get
Hence, by formulations (
12) and (
13), we obtain the oracle inequality (
8). Moreover, from the definition of
, the
-
implies that
Thus, the proof is complete. □
In the following, we established the global recovery bound for the
regularization problem (
2). The global recovery bound shows that the sparse solution
could be recovered by any point
in the level set
. Here
is a global optimal solution of problem (
2) when the penalized parameter
is small enough. We show the global recovery bound for the
regularization problem (
2) in the next theorem.
Theorem 2. Let , and suppose the -GREC(S,S) holds. Let be the unique solution of at group sparsity level S, and be the index set of nonzero groups of . Let K be the smallest integer such that . Then, for any , the following global recovery bound for problem (2) holds: Proof. We suppose
as defined in Theorem 1. Since
, from Lemma 3 and Theorem 1, we get
Furthermore, from Theorem 1 and the fact
, we get
Hence, formulation (
14) holds.
Moreover, if
, we have
and we get
If
, we know
. Hence
, and we get
Thus, formulation (
15) holds. □
Remark 1. From Proposition 2, we know that is convex. By the convexity of the function , we know it is suitable for the assumption for the variable . The conditions of the variable help us to obtain desirable results for Theorems 1 and 2.
3. ADMM Algorithm
In this section, we give an algorithm based on the ADMM algorithm [
26,
27] for solving the logistic regression model with
regularization (
2). The ADMM algorithm performs very well for problems where the variables can be separated. Model (
2) can be equivalently described as follows:
The augmented Lagrange function of the above model is as follows:
where
is the dual variable, and
is the augmented Lagrange multiplier.
Generally speaking, the structure of the ADMM algorithm is given as follows:
Based on the above structure and Propositions 1 and 2, the subproblem (
19) is an unconstrained convex optimization problem and the objective function is Lipschitz continuous. Based on these good properties of the subproblem (
19), we found that it could be effectively solved by many optimal algorithms, such as the trust region algorithm, the sequential quadratic programming algorithm, the algorithm based on the gradient and so on. Moreover, the first order optimal conditions of problem (
19) are given by the following formulation:
Thus, we obtain the optimal solution for subproblem (
19) by solving the following nonlinear equations:
where
.
From this fact we know that the variable
is group sparse. Hence, we can divide the variables
and
by the group structure.
and
denote the
th group variables of
and
, respectively. Subproblem (
20) is solved by employing a group structure. Then, for
,
can be given by solving the following optimization problem:
We can then obtain the solution for subproblem (
20) by the following:
Moreover, we find that the problem (
23) can be equivalently solved by the following one:
The proximal gradient method given in [
17] has proven very useful for solving (
25). Based on the above analysis, we found that the ADMM algorithm was effective for solving the logistic regression problem with
regularization, and we describe the structure of Algorithm 1 in the following.
Algorithm 1: ADMM algorithm for solving (2). |
Step 1: Initialization: give ,,, , , and set ; Step 2: for , if the stop criteria is satisfied, the algorithm is stopped; otherwise go to Step 3; Step 3: Update : is given by solving nonlinear Equation ( 22); Step 4: Update : for , we employ the proximal gradient method to solve the optimization problem ( 25) and give . Then, we have
Step 5: Update : .
|
4. Simulation Examples
In this section, we employ the simulation data to illustrate the efficiency of the logistic regression model with
regularization and the ADMM algorithm.
regularization is shown to give a more sparse optimal solution than the
norm [
28]. Hence, we employed
to do our numerical experiments, and we used the ADMM algorithm that we designed in
Section 3 to solve model (
2). The environment for the simulations was Python 3.7.
In order to verify the effect of the prediction and the classification for penalized logistic regression model (
2), we designed two simulation experiments with different data structures. At the same time, we employed the LASSO logistic regression model and the logistic regression model with the elastic net penalty to solve the numerical problems. The numerical results illustrated the advantages of the logistic regression model with
regularization.
We mainly considered two aspects of the effects of the models: the ability of the model to select variables; the ability of the model to test the effects of the classifications and predictions for the penalized logistic regression models. The evaluation indexes for the models in this section mainly included the following:
P: the number of non-zero coefficients in the variables that the model gives.
: the number of coefficients predicted to be non-zero which are actually non-zero.
: the number of coefficients predicted to be zero which are actually zero.
: the number of coefficients predicted to be non-zero but which are actually zero.
: the number of coefficients predicted to be zero but which are actually non-zero.
: the ratio of the number of non-zero coefficients in the variables for the predicted case to that of the true case, which is calculated by the following:
: the accuracy of the prediction for the test data, which is calculated by the following formulation:
: area under the curve.
A P value close to shows that the model is good. The greater the and , the better the model. being close to 1 shows the goodness of the model.
Moreover, for the penalized parameter , we used the test set verification method to choose it. Firstly, we selected a value of that made all coefficients equal to 0, and we set it as . Secondly, we chose a number that was very close to 0, such as 0.0001, and set it as . We chose to do the numerical experiments. Finally, we gave , and it produced a maximum value for the AUC. The augmented Lagrange multiplier did not influence the convergence of the algorithm, but the proper one would give a fast convergence rate. When we performed the numerical experiments, we just chose a better one, which meant it made the algorithm converge quickly.
4.1. Simulation Experiment with Non-Sparse Variables in the Group
Firstly, we constructed the group structure features with similar features in the group and different features between different groups, and we obtained the data as it is given in [
1]. The data was generated according to the following model:
Here, the explanatory variables for the groups followed the multivariate normal distributions, which meant , and the error followed the standard normal distribution, which meant The correlation coefficient of the variables and was . Generally speaking, we employed or to denote the weak correlation or strong correlation for the variables in the group.
For this simulation experiment, we generated data using 10 groups independently, and each group contained five variables. Hence, the total number of variables was 50. There were three groups that were significant, and the other seven groups were not significant. The correlation coefficients within the groups were 0.2 and 0.7, respectively. The sample size was 500. We selected
of the data for the training set and the others were the test set. The experimental simulation was repeated 30 times. For this example, the penalized parameter was
for the LASSO logistic regression model. For model (
2) the penalized parameter was
and the augmented Lagrange multiplier was chosen as
. For the logistic regression model with the elastic net penalty, the parameters were set as
and
.
The following table gives the numerical results for this example with different models and different correlation coefficients.
According to
Table 1, the logistic regression model with the elastic net penalty, the LASSO logistic regression model and the logistic regression model with
regularization could perform variable selection. When the correlation of variables was different, the criteria of the logistic regression model with
regularization were better than the other two models. According to the indexes
P and
, it could be seen that all variables with non-zero coefficients in the logistic regression model with
regularization were screened out when the correlation was different. The value of
P was closer to that of
, and all the selected variables were variables with non-zero coefficients. When choosing different parameters, the logistic regression model with
regularization selected more variables with true non-zero coefficients than that of the other two models, and the logistic regression model with
regularization gave a solution that was closer to the number of variables with true non-zero coefficients. The
was closer to 1 and could select significant variables. From the prediction effect of the model, the AUC and accuracy of the logistic regression model with
regularization showed better prediction effects with different correlation coefficients.
According to the variable-selection effect and the prediction effect of the models, the logistic regression model with regularization was better than those of the logistic regression model with the elastic net penalty and the LASSO logistic regression model. Since the data for this simulation experiment was designed to be zero or non-zero in one group, the logistic regression model with regularization performed better than the logistic regression model with regularization. The logistic regression model with regularization could select a set of variables or not, and it had an ideal group variable-selection effect. The logistic regression model with the elastic net penalty and the LASSO logistic regression model compressed the variables to achieve a sparse effect, but because they did not make full use of the group structure information of the variables, they selected too many variables with zero coefficients during variable selection, so the performance of the two models was not good.
4.2. Simulation Experiment with the Sparse Variables in the Group
The data give were similar to those in the above experiment. The difference between these two simulation experiments are given as follows. A total of six groups of variables with intragroup correlation were simulated. Each group contained 10 variables. A total of two groups of 12 variables were significant. One group of variables was completely significant, and the other group contained two significant variables. The sample size was 500. The ratio of the training set to the test set was
. The correlation coefficients within the group were 0.2 and 0.7, respectively. For this example, the penalized parameter was
for the LASSO logistic regression model, and for model (
2) the penalized parameter was
and the augmented Lagrange multiplier was
. When
, the parameters for the logistic regression model with the elastic net penalty were
and
. Moreover, when
, the parameters for the logistic regression model with the elastic net penalty were
and
.
The following table gives us the numerical results for this example with different models and different correlation coefficients.
According to
Table 2, the logistic regression model with
regularization performed very well. From the perspective of variable selection, the logistic regression model with the elastic net penalty, the LASSO logistic regression model and the logistic regression model with
regularization could screen out all the variables with non-zero coefficients. The logistic regression model with the elastic net penalty and the LASSO logistic regression model screened out too many variables with non-zero coefficients. However, the logistic regression model with
regularization not only gave all the variables with non-zero coefficients, but also the values of
P and
were very close, which meant the variables given by the logistic regression model with
regularization were close to the predicted ones. For this kind of data, we found that the logistic regression model with
regularization tended to compress a group of variables to be zero or non-zero at the same time. Therefore, the logistic regression model with
regularization selected a significant difference, but it did not filter out the important variables in the group. In terms of the prediction effect of the model, the logistic regression model with
regularization performed well for different correlation coefficients, and its prediction ability also improved compared with the univariate selection model.
Combining the effects of variable selection and model prediction, the logistic regression model with regularization performed well when the variables were sparse in the group. Due to the “all in all out” mechanism, the logistic regression model with regularization could only screen out important variable groups. The important variables in the group could not be screened out, and the variable-selection ability was not good. But the logistic regression model with regularization could overcome this disadvantage.
From the above two simulation experiments, we found the goodness of the logistic regression model with regularization for variable selection and prediction for the data with a group structure. Moreover, for different data, we adjusted the value of p to adapt to the problems.
5. Real-Data Experiment
Data description and preprocessing are described as follows. In this section, the real data are considered. The data came from the excellent mining quantitative platform, and the website is
https://uqer.datayes.com/ (acdcessed on 31 October 2021) The data were the factor data and the yield data of constituent stocks for the Shanghai and Shenzhen 300 index in China’s stock market from 1 January 2010 to 31 December 2020. The advantages for using these data were good performance, large scale, high liquidity and active trading in the market. In order to ensure the accuracy and rationality of the analysis results, we needed to select and correct the range of samples. According to the development experience of China’s stock market and previous research experience, the empirical part located the sample starting point in 2010. Secondly, because the capital of companies in the financial industry has the characteristics of high leverage and high debt, the construction of some financial indicators is quite different from the other listed companies. Therefore, we excluded some financial companies based on our experience. In addition, ST and PT stocks in the market have abnormal financial conditions and performance losses. We also excluded such kinds of stocks with weak comparability. Among the 243 stock factors visible on the excellent mining quantitative platform, 34 factors that belong to nine groups were selected to evaluate model (
2). The data were daily data, but in practice, if daily transaction data were used for investment, the frequent transactions would lead to a significant increase in transaction costs. The rise of transaction costs would affect the annualized rate of return. In order to reduce the impact of transaction costs, we used monthly transaction data for modeling. Hence, it was necessary to conduct a monthly average processing for each factor’s data, and we used monthly data for stock selection. In the division of the data set, the data from 1 January 2010 to 30 April 2018 were selected as the training set, and the data from 1 May 2018 to 31 December 2020 were used for back testing.
In order to ensure the quality of subsequent factor screening and the effect of stock rise and fall prediction, we needed to preprocess the data before the analysis. The methods for data preprocessing included noise cleaning, missing value processing, data standardization, lag processing and so on.
Using the above factors and data processing methods to generate the stock factor matrix, we employed the logistic regression model with
regularization and the ADMM algorithm proposed in
Section 3 to estimate the coefficients of the factors. We also calculated the posterior probability of the stock, sorted the probability from large to small, and bought the top ten stocks with equal weight.
In the following, we introduce the historical back test evaluation index. When we performed the back test analysis for the solutions, in order to perform the objective reflective and comprehensive evaluation of the solutions, it was necessary to give the evaluation indicators. We selected nine indicators to do this, and they were the return rate of the year, the return rate of the benchmark year, the sharp ratio, the volatility, the return unrelated to the market fluctuations (), the sensitivity to market changes (), the information ratio, the maximum pullback, the turnover rate of the year, etc.
Based on the above evaluation criteria, we used the selected data to back test and verify the effectiveness of model (
2), and we employed the above model to predict the stock trend. Moreover, we sorted the predicted yield data, and selected the top 10 stocks with the highest rise in probability as the stock portfolio for the month, according to the equal weight reconstruction portfolio. We held it until the end of the month to calculate the month’s income. The initial capital was set at 10 million yuan, the tax for buying was 0.003, and the tax for selling was 0.0013. The sliding point was 0. Moreover, the parameters for model (
2) were given as
.
was chosen for the LASSO logistic regression model. We adopted
for the logistic regression model with the elastic net penalty. We employed the ADMM algorithm that we gave in
Section 3 to solve this example, and the Lagrange multipliers for the ADMM algorithm to solve these models were chosen as
. After the calculation, we listed the following transaction back test results and the cumulative yield figure, which are listed in
Table 3 and
Figure 1, respectively.
From
Table 3, we found that the return rates for the year for these three models were all higher than the return rate of the benchmark year,
. For the return rate of the benchmark year, the logistic regression model with
regularization was the highest. Model (
2) gave a good strategy from the perspective of excess return
and the sharp ratio. Under the same risk coefficient, the investment strategy based on the logistic regression model with
regularization could help investors make effective investment decisions and obtain higher yields. Moreover, the model could give a strategy with an acceptable range in the maximum pullback, and the strategy could also more effectively prevent the pullback risk.
From the graph of the cumulative rate of return, the line given by the logistic regression model with regularization was basically always above the benchmark annualized rate of the return curve, which indicated that the return of the portfolio constructed with the group information was always stable. The portfolio constructed based on the logistic regression model with regularization could not only screen out the important factor types affecting stock returns, but could also screen out the important factor indicators in the group, so as to more accurately predict the probability of stock returns rising. When constructing the portfolio based on this model, it had certain advantages over the other two regression models.
6. Conclusions
Combined with the data requirements, this paper proposed a logistic regression model with
regularization. We showed the properties of the
norm and the loss function of the logistic regression problem. Moreover, the oracle inequality for the
norm and the global recovery bound for the penalized regression model were established with the help of the
-group restricted eigenvalue condition. These properties were important for variable selection. In
Section 3, we showed the framework for the ADMM algorithm for solving the penalized logistic regression model. For the algorithm, we gave a method for solving the subproblems, so as to reduce the difficulty and complexity of solving the model (
2).
By the numerical simulation results, since the logistic regression with regularization comprehensively considered the group structure information of variables, compared with the univariate selection method considering only a single variable, it could eliminate more redundant variables, so as to screen out the more important characteristics of dependent variables. It also had higher accuracy when we performed the test on the test set. Moreover, for the real-data experiment, it was found that the logistic regression model with regularization could show a more stable effect in variable selection and prediction.
In future work, we could extend the group logistic regression model with
regularization. In the theoretical part, we could do more analysis, such as local recovery bounds and design a more appropriate algorithm, which could give a convergence solution. Moreover, in the numerical parts, we could choose more
p and
q to illustrate the goodness of the model (
2).