In regression trees, the node impurity
is commonly measured by the mean squared error
In this case, minimizing the within-node impurity is equivalent to maximizing a two-sample
F test statistic that compares the two child nodes. For the time being, we consider continuous predictors only. The scenario for categorical predictors as well as binary variables will be discussed later. For simplicity, we denote
as
The splitting statistic can be cast into the following linear model:
where
is the indicator function corresponding to the split and
The two-sample
F test is the same as the squared
t test for testing
vs.
The indicator function in (
1) can be written as
, where
denotes the Heaviside (unit) step function. Often used in integration, the value of
at
does not really matter, whereas one natural choice is
for the sake of rotational symmetry.
Given that this search must be carried out over all variables, this can lead to a significant computational strain in a big data context whenever many features are present, with each containing several potential splits; it also suffers from other problems, as discussed earlier. This motivates us to consider a new way of splitting data. Our idea is to approximate the threshold indicator function with a smooth sigmoid surrogate (SSS) function.
2.1. Sigmoid Functions
A sigmoid function is a mathematical function with an ‘S’ shape. Very often, the sigmoid function refers to the special case of the logistic function (also called the expit function):
While there are various types of sigmoid functions, we primarily choose the logistic function due to its widespread use in statistics, such as in logistic regression, and the extensive research on its properties, e.g.,
Readers are referred to Balakrishnan [
19], Johnson, Kotz, and Balakrishnan [
20] and Rosenblatt [
16] (Section 23) for a book-length account of the logistic function and the logistic distribution as well as their properties and applications.
Figure 2 plots the logistic function with different values of the shape or scale parameter
. Note that
, hence;
with
approximates
Either
or
can be used to represent a split in the recursive partitioning setting. Therefore, without loss of generality (WLOG), our consideration is restricted to
. As
a increases,
provides a better approximation of the indicator function
. Similarly,
approximates
On the other hand,
becomes linear for small
a. Because tree methods model data with threshold effects, we are primarily interested in seeking an approximation to the indicator function, where a relatively large
a is desirable; hence, leaving
a as a free parameter for estimation is not advisable. To fix
a, we first note that the spread or scale of the predictor
X on which the best cutoff is sought would affect the choice of
a values. This concern leads to the specification of
a as
, where
denotes the sample standard deviation of the predictor. This is equivalent to working with the standardized predictor. To make this more clear, standardization is a necessary first step in SSS splitting. In terms of approximation, note that
for any
with exponential rate, i.e.,
Thus, a reasonably large constant
a would provide a good approximation. Another possibility is to have
a adaptive of the sample size
n such that
satisfies
Many rates can be used for
because of the exponential convergence rate. The choice of
a will be further explored and discussed in the subsequent sections.
2.2. Estimating the Best Cutoff Point
To continue, a natural approach is to replace
in (
1) with a sigmoid function. The model then becomes
where
Note that we have suppressed the parameter
a in the notation, as it is fixed a priori. By choosing a large
a that corresponds to a sharper ‘S’ shape and estimating the
c corresponding to the best fit, we are essentially finding a sharp step function-type change in the data. Model (
3) is a nonlinear parametric model involving four parameters
Its estimation and related inference can be accomplished via nonlinear least squares [
21]. In fact, the related optimization is a separable least squares problem (Section 9.4.1, [
22]).
Estimating Model (
3) involves multivariate optimization. However, we only need to estimate
c for splitting purposes. This motivates us to approximate the splitting statistic directly and reduce the problem to a one-dimensional optimization with decision variable
c only. We first note that the splitting statistic
used in CART can be treated as an objective function for
c and rewritten as follows:
where
denote the sample size and the average response in the left child node, respectively; similarly,
for the right child node. Throughout this article, we use the notation
to indicate ‘equivalence or correspondence up to some irrelevant constant’. In particular, we have ignored irrelevant terms that do not involve
c in rewriting
in (
4).
To approximate
in (
4), it is only necessary to approximate
and
by replacing
with
such that
and
The approximated objective function, denoted as
, becomes
where
,
with
and
are
n-dimensional vectors.
Suppose further that, WLOG, the response has been centered
such that
It follows that
; hence,
can be further simplified as
up to some irrelevant constant. Its approximation
reduces to
The objective Function (
5) or (
6) involves a scale parameter
a. Whereas leaving
a for estimation is important in HME [
17] and soft trees [
18], it is not a good idea for our purpose. If
a is treated as a free parameter, a small
a estimate that corresponds to a nearly linear relationship may provide a better fit, but may not be well suited for identifying the best cutoff point in ordinary tree procedures. This is because tree models account for the relationships between response and predictors via hard threshold functions. Second, if the best cutoff point is estimated for each predictor with a different scale parameter
a, comparison across predictors becomes groundless. Specific recommendations on the choice of
a will be provided in the subsequent sections. As discussed earlier, in order to fix
a it is crucial to standardize the predictor
, where
denote the sample mean and standard deviation of variable
, respectively. Alternatively, the range of the data may be normalized into the unit interval
by setting
where
and
are the minimum and maximum of
, respectively.
With fixed
a, the best cutoff point
can be obtained by maximizing
in (
5) or (
6) with respect to
c, then transformed it back to the original data scale for interpretability. This is a one-dimensional optimization problem. It can be conveniently solved by many standard algorithms. In this paper, we use the method in [
23] as implemented in the R [
24] function
optimize. Brent’s derivative-free method is very popular because it combines the bisection method, the secant method, and inverse quadratic interpolation by taking advantage of their best aspects.
One issue concerns non-concavity. The objective function in (
5) or (
6), although smooth, is not concave, as demonstrated in
Figure 1. As a result, the algorithm may become stuck at a local optimum. In this case, global optimization procedures (see, e.g., [
25,
26]) can be helpful. Again thanks to the one-dimensional nature of the problem, one simple solution is to divide the search range into several intervals, find the maximum within each, and then identify the highest maximum among them. We have included this step as an option in our implementation, which may occasionally improve performance yet slow down the splitting procedure. This option is not used in any of the results reported throughout this paper. As shown in our numerical studies later, Brent’s method seems quite practically effective in locating the true cutpoint.
An immediate advantage of SSS over GS is computational efficiency. The following proposition provides an asymptotic quantification of the computational complexity involved in both GS and SSS splitting.
Proposition 1. Consider a typical dataset of size n in the regression tree setting. Both GS and SSS are used to find the best cutoff point of a continuous predictor X which has distinct values. In terms of computation complexity, GS is at best with the updating scheme and without the updating scheme, while SSS is , where m is the number of iterations in Brent’s method.
The proof of Proposition 1 is provided in
Appendix B. It is a common wrong impression that LS splitting with updating is only
This is because updating entails sorting the response values according to the
X values. It turns out that this step would dominate the algorithm asymptotically in terms of complexity. Comparatively, SSS depends on the number of iterations in Brent’s method, denoted as
Although the number of iterations is affected by the convergence criterion and the desired accuracy,
m is generally small, as Brent’s method has guaranteed convergence at a superlinear rate. Based on our limited numerical experience,
m rarely reaches over 12 even for large
n. In other words, the
rate for SSS essentially amounts to the linear rate
Therefore, SSS can be more computationally efficient in least squares splitting. Additionally, SSS can be even more advantageous in other recursive partitioning scenarios where an updating formula is not readily available, especially when evaluating each individual split point involves iterative model estimation procedures.
To speed up GS, subsampling or splitting on percentiles are two common strategies. Clearly, subsampling can be combined with SSS as well. Nevertheless, reduced samples could weaken performance and leave weaker signals undetectable. For percentile splitting, the number g of percentiles used is an issue. In a single tree analysis, the investigator is keenly interested in the ‘best’ cutoff point for the sake of interpretability and relevance to the scientific or business question at hand. Using percentiles reduces the fineness level to 1% at best in identification of the best cutpoint, which is undesirable, especially when dealing with big data. Moreover, computing all g percentiles of X is not computationally thrifty. In terms of complexity, this step is only , and g could be much larger than i.e., the number of iterative steps in SSS.
2.3. Simulation Studies on Single Splits
To evaluate SSS and investigate several relevant issues, we simulated data from the following model:
where the predictor
X is generated from a uniform [0, 1] distribution. Two sample sizes
were investigated, as recursive partitioning always involves both small sample and large sample problems. We considered two true cutoff points
corresponding to the balanced and unbalanced scenarios, and three values for
, corresponding to null, weak, and strong signals, respectively. These considerations form ten combinations in total, as the choice of
no longer matters in the null case of
For each model configuration, 200 simulation runs were used. For each simulated dataset, both GS and SSS were used to identify the best cutoff point
For SSS, a spread of choices were considered for
With the identified cutpoints from all simulation runs, we obtained the empirical density curves of
, as presented in
Figure 3.
Figure 3a,b corresponds to the null model case. The results of GS show extraordinarily high peaks at both ends. This phenomenon can be explained by the ‘end-cut preference’ (ECP) problem (see Section 11.8 of CART; [
1]), which is known for the fact that the least squares criterion tends to favor unbalanced splits, i.e., splits in which one child node has only a very small proportion of observations. Comparatively, the smoothing effect of SSS substantially ameliorates the ECP problem. Although SSS is intended to approximate GS, weighting introduced by the SSS smoothing process mitigates the ECP problem markedly by changing the summations, e.g.,
, into weighted sums
A smaller
a is associated with more distributed weights, and seems more beneficial for this purpose. It is also of note that
in SSS regardless of the choice of
a, as opposed to 0/1 indicator values in GS. These features make cutting at the extreme ends less likely for SSS. To remedy ECP for least squares splitting in regression trees, Torgo [
27] suggested setting a minimum number of observations allowable for either child node, which can be implemented via the
minbucket argument in the R function
rpart [
28]. A similar strategy can be made available in SSS as well. Because Brent’s method asks for a range over which the optimum is sought, it is convenient to define the search interval as
, where
and
denote the sample
-th and
-th quantiles of
respectively. In our implementation, we have included
as an option.
Figure 3c–f corresponds to the weak signal scenarios, where it can be seen that SSS outperforms GS by a great deal in finding the true cutpoint with all choices of
a. The failure of GS in Model B can be partly attributable to the ECP problem. Breiman et al. [
1] showed that ECP in LS splitting is a joint result of the Kolmogorov inequality and the law of the iterated logarithm in theory, as further confirmed in a recent article of [
29] for other splitting criteria. Ishwaran [
29] argued that although ECP is unfavorable for single tree analysis, it is a beneficial feature for random forests, as it maximizes the sample size for a tree to recover from a bad split based on a pure noise predictor. However, we note that the ECP effect can undesirably block out weaker signals, as illustrated here. In addition to the adverse effect of ECP, the splitting statistics computed with GS can be rather turbulent, with considerable variation, especially when the signal grows weak. An illustration from one simulation run is provided in
Figure 1. In this scenario, the overall weak signal is likely to be surmounted by some local spikes owing to the large variation. Smoothing can be very helpful here, as in general smoothing preserves the global pattern more than the local ones. Because SSS facilitates a parametric mode of smoothing, in which
a plays the role of a smoothing parameter, it is reasonable to expect that SSS helps to remove irregular local fluctuations in order to better isolate the weak signal with a more stable objective function, especially with a smaller
a. A relatively smaller
a is even preferable in recovering weak signals, as a smaller
a helps to smooth out more local erratic patterns.
Figure 3g–j illustrates the strong signal scenarios, in which both GS and SSS perform exceptionally well.
We also computed the mean squared error MSE =
as a performance measure for each non-null model configuration. The results are presented in
Figure 4.
Figure 4a–d depicts the weak signal scenarios and reinforces our earlier conclusion that SSS is more advantageous than GS in these contexts. The figure also reveals a pattern where MSE decreases as
a decreases for a certain range of
a values, say,
However, excessively small
a a values result in poor approximations, leading to diminished performance. This indicates a tradeoff between approximation and smoothness in terms of the choice of
a.
Figure 4e–h represents the strong signal scenarios, where both GS and SSS perform very well. Notably, SSS outperforms GS in the small sample scenario with
across all choices of
a considered. In contrast, with larger samples
the results are mixed, and the difference between SSS and GS becomes negligible for reasonably large
a values (e.g.,
). Additionally, both GS and SSS demonstrate better performance with the balanced cutpoint
compared to the unbalanced cutpoint
, as illustrated by the MSE scales in
Figure 4.
To further investigate the sensitivity with respect to
a,
Figure 5 plots
versus
a in each simulation run for 100 simulation runs with
It can be seen that the lines are essentially flat, except for very small
a values. This indicates that the performance of SSS remains fairly stable with respect to
a, particularly for reasonably large values. Similar patterns are observed in the weaker signal case (
), although these results are not presented here. This suggests that while the scale parameter
a functions as a smoothing parameter and further research on its fine-tuning may be valuable, setting
a as a constant seems to be a sensible approach.
Additionally, the averaged best cutoff point
at each
a overlaps well with the true cutoff point
, except in Panel (b), where the estimation is biased downwards. GS exhibits a similar downward bias; see panel (i) in
Figure 3.
Figure 5b highlights the increased difficulty in estimating an unbalanced cutoff point, especially with a small sample size. On the other hand, the notable difference between the mean
values and
in Panel (b) can be partially attributed to the nonrobustness of means. If we consider the median
values (represented by the blue line), they align much more closely with the true value
. As the sample size increases, the bias becomes negligible, as shown in panel (d).
Another alternative for specifying
a is to make
a adaptive to the sample size
To investigate this, we reran the experiments by considering sample sizes while varying in {50, 100, 150, …, 1000}. Two strategies for setting
a values were tried: constant
a in
, and adaptive choices of
a in
A total of 200 simulation runs was taken for each scenario and the MSE values were recorded. Because the scale for MSE drops dramatically as
n increases, we computed the relative difference in MSE for the selected cutpoints from SSS as opposed to GS; thus, a negative value indicates improvement upon GS.
Figure 6 plots the results versus
n for each choice of
a. It can be seen that SSS constantly outperforms GS with weaker signals
and that a smaller
a seems preferable. In the stronger signal
scenario, the constant choices
work very well, while the adaptive choice
outperforms the other two.
In summary, a larger
a yields a better approximation of the threshold effect, while a smaller
a offers increased smoothness. With a larger
a, the performance of SSS is relatively more stable, as shown in
Figure 3,
Figure 4 and
Figure 5. Conversely, a smaller
a can be beneficial for detecting weak signals and addressing the end-cut preference (ECP) problem. Our empirical results demonstrate that the performance of SSS remains stable as
a varies, as well as that SSS outperforms GS in most scenarios with a reasonable choice of
a. While some optimization of
a could be explored, we believe that a reasonable choice of
a will suffice for practical applications. Based on our findings, we tentatively recommend fixing
a in the range of [20, 60], while the adaptive choice of
also seems advisable.
To investigate computational efficiency, we generated data from Model A in (
7) with
and
Because
, the number of distinct values that it has increases with
We considered different constant choices of
and an adaptive choice
For greedy search, we used the R function
rpart [
28] by setting its options
rpart.control such that it produces one single split only and all other features, such as competitive and surrogate splits, are suppressed. The implementation of SSS was accomplished solely in R with the
optimize function.
Figure 7a plots the CPU time (in seconds) that SSS and GS spent on ten simulation runs. It can be seen that SSS consistently outperforms GS.
Figure 7b plots the averaged number
m of iterative steps in Brent’s method for SSS. It can be seen that
m ranges from 7 to 10 in this example. In addition, a smaller
a, corresponding to a smoother objective function, leads to a smaller number of iterative steps, as expected. In practice, similar experiments can be performed on specific datasets to determine the approximate threshold at which SSS becomes more advantageous than GS.
2.4. Variable Selection Bias
The variable selection bias problem is deemed inherent in recursive partitioning. By optimally selecting a goodness-of-split measure across all permissible splits from all variables, greedy search gives preference to variables that have more distinct levels or values. As a remedy, Loh [
9] (GUIDE) proposed first selecting the most important splitting variable and then finding its best cutoff point via greedy search. In addition to its negligible variable selection bias, the method is advantageous in reducing the computational cost. A similar idea was followed in [
30] (MOB). However, determination of the most important splitting variable is an equally if not more difficult problem. A predictor that is important in its threshold effect may not be so in other senses. In GUIDE, residuals from linear models are first grouped. Every predictor is treated or otherwise made categorical to form two-way contingency tables with residual-based grouping. The most important variable is then selected based on Pearson
tests. In MOB, the selection is based on a stability assessment of the estimated coefficients in a linear model. One potential problem with both methods is that the variable selected via linear models may not be the one with the most important threshold effect for the partitioning purpose. Furthermore, these methods lose power in detecting the true threshold signal on continuous or nominal predictors, leading to a different type of bias in variable selection.
In other approaches, the best cutoff point is associated with a maximally selected random variable. Akin to change-point problems (see, e.g., [
31,
32]), asymptotic inference on maximally selected statistics resorts to standard results in stochastic processes (see, e.g., [
10]), while their exact distributions can be determined with permutation-based methods (see, e.g., [
12], 2008). Shih and Tsai [
11] evaluated these approaches in partitioning data and demonstrated their potential in reducing variable selection bias via simulations. In these approaches, the
p-value associated with the best cutpoint for each predictor is first computed and possibly transformed. These
p-values are then compared across predictors to determine the best split of the data. Different methods might be used to obtain the
p-value for the best cutpoint of each predictor depending on the type of the predictor. Nevertheless, the associated distributions for maximally selected statistics may not be easily available, and the computation could be intensive when either the sample size
n or the number of distinct levels
K become large. For example, the permutation-based methods are clearly not a good choice for this purpose, as the exact
p-value is often computationally infeasible while the approximate
p-value does not have the necessary precision in order to compare with
p-values obtained from the Brownian bridge for example. Moreover, most of the above-mentioned works on the distribution of maximally selected statistics are applicable or feasible only to continuous or at least ordinal predictors, and not to nominal variables in the strict sense.
The SSS method can help to get around the variable selection bias problem by providing a simple alternative way of evaluating splits. The significance assessment of the best cutoff point on a continuous predictor can be made via its connection to a nonlinear parametric model (
3). While other testing procedures are available, it is convenient to use the likelihood ratio test for
in Model (
3). Computing the LRT entails estimation of Model (
3). This is a separable nonlinear least squares problem that can be efficiently solved by the variable projection method of Golub and Pereyra [
33]. The main idea is that
can be readily estimated given
We define matrix
The nonlinear least square criterion is
with
For a given
c, the estimate of
can be obtained via ordinary LS:
Plugging it into the LS criterion leads to an objective function for
c only:
where
denotes the column subspace of matrix
,
is the subspace perpendicular to
, matrix
denotes the orthogonal projector to
, and
denotes the Euclidean norm unless otherwise specified. Estimates of
c can be obtained by minimizing (
8), which is equivalent to maximizing
in (
5). A popular implementation is to combine the Gauss–Newton method with variable projection, as considered by Kaufman [
34] and summarized in Osborne [
35]. The algorithm is iterative with an approximately quadratic convergence rate. The corresponding pseudocode for estimating Model (
3) is provided in
Appendix A.
In our approach, the best cutoff
is efficiently obtained by optimizing (
8) or the approximated splitting statistic directly. Given
Model (
3) becomes linear and
can be obtained via OLS. The LRT statistic (see, e.g., Section 12.4 of [
21], 2003) is provided by
where
denotes the
matrix evaluated at
This test compares Model (
3) with the null model
. Because
a is fixed, the resulting LRT statistic follows a
distribution under
Alternatively, an
F test can be used (see Section 5 of Seber and Wild [
21]). This result suggests a conjecture that there may be a connection between the
distribution and the asymptotic distribution derived from the Brownian bridge [
10], which could be further investigated in future research. We define the associated logworth
as the minus base-10 logarithm of the resulting
p-value when referring to a
distribution. The logworth provides a coalescent measure of the strength for the best cutoff point of each predictor regardless of the variable type or the methods used to evaluate its best cutpoint. The best split of the data is then the one with the maximum logworth among all predictors.
We next discuss how to handle other types of predictors in combination with SSS. For a 0–1 binary variable, the LRT for
in Model (
1) with
can be used to evaluate the single split. The corresponding
p-value, together with the associated logworth, can be computed by referring to the
distribution. For an ordinal variable with
K distinct values, we process with SSS by treating as if it is continuous. However, when
is small, it is possible to resort to the maximally selected statistics approaches. Several approaches are available for computing the associated
p-value. A method that combines the change-point approach of Hawkins [
36] and Genz [
37], as implemented in the R package
maxstat [
38], is particularly appealing for the small
K scenario in terms of computational feasibility and efficiency.
For a nominal variable
X with
K distinct levels, we first ‘ordinalize’ its levels by sorting its group means and then proceed by treating it as if it were continuous. This standard strategy in GS greatly reduces the total number of splits to be evaluated, and is theoretically justified (see CART, Section 9.4, [
1]); however, both its steps introduce overoptimism or bias in terms of the association between
Y and
X. The ordinalization step may make an irrelevant
X artificially correlated with
Y, while the maximal selection of splitting criteria over the ordinalized levels of
X induces additional overoptimism. Clearly, SSS can be used in the latter step to find the best cutpoint and help avoid bias owing to maximal selection. However, the overoptimism stemming from the ordinalization step remains an issue. In general, how to splitting nominal variables in regression trees without bias is not satisfactorily addressed in the extant literature, and further research is warranted.
To demonstrate this, we simulate data from a null model where the response
Y has nothing to do with any of the nine either predictors
. For the above discussed reasons, we have restricted ourselves to ordinal or continuous predictors only. The numbers of distinct values for
were set as {2, 3, 4, 5, 10, 20, 50, 100, 500}, respectively. Two sample sizes
and
were considered.
Figure 8 plots the frequency of each variable being selected for splitting data by SSS (combined with maximally selected statistics) and GS. For SSS, we tried different choices of
a in {10, 30, and 50}. It can be seen that greedy search suffers severely from selection bias, while SSS substantially alleviates the problem. Different choices of
a for SSS yield very similar results, with a slight preference to smaller
a values.
For further exploration, we consider another trial where data were generated from
where
is binary 0–1 valued and we include a second variable
In this model,
Y is only related to the binary predictor
; nevertheless, the unrelated predictor
has distinct values of
We then allowed the coefficient
to vary from
We recorded the percentage of correct selection of the true split
made by GS and SSS. For SSS, we also tried different choices of
{10, 30, 50, 100}.
Figure 9 plots the results against
for two sample sizes,
and
SSS clearly outperforms GS in picking up the correct split. For weak signals, it is more likely for GS to have spurious splits introduced into the tree structure. As expected, both perform better as the signal grows stronger or the sample size grows larger. For SSS, a smaller
a tends to yield a higher percentage of correct selection, yet with seemingly unimportant differences.
In the above simulation, we included only one binary predictor in order to better isolate the specific challenges of variable selection bias. As one reviewer noted, the effects of multiple predictors on the response can complicate the delineation of a variable’s impact. While future research with more extensive simulations is necessary to address this issue, we would like to emphasize that the approach used in recursive partitioning is somewhat analogous to coordinate descent in optimization, where each variable is handled separately. This allows us to focus on the individual impact of each predictor at each split.