Next Article in Journal
Characterization of Pseudo-Differential Operators Associated with the Coupled Fractional Fourier Transform
Next Article in Special Issue
Extensions of Some Statistical Concepts to the Complex Domain
Previous Article in Journal
Ground State Solutions for a Non-Local Type Problem in Fractional Orlicz Sobolev Spaces
Previous Article in Special Issue
Personalized Treatment Policies with the Novel Buckley-James Q-Learning Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Weighted Least Squares Regression with the Best Robustness and High Computability

1
Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
2
Department of Computer Science, Michigan State University, East Lansing, MI 48824, USA
*
Author to whom correspondence should be addressed.
Axioms 2024, 13(5), 295; https://doi.org/10.3390/axioms13050295
Submission received: 21 March 2024 / Revised: 15 April 2024 / Accepted: 21 April 2024 / Published: 27 April 2024
(This article belongs to the Special Issue New Perspectives in Mathematical Statistics)

Abstract

:
A novel regression method is introduced and studied. The procedure weights squared residuals based on their magnitude. Unlike the classic least squares which treats every squared residual as equally important, the new procedure exponentially down-weights squared residuals that lie far away from the cloud of all residuals and assigns a constant weight (one) to squared residuals that lie close to the center of the squared-residual cloud. The new procedure can keep a good balance between robustness and efficiency; it possesses the highest breakdown point robustness for any regression equivariant procedure, being much more robust than the classic least squares, yet much more efficient than the benchmark robust method, the least trimmed squares (LTS) of Rousseeuw. With a smooth weight function, the new procedure could be computed very fast by the first-order (first-derivative) method and the second-order (second-derivative) method. Assertions and other theoretical findings are verified in simulated and real data examples.

1. Introduction

In the classical regression analysis, we assume that there is a relationship for a given data set { ( x i , y i ) , i { 1 , 2 , , n } } :
y i = ( 1 , x i ) β 0 + e i , i { 1 , , n }
where y i R 1 , ⊤ stands for the transpose, β 0 = ( β 01 , , β 0 p ) (the true unknown parameter) in R p and x i = ( x i 1 , , x i ( p 1 ) ) in R p 1 ( p 2 ), e i R 1 is called an error term (or random fluctuation/disturbances, which is usually assumed to have zero mean and variance σ 2 in classic regression theory). That is, β 01 is the intercept term of the model. We write w i = ( 1 , x i ) , then one has y i = w i β 0 + e i , which is used interchangeably with (1).
One wants to estimate the β 0 based on a given sample z ( n ) : = { ( x i , y i ) , i { 1 , , n } } from the model y = ( 1 , x ) β 0 + e . Call the difference between y i and w i β the i-th residual r i ( β ) for a candidate coefficient vector β (which is often suppressed). That is,
r i : = r i ( β ) = y i w i β .
To estimate β 0 , the classic least squares (LS) minimizes the sum of squares of residuals,
β ^ l s = arg min β R p i = 1 n r i 2 .
Alternatively, one can replace the square above by the absolute value to obtain the least absolute deviations estimator (i.e., L 1 estimator, in contrast to the L 2 (LS) estimator).
The LS estimator is very popular in practice across a broader spectrum of disciplines due to its great computability and optimal properties when the error e i s is i.i.d. and follows a normal N ( 0 , σ 2 ) distribution. It, however, can behave badly when the error distribution is slightly departed from the normal distribution, particularly when the errors are heavy-tailed or contain outliers.
Robust alternatives to the β ^ l s have abounded in the literature for a long time. The most popular ones are M-estimators [1], the least median squares (LMS) and least trimmed squares (LTS) estimators [2], S-estimators [3], MM-estimators [4], τ -estimators [5], maximum depth estimators ([6,7]), and the recent least squares of trimmed residuals (LST) regression [8], among others. For more related discussions, see Sections 1.2 and 4.4 of [9], and Section 5.14 of [10].
Robust methods that have a high breakdown point are usually computationally intensive and with a non-differentiable objective function (e.g., LMS, LTS, and LST). In this article, we will introduce a smooth and differentiable objective function that greatly facilitates the computation of the underlying estimator. We introduce a new class of alternatives for robust regression, weighted least squares (WLS) estimators β ^ w l s :
β ^ w l s = arg min β R p i n w i r i 2 ( β ) ,
where w i is the weight associated with r i with a fundamental feature: it assigns equal weight to all r i 2 that are small (no greater than a cut-off value) and exponentially down-weights (penalizes) the large ones (when r i 2 s are greater than the cut-off value).
Weighted least squares estimation has been proposed and discussed in the literature, including the famous Huber’s M-estimators which, however, can have the lowest breakdown point if the derivative of the weight (or loss) function is non-decreasing; see [9] (p. 13) or [10,11]. For more discussions, see Section 1.2 of [9] or Section 5.11 of [10]. Previous weight functions in the literature are either constants (e.g., LS with 1, or LMS and LTS with 0 and 1 weight), rank-based weight, do not down-weight large residuals sufficiently, or non-differentiable. Among these weight-induced regression estimators, there are few that possess a high breakdown point (50%), a high efficiency, and a high computability, simultaneously.
On the other hand, there is much room for smooth weight functions. Successful examples in location setting have already appeared in the literature, e.g., [12]. This motivates us to extend those smooth weight functions to regression setting and to achieve a high breakdown point and high efficiency and high computablity simultaneously. We propose using a differentiable w ( r ) , which would assign weight one to r i s that lies close to the center of the data (all r i s) cloud. The other points that lie on the outskirts of the data (all r i s) cloud could be viewed as outliers, so a lower positive weight (not zero) should be given. This would balance efficiency with robustness. The weighted procedure proposed in this article has never appeared before. The specially chosen w i ’s in (3) will recover the famous LMS and the LTS in [2], and LST in [8]. More discussions on w and β ^ w l s are carried out in Section 2, where an ad hoc choice of the weight function with the above property in mind will be introduced.
The rest of this article is organized as follows. Section 2 introduces a class of differentiable weight functions and a class of weighted least squares estimators. Section 3 establishes the existence of β ^ w l s and studies its properties including its finite sample breakdown robustness. Section 4 discusses the computation of β ^ w l s . Section 5 presents some concrete examples, comparing the performance of β ^ w l s with other leading estimators. Section 6 ends the article with some concluding remarks. Long proofs of the main results are deferred to in Appendix A.

2. A Class of Weighted Least Squares

2.1. A Class of Weight Functions

An ad hoc choice of the weight function with the property mentioned in Section 1 takes the form of
w ( x ) = 1 ( | x | c ) + 1 ( | x | > c ) e k ( 1 c / | x | ) 2 e k 1 e k , c , k > 0 ,
where the tuning parameter k > 1 is a positive number (say, between 1 and 10) controlling the steepness of the exponential decrease (see the left panel of Figure 1), where the larger the k, the steeper the curve (the key difference from the trimmed procedures where the weight becomes zero). Tuning parameter c is the point where the weight function will change from a constant one to being exponentially decreasing. c (>1) usually can be set to be a large positive number (say 10), or it can be residual dependent, say 50% or 75% percentile of the residuals, and a larger c is favorable for higher efficiency. c is assumed to be finite to exclude the LS case (i.e., w ( x ) will not be a constant one).
One of the examples of w ( x ) is given in Figure 1, where w ( x ) and its derivative are given and k = 5 and c = 100 . For a general w ( x ) , it is straightforward to verify that
P1 
w ( x ) is twice differentiable and 0 < w ( x ) 1 . When x , w ( x ) is asymptotically equivalent to α ( e γ x 1 1 ) for some positive constants α and γ .
P2 
If r i , then w ( r i 2 / c * ) r i 2 2 c k c * / ( e k 1 ) , where c * : = M e d i { y i 2 } , the median of { y 1 2 , y 2 2 , , y n 2 } .

2.2. Weighted Least Squares Estimators

With the weight function given above, we are ready to specify the weighted least squares estimator in (3) with more details:
β ^ w l s = arg min β R p i n w i r i 2 ( β ) ,
where weight w i : = w ( r i 2 / c * ) , with w ( x ) being a weight function in (4) that satisfies P2 and c * defined in P2.
The behavior of function w ( r 2 / c * ) r 2 when r > c for different c * s is illustrated in Figure 2 below. Inspecting the figure reveals that it is strictly convex.

3. Properties of the β ^ w l s

3.1. Existence

Does the minimizer of the objective function O ( β , z ( n ) ) : = i n w i r i 2 ( β ) on the right-hand side (RHS) of (5) exist? We now formally address this. We need the following assumption.
A1: For a given sample z ( n ) : = { ( z i ) i = 1 n } = { ( x i , y i ) , i { 1 , 2 , , n } } and any β R p , all points { ( x i , y i ) } with r i s satisfying r i 2 / c * c do not lie in a vertical hyperplane.
The assumption holds true with probability one if the sample comes from a distribution of ( x , y ) that has a density. Now, we have the following existence result.
Theorem 1. 
If A1 holds true, then the minimizer β ^ w l s of O ( β , z ( n ) ) always exists.
Proof. 
See the Appendix A. □

3.2. Equivariance

Desirable fundamental properties of regression estimators include regression, scale, and affine equivarince. For x R n × ( p 1 ) and y R n , a regression estimator β ^ : = t ( w , y ) with w = ( 1 , x ) satisfying
t ( w , y + b w ) = t ( w , y ) + b , b R p ;
t ( w , s y ) = s t ( w , y ) , s R ;
t ( A w , y ) = A ( 1 ) t ( w , y ) , nonsingular A R p × p .
is called regression, scale, and affine equivariant, respectively (see page 116 of [9]). All aforementioned regression estimators are regression, scale, and affine equivariant.
Theorem 2. 
β ^ w l s defined in (3) is regression, scale, and affine equivariant.
Proof. 
Notice the identities r i = y i w i β = y i + b w i w i ( β + b ) , s r i = s y i w i ( s β ) , and r i = y i ( A w i ) A 1 β . Meanwhile, r i 2 / c * is regression, scale, and affine invariant. The desired result follows. □

3.3. Robustness

As an alternative to the least-squares β ^ l s , is the β ^ w l s more robust?
The most prevailing quantitative measure of the global robustness of any location or regression estimators in the finite sample practice is the finite sample breakdown point (FSBP), introduced in [13].
Roughly speaking, the FSBP is the minimum fraction of ‘bad’ (or contaminated) data that the estimator can be affected by to an arbitrarily large extent. For example, in the context of estimating the center of a data set, the sample mean has a breakdown point of 1 / n (or 0%) because even one bad observation can change the mean by an arbitrary amount; in contrast, the sample median has a breakdown point of ( n + 1 ) / 2 / n (or 50), where · is the floor function.
Definition 1 
([13]). The finite sample replacement breakdown point (RBP) of a regression estimator t at the given sample z ( n ) = { z 1 , , z n } , where z i : = ( x i , y i ) , is defined as
R B P ( t , z ( n ) ) = min 1 m n m n : sup z m ( n ) t ( z m ( n ) ) t ( z ( n ) ) = ,
where z m ( n ) denotes an arbitrary contaminated sample by replacing m original sample points in z ( n ) with arbitrary points in R p . Namely, the RBP of an estimator is the minimum replacement fraction that could drive the estimator beyond any bound. It turns out that both L 1 (least absolute deviations) and L 2 (least squares) estimators have RBP 1 / n (or 0%), the lowest possible value, whereas β ^ w l s can have ( ( n p ) / 2 + 1 ) / n (or 50%), the highest possible value for any regression equivariant estimators (see p. 125 of [9]).
We shall say z ( n ) is in a general position when any p of observations in z ( n ) gives a unique determination of β . In other words, any (p-1) dimensional subspace of the space ( x , y ) contains at most p observations of z ( n ) . When the observations come from continuous distributions, the event ( z ( n ) being in the general position) happens with a probability of one.
Theorem 3. 
Assume that A1 holds true, n > p , and z ( n ) is in the general position. Then,
R B P ( β ^ w l s n , z ( n ) ) = ( n + 1 ) / 2 / n , i f p = 1 , ( ( n p ) / 2 + 1 ) / n , i f p > 1 .
Proof. 
see the Appendix A. □
We need the following important result for the Proof of Theorem 3.
Lemma 1. 
For any r i 2 > r j 2 > c * c , w ( r i 2 / c * ) r i 2 < w ( r j 2 / c * ) r j 2 when r j 2 .
Proof. 
See the Appendix A. □
Remark 1. 
The RBP result in Theorem 3 is the highest possible breakdown point for any regression equivariant estimators in the literature (see p. 125 of [9]). There are very few regression estimators that possess the highest breakdown point robustness.

4. Computation of the WLS

Now, we address the most important issue with a high breakdown point estimator, the computation of the estimator. The objective function in (5) is
O ( β ) : = O ( β , z ( n ) ) = i = 1 n w ( r i 2 / c * ) r i 2 ,
which is differentiable with respect to β since the weight function w ( x 2 / c * ) is twice differentiable with
w ( x ) = α * e k ( 1 c / | x | ) 2 ( 1 c / | x | ) sgn ( x ) / x 2 1 ( | x | > c ) , w ( x ) = α * e k ( 1 c / | x | ) 2 2 k c ( 1 c / | x | ) 2 / | x | ( 2 3 c / | x | ) / x 3 1 ( | x | > c ) ,
where α * = 2 k c / ( 1 e k ) . The problem in (3) belongs to an unconstrained minimization. This type of problem has been thoroughly discussed and studied in the literature. Common approaches to find the solution include (i) methods utilizing first-order derivatives (gradient descent/steepest descent/conjugate gradient), (ii) methods using second-order derivatives (Hessian matrix) (Newton’s method), and (iii) quasi-Newton method, see [14,15]. We will select the conjugate gradient for speed/efficiency and accuracy consideration.
Note that
O ( β ) = O ( β ) β = i = 1 n ( w ( r i 2 / c * ) r i 2 + c * w ( r i 2 / c * ) ) r i 2 / c * β = i = 1 n ( w ( r i 2 / c * ) r i 2 + c * w ( r i 2 / c * ) ) 2 r i / c * ( w i ) = i = 1 n 2 r i / c * ( w ( r i 2 / c * ) r i 2 + c * w ( r i 2 / c * ) ) w i .
2 O ( β ) = 2 O ( β ) 2 β = 2 c * i = 1 n r i ( w ( r i 2 / c * ) r i 2 + c * w ( r i 2 / c * ) β w i = 2 c * i = 1 n w i w i 5 r i 2 w ( r i 2 c * ) + c * w ( r i 2 c * ) + 2 r i 4 c * w ( r i 2 c * ) = X n W X n ,
where X n = ( w 1 , , w n ) , W is a diagonal matrix with its i-th diagonal entry 2 γ i / c * and
γ i = 5 r i 2 w ( r i 2 c * ) + c * w ( r i 2 c * ) + 2 r i 4 c * w ( r i 2 c * ) .
Write γ i / c * as g ( t i ) , then g ( t i ) = 5 t i w ( t i ) + 2 t i 2 w ( t i ) + w ( t i ) , where t i = r i 2 / c * > c and g ( t ) < 0 for t > c for different c > 0 as indicated below in Figure 3. Namely, W is positive definite when t i > c .
The algorithm for the conjugate gradient method (CGM) is as follows:
(i)
Step 1. Pick a β 0 (which can be an LS estimator, but for robustness, the LTS ([2]) or LST ([8]) is a better choice). Set v 0 = O ( β 0 ) . Set a tolerance ε . if ( v 0 < ε ) {return β 0 }.
(ii)
Step 2. For k = 0 , 1 , , n 1 ,
(a)
Set β k + 1 = β k + α k v k , where α k is the minimizer of O ( β k + α v k ) with respect to α (using backtracking line search, see page 464 of [14]), or set
α k = O ( β k ) v k / ( v k ) H ( β k ) v k ,
where H ( β k ) = 2 ( O ( β k ) ) .
(b)
Compute O ( β k + 1 ) , if ( O ( β k + 1 ) < ε ) {return β k + 1 }.
(c)
If ( k = n 1 ) {break}; else set v k + 1 = O ( β k + 1 ) + α k v k , where
α k = O ( β k + 1 ) O ( β k + 1 ) / O ( β k ) O ( β k )
end for loop.
(iii)
Step 3. Replace β 0 by β n and go to step 1.
Convergence of the gradient algorithm or gradient descent method to the global minimum has been thoroughly analyzed on pp. 466–467 of Boyd and Vandenberghe (2004) [14]. The global convergence of conjugate gradient methods specifically has been addressed in Gilbert and Nocedal (1992) [16].

5. Examples and Comparison

Now, we investigate the performance of our new procedure WLS and compare it with some leading competitors including the robust benchmark, the least trimmed squares LTS estimator, Rouseeuw [2] (known for its high robustness and fast computation), the MM estimator of Yohai [4] (known for its high robustness and high efficiency), and the classical least squares LS estimator (known for its high efficiency for i.i.d. normal errors) via some concrete examples.

5.1. Performance Criteria

  • Empirical mean squared error (EMSE) For a general regression estimator t , we calculate EMSE : = i = 1 R t i β 0 2 / R , the empirical mean squared error (EMSE) for t . If t is regression equivariant, then we can assume (w.l.o.g.) that the true parameter β 0 = 0 R p (see p.124 of [9]). Here, t i is the realization of t obtained from the i-th sample with size n and dimension p, and replication number R is usually set to be 1000.
  • Total time consumed for all replications in the simulation (TT) This criterion measures the speed of a procedure, where the faster and more accurate, the better.One possible issue is the fairness of comparison of different procedures because different programming languages (e.g., C, Rcpp, Fortran, and R) are employed by different procedures.
  • Finite sample relative efficiency (FSRE) In the following, we investigate via simulation studies the finite-sample relative efficiency of the different robust alternatives of the LS with respect to the benchmark, the classical least squares line/hyperplane. The latter is optimal with normal models by the Gauss–Markov theorem. We generate R = 1000 samples from the linear regression model: y i = β 0 + β 1 x 1 + + β p 1 x p 1 + e i , i { 1 , , n } with different sample size ns and dimension ps, where e i N ( 0 , σ 2 ) . The finite sample RE of a procedure is the percentage of EMSE of the LS divided by the EMSE of the procedure.
All R code (downloadable via https://github.com/left-github-4-codes/WLS) accessed on 19 March 2024 for simulation, examples, and figures in this article were run on a desktop Intel(R)Core(TM) 21 i7-2600 CPU @ 3.40 GHz.

5.2. Examples

In the sequel, the cutoff value ε is set to be 10 4 for the procedure WLS. For simplicity, we set the tuning parameters c = k = 6 for the weight function.
Example 1 
(Simple linear regression). To take the advantage of the graphical illustration of data sets and plots, we start with p = 2 , the simple linear regression case.
We generated a data set with seven artificial highly correlated (with correlation 0.88 between x and y) bi-variate normal points. It is plotted in the left panel of Figure 4. Two reference regression lines ( y = 0 ) and ( y = x ) are also provided.
Inspecting the left panel of the figure immediately reveals that points 5 and 6 seem to be outliers and the overall pattern of the data set is linear y = c x with c > 0 . The right panel further reveals that the LS, the LTS, and the MM lines are very sensitive to the outlying points, whereas WLS still can catch the overall line pattern under the influence of two outliers.
One might immediately argue that the example above has at least two drawbacks: (i) the data set is too small, and (ii) it is purely artificial. In Figure 5, the sample size is increased to 80 highly correlated normal points with 30 % of them contaminated by other normal points. Inspecting the figure reveals that the four procedures capture the linear pattern perfectly in the left panel of the figure for perfect bivariate normal points, while in the right panel, the LTS, MM, and LS lines are drastically changed due to the 24 contaminating points, while WLS well resists the influence of outliers, catching the original overall linear pattern.
In practice, there are more cases with more than one independent variable: in the following, we consider the case p > 2 .
Example 2 
(Multiple linear regression with contaminated normal points). Now, we do not have the visual advantage like in the p = 2 case. To compare the performance of different procedures, we have to appeal the performance measures discussed in Section 5.1.
We consider the contaminated highly correlated normal data points scheme. We generate 1000 samples { z i = ( x i , y i ) , i { 1 , , n } } with various ns from the normal distribution N ( μ , Σ ) , where μ is a zero vector in R p , and Σ is a p by p matrix with diagonal entries being 1 and off-diagonal entries being 0.9 . Then, ε % of them are contaminated by m = n ε points, where · is the ceiling function. We randomly select m points of { z i , i { 1 , , n } } and replace them by ( 3 , 3 , , 3 , 3 ) .
The performance of the CGM in Section 4 (or rather any iterative procedure) severely depends on the initial given point β 0 . In light of its cyclic feature of the CGM for non-quadratic objective function (see page 195 of [15]) and our extensive empirical simulation experience, the performance of the β return by the CGM usually is not very different from (or better than) that of the initially selected β 0 . To achieve better performance for the WLS, we modified the LST of Zuo and Zuo [8] and utilized it as the initial β 0 for CGM. Results for the three methods and different ns and ps and contamination levels ε are listed in Table 1.
Inspecting the table reveals that (i) LS is the fastest in all cases considered and the best performer for pure normal data sets, except the case p = 20 and n = 200 , where WLS is even slightly more efficient. It, however, becomes the worst performer when there is contamination (except the ε = 0.30 cases, where the LTS and MM surprisingly become the worse performers. In theory, both MM and LTS can resist up to 50% contamination without breakdown). (ii) WLS has the smallest EMSE when there is contamination and this is true even with no contamination when p = 20 and n = 200 . It is also the second fastest performer (except in the case ε = 0.3 and p = 5 or 10, where MM is faster). (iii) LTS is inferior to WLS in all cases and so is the MM (except it runs faster when ε = 0.3 and p = 5 or 10). (iv) MM performs better than LTS in TT and in EMSE (except when p = 20 and ε = 0.0 , 0.10, or 0.20).
Example 3 
(Performance when β 0 is given). In the calculation of EMSE above, one assumes that β 0 = 0 in light of regression equivariance of an estimator t . In this example, we will provide β 0 (for convenience, write it as β 0 ) and calculate y i using the formula y i = ( 1 , x i ) β 0 + e i , where we simulate x i from a normal distribution with a zero mean vector and an identical covariance matrix. e i follows a standard normal distribution.
We set p = 10 , n = 100 and β 0 = ( 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ) . There is a ε = 10 % contamination for each of 1000 normal samples (generated as in Example 2) with the contamination scheme as follows: we randomly select m = n ε points out of { z i , i { 1 , , n } } and replace them by ( 4.5 , 4.5 , , 4.5 ) . We then calculate the squared deviation (SD) ( β ^ i β 0 ) 2 for each sample, the total time (TT) consumed by each procedures for all 1000 samples, and the relative efficiency (RE) (the ratio of EMSE of LS vs. EMSE of a procedure). The performance of three procedures for different criteria are displayed via the boxplot in Figure 6.
Inspecting the figure reveals that (i) in terms of squared deviations, LTS and LS perform the same, where both have a wide spread and a high EMSE, whereas MM has a much smaller EMSE, and WLS has the smallest EMSE (in fact, the EMSE for the four (mm, wls, lts, ls) are (1.188640, 1.037962, 2.245551, 2.245551)). (ii) In terms of total time consumed, LS is the absolute winner, LTS is the absolute loser, and WLS is much better than LTS and slightly better than MM. (iii) In terms of relative efficiency, LTS is the loser (performs as bad as the LS), whereas WLS earns the trophy and is much more robust against 10% contamination. MM is the second best.
Up to this point, we have dealt with synthetic data sets. Next, we investigate the performance of MM, WLS, LTS, and LS with respect to real data sets in high dimension.
Example 4 
(Performance for a large real data set). Boston housing is a famous data set (see [17]) and studied by many authors with different emphasizes (transformation, quantile, nonparametric regression, etc.) in the literature. For a more detailed description of the data set, see http://lib.stat.cmu.edu/datasets/ accessed on 19 March 2024.
The analysis reported here does not include any of the previous results but consists of just a straight linear regression of the dependent variable (median price of a house) on the thirteen explanatory variables as might be used in an initial exploratory analysis of a new data set. We have sample size n = 506 and dimension p = 14 .
We assess the performance of the MM, the LST, the WLS, and the LS as follows. Since some methods depend on randomness, we run the computation R = 1000 times to alleviate the randomness. (i) We compute the β ^ with different methods, and we do this 1000 times. (ii) We calculate the total time consumed (in seconds) by different methods for all replications and the EMSE (with true β 0 being replaced by the sample mean of 1000 β ^ s from (i)), which is the sample variance of all β ^ s up to a factor 1000 / 999 . The results are reported in Table 2.
Inspecting the table reveals that (i) WLS and LS produce the same β ^ for each sample, so there is no variance, whereas this is not the case for MM and LTS. (ii) LS is the fastest runner followed by MM, LTS, and WLS. (iii) The relative efficiency of MM and LTS is 0% since the sample variance of LS is 0, whereas the RE of WLS and LS is undefined (not a number) since 0 appeared in the denominator. On the other hand, one can interpret WLS as being as good as LS in this case with RE 100%.
Example 5 
(Performance for a real data set which is known to contain outliers). We examine the data set of Buxton (1920) [18], which has been studied repeatedly in the literature, see Hawkins and Olive (2002) [19], Olive (2017) [20], Park, Kim, and Kim (2012) [21], Olive and Hawkins (2011) [22].
We fit the different methods to the Buxton data, which is a 87 by 7 matrix (original row 9 was deleted), with height as the response variable and other four variables as predictor variables (two variables are excluded due the missing values) as Olive did. For more explanations, see Olive’s website at http://parker.ad.siu.edu/Olive/buxton.txt accessed on 19 March 2024.
We list in Table 3 the output of the methods (mm, lts, lms wls, ls, hbreg, and rmreg2), where the last two methods are proposed by Olive and Hawkins (2011) and Olive (2017) [20,22], respectively.
With great help from Dr. Olive, we were able to have the pairwise scatter plots of points of ( y ^ i , y i ) , namely, fitted values versus observed values and fitted values versus fitted values of different methods. The plot is given in Figure 7 (lms is omitted; it performs much the same as most other robust ones).
Inspecting Figure 7 reveals that there are five obvious outliers on response variable y. Further examining the data set confirms that observations 61:65 have unusual small response values from 18 to 19, while all others are in between 1500 and 1800 and have unusual, larger head length values. The first row of Figure 7 is ( y ^ i , y i ) for different methods. It is seen that five out of six methods perform much the same, while rmreg2 performs remarkably different.
The latter produces much larger fitted values for the five outliers which might be interpreted as the method resisting the influence of the outliers while others cope with the five outliers and produce fitted values that are in the same order in magnitude as the observed values, which might be interpreted as these methods being heavily influenced by the five outliers.
To better understand the performance of the six methods, we produced a classic fitted value versus the standardized residuals plot in Figure 8, which clearly identifies five outliers and performance difference between the six methods (rmreg2 performs remarkably different from all others).
Furthermore, to better appreciate the hyperplanes induced from β ^ in Table 3 and to take the two-dimensional graphic visualization advantage, we look at the two-dimensional vertical cross-section of hyperplanes in the fifth dimension (restricted/project to y versus x 3 dimension) and plot the lines (intercept and head) based on Table 3 by different methods (they are different from the regression lines based on ( x 3 , y) by different methods) in Figure 9. From the Figure, we obtain a better understanding of the behavior of different methods. All seven lines but the one from rmreg2 have a negative slope.
Note that both hbreg and rereg2 functions output more than one solution. We chose hbreg$coef (which is identical to ls) and rmreg2$Bhat in this data set case. The lines from hbreg, wls, and lts are almost parallel, while lines from mm and lms are also almost parallel to the majority but far away from the data cloud and should be discarded in this case. Similar plots with other variables could also be constructed.
Lines in Figure 9 are induced from the hyperplanes in Table 3 by projection to the (head, height)-dimension in the five-dimensional space. One naturally wonders: are they the same as the lines from direct regression on (head, height) by different methods? To appreciate the difference between two types of lines, we fit (head, height) [as (x, y)] with different methods, and the lines are given in Figure 10. Inspecting the figure reveals that all the lines perform the same but the line induced from rmreg2.

6. Concluding Remarks

With a novel weighting scheme, the proposed weighted least squares estimator performs as efficiently as the classic least squares (LS) estimator for perfect normal data, while being more efficient than MM and much more efficient than the LTS estimator. It is much more robust than the LS when there is contamination or outliers (it is also more robust than MM and LTS when the contamination level is 30%). It performs as robustly as the LTS and the MM while being more efficient than MM and LTS when there are outliers. It possesses the best finite sample breakdown point robustness while achieving high efficiency and computability. It could serve as a robust alternative to the LTS and the MM in practice.

Author Contributions

Writing—original draft, Y.Z. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors thank Wei Shao for insightful comments and stimulus discussions.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs of Main Results

Proof of Theorem 1. 
For a given z ( n ) and any β , write M : = i = 1 n y i 2 i = 1 n w ( y i 2 / c * ) y i 2 = O ( 0 p × 1 , z ( n ) ) . For a given β R p , hereafter assume that H β is the hyperplane determined by y = w β , and let H h be the horizontal hyperplane (i.e., y = 0 , the w -space).
Partition the space of β s into two parts, S 1 and S 2 , with S 1 containing all β s such that H β and H h are parallel and S 2 consists of the rest of β s so that H β and H h are not parallel.
If one can show that there are minimizers of O ( β , z ( n ) ) over S i i = 1 , 2 , respectively, then one can have an overall minimizer.
Over S 1 , β = ( β 0 , 0 ( p 1 ) × 1 ) and r i = y i β 0 . If the minimizer does not exist, then it means that any bounded β 0 cannot minimize O ( β , z ( n ) ) , and the absolute value of the minimizer β 0 * must be greater than any M * > 0 . We seek a contradiction now. Denote the minimizer as β * = ( β 0 * , 0 ( p 1 ) × 1 ) . Define β 1 * = ( 2 β 0 * , 0 ( p 1 ) × 1 ) , then it is readily seen that r i 2 ( β 1 * ) > r i 2 ( β * ) for large enough β 0 * . By Lemma 1 below, one has O ( β * , z ( n ) ) > O ( β 1 * , z ( n ) ) . A contradiction is obtained.
Over S 2 , denote by l β the intersection part of H β with the horizontal hyperplane H h (we call it a hyperline, though it is p 1 -dimensional). Let θ β ( π / 2 , π / 2 ) be the acute angle between the H β and H h (and θ β 0 ). Consider two cases.
Case I. All w i = ( 1 , x i ) with r i 2 / c * c on the hyperline l β , where r i = y i w i β . Then, we have a vertical hyperplane that is perpendicular to the horizontal hyperplane H h ( y = 0 ) and intersect H h at l β , But this contradicts A1. We only need to consider the other case.
Case II. Otherwise, define
δ = 1 2 inf { τ , s u c h   t h a t   N ( l β , τ )   c o n t a i n s   a l l   w i   w i t h   r i 2 / c * c } ,
where N ( l β , τ ) is the set of points in w -space such that each distance to the l β is no greater than τ . Clearly, 0 < δ < (since δ = 0 has been covered in Case I and δ max i { w i } < , where i satisfies r i 2 / c * c , and the first inequality follows from the fact that the hypotenuse is always longer than any legs).
We now show that when β > ( 1 + η ) M / δ , where η > 1 is a fixed number, then
O ( β , z ( n ) ) = i = 1 n w ( r i 2 / c * ) r i 2 ( β ) > M O ( 0 p × 1 , z ( n ) ) .
That is, for the solution of the minimization of (5). One only needs to search over the ball β ( 1 + η ) M / δ , a compact set. Note that O ( β , z ( n ) ) is continuous in β since r i ( β ) and w ( r i 2 / c * ) are. Then, the minimization problem certainly has a solution over the compact set.
The proof is complete if we can show (A1) when β > ( 1 + η ) M / δ . It is not difficult to see that there is at least one i 0 such that r i 0 2 / c * c and w i 0 N ( l β , δ ) since otherwise, it contradicts the definition of δ above. Note that θ β is the angle between the normal vectors ( β , 1 ) and ( 0 , 1 ) of hyperplanes H β and H h , respectively. Then, | tan θ β | = β (see [8]) and (see Figure A1)
| w i 0 β | > δ | tan θ β | = δ β > ( 1 + η ) M .
Now, we have
| r i 0 ( β ) | = | w i 0 β y i 0 | | | w i 0 β | | y i 0 | | > ( 1 + η ) M | y i 0 | .
Therefore,
O ( β , z ( n ) ) = j = 1 n w ( r j 2 / c * ) r j 2 ( β ) w ( r i 0 2 / c * ) r i 0 2 ( β ) = r i 0 2 ( β ) > ( 1 + η ) M | y i 0 | 2 ( 1 + η ) M M 2 = η 2 M > M O ( 0 p × 1 , z ( n ) ) .
That is, we have certified (A1). □
Figure A1. A two-dimensional vertical cross-section (that goes through points ( w i t , 0 ) and ( w i t , w i t β ) ) of a figure in R p ( w i t = w i ). Hyperplanes H h and H β intersect at hyperline l β (which does not necessarily pass through ( 0 , 0 ) , here just for illustration). The vertical distance from point ( w i t , 0 ) to the hyperplane H β , | w i t β | , is greater than δ | tan ( θ β ) | .
Figure A1. A two-dimensional vertical cross-section (that goes through points ( w i t , 0 ) and ( w i t , w i t β ) ) of a figure in R p ( w i t = w i ). Hyperplanes H h and H β intersect at hyperline l β (which does not necessarily pass through ( 0 , 0 ) , here just for illustration). The vertical distance from point ( w i t , 0 ) to the hyperplane H β , | w i t β | , is greater than δ | tan ( θ β ) | .
Axioms 13 00295 g0a1
Proof of Lemma 1. 
Write w ( r 2 / c * ) r 2 = c * w ( r 2 / c * ) r 2 / c * : = c * w ( x 2 ) x 2 , where x = | r | / c * and x 2 = r 2 / c * > c . It suffices to show that w ( x 2 ) x 2 is strictly decreasing in x (this intuitively is clear from Figure 2), or equivalently, to show that the derivative of w ( x 2 ) x 2 is negative. A straightforward calculus derivation yields
w ( x 2 ) x 2 = 2 x / ( 1 e k ) e k ( 1 c / x 2 ) 2 1 2 k c / x 2 ( 1 c / x 2 ) e k .
Now it suffices to show that
e k ( 1 c / x 2 ) 2 1 2 k c / x 2 ( 1 c / x 2 ) e k < 0 .
Or, equivalently, it suffices to show that
e k ( 1 c / x 2 ) 2 1 > 1 2 k c / x 2 ( 1 c / x 2 ) .
For convenience, write t : = c / x 2 . Then, t 0 as x 2 . Now, we want to show that
e t k ( 2 t ) > 1 2 k t ( 1 t ) .
A straightforward Taylor expansion of e x = 1 + x + x 2 / 2 ! + x 3 / 3 ! + to the left-hand side (LHS) of (A3) yields
e t k ( 2 t ) = 1 + ( 2 k t + k t 2 ) + ( 2 k t + k t 2 ) 2 / 2 + ( 2 k t + k t 2 ) 3 / 3 ! + ( 2 k t + k t 2 ) 4 / 4 ! + > 1 + ( 2 k t + k t 2 ) + ( 2 k t + k t 2 ) 2 / 2 + ( 2 k t + k t 2 ) 3 / 3 ! = 1 2 k t ( 1 t ) k t 2 + ( k t ( 2 t ) ) 2 / 6 + 2 ( k t ( 2 t ) ) 2 / 6 + ( k t ( 2 t ) ) 3 / 6 = 1 2 k t ( 1 t ) + k t 2 k ( 2 t ) 2 / 6 1 + k 2 t 2 ( 2 t ) 2 2 k t ( 2 t ) / 6 > 1 2 k t ( 1 t )
where the first inequality follows from the fact that
( k t ( 2 t ) ) 2 n ( 2 n ) ! + ( k t ( 2 t ) ) 2 n + 1 ( 2 n + 1 ) ! = ( k t ( 2 t ) ) 2 n ( 2 n + 1 k t ( 2 t ) ) ( 2 n + 1 ) ! > 0 ,
for n 2 and small enough t. And the last inequality in (A4) follows the facts (i) k ( 2 t ) 2 / 6 1 > 0 (if t < 2 6 / k ) and (ii) 2 k t ( 2 t ) > 0 (if t < 1 1 2 / k ).
Combining (A4) with (A3), we complete the proof. □
Proof of Theorem 3. 
It suffices to treat the case p > 1 , and furthermore by Theorem 4 on p. 125 of [9], it is sufficient to show that m = ( n p ) / 2 contaminating points are not enough to break drown β ^ w l s . Assume it is otherwise. This implies that either
(I) 
| β ^ w l s n ( ( z m ( n ) ) j ) 1 | and β ^ w l s n ( ( z m ( n ) ) j ) 2 is finite, or
(II) 
β ^ w l s n ( z m ( n ) ) j ) 2 = | tan ( θ β ^ w l s n ( z m ( n ) ) j ) ) | ,
along a sequence of ( z m ( n ) ) j as j , where the subscripts 1 and 2 correspond to the intercept and non-intercept terms, respectively, as in the case β = ( β 1 , β 2 ) in R p . We seek a contradiction for both cases. For description simplicity, write β j : = β ^ w l s n ( ( z m ( n ) ) j )
Case (I). For simplicity, write β j = ( β 1 , β 2 ) and β j j = ( 2 m β 1 , β 2 ) . Then, it is readily seen that r i 2 ( β j ) < r i 2 ( β j j ) for large positive m. In light of Lemma 1, one has that O ( β j ) > O ( β j j ) ; a contradiction is obtained.
Case (II). This case implies there is a sequence of hyperplanes induced from β ^ w l s n ( ( z m ( n ) ) j ) that tend to the eventual vertical position as j . Denote by H j those hyperplanes. Let H j intercept with the horizontal hyperplane H h at j , the hyperlines (or the common part of H j and H h ).
For simplicity, write the minimizer β j = ( β 1 , β 2 ) : = β ^ w l s n ( ( z m ( n ) ) j ) . Introduce a new hyperplane determined by β j j = ( α β 1 , κ β 2 ) ( κ > 1 is a positive integer). This β j j amounts to tilting H j (corresponding to β j ) along j to a more vertical position H j j (corresponding to β j j ). Note that it is possible that there are no data points touched during the titling process except those originally on the H j since both hyperplanes are almost vertical. It is readily seen that r i 2 ( β j j ) > r i 2 ( β j ) except those points ( x i , y i ) ) that originally lie on the j with a zero residual. By Lemma 1, O ( β j ) > O ( β j j ) , a contradiction is reached. □

References

  1. Huber, P.J. Robust estimation of a location parameter. Ann. Math. Statist. 1964, 35, 73–101. [Google Scholar] [CrossRef]
  2. Rousseeuw, P.J. Least median of squares regression. J. Amer. Statist. Assoc. 1984, 79, 871–880. [Google Scholar] [CrossRef]
  3. Rousseeuw, P.J.; Yohai, V.J. Robust regression by means of S-estimators. In Robust and Nonlinear Time Series Analysis; Lecture Notes in Statist; Springer: New York, NY, USA, 1984; Volume 26, pp. 256–272. [Google Scholar]
  4. Yohai, V.J. High breakdown-point and high efficiency estimates for regression. Ann. Statist. 1987, 15, 642–656. [Google Scholar] [CrossRef]
  5. Yohai, V.J.; Zamar, R.H. High breakdown estimates of regression by means of the minimization of an efficient scale. J. Am. Stat. Assoc. 1988, 83, 406–413. [Google Scholar] [CrossRef]
  6. Rousseeuw, P.J.; Hubert, M. Regression depth (with discussion). J. Am. Stat. Assoc. 1999, 94, 388–433. [Google Scholar] [CrossRef]
  7. Zuo, Y. On general notions of depth for regression. Stat. Sci. 2021, 36, 142–157. [Google Scholar] [CrossRef]
  8. Zuo, Y.; Zuo, H. Least sum of squares of trimmed residuals regression. Electron. J. Stat. 2023, 17, 2416–2446. [Google Scholar] [CrossRef]
  9. Rousseeuw, P.J.; Leroy, A. Robust Regression and Outlier Detection; Wiley: New York, NY, USA, 1987. [Google Scholar]
  10. Maronna, R.A.; Martin, R.D.; Yohai, V.J. Robust Statistics: Theory and Methods; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  11. Müller, C. Redescending M-estimators in regression analysis, cluster analysis and image analysis. Discuss. Math. Stat. 2004, 24, 59–75. [Google Scholar] [CrossRef]
  12. Zuo, Y. Projection-based depth functions and associated medians. Ann. Stat. 2003, 31, 1460–1490. [Google Scholar] [CrossRef]
  13. Donoho, D.L.; Huber, P.J. The notion of breakdown point. In A Festschrift foe Erich L. Lehmann; Bickel, P.J., Doksum, K.A., Hodges, J.L., Jr., Eds.; Wadsworth: Belmont, CA, USA, 1983; pp. 157–184. [Google Scholar]
  14. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  15. Edgar, T.F.; Himmelblau, D.M.; Lasdon, L.S. Optimization of Chemical Processes, 2nd ed.; McGraw-Hill Chemical Engineering Series; McGraw-Hill: New York, NY, USA, 2001. [Google Scholar]
  16. Gilbert, J.C.; Nocedal, J. Global Convergence Properties of Conjugate Gradient Methods for Optimization. Siam J. Optim. 1992, 2, 21–42. [Google Scholar] [CrossRef]
  17. Harrison, D.; Rubinfeld, D.L. Hedonic prices and the demand for clean air. J. Environ. Econ. Manag. 1987, 5, 81–102. [Google Scholar] [CrossRef]
  18. Buxton, L.H.D. The Anthropology of Cyprus. J. R. Inst. Great Br. Irel. 1920, 50, 183–235. [Google Scholar] [CrossRef]
  19. Hawkins, D.M.; Olive, D.J. Inconsistency of Resampling Algorithms for High Breakdown Regression Estimators and a New Algorithm, (with discussion). J. Am. Stat. Assoc. 2002, 97, 136–159. [Google Scholar] [CrossRef]
  20. Olive, D.J. Robust Multivariate Analysis; Springer: New York, NY, USA, 2017. [Google Scholar]
  21. Park, Y.; Kim, D.; Kim, S. Robust Regression Using Data Partitioning and M-Estimation. Commun. Stat. Simul. Comput. 2012, 8, 1282–1300. [Google Scholar] [CrossRef]
  22. Olive, D.J.; Hawkins, D.M. Practical High Breakdown Regression. 2011. Available online: http://www.math.siu.edu/olive/pphbreg.pdf (accessed on 19 March 2024).
Figure 1. Weight function w ( x ) when k = 5 and c = 100 . Left: w ( x ) , right: w ( x ) .
Figure 1. Weight function w ( x ) when k = 5 and c = 100 . Left: w ( x ) , right: w ( x ) .
Axioms 13 00295 g001
Figure 2. Behavior of function w ( x 2 / c * ) x 2 when k = 5 and c = 100 , x > c .
Figure 2. Behavior of function w ( x 2 / c * ) x 2 when k = 5 and c = 100 , x > c .
Axioms 13 00295 g002
Figure 3. Behavior of function γ i ( r i ) / c * when k = 5 and r i > c with different values of c.
Figure 3. Behavior of function γ i ( r i ) / c * when k = 5 and r i > c with different values of c.
Axioms 13 00295 g003
Figure 4. Left panel: plot of seven artificial points and two reference lines y = 0 and y = x . Right panel: the same seven points are fitted by LTS, WLS, MM, and the LS (benchmark). A solid black line is LTS given by ltsReg. Green dashed line is given by WLS. Red dotted line is given by the LS, which is identical to LTS line and is almost identical to the blue dot-dashed line given by MM in this case.
Figure 4. Left panel: plot of seven artificial points and two reference lines y = 0 and y = x . Right panel: the same seven points are fitted by LTS, WLS, MM, and the LS (benchmark). A solid black line is LTS given by ltsReg. Green dashed line is given by WLS. Red dotted line is given by the LS, which is identical to LTS line and is almost identical to the blue dot-dashed line given by MM in this case.
Axioms 13 00295 g004
Figure 5. We show 80 highly correlated normal points with 30% of them contaminated by other normal points. Left: scatterplot of the uncontaminated perfect normal data set and four almost identical lines. Righ: LTS, WLS, MM, and LS lines for the contaminated data set. The solid black is the LTS line, the dotted green is the WLS, the dot-dash blue is given by MM, and dashed red is given by the LS—parallel to LTS line in this case. The MM line is almost identical to LTS and LS lines.
Figure 5. We show 80 highly correlated normal points with 30% of them contaminated by other normal points. Left: scatterplot of the uncontaminated perfect normal data set and four almost identical lines. Righ: LTS, WLS, MM, and LS lines for the contaminated data set. The solid black is the LTS line, the dotted green is the WLS, the dot-dash blue is given by MM, and dashed red is given by the LS—parallel to LTS line in this case. The MM line is almost identical to LTS and LS lines.
Axioms 13 00295 g005
Figure 6. Performance of four procedures with respective to 1000 normal samples (points are highly correlated) with p = 10 and n = 100 , each sample suffers 10% contamination.
Figure 6. Performance of four procedures with respective to 1000 normal samples (points are highly correlated) with p = 10 and n = 100 , each sample suffers 10% contamination.
Axioms 13 00295 g006
Figure 7. Pairwise plots of fitted values versus observed values and fitted values versus values for six different methods.
Figure 7. Pairwise plots of fitted values versus observed values and fitted values versus values for six different methods.
Axioms 13 00295 g007
Figure 8. Fitted values versus standardized residuals plot for six different methods.
Figure 8. Fitted values versus standardized residuals plot for six different methods.
Axioms 13 00295 g008
Figure 9. Restricted to (head length, height)-space, the two-dimensional vertical cross-section of hyperplances of seven different methods.
Figure 9. Restricted to (head length, height)-space, the two-dimensional vertical cross-section of hyperplances of seven different methods.
Axioms 13 00295 g009
Figure 10. Regression lines based on (x = head length, y = height) by seven different methods.
Figure 10. Regression lines based on (x = head length, y = height) by seven different methods.
Axioms 13 00295 g010
Table 1. EMSE, TT (s), and RE for MM, LTS, WLS, and LS based on all 1000 samples for various ns, ps, and contamination levels.
Table 1. EMSE, TT (s), and RE for MM, LTS, WLS, and LS based on all 1000 samples for various ns, ps, and contamination levels.
Normal Data Sets, Each with ε Contamination Rate
 p  n MethodEMSETTRE    EMSETTRE
     ε = 0 %        ε = 10 %   
550mm0.33569.94270.9767   0.33579.84832.9876
wls0.33097.36040.9905   0.33249.47403.0178
lts0.397515.8830.8246   0.367015.9572.7326
ls0.32781.42431.0000   1.00301.28341.0000
     ε = 20 %        ε = 30 %   
550mm0.35659.85195.3673   8.473810.5320.3311
wls0.354612.3295.3951   0.371115.8467.5618
lts0.654616.6622.9228   27.22317.0260.1030
ls1.91321.35491.0000   2.80601.34721.0000
     ε = 0 %        ε = 10 %   
10100mm0.237821.4210.8839   0.237220.8925.5816
wls0.210511.1120.9985   0.222615.6805.9499
lts0.291948.6480.7201   0.258449.6155.1245
ls0.21021.32981.0000   1.32421.25421.0000
     ε = 20 %        ε = 30 %   
10100mm0.241020.66910.244   5.112421.8910.6979
wls0.237220.53510.407   0.260029.14613.724
lts0.263555.0189.3714   40.40364.8030.0883
ls2.46911.24621.0000   3.56801.26261.0000
     ε = 0 %        ε = 10 %   
20200mm0.242984.7090.6564   0.218383.5256.6713
wls0.159228.6641.0021   0.172639.1008.4390
lts0.2208259.210.7224   0.2015293.407.2261
ls0.15951.49361.0000   1.45641.47751.0000
     ε = 20 %        ε = 30 %   
20200mm0.529978.3875.1922   20.90890.3850.1899
wls0.187551.28014.677   0.212671.14818.672
lts0.1983387.5613.877   33.918832.750.1170
ls2.75121.45661.0000   3.96941.43001.0000
Table 2. EMSE, TT (seconds), and RE for MM, LTS, WLS, and LS based on Boston housing real data set.
Table 2. EMSE, TT (seconds), and RE for MM, LTS, WLS, and LS based on Boston housing real data set.
 Performance Measure MMWLS   LTS  LS 
EMSE4.352446 × 10 5 0.00004.619404 × 10 1 0.0000
TT120.368098161.465350125.7076031.487204
RE0NaN0NaN
Table 3. Outputs of different methods based on Buxton data set.
Table 3. Outputs of different methods based on Buxton data set.
Methods InterceptHeadNasalBigonalCephalic
hbreg1546.3737947−1.12889886.1133570−0.58719851.1263726
rmreg2807.33036431.79635084.8262483−0.14815523.9353752
wls1437.3761729−1.11072105.26697630.91993880.9766958
lts1066.188018−1.1047746.4768022.5238152.623706
lms449.515−1.0617.3176.2154.790
mm1511.5503972−1.12891556.5942674−0.63415361.2965989
ls1546.3737947−1.12889886.1133570−0.58719851.1263726
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zuo, Y.; Zuo, H. Weighted Least Squares Regression with the Best Robustness and High Computability. Axioms 2024, 13, 295. https://doi.org/10.3390/axioms13050295

AMA Style

Zuo Y, Zuo H. Weighted Least Squares Regression with the Best Robustness and High Computability. Axioms. 2024; 13(5):295. https://doi.org/10.3390/axioms13050295

Chicago/Turabian Style

Zuo, Yijun, and Hanwen Zuo. 2024. "Weighted Least Squares Regression with the Best Robustness and High Computability" Axioms 13, no. 5: 295. https://doi.org/10.3390/axioms13050295

APA Style

Zuo, Y., & Zuo, H. (2024). Weighted Least Squares Regression with the Best Robustness and High Computability. Axioms, 13(5), 295. https://doi.org/10.3390/axioms13050295

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop