Next Article in Journal
Research on Decomposition and Offloading Strategies for Complex Divisible Computing Tasks in Computing Power Networks
Next Article in Special Issue
Statistical Analysis and Several Estimation Methods of New Alpha Power-Transformed Pareto Model with Applications in Insurance
Previous Article in Journal
A Knowledge-Guided Multi-Objective Shuffled Frog Leaping Algorithm for Dynamic Multi-Depot Multi-Trip Vehicle Routing Problem
Previous Article in Special Issue
Bayesian Inference for the Gamma Zero-Truncated Poisson Distribution with an Application to Real Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving the Robustness of the Theil-Sen Estimator Using a Simple Heuristic-Based Modification

1
Department of Data Science and Engineering, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
2
Biotechnology Center, Silesian University of Technology, Bolesława Krzywoustego 8, 44-100 Gliwice, Poland
Symmetry 2024, 16(6), 698; https://doi.org/10.3390/sym16060698
Submission received: 25 April 2024 / Revised: 31 May 2024 / Accepted: 3 June 2024 / Published: 5 June 2024

Abstract

:
One of the most widely used robust regression methods for solving simple linear regression problems is the Theil-Sen (TS) estimator. This estimator has some notable advantages; however, it does not belong to the most robust estimation methods (called high-breakdown estimators) and is prone to outliers whose distribution is highly asymmetric with respect to the correct data points. This paper presents a modification of the TS estimator, the Robustified Theil-Sen (RTS) estimator. The new method uses a heuristic-based selection procedure to reduce the number of initial estimates of the regression function parameters computed with at least one outlier, thereby improving the regression results. The use of this heuristic procedure only slightly increases the computational time required for using the RTS estimator compared to the TS estimator. Preliminary results of two numerical experiments presented in the paper show that the RTS estimator outperforms other comparable estimators, i.e., the TS estimator and the repeated median estimator, in terms of robustness. The results presented also suggest that the breakpoint value (which is a measure of the robustness of estimators) of the RTS estimator is higher than the breakpoint value of the TS estimator and equal to the breakpoint value of the high-breakpoint estimators.

1. Introduction

Regression analysis of a data set is a critical step in many applications [1,2]. One of the important problems that occurs during this analysis is the presence of outliers—that is, atypical values in the context of this data set—in the analyzed data sets and their negative influence on the regression results [3,4,5,6,7]. For real data, it should always be assumed that the analyzed data set can contain a certain number of outliers. It is important to note that the usage of standard and very commonly used regression methods based on the idea of Least Squares (LS) minimization (e.g., the Ordinary Least Squares (OLS) method) can lead to obtaining wrong results even when there is a single outlier in the analyzed data set. Hence, there is a need for regression methods that are not as sensitive to outliers as the LS-based methods—in the literature, such methods are referred to as robust regression methods [4,6].
One of the most widely used robust regression methods [8] for solving probably the most common regression problem, simple linear regression (SLR) in the presence of outliers, is the Theil-Sen (TS) estimator. The method was introduced by Henri Theil in 1950 [9] and later extended by Pranab K. Sen in 1968 [10]. Although more than 50 years have passed since its development, the TS estimator is still used in many areas, such as image processing and machine vision [11,12], object positioning using 5G and 6G networks [13], and data preprocessing for machine learning applications [14], to name a few. It should be noted that the TS estimator, along with the Mann–Kendall test [15,16], is widely used for trend analysis in earth, environmental, and physical geography research. These two methods are used, for example, in the analysis of air pollution [17,18], long-term changes in wind power [19], streamflow [20], long-term trends in meteorological and hydrological drought [21], changes in rainfall [22], rainfall and temperature trends [23,24], annual rainfall and meteorological drought [25], groundwater storage [26], vegetation trends [27] or changes in oceanic or sea wave height [28,29,30,31].
The popularity of the TS estimator is due to its advantages, including but not limited to its high asymptotic efficiency, its ability to estimate skewed or heteroskedastic data, and its simplicity and computational efficiency [32,33]. Disadvantages of the TS estimator include its limited robustness (which makes it not among the most robust estimators) and relatively large errors in estimation results when there are outliers whose distribution is highly asymmetric with respect to the correct data points.
The robustness of an estimator T can be evaluated using, among others, the asymptotic breakdown point measure
ε * ( T ) = lim n θ ,
where θ = o / n , o is the number of outliers and n is the number of all data points in the analyzed data set D . The measure ε * ( T ) informs about the smallest ratio of o to n, for which the results obtained using T can be are arbitrary bad or meaningless [34,35,36]. In the case of the TS estimator [4]
ε * ( TS ) = 1 1 2 0.293 ,
where the most robust estimators, known as high-breakdown estimators or resistant estimators, have ε * ( T ) = 0.5 . It should be noted that ε * ( T ) = 0.5 is the maximum possible value of this parameter, and the value ε * ( T ) > 0.5 will mean that in the analyzed data set, the number of outliers is larger than the number of correctly measured values [36].
This paper focuses on improving two previously mentioned drawbacks of the TS estimator, namely its limited robustness and its sensitivity to asymmetrically distributed outliers. The proposed solution, in the form of a Robustified TS (RTS) estimator, uses an additional simple heuristic-based procedure that leads to a significant improvement in the properties of the new estimator compared to the original TS estimator, at the cost of only a slight increase in the required computational effort.
The rest of this article is organized as follows. Section 2 briefly describes the TS estimator algorithm. Section 3 discusses the main reasons for the limited robustness of the TS estimator and presents the RTS estimator. Section 4 presents the results of two numerical experiments conducted to compare the RTS estimator with the TS and Repeated Median estimators. The last section, Section 5, concludes the article.

2. The Theil-Sen Estimator

The TS estimator can be used, among other problems, for solving the SLR problem, i.e., estimation of the regression model parameters calculated for a data set D = { d 1 , d 2 , , d i , d n } of n bivariate data points d i = ( x i , y i ) representing observations (measurements), where x i is an explanatory (independent) variable and y i is an explained (dependent) variable, and the linear function
y i = α x i + β ,
is used as a regression model. The purpose of SLR is to estimate the values of parameters α and β , that is, respectively, α ^ and β ^ , based on the data points d i D . Note that the number of outliers in the analyzed data set D should satisfy the condition
0 o n 2 ,
which is a common assumption when using robust estimators [36].
In the case of SLR, the TS estimator [9,10] works in two simple steps. In the first step, the estimation of the line slope α is calculated according to the formula
α ^ = median i < j α ˜ i j ,
where
α ˜ i j = y j y i x j x i
and i , j { 1 , , n } . The second step is the estimation of β ^ , that is, the estimation of line interception with the Y axis—this step can be realized in one of two ways:
  • Directly—in this case the value β ^ is calculated based only on the values of data points
    β ^ = median i < j β ˜ i j ,
where
β ˜ i j = x j y i x i y j x j x i ,
  • Hierarchically—using the early calculated value, using (5), of α ^ , that is,
    β ^ = median i y i α ^ x i .
From now on, TSd and TSh will denote, respectively, the two above-mentioned versions of the TS estimator.

3. The Robustified Theil-Sen Estimator

3.1. Causes of the Limited Robustness of the TS Estimator

To understand the causes of the limited robustness of the TS estimator, it is important to analyze how the TSd estimator works. Let us assume that there is a bivariate data set D , for which its cardinality card ( D ) = n and o of its data points are outliers, it means that c = n o of data points in D are correct data points. Taking this into account, it can be seen that only
d ̑ = ( n o ) ( n o 1 ) 2 = c ( c 1 ) 2 ,
values of α ˜ i j and β ˜ i j are calculated using data points d i and d j , where each of these points is a correct data point, that is, none of them is an outlier. From now on, such estimates α ˜ i j and β ˜ i j are referred to as correct estimates, and denoted, respectively, as α ̑ i j and β ̑ i j . Hence, d ̑ shows the number of correct estimates for each of the estimated parameters.
The number of all calculated estimators for α and β is equal
t = card α ˜ i j | i < j = card β ˜ i j | i < j = n ( n 1 ) 2 .
The relation between d ̑ and t is best represented by the ratio
τ ( n , c ) = d ̑ t = c ( c 1 ) n ( n 1 ) .
Let ρ be a fraction of the correct data points in the analyzed data set, that is, ρ = c / n . Thus, substituting c by ρ n
τ ( n , ρ ) = ρ n ( ρ n 1 ) n ( n 1 ) = ρ 2 1 1 n ρ n 1 .
Therefore, for n ,
τ ρ = lim n τ ( n , ρ ) = ρ 2 ,
which means that for large n and ρ = 0.5 only 25 % of all estimates are correct estimates; for smaller n, this fraction can be even smaller than 25 % (see Appendix A).
In the case of the TSd estimator, the median function is used to calculate the α ^ and β ^ values, (5) and (7), respectively. Since the breakpoint value of the median function is ε * ( median ) = 0.5 , whenever τ n , ρ 0.5 , the TSd estimator may return incorrect estimates. In the case of the TSh estimator, the above analysis applies only to the estimation of α ^ . However, the wrong value of α ^ affects the estimation of β ^ , (8). This means that even in the case of the TSh estimator, the condition τ n , ρ > 0.5 must be satisfied to reduce the possibility of obtaining meaningless results.

3.2. General Idea of the RTS Estimator

The presented analysis shows that the quality of the estimation results obtained by the TS estimator can be improved by introducing two additional steps into this method, namely
  • A selection of subsets of correct estimates A ̑ = α ̑ i j and B ̑ = β ̑ i j from sets of all initial estimates, A ˜ = α ˜ i j and B ˜ = β ˜ i j , respectively,
  • Searches for values of optimal estimates α ^ and β ^ only among sets of correct estimators A ̑ and B ̑ , respectively.
The use of these additional steps is the founding idea of the developed RTS estimator.
The search for optimal estimates α ^ and β ^ from sets A ̑ and B ̑ can be carried out easily in the same way as in the TS estimator, that is, using the median function. Nonetheless, selecting the subsets of correct estimates, i.e., A ̑ from A ˜ and B ̑ from B ˜ , is a much more difficult problem. However, it should be noted that the median function is a high-breakdown estimator ( ε * ( median ) = 0.5 ), which means that the occurrence of a limited number of non-correct estimators in the A ̑ or B ̑ sets should only have a limited influence on the final estimation result obtained from the RTS estimator. Taking this into account, it is not necessary that the selection procedure be perfectly accurate—it is sufficient if only the majority of the selected estimates are correct estimates. The consequence of this observation is the possibility of using the heuristic approach described below as a selection procedure.
The heuristic selection procedure implemented in the RTS estimator is based on the observation that a difference
Δ S E ˜ ( i ) = e i + s 1 e i ,
where e i is the first and e i + s 1 is the last element of the s-elementary subset of consecutive, sorted in ascending order, elements of the estimate set E ˜ (in the case of the RTS estimator, i.e., A ˜ and B ˜ ), is usually smaller in the cases when only correct estimates belong to this subset than in cases where this subset contains both correct and non-correct estimates or contains only non-correct estimates. Furthermore, the values of such defined differences of s-nearest elements usually increase with the increasing number of non-correct estimates belonging to these s-elementary subsets. These observations have been made for data sets D for which the dispersion of observation (measurement) error for the correct data points in D is smaller than the dispersion of outliers belonging to D . This heuristic approach made it possible to reduce the task of selecting the correct estimates to a simple procedure of finding the minimum value in the one-dimensional list
Δ s E ˜ ( 1 ) , Δ s E ˜ ( 2 ) , , Δ s E ˜ ( t s + 1 ) ,
where s is the parameter of the procedure and represents the desired number of correct estimates in the set E ˜ .

3.3. Description of the RTS Estimator Algorithm

The ideas presented in the previous section have been implemented in two versions of the RTS estimator, i.e., the directed (RTSd) and hierarchical (RTSh) versions. Similarly to the TSd and TSh estimators, the only difference between the RTSd and RTSh estimators is the method of calculation of the estimation of line interception β ^ . The algorithm for the calculation of the line slope estimate α ^ is common for both variants of the RTS estimator.

3.3.1. Calculation of α ^ in RTSd and RTSh Estimators

The first step of both versions of the RTS estimator is the calculation of all estimates α ˜ i j of the line slope α , that is,
i < j : α ˜ i j = y j y i x j x i ,
where all estimates α ˜ i j form a set A ˜ , i.e., A ˜ = { α ˜ i j } . The next stage is the selection of a subset of correct estimates A ̑ from A ˜ . For this purpose, all elements of A ˜ have to be sorted in ascending order, after which
A ˜ : = sort A ˜ = sort α ˜ i j = α ˜ ( k ) ,
where index ( k ) is the ordering index, sort ( · ) is an ordering operator, and : = is an assignment operator. Hence,
α ˜ ( 1 ) α ˜ ( 2 ) α ˜ ( k ) α ˜ ( t ) ,
where t is calculated using (11). The next step is the search for an optimal s-elementary subset of A ˜ , i.e.,
k ^ = arg min k Δ s A ˜ ( k ) = arg min k α ˜ ( l ) α ˜ ( k ) ,
where Δ s A ˜ ( k ) is the k-th s-nearest elements difference for A ˜ , the idea of which was introduced in Section 3.2, and
k { 1 , , t s + 1 } , l = k + s 1 .
The value of s can be easily calculated based on the estimated number of outliers o ˜ in the analyzed data set D . The relationship between o ˜ and s is analogous to the relationship (10) between o and d ̑ , so s is calculated as follows
s = ( n o ˜ ) ( n o ˜ 1 ) 2 ,
where
o ˜ = θ ˜ · n ;
v denotes a floor function, i.e., a function that returns the largest integer less than or equal to a given real number v. The θ ˜ is the only parameter of the RTS estimator that must be provided by the user, and it is the assumed ratio of the number of outliers to the number of all data points in the analyzed set D , hence 0 θ ˜ 0.5 . For θ ˜ = 0 , the RTS estimator will give the same results as the standard TS estimator. As the number of outliers o in the data set D increases, the value of the parameter θ ˜ should also be increased. Choosing θ ˜ = 0.5 means that the worst case is assumed, that is, that 50 % of the points in the set D are outliers. However, θ ˜ = 0.5 can be used as a default parameter—as a result, the RTS estimator can be considered a parameterless estimator.
Based on k ^ , the set A ̑ is determined according to the formula
A ̑ = α ˜ k ^ , α ˜ k ^ + 1 , , α ˜ l ^ ,
where l ^ = k ^ + s 1 . The last stage of estimation of the slope of the line α is the calculation of α ^ , which is defined as
α ^ = median A ̑ .

3.3.2. Calculation of β ^ Using the RTSd Estimator

In the case of the RTSd estimator, the algorithm for β ^ calculation is analogous to the algorithm presented above for calculation α ^ . Hence, as a first set, B ˜ = β ˜ i j is created. The estimates β ˜ i j are calculated as follows
i < j : β ˜ i , j = x j y i x i y j x j x i .
Then, the selection of the subset B ̑ of the correct estimates from the set B ˜ is realized. This procedure begins with a sorting of B ˜ in ascending order, that is
B ˜ : = sort B ˜ = sort β ˜ i j = β ˜ ( k ) ,
which means that
β ˜ ( 1 ) β ˜ ( 2 ) β ˜ ( k ) β ˜ ( t ) .
Then, the search for an optimal s-elementary subset of B ˜ is performed as follows
k ^ = arg min k Δ s B ˜ ( k ) = arg min k β ˜ ( l ) β ˜ ( k ) ,
where k and l are given by (21), and its result is
B ̑ = β ˜ k ^ , , β ˜ l ^ .
The last step of the RTSd estimator algorithm is the calculation of the final estimate of β according to the formula
β ^ = median B ̑ .

3.3.3. Calculation of β ^ Using the RTSh Estimator

The calculation of β ^ in the RTSh version of the RTS estimator is based on an early calculated value of α ^ . This procedure is started by creating a set B ˜ = β ˜ i , where
i : β ˜ i = y i α ^ x i .
The next step is the procedure of selecting the subset B ̑ from the set B ˜ , which begins with the ascending sorting of B ˜ , i.e.,
B ˜ : = sort B ˜ = sort β ˜ i = β ˜ ( k ) ,
thereby
β ˜ ( 1 ) β ˜ ( 2 ) β ˜ ( k ) β ˜ ( n ) .
Then, the optimal value of k is calculated as follows
k ^ = arg min k Δ s B ˜ ( k ) = arg min k β ˜ ( l ) β ˜ ( k ) ,
but because the estimates β ˜ i are calculated based only on one data point d i , the values k and l have to be defined using o ˜ and n instead of using s and t values like in (21), thus
k { 1 , , n o ˜ + 1 } , l = k + o ˜ 1 .
The value o ˜ is calculated according to (23). Then, using k ^ , subset B ̑ is created as follows
B ̑ = β ˜ k ^ , , β ˜ l ^ ,
where, with respect to (36), l ^ = k ^ + q 1 . Finally, the last step of the RTSh estimator algorithm is the calculation of the final estimate of β , that is,
β ^ = median B ̑ .

4. Numerical Experiments

4.1. General Assumptions

The purpose of the experiments was to compare the newly developed RTS estimator with its direct predecessor, the TS estimator, and the modification of the TS estimator, the Repeated Median (RM) estimator [37,38]. The comparison was made in terms of
  • Exp-A—evaluation of the dispersion of the estimation results,
  • Exp-B—preliminary determination of the breakdown point value of the RTS estimator based on the comparison of the RTS estimation results with the results of the TS and RM estimators for data sets D containing a different number o of outliers.
In both experiments, the normal distribution (but with different parameters) is used for both the simulation correct observations and the outliers. In the experiments, the outliers are highly asymmetrically distributed with respect to the correct data points. In the experiments, the estimation results of α ^ and β ^ were compared using both the directed and the hierarchical versions for the analyzed estimators. Using the RTS estimator requires specifying the estimated ratio θ ˜ of outliers in the analyzed set D —the default value of θ ˜ = 0.5 was used in both experiments. Both experiments were carried out using specially developed software. The MATLAB R2023b Update 7 environment for the MS Win 10 operating system was used in the research.

4.2. Experiment A

The comparison was performed for the simulated SLR problem, with data sets D containing n = 100 data points, and o = 40 of them are outliers (hence, the outlier ratio in D is equal to θ = o / n = 0.4 ). Set D was created in four steps:
  • Drawing of the values of x i [ 2 , 10 ] coordinates according to a uniform distribution, i = { 1 , , 100 } ,
  • Draw, according to a uniform distribution, a list of o = 40 indices i of data points d i that will be outliers,
  • For correctly measured (correct, non-outlier) data points, their y i values are simulated with a normally distributed N ( μ , σ ) measurement error as follows
    y i = α x i + β + N ( μ , σ ) ,
    where α = 1 , β = 5 , μ = 0 , and σ = 1 ,
  • Outlier values are calculated according to the formula
    y i = 2 α x i + β α x o u t + Δ o u t + N ( 0 , σ o u t ) ,
    where
    Δ o u t = σ ( δ o u t + σ o u t ) for x < x o u t σ ( δ o u t + σ o u t ) for x x o u t
    and δ o u t = 7.5 , σ o u t = 5 , and x o u t = 5 .
The exemplary data set D and the estimation results obtained for this set using the compared estimators are presented in Figure 1. In addition, this figure also presents the results obtained using the OLS estimator, as well as the results obtained using the OLS estimator but only for the correct data points contained in D (these results are denoted as OLSc). It should be noted that the results obtained with the OLSc estimator are only a theoretically realizable idealization intended to show how accurate the estimation can be for the data sets used in the experiment.
In order to obtain more reliable results for the comparison, the whole process of drawing the data sets D and calculating the estimates α ^ and β ^ based on them was repeated 1000 times. The box plots of the obtained sets of the α ^ and β ^ values are presented in Figure 2.

4.3. Experiment B

The purpose of the experiment was to test how replacing correct points with outliers affects the estimation results. To limit the influence of other factors, 100 specially created data sets D were used for the study. Each set D = { D o } contains data sets D o , where o is the number of outliers in that particular data set; in the experiment, the sets D o D are data sets of a simulated SLR problem, each D o contains n = 100 data points, and o { 0 , 1 , , 55 } were used. For each D , the sets D o D and D o + 1 D differ only in the value y n ( o + 1 ) assigned to the data point d n ( o + 1 ) = x n ( o + 1 ) , y n ( o + 1 ) , where the value of x n ( o + 1 ) is the same in both data sets. Furthermore, in the set D o , the data point d n ( o + 1 ) is a correct data point, while in the set D o + 1 , this data point is already an outlier; other data points in both data sets are identical. For each data set D o , replacing the correct data point with an outlier is achieved by replacing the correct data point with the largest x in the data set. The above-described process of replacing data points can be observed in Figure 3.
Creating D sets is completed in three steps:
  • The creation of each D set begins by drawing x coordinates for n = 100 points in the range [ 2 , 10 ] using a uniform distribution. The resulting vector is sorted in ascending order so that the returned vector X = [ x 1 , x 2 , , x i , , x n ] satisfies the condition i < j : x i < x j . Then, the vector Y of y coordinates of the corrected data points was created according to the formula
    y i = x i + β + N ( μ , σ ) ,
    where α = 1 , β = 10 , μ = 0 , σ = 1 .
  • Next, the vector Y o u t = [ y 1 o u t , y 2 o u t , , y i o u t , , y o m a x o u t ] of y coordinates of o m a x = max ( o ) = 55 outliers were drawn using the normal distribution N ( μ o u t , σ o u t ) , where μ o u t = 50 , σ o u t = 10 .
  • The final step is to create the data set D o = { d i } for each analyzed value of o; this is achieved as follows
    d i = ( x i , y i ) for i n o ( correct data point ) x i , y i o m a x o u t for i > n o ( outlier ) .
The previously mentioned Figure 3 shows, in addition to the process of replacing data points with outliers, the effect of this process on changing the estimation results obtained with the tested robust estimators, as well as the OLS and OLSc estimators. Figure 4 shows the change, as a function of θ [ 0 , 0.55 ] , in the median of the α ^ and β ^ estimates obtained for all 100 simulated D data sets using the tested estimators.

4.4. Discussion of the Results

The presented results show that for the conditions of Exp-A, the RTS estimator outperforms other tested estimators in terms of accuracy and precision of the obtained results. It is also worth noting that the RTS estimator gives comparable results to those obtained by the OLSc estimator, which for a data set containing correct data points with homoscedastic error and outliers (data sets D used in Exp-A fulfill these conditions) is the optimal estimator in the class of linear unbiased estimators—this statement follows from the Gauss–Markov theorem.
The results of Exp-B show that the RTS estimator for an entire practically used range of θ [ 0 , 0.5 ) gives more accurate estimation results than the results obtained by the TS and RM estimators. Analyzing the effect of replacing subsequent correct data points with outliers, it is easy to see that among the tested estimators, the TS estimator is the most sensitive to these changes. The RM estimator is slightly less sensitive than the TS estimator, and the RTS estimator is virtually unaffected by these changes. In the range of θ = [ 0 , 0.47 ] , the changes in the values of α ^ are hardly noticeable; see Figure 4a. In the case of the β ^ estimator, this range is even larger; see Figure 4b.
A comparison of the results of Exp-A and especially Exp-B with the values of the breakdown points of the TS and RM estimators— ε * ( TS ) 0.293 and ε * ( RM ) = 0.5 [39], respectively—together with knowledge of how the RTS estimator works—provides a basis for concluding that
  • The breakdown point value for the RTS estimator is higher than the breakdown point value of the TS estimator, formally
    ε * ( RTS ) > ε * ( TS ) ,
and probably that
  • The breakdown point value for the RTS estimator is equal to the breakdown point value of the RM estimator, hence formally
    ε * ( RTS ) = ε * ( RM ) = 0.5 .
This last statement means that the RTS estimator is actually a high-breakdown estimator.
The conclusions presented above are based on the obtained results and the analysis of the algorithms of the compared estimators. Therefore, these results should be treated as preliminary hypotheses that require further research for their possible confirmation.

5. Conclusions

The preliminary results obtained with the RTS estimator presented in the article are promising. The improvement in robustness of the RTS estimator compared to the TS is significant. The RTS estimator also shows better robustness and stability than the RM estimator. It is worth noting that these quality improvements were achieved at the expense of only a slight increase in computational complexity (see Appendix B).
The preliminary nature of the presented research also means that more research is needed to better understand the properties of the new estimator, such as consistency and efficiency. It is also important to answer the question of how the number and distribution of data points and outliers, selection of the θ ˜ parameter value, as well as the properties of the observation (measurement) error, affect the estimation results obtained by the RTS estimator. Attempts to answer these questions will be the goal of future research.
No less important than the above-mentioned research on the properties of the RTS estimator is its comparison with robust estimators other than those used in this paper, including, among others, the Least Absolute Deviations, Least Trimmed Squares [40], and Least Median of Squares [40] estimators. Such a comparison is planned for both simulated data and data derived from real-world problems.

Funding

This work was supported by the Polish Ministry of Education and Science under internal grant 02/070/BK_24/0055 for the Silesian University of Technology, Gliwice, Poland.

Data Availability Statement

Data from this study are available on request from the author.

Acknowledgments

The author would like to thank the anonymous reviewers for their valuable comments, which were very helpful in improving the manuscript. The author is also grateful to the reviewers for their suggestions of interesting directions for further research. The author would like to thank Henryk Palus for stimulating discussions and valuable comments.

Conflicts of Interest

The author declares no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
OLSOrdinary Least Squares estimator
OLScOrdinary Least Squares estimator calculated only for the correct data points in data set
RMRepeated Median estimator
RMdRepeated Median estimator, directly calculated estimate β ^
RMhRepeated Median estimator, hierarchically calculated estimate β ^
RTSRobustifieted Theil-Sen estimator
RTSdRobustifieted Theil-Sen estimator, directly calculated estimate β ^
RTShRobustifieted Theil-Sen estimator, hierarchically calculated estimate β ^
TSTheil-Sen estimator
TSdTheil-Sen estimator, directly calculated estimate β ^
TShTheil-Sen estimator, hierarchically calculated estimate β ^

Appendix A. Effect of Data Set Size on τ Coefficient Value

The relation of τ to n and c is given by (12). To plot the τ values over the assumed range of n values, it is convenient to use the assumed values of ρ , that is, the required value of ρ for n . This approach was used to draw the plots τ for n { 5 , 6 , 7 , , 10 4 } shown in Figure A1.
Figure A1. Effect of n and ρ values on τ value.
Figure A1. Effect of n and ρ values on τ value.
Symmetry 16 00698 g0a1
Since the values n, c, o are natural numbers, when calculating the value of τ for the given value of ρ and the value of n using (12), the value of c was not calculated as the product of ρ · n because it is not certain that the received value is a natural number, but was calculated as follows
c = ρ · n ,
where g denotes a ceiling function, which is a function that returns the smallest integer greater than or equal to a given real number g. The desired value ρ can therefore be interpreted as the minimum desired ratio of the number c of correct data points in the analyzed data set D to the number n of all data points in this set.
As can be seen from (A1), the desired value of ρ and the actual value of ρ determined for a given pair of values of n and c may differ. These differences result in the fact that the curves of the value of τ visible on the graph for different ρ are not smooth and there are fluctuations in the value of this coefficient, which can be easily seen for the value of τ determined for n 100 . It is worth noting that such fluctuations, of course with a correspondingly smaller range, occur not only for small values of n but also for large values of this parameter.

Appendix B. Comparison of Computational Times of the Compared Robust Estimators

A measurement of the computational effort required by each of the tested robust estimators was carried out for data sets D containing
n { 5 , 7 , 10 , 25 , 50 , 75 , 100 , 250 , 500 , 750 , 1000 , 2500 , 5000 , 7500 , 10000 }
data points. For each value of n, the measurements were repeated 100 times. The test data sets D were generated in the same way as the data sets in Exp-B, but the proportion θ of outliers to all points in D was constant and set to θ = 0.25 . The actual number o of outliers for D of size n was calculated according to the equation
o = θ · n
where x denotes a floor function, i.e., a function that returns the largest integer less than or equal to a given real number x. In the case of RTS, the estimator o ˜ = 0.5 was used during the measurements. The measurements were performed on a computer equipped with an Intel i5-13500 processor and 32 GB of RAM. All the required programs were written and executed in MATLAB R2023b Update 7 under an MS Win 10 operating system.
The graphs of the median measurement times as a function of n are shown in Figure A2. Based on these graphs, it can be seen that all three compared estimators are characterized by a similar computational complexity, i.e., O n 2 . A more detailed analysis of the results (Figure A3) shows that the TS estimator requires the least computational effort, while the RM estimator is generally the slowest (at least for the tested range of n values). The times obtained for the RTS estimator are intermediate between those obtained for the TS and RM estimators. For small n, the running times for the RTS estimator are closer to those obtained for TS than for RM. However, as n increases, the time required by RTS increases, and for n = 10 4 the median of the computation times for the RM and RTS estimators are practically identical, with some test data sets having longer computation times for the RTS estimator than for the RM estimator.
It should be noted that the computation time measurements were made for the base implementations of the TS and RM estimators. The use of such implementations results from maintaining similar measurement conditions for all compared estimators. In the literature, deterministic (e.g., [41,42]) and randomized (e.g., [43,44]) implementations with complexity less than O ( n 2 ) have been proposed for the TS and RM estimators. Based on the similarities between all tested estimators, it can be concluded that it will be possible to reduce the computational complexity of the RTS estimator by using the solutions developed so far for the TS and RM estimators.
Figure A2. Median computation time as a function of the number of data points n in the analyzed dataset D ; results were obtained for 100 repetitions of each time measurement.
Figure A2. Median computation time as a function of the number of data points n in the analyzed dataset D ; results were obtained for 100 repetitions of each time measurement.
Symmetry 16 00698 g0a2
Figure A3. Comparison of the variance of the computation time of the compared robust estimators for different numbers n of data points in the analyzed data set D : (a) n = 50 , (b) n = 10 2 , (c) n = 5 · 10 2 , (d) n = 10 3 , (e) n = 5 · 10 3 , (f) n = 10 4 .
Figure A3. Comparison of the variance of the computation time of the compared robust estimators for different numbers n of data points in the analyzed data set D : (a) n = 50 , (b) n = 10 2 , (c) n = 5 · 10 2 , (d) n = 10 3 , (e) n = 5 · 10 3 , (f) n = 10 4 .
Symmetry 16 00698 g0a3

References

  1. Chatterjee, S.; Simonoff, J.S. Handbook of Regression Analysis; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2013. [Google Scholar] [CrossRef]
  2. Draper, N.R.; Smith, H. Applied Regression Analysis; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2014. [Google Scholar] [CrossRef]
  3. Grubbs, F.E. Procedures for Detecting Outlying Observations in Samples. Technometrics 1969, 11, 1–21. [Google Scholar] [CrossRef]
  4. Rousseeuw, P.J.; Leroy, A.M. Robust Regression and Outlier Detection; John Wiley: Hoboken, NJ, USA, 1987. [Google Scholar] [CrossRef]
  5. Møller, S.F.; Frese, J.V.; Bro, R. Robust methods for multivariate data analysis. J. Chemom. 2005, 19, 549–563. [Google Scholar] [CrossRef]
  6. Rousseeuw, P.J.; Hubert, M. Anomaly detection by robust statistics. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1236. [Google Scholar] [CrossRef]
  7. Wang, H.; Bah, M.J.; Hammad, M. Progress in Outlier Detection Techniques: A Survey. IEEE Access 2019, 7, 107964–108000. [Google Scholar] [CrossRef]
  8. El-Shaarawi, A.H.; Piegorsch, W.W. (Eds.) Encyclopedia of Environmetrics, Volume 1; John Wiley & Sons: Hoboken, NJ, USA, 2001; p. 19. [Google Scholar]
  9. Theil, H. A rank-invariant method of linear and polynomial regression analysis. Parts: I, II, III. Proc. R. Neth. Acad. Sci. 1950, 53, 386–392, 521–525, 1397–1412. [Google Scholar]
  10. Sen, P.K. Estimates of the Regression Coefficient Based on Kendall’s Tau. J. Am. Stat. Assoc. 1968, 63, 1379–1389. [Google Scholar] [CrossRef]
  11. Guerrero, J.; Guijarro, M.; Montalvo, M.; Romeo, J.; Emmi, L.; Ribeiro, A.; Pajares, G. Automatic expert system based on images for accuracy crop row detection in maize fields. Expert Syst. Appl. 2013, 40, 656–664. [Google Scholar] [CrossRef]
  12. Choi, K.H.; Han, S.K.; Park, K.H.; Kim, K.S.; Kim, S. Vision based guidance line extraction for autonomous weed control robot in paddy field. In Proceedings of the 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), Zhuhai, Chiana, 6–9 December 2015; pp. 831–836. [Google Scholar] [CrossRef]
  13. Henninger, M.; Sengupta, S.; Mandelli, S.; ten Brink, S. Performance Evaluation of Array Calibration for Angle-of-Arrival-Based 5G Positioning. In Proceedings of the WSA & SCC 2023 26th International ITG Workshop on Smart Antennas and 13th Conference on Systems, Communications, and Coding, Braunschweig, Germany, 27 February 2023; pp. 1–6. [Google Scholar]
  14. Kasimati, A.; Espejo-García, B.; Darra, N.; Fountas, S. Predicting Grape Sugar Content under Quality Attributes Using Normalized Difference Vegetation Index Data and Automated Machine Learning. Sensors 2022, 22, 3249. [Google Scholar] [CrossRef] [PubMed]
  15. Mann, H.B. Nonparametric Tests Against Trend. Econometrica 1945, 13, 245–259. [Google Scholar] [CrossRef]
  16. Kendall, M.G. Further Contributions to the Theory of Paired Comparisons. Biometrics 1955, 11, 43–62. [Google Scholar] [CrossRef]
  17. Davtalab, M.; Byčenkienė, S.; Bimbaitė, V. Long-term spatial and temporal evaluation of the PM2.5 and PM10 mass concentrations in Lithuania. Atmos. Pollut. Res. 2023, 14, 101951. [Google Scholar] [CrossRef]
  18. Chen, Y.; Rich, D.Q.; Hopke, P.K. Changes in source specific PM2.5 from 2010 to 2019 in New York and New Jersey identified by dispersion normalized PMF. Atmos. Res. 2024, 304, 107353. [Google Scholar] [CrossRef]
  19. Carreno-Madinabeitia, S.; Ibarra-Berastegi, G.; Sáenz, J.; Ulazia, A. Long-term changes in offshore wind power density and wind turbine capacity factor in the Iberian Peninsula (1900–2010). Energy 2021, 226, 120364. [Google Scholar] [CrossRef]
  20. Yeh, C.F.; Wang, J.; Yeh, H.F.; Lee, C.H. Spatial and Temporal Streamflow Trends in Northern Taiwan. Water 2015, 7, 634–651. [Google Scholar] [CrossRef]
  21. Kubiak-Wójcicka, K.; Pilarska, A.; Kamiński, D. The Analysis of Long-Term Trends in the Meteorological and Hydrological Drought Occurrences Using Non-Parametric Methods—Case Study of the Catchment of the Upper Noteć River (Central Poland). Atmosphere 2021, 12, 1098. [Google Scholar] [CrossRef]
  22. Caloiero, T.; Filice, E.; Coscarelli, R.; Pellicone, G. A Homogeneous Dataset for Rainfall Trend Analysis in the Calabria Region (Southern Italy). Water 2020, 12, 2541. [Google Scholar] [CrossRef]
  23. Muthoni, F. Spatial-Temporal Trends of Rainfall, Maximum and Minimum Temperatures Over West Africa. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2960–2973. [Google Scholar] [CrossRef]
  24. Ali Mohammed, J.; Gashaw, T.; Worku Tefera, G.; Dile, Y.T.; Worqlul, A.W.; Addisu, S. Changes in observed rainfall and temperature extremes in the Upper Blue Nile Basin of Ethiopia. Weather. Clim. Extrem. 2022, 37, 100468. [Google Scholar] [CrossRef]
  25. Kourtis, I.M.; Vangelis, H.; Tigkas, D.; Mamara, A.; Nalbantis, I.; Tsakiris, G.; Tsihrintzis, V.A. Drought Assessment in Greece Using SPI and ERA5 Climate Reanalysis Data. Sustainability 2023, 15, 15999. [Google Scholar] [CrossRef]
  26. Henao Casas, J.; Fernández Escalante, E.; Ayuga, F. Increasing groundwater storage and maintaining irrigation through managed aquifer recharge. Groundw. Sustain. Dev. 2022, 19, 100842. [Google Scholar] [CrossRef]
  27. Aubard, V.; Paulo, J.A.; Silva, J.M.N. Long-Term Monitoring of Cork and Holm Oak Stands Productivity in Portugal with Landsat Imagery. Remote Sens. 2019, 11, 525. [Google Scholar] [CrossRef]
  28. Vanem, E.; Walker, S.E. Identifying trends in the ocean wave climate by time series analyses of significant wave heightdata. Ocean. Eng. 2013, 61, 148–160. [Google Scholar] [CrossRef]
  29. Aydoğan, B.; Ayat, B. Spatial variability of long-term trends of significant wave heights in the Black Sea. Appl. Ocean. Res. 2018, 79, 20–35. [Google Scholar] [CrossRef]
  30. Wang, J.; Liu, J.; Wang, Y.; Liao, Z.; Sun, P. Spatiotemporal variations and extreme value analysis of significant wave height in the South China Sea based on 71-year long ERA5 wave reanalysis. Appl. Ocean. Res. 2021, 113, 102750. [Google Scholar] [CrossRef]
  31. Caloiero, T.; Aristodemo, F.; Ferraro, D.A. Annual and seasonal trend detection of significant wave height, energy period and wave power in the Mediterranean Sea. Ocean. Eng. 2022, 243, 110322. [Google Scholar] [CrossRef]
  32. Wilcox, R.R. Fundamentals of Modern Statistical Methods. Substantially Improving Power and Accuracy; Springer: New York, NY, USA, 2010. [Google Scholar]
  33. Borroni, C.G.; Cifarelli, D.M. Some maximum-indifference estimators for the slope of a univariate linear model. J. Nonparametric Stat. 2016, 28, 395–412. [Google Scholar] [CrossRef]
  34. Hampel, F.R. A General Qualitative Definition of Robustness. Ann. Math. Statist. 1971, 42, 1887–1896. [Google Scholar] [CrossRef]
  35. Donoho, D.L.; Gasko, M. Breakdown Properties of Location Estimates Based on Halfspace Depth and Projected Outlyingness. Ann. Statist. 1992, 20, 1803–1827. [Google Scholar] [CrossRef]
  36. Hubert, M.; Rousseeuw, P.J.; Aelst, S. Robustness. In Encyclopedia of Actuarial Science; American Cancer Society: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
  37. Siegel, A.F. Robust regression using repeated medians. Biometrika 1982, 69, 242–244. [Google Scholar] [CrossRef]
  38. Stein, A.; Werman, M. Finding the Repeated Median Regression Line. In Proceedings of the Third Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’92), Philadelphia, PA, USA, September 1992; pp. 409–413. [Google Scholar]
  39. Rousseeuw, P.J.; Netanyahu, N.S.; Mount, D.M. New Statistical and Computational Results on the Repeated Median Regression Estimator. In New Directions in Statistical Data Analysis and Robustness; Morgenthaler, S., Ronchetti, E., Stahel, W.A., Eds.; Birkhäuser: Basel, Switzerland, 1993; pp. 177–194. [Google Scholar]
  40. Rousseeuw, P.J. Least Median of Squares Regression. J. Am. Stat. Assoc. 1984, 79, 871–880. [Google Scholar] [CrossRef]
  41. Katz, M.J.; Sharir, M. Optimal slope selection via expanders. Inf. Process. Lett. 1993, 47, 115–122. [Google Scholar] [CrossRef]
  42. Brönnimann, H.; Chazelle, B. Optimal slope selection via cuttings. Comput. Geom. 1998, 10, 23–29. [Google Scholar] [CrossRef]
  43. Matoušek, J. Randomized optimal algorithm for slope selection. Inf. Process. Lett. 1991, 39, 183–187. [Google Scholar] [CrossRef]
  44. Matoušek, J.; Mount, D.M.; Netanyahu, N.S. Efficient Randomized Algorithms for the Repeated Median Line Estimator. Algorithmica 1998, 20, 136–150. [Google Scholar] [CrossRef]
Figure 1. The distribution of points for the sample data set D of the experiment Exp-A and the estimation results obtained for this data set.
Figure 1. The distribution of points for the sample data set D of the experiment Exp-A and the estimation results obtained for this data set.
Symmetry 16 00698 g001
Figure 2. Comparison of the dispersion of the obtained values of the estimates in the experiment Exp-A: (a) results for α ^ , (b) results for β ^ .
Figure 2. Comparison of the dispersion of the obtained values of the estimates in the experiment Exp-A: (a) results for α ^ , (b) results for β ^ .
Symmetry 16 00698 g002
Figure 3. The effect of changing the number o of outliers in the analyzed data set D on the results of the estimations (experiment Exp-B): (a) o = 15 ( θ = 0.15 ), (b) o = 20 ( θ = 0.2 ), (c) o = 25 ( θ = 0.25 ), (d) o = 30 ( θ = 0.3 ), (e) o = 35 ( θ = 0.35 ), (f) o = 40 ( θ = 0.4 ), (g) o = 45 ( θ = 0.45 ), (h) o = 50 ( θ = 0.5 ).
Figure 3. The effect of changing the number o of outliers in the analyzed data set D on the results of the estimations (experiment Exp-B): (a) o = 15 ( θ = 0.15 ), (b) o = 20 ( θ = 0.2 ), (c) o = 25 ( θ = 0.25 ), (d) o = 30 ( θ = 0.3 ), (e) o = 35 ( θ = 0.35 ), (f) o = 40 ( θ = 0.4 ), (g) o = 45 ( θ = 0.45 ), (h) o = 50 ( θ = 0.5 ).
Symmetry 16 00698 g003
Figure 4. Change the median of the estimates α ^ and β ^ as a function of θ [ 0 , 0.55 ] ( θ = o / n ) for 100 simulated data sets D : (a) results for α ^ , (b) results for β ^ (experiment Exp-B).
Figure 4. Change the median of the estimates α ^ and β ^ as a function of θ [ 0 , 0.55 ] ( θ = o / n ) for 100 simulated data sets D : (a) results for α ^ , (b) results for β ^ (experiment Exp-B).
Symmetry 16 00698 g004
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bal, A. Improving the Robustness of the Theil-Sen Estimator Using a Simple Heuristic-Based Modification. Symmetry 2024, 16, 698. https://doi.org/10.3390/sym16060698

AMA Style

Bal A. Improving the Robustness of the Theil-Sen Estimator Using a Simple Heuristic-Based Modification. Symmetry. 2024; 16(6):698. https://doi.org/10.3390/sym16060698

Chicago/Turabian Style

Bal, Artur. 2024. "Improving the Robustness of the Theil-Sen Estimator Using a Simple Heuristic-Based Modification" Symmetry 16, no. 6: 698. https://doi.org/10.3390/sym16060698

APA Style

Bal, A. (2024). Improving the Robustness of the Theil-Sen Estimator Using a Simple Heuristic-Based Modification. Symmetry, 16(6), 698. https://doi.org/10.3390/sym16060698

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop