1. Introduction
The presence of outliers in the data may have an appreciable impact on the data analysis, which often leads to erroneous conclusions, and in turn results in severe decision-making mistakes. Therefore, it is necessary to detect outliers before statistical analysis. On the other hand, outlier detection has a wide range of applications in the prevention of financial fraud, disease diagnosis, and judgment of the truth of military information, etc.
Refs. [
1,
2] define outliers as those observations which are surprisingly far away from the main group. In a one dimensional situation, if the observations are arranged in an ascending order of magnitude, there will be only three types of outlier detection problems: (i) only upper outliers; (ii) only lower outliers; and (iii) both upper and lower outliers.
The commonly used methods of dealing with outliers include the detection of outliers and robust statistical methods. Robust methods aim to analyze data while retain outliers and minimize the deviation of analytical results from theoretical results. The detection of outliers is to identify outliers in the sample by using a reasonable statistical procedure and then analyzing the remaining observations. In this paper, we focus on this method.
In the field of statistics, there are many results on the detection of outliers, and many effective methods have been proposed. These methods include descriptive statistics, machine learning, and hypothesis testing.
Descriptive statistics is intuitive and contains no computational burden. Commonly used methods include Box-plot, Hampel rule, etc. Box-plot needs to compute the
quantile and
quantile of the sample,
and
. Denote
as the interquartile range, then the observations are located in the interval of
,
in the plot are observed as clean observations, and other observations are tested as outliers. According to [
3], a data point is identified as an outlier if the distance between it and the sample median exceeds 4.5 times MAD, where
.
Machine learning mainly trains the sample to detect outliers according to the data characteristics, combined with mathematical models and statistical principles. Some common methods include one-class support vector machines (one-class SVM), minimum spanning tree (MST), etc. One-class SVM usually trains a minimal, ellipsoid which contains all normal observations from historical data or other clean data. Then, the observations that fall outside the ellipsoid are treated as outliers; see [
4]. MST algorithm defines the distance between points as Euclidean distance, considers the points as nodes, and finds a path connecting each node with the smallest sum of distances. Then, based on the given criteria, the sample is divided into different classes. The largest set is treated as inlying data, while the rest is treated as outliers; see [
5].
Hypothesis testing is a basic method for outlier detection. By setting appropriate null and alternative hypotheses and constructing test statistics with certain properties, the hypothesis testing method can detect whether there are outliers in the sample with the given significance level.
In a univariate sample, and unlike the limitations of the exponential distribution, observations from gamma distribution are more extensive and easier to collect. This paper studies the multiple outlier detection under gamma distribution, a parameter slippages model. Since the 1950s, there has been many results about outlier detection based on the hypothesis testing method, but most of them aim to detect a single outlier or outliers in a normal distribution. In the 1970s, outlier detection under more general distributions such as exponential, Pareto, and uniform distributions received much attention. Multiple outlierdetection has recently drawn considerable attention in practice owing to the development of science and technology and the diversification of data collection methods. We briefly introduce three commonly used statistics, which are suitable for detecting multiple upper outliers in the gamma distribution.
Dixon’s statistic proposed in [
6] is based on the idea that the dispersion of the suspect observations accounts for a large proportion of the sample dispersion. This method is further extended in [
7,
8,
9], where [
8] proposes the following statistic
With the given significance level
,
, ⋯,
are identified as outliers if
, where
is the critical value of
. Later, another Dixon type statistic for detecting outliers in a gamma distribution is proposed in [
10,
11], and the statistic is
Ref. [
10] gives the critical value
for the given significance level
,
, ⋯,
are regarded as outliers if
. The third test statistic is
by [
10,
11]:
Ref. [
10] also obtains the corresponding critical value
for the given significance level
.
, ⋯,
are regarded as outliers if
. The fourth test statistic is a “gap-test” ([
12]), which is given by
Ref. [
12] provides the critical value
for the significance level
, and
, ⋯,
are identified as outliers if
. The fifth test statistic is proposed in [
13], which is given by
Ref. [
13] shows that the distribution of
and the critical value
can be obtained for the given significance level
. Thus,
, ⋯,
are regarded as outliers if
.
The remainder of this article is organized as follows. In
Section 2, we propose a test statistic to detect outliers in a gamma sample, and the density function of the proposed test statistic is derived. In order to obtain the critical values, a Monte Carlo procedure and a kernel density estimation procedure are proposed. In
Section 3, the simulation results demonstrate that the proposed
test statistic is better than others. Furthermore, an improved
method is suggested, which can eliminate the swamping effect in multiple outliers detection in
Section 4. A real data analysis is performed in
Section 5.
Section 6 is the conclusion. All proofs of theoretical results are presented in
Appendix A, and the data of empirical applications is contained in
Appendix B.
4. Modified Test-ITK
In practice, almost all test statistics used to detect multiple outliers have the swamping effect. This phenomenon happens because large outliers may cause the sum of multiple observations to be too large in the block test. To reduce or eliminate the impact of the swamping effect, we suggest a modified test, ITK, which retains the high probabilities of outliers detecting and low error probabilities when there is no outlier in the gamma sample.
Note that for multiple outlier detection, some inlying observations may be judged as outliers falsely caused by improper
k. For example, consider a sample consisting of
, and
, and use
to test
,
,
. Clearly,
, and the critical value of
, by using Algorithm 1 in
Section 2.3.1, is
. Therefore,
, and
,
,
are outliers in the sample. However, in fact,
is a genuineobservation from the inlying cluster.
is detected as an outlier because
and
compared with the inlying sample are too large, causing the sum of
,
,
beyond the bound range, i.e., swamping effect. However, this negative impact will be eliminated if we take
.
To deal with the swamping effect, a method for choosing a reasonable k should be given. Thus, our modified test includes two stages: (1) pick a reasonable k, and use the test to detect k upper observations; (2) use stepwise forward testing for the remainingobservations or stepwise backward testing for the “outliers” sample from the first stage.
4.1. Estimation of k
From [
14], the number of outliers should be less than
. Later, [
1] put forward a point that the number of outliers is usually less than
if the sample is collected properly.
Here, we take
where
is the greatest integer less than or equal to
.
4.2. The Improvement of the Test-ITK
Based on
Section 4.1, we propose an improved
test procedure, as follows:
Step 1. For the significance level , , ⋯, are judged as outliers preliminarily, which forms a preliminary outliers sample, if ; otherwise, goto Step 5. The remaining observations constitute the preliminary inlying group, .
Step 2 (step forward test). Using step forward test to detect whether includes any outliers. For , , and is an outlier if ; otherwise, goto Step 4.
Step 3. Repeat the test process in Step 2 until no outlier can be detected in . If is the smallest outlier in , then , ⋯, are outliers in the data and stop the procedure.
Step 4 (step backward test). After the step forward test has stopped, use the step backward test to check the preliminary outliers sample in Step 1. For the significance level , if , then the step backward test ends; otherwise, use the step backward test for detecting . Repeat this step until an outlier is detected. If is not judged as an outlier, then there is no outlier in the sample, the sample is inlying data.
Step 5. Let , and substitute to Step 1. If , there is no outlier in the sample, and the test procedure ends.
5. Empirical Applications
In this section, we apply the ITK test method to two data sets: Alcohol-related mortality rates and artificial scout position data, and compare it with the other six test statistics of , , , , , and .
5.1. Alcohol-Related Mortality Rates in Selected Countries in 2000
The dataset (see
Appendix B) is selected from Office for National Statistics (ONS). The Kolmogorov-Smirnov test indicates that this data follows the gamma distribution.
Here,
and so
. We obtain
by using the Newton-Rapson algorithm. From
Appendix B, it is observed that
. Further, we compute the critical value of
by using Algorithm 1 in
Section 2.3.1, and obtain
. Obviously,
, and hence
are detected as outliers preliminarily.
Then, we use the step forward test for the remaining sample. It is clear that . Thus, is a normal observation.
We now use the step backward test for . It is readily observed that , , and in the 5% significance level, . As , are detected as upper outliers.
On the other hand, we utilize the
,
,
,
,
, and
test statistics to detect outliers, and the results are shown in
Table 3.
As we can observe from
Table 3, ITK,
,
, and
can identify outliers correctly without misjudgment. This phenomenon happens because
k is chosen reasonably. We can also observe that
,
, and
have bad performance in multiple upper outlier detection.
Furthermore, the result from
Table 3 shows that Ireland, France, Austria, Slovenia, Portugal, Denmark, the United Kingdom of Great Britain and Northern Ireland, the Republic of Korea, the Russian Federation, and Australia have higher alcohol-related mortality rates, which means that these countries need to pay more attention to alcohol-related mortality.
5.2. Artificial Scout Position Data
In the application of military information, the gamma model is usually used to describe the position of some objects. Suppose a military scene, in a mission, 20 scouts reconnoiter a certain area, and their location components are characterized by , , ⋯, 20, and the larger , the further they are away from the landing site. If deviates from the main group, this indicates that the th soldier is separated from the troops and may not be able to obtain support in time in case of an emergency. Therefore, it is necessary to pay attention to this movement.
In our setting, the basic model is gamma (3,5) and the alternative model is gamma (3,10). The initial data are outlined in
Appendix B.
Here,
is known. The sample size is 20, thus
. From
Appendix B, it is observed that
. With the significance level of 0.05, we utilize the Monte Carlo method to calculate the critical value for
, and we obtain
. As
,
,
,
and
are placed into the initial outlier group.
Furthermore, we continue to test the remained sample and carry out the step forward test for . Noting that > , is not an outlier.
Presently, we use step backward test for , , , . It is clear that , and with the significance level of 0.05, . Noting that , is not an outlier. Moreover, , the test procedure ends. Therefore, , and are outliers in the sample.
Meanwhile, we utilize the
,
,
,
,
, and
test statistics to detect outliers, and the results are shown in
Table 4.
It can be observed from
Table 4 that the ITK method performs better than the other five methods (the
,
,
,
,
, and
test statistics) because it can not only detect all outliers in the sample, but also has the lowest misjudged probabilities.
Further, from the result of the ITK method, we can obtain information that the IDs 18, 19, and 20 seem to be far away from the landing site. This means that they would be endangered in case of an emergency.
6. Concluding Remarks
It can be observed from the simulation that with the increase in k and n values, compared with other test statistics, our test statistic has a higher power and relatively lower “false alarm” on outlier detection, especially for a lower significance level. However, the swamping effect still exists for , and this phenomenon will cause the loss of information. Therefore, to reduce the impact of swamping effect, we design the ITK test. From the outlier detection results of the two real data analyses, the ITK test has the same high power as the test statistic and lower error probabilities than the other six test statistics (, , , , , and ). In conclusion, compared with other test statistics, ITK has the highest detection capability for outliers and the lowest “false alarm”. Thus, the ITK method is recommended to be used to identify multiple outliers in a sample.
In this paper, we design two algorithms based on the Monte Carlo and the kernel density estimation to obtain the critical values of . How to derive the exact critical value of is an interesting problem. Further, in the case of k being unknown, we take a conservative estimation of . Thus, it is worth studying the problem of choosing a more appropriate value of k in our ITK method. This article discusses only the case of multiple upper outliers existing in a gamma sample. Noting that lower outliers or both upper and lower outliers may exist in practice, it is necessary to extend our outlier detection methods to these situations. In addition, the masking effect with our methods is not discussed in this paper, which remains our future research. How to extend our approaches to other distributions is also an important topic.