2. Brief Survey of the Previous Results
The goal of G. Podolski [
1] was “to make the critical distinction between the two methods of calculating” the values of standard deviation (
SD):
where
n is the total number of observations,
mR—moving range,
xi—the value of
i-th observation,
—the mean of
n total results,
d2 = 1.128. For simplicity of notation, we will denote
as
SDmR and
AMR (average moving range) below.
SDmR “is used in conjunction with control charting and process capability. This method of calculation eliminates most of the variation in a data set that is contributed by special causes or an unstable process (identified on a control chart by runs, shifts, trends, patterns, and outliers)” [
1].
SDn−1 “does not eliminate any of the effects of special cause variation in the data. It is a measure of the total variation present in a data set… If a process is in statistical control,
SDn−1 and
SDmR will be approximately equal. If a process is not in statistical control, then
SDn−1 will be larger than
SDmR (except when special cause variation is occurring in a sawtooth pattern)” (ibid, except for our changes in the notations, Shper et al.). At the end of his paper, Podolski proposed a tool to determine if a process is stable:
F-test of the ratio (
SDn−1)
2/(
SDmR)
2. It is worth noting that the example used by Podolski in his work is based on obviously nonhomogenous data. It is clear from Figure 3 in [
1] and this point is essential for the following discussion.
Cruthis and Rigdon [
2] pointed out that the ratio (
SDn−1)
2/(
SDmR)
2 did not follow
F-distribution and simulated the distribution function (
DF) of this ratio for some variants of control charts. They presented the tables for the 90th, 95th, 99th, and 99.9th percentiles, and outlined: “A value greater than one of these upper percentiles indicates that the process was not in control over the time period that the data were collected” [
2]. According to their viewpoint, the cause for the large values of the ratio may be a shift in the mean or in the process variability, or positive autocorrelation. The authors have not mentioned at all what DF they used for simulation. By default, one can imply the normal distribution. We checked this out by repeating Cruthis and Rigdon’s simulation for x-chart and for
n = 10. Our results coincided with Table 1 in [
2] with an accuracy of 1%. Thus, the limits for the ratio of (
SDn−1)
2/(
SDmR)
2 are based on independent variables taken from the normal distribution.
Ramirez and Runger [
3] proposed three measures to evaluate process stability quantitatively: the ratio (
SDn−1)
2/(
SDmR)
2 (they called it stability ratio,
SR), the ANOVA approach to compare “within” to “between” subgroup variation, and the instability ratio (INSR)—the metric calculated by counting the number of subgroups having one or more violations of Western Electric rules. They have not mentioned the works of Podolski, and Cruthis and Runger, but wrote, with a reference to Wheeler [
12] that comparison between
SDn−1 and
SDmR has been used extensively in quality control applications. As Gauri [
4] pointed out later—and we agree with him—the problem with
SR is that the exact
DF is unknown and using a standard
F-test can be justified only for large sample sizes. The ANOVA method is difficult for practitioners, and the
INSR test for small numbers of samples like 50–100 subgroups leads to a high value of a type I error [
4]. Furthermore, as it was noted by Jensen et al. [
8] the INSR is dependent on the choice of additional rules used, and the ANOVA approach for individual data is based on artificial grouping of data.
Gauri [
4] proposed an interesting measure called process stability indicator (PSI). This index combines the analysis of run rules and the regression analysis of data segmented into some pieces. This technique has “a considerable amount of complexity in both …the calculation… and the interpretation… [
8] and will hardly be ever used by practitioners”.
The authors of the paper Wooluru et al. [
5] took one set of real data (with a sample size equal to 32) and compared the conclusions about this specific process stability for the following approaches: regression analysis; SR method; INSR method; run test; ANOVA method; Levene’s test. The idea of their work was to find the approach that would be optimal when “stability cannot be monitored using control charts due to lack of data and time for establishing control limits” [
5]. The process turned out to be stable by all criteria and they came to the conclusion that running a chart using MINITAB gives the best result for assessing process stability. In fact, this assertion has not been proved by any objective evidence. In our view, such a broad generalization cannot be based on one small set of obviously stable data.
Britt et al. [
6] performed a great amount of simulation and compared the performance of control chart visual analysis with the SR and ANOVA methods for sample sizes of
n = 30, 40, …, 100 taken from normal distribution. They came to the conclusion that the SR and ANOVA approach gave better results than the traditional use of the Shewhart control chart. We beg to differ on this. First, the authors of [
6] studied if the past data in a sample chosen fell within control chart limits. However, it is not the goal of Shewhart charts which are focused not on the product already made: “the action limits… provide a
means of directing action toward the process with a view to the elimination of assignable causes of variation so that the quality of the product
not yet made may be less variable on the average” [
13] (italic by Shewhart). In other words, the work of Britt et al. [
6] compared the ability of the three abovementioned techniques to analyze the stability of past data within phase I of the control chart application. This procedure is equivalent to checking if the past data are homogenous (consistent) or not (see [
14]). However, it is not equivalent to answering if a new measured point is a point of common or assignable cause of variation. Second, the the authors of [
6] dealt with random independent normally distributed data that is almost never taking place in real practice. We will discuss this facet of the problem below.
Sall [
7] proposed the technique for “monitoring thousands of process measures that are to be analyzed retrospectively”. He considered the SR as an adequate measure of process stability and without any basis (see [
7], right column) chose the value SR = 1.5 as the bound between stable and unstable processes. He used a process performance two-dimensional graph to visualize the process health (Figure 19 in [
7]) with a horizontal axis—the stability ratio and a vertical axis—the capability index, P
pk. According to Sall’s data, 84.3% of processes shown in this figure were stable and capable, 5.2%—capable and unstable, 7.5%—stable and incapable, and 3.0%—unstable and incapable. We will come back to these suggestions below.
Jensen et al. performed a detailed comparative analysis of the previous works on the basis of five criteria proposed in [
8]: “ease of use, interpretability, applicability, connection to capability, and statistical performance”. As a result, they suggested the
SI that is a square root of the SR:
Then, they proposed a boundary for the
SI equal to 1.25 and discussed carefully all the pros and cons of their approach. We agree with their comparative estimates of all of the above mentioned methods. However, we cannot agree with their main recommendation to use the
SI as a best practice for assessing process stability [
8]. Furthermore, we think that it is much better and more useful for practitioners not to combine stability and capability measures into one common picture, and not to indicate any cut-off values for the indices such as 1.25 and 1.33. Our arguments for this reasoning will be presented in
Section 2. The second paper by the same group of authors [
9] is merely the more popular version published in Quality Progress.
The authors of White et al. [
10] tried to expand the narrow limits of assessing process quality with a recommendation to use “a set of indices for evaluating process health that captures a holistic view of process performance”. Recommended in this article integral set of indices includes different types of well-known capability and performance indices (
Cp,
Cpk,
Ppk), the
SI (3), the Target Index (
TI), (defined as
, where
T—target), the Intraclass Correlation Coefficient (
ICC), the Precision to Tolerance Ratio (
P/T), the Measurement System Index (
%MS) plus two two-dimensional graphs:
Ppk—SI and
Cp—%MS. In our view, all suggestions by White et al. [
10] are limited by their primary assumptions of uncorrelated measurements and normality. Moreover, the proposed algorithm of process analysis is based on the following assertion: “Our experience has shown that even in the case of unstable processes, the indices that we discuss are still informative in identifying problematic processes that warrant more investigation” (ibid). However, our experience firmly supports the viewpoint of D. Wheeler: “An unpredictable (i.e., unstable) process does not have a well-defined capability” [
12]. A real example demonstrating the purposelessness of estimating both capability indices and the SI for the unstable process is presented below. These are not all our questions in the paper of [
10] so we will return to the discussion with it in the following section.
Kim Jeong-bae et al. [
11] investigated the statistical properties of the SI analytically by using approximations for
F-distribution. They analyzed cases when control charts for averages and ranges/standard deviations were being used, and when process mean and SDs were either constant or changed stepwise. The results obtained by this group are close to previous works, and they also use a two-dimensional plane to distinguish good processes from bad ones. The critical points for the SI turned out to be dependent on chart type, the number of subgroups, and their size. Thus, Kim Jeong-bae et al. [
11] outlined that there was no single measure to assess process stability, and we agree with this.
5. Conclusions
In this article, we presented a number of arguments that using the SI without a good understanding of its nature and properties could lead not to process improvement but to process worsening instead. Factually, the quality of business decisions may become low because of management errors. These errors may be caused by misunderstanding of what small or big values of the SI really mean or by misinterpretation of what follow-ups should be carried out. They can lead to unneeded distraction, wasted time and effort, and increased process variability. Furthermore, they can lead to disappointment in SPC methods in general and ShCCs in particular. Rephrasing the well-known phrase of R. Descartes, the right use of the right words can make the number of misconceptions half as much. The SI is an unsuccessful term, as we tried to prove in this work. So, what are we to do to improve the situation?
First, this indicator should be renamed in order to focus the practitioner’s attention in the right direction. The name “stability index” makes the practitioner focus on process stability only, limiting her/his look on the process and distorting in some cases the analysis of the process state.
Second, the values of the ratio (3) should be correctly interpreted, and the new name—the PStI—can help to assess if the process has some peculiarities that require comprehensive examination first. These findings, if found, can help better understand the set of processes and improve both process and business decisions.
Third, a lot of new research is necessary. The statistical community needs to deepen the notion of process stability in order to distinguish between ephemeral events (as Dr. Deming defined the assignable cause of variation) and system change (as for non-homogenous processes). Similarly, the impact of data nonrandomness on process behavior remains insufficiently investigated. That is, the properties of the PStI should be analyzed by other researchers from different viewpoints and in various conditions.
At last, it seems very favorable that there is no need to invent any new metrics when there is one appropriate. This answers Occam’s razor principle.
The data that support the findings of this study are available on request from the corresponding author, [SV]. The data are not publicly available due to privacy restrictions.