3.1.3. Data Analysis
Before starting the actual data analysis, it is useful to check for possible correlations. The correlation
between the input parameters
and
is given by:
where
is the mean of
x. The correlation
can take values in the interval [−1;1]. Values of
close to 1 indicate strong (anti-)correlations. If
is close to 0,
and
are not correlated. A strong correlation between
and
means that tests with high values of
also tend to have high values for
. Similarly, a strong anticorrelation between
and
means that high values for
are often associated with low values for
. Strong (anti-)correlations between the inputs can easily lead to the wrong conclusions during the evaluation because the associated effects cannot be separated.
The three phases of the main test programme were optimized by the design of experiments method [
29], which also minimises the correlation between the factors. However, the available collection of tests varied from the planned test matrix, since some tests were invalid or not carried out as specified. Furthermore, additional data were contributed by some project partners, and some test conditions were modified during the project. These circumstances could have introduced correlations between the independent variables.
Table 4 lists the correlations between the factors in the main programme. The largest (anti-)correlations were found between
and
and between
and
. An anticorrelation between
and
was expected since in Phases II and III, no tests with holds were carried out any more, whereas in Phase II, a higher surface roughness
was introduced. Therefore, one would expect tests with holds to have on average lower
and hence an anticorrelation between
and
. The correlation between
and
, however, is unexpected. Most likely, it is a random effect resulting from the grinding process that was used to produce the rough surface finishes and that yielded a distribution of surface roughnesses rather than specific
values (
Figure 3). These two largest (anti-)correlations were below 0.15 and should not have a major impact on the evaluation.
The actual data analysis was based on a second degree factorial model, i.e., a model including the main effects and all second order interactions:
The
are the different factors (such as
). The parameters
and
are the model parameters for the main effects and the two factor interactions, and
I is the intercept. For every test, an equation like Equation (
9) is formulated. The best model is the model for which the parameters
,
and
I best describe the experimental data. A lognormal distribution for
is assumed as recommended in ISO 12107 [
37]. In a lognormal distribution, the expected (i.e., mean) value
of the lognormally distributed variable
X is:
and
are the mean and standard variation of the natural logarithm of
X.
The model parameters in Equation (
9) depend on the scaling of the factors
. Normalizing the factors to the range [−1;1] allows comparing the impact of the different main and interaction effects by simply comparing the corresponding
parameters.
Table 5 lists the normalization conventions for the factors in the main programme. In this work, the superscript “
” indicates normalized factors, e.g.,
is the normalized
. For consistency, also the categorical factors like the environment
E are labelled similarly (
).
The aim of the current study is not only to obtain a numerical model that allows predicting the fatigue life of a specimen under a specific set of test conditions, but especially to determine which of the investigated factors have a significant impact on fatigue life. Therefore, the selected model should not only describe the data, but also include only those variables that have a significant effect on fatigue life. Many algorithms are available for fitting a model to the data. For the present study, we chose the backward elimination [
38] method. This algorithm starts with a full model, including all factors and interactions that are being considered (in this case, a second order factorial model, Equation (
9)) and evaluates the predictive performance of this model. In the next step, one model parameter (main effect or interaction) is removed, and the performance of the reduced model is evaluated. This procedure is repeated iteratively until only the intercept is left. This approach allows more easily comparing models with different numbers of factors than is the case for other algorithms that do not eliminate factors at all or where the number of factors is not changed in every step.
The model that best fits the data is not necessarily the most useful model since models with more parameters can easily overfit the data (i.e., fit the noise). Two approaches were used here for model selection. In the first approach, the data set is divided into a training set and a validation set. The data in the training set are used to determine the model parameters . The data in the validation set are then used to evaluate the predictive performance of the model. Since the data in the validation set were not used to determine the model parameters, the predictive performance of the model on the validation set is a good measure for the model performance under new conditions within the parameter range in which the model was optimized.
From
Figure 2, it is clear that the data sets can be roughly separated into four distinct groups by the two levels of
and
E. The training and validation sets are selected in such a way that 75% of the data in each of the four groups are in the training set and 25% in the validation set. This approach is shown in
Figure 4a, where the
for the training and the validation sets is plotted over the iteration steps of the algorithm. The
, the negative natural logarithm of the likelihood function, is a measure for the goodness of fit, whereby smaller numbers indicate a better fit. The iteration steps of the algorithm start with Step Number 0, i.e., the full model including all main effects and all two parameter interactions. Moving on the abscissa left allows following the progression of the algorithm until at the leftmost step (here, Step 15), only the intercept remains.
The dashed line refers to the training set. The
for the training set rises continuously with the progression of the algorithm (from right to left). This is expected since reducing the number of terms in the model will necessarily lead to worse fits. The behaviour of the solid curve for the validation set is different: initially, the
drops until it reaches a minimum in Step 10 (indicated by the vertical red line) and continuously rises from there. This means that the model that best describes the validation set is reached in Step 10 of the algorithm. The corresponding model coefficients are listed in
Table 6 Model (a) (
Appendix A).
An alternative approach for selecting a model and to avoid overfitting is using a measure for the quality of the fit that penalises models with a larger number of parameters. The Bayesian information criterion (BIC) is such a measure. It is defined as:
where
k is the number of parameters in the model and
n the number of data points. As for the
discussed above, lower values of BIC indicate a better fit. The second term of the sum in Equation (
11) penalizes models with more parameters.
Figure 4b shows the BIC for the different steps in the backward elimination algorithm for the full data set. The best model is again reached in Step 10; the corresponding model coefficients are listed in
Table 6 Model (b).
3.1.4. Discussion
Comparing the model coefficients listed in
Table 6 Models (a) and (b) shows that both models include the same terms, namely the main effects
,
and
, as well as the two interactions
*
and
*
. The estimates for the coefficients of all three main effects are negative, indicating their detrimental effect on fatigue life.
The estimated factor for the interaction
*
is positive; large values for either
or
, i.e., large strain ranges or testing in the LWR environment therefore partly compensate the negative effects of
and
; at high strain ranges, there is less environmental effect. This is consistent with the observation reported in [
3].
Similarly, the positive coefficient related to the interaction term * reduces the negative impact of a high surface roughness at high strain ranges. This is understandable: affects crack initiation rather than crack growth, so one would expect to have a more deleterious impact in situations where fatigue life is dominated by crack initiation, i.e., at low strain ranges, which is what the models predict.
Models (a) and (b) were determined using the same algorithm (backward elimination), but with different validation methods. Published and project internal analyses with different algorithms and slightly different data sets consistently showed the main effects
,
E and
to have the largest impact [
40]. In most cases, one or two two-factor interactions were found to be statistically significant, but not practically relevant, i.e., they did not have a major impact on the predicted fatigue life. The interactions that were found to be statistically significant varied in evaluations with increasing size of the data set and depending on the algorithm used for the model optimization. This may indicate that the size of these effects is at the limit of what is detectable with the number of tests available in this work.
This is confirmed by the optimization curves (the solid black lines) for both models in
Figure 4. In both cases, the best model is found in Step 10, but the performance of the models in Step 9 or 11 is very comparable. A further reduced model, including only the main effects, was therefore calculated (using the BIC validation); the model parameters are listed in
Table 7.
Figure 5a compares the
predicted by the three models to the experimentally observed values. As could be expected from
Table 6, Models (a) and (b) are hardly distinguishable. Only at very high
do differences become apparent. Model (c), which only includes the main effects, differs visibly from the other two models. For high fatigue lives, Model (c) systematically predicts lower
, whereas the contrary can be observed in the medium
range around 4000 cycles. In the region where
is around 1000 cycles, all three models match well in general, with Model (c) deviating from the others in some cases. These differences result from omitting the interaction effects. However, the differences between the reduced model (c) and the optimal models (a) and (b) is small compared to the scatter observed experimentally. Therefore, Model (c) seems to be good enough to make realistic predictions.
During the analysis, all tests were considered to be either carried out in air or in the LWR environment, where the LWR environment included simulated PWR, as well as simulated VVER conditions, and no distinction was made between the latter two. Furthermore, all tests in the VVER environment (and only these) were performed on a 321 steel. The question is if considering the PWR and VVER tests was a sensible approach.
Figure 5b–d compares the predicted
from Model (a) to the experimentally observed values, whereby the colour coding indicates the different environments, strain ranges and surface roughnesses.
The VVER data in
Figure 5b are distributed around the black reference line and do not show any particularities. Hence, based on the data available here, the model describes the VVER data just as well as the PWR data. Similarly, the model predictions work equally well for different strain ranges
(c) and surface roughnesses
(d). The effect of
on the predicted fatigue life is visible by the separation of the blue points with very low and the grey/red points with higher
values. The gap between these two groups is higher for larger fatigue lives, showing the interaction between
and
.