1. Introduction
We have long records of instrumental climate observations from a large part of the earth and for several climate variables. These records represent an enormous asset in the evaluation of past, present, and future climate changes. However, technical changes in the observation practices affect the temporal comparability of observed data. For instance, station relocations, changes in the instrumentation, observing staff, time schedule of observations, etc. may cause non-climatic biases, so-called inhomogeneities, in the time series of observed data. In fact, long series without the occurrences of such technical changes are rare [
1], and the mean frequency of inhomogeneities is generally estimated to be 5–7 per 100 years [
2,
3,
4]. The problem of inhomogeneities in climate records is well known since more than a century ago [
5], but finding the best tools to remove inhomogeneities is still a challenging task [
5,
6]. The two main sources for any possible solution are the statistical analysis of time series and the use of documented information, so-called metadata, of the technical changes of observations.
The lists of documented technical changes would provide the optimal solution for inhomogeneity removal if such lists were complete, and the quantitative impact of each technical change would be known. For instance, if metadata show that a station relocation occurred in 1950 from station “A” to station “B”, we know that sections of data before 1950 are inhomogeneous when compared to data after 1950. We know the size of the change from metadata only when parallel observations in station “A” and station “B” were performed for several years, and their results are saved among the metadata. In practice, metadata lists are incomplete, since a part of the technical changes are unintentional, and most metadata are unquantified [
5,
7,
8].
The principal idea of statistical homogenization is that station-specific inhomogeneities can be made visible by the comparison of time series from nearby stations since the temporal evolution of climate is similar within a given climatic region. Such statistical procedures are named relative homogenization. Relative homogenization can be performed with or without the joint use of metadata. In the last few decades, several automatic methods have been constructed for the homogenization of large climatic datasets [
5]. Metadata still offers additional information for effective homogenization, but the fruitful combination of a statistical method with unquantified pieces of information is not a simple task. This paper assesses the potential benefit of metadata use in automatic homogenization procedures on the example of the ACMANT homogenization method [
9,
10], which is tested on a large, synthetically developed monthly temperature benchmark dataset. Before the presentation of our own examinations, here we give a brief review of the usual metadata use in homogenization.
There is a wide consensus among experts regarding the generally high importance of metadata [
5,
8,
11]. However, not all pieces of metadata have the same importance, and the most important ones are those, which point to synchronous technical changes in many or all stations of an observing network. In such cases, the basic idea of relative homogenization fails, but when metadata provide sufficient information, the inhomogeneities can be removed by a separate operation [
12] performed before the general procedure of relative homogenization. The majority of inhomogeneities are station specific, and they are also referred to as station effects. Most frequently, a change in the station effect results in a sudden shift, a so-called break in the section mean values of the time series. Hereafter, metadata means station-specific metadata, except when the context specifies otherwise. In spite of the theoretical consensus about the importance of metadata, their practical treatment is varied in individual studies, and not only for the varying availability of metadata. Starting with studies in which little attention is dedicated to metadata, Pérez-Zanón et al. [
13] omitted the metadata dates from break dates in homogenizing with the HOMER method [
14]. In the homogenization of an integrated water vapor dataset, Nguyen et al. [
15] reported that only 30–35% of the statistically detected breaks were confirmed by metadata, and they explained this by the ability of the statistical method to find relatively small breaks. By contrast, in many other studies, low ratios of metadata confirmation are often explained by the lack of metadata availability or they are interpreted as the overestimation of break frequency by statistical methods. In a few studies, statistically detected breaks without metadata confirmation are left out of consideration [
16,
17], partly because the limited possibilities of time series comparisons debilitated the reliability of statistical detection results in these studies. Finally, finding inconsistencies in break detection results, O’Neil et al. [
18] compared the statistical break detection without metadata use to the use of imaginary maps citing an old economist [
19]: “A man who uses an imaginary map, thinking that it is a true one, is likely to be worse off than someone with no map at all”.
We know a few studies in which the results of a relative homogenization method with metadata use were compared with those of the same homogenization method without metadata use. In all these studies [
20,
21,
22], the compared homogenization results showed minor differences, so they could not confirm the usefulness of metadata in relative homogenization.
The value of metadata depends on the manner of their use. In automatic homogenization procedures, metadata indicating a likely break occurrence are often used for the precision of statistically detected break dates, when the metadata date falls within the confidence interval of the statistically detected break [
5,
23,
24]. Further, lighter significance thresholds can be applied in statistical break detection when the breaks can be confirmed by metadata. For instance, coincidental statistical break detection results for seasonal and annual series of the same time series are expected when the breaks are not supported by metadata, while breaks are accepted without such coincidences in the reverse case [
25,
26,
27]. In slightly different ways, but the same logic of metadata use is applied in the newly developed “automated HOMER” of Joelsson et al. [
28], and also in ACMANTv5 [
9].
All the examples described in the previous paragraph represent restrictive metadata use where restriction means that the use of any piece of metadata is conditioned to some indications of statistical significance. By contrast, in permissive metadata use every piece of metadata is considered as a break position disregarding statistical break detection results. In this study, we will examine both restrictive and permissive ways of metadata use.
2. Materials and Methods
2.1. Benchmark Database
To test the usefulness of metadata, a large monthly temperature benchmark dataset has been developed. It consists of a seed dataset and further 40 datasets; in each of the latter sets, one parameter of the seed dataset is altered. Each dataset has homogeneous and inhomogeneous sections, and each of them contains 500 networks. Within a network, all time series cover the same period, and no data gap occurs. The datasets also include metadata, showing the dates of the inserted breaks. However, some pieces of metadata dates are false, i.e., they point to non-existing breaks, similarly to the occurrences of such instances in real-world metadata.
2.1.1. Seed Dataset
In setting the parameters of the seed dataset, the aim was to provide a dataset, that, (a) includes several kinds of inhomogeneity problems, like multiple breaks, occurrences of gradual inhomogeneities, notable seasonal cycles of inhomogeneity magnitudes and significant network mean biases; (b) is characterized by moderately high signal-to-noise ratio; and (c) is realistic, i.e., the networks and their time series are similar to those of real-world data. To fulfill these expectations, a slightly modified version of the “U2” dataset of the MULTITEST project [
29] was selected. The modifications included the reduction of the mean magnitude and maximum duration of short-term platform-shaped inhomogeneities, a slight increase in break frequency, and the enlargement of the dataset up to 500 networks. Each network consists of 10 time series, which are 60 years long. Here a shortened description of the dataset generation is provided.
The origin of the homogeneous set is from the synthetic daily temperature dataset performed for four U.S. regions [
30,
31]. The original series is 42 years long. The 210 homogeneous time series of the southeastern region, version 2 were taken (Wyoming data were used for U2), and 100 years long monthly series was created from them keeping the original spatial connections unchanged. See details of this step in [
32]. Note that although the time series of the seed dataset is only 60 years long, 100 years long series was created for some other test datasets.
In generating the inhomogeneous set, inhomogeneities and outliers were randomly added to the time series. Three kinds of inhomogeneities are included: breaks, linearly changing biases, and short-term platform-shaped inhomogeneities. The mean frequencies for 100 years are 5 breaks, 1 linear change, 5 platform inhomogeneities, and 2 outliers, and the frequencies randomly vary between time series. The size distribution of inhomogeneities is Gaussian with 0 mean, and the standard deviation of inhomogeneity magnitudes for breaks and linear changes (platform inhomogeneities) is 0.8 °C (0.6 °C). The length of linear changes varies between 5 and 99 years. The length of platform inhomogeneities varies between 1 month and 60 months and the frequency quadratically increases with decreasing length. The sequence of inhomogeneities is “limited random walk” [
32]. This concept means that inhomogeneity sizes are generally independent and they are simply added to the previous bias, however, threshold accumulated biases are set, which are not allowed to be exceeded in the dataset generation. The thresholds for accumulated biases differ according to the sign of the bias, and this resulted in notable network mean trend biases in the inhomogeneous dataset. Coincidental breaks in more than one time series of a given network may accidentally occur, but synchronous or semi-synchronous breaks were not produced intentionally in the dataset creation. The seasonal cycle of inhomogeneity sizes follows a semi-sinusoid pattern in 75% of the inhomogeneities, while they are flat in the other cases. Find more details in [
29].
The metadata list contains all the dates of the inserted breaks, while they do not contain dates related to gradually changing biases or short-term, platform-shaped inhomogeneities. 20% of the dates of the metadata list are false, i.e., they point to non-existing breaks. Note that in the homogenization of the seed dataset an arbitrary selected 25% of metadata were excluded, simulating metadata incompleteness.
2.1.2. Secondary Datasets
Secondary datasets were created from the seed dataset by changing a parameter of the dataset generation. Five parameters were varied, they are, (a) the number of time series in network, (b) time series length, (c) the standard deviation of inhomogeneity size, (d) mean spatial correlation between time series, (e) the ratio of false break dates in metadata. In addition, seeming metadata incompleteness were varied by manipulating a parameter of the homogenization procedure. As a general rule, only 1 parameter was altered in comparison with the seed dataset generation and homogenization, but there exists a deviation from this rule: each dataset with networks of 10 time series has a twin dataset with the only difference that the networks contain only 5 time series there. It is because the importance of metadata is higher for small networks than for larger networks.
The generation of secondary datasets was technically simple, only the variation of spatial correlations needs explication. The spatial correlations are calculated for the increment series of deseasonalized monthly values [
32], and in this study, they are calculated for the homogeneous section of the data. The mean correlation in the source U.S. database is 0.883. Having randomly selected the time series, the mean spatial correlation in our seed dataset is the same. For raising (lowering) the mean correlation, minimum (maximum) correlation thresholds were set in the selection of time series to a given network. To lower the mean correlation more effectively, also Gaussian red noise processes of zero mean and 0.15 autocorrelation were added, but the latter had relatively smaller role than the use of correlation thresholds. When the mean correlation lowered most, down to 0.67, the standard deviation of the added noise was 0.3 °C.
The complete benchmark database consists of 41 datasets, 20,500 networks, and 170,000 time series.
2.2. Homogenization with ACMANT
ACMANT (Adapted Caussinus—Mestre Algorithm for the homogenization of Networks of climatic Time series) includes theoretically sophisticated and practically tested solutions for all important phases of the homogenization of climate time series. In the method comparison tests performed on the 1900 synthetic monthly temperature networks of the MULTITEST project, ACMANT produced the most accurate results [
29]. ACMANTv4 is described in [
10], while the changes in ACMANTv5 relative to the earlier versions are presented in [
9]. Here we summarize a few important features of the method.
ACMANT homogenizes section means only. It applies a maximum likelihood multiple break detection method [
33], which has univariate and bivariate modes within ACMANT [
10]. The so-called combined time series comparison [
9] is included, which unifies the advantages of pairwise comparisons and composite reference series use. It applies ensemble homogenization [
10] to reduce random homogenization errors. The correction terms for inhomogeneity bias removal are calculated jointly for an entire network, by the equation system of the ANOVA correction model [
33,
34], which gives better results than any other known correction method in most practical cases [
5]. ACMANT can be applied either to daily or monthly datasets, and it is characterized by high missing data tolerance. The newest version (ACMANTv5) already has both automatic and interactive versions.
When the climatic conditions suggest that inhomogeneity biases frequently have sinusoid or semi-sinusoid annual cycle, a bivariate homogenization can be applied, in which the breaks of annual means and those of summer-–winter difference are jointly detected, and separate ANOVA correction term calculations are performed for these two variables [
10]. In this study, both the univariate and bivariate ACMANT homogenizations are applied to each test dataset. The homogenizations are performed in fully automatic mode.
2.3. Metadata Use
Regarding metadata use, three kinds of homogenization were performed: (i) restrictive metadata use, (ii) permissive metadata use, and (iii) exclusion of metadata. Overall, each test dataset was homogenized in six modes by the variation of univariate and bivariate homogenizations and the variation of metadata use.
2.3.1. Restrictive Metadata Use
In restrictive metadata use, the automatic mode of ACMANTv5.0 homogenization was applied [
9,
35]. Metadata are used in two steps of homogenization, and in both cases, the indication of some statistical results is needed to include pieces of metadata.
(a) Metadata use in the pairwise comparisons step of the homogenization: In summarizing the coincidental pieces of detection results for any date of an actually examined time series, referred to as candidate series, a confirming metadata date is considered in the same way as a confirming indication from the comparison between the candidate series and one of its neighbor series. A break existence is accepted when the total weight of the confirming pieces of information exceeds the threshold 2.1. Note that such pieces of detection results can be 0 or 1 in the original development of the automatic evaluation of pairwise comparison results [
36], while they can also be fractions between 0 and 1 in ACMANT [
9]. Pieces of metadata are always considered with weight 1.
(b) In the monthly precision step of the ACMANT procedure, metadata dates have a certain degree of preference. In monthly precision, the most likely break date is searched in a 28-month wide symmetric window around the date detected on an annual scale. Step functions of 1 step are fitted with varying step positions. In univariate homogenization, the sections before and after the step are flat, while the best fitting sinusoid annual cycles are included for them in bivariate homogenization [
10], so that modified step functions are applied in bivariate homogenization. Generally, the step position producing the lowest sum of square errors (SSE) is selected as break date. When a metadata date occurs in the examined window, and the SSE exceeds the minimum SSE with no more than 2.0 (1.5) standard deviation of the examined data in univariate (bivariate) homogenization, the metadata date is selected as break date. When more metadata dates occur in an examined window, the one with the lowest SSE will be preferred.
2.3.2. Permissive Metadata Use
In permissive metadata use, all kinds of metadata use described in
Section 2.3.1 are kept, and additionally, all metadata dates are included in the final application of the ANOVA correction model regardless if any piece of statistical result indicates their significance or not. In the ANOVA model, every time series of observed climate data is considered to be the sum of a regionally common climate signal and a site-specific station effect, and the temporal evolution of station effects is described by step functions [
10]. The input data of the model consists of the observed climatic data and the dates of the breaks related to known inhomogeneities or estimated inhomogeneities detected by statistical methods, hence the inclusion of metadata dates in the ANOVA correction model is straightforward.
2.4. Efficiency Measures
We use efficiency measures reporting directly about the success in reconstructing the climate trends and climate variability. Let V and Z stand for the vectors of homogeneous series and homogenized series with metadata use, respectively. The differences in the temporal evolution in V and Z represent the residual errors after homogenization. In this study, these residual errors are characterized by 4 error measures: centered root mean squared error of monthly values (CRMSEm), centered root mean squared error of annual values (CRMSEy), mean absolute error of linear trends fitted to individual time series (Trbias), and mean absolute error of linear trends fitted to network mean time series (Trnetb).
2.4.1. Centered Root Mean Squared Error
Centered root mean squared error differs from common root mean squared error in a way that the former excludes the difference between the means of the compared time series. Its use is reasoned by the fact that time series homogenization aims to reconstruct the temporal evolutions of climate data, but it does not and could not reconstruct station-specific climatic normal values. The concept of CRMSE was introduced to time series homogenization by [
3], and its use is widespread since then. Equation (1) shows the calculation of CRMSE for time series of
n data points. Equation (1) is usable both for monthly and annual time series.
2.4.2. Trend Bias
Linear regression with the minimization of mean quadratic difference is fitted to time series
V and
Z, and the trend slopes are denoted with
αv and
αz, respectively. Trbias and Trnetb can be calculated by Equation (2).
2.4.3. Efficiency of Metadata Use
To examine the metadata effects on the homogenization results, the results with metadata use (
z) are compared to the results without metadata use (
u). Then the efficiency (
f) of metadata use is calculated as the percentage reduction of any error term (
E) by Equation (3).
The sign of
f indicates if the metadata effect is positive or negative on the homogenization results, and in the theoretical case of perfect homogenization with metadata,
f = 100(%). Equation (3) is used for datasets, but is not for individual time series, since
E(
u) can be very small or even zero for some series. In the examination of efficiencies for individual time series (
f1) the efficiency is normalized with the dataset mean value of
E(
u), Equation (4).
In Equation (4), J stands for the number of time series in a given dataset. Differing from f, f1 may have values above 100%.
4. Discussion
The presented examinations show that the inclusion of metadata use in automatic homogenization improves the accuracy of homogenization results. However, we have also found some unfavorable features. The average improvement of the homogenization accuracy for metadata use is not very large and may be smaller than expected. In addition, in the results of individual network homogenizations, sometimes notable worsening of the accuracy was found, although with lower frequency than the cases of notable improvement. The presented results are linked to the conditions of the test experiments i.e., (i) base statistical software (ACMANT), (ii) algorithm of metadata use, and (iii) test dataset properties. In evaluating if the settings of the presented experiments could produce some unfavorable results or not, we point to the general stochastic behavior of homogenization accuracy with a simple example (
Figure 9).
Let us suppose that during the homogenization of a series, whose platform-shaped inhomogeneity is shown in
Figure 9a, only the first break (the break in 1975) has been detected and adjusted. In the shown synthetic example, the unrevealed break in 2005 affected the calculation of the adjustment term for the break in 1975, but any additional errors for noise or neighbor series inhomogeneities (not shown) were zero. However, in the actual context, not the accuracy of the adjustment for the first break is the important point, but the fact that adjusting only one break of the existing two resulted in an increment of the trend bias for the period 1961–2020: The trend bias is zero in
Figure 9a for symmetry reasons, while in
Figure 9b the linear trend slope for 1961–2020 is −0.62 °C/100 years. As most climate time series include multiple inhomogeneities of varied magnitude [
3,
38,
39], the stochastic behavior of accuracy changes related to the adjustments of individual inhomogeneities is an inherent characteristic of homogenization. We can conclude that the accuracy improvements provided by metadata use are limited by the following factors: the efficiency of statistical procedure, the incompleteness of metadata, false metadata occurrences, and stochastic effects.
A general experience is that an undetected inhomogeneity tends to do more harm than the mistaken detection of a non-existing break. Nevertheless, false breaks have negative impacts on homogenization accuracy. Such negative impacts are the smallest when the breaks indicated by metadata but not confirmed by statistical tests are considered only in the final correction step of a homogenization procedure. This permissive metadata use allows the benefit of all metadata but minimizes the error propagation between time series for the uncertainty of breaks indicated only by metadata.
The usefulness of metadata is influenced by their reliability and relevance. A metadata date is rarely erroneous, but some pieces of metadata can be irrelevant. While several types of technical changes almost always cause inhomogeneities, the relevance is less certain in some other cases, or even it can be doubtful, e.g., when metadata indicates “maintenance works”, “station inspection”, etc.
For time series of spatially dense and highly correlating observations, high homogenization accuracy can be achieved even without metadata use, and the presented tests confirm that metadata can have a minor role in such homogenization tasks. Although the mean efficiency of metadata use remains positive for any network density according to the presented tests, the possible exclusion of metadata use may sometimes be reasoned by the working load of metadata selection and metadata digitation. However, the following factors must be considered in connection with the possible exclusion of metadata use: (i) metadata of synchronous or semi-synchronous technical changes must be treated in a distinct way, and they never can be omitted from homogenization, (ii) The unevenness of spatial correlations or data gaps may cause the low spatial density of observed data for some stations or for some periods in an otherwise spatially dense dataset.
Networks of 4 time series can be homogenized by the automatic version of ACMANT when no metadata is available. When metadata is available, the use of a manual or interactive homogenization method is recommended, which can be the interactive version of ACMANTv5. Larger networks can be homogenized in automatic mode either with metadata or without metadata, although the inclusion of permissive metadata use in the ACMANT software is a task for the future at this time.