1. Introduction
In an image classification system based on machine learning, the main objective is to obtain a trained model from the input data in such a way that it allows predictions to be made on the new data with the lowest possible error. These new data are said to be sampled from the same data distribution used during training [
1]. In other words, the characteristics of the data used with the model in production must be similar to those of the data used for training. This concept is usually referred to as an
i.i.d. (Independently and Identically Distributed) data condition, where the data distribution remains constant, and each extracted value is independent of previously extracted values. This situation can be considered an ideal environment for a machine learning system, but it is not achieved in 100% of cases.
The importance of this aspect is related to the evaluation and performance of the model, particularly with regard to training error and generalization error. In the first case, there are usually no drawbacks since the training error is evaluated on the same training/validation dataset. In contrast, the generalization error corresponds to the evaluation of the model on a continuous stream of additional data that, in principle, should be drawn from the same data distribution as the original samples [
1,
2].
However, if the inference data show a changing dynamic, i.e., if they do not necessarily resemble previous data or if their behavior may change over time, the
i.i.d. condition will not be met, but instead a distribution shift (training and test data distributions are different) will be present [
3]. A distribution shift occurs when the data encountered by a machine learning model deviate from the data it was trained on [
4]. Real-world applications often face data-related challenges, such as differences in distributions between two comparable datasets in the same domain [
5], or gradual differences over time as the real world evolves (e.g., image or sensor data collected over extended periods) [
4].
This may raise two main questions. (i) What happens if the data never seen by the model do not come from the same distribution? (ii) How can we identify if the data used in inference come from the same distribution as the training dataset? A change in distribution may cause the model’s performance to be much lower than that achieved during training [
6]. That is, the model may perform very well on the validation set, but fail when deployed and evaluated if the data distribution changes [
7,
8].
Consequently, methods have been proposed in the literature to evaluate the generalization of a model under variations in the data distribution. First, there is a broad set of techniques proposed for the exploitation and/or identification of out-of-distribution (OOD) data, i.e., outliers or samples outside the usual training distribution [
9]. Such techniques may include data augmentation, robust loss functions, or calibration techniques, where data augmentation, for example, can be an effective alternative to robust models for real-world OOD data problems [
10]. Furthermore, several methods propose improved solutions to increase the robustness of models to OOD, or also for the identification of OOD samples or labeling errors, as in the case of Cleanlab [
11].
Secondly, there is work aimed at evaluating the impact of displacement on data distribution. For example, in [
8], the performance of models trained on the CIFAR-10 dataset was evaluated against a modified version of the dataset, finding performance reductions of up to 15% in accuracy. Another method involves constructing parametric robustness datasets, whose distribution approximates the original data distribution and allows the evaluation of machine learning models’ robustness to changes in the data distribution [
12]. Frameworks have also been proposed to analyze machine learning models deployed in environments different from those in which they were trained and validated [
4,
13].
Similarly, there have been proposals aimed at improving the performance of machine learning models when the data they encounter deviate from the data they were trained on. For example, the use of self-learning techniques, in particular entropy minimization and pseudo-labeling, has been proposed [
14]. Normalization methods, specifically CrossNorm and SelfNorm, have also been used to improve generalization under distribution changes; here, CrossNorm is responsible for exchanging the mean and variance (at the channel level) between feature maps to broaden the training distribution, and SelfNorm allows recalibration of statistics in order to bridge the gap between the training and test distributions [
15]. Some authors have evaluated the use of data augmentation in the generalization process of a model, finding that it can be beneficial if such augmentation is focused on selectively randomizing spurious variations between domains [
10].
To answer the second question, some proposals for detecting dataset shift have been made in the literature. Dimensionality reduction and two-sample testing have been combined into a practical process that requires pre-trained classifiers to detect distribution changes in machine learning systems [
6]. Another approach uses GradNorm to detect OOD inputs using information extracted from gradient space. The general idea is based on the fact that the magnitude of gradients is larger for data from the same distribution than for OOD data [
16]. Studies on specific types of distribution shift have also been performed, such as subpopulation shift, characterized by changes in the proportion of some subpopulations between training and deployment [
17], or temporal shift of the dataset associated with changes over time in a given context [
18]. From a machine learning perspective, assessing distributional shifts between training and test data can be approached by training a supervised model on the training data and subsequently evaluating its performance across datasets with varying distributions. Specifically, the model’s predictive performance can be compared between a test dataset that mirrors the original training data distribution and one in which the distribution has demonstrably shifted. This methodology enables the detection and quantification of distributional changes, allowing for an analysis of how these shifts impact model performance and generalization [
16].
In order to address the problem of data shift in data distribution, a collection of 12,868 sets of natural images classified into 410 classes has recently been made public. It is intended as a framework to facilitate the evaluation of methods or techniques to overcome data shift. This initiative is known as MetaShift, and its development takes into account the context of the image, to offer multiple subsets of each class in different scenarios. To differentiate these subsets, the authors have included a distance metric, on which they illustrate the values of this metric in the subsets given in MetaShift, but do not provide details of its implementation or resources that would allow replication of its application to other datasets [
19].
Based on the above, the methods proposed to assess the degree of distribution shift in the data are characterized by a lack of global approaches to explain the differences between datasets in a human-understandable way [
5], may require a high execution time, caused either by the computational cost of the methods used, or by the need to adjust and train supervised models to allow a preliminary classification, or they lack sufficient information to replicate their implementation in new problems or scenarios.
Another alternative that could be used to identify differences in the distributions of the datasets is adversarial validation, which, to our knowledge, has not been explored in the context of computer vision. This approach, initially oriented towards structured (tabular) data [
20,
21,
22], consists of configuring the two datasets (training and test) as a single labeled dataset, where the training data are labeled as the first class (e.g., class X) and the test data are defined as the other class (e.g., class Y). The idea of this method is to determine whether the two classes of this new dataset are easily separable or not. To accomplish this, it is necessary to create and train a binary classifier and relate its performance as a function of the separability of the classes. If the classes are easily separable, the training and test data have different characteristics, thus presenting a distribution shift. If, on the other hand, the performance of the classifier is very low, the data have similar behavior or similar characteristics that do not guarantee the presence of distribution shift.
When reviewing the applications of this approach, it is found that, although its application for structured data is usual, its application for other types of data, such as image classification, is not common. Moreover, the application of this solution involves the definition, creation and training of a classification model. Again, the question is whether it is possible to determine the distribution shift without the need to create and train a model. Therefore, the aim of this article is to answer this question.
Accordingly, this paper presents a proposal to determine the degree of shift in the distribution of two datasets, which can be applied without the need to train multiple models and multiple times, reducing the ecological footprint givens its low computational cost, and as a complement in the analysis process of generalization of classification models. To this end, a metric for evaluating complexity in datasets is used, which can be applied in multi-class problems, comparing each pair of classes of the two sets. The contributions of this work are summarized below:
We propose a methodology using class-level adversarial validation to determine the degree of distribution shift in two image datasets through the evaluation of their complexity, allowing us to assess those modifications that have a strong impact on the generalization of the models.
The proposed methodology was tested on three well-known datasets: MNIST, CIFAR-10, and CIFAR-100, along with corrupted versions of the same, proving that this methodology can be applied without the need to train models.
It is proposed to use the Cumulative Spectral Gradient (CSG) metric as the basis for class-level adversarial validation to compare data from the same class in the face of possible changes in their distribution.
2. Cumulative Spectral Gradient Metric
CSG is a metric designed to characterize the difficulty of a classification problem without the need to train a model, particularly the difficulty of its dataset. The calculation of this metric is based on the probabilistic divergence between classes within a spectral clustering framework [
23,
24].
First, the CSG method projects the input images into a lower-dimensional latent space, allowing the data features to better align with what the models learn. By performing this projection, images with similar content stay close to each other, and the others move away from each other [
23]. The projection function used in this method corresponds to t-SNE (t-distributed Stochastic Neighbor Embedding) applied to the embedding of a Convolutional Neural Network (CNN) autoencoder. From this projection, the overlap between classes is estimated as the expected value (
E) of one distribution (
P) over the other (Equation (
1)):
where
x represents the input samples,
corresponds to the projection of the samples with
i.i.d. characteristics,
and
denote the two classes.
The expected value could be approximated by applying the Monte Carlo method as shown in Equation (
2):
where
are the
M projections from the distribution
.
Since
is unknown, it could be approximated by a K-nearest estimator, and the expected value in Equation (
2) is estimated by averaging the probability of belonging to one class over the other across
M samples. Thus, the approximate divergence between classes is given by Equation (
3):
where
is the number of neighbors around
of class
,
M is the number of samples selected in class
, and
V is the volume of the hypercube surrounding the k closest samples to
in class
[
23].
The result of this process for a set of K classes corresponds to a similarity matrix S, of dimensions , with being the divergence approximation, through the Monte Carlo method, between and .
Subsequently, a Laplacian matrix derived from the similarity matrix is calculated as
. For the calculation of
W, each column of the matrix
S is considered a vector of signatures of each class, so that to evaluate the similarity between them, it is sufficient to apply a distance metric between the signature vectors of two different classes. In this case, the Bray–Curtis distance (Equation (
4)) is used:
where
i and
j represent the two evaluated classes and
K the total number of classes. Thus,
when there is no overlap between classes
i and
j and
when the distributions are the same.
On the other hand,
D is a degree matrix and it is defined as given in Equation (
5):
The eigenvalues (
) of
L make up its spectrum, and the gradient discontinuities in this spectrum are called eigengaps (
). Thus, the CSG metric (
6) is calculated as the cumulative maximum (cummax) of the eigengaps (
) normalized by its position:
where
The higher the value of the CSG metric, the more similar the distributions will be, and the lower the value, the more the sets will not overlap.
4. Results and Discussion
The results obtained after applying the proposed methodology to the three datasets explained above and their variants are shown below. Since the proposed methodology is focused on the evaluation at the class level,
Figure 5,
Figure 6,
Figure 7 and
Figure 8 show the impact of each of the modifications on each class. In this way, they allow us to identify which types of distortions have a greater impact on the distribution of the data. These results and their discussion are presented for each of the datasets.
4.1. MNIST Dataset
The results for the 160 validations performed on the MNIST dataset are shown in
Figure 5. The first case shows the results ordered according to the CSG metric and its respective relationship with data separability (
Figure 5a), where each column shows the result of the adversarial validation, represented by the value of the CSG metric for a given class (classes 0 to 9 in MNIST represented in a shade of blue) for each type of transformation. On the other hand,
Figure 5b shows the variability of the interclass CSG metric for each type of modification.
According to
Figure 5a, for the MNIST dataset, there are modifications where the data are not easily separable from the original data, implying that the distribution of these modified data deviates very little from the original distribution. This is the case for noise addition modifications (impulse noise and shot noise), as well as the addition of splotches (spatter) and dotted lines to the MNIST images, which do not generate a significant impact. The latter are closely followed by the modification that adds zigzags to the image, in a controlled way, thanks to the modification that adapts the brightness of each straight segment.
In addition, in
Figure 5a, it is observed that affine transformations such as shear and rotate reach values that slightly exceed the 50th percentile of the metric, but other transformations that may involve loss of information such as translation and scaling cause the distribution of the modified data to move away from the distribution of the original data (CSG metric close to zero). It is also observed that distortions such as blurring and edge-only images do not help to preserve data with a distribution similar to that of the original data. Finally, in the MNIST-C dataset, the distortions of blurring, brightness and the inversion of the pixel values (stripe) completely distance the distribution of the modified data from the distribution of the original data (
).
On the other hand, the analysis of the results by class in the MNIST dataset shows that there are modifications in which the results between classes are consistent (see
Figure 5b). This tendency is present in modifications with high CSG values (
for all classes) as is the case of impulse noise, spatter, dotted line and shot noise. It is also possible to find consistent modifications with low CSG values (
for all classes) as in the case of impulse noise, spatter, dotted line and shot noise.
However, some modifications were found where the results of the CSG metric varied considerably between classes (see
Figure 5b). For example, the modification of zigzag addition to an image presents a difference of 0.54 between the class with the highest CSG value (class 0) and the class with the lowest CSG value (class 1). Other modifications with inter-class variability correspond to rotation (difference of 0.35 between class 0 (high) and class 7 (low)), Canny filter (difference of 0.34 between class 5 (high) and class 1 (low)) and shear transformation (difference of 0.25 between class 4 (high) and class 3 (low)). This means that there are corruptions that affect some classes more than others.
4.2. CIFAR-10 Dataset
The results for the 200 validations performed with the CIFAR-10 dataset and their modified datasets are shown in
Figure 6. This figure includes data sorted according to the average CSG metric for all classes (
Figure 6a), as well as data to analyze the inter-class variability of the metric (
Figure 6b).
In the case of CIFAR-10, the validation results are higher than in MNIST, being generally higher than 0.4. The type of corruption that retains the greatest similarity with the original data distribution is blurring, with average values above 0.89. Of the seven digital processing operations, three of them show high values of CSG metrics, those being elastic transform, pixelate, and jpeg compression, modifications that preserve to a large extent the characteristics of the original data.
Noise modifications closely follow this trend, with average values of 0.85, as well as corruptions such as spatter, contrast, saturate and fog, which maintain values above 0.8. Finally, the modifications that have the strongest impact on the data correspond to a digital processing modification (brightness) and two weather-type operations (snow and frost), the latter being the one with the worst performance. It should be noted that although the types of modification of the MNIST and CIFAR-10 datasets are not the same, the results show similarities in the worst-performing corruptions such as fog and brightness and similar values for blurring or noise modifications.
Regarding the behavior between the same classes (
Figure 6b), the results for CIFAR-10 show more uniformity compared to the MNIST results. This is mainly due to the type of corruptions applied on each dataset and to the difference between the image characteristics. Anyway, in CIFAR-10, the maximum difference between the class with the highest CSG value and the class with the lowest CSG value in the same type of modification does not exceed 0.15, whereas in MNIST, this difference reached 0.54. Thus, the characteristics of the CIFAR-10 images together with the types of modifications applied offer low interclass variability.
4.3. CIFAR-100 Dataset
The results for the 2000 validations performed on the CIFAR-100 dataset are shown in
Figure 7 and
Figure 8. Due to the number of classes, the first result illustrates the behavior of the CSG metric for each type of modification in each of the 100 classes, sorted alphabetically by subclass (
Figure 7). The second case shows the variability of the interclass CSG metric for each type of modification (
Figure 8).
As in CIFAR-10, the results in CIFAR-100 (
Figure 7) show that the modifications with the least impact on the distribution of the data are those of the blurring type, with average values above 0.89. The order of the remaining modifications is similar to that of CIFAR-10 (although not exactly in the same order). In this case, those with the worst average performance in terms of the CSG metric are snow (0.72), brightness (0.70), and frost (0.57).
Although each curve in
Figure 7 shows the values for all one hundred classes, one important aspect that can be observed is that there are two classes that differ significantly from the rest of the classes (highly visible valleys shown on most curves). These are “pear” from the subclass “fruit and vegetables” and “pickup truck” from the subclass “Vehicles 1”. In most cases, the value of the CSG metric for these two classes is reduced to approximately fifty percent of the mean value for that type of modification.
This aspect can be seen more clearly in
Figure 8, where the points corresponding to these two classes for each type of modification are considerably different from the rest of the classes (note the two points located on the left in each of the horizontal ranges). Of these two points, the darker one corresponds to the pickup truck class and the lighter one corresponds to the pear class. The important point here is that these two classes greatly impact the models’ generalization to data that deviate from the original distribution due to modifications.
As a final analysis, the results of the present study show that not all types of modifications proposed for traditional datasets present challenges of similar complexity for evaluating data distributions. Overall, the CIFAR dataset and its modifications can facilitate further generalization of models with respect to the MNIST dataset. In particular, cases in which the CSG metric is close to 1 in the adversarial validation, i.e., when comparing the distribution of the original data with respect to the modified data, may lead to higher performance of the classification models because the data (original and modified) are not easily separable, i.e., they present similar distributions.
4.4. Qualitative Analysis
According to the quantitative results on color images, the types of modifications that are most affected are frost, brightness and snow. On the other hand, the types of modifications that are less affected are blurring (Gaussian, defocus, motion and zoom). In order to illustrate qualitatively the reasons related to these results, an example showing the degree of change for a given image is shown below. In this case, since the CIFAR datasets have been modified with five levels of severity, both the original image and the images with these seven modifications were taken with the maximum level of severity (
Figure 9).
As shown in
Figure 9, the types of distortions that are less affected in the adversarial validation present a high spectral similarity with respect to the original image. At the spatial level, an acceptable structure is maintained that could be tolerated in the downsampling process performed by the CNN. On the other hand, the images of the three types of distortions that are more affected in the adversarial validation show important changes in the spectral information of the image, as well as distortions at the spatial level, particularly noticeable in frost.
In any case, it is important to note that the adversarial validation proposed in this paper is performed at the class level and not at the sample level. That is, the above example only tries to illustrate the possible reasons for the impact of each type of modification on the separability of the data.
4.5. Comparison with State-of-the-Art Methods
One of the recent initiatives addressing the problem of distribution shifts in image datasets is MetaShift, a collection of 12,868 natural image sets of 410 classes, which takes the image context into account in its structuring [
19]. In addition, the work performed by these researchers includes a score (
d) that measures the distance between any two subsets.
MetaShift has many subsets of a class, and each subset corresponds to the class in a different context. In this way, MetaShift facilitates the evaluation of changes in these subsets taking into account the context of the change [
19]. In order to compare the results of the present paper with the results illustrated in the aforementioned paper [
19], some MetaShift subsets were selected and the similarity between these subsets is compared with respect to the values obtained with the adversarial validation methodology of the present paper.
The selected subsets correspond to a binary classification task, and the distance evaluation between a class under a given context and that same class under a different context has been taken into account. Four different training subsets were taken from the Dog class: Dog (cabinet + bed), Dog (bag + box), Dog (bench + bike) and Dog (boat + surfboard). Each of these subsets are compared against the Dog (shelf) class using a distance metric, given by both MetaShift and the proposed methodology.
The two distance metrics shown in
Table 1 (
d and CSG) quantitatively indicate the degree of similarity between the two contexts evaluated. The two metrics coincide in identifying that the dataset with the highest affinity with the subset
Dog (shelf) is
Dog (cabinet + bed), and that the dataset with the lowest affinity is
Dog (boat + surfboard). The degree of difficulty for the other two subsets is also consistent between the two metrics. Here, it is important to take into account two aspects. The first one relates to the fact that the trends of the metrics are inverse, i.e., high values in the MetaShift distance and low values with CSG correspond to a higher degree of distribution shift between the subsets. The second aspect relates to the fact that the distance in MetaShift does not have a maximum value, so it can be difficult to evaluate the degree of distribution shift. In the case of CSG-based adversarial validation, the results are easier to interpret since the metric ranges between 0.0 and 1.0 since two classes are compared.
Although the MetaShift publication does not provide information on the computational cost, information on the execution time using the proposed methodology is added. This time corresponds to the computation of the distance between subsets using CSG on a machine running on Google Colaboratory using a CPU with 2 logical cores (1 physical core), CPU operating frequency 2200 MHz, 12,978 MB of RAM, and without using GPU.
5. Conclusions
This article proposes a methodology to perform adversarial validation on multi-class datasets. The method is based on the analysis of separability between data of the same class, as these data change. Representative state-of-the-art datasets have been used for validation, with modified versions based on distortions such as noise, blurring, or affine transformations, which are also available in the literature. To perform class-level adversarial validation by assessing the complexity of separating data of the same class, the Cumulative Spectral Gradient metric was used. The results show that the CSG metric allows us to perform this comparison, and that such an analysis allows us to determine the degree to which each possible type of data corruption affects the data, as well as to determine the classes that are most affected by such modifications. The presented methodology can be applied without the need to train models, and can complement the analysis process of generalization of classification models. In addition, it can be used to evaluate the degree of influence when different data augmentation techniques are required to address robustness issues in real environments and to mitigate underfitting problems.
In addition, this study provides a framework for evaluating the impact of data configuration processes prior to the training and evaluation of deep learning models. By assessing potential distributions or changes in new data, this approach can influence researchers’ viewpoint to observe and understand the effectiveness and robustness of potential AI solutions and applications in changing environments, but also helps to understand their inherent limitations. Consequently, it can serve as a critical tool for refining deep learning techniques, ensuring that they are more resilient and better equipped to handle the complexities of real-world applications. This systematic scrutiny is fundamental to advancing the field of artificial intelligence, as it guides the optimization process to improve the performance and reliability of models in a variety of environments. In addition, it aims to promote the study of new algorithms that guarantee an optimal split of the data, thus ensuring a correct distribution of the data taking into account real-world conditions.
Finally, in future work, it is planned to contrast the results of the proposed methodology with the performance of classifiers based on recent architectures such as ConvNext [
29,
30], ConvNext2 [
31] or ViT [
32].