Next Article in Journal
Exponential Functions Permit Estimation of Anaerobic Work Capacity and Critical Power from Less than 2 Min All-Out Test
Previous Article in Journal
A VIKOR-Based Sequential Three-Way Classification Ranking Method
Previous Article in Special Issue
An Autoencoder-Based Task-Oriented Semantic Communication System for M2M Communication
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adversarial Validation in Image Classification Datasets by Means of Cumulative Spectral Gradient

1
Facultad de Ingeniería, Universidad Militar Nueva Granada, Carrera 11 101-80, Bogotá 110111, Colombia
2
Facultad de Ingeniería, Universidad Panamericana, Augusto Rodin 498, Ciudad de México 03920, Mexico
*
Author to whom correspondence should be addressed.
Algorithms 2024, 17(11), 531; https://doi.org/10.3390/a17110531
Submission received: 11 October 2024 / Revised: 8 November 2024 / Accepted: 14 November 2024 / Published: 19 November 2024
(This article belongs to the Special Issue Machine Learning Algorithms for Image Understanding and Analysis)

Abstract

:
The main objective of a machine learning (ML) system is to obtain a trained model from input data in such a way that it allows predictions to be made on new i.i.d. (Independently and Identically Distributed) data with the lowest possible error. However, how can we assess whether the training and test data have a similar distribution? To answer this question, this paper presents a proposal to determine the degree of distribution shift of two datasets. To this end, a metric for evaluating complexity in datasets is used, which can be applied in multi-class problems, comparing each pair of classes of the two sets. The proposed methodology has been applied to three well-known datasets: MNIST, CIFAR-10 and CIFAR-100, together with corrupted versions of these. Through this methodology, it is possible to evaluate which types of modification have a greater impact on the generalization of the models without the need to train multiple models multiple times, also allowing us to determine which classes are more affected by corruption.

1. Introduction

In an image classification system based on machine learning, the main objective is to obtain a trained model from the input data in such a way that it allows predictions to be made on the new data with the lowest possible error. These new data are said to be sampled from the same data distribution used during training [1]. In other words, the characteristics of the data used with the model in production must be similar to those of the data used for training. This concept is usually referred to as an i.i.d. (Independently and Identically Distributed) data condition, where the data distribution remains constant, and each extracted value is independent of previously extracted values. This situation can be considered an ideal environment for a machine learning system, but it is not achieved in 100% of cases.
The importance of this aspect is related to the evaluation and performance of the model, particularly with regard to training error and generalization error. In the first case, there are usually no drawbacks since the training error is evaluated on the same training/validation dataset. In contrast, the generalization error corresponds to the evaluation of the model on a continuous stream of additional data that, in principle, should be drawn from the same data distribution as the original samples [1,2].
However, if the inference data show a changing dynamic, i.e., if they do not necessarily resemble previous data or if their behavior may change over time, the i.i.d. condition will not be met, but instead a distribution shift (training and test data distributions are different) will be present [3]. A distribution shift occurs when the data encountered by a machine learning model deviate from the data it was trained on [4]. Real-world applications often face data-related challenges, such as differences in distributions between two comparable datasets in the same domain [5], or gradual differences over time as the real world evolves (e.g., image or sensor data collected over extended periods) [4].
This may raise two main questions. (i) What happens if the data never seen by the model do not come from the same distribution? (ii) How can we identify if the data used in inference come from the same distribution as the training dataset? A change in distribution may cause the model’s performance to be much lower than that achieved during training [6]. That is, the model may perform very well on the validation set, but fail when deployed and evaluated if the data distribution changes [7,8].
Consequently, methods have been proposed in the literature to evaluate the generalization of a model under variations in the data distribution. First, there is a broad set of techniques proposed for the exploitation and/or identification of out-of-distribution (OOD) data, i.e., outliers or samples outside the usual training distribution [9]. Such techniques may include data augmentation, robust loss functions, or calibration techniques, where data augmentation, for example, can be an effective alternative to robust models for real-world OOD data problems [10]. Furthermore, several methods propose improved solutions to increase the robustness of models to OOD, or also for the identification of OOD samples or labeling errors, as in the case of Cleanlab [11].
Secondly, there is work aimed at evaluating the impact of displacement on data distribution. For example, in [8], the performance of models trained on the CIFAR-10 dataset was evaluated against a modified version of the dataset, finding performance reductions of up to 15% in accuracy. Another method involves constructing parametric robustness datasets, whose distribution approximates the original data distribution and allows the evaluation of machine learning models’ robustness to changes in the data distribution [12]. Frameworks have also been proposed to analyze machine learning models deployed in environments different from those in which they were trained and validated [4,13].
Similarly, there have been proposals aimed at improving the performance of machine learning models when the data they encounter deviate from the data they were trained on. For example, the use of self-learning techniques, in particular entropy minimization and pseudo-labeling, has been proposed [14]. Normalization methods, specifically CrossNorm and SelfNorm, have also been used to improve generalization under distribution changes; here, CrossNorm is responsible for exchanging the mean and variance (at the channel level) between feature maps to broaden the training distribution, and SelfNorm allows recalibration of statistics in order to bridge the gap between the training and test distributions [15]. Some authors have evaluated the use of data augmentation in the generalization process of a model, finding that it can be beneficial if such augmentation is focused on selectively randomizing spurious variations between domains [10].
To answer the second question, some proposals for detecting dataset shift have been made in the literature. Dimensionality reduction and two-sample testing have been combined into a practical process that requires pre-trained classifiers to detect distribution changes in machine learning systems [6]. Another approach uses GradNorm to detect OOD inputs using information extracted from gradient space. The general idea is based on the fact that the magnitude of gradients is larger for data from the same distribution than for OOD data [16]. Studies on specific types of distribution shift have also been performed, such as subpopulation shift, characterized by changes in the proportion of some subpopulations between training and deployment [17], or temporal shift of the dataset associated with changes over time in a given context [18]. From a machine learning perspective, assessing distributional shifts between training and test data can be approached by training a supervised model on the training data and subsequently evaluating its performance across datasets with varying distributions. Specifically, the model’s predictive performance can be compared between a test dataset that mirrors the original training data distribution and one in which the distribution has demonstrably shifted. This methodology enables the detection and quantification of distributional changes, allowing for an analysis of how these shifts impact model performance and generalization [16].
In order to address the problem of data shift in data distribution, a collection of 12,868 sets of natural images classified into 410 classes has recently been made public. It is intended as a framework to facilitate the evaluation of methods or techniques to overcome data shift. This initiative is known as MetaShift, and its development takes into account the context of the image, to offer multiple subsets of each class in different scenarios. To differentiate these subsets, the authors have included a distance metric, on which they illustrate the values of this metric in the subsets given in MetaShift, but do not provide details of its implementation or resources that would allow replication of its application to other datasets [19].
Based on the above, the methods proposed to assess the degree of distribution shift in the data are characterized by a lack of global approaches to explain the differences between datasets in a human-understandable way [5], may require a high execution time, caused either by the computational cost of the methods used, or by the need to adjust and train supervised models to allow a preliminary classification, or they lack sufficient information to replicate their implementation in new problems or scenarios.
Another alternative that could be used to identify differences in the distributions of the datasets is adversarial validation, which, to our knowledge, has not been explored in the context of computer vision. This approach, initially oriented towards structured (tabular) data [20,21,22], consists of configuring the two datasets (training and test) as a single labeled dataset, where the training data are labeled as the first class (e.g., class X) and the test data are defined as the other class (e.g., class Y). The idea of this method is to determine whether the two classes of this new dataset are easily separable or not. To accomplish this, it is necessary to create and train a binary classifier and relate its performance as a function of the separability of the classes. If the classes are easily separable, the training and test data have different characteristics, thus presenting a distribution shift. If, on the other hand, the performance of the classifier is very low, the data have similar behavior or similar characteristics that do not guarantee the presence of distribution shift.
When reviewing the applications of this approach, it is found that, although its application for structured data is usual, its application for other types of data, such as image classification, is not common. Moreover, the application of this solution involves the definition, creation and training of a classification model. Again, the question is whether it is possible to determine the distribution shift without the need to create and train a model. Therefore, the aim of this article is to answer this question.
Accordingly, this paper presents a proposal to determine the degree of shift in the distribution of two datasets, which can be applied without the need to train multiple models and multiple times, reducing the ecological footprint givens its low computational cost, and as a complement in the analysis process of generalization of classification models. To this end, a metric for evaluating complexity in datasets is used, which can be applied in multi-class problems, comparing each pair of classes of the two sets. The contributions of this work are summarized below:
  • We propose a methodology using class-level adversarial validation to determine the degree of distribution shift in two image datasets through the evaluation of their complexity, allowing us to assess those modifications that have a strong impact on the generalization of the models.
  • The proposed methodology was tested on three well-known datasets: MNIST, CIFAR-10, and CIFAR-100, along with corrupted versions of the same, proving that this methodology can be applied without the need to train models.
  • It is proposed to use the Cumulative Spectral Gradient (CSG) metric as the basis for class-level adversarial validation to compare data from the same class in the face of possible changes in their distribution.

2. Cumulative Spectral Gradient Metric

CSG is a metric designed to characterize the difficulty of a classification problem without the need to train a model, particularly the difficulty of its dataset. The calculation of this metric is based on the probabilistic divergence between classes within a spectral clustering framework [23,24].
First, the CSG method projects the input images into a lower-dimensional latent space, allowing the data features to better align with what the models learn. By performing this projection, images with similar content stay close to each other, and the others move away from each other [23]. The projection function used in this method corresponds to t-SNE (t-distributed Stochastic Neighbor Embedding) applied to the embedding of a Convolutional Neural Network (CNN) autoencoder. From this projection, the overlap between classes is estimated as the expected value (E) of one distribution (P) over the other (Equation (1)):
E P ϕ ( x ) | C i P ϕ ( x ) | C j or E P ϕ ( x ) | C j P ϕ ( x ) | C i ,
where x represents the input samples, ϕ ( x ) corresponds to the projection of the samples with i.i.d. characteristics, C i and C j denote the two classes.
The expected value could be approximated by applying the Monte Carlo method as shown in Equation (2):
E P ϕ ( x ) | C i P ϕ ( x ) | C j 1 M m = 1 M P ϕ ( x ) | C j ,
where ϕ ( x 1 ) , ϕ ( x 2 ) , , ϕ ( x M 1 ) , ϕ ( x M ) are the M projections from the distribution P ϕ ( x ) | C i .
Since P ϕ ( x ) | C j is unknown, it could be approximated by a K-nearest estimator, and the expected value in Equation (2) is estimated by averaging the probability of belonging to one class over the other across M samples. Thus, the approximate divergence between classes is given by Equation (3):
E P ϕ ( x ) | C i P ϕ ( x ) | C j 1 M m = 1 M K C j M · V ,
where K C j is the number of neighbors around ϕ ( x ) of class C j , M is the number of samples selected in class C j , and V is the volume of the hypercube surrounding the k closest samples to ϕ ( x ) in class C j [23].
The result of this process for a set of K classes corresponds to a similarity matrix S, of dimensions K × K , with S i j being the divergence approximation, through the Monte Carlo method, between C i and C j .
Subsequently, a Laplacian matrix derived from the similarity matrix is calculated as L = D W . For the calculation of W, each column of the matrix S is considered a vector of signatures of each class, so that to evaluate the similarity between them, it is sufficient to apply a distance metric between the signature vectors of two different classes. In this case, the Bray–Curtis distance (Equation (4)) is used:
w i j = 1 k K | S i k S j k | k K | S i k + S j k | ,
where i and j represent the two evaluated classes and K the total number of classes. Thus, w i j = 0 when there is no overlap between classes i and j and w i j = 1 when the distributions are the same.
On the other hand, D is a degree matrix and it is defined as given in Equation (5):
D i = j w i j .
The eigenvalues ( λ i ) of L make up its spectrum, and the gradient discontinuities in this spectrum are called eigengaps ( λ i + 1 λ i ). Thus, the CSG metric (6) is calculated as the cumulative maximum (cummax) of the eigengaps ( Δ λ ˜ ) normalized by its position:
CSG = i cummax Δ λ ˜ i ,
where
Δ λ ˜ = λ i + 1 λ i K i .
The higher the value of the CSG metric, the more similar the distributions will be, and the lower the value, the more the sets will not overlap.

3. Materials and Methods

In order to assess whether a test dataset has similar characteristics to the training dataset, the methodology illustrated in Figure 1 is presented. This methodology proposes to perform an adversarial validation scheme on each class of the dataset. In this case, it proposes to evaluate the separability of the training data against data that do not necessarily follow the same distribution, and to compare this separability with respect to the separability of the training data and its original (unmodified) test set. The datasets used and the main phases of the adversarial validation methodology are outlined below.

3.1. Original Datasets

To evaluate the proposed methodology, three datasets of wide recognition in the world of machine learning and computer vision were used: MNIST, CIFAR-10, and CIFAR-100. However, this methodology can be applied to datasets with different image sizes, total number of images in the dataset or different number of samples per class.
MNIST is a reference dataset due to its simplicity and widespread availability. It includes grayscale images of handwritten decimal system digits (ten classes). MNIST images are low-resolution ( 28 × 28 ), with a total of 70,000 images, distributed in 60,000 for training and 10,000 for testing, with an equal number of samples per class [25]. Moreover, the MNIST dataset constitutes a subset extracted from the NIST’s Special Database (SD)-1 and Special Database (SD)-3, both of which include binary images of handwritten digits. The digits in the MNIST dataset have been standardized in size and centered within fixed-size images. Specifically, the training set comprises 30,000 patterns from SD-1 and an additional 30,000 from SD-3, while the test set consists of 5000 patterns from each of these databases. The training dataset incorporates samples contributed by approximately 250 different writers. The MNIST dataset is publicly available from https://yann.lecun.com/exdb/mnist/ (accessed on 4 April 2024).
CIFAR-10 is a low-resolution ( 32 × 32 pixels) color image dataset that has been widely used for the validation of new object classification and recognition algorithms. This dataset comprises 60,000 images, each with a single label of 10 mutually exclusive classes. It also presents a distribution of 50,000 images for training and 10,000 images for testing, with an equal number of images in each class in both the training and test sets [26].
On the other hand, CIFAR-100 is a dataset similar to CIFAR-10, but differs in two fundamental aspects: the number of classes and the number of samples. In this case, 20 super classes and 100 different classes have been defined (i.e., each super class groups five classes). Each of the samples in the dataset has been labeled both with the superclass to which it belongs and with its particular class. As for the number of examples, the difference lies in the number of samples per class (600 images per class) and not in the total number of samples (60,000 images), showing a distribution similar to that of CIFAR-10 [26]. Both CIFAR-10 and CIFAR-100 datasets are publicly available from https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 4 April 2024).

3.2. Corrupted Datasets

MNIST-C is a dataset proposed by Google researchers from corruptions of MNIST images. It was proposed to facilitate the evaluation of OOD robustness in computer vision algorithms. According to its authors, the corruptions are realistic, diverse and are designed in such a way that they preserve the semantic content of the underlying image. Another important aspect is that a simple data augmentation does not trivially solve the challenge of this dataset. To the original MNIST dataset, 15 corruptions related to noise (shot noise, impulse noise), blurring (glass blur, motion blur), affine transformations (shear, translate, scale, rotate), digital processing (brightness, stripe, zigzag, dotted line, Canny edge detector) and weather (fog, spatter) have been applied [27]. Examples of the types of corruptions in MNIST are shown in Figure 2.
As for the CIFAR datasets, there are modified versions for 19 corruptions. Each corruption has been applied with five levels of severity, where the lowest level was applied to the first ten thousand images, and the highest level to the last ten thousand. The corruptions can be grouped into four categories: noise (Gaussian, shot, impulse, speckle), blur (defocus, glass, motion, zoom, Gaussian), weather (snow, frost, fog, spatter) and digital (brightness, contrast, elastic transform, pixelate, jpeg compression, saturate). These corruptions took into account aspects such as semantic preservation of content or the use of real-world-like corruptions, and their severity was adjusted to identify a level that reduces model performance while retaining semantic integrity [28]. Figure 3 shows examples of corruption found in CIFAR-10-C and CIFAR-100-C datasets.
Consequently, the idea in this study is to compare each of the classes of the MNIST, CIFAR-10, and CIFAR-100 training subsets with respect to the modified data in each of the MNIST-C, CIFAR-10-C, and CIFAR-100-C classes. This evaluation is performed under varying conditions, i.e., in each of the corruption types listed above, in order to evaluate the distribution shift using the method described below.

3.3. Adversarial Validation

In order to perform the adversarial validation, the data of a class of the dataset are compared against data that, in principle, belong to the same class. This evaluation is performed by applying the CSG metric presented in Section 2, and although the calculation of the CSG metric is based on all the steps described in that section, in the end, it is only necessary to establish two main parameters: the number of samples selected per class (M), and the number of neighbors in the K-nearest estimator ( K C j ), given in Equation (3). The fit of these values is straightforward, and in our case, 250 and 20, respectively, were used for the three datasets used. Consequently, if the data show similar characteristics (they come from similar distributions), a high metric value will be obtained; otherwise, the CSG value will be low (high separability between data of the “same” class).
The process carried out to perform the adversarial validation between the original datasets presented in Section 3.1 and the modified versions of these datasets was carried out both at the class and the corruption level. The stages of the adversarial validation process are described as follows, and Figure 4 shows the corresponding flowchart diagram:
  • Stage 1: Data Selection.Selecting a class from the original train dataset (referred to as Set X) and the corresponding class from the modified test dataset which has undergone a specific type of corruption (referred to as Set Y).
  • Stage 2: Class-level Adversarial Validation. Compute the CSG metric between Set X and Set Y, as described in Section 2. Since the data belong to the same class, the metric allows estimating the degree of distribution shift of the modified data with respect to the original training data. If the two sets have a similar distribution, the data are hardly separable, so a high CSG value (close to 1) is obtained. Conversely, a low CSG value (close to zero) indicates that data of the same quality (original and corrupted) are easily separable.
  • Stage 3: Dataset-Level Adversarial Validation. Stages 1–2 are repeated for each of the classes of each of the corrupted datasets to be evaluated.
  • Stage 4: Corruption-level Adversarial Validation. Finally, stages 1–3 are applied for each of the corruptions of the evaluated datasets.
In addition, as a benchmark, the comparison is also performed against the original (unmodified) test dataset. Thus, for the MNIST dataset, 160 validations (10 classes × (15 manipulation types + 1 original test data)) were performed. For the CIFAR-10 dataset, 200 validations were performed (10 classes × (19 types of manipulation + 1 original test data)), whereas for the CIFAR-100 dataset, 2000 validations were performed (100 classes × (19 types of manipulation + 1 original test data)).
Thus, following the stages shown in Figure 4, the adversarial validation process systematically assesses the similarity and separability of data within the same class across original and corrupted versions of the datasets. This comprehensive assessment helps to understand the impact of different types of corruption on data integrity and possible changes in data distribution over time.

4. Results and Discussion

The results obtained after applying the proposed methodology to the three datasets explained above and their variants are shown below. Since the proposed methodology is focused on the evaluation at the class level, Figure 5, Figure 6, Figure 7 and Figure 8 show the impact of each of the modifications on each class. In this way, they allow us to identify which types of distortions have a greater impact on the distribution of the data. These results and their discussion are presented for each of the datasets.

4.1. MNIST Dataset

The results for the 160 validations performed on the MNIST dataset are shown in Figure 5. The first case shows the results ordered according to the CSG metric and its respective relationship with data separability (Figure 5a), where each column shows the result of the adversarial validation, represented by the value of the CSG metric for a given class (classes 0 to 9 in MNIST represented in a shade of blue) for each type of transformation. On the other hand, Figure 5b shows the variability of the interclass CSG metric for each type of modification.
According to Figure 5a, for the MNIST dataset, there are modifications where the data are not easily separable from the original data, implying that the distribution of these modified data deviates very little from the original distribution. This is the case for noise addition modifications (impulse noise and shot noise), as well as the addition of splotches (spatter) and dotted lines to the MNIST images, which do not generate a significant impact. The latter are closely followed by the modification that adds zigzags to the image, in a controlled way, thanks to the modification that adapts the brightness of each straight segment.
In addition, in Figure 5a, it is observed that affine transformations such as shear and rotate reach values that slightly exceed the 50th percentile of the metric, but other transformations that may involve loss of information such as translation and scaling cause the distribution of the modified data to move away from the distribution of the original data (CSG metric close to zero). It is also observed that distortions such as blurring and edge-only images do not help to preserve data with a distribution similar to that of the original data. Finally, in the MNIST-C dataset, the distortions of blurring, brightness and the inversion of the pixel values (stripe) completely distance the distribution of the modified data from the distribution of the original data ( CSG 0 ).
On the other hand, the analysis of the results by class in the MNIST dataset shows that there are modifications in which the results between classes are consistent (see Figure 5b). This tendency is present in modifications with high CSG values ( CSG 0.9 for all classes) as is the case of impulse noise, spatter, dotted line and shot noise. It is also possible to find consistent modifications with low CSG values ( CSG 0.1 for all classes) as in the case of impulse noise, spatter, dotted line and shot noise.
However, some modifications were found where the results of the CSG metric varied considerably between classes (see Figure 5b). For example, the modification of zigzag addition to an image presents a difference of 0.54 between the class with the highest CSG value (class 0) and the class with the lowest CSG value (class 1). Other modifications with inter-class variability correspond to rotation (difference of 0.35 between class 0 (high) and class 7 (low)), Canny filter (difference of 0.34 between class 5 (high) and class 1 (low)) and shear transformation (difference of 0.25 between class 4 (high) and class 3 (low)). This means that there are corruptions that affect some classes more than others.

4.2. CIFAR-10 Dataset

The results for the 200 validations performed with the CIFAR-10 dataset and their modified datasets are shown in Figure 6. This figure includes data sorted according to the average CSG metric for all classes (Figure 6a), as well as data to analyze the inter-class variability of the metric (Figure 6b).
In the case of CIFAR-10, the validation results are higher than in MNIST, being generally higher than 0.4. The type of corruption that retains the greatest similarity with the original data distribution is blurring, with average values above 0.89. Of the seven digital processing operations, three of them show high values of CSG metrics, those being elastic transform, pixelate, and jpeg compression, modifications that preserve to a large extent the characteristics of the original data.
Noise modifications closely follow this trend, with average values of 0.85, as well as corruptions such as spatter, contrast, saturate and fog, which maintain values above 0.8. Finally, the modifications that have the strongest impact on the data correspond to a digital processing modification (brightness) and two weather-type operations (snow and frost), the latter being the one with the worst performance. It should be noted that although the types of modification of the MNIST and CIFAR-10 datasets are not the same, the results show similarities in the worst-performing corruptions such as fog and brightness and similar values for blurring or noise modifications.
Regarding the behavior between the same classes (Figure 6b), the results for CIFAR-10 show more uniformity compared to the MNIST results. This is mainly due to the type of corruptions applied on each dataset and to the difference between the image characteristics. Anyway, in CIFAR-10, the maximum difference between the class with the highest CSG value and the class with the lowest CSG value in the same type of modification does not exceed 0.15, whereas in MNIST, this difference reached 0.54. Thus, the characteristics of the CIFAR-10 images together with the types of modifications applied offer low interclass variability.

4.3. CIFAR-100 Dataset

The results for the 2000 validations performed on the CIFAR-100 dataset are shown in Figure 7 and Figure 8. Due to the number of classes, the first result illustrates the behavior of the CSG metric for each type of modification in each of the 100 classes, sorted alphabetically by subclass (Figure 7). The second case shows the variability of the interclass CSG metric for each type of modification (Figure 8).
As in CIFAR-10, the results in CIFAR-100 (Figure 7) show that the modifications with the least impact on the distribution of the data are those of the blurring type, with average values above 0.89. The order of the remaining modifications is similar to that of CIFAR-10 (although not exactly in the same order). In this case, those with the worst average performance in terms of the CSG metric are snow (0.72), brightness (0.70), and frost (0.57).
Although each curve in Figure 7 shows the values for all one hundred classes, one important aspect that can be observed is that there are two classes that differ significantly from the rest of the classes (highly visible valleys shown on most curves). These are “pear” from the subclass “fruit and vegetables” and “pickup truck” from the subclass “Vehicles 1”. In most cases, the value of the CSG metric for these two classes is reduced to approximately fifty percent of the mean value for that type of modification.
This aspect can be seen more clearly in Figure 8, where the points corresponding to these two classes for each type of modification are considerably different from the rest of the classes (note the two points located on the left in each of the horizontal ranges). Of these two points, the darker one corresponds to the pickup truck class and the lighter one corresponds to the pear class. The important point here is that these two classes greatly impact the models’ generalization to data that deviate from the original distribution due to modifications.
As a final analysis, the results of the present study show that not all types of modifications proposed for traditional datasets present challenges of similar complexity for evaluating data distributions. Overall, the CIFAR dataset and its modifications can facilitate further generalization of models with respect to the MNIST dataset. In particular, cases in which the CSG metric is close to 1 in the adversarial validation, i.e., when comparing the distribution of the original data with respect to the modified data, may lead to higher performance of the classification models because the data (original and modified) are not easily separable, i.e., they present similar distributions.

4.4. Qualitative Analysis

According to the quantitative results on color images, the types of modifications that are most affected are frost, brightness and snow. On the other hand, the types of modifications that are less affected are blurring (Gaussian, defocus, motion and zoom). In order to illustrate qualitatively the reasons related to these results, an example showing the degree of change for a given image is shown below. In this case, since the CIFAR datasets have been modified with five levels of severity, both the original image and the images with these seven modifications were taken with the maximum level of severity (Figure 9).
As shown in Figure 9, the types of distortions that are less affected in the adversarial validation present a high spectral similarity with respect to the original image. At the spatial level, an acceptable structure is maintained that could be tolerated in the downsampling process performed by the CNN. On the other hand, the images of the three types of distortions that are more affected in the adversarial validation show important changes in the spectral information of the image, as well as distortions at the spatial level, particularly noticeable in frost.
In any case, it is important to note that the adversarial validation proposed in this paper is performed at the class level and not at the sample level. That is, the above example only tries to illustrate the possible reasons for the impact of each type of modification on the separability of the data.

4.5. Comparison with State-of-the-Art Methods

One of the recent initiatives addressing the problem of distribution shifts in image datasets is MetaShift, a collection of 12,868 natural image sets of 410 classes, which takes the image context into account in its structuring [19]. In addition, the work performed by these researchers includes a score (d) that measures the distance between any two subsets.
MetaShift has many subsets of a class, and each subset corresponds to the class in a different context. In this way, MetaShift facilitates the evaluation of changes in these subsets taking into account the context of the change [19]. In order to compare the results of the present paper with the results illustrated in the aforementioned paper [19], some MetaShift subsets were selected and the similarity between these subsets is compared with respect to the values obtained with the adversarial validation methodology of the present paper.
The selected subsets correspond to a binary classification task, and the distance evaluation between a class under a given context and that same class under a different context has been taken into account. Four different training subsets were taken from the Dog class: Dog (cabinet + bed), Dog (bag + box), Dog (bench + bike) and Dog (boat + surfboard). Each of these subsets are compared against the Dog (shelf) class using a distance metric, given by both MetaShift and the proposed methodology.
The two distance metrics shown in Table 1 (d and CSG) quantitatively indicate the degree of similarity between the two contexts evaluated. The two metrics coincide in identifying that the dataset with the highest affinity with the subset Dog (shelf) is Dog (cabinet + bed), and that the dataset with the lowest affinity is Dog (boat + surfboard). The degree of difficulty for the other two subsets is also consistent between the two metrics. Here, it is important to take into account two aspects. The first one relates to the fact that the trends of the metrics are inverse, i.e., high values in the MetaShift distance and low values with CSG correspond to a higher degree of distribution shift between the subsets. The second aspect relates to the fact that the distance in MetaShift does not have a maximum value, so it can be difficult to evaluate the degree of distribution shift. In the case of CSG-based adversarial validation, the results are easier to interpret since the metric ranges between 0.0 and 1.0 since two classes are compared.
Although the MetaShift publication does not provide information on the computational cost, information on the execution time using the proposed methodology is added. This time corresponds to the computation of the distance between subsets using CSG on a machine running on Google Colaboratory using a CPU with 2 logical cores (1 physical core), CPU operating frequency 2200 MHz, 12,978 MB of RAM, and without using GPU.

5. Conclusions

This article proposes a methodology to perform adversarial validation on multi-class datasets. The method is based on the analysis of separability between data of the same class, as these data change. Representative state-of-the-art datasets have been used for validation, with modified versions based on distortions such as noise, blurring, or affine transformations, which are also available in the literature. To perform class-level adversarial validation by assessing the complexity of separating data of the same class, the Cumulative Spectral Gradient metric was used. The results show that the CSG metric allows us to perform this comparison, and that such an analysis allows us to determine the degree to which each possible type of data corruption affects the data, as well as to determine the classes that are most affected by such modifications. The presented methodology can be applied without the need to train models, and can complement the analysis process of generalization of classification models. In addition, it can be used to evaluate the degree of influence when different data augmentation techniques are required to address robustness issues in real environments and to mitigate underfitting problems.
In addition, this study provides a framework for evaluating the impact of data configuration processes prior to the training and evaluation of deep learning models. By assessing potential distributions or changes in new data, this approach can influence researchers’ viewpoint to observe and understand the effectiveness and robustness of potential AI solutions and applications in changing environments, but also helps to understand their inherent limitations. Consequently, it can serve as a critical tool for refining deep learning techniques, ensuring that they are more resilient and better equipped to handle the complexities of real-world applications. This systematic scrutiny is fundamental to advancing the field of artificial intelligence, as it guides the optimization process to improve the performance and reliability of models in a variety of environments. In addition, it aims to promote the study of new algorithms that guarantee an optimal split of the data, thus ensuring a correct distribution of the data taking into account real-world conditions.
Finally, in future work, it is planned to contrast the results of the proposed methodology with the performance of classifiers based on recent architectures such as ConvNext [29,30], ConvNext2 [31] or ViT [32].

Author Contributions

Conceptualization, D.R.; methodology, D.R.; software, D.R.; validation, E.M.-A. and A.C.; formal analysis, D.R., E.M.-A. and A.C.; investigation, D.R.; resources, E.M.-A. and A.C.; data curation, E.M.-A. and A.C.; writing—original draft preparation, D.R. and E.M.-A.; writing—review and editing, D.R., E.M.-A. and A.C.; visualization, A.C.; supervision, D.R. and E.M.-A.; project administration, D.R.; funding acquisition, E.M.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by “Universidad Militar Nueva Granada-Vicerrectoría de Investigaciones” under the grant INV-ING-3946 of 2024.

Data Availability Statement

The image datasets used in this work are publicly available from: MNIST dataset: https://yann.lecun.com/exdb/mnist/ (accessed on 4 April 2024); CIFAR-10 and CIFAR-100 datasets: [26]: https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 4 April 2024).

Acknowledgments

Ernesto Moya-Albor thanks the Facultad de Ingeniería and the Institutional Program “Fondo Open Access” of the Vicerrectoría General de Investigación of the Universidad Panamericana for all their support in this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CNNConvolutional Neural Network
CSGCumulative Spectral Gradient
Green AIGreen Artificial Intelligence
i.i.d.Independently and Identically Distributed
MLMachine Learning
OODOut-of-Distribution
t-SNEt-distributed Stochastic Neighbor Embedding
OODout-of-distribution
ViTVision Transformer

References

  1. Zhang, A.; Lipton, Z.C.; Li, M.; Smola, A.J. Dive into deep learning. arXiv 2021, arXiv:2106.11342. [Google Scholar] [CrossRef]
  2. Renza, D.; Ballesteros, D. Sp2PS: Pruning Score by Spectral and Spatial Evaluation of CAM Images. Informatics 2023, 10, 72. [Google Scholar] [CrossRef]
  3. Lu, S.; Nott, B.; Olson, A.; Todeschini, A.; Vahabi, H.; Carmon, Y.; Schmidt, L. Harder or different? A closer look at distribution shift in dataset reproduction. In Proceedings of the ICML Workshop on Uncertainty and Robustness in Deep Learning, Virtual Event, 13–18 July 2020; Volume 5, p. 15. [Google Scholar]
  4. Yao, H.; Choi, C.; Cao, B.; Lee, Y.; Koh, P.W.W.; Finn, C. Wild-time: A benchmark of in-the-wild distribution shift over time. Adv. Neural Inf. Process. Syst. 2022, 35, 10309–10324. [Google Scholar]
  5. Babbar, V.; Guo, Z.; Rudin, C. What is different between these datasets? arXiv 2024, arXiv:2403.05652. [Google Scholar] [CrossRef]
  6. Rabanser, S.; Günnemann, S.; Lipton, Z. Failing loudly: An empirical study of methods for detecting dataset shift. Adv. Neural Inf. Process. Syst. 2019, 32, 1396–1408. Available online: https://dl.acm.org/doi/10.5555/3454287.3454412 (accessed on 4 April 2024).
  7. Ovadia, Y.; Fertig, E.; Ren, J.; Nado, Z.; Sculley, D.; Nowozin, S.; Dillon, J.; Lakshminarayanan, B.; Snoek, J. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Adv. Neural Inf. Process. Syst. 2019, 32, 14003–14014. [Google Scholar]
  8. Recht, B.; Roelofs, R.; Schmidt, L.; Shankar, V. Do CIFAR-10 Classifiers Generalize to CIFAR-10? arXiv 2018, arXiv:1806.00451. [Google Scholar] [CrossRef]
  9. De Silva, A.; Ramesh, R.; Priebe, C.; Chaudhari, P.; Vogelstein, J.T. The value of out-of-distribution data. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 7366–7389. [Google Scholar]
  10. Gao, I.; Sagawa, S.; Koh, P.W.; Hashimoto, T.; Liang, P. Out-of-Distribution Robustness via Targeted Augmentations. In Proceedings of the NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, New Orleans, LA, USA, 2–8 December 2022. [Google Scholar]
  11. Northcutt, C.; Jiang, L.; Chuang, I. Confident learning: Estimating uncertainty in dataset labels. J. Artif. Intell. Res. 2021, 70, 1373–1411. [Google Scholar] [CrossRef]
  12. Thams, N.; Oberst, M.; Sontag, D. Evaluating robustness to dataset shift via parametric robustness sets. Adv. Neural Inf. Process. Syst. 2022, 35, 16877–16889. [Google Scholar]
  13. Chen, M.; Goel, K.; Sohoni, N.S.; Poms, F.; Fatahalian, K.; Ré, C. Mandoline: Model evaluation under distribution shift. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 1617–1629. [Google Scholar]
  14. Rusak, E.; Schneider, S.; Pachitariu, G.; Eck, L.; Gehler, P.; Bringmann, O.; Brendel, W.; Bethge, M. If your data distribution shifts, use self-learning. arXiv 2021, arXiv:2104.12928. [Google Scholar] [CrossRef]
  15. Tang, Z.; Gao, Y.; Zhu, Y.; Zhang, Z.; Li, M.; Metaxas, D.N. Crossnorm and selfnorm for generalization under distribution shifts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 52–61. [Google Scholar] [CrossRef]
  16. Huang, R.; Geng, A.; Li, Y. On the importance of gradients for detecting distributional shifts in the wild. Adv. Neural Inf. Process. Syst. 2021, 34, 677–689. [Google Scholar]
  17. Yang, Y.; Zhang, H.; Katabi, D.; Ghassemi, M. Change is hard: A closer look at subpopulation shift. arXiv 2023, arXiv:2302.12254. [Google Scholar] [CrossRef]
  18. Guo, L.L.; Pfohl, S.R.; Fries, J.; Johnson, A.E.; Posada, J.; Aftandilian, C.; Shah, N.; Sung, L. Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Sci. Rep. 2022, 12, 2726. [Google Scholar] [CrossRef] [PubMed]
  19. Liang, W.; Zou, J. Metashift: A dataset of datasets for evaluating contextual distribution shifts and training conflicts. arXiv 2022, arXiv:2202.06523. [Google Scholar] [CrossRef]
  20. Qian, H.; Wang, B.; Ma, P.; Peng, L.; Gao, S.; Song, Y. Managing dataset shift by adversarial validation for credit scoring. In PRICAI 2022: Trends in Artificial Intelligence, Proceedings of the 19th Pacific Rim International Conference on Artificial Intelligence, PRICAI 2022, Shanghai, China, 10–13 November 2022; Springer: Cham, Switzerland, 2022; pp. 477–488. [Google Scholar] [CrossRef]
  21. Pan, J.; Pham, V.; Dorairaj, M.; Chen, H.; Lee, J.Y. Adversarial validation approach to concept drift problem in user targeting automation systems at uber. arXiv 2020, arXiv:2004.03045. [Google Scholar] [CrossRef]
  22. Ishihara, S.; Goda, S.; Arai, H. Adversarial validation to select validation data for evaluating performance in e-commerce purchase intent prediction. In Proceedings of the SIGIR eCom’21, Virtual Event, 15 July 2021. [Google Scholar]
  23. Branchaud-Charron, F.; Achkar, A.; Jodoin, P.M. Spectral Metric for Dataset Complexity Assessment. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3210–3219. [Google Scholar] [CrossRef]
  24. Pachon, C.G.; Renza, D.; Ballesteros, D. Is My Pruned Model Trustworthy? PE-Score: A New CAM-Based Evaluation Metric. Big Data Cogn. Comput. 2023, 7, 111. [Google Scholar] [CrossRef]
  25. LeCun, Y. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 4 April 2024).
  26. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
  27. Mu, N.; Gilmer, J. Mnist-c: A robustness benchmark for computer vision. arXiv 2019, arXiv:1906.02337. [Google Scholar] [CrossRef]
  28. Hendrycks, D.; Dietterich, T. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  29. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  30. Chavarro, A.; Renza, D.; Moya-Albor, E. ConvNext as a Basis for Interpretability in Coffee Leaf Rust Classification. Mathematics 2024, 12, 2668. [Google Scholar] [CrossRef]
  31. Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
  32. Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Figure 1. Outline of the proposed adversarial validation methodology.
Figure 1. Outline of the proposed adversarial validation methodology.
Algorithms 17 00531 g001
Figure 2. Examples of the 15 corruptions included in the MNIST-C dataset.
Figure 2. Examples of the 15 corruptions included in the MNIST-C dataset.
Algorithms 17 00531 g002
Figure 3. Examples of the 19 corruptions included in the CIFAR-10-C and CIFAR-100-C datasets.
Figure 3. Examples of the 19 corruptions included in the CIFAR-10-C and CIFAR-100-C datasets.
Algorithms 17 00531 g003
Figure 4. Flowchart diagram of the adversarial validation methodology. The blue lines represent the original datasets, and the red line represents the corrupted datasets.
Figure 4. Flowchart diagram of the adversarial validation methodology. The blue lines represent the original datasets, and the red line represents the corrupted datasets.
Algorithms 17 00531 g004
Figure 5. Class-level adversarial validation on the MNIST dataset and its corrupted version (MNIST-C). (a) CSG metric ordered by the average value of all classes (dataset-level adversarial validation). Lower values relate to data that deviate from the original distribution. (b) Variability of the CSG metric between classes in each of the corruption types (corruption-level adversarial validation).
Figure 5. Class-level adversarial validation on the MNIST dataset and its corrupted version (MNIST-C). (a) CSG metric ordered by the average value of all classes (dataset-level adversarial validation). Lower values relate to data that deviate from the original distribution. (b) Variability of the CSG metric between classes in each of the corruption types (corruption-level adversarial validation).
Algorithms 17 00531 g005
Figure 6. Class-level adversarial validation on the CIFAR-10 dataset and its corrupted version (CIFAR-10-C). (a) CSG metric ordered by the average value of all classes. Lower values relate to data that deviate from the original distribution (dataset-level adversarial validation). (b) Variability of the CSG metric between classes in each of the corruption types (corruption-level adversarial validation).
Figure 6. Class-level adversarial validation on the CIFAR-10 dataset and its corrupted version (CIFAR-10-C). (a) CSG metric ordered by the average value of all classes. Lower values relate to data that deviate from the original distribution (dataset-level adversarial validation). (b) Variability of the CSG metric between classes in each of the corruption types (corruption-level adversarial validation).
Algorithms 17 00531 g006
Figure 7. Class-level adversarial validation on CIFAR-100 dataset and its corrupted version (CIFAR-100-C). CSG metric ordered by the average value of all classes (dataset-level adversarial validation). Lower values relate to data that deviate from the original distribution.
Figure 7. Class-level adversarial validation on CIFAR-100 dataset and its corrupted version (CIFAR-100-C). CSG metric ordered by the average value of all classes (dataset-level adversarial validation). Lower values relate to data that deviate from the original distribution.
Algorithms 17 00531 g007
Figure 8. Class-level adversarial validation on CIFAR-100 dataset and its corrupted version (CIFAR-100-C). Variability of the CSG metric between classes in each of the corruption types (corruption-level adversarial validation).
Figure 8. Class-level adversarial validation on CIFAR-100 dataset and its corrupted version (CIFAR-100-C). Variability of the CSG metric between classes in each of the corruption types (corruption-level adversarial validation).
Algorithms 17 00531 g008
Figure 9. Example of a CIFAR image with the four types of modifications that have the greatest impact on the distribution of data (bd) and the four types of modifications that have the least impact (eh).
Figure 9. Example of a CIFAR image with the four types of modifications that have the greatest impact on the distribution of data (bd) and the four types of modifications that have the least impact (eh).
Algorithms 17 00531 g009
Table 1. Comparison of the distance given by MetaShift and the distance obtained with the proposed method of four subsets extracted from MetaShift [19]. Each subset is compared with the subset Dog (shelf). In MetaShift, higher distance (d) indicates more challenging problem (maximum value not given). The CSG metric in adversarial validation ranges betwewen 0.0 and 1.0 (lower values indicate more challenging problem).
Table 1. Comparison of the distance given by MetaShift and the distance obtained with the proposed method of four subsets extracted from MetaShift [19]. Each subset is compared with the subset Dog (shelf). In MetaShift, higher distance (d) indicates more challenging problem (maximum value not given). The CSG metric in adversarial validation ranges betwewen 0.0 and 1.0 (lower values indicate more challenging problem).
SubsetMetaShiftProposed Method
d Exec Time (s) CSG Exec Time (s)
Dog (cabinet + bed)0.4400Not given0.98325.4874
Dog (bag + box)0.7100Not given0.96645.9382
Dog (bench + bike)1.120Not given0.85906.6487
Dog (boat + surfboard)1.430Not given0.68545.5696
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Renza, D.; Moya-Albor, E.; Chavarro, A. Adversarial Validation in Image Classification Datasets by Means of Cumulative Spectral Gradient. Algorithms 2024, 17, 531. https://doi.org/10.3390/a17110531

AMA Style

Renza D, Moya-Albor E, Chavarro A. Adversarial Validation in Image Classification Datasets by Means of Cumulative Spectral Gradient. Algorithms. 2024; 17(11):531. https://doi.org/10.3390/a17110531

Chicago/Turabian Style

Renza, Diego, Ernesto Moya-Albor, and Adrian Chavarro. 2024. "Adversarial Validation in Image Classification Datasets by Means of Cumulative Spectral Gradient" Algorithms 17, no. 11: 531. https://doi.org/10.3390/a17110531

APA Style

Renza, D., Moya-Albor, E., & Chavarro, A. (2024). Adversarial Validation in Image Classification Datasets by Means of Cumulative Spectral Gradient. Algorithms, 17(11), 531. https://doi.org/10.3390/a17110531

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop