CISA: Context Substitution for Image Semantics Augmentation

Nesteruk, Sergey; Zherebtsov, Ilya; Illarionova, Svetlana; Shadrin, Dmitrii; Somov, Andrey; Bezzateev, Sergey V.; Yelina, Tatiana; Denisenko, Vladimir; Oseledets, Ivan

doi:10.3390/math11081818

Open AccessArticle

CISA: Context Substitution for Image Semantics Augmentation

by

Sergey Nesteruk

¹

,

Ilya Zherebtsov

²,

Svetlana Illarionova

¹

,

Dmitrii Shadrin

^1,3

,

Andrey Somov

^1,*

,

Sergey V. Bezzateev

⁴

,

Tatiana Yelina

⁴,

Vladimir Denisenko

²

and

Ivan Oseledets

¹

Skolkovo Institute of Science and Technology (Skoltech), 121205 Moscow, Russia

²

Voronezh State University of Engineering Technology (VSUET), 394036 Voronezh, Russia

³

Irkutsk National Research Technical University (INRTU), 664074 Irkutsk, Russia

⁴

Saint-Petrsburg State University of Aerospace Instrumentation (SUAI), 190000 Saint Petersburg, Russia

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(8), 1818; https://doi.org/10.3390/math11081818

Submission received: 28 February 2023 / Revised: 4 April 2023 / Accepted: 6 April 2023 / Published: 11 April 2023

(This article belongs to the Special Issue Mathematics in Robot Control for Theoretical and Applied Problems)

Download

Browse Figures

Versions Notes

Abstract

:

Large datasets catalyze the rapid expansion of deep learning and computer vision. At the same time, in many domains, there is a lack of training data, which may become an obstacle for the practical application of deep computer vision models. To overcome this problem, it is popular to apply image augmentation. When a dataset contains instance segmentation masks, it is possible to apply instance-level augmentation. It operates by cutting an instance from the original image and pasting to new backgrounds. This article challenges a dataset with the same objects present in various domains. We introduce the Context Substitution for Image Semantics Augmentation framework (CISA), which is focused on choosing good background images. We compare several ways to find backgrounds that match the context of the test set, including Contrastive Language–Image Pre-Training (CLIP) image retrieval and diffusion image generation. We prove that our augmentation method is effective for classification, segmentation, and object detection with different dataset complexity and different model types. The average percentage increase in accuracy across all the tasks on a fruits and vegetables recognition dataset is

4.95 %

. Moreover, we show that the Fréchet Inception Distance (FID) metrics has a strong correlation with model accuracy, and it can help to choose better backgrounds without model training. The average negative correlation between model accuracy and the FID between the augmented and test datasets is

0.55

in our experiments.

Keywords:

image augmentation; computer vision; data collection; image retrieval; image generation; few-shot learning

MSC:

65D19; 51N05; 68U05

1. Introduction

Deep learning and computer vision (CV) algorithms have recently shown their capabilities in addressing various challenging industrial and scientific problems [1]. Successful application of machine learning and computer vision algorithms for solving complex tasks is impossible without relying on comprehensive and high-quality training and testing data [2,3]. CV algorithms for solving classification, object detection, and semantic and instance segmentation require a huge variety of input data to ensure robust work of the trained models [4,5,6]. There are two major ways to enlarge a training dataset. The first one is obvious and implies physical collection of the dataset samples in various conditions to ensure high diversity of the training data. There is a set of huge datasets that have been collected for solving computer vision problems. These datasets are commonly used as the benchmark [7,8,9,10]. One of the specifics of these datasets is that they are general-domain sets. Unfortunately, general-domain-labeled data can be almost useless for solving specific industrial problems. One of the feasible applications of such well-known datasets is that they can serve as a good basis for pre-training of neural networks (transfer learning) [11,12]. Using these pre-trained neural networks, it is possible to fine-tune them and adapt them to address specific problems. However, in some cases, even for fine-tuning, a comprehensive dataset is in high demand. Some events are rare, and it is possible to collect only a few data samples [13,14,15]. Thus, a second approach for enhancing the characteristics of the dataset can help. This approach is based on artificial manipulations with the initial dataset [16,17]. One of the well-developed techniques is data augmentation, where original images are transformed according to special rules [18]. Usually, the goal of image augmentation is to make the training dataset more diverse. However, augmentation can be used to deliberately shift the data distribution. If the distribution of the original training dataset differs from the distribution of the test set, it is important to equalize them as much as possible.

The agricultural domain is part of the industrial and research areas for which the development of artificial methods for improvement of training datasets is vital [19,20,21]. This demand appears due to the high complexity and variability of the investigated system (plant) that has to be characterized by computer vision algorithms [22]. The difficulty of the agricultural domain makes it a good candidate for testing augmentation algorithms.

There are many different plant species, and plants grow slowly. Thus, collecting and labeling huge datasets for each specific plant growing in each specific stage is a complex task [23]. Overall, it is difficult to collect datasets [24], especially for plants, and it is expensive to annotate them [25]. Therefore, we propose a method to multiply the number of training samples. It does not require many computational resources, and it can be performed on the fly. The idea behind the algorithm is to cut instances from the original images and add them onto the new backgrounds (Figure 1).

The contribution of this study is the following:

we describe an efficient algorithm for instance-level image augmentation and measure its performance;
we prove that the context is vital for instance-level augmentation;
we propose several efficient ways to find representative background images if the test environment context is known;
we show that it is possible to estimate which dataset variant will provide better accuracy before model training, calculating the FID between the test dataset and the training dataset variants;
we share the dataset and generate background images and source code for augmentation.

The novelty of this study is as follows:

extensive experiments with instance-level augmentation for different computer vision tasks;
experiments with different model types;
application of FID to choose the augmentation approach.

1.1. Image Augmentation

Computer vision models require many training data. Therefore, it becomes challenging to obtain a good model with limited datasets. Namely, a small-capacity model might not capture complex patterns, while a big capacity model tends to overfit if small datasets are used [26]. Slight changes in test data connected with surrounding and environmental conditions might also lead to a decrease in model performance [27].

To overcome this issue, we use various image augmentation techniques. Data augmentation aims to add diversity to the training set and to complicate the task for a model [28]. Among these plant image augmentation approaches, we can distinguish: basic computer vision augmentations, learned augmentation, graphical modeling, augmentation policy learning, collaging, and compositions of the ones above.

Basic computer vision augmentations are the default methods preventing overfitting in most computer vision tasks. They include image cropping, scaling, flipping, rotating, and adding noise [29]. There are also advanced augmentation methods, connected with distortion techniques and coordinate system changes [30]. Since these operations are quite generic, most popular ML frameworks support them. However, although helpful, these methods demonstrate limited use, as they bring insufficient diversity to the training data for few-shot learning cases.

Learned augmentation stands for generating training samples with an ML model. For this purpose, conditional generative adversarial networks (cGANs) and variational autoencoders (VAEs) are frequently used. In the agricultural domain, there are examples of applying GANs to Arabidopsis plant images for the leaf counting task [31,32]. The main drawback of this approach is that generating an image with a neural network is quite resource-intensive. Another disadvantage is the overall pipeline complexity: the errors of a model that generates training samples are accumulated with the errors of a model that solves the target task.

Learned augmentation policy is a series of techniques used to find combinations of basic augmentations that maximize model generalization. This implies hard binding of the learned policy to the ML model, the dataset, and the task. Although it is shown to provide systematic generalization improvement on object detection [33] and classification [34], its universal character as well as the ability to be performed along with multi-task learning are not supported with solid evidence.

Collaging presupposes cropping an object from an input image with the help of a manually annotated mask and pasting it to a new background with basic augmentations of each object [19]. In [35], a scene generation technique using object mask was successfully implemented for an instance detection task. It boosted model performance significantly compared with the use of only original images. The study on image augmentation for instance segmentation using a copy–paste technique with object mask was extended in [36]. The importance of scene context for image augmentation is explored in [37,38].

1.2. Image Synthesis

Graphical modeling is another popular method in plant phenomics. It involves creating a 3D model of the object of interest and rendering it. The advantage of this process is that it permits the generation of large datasets [39] with precise annotations, as the labels of each pixel are known. However, this technique is highly resource-intensive; moreover, the results obtained using the existing solutions [40,41] seem artificial. More realistic synthesis is very time-consuming. This approach is suitable when there are not many variations of the modeled object. If there are many different object types, it can be easier to collect and annotate new images.

1.3. Neural Image Generation and Image Retrieval

To gain new training images for CV tasks, one can implement GAN-based or diffusion-based models. Currently, they allow for the creation of rather realistic images and meet the demands of different domains, such as agricultural [42], manufacturing processes [43], remote sensing [44], or medical [45]. Such models can be considered as a part of an image recognition pipeline. Moreover, recent results in Natural Language Processing (NLP) offer opportunities to extend image generation applications via textual description. For instance, an image can be generated based on a proposed prompt, namely, a phrase or a word. Such synthetic images help to extend the initial dataset. The same target image can be described by a broad variety of words and phrases that lead to diverse visual results. Another way to obtain additional training images is a data retrieval approach. It supposes to search for existing images from the Internet or some database according to a user’s prompt. For instance, the CLIP model can be used to compute embedding of a text and to find images that match it better based on distance in a special embedding space [46].

2. Materials and Methods

The notation that we use in this section for describing the augmentation framework parameters is listed in Table 1.

2.1. Method Development and Description

In this paper, we introduce a method of image augmentation for a semantic segmentation task. When instance-level annotation areas are available, one can apply our method for other tasks such as classification, object detection, object counting, and semantic segmentation. Our method takes image–mask pairs and transforms them to obtain various scenes. Having a set of image–mask pairs, we can place many of them on a new background. Transformation of input data and background, accompanied by adding noise, gives the possibility for us to synthesize an infinite number of compound scenes.

This section first describes the overall augmentation pipeline and then describes the tested approaches for background image generation.

We distinguish between several types of image masks:

Single (S)—single-channel mask that shows the object presence.
Multi-object (MO)—multi-channel mask with a special color for each object (for each plant).
Multi-part (MP)—multi-channel mask with a special color for each object part (for each plant leaf).
Semantic (Sema)—multi-channel mask with a special color for each type of object (leaf, root, flower).
Class (C)—multi-channel mask with a special color for each class (plant variety).

A single-input mask type allows us to produce more than one output mask type. Hence, multiple tasks can be solved using any dataset, even the one that was not originally designed for these tasks (see Table 2 for the possible mask transitions).

For example, an image with a multipart mask as input enables us to produce: the S mask, which is a Boolean representation of any other mask, the MO mask with unique colors for every object, the MP mask with a unique color for each part across all the present objects, and the C mask that distinguishes the classes (Figure 2). Additionally, for every generated sample, we provide bounding boxes for all objects and the number of objects of each class.

Note that we assume that each input image–mask pair includes a single object. Therefore, we can produce the MO mask based on any other mask. To create the C mask, information about input objects must be provided.

2.2. System Architecture

The library with the code will be shared as an open source code with the community. The core of the presented system is the Augmentor. This class implements all the image and mask transformations. Such transformations as flipping or rotating are mutual for both the image and the mask. We add noise for images only.

From the main Augmentor class, we inherit SingleAugmentor, MultiPartAugmentor and SemanticAugmentor classes, helping to apply different input mask types and to treat them separately. To be more precise, SingleAugmentor is exploited for S input mask type, MultiPartAugmentor is for MP mask type, and SemanticAugmentor is for Sema mask type.

The described above classes are used in the DataGen class, which chooses images for each scene and balances classes if needed. Two principal ways of new scene generation are offline and online. We implement them in SavingDataGen and StreamingDataGen accordingly. Both of the classes take the path to images with corresponding masks as input. The offline data generator produces a new folder with created scenes while the online generator can be used to load data directly to a neural network.

Offline generation is more time-consuming because of additional disk access operations; at the same time, it is performed in advance and thus does not affect model training time. It also makes it easier to manually look through the obtained samples to tune the transformation parameters.

Meanwhile, the online data generator streams its results immediately to the model without saving images on the disk. Furthermore, this type of generator allows us to change parameters on the fly: for instance, the model is trained on easy samples, and then, the complexity may be manipulated based on the loss function.

2.3. Implementation Details

The present section discusses the main transformation pipeline (Figure 3).

The first step is to select the required number of image–mask pairs from a dataset. By default, we pick objects with repetitions that enable us to create scenes with a larger number of objects than present in the input data.

After that, we prepare images and masks before combining them into a single scene. The procedure is as follows:

adjust the masks to exclude large margins;
perform the same random transformations to both the image and mask;
obtain all required mask types and auxiliary data.

Once all the transformations are performed and we know the sizes of all objects, the size of the output scene is calculated. Note that input objects can have different sizes and orientations; therefore, we cannot simply place objects by grid because it will lead to inefficient space usage. It is also not a good idea to place objects randomly in most cases because it will lead to uncontrollable overlapping of objects.

Within the framework of our approach, the objects are packed using the Maximal Rectangles Best Long Side Fit (MAXRECTS-BLSF) algorithm. It is a greedy algorithm that is aimed at packing rectangles of different sizes into a bin using the smallest possible area. The maximum theoretical packaging space overhead of the MAXRECTS-BLSF algorithm is

0.087

. The BLSF modification of the algorithm tries to avoid a significant difference between side lengths. However, similar ot other rectangular packing algorithms, this one also tends to abuse the height dimension of the output scene, yielding a column-oriented result.

In order to control both overlapping of the objects and the orientation of output scenes, we introduce two modifications to the MAXRECTS-BLSF algorithm.

Control of the overlapping is achieved via substituting the objects’ real sizes with the shrinked ones when passing them through the packing algorithm. The height and width are modified according to Equation (1):

\tilde{H} = (1 - s) H; \tilde{W} = (1 - s) W,

(1)

where s ranges from 0 to 1 inclusively.

The bigger the shrinkage ratio, the smaller the substituted images. It is applied to both height and width and to all of the input objects. The real overlapping area in practice will vary depending on each objects’ shape and position. To perceive the overlap percentage, see Figure 4. Here, we consider the case where all input objects are squares without any holes. In other words, it is the maximum possible overlap percentage for the defined shrinkage ratio. We show this value for an object in the corner of a scene, an object on the side, and an object in the middle, separately.

We recommend choosing s between 0 and

0.3

; however, taking into consideration sparse input masks, it can be slightly higher.

To control the orientation of the output scene, we set a hard limit of the scene height for the packing algorithm. Assuming that input objects will have different sizes in practice, we cannot obtain optimal packing with the fixed output image size or width-to-height ratio. To calculate the hard height limit, we use Equation (2).

\hat{h} = m a x (m a x H, θ \frac{\sum_{i = 1}^{n} {\tilde{H}}_{i}}{⌈\sqrt{n}⌉})

(2)

The fraction in Equation (2) estimates the required value of height to make a square scene. We choose a maximum between it and the biggest objects’ height to ensure that it is enough space for any input object. The orientation coefficient

θ

can be treated as the target width-to-height ratio. It will not produce the scenes with the fixed ratio, but with many samples, the average value will approach the target one.

θ = 1

will try to obtain square scenes.

θ > 1

will generate landscape scenes. In our experiments, we set

θ

to 1.2 to obtain close to square images with landscape preference. The average resulting width-to-height ratio over ten thousand samples was 1.1955.

To adjust the background image size to the obtained scene size, we resize the background if it is smaller than the scene or randomly crop it if it is bigger.

We generate the required number of colors, excluding black and white, and find their Cartesian product according to Algorithm 1 for coloring the MO and MP masks.

Algorithm 1: Color generation.

Input: Number of objects n;

Output: The set of colors C;

L = ⌈ \sqrt[3]{n + 2} ⌉

s = \frac{1}{L}

for

l = 0, . . ., L - 1

do

T \leftarrow 1 - (s * l)

end

return

C = {(c_{1}, c_{2}, c_{3}) | c_{1}, c_{2}, c_{3} \in T}

To preserve the correspondence between the input objects and their representation on the final scene, we color the objects in order of their occurrence.

2.4. Time Performance

In this section, we measure the average time that is required to generate scenes of various complexity. For this experiment, we use Intel(R) Core(TM) i7-7700HQ CPU 2.80 GHz without multiprocessing. The average height of objects in the dataset is 385 pixels; the average width is 390 pixels. The results are averaged on a thousand scenes for each parameter combination and are reflected in Figure 5a for MultiPartAugmentor and Figure 5b for SemanticAugmentor.

SA (the red bar on the left) stands for Simple Augmentor with one type of output mask; NA (the blue bar in center) means adding noise and smoothing to scenes; NMA (the green bar on the right) means adding noise, smoothing, calculating bounding boxes, and generating all possible types of output masks. To recall possible mask types for each augmentor, refer to Table 2. The filled area in the bottom shows the time for loading input images and masks from disks. The shaded area in the middle shows the time for actual transformation. The empty area in the top shows the time for saving all the results to the disk. If every bar is accumulated with all the bars below it, the top of the shaded bar will show the time for StreamingDataGen, and the top of the empty area will show the time for SavingDataGen.

From the bar plots, you can see linear dependence between the number of input objects and the time for generating a scene.

2.5. System Parameters

Two main classes of the system where we can choose parameters are Augmentor and DataGen, or classes inherited.

The Augmentor parameters that define the transformations are shown in Table 3.

The rest of the Augmentor parameters define output mask types, bounding box presence, and mask preprocessing steps.

The data generator parameters define the rules to pick samples for scenes: the number of samples per scene, picking samples for a single scene from the same class or randomly, class balancing rule, the input file structure, the output file structure.

2.6. Background Image Choosing

Making many augmented copies of objects is a very powerful tool used to increase dataset variability. However, many previous works underestimate the role of image context. The role of the context in an image plays a role in its background. In this paper, we show that the proper choice of a background is vital. For this, we experiment with methods that produce images that are similar to the test set backgrounds.

In the test set, we have five types of background. It includes: grass, floor tiles, wooden table, color blanket, and shop shelves. Therefore, we want to obtain suitable images that represent every surrounding type. The corresponding text prompts are:

grass: grass, green grass, grass on the Earth, photo of grass, grass grown on the Earth;
floor tiles: tile, ceramic tile, beige tile, grey tile, metal, photo of metal sheet, metal sheet, tile on the floor, close photo of tile, close photo of grey tile;
wooden table: wood, wooden, wooden table, dark wooden table, light wooden table, close photo of wooden table, close photo of table in the room;
color blanket: veil, cover, blanket, color blanket, dark blanket, blanket spread, bed linen, close photo of veil (cover, blanket), blanket on the bed, towel, green towel, close photo of towel on the table;
shop shelves: shelves, shop shelves, close photo of shop shelves, white shop shelves, shop shelves close, table in shop, empty shelves in the shop, table with scales in front of shop shelves, scales in the shop.

We also split backgrounds into easy: wooden table, floor tiles; and complex: grass, color blanket, shop shelves. This split is manual and serves to demonstrate the difference in performance between more and less realistic images. More precisely, complex backgrounds are ones where visual augmentation looks unrealistic. Various background properties are significant not only in the agriculture domain, they represent different environmental conditions in the remote sensing domain and can be considered to boost model performance through geographical regions [47]. Background complexity in CV tasks for self-driving cars depends on urban area complexity and lighting conditions and has to be taken into account to develop robust algorithms [48]. To capture observed scenes for aerial vehicle navigation, surrounding properties are also crucial [49].

We use the described above text prompts with ruDALL-E [50] and stable diffusion [51] models to generate similar images, and with the CLIP [52] model to retrieve similar images from the LAION-400M [53] dataset. There are 100 collected backgrounds for each prompt.

For the comparison, we also add the worst-case and the best-case backgrounds. As the worst case, we propose to use random pattern images. The best case is to have real images from the same place, where a CV model will be inferenced.

Dataset

To verify the proposed approach, we conduct experiments using a set of images of various fruits and vegetables. We collect a unique dataset that comprises the following species: apple, cabbage, grape, tomato, pepper sweet, and onion. The dataset has hierarchical structure where each species includes three varieties, as is depicted in Figure 6. All species and varieties are presented in Table 4. Overall, each individual fruit or vegetable variety is represented by 150 images gained in different environmental and lighting conditions. We create a manual instance segmentation annotation for the images. Each image contains several fruits or vegetables of a single variety. Therefore, instance segmentation markup can be easily automatically converted into image classification labels. We can also obtain bounding boxes for object detection based on instance segmentation masks. Hence, we create annotations for three CV tasks, namely, semantic segmentation, image classification, and object detection. For each task, the dataset is split into training and testing in an 80/20 ratio.

Figure 7 depicts generated images using the original dataset with instance segmentation masks.

2.7. Experiments

The experiment setup is as follows. We have to test the stability of our approach under various conditions. For this, we experiment with three CV tasks:

image classification;
semantic segmentation;
object detection.

For each task, we compare:

easy 6-species setup;
complex 18-varieties setup.

For the classification task, we also compare different type of models:

convolutional model (ResNet50 [54]);
transformer model (SWIN [55]).

As well as models with different capacities:

medium (ResNet50);
small (MobileNetv3 [56]).

We set the following hyperparameters: For the ResNet50 training, we choose: a learning rate of

10^{- 3}

, cross-entropy loss function, SGD optimizer, exponential learning rate decay with gamma set to 0.95, and weight decay

2 \times 10^{- 3}

.

For the MobileNetv3 training, we choose: a learning rate of

10^{- 2}

, cross-entropy loss function, SGD optimizer, exponential learning rate decay with gamma set to 0.95, and weight decay

3 \times 10^{- 4}

.

For the SWIN training, we choose: a learning rate of

5 \times 10^{- 4}

, cross-entropy loss function, Adam optimizer, cosine annealing learning rate decay, and weight decay

10^{- 5}

.

For the UNET++ training, we choose: a learning rate of

3 \times 10^{- 5}

, binary cross-entropy with logits loss function, Adam optimizer, cosine annealing learning rate decay, and weight decay

10^{- 5}

. Images were resized to 512 × 512 px.

For YOLOv8 training, we choose: a learning rate of

10^{- 3}

, SGD optimizer, exponential learning rate decay with gamma set to 0.95, and weight decay

5 \times 10^{- 4}

. Images were resized to 640 × 640 px.

We explicitly compare convolutional [57] and transformer [58] models. These are the two most popular types of computer vision models today. They differ in receptive field. Convolutions operate locally (Equation (3)), while transformers look at the greater scale (Equation (4)). The success of augmentation with one model type does not guarantee success with another.

O [x, y] = (I * K) (i, j) = \sum_{j = 1} \sum_{i = 1} I [x - i, y - j] K [i, j],

(3)

where O is the resulting feature map; K is a kernel.

A (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(4)

where Q, K and V are weight matrices; d is the dimensionality of an attention head.

In each experiment, we measure the model performance using five-fold cross-validation. We use early stopping to terminate model training; therefore, the number of training epochs for different models varies. Classification models are pre-trained on the ImageNet dataset. Segmentation and detection models are pre-trained on the COCO dataset.

We compare several ways to find backgrounds that match the context of the test set, including Contrastive Language–Image Pre-Training (CLIP) [52] image retrieval, VQGAN (ruDALL-E [50]) image generation, and diffusion (Stable Diffusion [51]) image generation.

In each experiment excluding the baseline, we first pre-train a model on the CISA-augmented dataset and then fine-tune the original dataset.

2.8. Evaluation Metrics

To determine the suitability of the training dataset prior to the training procedure, we propose to use the Fréchet Inception Distance (FID) metrics [59]. It is a commonly used choice to evaluate the performance of GAN models. FID measures distance between the distribution of generated images and the original natural samples. However, in our case, the idea behind FID computation is to determine the similarity and feasibility of the generated training samples and test data. A low FID value depicts the better case when we manage to obtain an artificially realistic dataset close to the original test dataset distribution. To compute FID, we use Equation (5).

F I D = | | μ_{r} - μ_{g} {| |}^{2} + T_{r} (\sum_{r} + \sum_{g} - 2 \sqrt{(\sum_{r} \sum_{g})}),

(5)

where r and g indexes denote real and generated datasets, correspondingly;

μ

is the mean of the Inceptionv3 model [60] features of a dataset;

\sum_{d a t a s e t}

is the variance matrix of a dataset;

T_{r}

is the trace operator.

For assessing classification results, we use accuracy, because the dataset is balanced.

To evaluate semantic segmentation, we calculate pixel-wise intersection over union (IoU, Equation (6)).

I o U = \frac{T P}{T P + F P + F N},

(6)

where

T P

is the number of true positive samples;

F P

is the number of false positive samples;

F N

is the number of false negative samples.

To evaluate object detection results, we calculate

m A P @ 0.5

(Equation (7)). It means that for the prediction, we use the threshold

I o U = 0.5

.

m A P @ 0.5 = \frac{1}{# c l a s s e s} \sum_{c \in c l a s s e s} \frac{T P (c)}{T P (c) + F P (c)},

(7)

To measure the statistical significance of our results, we calculate the Spearman rank-order correlation coefficient (Equation (8)). We choose Spearman’s over Pearson’s correlation because the relation between the FID and accuracy is monotonous but non-linear.

ρ = 1 - \frac{6 \sum d_{i}^{2}}{n (n^{2} - 1)},

(8)

where

ρ

is the Spearman’s correlation coefficient;

d_{i}

is the distance between two ranks of each observation; n is the number of observations.

3. Results

The results of the experiments are shown in Table 5, Table 6, Table 7, Table 8, Table 9, Table 10, Table 11, Table 12, Table 13 and Table 14.

In Table 5, one can find the results of the classification of six species with the ResNet50 model. CISA with stable diffusion backgrounds shows a

2.3 %