Next Article in Journal
Observer-Based Prescribed Performance Adaptive Neural Network Tracking Control for Fractional-Order Nonlinear Multiple-Input Multiple-Output Systems Under Asymmetric Full-State Constraints
Previous Article in Journal
Stability and l Performance Analysis for Nabla Discrete Fractional Linear Positive System with Time-Varying Delays
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fractals as Pre-Training Datasets for Anomaly Detection and Localization

Department of Engineering, Free University of Bozen-Bolzano, 39100 Bozen-Bolzano, Italy
*
Author to whom correspondence should be addressed.
Fractal Fract. 2024, 8(11), 661; https://doi.org/10.3390/fractalfract8110661
Submission received: 11 October 2024 / Revised: 29 October 2024 / Accepted: 5 November 2024 / Published: 13 November 2024

Abstract

:
Anomaly detection is crucial in large-scale industrial manufacturing as it helps to detect and localize defective parts. Pre-training feature extractors on large-scale datasets is a popular approach for this task. Stringent data security, privacy regulations, high costs, and long acquisition time hinder the development of large-scale datasets for training and benchmarking. Despite recent work focusing primarily on the development of new anomaly detection methods based on such extractors, not much attention has been paid to the importance of the data used for pre-training. This study compares representative models pre-trained with fractal images against those pre-trained with ImageNet, without subsequent task-specific fine-tuning. We evaluated the performance of eleven state-of-the-art methods on MVTecAD, MVTec LOCO AD, and VisA, well-known benchmark datasets inspired by real-world industrial inspection scenarios. Further, we propose a novel method to create a dataset by combining the dynamically generated fractal images creating a “Multi-Formula” dataset. Even though pre-training with ImageNet leads to better results, fractals can achieve close performance to ImageNet under proper parametrization. This opens up the possibility for a new research direction where feature extractors could be trained on synthetically generated abstract datasets mitigating the ever-increasing demand for data in machine learning while circumventing privacy and security concerns.

1. Introduction

Identifying unusual structures in images is a challenging problem in computer vision with numerous applications, including industrial inspection [1,2], healthcare monitoring [3,4], autonomous driving [5,6], and video surveillance [7,8]. Due to the rarity of encountering defects and the complexity of determining the full specification of defect variations, most of the literature addresses the anomaly detection (AD) problem in unsupervised settings, where a model is only trained on anomaly-free images. However, obtaining training data is expensive and time-consuming, and privacy concerns limit availability, especially in industrial and medical scenarios. Recently, computer vision systems have expanded greatly as large-scale datasets, such as ImageNet, have led to a shift from model-driven to data-driven approaches [9]. For example, in AD, many current state-of-the-art methods rely on deep feature extractors pre-trained on a proxy task on large-scale datasets. In addition to the technical challenges and high costs associated with acquiring and labelling these large datasets, questions have arisen over privacy, ownership, inappropriate content, and unfair biases. This has resulted in ImageNet being restricted to non-commercial applications, the 80M Tiny Images dataset [10] being withdrawn, promising datasets such as JFT-300M [11] or Instagram-3.5B [12] being unavailable for public use, and LAION-5B [13], which was used to train the famous Stable Diffusion [14], being withdrawn due to ethical concerns. What if we had a way to harness the power of large image datasets with few or none of the major issues and concerns currently faced? [15].
Fractals are complex geometric structures generated by mathematical equations; thus, anyone can produce the images, making them open-source, bypassing massive manual labelling and ethical or bias concerns. The work of Kataoka et al. [9] was the first to introduce the possibility of using fractals as an alternative pre-training method for image recognition tasks. In light of the promising results shown in image classification [15,16,17] and 3D scene-understanding [18], in this paper, we conducted extensive experiments to examine the potential utility of using a synthetically generated dataset composed of fractals for the detection and localization of anomalies such as those representing defects in industrial production processes. We generated two- and three-dimensional fractals following the implementation in [15,17] to create standard classification datasets. Moreover, inspired by the semantic segmentation task based on formula-driven supervised learning [19], we propose a novel approach to combine multiple fractal images. This approach uses groups of fractals to represent different characteristics of each class. A sample of a given class may include a variable number of fractals, which results in a “Multi-Formula” classification dataset. This introduces more complexity to the classification task with the model being able to learn more discriminative features. Nevertheless, contrary to the existing literature that mainly focuses on fractals’ transfer-learning ability for supervised classification, we compared the AD methods pre-trained with fractals against ImageNet without fine-tuning.
This work builds upon the foundation of our previous study [20], which primarily focused on comparing the performance of ImageNet with a single type of fractal dataset, using eight AD methods across two benchmark datasets, and analyzing the impact of feature hierarchy and object categories. The contributions of this extended work are summarized as follows:
We conducted the first systematic analysis, comparing the performance of eleven AD models pre-trained with fractals against those pre-trained with ImageNet on three benchmark datasets specifically designed for real-world industrial inspection scenarios, demonstrating that synthetically generated abstract images could be a valid alternative for defect detection.
We analysed the influence of feature hierarchy and object categories in addressing the anomaly detection (AD) task, demonstrating that the effectiveness of fractal-based features is closely tied to the type of anomaly. Nevertheless, we found that low-level fractal features performed better than high-level ones.
We introduced a novel procedure for the generation of classification datasets dubbed “Multi-Formula” that integrates multiple fractals, increasing the number of characteristics for each class, and showed that this strategy led to improved performance compared to the standard classification (“Single-Formula”) dataset under the same training condition.
We demonstrated that the learned weights are influenced by the specific fractals used during pre-training, showing that the presence of filters with complex patterns in the early network layers does not necessarily reflect well-learned weights across the entire architecture. On the contrary, higher-level weights may still be poorly optimized. Thus, we emphasized the importance of conducting a comprehensive analysis of the latent space structure to accurately assess the quality of the learned weights.
We evaluated the performance variations when training a model using different dataset configurations, such as fractal structure types, the number of samples, and training settings. Additionally, we observed that the careful tuning of model selection is crucial, and reducing the number of samples and classes led to improved anomaly detection performance.
The code is available at https://github.com/cugwu/fractal4AD (last accessed on 4 November 2024).

2. Related Works

2.1. Formula-Driven Supervised Learning

Formula-driven supervised learning (FDSL) has recently gained interest in the research community in the context of visual representation learning without real images. By exploiting mathematical formulas and rendering software, FDSL allows for the automatic construction of large-scale synthetic datasets, preventing the need for manual labelling and producing a nearly limitless range of images. In this context, several FDSL datasets have been proposed to pre-train computer vision models [15,17,21,22]. FractalDB [9] is one of the first datasets presented for image recognition tasks. It consists of colour or greyscale 2D fractal images obtained through an iterated affine function system. The dataset has proven to be effective for CNNs [9] and ViTs [16], showing that the pre-trained models tend to focus exclusively on contours to solve the classification task. This discovery has led [22] to further studies on the impact of contours in determining the efficacy of the learning of ViTs by proposing ExFractalDB and Radial Contour DataBase (RCDB). The former is composed of greyscale images obtained as projections of 3D fractals. The latter consists of greyscale images of 2D contours and enables a better pre-training performance than ImageNet-21k when fine-tuned on ImageNet-1k. RCDB pre-training performance on ViTs was outperformed by VisualAtom [21], a dataset of greyscale images of 2D contours with a larger design space. Yamada et al. [18] proposed a point cloud fractal database PC-FractalDB for 3D object detection. The more recent work of [17] has surpassed the previously mentioned studies on the classification task by proposing MandelbulbVAR-1k and MandelbulbVAR-Hybrid-21k, two datasets based on a 3D variant of fractals called Mandelbulb, which are projected to RGB images.

2.2. Anomaly Detection

The majority of unsupervised AD models fall into two main categories: reconstruction-based and feature embedding-based. This paper focuses on the latter since most state-of-the-art AD methods are feature embedding-based. These methods rely on learning the distribution of anomaly-free data by extracting descriptors from a pre-trained backbone (feature extractor), which is typically kept frozen during the entire AD process. Anomalies are detected during inference as deviations from these anomaly-free features, assuming that the feature extractor produces different features for anomalous images. According to [23], feature embedding-based methods can be divided into four categories: teacher–student ([24,25,26,27]), memory bank ([28,29,30]), normalizing flow ([31,32]), and one-class classification ([33,34]). For teacher–student models, during the training phase, the teacher is the feature extractor and distils the knowledge to the student model. When an abnormal image is passed, the teacher will produce features that the student was not trained on, so the student network would not be able to replicate the features. Thus, the feature difference in the teacher–student network is the fundamental principle in detecting anomalies during inference. Regarding memory bank-based approaches, features of normal images are extracted from a pre-trained network and stored in a memory bank. Test samples are classified as anomalous if the distance between the extracted test feature and the closest neighbourhood feature point inside the memory bank exceeds a certain threshold. Normalizing flow is used to learn transformations between data distributions. In AD, anomaly-free features are extracted from a pre-trained network and projected by the trainable flow model to an isotropic Gaussian distribution; in other words, the model applies a change in variable formula to fit an arbitrary density to a tractable base distribution. During inference, the normalizing flow is used to estimate the precise likelihood of a test image. Anomalous images should be out of distribution and have a lower likelihood than normal images. For one-class classification, the goal is to identify instances belonging to a single class, without explicitly defining the boundaries between classes as in traditional binary classification.

3. Dataset Generation Methods

3.1. Fractals Images

Fractal images are generated using an Iterated Function System (IFS), composed of two or more functions (called systems or codes), each associated with a sampling probability. Affine IFS involves affine transformations: ω ( x ) = A x + b , where A represents a linear function and b represents a translation vector. The set of functions has an associated set of points with a particular geometric structure called attractor. Following the definition of [9,15], an IFS system S, with cardinality N U ( { 2 , 3 , , 8 } ) , defined on a complete metric space X = ( R 2 , · 2 ) is a set of transformations ω i : X X and their associated probabilities p i :
S = ( ω i , p i ) : i = 1 , 2 N ,
which satisfy the average contractility condition
i = 1 N s i p i < 1 ,
where s i is the Lipschitz constant for ω i . The attractor A S is a unique geometric structure, a subset of X defined by S . The shape of A S depends on the function ω i , while the sampling probabilities p i | d e t A i | influence the distribution of points on the attractor that are visited during iterations. Affine transform parameters are associated with the categories of the synthetic dataset.
Anderson and Farrell [15] improved the sampling strategy to always guarantee the contractility condition of S and produce fractals with “good” geometric properties. Fractal images with good geometry are not too sparse, containing complex and varied structures with few empty spaces. The linear operator A of the affine transform must have singular values less than 1 to be a contraction, which can be imposed by construction. Thus, the authors used the singular-value decomposition of A = U Σ V T , where U and V are orthogonal matrices and Σ is a diagonal matrix containing the singular values σ 1 and σ 2 . By sampling σ 1 and σ 2 in the range (0, 1), we ensure that the system is a contraction. If we consider U = R θ and V T = R ϕ being rotation matrices of angles θ and ϕ , and D a diagonal matrix with diagonal elements d 1 , d 2 { 1 , 1 } , then A = U Σ V T = R θ Σ R ϕ D . In this way, we can obtain different A by sampling ( θ , ϕ , σ 1 , σ 2 , d 1 , d 2 ) . The translation vector is instead sampled with b U ( 1 , 1 ) . Regarding good geometry, the authors empirically demonstrate that singular values’ magnitudes dictate how quickly an affine contraction map converges to its fixed point. Small values cause quick collapse, while values near 1 lead to “wandering” trajectories which do not converge to a clear geometric structure. They empirically find that given σ i , 1 and σ i , 2 as the singular values for A i , the ith function in the system, the majority of the systems with good geometry satisfy 1 2 ( 5 + N ) < α < 1 2 ( 6 + N ) with α being
α = i = 1 N ( σ i , 1 + 2 σ i , 2 ) .
For N = 2 , , 8 , the found range works well, although it may also work for a wider range. Through this paper, this dataset will be called Fractals, in Figure 1 some examples of generated samples are shown.

3.2. Mandelbulb Variations

To generate their dataset, ref. [17] uses Mandelbulb, an extension to the 3D space of the 2D Mandelbrot, a particular set of fractals. Since fractals are objects with smaller portions similar to larger ones, they can represent infinite levels of smaller detail, and the ability to display them is limited only by computational constraints. This motivated the authors to mostly focus on the rendering part by modelling a parametrized 3D mathematical object, augmenting it by randomly generating colour patterns and shading through a simulated external light source to enhance as many details as possible. Following the formulation in [17], given n N and the following function ω n : R 3 R 3 :
ω n : x y z r n sin n θ cos n ϕ sin n θ sin n ϕ cos n θ
where
r = x 2 + y 2 + z 2 θ = arccos z r ϕ = arctan ( y x ) ,
a 3D Mandelbulb is defined as the set of data points c R 3 for which the sequence v k + 1 = ω n ( v k ) + c does not diverge, i.e., s u p ( v k ) < + when starting from v 0 = 0 . In this formulation, the only parameter is n, while for Fractals, the set of parameters are ( θ , ϕ , σ 1 , σ 2 , d 1 , d 2 ) .
To increase the diversity of generated data by increasing the number of parameters, the authors used the Mandelbulb Variant V ( n , b ) [35] based on a new function f n , b : R 3 R 3 parametrized by a vector b = ( b 1 , b 9 ) { 0 , 1 } 9 equivalent to a boolean vector of length 9. The obtained dataset MandelbulbVAR-1k is composed of 1037 variations. For more details, see [17,35]. To overcome the limited number of variants, the authors use the Hybrid Mandelbulb Variations, obtained by combining two Mandelbulb Variations. With this hybrid version, they obtained a dataset with 21 k variations, MandelbulbVAR-Hybrid-21k. Throughout this paper, we will refer to the two datasets collectively as Mandelbulbs, using specific names when we need to precisely identify one of the two, in Figure 2 some examples of generated samples are shown.

3.3. Multi-Formula Dataset

We took inspiration from [15,19], which use FDSL, respectively, for semantic segmentation and multi-instance classification. The two approaches create an abstract scene with several different geometric structures within the same image. We adopted the idea of objects of different classes in the same scene to characteristics of the same class in the same image. For example, to classify a house, we may have images of various homes, but what defines a home? For instance, having walls, windows, doors, and a roof are individual characteristics or features that together define a home. We applied the same concepts to fractals. Consider an original dataset composed of n classes, where each class C i (for i = 1 , 2 , , n ) consists of images with a single fractal representing that class. We rearranged the classes to create a new dataset with m classes, where each new class C j (for j = 1 , 2 , , m ) is composed of fractals from multiple classes C i . Each sample in this new dataset contains k U ( k m i n , k m a x ) fractals from class C j .
Formally, let the original dataset be D = { ( x i , y i ) } , where x i is an image containing a fractal obtained from the mathematical equation f i , and y i { 1 , 2 , , n } is its corresponding class label. We construct a new dataset D = { ( x j , y j ) } , where x j is an image containing k fractals randomly rescaled and inserted to the image at a random location, and y j { 1 , 2 , , m } is the new class label (see Figure 3). Thus, for each new class C j , we have C j = { ( x j , y j ) x j = i = 0 k f j i . In other words, each class can be considered as a combination of different attributes that could or could not be present. By composing multiple fractals into a single image, we generate multi-formula samples. This is because each class is formed by multiple mathematical formulas that produce distinct fractals. This approach contrasts with the previous datasets, Fractals and Mandelbulbs, where each class was generated by the iteration of a single formula, obtaining single-formula samples. Through the paper, the “Multi-Formula” datasets will be called MultiFractals or MultiMandelbulbs when using, respectively, Fractals or Mandelbulbs as the source dataset.

4. Implementation Details

The Fractals dataset consists of single-fractal RGB images without background with an image resolution of 256 × 256 , obtained by grouping 100k IFS codes into 1000 classes. The IFS codes were uniformly sampled for N U ( { 2 , 3 , 4 } ) . We also employed the parameter augmentation method which randomly selects one of the affine transforms ( A k , b k ) in the system and scales it by a factor of γ U ( 0.8 , 1.1 ) to obtain ( γ A k , γ b k ). We used the official code (https://github.com/catalys1/fractal-pretraining accessed on 4 November 2024) from [15] for training since the authors did not release their pre-trained weights. Chiche et al. [17] released the code (https://github.com/RistoranteRist/MandelbulbVariationsGenerator accessed on 4 November 2024) for the dataset generation, which we used to create MandelbulbVAR-1k and MandelbulbVAR-Hybrid-21k. However, they did not provide their training code for the anomaly detection task. Therefore, we used the pre-trained weights of WideResNet50 that the authors made available for the analysis of performance with different AD methods.
For the ablation analysis in Section 7, we implemented our training function where for the Mandelbulbs dataset, the images of size 512 × 512 were randomly cropped to size 224 × 224 , following the original implementation. We created MultiFractals and MultiMandelbulbs with 1000 classes using, respectively, Fractals and MandelbulbVAR-Hybrid-21k as origin datasets. All the mentioned datasets have around 1M samples. We also created a smaller version of the datasets with 200 classes and 400k samples using Fractals and MandelbulbVAR-1k as the source datasets. For a fair comparison, except for Mandelbulbs, all the ablation datasets have images with size 224 × 224 . We trained the models with the standard cross-entropy objective function for 100 epochs, with batch sizes of 512 for Fractals and 512 and 1024 for the ablation study, using stochastic gradient descent (SGD) with a momentum value of 0.9, learning rate of 0.1, and weight decay of 1 × 10−4. The images were normalized to the range [ 0 , 1 ] . A graphical representation of our overall framework is shown in Figure 4.

4.1. Datasets

To investigate the types of performances on the industrial anomaly detection task, our experiments were performed on the MVTecAD [1], VisA [36], and MVTec LOCO AD [2]. MVTecAD contains 15 sub-datasets of industrially manufactured objects. For each object class, the test sets contain both normal and abnormal samples with various defect types. The dataset has a relatively small scale, specifically, the number of training images for each sub-dataset varies from 60 to 391, posing a unique challenge for learning deep representations. VisA contains 12 sub-datasets. The objects range from different types of printed circuit boards to samples with multiple or single instances in a view. MVTec LOCO AD contains 3644 images from five different categories inspired by real-world industrial inspection scenarios with representative samples for structural anomalies such as scratches, dents, or contaminations and logical anomalies like violations of logical constraints, for example, permissible objects occurring in invalid locations.

4.2. Anomaly Detection Methods

For assessing the anomaly detection performance on MVTecAD and VisA, we used the teacher–student methods RD [25] and STFPM [24], the memory-based methods PatchCore [28] and PaDiM [29], the flow models FastFlow [32] and C-Flow [31], and the one-class classification methods PANDA [33] and CutPaste [34]. To facilitate reproducibility, we used Anomalib (https://github.com/openvinotoolkit/anomalib accessed on 4 November 2024) [37] to train all the methods, except for PANDA (https://github.com/talreiss/PANDA accessed on 4 November 2024) [33] and CutPaste (https://github.com/Runinho/pytorch-cutpaste accessed on 4 November 2024) [34], as well as all the methods used for MVTec LOCO AD, deployed through the official code implementations. For MVTec LOCO AD, we used the state-of-the-art methods EfficientAD (https://github.com/nelson1425/EfficientAD accessed on 4 November 2024) [38], PUAD (https://github.com/LeapMind/PUAD accessed on 4 November 2024), [39] and SINBAU (https://github.com/NivC/SINBAD accessed on 4 November 2024) [40]. All AD methods use WideResNet50, except for CutPaste, which is implemented with ResNet18.

4.3. Evaluation Metrics

Image-level metrics were used to assess AD algorithms’ classification performance, whereas pixel-level metrics were used to assess their localization performance. These two types of metrics represent distinct capabilities of AD algorithms, and they are both extremely important. Following prior work, we used the area under the receiver operator curve (AUROC) for both image-level and pixel-level anomaly detection. To measure localization performance, we also used the area under the per-region-overlap (AUPRO). A significant difference between the PRO score and the ROC measure is that the PRO score weights ground-truth regions of different sizes equally so that it better accommodates varying anomaly sizes; see [26] for details. For MVTec LOCO AD, we also used the saturated per-region overlap (sPRO) introduced in [2]. It is a generalization of the PRO metric that takes into account a saturation threshold s to have a more fair performance evaluation in the case of logical anomalies. Similar to PRO, sPRO does not take into account the false positive rate (FPR). Hence, unless specified, the PRO and sPRO values are associated with an FPR of 0.3.

5. Performance of Different Anomaly Detection Methods on Fractals Dataset

In this section, we analyse the experimental results on MVTecAD and VisA in depth when using Fractals as the pre-training dataset. Note that, except for CutPaste, none of the algorithms had their model weights fine-tuned. In each table, the reported metrics are expressed in percentage, and the best result for each method is marked in blue for ImageNet and red for Fractals pre-training; in addition, each cell contains the results for ImageNet/Fractals. All the results were obtained with version 0.* of Anomalib.
For MVTecAD, the corresponding results are in Table 1, Table 2 and Table 3. In Table 1, we observe that PatchCore is the winning approach followed by RD when using ImageNet as they both solved 7 of the 15 sub-datasets. When using Fractals, CutPaste solved 7 of the 15 classes, achieving the highest average image-level AUROC of 80.9%. For some classes, Fractals surpassed the performance of ImageNet: grid when using CutPaste and PANDA; wood with FastFlow; and toothbrush with C-Flow, PaDiM, RD, and CutPaste. In Table 2, we can see that PatchCore reached the highest pixel-level AUROC for both ImageNet and Fractals, followed by PaDiM. When using the AUPRO, Fractals’ performance dropped. As shown in Table 3, C-Flow is the method that had the biggest drop in localization performance when compared with the results in Table 2. PaDiM reached the highest AUPRO of 68.9%. Note that the AUPRO metric with the carpet class for the FastFlow pre-trained with ImageNet was missing. Anomalib [37], the repository used for the evaluation, led to the invalid score of 1.21; thus, we did not report any value. For VisA, the corresponding result tables are Table 4, Table 5 and Table 6. As shown in Table 4, when using Fractals, PatchCore reached the best score of 80.4% followed by CutPaste with 79.2%. We have some cases where Fractals surpassed ImageNet results: capsules with PatchCore and PANDA; macaroni2 with CutPaste and PANDA; pcb1 with PaDiM and CutPaste; and pcb2 with PatchCore, PaDiM, and CutPaste. Table 5 shows the pixel-level AUROC. For ImageNet, the best approach is RD, while for Fractals, it is PatchCore. On average, the pixel-level performance differs around 11% between ImageNet and Fractals. Here, too, using the AUPRO metric results in a performance drop, as shown in Table 6.
Overall, for both datasets, it is clear that memory-based methods seem to be the more suitable when using Fractals, while flow-based methods are the ones with the lowest performance. CutPaste works well with fractals reaching the first position on MVTecAD and the second on VisA. For ImageNet, the best results remain between teacher–student and memory-based methods. With the AUROC metric, using fractal images leads to promising results both at the image and pixel level. The performance drops when using AUPRO, indicating that small defects are not well localized. Meanwhile, ImageNet weights can maintain good performance across all metrics.

5.1. Comparison Between Object Categories

For MVTecAD, the 15 sub-datasets can be grouped into textures (carpet, grid, leather, tile, wood) and objects (bottle, cable, capsule, hazelnut, metal_nut, pill, screw, toothbrush, transistor, zipper). For VisA, the 12 sub-classes are grouped into pcb (pcb1, pcb2, pcb3, pcb4), images with multi-instance in a view multi-in (candle, capsules, macaroni1, macaroni2), and image with single-instance in a view single-in (cashew, chewinggum, fryum, pipe_fryum). In Figure 5, we can see a qualitative visualization of the image-level AUROC accuracy grouped by object categories. Focusing on MVTecAD ImageNet leads to good performance for both textures in blue and objects in red for all the methods. The larger blue area shows a higher performance for the texture category. Also, with Fractals, we have the same behaviour except for RD and PANDA, with objects having, respectively, +1.9% and +0.7% compared to textures. The bigger difference between textures and objects can be seen for flow-based methods with +7.2% and +6% for FastFlow and C-Flow. With VisA, all the methods underperform for both ImageNet and Fractals for the multi-in category. Our intuition is that this behaviour is more method-related rather than weight-related. The proposed methods are specialized to perform well on MVTecAD, which is composed of images with single objects in a view. For ImageNet, pcb and sinle-in have comparable performance, while for Fractals, the results are quite variable.

5.2. Impact of Feature Hierarchy

Hierarchical feature learning in deep neural networks, especially in CNNs, plays a crucial role in improving model performance and interpretability. Zeiler and Fergus [41] showed that CNNs learn simple features like edges and textures in the lower layers and complex, task-specific patterns in the higher ones. Later, Yosinski et al. [42] found that these lower-layer features are more transferable to new tasks, highlighting the importance of hierarchical learning for transfer learning and model flexibility. The work of Neyshabur et al. [43] demonstrated that lower layers capture more general features, while higher layers are more sensitive to parameter perturbations. Feature maps from ResNet-like architectures can be divided into hierarchy-level j { 1 , 2 , 3 , 4 } . For example, using the last level for feature representation introduces two problems: (i) the loss of more localized information, as the last layers of the network extract more high-level features, (ii) and the feature bias towards the task of natural image classification, which has only little overlap with industrial anomaly detection [28]. Given the huge impact of hierarchical leaning on model flexibility, we studied the impact of feature hierarchy when using Fractals.
Figure 6 shows the average image- and pixel-level scores in MVTecAD for PatchCore and PaDiM, which rely on j { 2 , 3 } and j { 1 , 2 , 3 } for feature representation. For the image-level AUROC, when focusing on ImageNet (blue), the results with different hierarchies are quite stable for both methods. Nevertheless, there is not a huge boost in performance when combining three hierarchies instead of two for both ImageNet and Fractals. This outcome is significant as memory-based methods necessitate a large amount of memory during initialization, which increases with the number of features involved. For the pixel-level AUROC, both ImageNet and Fractals show a clear performance drop with j { 3 , 4 } . When using Fractals, the best results are obtained with j { 2 , 3 } for PatchCore as well as j { 1 , 2 } and j { 1 , 2 , 3 } for PaDiM. We can see that the difference between the results using ImageNet and the results using Fractals is relatively small. This difference increases when considering the AUPRO metrics. When using layers j { 3 , 4 } with Fractals, we reach an accuracy of 49.9% for PatchCore and 51.7% for PaDiM.
For Fractals, it is clear that the best performance is obtained when using initial hierarchies j { 1 , 2 } where features that can extract simple patterns like edges and corners are learned, as described in [41]. Nakashima et al. [16] have shown that models pre-trained on fractals tend to focus primarily on contours to solve classification tasks. This suggests that the strong performance of low-level features may be due to the complex edges and patterns in fractals, which cover a broader variety of real-world textures than those found in ImageNet [18]. Low-level features capture these simpler, nature-like patterns, contributing to the quality of features learned at these levels. In contrast, the significant drop in performance when using only high-level features can be attributed to their task-specific nature. Since fractals are abstract geometric structures, they are disconnected from the anomaly detection task. ImageNet, by comparison, consists of natural images and has a little overlap with industrial anomaly detection [28]. Additionally, FDSL datasets lack semantic content, making them less suitable for semantic-related representation learning, which remains useful in anomaly detection [17]. We specifically selected AD methods that do not require fine-tuning, so our high-level weights are not adjusted for the downstream task. This lack of task-specific tuning likely explains the significant decrease in high-level feature performance.

6. Performance of Different Anomaly Detection Methods on MandelbulbVAR-1k Dataset

The results presented for MVTecAD (Table 7, Table 8 and Table 9) and VisA (Table 10, Table 11 and Table 12) demonstrate that pre-training on MandelbulbVAR-1k achieves performance comparable to ImageNet. In all tables, the metrics are reported as percentages, with the best results for each method highlighted in blue for ImageNet and purple for MandelbulbVAR-1k pre-training; in addition, each cell contains the results for ImageNet/MandelbulbVAR-1k. All the results were obtained with version 1.0 of Anomalib.
From the tables, it is interesting to notice that while flow-based methods, such as FastFlow and Cflow, experienced the largest performance drop when pre-trained on Fractals, they maintained high scores at both the image and pixel levels when using MandelbulbVAR-1k as the pre-training dataset. AUPRO, the metric most negatively impacted by Fractals pre-training, remains close to ImageNet-level performance when using MandelbulbVAR-1k. STFPM exhibits the highest average model standard deviation, making it the least stable method overall. In particular, in Table 12 for the class pcb4, STFPM yields a notably low AUPRO score of 2.5%, raising uncertainty about whether this is an actual result or a potential bug in the library. In general, the leading methods in this evaluation are PatchCore and RD, while STFPM shows highly variable results.
We investigated the impact of feature hierarchy on performance using MandelbulbVAR-1k. In Figure 7, we present the average pixel-level AUROC and AUPRO scores for ImageNet, Fractals, and MandelbulbVAR-1k with PaDiM. The results show that similarly to Fractals, MandelbulbVAR-1k experiences a performance drop when high-level features are used. However, for low- and mid-level hierarchies, MandelbulbVAR-1k achieves performance very close to ImageNet. When analyzing higher-level hierarchies (i.e., j 3 , 4 and j 2 , 3 , 4 ), both Fractals and MandelbulbVAR-1k exhibit a consistent decline in performance. Interestingly, for the AUPRO metric, MandelbulbVAR-1k performs worse than Fractals at higher feature hierarchies, indicating that Mandelbulb pre-training struggles to localize small anomalies precisely when using high-level features.
Overall, it is clear that for FDSL, the model trained with MandelbulbVAR-1k outperforms Fractals. However, ImageNet remains the best pre-training dataset, particularly due to its stable performance across different feature hierarchies. Nevertheless, MandelbulbVAR-1k produces results that are very close to ImageNet, which is promising, since abstract images have a significantly different distribution from natural images used for anomaly detection.

6.1. Visualization of Learned Low-Level Filters

Figure 8 shows the filters from the first layer of WideResNet-50 backbones that give the results reported in the previous sections. We can see that the weights learned using Fractals show simple patterns, such as solid vertical or horizontal lines. This means that the model has learned more basic features in the input data. When using MandelbulbVAR-1k, some filters exhibit similarities to those in ImageNet, such as Gabor-like and coloured Gaussian-like filters, along with similar grey-toned backgrounds. This suggests that MandelbulbVAR-1k can capture complex structures in the data, much like ImageNet, enabling the extraction of more intricate features.

6.2. Qualitative Results

In Figure 9, we can see some qualitative results on MVTecAD’s classes: bottle, cable, carpet, hazelnut, and wood. In the coloured boxes, we have the predicted heat maps and segmentation mask for ImageNet, Fractals, and MandelbulbVAR-1k. It is interesting to notice that for cable, the anomaly type is called cable_swap, so rather than a structural defect such as scratches, dents, colour spots or cracks, we are faced with a misplacement, a violation of the position of an object which can be seen as a logical anomaly like the one found in MVTec LOCO AD. We can see from the figure that, with none of the pre-training datasets, the AD methods can predict the correct segmentation mask. We also observe that Fractals tend to struggle with localizing anomalies that have low contrast with the background, such as in the carpet class. In contrast, when using PatchCore with ImageNet or MandelbulbVAR-1k, we can successfully localize the colour stain. From a qualitative perspective, we observe that using MandelbulbVAR-1k produces predictions that closely resemble those from ImageNet. In certain cases, such as the hazelnut class or for carpet with PatchCore, MandelbulbVAR-1k generates more refined and accurate segmentation masks.

7. Ablation Study

In the FDSL scenario, based on the results of Section 5 and Section 6, we can conclude that PatchCore using MandelbulbVAR-1k offers the best solution, achieving results that closely match those of ImageNet pre-training. This section explores key dataset characteristics in depth by comparing various generated datasets. Since Chiche et al. [17] did not provide an official training function, we performed the ablation using our code. We used PatchCore’s official code (https://github.com/amazon-science/patchcore-inspection accessed on 4 November 2024) for the experiments, utilizing the default configuration, and conducted evaluations on the MVTec AD dataset. We stopped the training once the model achieved 100% validation accuracy, as the classification task was fully resolved and continuing training would no longer provide any benefit. The average results for the different pre-trained datasets are reported in Table 13, where for each dataset, the number of classes is reported in brackets. As mentioned in Section 4, when using 1000 classes (1000 cl.), the training samples are around 1M; when using 200 classes (200 cl.), they are 400k. As a convenience, we will attach the suffix “-small” to the dataset with 200 classes from now on.
For a fair comparison and to establish a baseline reference, we trained the model with MandelbulbVAR-1k, referred to as “Mandelbulbs” for simplicity. This distinction helps to differentiate the results obtained from our custom training functions versus those using the official pre-trained weights. Unexpectedly, Mandelbulbs led to worse performances than a randomly initialized network. The surprisingly good results from the random network were closely linked to PatchCore’s memory-based strategy. Normality was determined by comparing stored features, so even with random weights, the descriptors remained meaningful for anomaly detection since the process relied on distinguishing between stored and non-stored features. Our pre-training with Mandelbulbs resulted in an image-level AUROC of 67.8% and a pixel-level AUROC of 78.4%, which is significantly lower than the results reported in the original paper [17] (97.2% and 96.8%, respectively). The multi-formula strategy substantially boosted performance, with MultiMandelbulbs achieving +13.1% and +13.5% compared to Mandelbulbs in terms of image- and pixel-level AUROC. MultiMandelbulbs-small reached the second-best performance after ImageNet. This result is significant, as it demonstrates that we can achieve better performance with 1/2 the samples and only 1/5 of the classes. We investigated whether the performance boost was due to the reduced number of samples and classes or the multi-formula strategy, so we trained the feature extractor with Mandlebulbs-small, and the results showed a slight improvement over Mandelbulbs, with a +1.7% increase in image- and +1.8% in pixel-level accuracy. Moreover, when comparing Fractals-small and MultiFractals-small, we observed performance increases of +5.1% and +7.7%, confirming that the performance gains were due to the multi-formula strategy, not the type of abstract figure or the number of samples used for generating the dataset. We also compared the performance when using a background (see “MultiMandelbulbs-back” on Table 13) and we followed the same background creation of [15]. Contrary to the background impact in [15], we obtained a reduction in performance compared to the case without background with both 1000 and 200 classes. Using the best FDSL as baseline (“MultiMandelbulbs (200 cl.)”), we analysed the impact of adding some random transforms (“MultiMandelbulbs-transforms (200 cl.)”) like in [15], but this led to a slight reduction in performance of −2.6% and −0.9%. Finally, we trained the model using greyscale images (“MultiMandelbulbs-gray (200 cl.)”) to prevent the model from relying on colour patterns to differentiate between classes. This resulted in reduced performance, most likely because the MVTec AD dataset contains coloured images, so some colour information was still required to distinguish defects.

7.1. Convergence Speed in FDSL Training

We observed that training speed varies with the dataset. For instance, when using Multi-Formula datasets, the classification task achieved a 100% validation accuracy more quickly than with standard datasets. Our deduction is that, since we have a mixture of samples for each image, the network is exposed to samples from the same classes more frequently during each epoch. Let us consider in Figure 3 the sample corresponding to the “Old class” with label 5 (in green); it is present in all the three generated images related to the “New class” with label 2, meaning that the network has three chances to see that samples during training. This behaviour likely explains the difference in training speed. For instance, Mandelbulbs achieved 100% validation accuracy only towards the end of a 20-day training period on 4 NVIDIA A100 GPUs with a total batch size of 1024 and 100 epochs. MultiMandelbulbs-small completed 100 epochs in less than 6 days, reaching 100% accuracy between epoch 30 and 40. In our hardware configuration, stopping the training around epoch 30 will reduce the training time to around 2 days. In Figure 10, we observed the validation accuracy across different pre-training datasets, with MultiMandelbulbs showing the fastest convergence. Interestingly, incorporating transformations during training significantly affects both convergence speed and overall performance (as we see in Table 13).

7.2. Low- and High-Level Feature Analysis

Figure 11 shows the filters from the first layer of WideResNet-50 backbones that give the results reported in Table 13. We can see that the learned weights have similar patterns to the original weights shown in Figure 8.
Filter visualization helps to examine the learned weights in the first layers, providing insights into the low-level features extracted by the pre-trained model.
As discussed in Section 5.2, we found that FDSL struggles to learn effective high-level features. This observation aligns with the findings reported in [17]. This aspect was further explored by analyzing the t-SNE plot on CIFAR10. In Figure 12, we present the plots for the different pre-trained backbones listed in Table 13 without fine-tuning. The figure shows that pre-training with ImageNet results in a backbone that effectively clusters CIFAR-10 images, even without fine-tuning. This is likely due to ImageNet’s large-scale dataset with diverse classes, which may overlap with the types of objects in CIFAR-10. In the case of random initialization (Random), it is interesting to observe the shape of the plot. ImageNet’s t-SNE plot shows a spherical shape with a more spread-out sample distribution, indicating that the network effectively projects data features evenly across the high-dimensional space. In contrast, the random network exhibits an elongated shape, suggesting that features are concentrated in a specific region of the latent space. Surprisingly, Mandelbulbs also shows a latent space with a similar “sinusoidal” pattern to the random network. This pattern indicates that, in the final layers where semantic understanding happens, Mandelbulbs behaves similarly to a random network. This might suggest either that the data have a dominant direction or trend or that the network struggles to find an effective weight set to evenly distribute points in the latent space. Moreover, no cluster is formed, which is an expected behaviour. This sinusoidal shape is also maintained for MultiMandeblulbs with and without background. With MultiMandelbubls-small and MultiFractals-small, we start to see a more spread-out shape.
These results highlight that relying solely on visualizing the filters of the first convolution layer may lead to the premature conclusion that the learned weights are effective. However, this does not guarantee a comprehensive understanding of the model’s learning process. A thorough analysis of the latent space should be incorporated to gain deeper insights into the network’s representations.

7.3. Impact of Training Configuration

In line with other FDSL implementations, we used the same dataset for both training and validation, as the primary goal is to improve downstream task performance. However, the literature has not thoroughly explored key aspects like optimal training and validation accuracy, the relationship between training–validation splits, data transformations, and how these factors influence model selection, leaving this critical area largely underinvestigated. We opted to follow the code configuration outlined in [9], also followed by [17] for training the network in the downstream classification task (not for anomaly detection). The key difference between our implementation and the one in [9] lies in the data transformations applied during training and validation. In [9], the training set consists of images sized at 512 × 512 , randomly cropped to 224 × 224 and normalized to a mean of 0.2 and a standard deviation of 0.5. In contrast, the validation set, which is identical to the training set, undergoes resizing to 224 × 224 and is subjected to the same normalization process. The batch size is 512; moreover, a scheduler is applied where the learning rate is divided by 10 at epochs 30 and 60. The rest of the training configuration was kept the same as in the ablation study discussed in Section 7. We want to emphasize that, while the authors claim to have trained MandelbulbVAR-1k for 600 epochs, we trained for only 100 epochs to maintain consistency with the ablation results presented in Table 13.
We evaluated the impact of different training configurations using the MultiMandelbulbs-small dataset, which emerged as the most effective pre-training dataset in our ablation study (Table 13). Three configurations were tested: The first, referred to as “VAR1”, follows the setup described earlier, inspired by [9]. In the second configuration, “VAR1-BATCH”, we increased the batch size from 512 to 1024, matching the original batch size used in the ablation study, while keeping the same scheduler. The third configuration, “VAR1-noSCHEDULER”, retains the batch size of 512 but omits the scheduler, aligning with the setup used in the ablation study. In Figure 13, we report how the image- and pixel-level AUROC, when using PatchCore as the AD method, change at different training epochs. We can see that “VAR1” is the best-performing configuration while “VAR1-BATCH” is the worst one. It is interesting to observe that AUROC scores are not directly related to validation accuracy; for example, in “VAR1”, the highest image-level AUROC is at epoch 40 with a score of 0.865 and a validation accuracy of 4.94%, while the highest pixel-level accuracy is achieved at epoch 80 with a score of 0.942 and a validation accuracy of 8.95%. The best model selected during training is epoch 83 with the highest validation accuracy of 9.33%, resulting in an image and pixel AUROC of 0.857 and 0.941 (see Table 14). If we observe the image-level AUROC, the one obtained with the highest validation accuracy (0.857) is lower than the one obtained at epoch 40 (0.865). This highlights that with FDSL, clarity in train configuration and model selection is crucial.
In Figure 14, we present the t-SNE plot of the CIFAR-10 validation set. It is noteworthy that, compared to the baseline latent space for MultiMandelbulbs-small shown in Figure 12, the latent space structure exhibits an even more spherical organization. This shift in structure highlights the impact of the training configuration on the representation of the data. For example, in “VAR1-noSCHEDULER”, which mirrors our ablation training settings except for the initial transform applied to the training and validation sets and the batch size, we observe significant improvements in the latent space. This structure closely resembles what we achieve using ImageNet. For “VAR1”, the best-performing method in our experiments, we observe a spherical latent space with an elongated region on the right side of the t-SNE plot. This suggests that while “VAR1” performs well for PatchCore, the quality of high-level features in certain areas of the latent space may not be fully optimized.
In Figure 14, we can also see the filters from the first layer of WideResNet-50. The filters of “VAR1-noSCHEDULER” are closer to the filters of MandelbulbVAR-1k (see Figure 8), in particular the “dot-like” pattern.

7.4. Results on MVTec LOCO AD

We further investigate the influence of training configuration and the significance of the latent space by comparing pre-training on ImageNet and MultiMandelbulbs-small. Table 15 and Table 16 present the results for the MultiMandelbulbs-small dataset, where we compare the original training setup from Section 7 (Baseline) with the three variations introduced in Section 7.3 (VAR1, VAR1-BATCH, VAR1-noSCHEDULER). These configurations are evaluated on the MVTec LOCO AD dataset. Unlike MVTecAD and VisA, MVTec LOCO AD poses a greater challenge as it contains structural and logical anomalies. In particular, addressing logical anomalies requires learned weights that can extract semantic representations which align with learning meaningful high-level features. However, a latent space with an elongated structure may compromise the overall quality of these learned representations. The tables below highlight the best results for the different training configurations for MultiMandelbulbs-small in orange. In Table 15, we have the image-level AUROC for three different AD methods, where all the variations adopted in Section 7.3 lead to an improvement in performance compared to the “Beseline”. For EfficientAD, the optimal configuration is “VAR1”, achieving a score of 0.811. For PUAD, the best-performing setup is “VAR1-BATCH” with a score of 0.837, while for SINBAD, the top result is obtained with “VAR1-noSCHEDULER”, scoring 0.788. The common factor of the different configurations compared to the “Baseline” is the different sets of transforms applied to the training and validation sets which follow [9]. Other training configurations, such as the scheduler and batch size, also affect performance depending on the specific AD model. For instance, in SINBAD, the score increases from 0.734 to 0.788 when the scheduler is removed. This highlights the strong correlation between model selection in FDSL and the AD method used in downstream tasks. In Table 16, we reported the results for EfficientAD using pixel-level sPRO obtained using the official evaluation code (https://www.mvtec.com/company/research/datasets/mvtec-loco, accessed on 4 November 2024) of MVTec LOCO AD. The sPRO score provides more precise localization results. The table shows that “VAR1” remains the best-performing configuration for EfficientAD.

8. Conclusions

We investigate the impact of pre-training with fractals compared to ImageNet using established AD methods. Unlike conventional FDSL approaches that rely on fractal fine-tuning, we deliberately exclude this step to maintain consistency with the original AD methods, which do not involve fine-tuning (except for CutPaste, which includes fine-tuning by default). We conducted a systematic analysis of 11 state-of-the-art AD methods and tested their performance on 32 object classes, each having different types of anomalies. Our experiments reveal that memory-based methods and CutPaste seem statistically better than others and their results vary depending on the type of objects’ class, emphasizing the importance of anomaly-type selection when considering fractal images. We observed that low-level fractal features are more effective and an in-depth analysis of the latent space should be considered to understand the quality of the learned weights. We observed that MandelbulbVAR-1k can reach very close results to ImageNet both from qualitative and quantitative point of views. We introduce a new way of combining the dynamically generated abstract image generating a “Multi-Formula” dataset, showing a high improvement in performance compared to the standard classification datasets under the same training conditions. Additionally, we observed that training configuration and model selection require careful tuning and investigation as it could completely change the quality of the learned weights, particularly the high-level features. Although pre-training with ImageNet remains a clear winner on this task, our findings motivate a new research direction in AD, where there is the potential to replace large-scale natural datasets with completely synthetic abstract datasets reducing annotation labour, protecting fairness, and preserving privacy.

8.1. Discussion

We focused on industrial defect detection, a specialized field often constrained by data scarcity due to strict privacy regulations and the high cost of labelling. With limited access to high-quality labelled data, relying solely on supervised methods becomes increasingly impractical due to the data-hungry nature of deep learning models. Large-scale datasets are highly valuable for pre-training models in cold-start defect detection. There may, however, be a suspension of functionality, ownership shifts, or abrupt removal of these datasets if they violate ethical or privacy standards. Such scenarios can complicate the tracking of lineages, raise privacy concerns, compromise data integrity, and make it difficult to attribute credit to data sources. As an example, ethical issues have been reported about dataset biases and privacy violations for ImageNet [19]; additionally, it is currently restricted to academic or educational usage. This is an issue for industrial applications whose goal is to commercialize a product trained with such a large dataset. In this context, FDSL has proven to be a valuable substitution for natural image pre-training, and also for defect detection, which requires robust features capable of discerning even minor visual variations as normal and abnormal samples look similar but differ in local appearance. Apart from commercial restrictions, using fractals as a pre-training dataset could benefit fields where there are no corresponding large-scale datasets comparable to ImageNet. For instance, MVTec 3D-AD [44] is a 3D dataset tailored to unsupervised anomaly detection and defect localization in objects’ geometric structures. Most methods addressing this task rely on 2D models pre-trained on ImageNet and self-supervised methods on the training data [45,46,47] due to the lack of large-scale 3D datasets for pre-training. Three-dimensional fractals have already demonstrated strong pre-training capacity in 3D scene-understanding [18], and our work shows their effectiveness for anomaly detection without fine-tuning. Thus, fractals could serve as a large-scale pre-training dataset for point clouds and other domains lacking specific large-scale datasets, offering a performance level close to that of ImageNet without ownership, fairness, ethical, or bias concerns.

8.2. Future Work

In future work, our studies may be continued in a variety of ways. First, FDSL requires further investigation, especially in developing more advanced pre-training strategies to enhance high-level features. Addressing this gap could bring FDSL closer to the performance levels of ImageNet-based feature extractors, improving off-the-shelf models for cold-start anomaly detection. Second, we observe that in the literature key aspects like optimal training and validation accuracy, the relationship between training–validation splits, data transformations, and how these factors influence model selection have not been thoroughly explored, and we have shown in our study that they highly impact the final performance. Third, our “Multi-Formula” datasets were created by randomly grouping fractal images into new classes following the random generation of the “Single-Formula” datasets. However, future research could explore alternative strategies for combining fractal images. Such an exploration may involve developing advanced generation methods for source datasets, i.e., Fractals and Mandelbulbs, to systematically examine how different grouping techniques impact model performance. Additionally, the impact of performance relative to the number of classes should also be examined. Fourth, limited effort has been made to synthesize abnormal samples through data augmentation, a challenging but crucial task. More focus should be given to self-supervised methods like CutPaste, which generate synthetic anomalies. Finally, fractals’ performance under few-shot learning should be investigated. Using fractal pre-trained weights could reduce the amount of data needed for fine-tuning in fields with limited data, like medicine.

Author Contributions

Conceptualization, C.I.U., E.C. and O.L.; methodology, C.I.U.; software, C.I.U.; validation, C.I.U.; formal analysis, C.I.U.; investigation, C.I.U.; writing—original draft preparation, C.I.U.; writing—review and editing, C.I.U., E.C. and O.L.; visualization, C.I.U.; funding acquisition, O.L. All authors have read and agreed to the published version of the manuscript.

Funding

The research presented in this paper was funded by CovisionLab, Schaeffler and the Italian National Operative Program Research and Innovation 2014–2020 (CCI2014IT16M2OP005), resources FSE REACT-EU, Action IV.4 “Doctorates on innovation topics” and Action IV.5 “Doctorates on green topics”.

Data Availability Statement

The data used in this article can be generated from https://github.com/RistoranteRist/MandelbulbVariationsGenerator and our official GitHub repository https://github.com/cugwu/fractal4AD both links were accessed on 4 November 2024.

Acknowledgments

Large language models (LLMs), specifically ChatGPT, and the AI-tools Grammarly and Wordtune, were used to check for grammatical errors and to make minor rephrasing adjustments in the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADAnomaly Detection
FDSLFormula-Driven Supervised Learning
IFSIterated Function System

References

  1. Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
  2. Bergmann, P.; Batzner, K.; Fauser, M.; Sattlegger, D.; Steger, C. Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization. Int. J. Comput. Vis. 2022, 130, 947–969. [Google Scholar] [CrossRef]
  3. Zimmerer, D.; Full, P.M.; Isensee, F.; Jäger, P.; Adler, T.; Petersen, J.; Köhler, G.; Ross, T.; Reinke, A.; Kascenas, A.; et al. Mood 2020: A public benchmark for out-of-distribution detection and localization on medical images. IEEE Trans. Med. Imaging 2022, 41, 2728–2738. [Google Scholar] [CrossRef] [PubMed]
  4. Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R.; et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 2014, 34, 1993–2024. [Google Scholar] [CrossRef] [PubMed]
  5. Blum, H.; Sarlin, P.E.; Nieto, J.; Siegwart, R.; Cadena, C. Fishyscapes: A benchmark for safe semantic segmentation in autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
  6. Hendrycks, D.; Basart, S.; Mazeika, M.; Zou, A.; Kwon, J.; Mostajabi, M.; Steinhardt, J.; Song, D. Scaling out-of-distribution detection for real-world settings. arXiv 2019, arXiv:1911.11132. [Google Scholar]
  7. Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6536–6545. [Google Scholar]
  8. Nazare, T.S.; de Mello, R.F.; Ponti, M.A. Are pre-trained CNNs good feature extractors for anomaly detection in surveillance videos? arXiv 2018, arXiv:1811.08495. [Google Scholar]
  9. Kataoka, H.; Okayasu, K.; Matsumoto, A.; Yamagata, E.; Yamada, R.; Inoue, N.; Nakamura, A.; Satoh, Y. Pre-training without natural images. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
  10. Torralba, A.; Fergus, R.; Freeman, W.T. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1958–1970. [Google Scholar] [CrossRef] [PubMed]
  11. Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 843–852. [Google Scholar]
  12. Mahajan, D.; Girshick, R.; Ramanathan, V.; He, K.; Paluri, M.; Li, Y.; Bharambe, A.; Van Der Maaten, L. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 181–196. [Google Scholar]
  13. Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst. 2022, 35, 25278–25294. [Google Scholar]
  14. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2021, arXiv:2112.10752. [Google Scholar] [CrossRef]
  15. Anderson, C.; Farrell, R. Improving fractal pre-training. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1300–1309. [Google Scholar]
  16. Nakashima, K.; Kataoka, H.; Matsumoto, A.; Iwata, K.; Inoue, N.; Satoh, Y. Can vision transformers learn without natural images? In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1990–1998. [Google Scholar]
  17. Chiche, B.N.; Horikawa, Y.; Fujita, R. Pre-training Vision Models with Mandelbulb Variations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 22062–22071. [Google Scholar]
  18. Yamada, R.; Kataoka, H.; Chiba, N.; Domae, Y.; Ogata, T. Point cloud pre-training with natural 3D structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 21283–21293. [Google Scholar]
  19. Shinoda, R.; Hayamizu, R.; Nakashima, K.; Inoue, N.; Yokota, R.; Kataoka, H. Segrcdb: Semantic segmentation via formula-driven supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 20054–20063. [Google Scholar]
  20. Ugwu, C.I.; Casarin, S.; Lanz, O. Fractals as Pre-training Datasets for Anomaly Detection and Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 163–172. [Google Scholar]
  21. Takashima, S.; Hayamizu, R.; Inoue, N.; Kataoka, H.; Yokota, R. Visual atoms: Pre-training vision transformers with sinusoidal waves. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18579–18588. [Google Scholar]
  22. Kataoka, H.; Hayamizu, R.; Yamada, R.; Nakashima, K.; Takashima, S.; Zhang, X.; Martinez-Noriega, E.J.; Inoue, N.; Yokota, R. Replacing labeled real-image datasets with auto-generated contours. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 21232–21241. [Google Scholar]
  23. Xie, G.; Wang, J.; Liu, J.; Lyu, J.; Liu, Y.; Wang, C.; Zheng, F.; Jin, Y. Im-iad: Industrial image anomaly detection benchmark in manufacturing. arXiv 2023, arXiv:2301.13359. [Google Scholar] [CrossRef] [PubMed]
  24. Wang, G.; Han, S.; Ding, E.; Huang, D. Student-teacher feature pyramid matching for anomaly detection. arXiv 2021, arXiv:2103.04257. [Google Scholar]
  25. Deng, H.; Li, X. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9737–9746. [Google Scholar]
  26. Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4183–4192. [Google Scholar]
  27. Guo, J.; Lu, S.; Jia, L.; Zhang, W.; Li, H. ReContrast: Domain-Specific Anomaly Detection via Contrastive Reconstruction. arXiv 2023, arXiv:2306.02602. [Google Scholar]
  28. Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
  29. Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection and localization. In Proceedings of the International Conference on Pattern Recognition, Virtual, 10–15 January 2021; pp. 475–489. [Google Scholar]
  30. Lee, S.; Lee, S.; Song, B.C. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access 2022, 10, 78446–78454. [Google Scholar] [CrossRef]
  31. Gudovskiy, D.; Ishizaka, S.; Kozuka, K. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 98–107. [Google Scholar]
  32. Yu, J.; Zheng, Y.; Wang, X.; Li, W.; Wu, Y.; Zhao, R.; Wu, L. Fastflow: Unsupervised anomaly detection and localization via 2D normalizing flows. arXiv 2021, arXiv:2111.07677. [Google Scholar]
  33. Reiss, T.; Cohen, N.; Bergman, L.; Hoshen, Y. Panda: Adapting pretrained features for anomaly detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2806–2814. [Google Scholar]
  34. Li, C.L.; Sohn, K.; Yoon, J.; Pfister, T. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9664–9674. [Google Scholar]
  35. Rampe, J. New Mandelbulb Variations. 2011. Available online: https://softologyblog.wordpress.com/2011/07/21/ (accessed on 22 July 2024).
  36. Zou, Y.; Jeong, J.; Pemula, L.; Zhang, D.; Dabeer, O. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 392–408. [Google Scholar]
  37. Akcay, S.; Ameln, D.; Vaidya, A.; Lakshmanan, B.; Ahuja, N.; Genc, U. Anomalib: A deep learning library for anomaly detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 1706–1710. [Google Scholar]
  38. Batzner, K.; Heckler, L.; König, R. Efficientad: Accurate visual anomaly detection at millisecond-level latencies. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 128–138. [Google Scholar]
  39. Sugawara, S.; Imamura, R. PUAD: Frustratingly Simple Method for Robust Anomaly Detection. arXiv 2024, arXiv:2402.15143. [Google Scholar]
  40. Cohen, N.; Tzachor, I.; Hoshen, Y. Set Features for Anomaly Detection. arXiv 2023, arXiv:2311.14773. [Google Scholar]
  41. Zeiler, M. Visualizing and Understanding Convolutional Networks. In Proceedings of the European Conference on Computer Vision/arXiv, Zurich, Switzerland, 6–12 September 2014; Volume 1311. [Google Scholar]
  42. Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
  43. Neyshabur, B.; Sedghi, H.; Zhang, C. What is being transferred in transfer learning? Adv. Neural Inf. Process. Syst. 2020, 33, 512–523. [Google Scholar]
  44. Bergmann, P.; Jin, X.; Sattlegger, D.; Steger, C. The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization. arXiv 2021, arXiv:2112.09045. [Google Scholar]
  45. Chu, Y.M.; Liu, C.; Hsieh, T.I.; Chen, H.T.; Liu, T.L. Shape-Guided Dual-Memory Learning for 3D Anomaly Detection. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 6185–6194. [Google Scholar]
  46. Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; Wang, C. Multimodal industrial anomaly detection via hybrid fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8032–8041. [Google Scholar]
  47. Rudolph, M.; Wehrbein, T.; Rosenhahn, B.; Wandt, B. Asymmetric student-teacher networks for industrial anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2592–2602. [Google Scholar]
Figure 1. Examples of samples we generated from a class of fractals. Note that in [9], fractals belonging to the same class share similar geometric properties, as they are sampled by slightly perturbing on one of the parameters of the linear operator A. Contrary to [15], different fractals are grouped under the same class, lacking geometric continuity within samples from the same class.
Figure 1. Examples of samples we generated from a class of fractals. Note that in [9], fractals belonging to the same class share similar geometric properties, as they are sampled by slightly perturbing on one of the parameters of the linear operator A. Contrary to [15], different fractals are grouped under the same class, lacking geometric continuity within samples from the same class.
Fractalfract 08 00661 g001
Figure 2. Examples of samples we generated from a class of MandelbulbVAR-1k. We can observe that a class is composed of the same Mandelbulb taken from different perspectives and with various colour patterns, ensuring geometric continuity between objects of the same class.
Figure 2. Examples of samples we generated from a class of MandelbulbVAR-1k. We can observe that a class is composed of the same Mandelbulb taken from different perspectives and with various colour patterns, ensuring geometric continuity between objects of the same class.
Fractalfract 08 00661 g002
Figure 3. Overview of the proposed “Multi-Formula” dataset. Fractals from different classes from the source dataset are grouped to be the features of new classes, where a variable number of fractals are present in a sample of a class.
Figure 3. Overview of the proposed “Multi-Formula” dataset. Fractals from different classes from the source dataset are grouped to be the features of new classes, where a variable number of fractals are present in a sample of a class.
Fractalfract 08 00661 g003
Figure 4. The left-hand box (“Dataset Generation”) illustrates two distinct IFS, each defining unique codes obtained by sampling the parameters of the system which are used to generate both Fractal and Mandelbulb datasets. In the middle box ("Pre-Training") a computer vision model for multi-class classification is trained from the generated images, either with a single sample or multiple samples per image. Finally, in the last box (“Anomaly Detection”), the model is used as a feature extractor for unsupervised anomaly detection.
Figure 4. The left-hand box (“Dataset Generation”) illustrates two distinct IFS, each defining unique codes obtained by sampling the parameters of the system which are used to generate both Fractal and Mandelbulb datasets. In the middle box ("Pre-Training") a computer vision model for multi-class classification is trained from the generated images, either with a single sample or multiple samples per image. Finally, in the last box (“Anomaly Detection”), the model is used as a feature extractor for unsupervised anomaly detection.
Fractalfract 08 00661 g004
Figure 5. Spider chart representing average image-level AUROC grouping MVTecAD and VisA classes into different object categories.
Figure 5. Spider chart representing average image-level AUROC grouping MVTecAD and VisA classes into different object categories.
Fractalfract 08 00661 g005
Figure 6. Comparison between ImageNet and Fractals pre-training when using different feature hierarchies.
Figure 6. Comparison between ImageNet and Fractals pre-training when using different feature hierarchies.
Fractalfract 08 00661 g006
Figure 7. Comparison between ImageNet, Fractals, and MandelbulbVAR-1k pre-training when using different feature hierarchies on PaDiM.
Figure 7. Comparison between ImageNet, Fractals, and MandelbulbVAR-1k pre-training when using different feature hierarchies on PaDiM.
Fractalfract 08 00661 g007
Figure 8. Comparison of the filters from the first convolutional layer of WideResNet50 pre-trained with different datasets.
Figure 8. Comparison of the filters from the first convolutional layer of WideResNet50 pre-trained with different datasets.
Fractalfract 08 00661 g008
Figure 9. Qualitative visualization for the MVTecAD’s classes: bottle, cable, carpet, hazelnut, and wood. In the first column, we have the original image and the ground-truth. In the blue box, we have the anomaly score and predicted segmentation mask for ImageNet pre-training, in the red box for Fractals, and the purple box for MandelbulbVAR-1k.
Figure 9. Qualitative visualization for the MVTecAD’s classes: bottle, cable, carpet, hazelnut, and wood. In the first column, we have the original image and the ground-truth. In the blue box, we have the anomaly score and predicted segmentation mask for ImageNet pre-training, in the red box for Fractals, and the purple box for MandelbulbVAR-1k.
Fractalfract 08 00661 g009
Figure 10. Top-1 classification accuracy during training for different generated datasets.
Figure 10. Top-1 classification accuracy during training for different generated datasets.
Fractalfract 08 00661 g010
Figure 11. Comparison of the filters from the first convolutional layer of WideResNet50 pre-trained with different dataset configurations.
Figure 11. Comparison of the filters from the first convolutional layer of WideResNet50 pre-trained with different dataset configurations.
Fractalfract 08 00661 g011
Figure 12. The t-SNE plot of the CIFAR-10 validation set, using WideResNet-50 pre-trained on different datasets, is presented. We extracted feature vectors from the penultimate layers, prior to the final classification layers, without any fine-tuning. (Note: The legend in each t-SNE plot is intentionally small, as our focus is on illustrating the structure of the latent space rather than the classification of each individual point).
Figure 12. The t-SNE plot of the CIFAR-10 validation set, using WideResNet-50 pre-trained on different datasets, is presented. We extracted feature vectors from the penultimate layers, prior to the final classification layers, without any fine-tuning. (Note: The legend in each t-SNE plot is intentionally small, as our focus is on illustrating the structure of the latent space rather than the classification of each individual point).
Fractalfract 08 00661 g012
Figure 13. Image- (left) and pixel-level (right) AUROC scores achieved with PatchCore at various epochs of the pre-training stage using different training configurations.
Figure 13. Image- (left) and pixel-level (right) AUROC scores achieved with PatchCore at various epochs of the pre-training stage using different training configurations.
Fractalfract 08 00661 g013
Figure 14. Comparison of the filters from the first convolutional layer of WideResNet-50 that give the results reported in Table 14. Some of the “dot-like” filters are framed in red.
Figure 14. Comparison of the filters from the first convolutional layer of WideResNet-50 that give the results reported in Table 14. Some of the “dot-like” filters are framed in red.
Fractalfract 08 00661 g014
Table 1. MVTecAD image-level AUROC. Each cell carries the results for ImageNet/Fractals.
Table 1. MVTecAD image-level AUROC. Each cell carries the results for ImageNet/Fractals.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPMCutPastePANDA
carpet98.6/64.592.7/49.698.0/40.999.0/42.598.9/30.698.0/53.985.9/69.293.4/31.2
grid99.8/58.096.1/82.097.5/93.796.9/78.5100.0/68.398.3/46.098.3/100.052.0/54.4
leather99.7/88.596.1/63.3100.0/82.099.7/81.9100.0/75.299.8/67.9100.0/87.396.5/54.4
tile99.9/95.699.9/92.798.8/95.699.5/97.3100.0/60.998.6/74.094.7/84.896.8/65.1
wood99.2/99.695.6/93.899.4/97.999.1/97.199.4/84.399.7/75.599.7/95.795.9/56.8
bottle100.0/97.6100.0/56.7100.0/88.299.8/95.999.9/93.2100.0/54.999.8/97.996.8/65.1
cable92.9/55.692.0/45.998.8/52.293.2/61.496.2/58.691.3/43.990.6/85.884.5/54.9
capsule94.7/42.190.4/61.697.8/73.491.9/70.897.6/78.357.9/56.583.5/78.191.8/71.8
hazelnut97.9/97.699.6/85.7100.0/92.094.1/93.9100.0/89.5100.0/90.897.2/71.388.5/61.3
metal_nut98.7/57.896.4/34.499.8/38.198.7/47.9100.0/69.896.6/66.294.2/80.772.9/41.5
pill96.4/79.582.4/76.593.1/75.992.3/77.296.7/72.481.0/77.489.1/71.081.0/65.3
screw85.0/27.589.1/69.097.9/61.785.2/40.098.1/69.190.3/60.479.0/42.7570.5/41.3
toothbrush77.5/60.871.4/78.3100.0/99.287.2/98.693.9/96.785.0/79.287.8/97.888.1/68.9
transistor89.7/59.787.8/33.099.9/55.298.5/78.697.4/66.894.9/37.592.8/79.891.0/71.2
zipper89.3/74.491.6/44.699.3/81.288.3/76.898.3/83.281.5/46.599.8/70.997.0/57.6
Model Avg 94.6/70.692.1/64.598.7/75.194.9/75.998.4/73.191.5/62.092.8/80.986.4/57.4
Model STD 6.6/22.17.5/20.21.8/20.85.0/20.01.8/16.411.5/15.46.7/15.012.8/11.8
Table 2. MVTecAD pixel-level AUROC. Each cell carries the results for ImageNet/Fractals.
Table 2. MVTecAD pixel-level AUROC. Each cell carries the results for ImageNet/Fractals.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPM
carpet98.2/78.498.8/71.298.7/72.798.8/73.298.8/56.299.2/76.7
grid98.6/85.097.4/72.298.0/82.396.7/69.699.3/88.399.2/69.5
leather98.9/96.697.4/84.298.9/95.698.9/90.599.1/92.499.6/83.5
tile95.7/87.195.8/76.094.9/85.994.9/74.295.4/69.097.1/76.0
wood90.8/84.995.0/82.093.2/84.093.9/84.594.9/84.996.9/85.2
bottle97.8/92.398.5/59.398.0/84.498.3/92.298.3/76.498.7/59.9
cable93.8/78.295.6/68.398.0/84.397.2/89.096.4/53.994.9/73.8
capsule98.7/85.598.7/90.898.8/95.298.5/95.098.7/94.397.6/95.1
hazelnut95.3/95.998.2/95.798.4/97.198.6/97.998.8/96.599.1/95.2
metal_nut98.6/82.797.4/76.198.5/84.496.1/86.597.0/82.498.2/81.8
pill97.5/85.398.0/90.797.5/94.695.2/92.797.4/91.295.8/88.0
screw98.1/85.097.4/93.999.2/95.798.7/94.899.6/97.098.9/93.6
toothbrush95.2/72.698.2/88.298.7/97.199.0/97.698.9/93.299.0/91.9
transistor92.6/78.385.9/53.796.7/75.297.6/86.589.1/66.682.3/59.5
zipper95.9/74.396.3/70.798.1/86.697.2/88.098.5/78.098.1/78.6
Model AVG 96.4/84.196.6/78.297.7/87.797.3/87.597.3/81.497.0/80.6
Model STD 2.5/7.13.2/12.61.6/7.91.6/8.82.7/14.34.3/11.6
Table 3. MVTecAD AUPRO. Each cell carries the results for ImageNet/Fractals.
Table 3. MVTecAD AUPRO. Each cell carries the results for ImageNet/Fractals.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPM
carpet–/51.393.8/33.192.7/31.495.3/39.694.8/24.897.0/51.9
grid95.1/63.290.8/40.390.1/60.789.0/41.197.3/70.297.0/31.6
leather98.3/89.890.8/47.996.3/76.798.0/68.997.9/69.099.0/51.6
tile87.4/72.190.2/63.379.6/69.086.3/64.387.5/45.192.4/49.5
wood89.3/75.088.6/50.784.6/54.991.6/65.591.3/70.395.7/62.7
bottle88.7/76.193.5/28.192.3/64.795.1/77.495.3/53.296.2/22.5
cable80.3/38.684.8/29.991.1/46.888.5/62.590.1/41.489.0/30.4
capsule92.4/59.391.0/73.992.3/75.191.1/77.693.0/81.891.1/81.9
hazelnut95.2/89.795.1/86.294.4/87.095.0/90.196.3/90.197.6/87.6
metal_nut92.8/47.587.2/27.491.9/49.491.9/54.193.8/40.095.4/36.8
pill91.3/68.993.4/65.093.8/83.894.4/85.696.2/82.295.1/72.7
screw91.2/59.989.2/80.395.5/84.094.7/83.697.7/88.595.0/78.8
toothbrush77.8/28.382.9/64.186.2/82.793.2/91.691.6/79.492.9/70.4
transistor79.1/44.473.8/21.894.0/42.394.0/62.479.2/41.169.4/16.0
zipper87.8/41.887.7/30.292.5/67.791.3/64.295.3/50.494.2/38.3
Model AVG 89.1/60.488.9/49.591.2/65.192.6/68.693.2/61.893.1/52.2
Model STD 23.8/18.45.4/21.34.5/17.13.1/16.04.9/20.77.1/22.7
Table 4. VisA image-level AUROC. Each cell carries the results for ImageNet/Fractals.
Table 4. VisA image-level AUROC. Each cell carries the results for ImageNet/Fractals.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPMCutPastePANDA
candle94.2/69.792.2/69.197.9/83.192.6/79.794.0/76.280.7/70.796.6/77.988.4/67.9
capsules85.6/49.879.4/69.168.4/79.665.6/62.784.6/62.788.4/68.483.7/71.457.1/68.2
cashew89.0/90.991.9/78.695.6/91.888.1/82.396.3/65.086.1/80.282.7/73.191.6/90.2
chewinggum95.8/91.698.4/80.199.4/81.998.3/71.799.4/67.898.2/73.596.6/86.092.2/69.0
fryum78.0/61.178.0/71.491.6/82.684.6/80.791.9/70.889.2/60.793.4/75.884.5/74.8
macaroni195.0/84.887.7/66.289.7/75.981.1/71.596.3/73.192.2/72.985.1/67.177.2/68.0
macaroni286.9/52.476.8/58.071.7/59.662.0/60.880.8/62.784.3/59.163.1/75.558.7/67.3
pcb195.2/72.490.9/54.695.1/89.883.2/83.397.0/62.987.6/36.089.4/92.787.0/59.5
pcb295.2/80.780.0/29.893.5/94.782.7/88.396.8/85.690.3/30.293.6/95.591.3/83.7
pcb394.4/50.585.6/56.691.9/71.178.9/76.596.5/93.290.0/64.089.7/72.678.1/64.3
pcb497.0/69.897.1/83.999.5/90.693.2/94.099.8/96.595.5/81.497.4/95.096.5/83.0
pipe_fryum99.5/64.894.8/64.598.5/64.496.7/66.197.3/74.692.6/64.376.3/67.380.1/59.8
Model AVG 92.1/69.987.7/65.291.1/80.483.9/76.594.2/74.389.6/63.487.3/79.281.9/71.3
Model STD 6.1/14.97.7/14.510.3/11.011.3/10.25.8/11.84.8/15.810.0/10.412.7/9.7
Table 5. VisA pixel-level AUROC. Each cell carries the results for ImageNet/Fractals.
Table 5. VisA pixel-level AUROC. Each cell carries the results for ImageNet/Fractals.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPM
candle99.2/80.798.7/74.698.9/82.698.7/77.499.0/85.998.9/86.5
capsules98.2/84.297.0/82.297.6/90.996.3/90.299.6/92.599.3/76.8
cashew98.2/89.699.1/91.899.0/75.198.6/74.395.1/41.497.0/92.8
chewinggum99.2/96.998.8/94.198.9/87.398.9/69.198.7/86.799.1/93.3
fryum89.0/88.596.5/89.094.9/94.295.5/94.196.3/92.195.4/87.0
macaroni196.3/98.098.6/91.398.2/95.297.4/93.899.5/98.699.4/97.3
macaroni298.7/94.997.5/90.996.9/91.894.9/91.099.2/96.299.6/95.5
pcb199.7/94.099.1/87.299.5/98.498.7/89.699.6/31.199.4/47.7
pcb298.7/91.096.1/84.097.8/92.897.3/94.398.5/89.597.3/76.8
pcb393.5/85.497.3/86.298.2/92.797.2/96.199.0/95.098.1/89.3
pcb498.4/77.097.8/81.997.7/83.296.5/88.498.1/94.398.2/89.6
pipe_fryum98.3/90.798.6/95.898.8/96.098.9/96.998.7/97.297.9/96.7
Model AVG 97.3/89.297.9/87.498.0/90.097.4/87.998.4/83.498.3/85.8
Model STD 3.0/6.51.0/6.01.2/6.71.4/0.21.4/22.51.3/13.8
Table 6. VisA AUPRO. Each cell carries the results for ImageNet/Fractals.
Table 6. VisA AUPRO. Each cell carries the results for ImageNet/Fractals.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPM
candle94.8/42.592.7/43.294.3/72.894.0/49.494.1/71.494.5/61.8
capsules90.6/45.975.3/51.367.8/61.968.7/56.893.1/51.795.3/44.6
cashew81.1/81.392.5/74.389.4/42.684.6/37.787.4/38.192.1/77.0
chewinggum84.4/62.788.9/53.784.7/43.086.5/29.880.5/48.083.0/68.6
fryum69.7/68.781.0/69.780.2/72.270.1/70.688.4/77.885.9/65.3
macaroni187.1/95.190.7/79.191.8/81.887.6/67.395.0/87.394.8/88.0
macaroni293.9/69.483.4/60.986.9/58.371.5/54.992.7/75.495.5/76.2
pcb192.5/64.988.1/49.789.9/77.887.5/74.495.6/18.092.3/14.4
pcb285.7/68.576.7/54.483.7/78.977.6/78.890.4/67.285.3/33.7
pcb379.6/42.173.5/64.980.4/78.570.6/80.791.0/88.489.6/77.1
pcb489.0/30.686.2/42.884.6/44.179.1/52.688.1/75.789.7/66.1
pipe_fryum86.1/78.092.9/87.093.4/78.590.5/79.295.0/88.993.7/88.9
Model AVG 86.2/62.585.2/60.985.6/65.980.7/61.090.9/65.791.0/63.5
Model STD 7.0/18.87.0/14.37.3/15.38.9/16.94.4/22.24.3/22.2
Table 7. MVTecAD image-level AUROC. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
Table 7. MVTecAD image-level AUROC. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPM
carpet99.0/94.995.1/77.998.9/93.399.6/88.799.0/92.898.5/86.3
grid99.8/99.295.2/71.598.0/98.495.4/94.995.3/97.597.4/78.6
leather99.9/99.698.3/87.5100.0/97.1100.0/97.6100.0/91.299.9/98.4
tile99.7/99.699.8/98.498.8/98.999.7/84.099.9/99.999.2/98.4
wood99.3/97.693.7/95.699.1/98.599.2/98.699.4/98.899.6/98.5
bottle99.7/99.6100.0/97.7100.0/99.6100.0/100.0100.0/100.097.5/95.4
cable95.8/90.184.7/77.898.8/98.589.5/92.195.5/83.781.5/63.5
capsule90.5/79.288.2/81.197.9/91.593.1/88.196.9/91.758.5/53.2
hazelnut95.6/83.396.7/84.8100.0/98.192.3/71.3100.0/94.998.2/93.5
metal_nut98.9/93.491.8/69.099.8/95.399.8/93.3100.0/94.795.9/82.8
pill95.1/71.782.0/80.494.1/88.292.5/78.397.9/91.051.0/41.3
screw74.0/88.182.4/50.798.0/83.285.7/70.297.7/90.945.8/55.5
toothbrush85.2/65.585.8/90.899.7/99.490.2/96.693.6/99.981.6/68.3
transistor96.6/84.296.5/83.999.9/98.998.5/96.297.3/92.780.1/69.9
zipper92.3/94.393.1/93.999.1/99.188.6/79.197.5/95.779.2/55.6
Model AVG 94.8/89.492.2/82.798.8/95.994.9/88.698.0/94.484.3/75.9
Model STD 7.1/10.06.1/12.61.5/4.85.9/9.92.0/4.518.7/19.2
Table 8. MVTecAD pixel-level AUROC. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
Table 8. MVTecAD pixel-level AUROC. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPM
carpet96.8/94.498.9/96.198.8/98.499.0/97.499.0/98.599.3/96.9
grid98.6/98.796.8/87.298.0/93.597.1/96.399.0/99.098.9/92.4
leather98.9/99.399.5/98.298.9/98.998.9/99.499.2/99.299.6/99.5
tile91.3/94.495.6/89.694.7/89.894.4/83.994.9/91.296.9/89.9
wood85.0/85.793.3/87.392.9/90.294.4/92.194.7/92.696.3/92.8
bottle97.4/98.298.2/97.798.1/98.498.4/98.898.5/98.294.9/88.4
cable94.2/93.894.4/88.898.0/96.397.0/96.196.7/91.892.0/88.1
capsule98.7/96.598.8/97.198.7/98.298.6/98.598.7/98.792.9/88.9
hazelnut95.5/97.598.6/97.998.4/98.798.0/98.598.7/99.197.3/97.9
metal_nut97.5/97.197.5/96.198.2/98.896.2/98.696.6/97.197.5/93.9
pill97.1/81.297.6/84.097.6/93.694.5/90.897.5/93.790.1/84.1
screw88.4/92.597.4/95.198.9/98.298.5/97.599.4/99.194.6/95.7
toothbrush94.8/91.298.2/97.998.6/98.399.0/98.799.0/99.098.4/85.2
transistor96.0/92.586.5/86.597.1/97.097.7/97.590.4/88.777.1/71.1
zipper92.0/98.196.8/96.297.9/98.797.1/98.398.2/98.896.0/73.5
Model AVG 94.8/94.196.5/93.097.7/96.597.3/96.297.4/96.394.8/89.2
Model STD 4.0/5.13.3/5.11.7/3.21.7/4.22.4/3.65.6/8.2
Table 9. MVTecAD AUPRO. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
Table 9. MVTecAD AUPRO. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPM
carpet89.0/86.595.0/77.793.8/89.996.1/91.195.7/92.797.7/90.2
grid94.9/94.989.9/65.590.7/80.490.1/89.296.6/97.296.7/83.4
leather97.8/95.198.4/87.096.7/92.898.0/97.498.1/96.699.1/97.7
tile77.4/84.689.7/74.379.1/70.685.4/68.786.3/81.491.7/79.7
wood86.8/78.988.8/64.684.5/70.992.6/82.891.0/84.195.1/88.3
bottle89.1/90.792.4/85.692.8/90.695.2/95.095.9/93.985.4/71.8
cable75.6/85.979.4/66.891.2/88.986.4/88.790.7/78.080.4/61.7
capsule93.8/85.991.2/83.091.9/88.491.4/91.093.3/92.874.5/64.3
hazelnut95.3/92.795.3/85.693.9/92.093.4/93.796.0/94.695.3/93.4
metal_nut89.4/83.986.1/74.892.0/87.892.7/91.093.7/91.594.8/81.1
pill93.4/74.691.4/62.693.7/86.494.2/86.996.2/92.781.5/78.8
screw67.7/76.488.6/81.894.1/91.894.0/91.096.2/92.781.6/78.8
toothbrush73.6/68.684.4/78.885.5/85.193.0/92.692.5/93.188.2/46.5
transistor91.1/80.779.0/60.994.5/93.294.0/91.980.9/77.560.5/49.6
zipper77.2/94.088.7/86.092.0/94.291.3/93.995.0/95.689.2/27.4
Model AVG 86.1/84.989.2/75.691.1/86.992.5/89.793.2/90.587.4/73.2
Model STD 9.4/7.95.4/9.44.6/7.43.3/6.84.5/6.710.5/19.8
Table 10. VisA image-level AUROC. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
Table 10. VisA image-level AUROC. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPM
candle94.0/89.690.2/76.198.3/87.091.7/76.194.7/89.176.4/66.0
capsules87.3/86.887.1/67.270.6/76.066.9/63.888.8/81.986.6/78.2
cashew92.7/90.092.7/88.596.8/96.489.1/84.495.8/93.987.3/65.8
chewinggum98.8/97.499.5/81.898.8/92.898.8/89.498.6/93.496.0/86.1
fryum96.5/95.265.1/85.195.0/95.288.1/89.288.6/94.680.4/89.9
macaroni194.2/87.277.1/70.787.0/83.179.9/74.796.4/92.188.3/86.3
macaroni287.6/79.571.2/53.369.7/63.461.4/66.382.6/82.275.0/54.1
pcb196.5/95.294.3/94.794.2/95.785.2/93.896.5/97.687.9/93.4
pcb296.5/93.684.2/83.393.9/97.082.7/85.996.3/95.586.1/82.6
pcb397.3/87.177.3/84.192.6/91.978.6/70.196.5/97.778.5/53.8
pcb499.6/94.297.1/94.599.2/98.792.9/93.999.7/99.394.0/65.8
pipe_fryum99.6/96.798.6/84.099.3/94.692.2/85.099.4/97.691.6/86.4
Model AVG 95.1/91.086.2/80.391.3/89.384.0/81.194.5/92.985.7/75.7
Model STD 4.2/5.311.3/11.910.5/10.511.0/10.55.2/5.86.8/13.9
Table 11. VisA pixel-level AUROC. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
Table 11. VisA pixel-level AUROC. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPM
candle98.9/95.298.7/90.199.0/95.198.8/92.798.9/95.096.7/78.1
capsules99.0/98.197.2/79.097.6/96.995.7/93.899.6/99.099.1/96.0
cashew97.8/95.398.5/96.198.9/96.298.2/94.894.8/72.395.1/81.8
chewinggum98.6/97.698.8/93.598.8/96.099.0/94.098.6/91.898.4/94.0
fryum84.8/93.295.2/95.194.4/94.794.9/96.096.3/95.794.0/91.3
macaroni199.0/95.097.4/94.697.5/96.396.8/97.299.4/99.397.6/98.7
macaroni298.2/97.496.3/93.096.3/93.494.9/94.899.0/98.998.5/96.0
pcb199.5/99.399.3/99.199.5/99.399.0/99.399.7/99.699.3/99.1
pcb298.7/96.997.2/95.197.8/97.297.3/98.198.7/97.097.4/95.8
pcb398.9/97.596.7/97.298.0/98.297.2/98.299.1/99.095.9/95.7
pcb497.8/95.197.6/97.898.0/98.696.8/97.098.4/98.598.1/53.5
pipe_fryum96.5/98.798.5/99.098.9/99.099.0/99.198.9/99.097.3/96.9
Model AVG 97.3/96.697.6/94.197.9/96.797.3/96.398.5/95.497.3/89.7
Model STD 4.0/1.81.2/5.41.4/1.81.5/2.21.4/7.61.6/13.2
Table 12. VisA AUPRO. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
Table 12. VisA AUPRO. Each cell carries the results for ImageNet/MandelbulbVAR-1k.
ClassFastFlowC-FlowPatchCorePaDiMRDSTFPM
candle95.1/91.693.4/78.495.1/86.794.8/83.694.3/90.991.4/61.3
capsules93.8/84.476.8/47.269.1/62.166.7/62.095.1/89.095.4/82.3
cashew84.1/80.592.8/63.590.4/60.582.0/64.889.2/58.791.7/66.9
chewinggum85.2/75.189.3/39.184.9/50.887.3/43.677.8/39.177.4/52.0
fryum74.7/70.269.5/54.378.4/66.170.3/71.588.7/84.785.2/79.9
macaroni194.3/84.987.7/75.189.9/84.386.5/82.992.8/92.982.8/88.3
macaroni294.2/89.271.9/72.187.3/77.171.4/71.491.7/91.286.4/79.8
pcb192.2/89.389.9/87.588.8/86.888.3/89.395.1/95.290.8/91.3
pcb289.1/80.481.5/74.282.6/79.677.5/85.289.1/86.781.3/82.6
pcb385.7/71.968.0/80.378.4/82.471.0/81.890.7/91.159.2/79.7
pcb485.5/66.586.8/85.786.3/85.980.5/81.089.1/88.789.6/2.5
pipe_fryum85.3/84.393.5/86.793.3/88.789.6/87.295.8/95.591.0/90.3
Model AVG 88.3/80.783.4/70.385.4/75.980.5/75.490.8/83.685.2/71.4
Model STD 6.0/8.19.6/16.07.3/12.79.1/13.34.9/17.09.7/24.8
Table 13. Average image- and pixel-level AUROC express in % for PatchCore [28] using the WideResNet-50 feature extractor on MVTec AD. The “Pre-training” column indicates which dataset has been used for pre-training, and in brackets, the number of classes in each dataset is indicated. Best and second-best scores are shown in underlined bold and bold, respectively.
Table 13. Average image- and pixel-level AUROC express in % for PatchCore [28] using the WideResNet-50 feature extractor on MVTec AD. The “Pre-training” column indicates which dataset has been used for pre-training, and in brackets, the number of classes in each dataset is indicated. Best and second-best scores are shown in underlined bold and bold, respectively.
Pre-TrainingImage AUROCPixel AUROC
Random initialization0.7720.860
ImangenNet (1000 cl.)0.9910.981
Mandelbulbs (1000 cl.)0.6780.784
MultiMandelbulbs (1000 cl.)0.8090.919
MultiMandelbulbs-back (1000 cl.)0.7190.833
Fractals (200 cl.)0.7200.823
MultiFractals (200 cl.)0.7710.900
Mandelbulbs (200 cl.)0.6950.802
MultiMandlebulbs (200 cl.)0.8170.921
MultiMandelbulbs-back (200 cl.)0.6990.781
MultiMandelbulbs-transforms (200 cl.)0.7910.912
MultiMandelbulbs-gray (200 cl.)0.7930.908
Table 14. Results of the best-performing model for each training configuration. The table shows the epoch at which the best model was selected, along with the corresponding validation accuracy and AUROC score achieved using PatchCore.
Table 14. Results of the best-performing model for each training configuration. The table shows the epoch at which the best model was selected, along with the corresponding validation accuracy and AUROC score achieved using PatchCore.
Train Config.Best EpochBest Val. Acc.Image AUROCPixel AUROC
VAR1839.330.8570.941
VAR1-BATCH2510.010.8360.932
VAR1-noSCHEDULER9716.690.8440.931
Table 15. MVTec LOCO AD image-level AUROC obtained via original code implementation of the different methods. Each cell carries the results for ImageNet/MultiMandelbulbs-small.
Table 15. MVTec LOCO AD image-level AUROC obtained via original code implementation of the different methods. Each cell carries the results for ImageNet/MultiMandelbulbs-small.
Pre-TrainingEfficient ADPUADSINBAD
ImageNet0.8980.9250.841
Baseline0.7730.8180.733
VAR10.8110.8220.734
VAR1-BATCH0.7890.8370.780
VAR1-noSCHEDULER0.7880.8320.788
Table 16. Pixel-level sPRO results on the MVTec LOCO AD dataset, obtained using the official dataset’s evaluation code. “Log.” and “Stru.” stand for logical and structural anomalies and “fpr” is the false positive rate used for the sPRO calculation.
Table 16. Pixel-level sPRO results on the MVTec LOCO AD dataset, obtained using the official dataset’s evaluation code. “Log.” and “Stru.” stand for logical and structural anomalies and “fpr” is the false positive rate used for the sPRO calculation.
EfficientAD Pre-TrainingLog.
fpr = 0.05
Stru.
fpr = 0.05
Log.
fpr = 0.3
Stru.
fpr = 0.3
ImageNet0.6910.6820.8890.866
Baseline0.5740.4830.8190.718
VAR10.5740.5180.8320.756
VAR1-BATCH0.4540.5040.7310.735
VAR1-noSCHEDULER0.4840.4930.7520.721
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ugwu, C.I.; Caruso, E.; Lanz, O. Fractals as Pre-Training Datasets for Anomaly Detection and Localization. Fractal Fract. 2024, 8, 661. https://doi.org/10.3390/fractalfract8110661

AMA Style

Ugwu CI, Caruso E, Lanz O. Fractals as Pre-Training Datasets for Anomaly Detection and Localization. Fractal and Fractional. 2024; 8(11):661. https://doi.org/10.3390/fractalfract8110661

Chicago/Turabian Style

Ugwu, Cynthia I., Emanuele Caruso, and Oswald Lanz. 2024. "Fractals as Pre-Training Datasets for Anomaly Detection and Localization" Fractal and Fractional 8, no. 11: 661. https://doi.org/10.3390/fractalfract8110661

APA Style

Ugwu, C. I., Caruso, E., & Lanz, O. (2024). Fractals as Pre-Training Datasets for Anomaly Detection and Localization. Fractal and Fractional, 8(11), 661. https://doi.org/10.3390/fractalfract8110661

Article Metrics

Back to TopTop