1. Introduction
The textile industry occupies a significant proportion of China’s industry, and its products are widely used in homes, clothing, construction, and even aerospace. In the textile industry, the surface quality of products is an important factor that affects their price and grade evaluation [
1]. Therefore, in the manufacturing process, the industry arranges an inspection process to ensure that flawless products are delivered to merchants or consumers. The traditional detection methods use a human to detect the surface defects, which is not only slow, but also cannot ensure the consistency of the detection effect. With the popularization of automatic product lines, automatic fabric defect detection equipment based on machine vision is increasingly being applied in fabric defect detection.
The core of surface defect detection based on industrial vision is to extract the features related to defects from fabric images. However, the visual inspection of surface defects is still difficult due to the diverse sizes, varying brightness, low contrast, complex features, and inadequate defect samples for different products. In the past few decades, researchers have proposed many surface defect inspection methods to overcome these difficulties. These traditional methods rely heavily on handcrafted features. In other words, with the increase of defect types, these traditional methods that rely on prior experience will not satisfy the automation needs of the industry.
At present, the data-driven deep-learning method is widely used in many defect detection fields. This method can be divided into three categories: classification, detection, and segmentation [
2]. Moreover, researchers have attempted to apply methods based on convolution neural networks to fabric defect detection in industrial settings. Jing et al. [
3] proposed an extremely efficient convolutional neural network, Mobile-Unet [
4], to achieve end-to-end fabric defect segmentation. Mobile-Unet introduced depthwise separable convolution, which dramatically reduced the complexity cost and model size of the network. Zhu et al. [
5] designed the modified densenet [
6] for automatic fabric defect classification with edge computing. Wu et al. [
7] proposed a wide-and-light network structure based on Faster R-CNN for detecting common fabric defects. The feature extraction capability for fabric defects was enhanced by a multiscale dilated convolution kernel. In [
8], a cascaded mixed feature pyramid network was proposed for guiding the localization of fabric defects. Although the above methods achieved good detection results, training these models was required to be supervised in the form of annotations. These annotations can be at the image level, bounding box level, or even pixel level, which will lead to some disadvantages. First, annotations in supervised methods often require an experienced inspector, and pixel-level annotations require a lot of time and costs. Secondly, these methods can only map existing defect types to known defect types, and cannot handle unknown types of anomalies. Finally, the difficulty of collecting flawless images is much lower than that of defective images in industrial sites. Hence, some fabric defect detection methods based on unsupervised learning have also attracted the interest of researchers, such as Markov random field [
9], low-rank decomposition [
10], and sparse dictionary [
11]. However, these shallow-feature networks limit the detection ability of surface defects, resulting in the inability to adapt to complex and real industrial environments.
In this paper, we propose an unsupervised fabric defect detection method based on feature-compared training on defect-free samples. To avoid using a large number of training samples, feature extraction based on the pretrained model was applied. This approach could directly capture the normal variability of training data. Then, the obtained features went through the SFC architecture, which contained the self-feature reconstruction module (SRM) and the self-feature distillation (SFD). The surface defects were obtained by using combined judgment criteria based on feature reconstruction errors and feature distillation errors. Compared with the traditional methods operating in the image space, the comparison of feature space can better locate the anomalies of fiber texture surfaces. In this work, a combined feature reconstruction and feature distillation were utilized, and its normal features were learned simultaneously through a direct network and an indirect network.
The rest of this paper is organized as follows. In
Section 2, the related works regarding unsupervised learning for anomaly detection are introduced and summarized. In
Section 3, the proposed network is described and discussed in detail. In
Section 4, experiments are designed to demonstrate the performance of our method, which includes the fabric datasets, experimental setups, results with comparative study, and extended application. Finally, we conclude the paper in
Section 5.
2. Related Works
In this section, the related works regarding popular unsupervised learning for anomaly detection are introduced and discussed.
Anomaly detection is used to find outlier abnormal samples in a group of normal samples. Such methods attempt to generate defect-free images with autoencoders (AE) or generative adversarial networks (GANs) to determine an abnormal area by comparing input images and output images. Bergman et al. [
12] proposed a modified AE by introducing the SSIM index into the loss function. This enforced the network to yield a more real image. Mei et al. [
13] designed a multiscale denoising encoder based on unsupervised learning to reconstruct normal features in textile images. The reconstruction errors were utilized to realize automatic defect detection. Liu et al. [
14] designed a multilevel unsupervised model to reconstruct realistic surface fabric defects by using the generation characteristics of GANs. In [
15], a GAN-based anomaly detection technique named AnoGAN was proposed. Some methods based on variational AE were also proposed for anomaly detection, including FAVAE [
16], VAE-grad [
17], and VE-VAE [
18]. In the industrial scenario, these anomaly detection methods based on generating models are limited to the restrictions of training samples. They often have difficulties when excavating deep-level features, and only pay attention to the local pixel-level features, resulting in difficulties in achieving good results.
Some researchers transfer the direction of anomaly detection from the comparison in the image level to the feature level. Cohen et al. [
19] first attempted to extract features from pretrained models, and used KNN to compare features. In [
20], anomaly detection based on feature reconstruction was proposed. However, it required additional feature fusion modules. Bergman et al. [
21] proposed the largest industrial anomaly detection dataset, MVTec, and designed a method based on knowledge distillation for feature comparison. In order to improve the accuracy of anomaly localization, the input image in the model had to be divided into patches, which greatly increased the time and costs.
Different from the current conventional anomaly detection methods, our novel approach mainly focused on anomaly localization (i.e., segmentation) instead of anomaly detection (classification level). Moreover, it put forward the idea of two anomaly comparison modules in the feature space instead of image space. Our method was evaluated on the carpet database, and showed the abilities of unsupervised methods in surface defect detection. The extensive experimental results demonstrated the effectiveness of the proposed method in other industrial applications.
3. The Proposed SFC Framework
The proposed framework SFC had two parts, which are presented in
Figure 1. The two parts were the pretrained feature extraction and the feature comparison. In the pretrained feature extraction part, the normal samples without defects were input into the network for feature extraction. Then, the extracted features were employed as the input for the next part. The feature comparison part contained the SRM and SFD. In the training stage, the purpose of these two modules was to model the feature representation of normal samples, ensuring that the input and output of the module were the same. The SRM was an AE-based reconstruction network to reconstruct the input feature. The SFD network was based on a knowledge-distillation network. Its structure was a simplified version of the pretrained network, and was also used to mimic the input features. In the inference stage, the features differed between the input and output of the SRM, representing the anomaly score. Moreover, the features that differed between the pretrained feature extraction and output of the SFD also represented the anomaly score. Both components of the SFC framework were trained simultaneously.
Next, we will introduce various component modules in the model in detail, including feature extraction, SRM, SFD, and the final module for obtaining anomaly scoring, which are presented in
Section 3.1,
Section 3.2,
Section 3.3, and the final
Section 3.5. The loss function of the network is described in
Section 3.4.
3.1. Pretrained Deep Feature Extraction
In the image-reconstruction-based methods [
15,
16,
17], the features were trained by the GAN and the encoder in the AE. Due to few training samples being available in industrial scenarios, directly using network training to extract features was not a sensible approach. The image-based pretrained model was a very effective feature extractor, and the features had generalization ability in different scenes. We used unlabeled training data
x =
xk ∈ X sampled from the manifold X of images showing normal anatomy. With n = 1, 2, ..., N, we employed a collection of N fabric images that exclusively depicted normal anatomy, where X
n ∈ R
u × v is an intensity image of size u × v. We defined the feature extractor as F. When a training image
xk is input, the corresponding extracted features can be expressed as:
The resolution of the obtaining features is h × w × d, and d represents the dimension on the feature channel. The feature maps were processed by extensive convolutional and nonlinear operations. As the pretrained weights were fixed, the h × w × d dimension features could be quickly obtained, and no additional network training was required. In the pretrained network, the size of each critical layer Fi extracted was inconsistent. To ensure that the features extracted from each critical layer Fi could be fused, it was often necessary to use the resize operation. Generally, the feature maps after the pooling operation were selected as i. The final i was generated by upsampling and concentrated into the final feature tensor . Moreover, each position in the original input image xk corresponded to the position in (i,j) after feature extraction. There was only a scaling relationship, which was convenient for exact anomaly localization.
3.2. Self-Feature Reconstruction Module
Motivated by the image-reconstruction method [
18,
20], we trained a feature reconstruction to automatically reconstruct
.
represented a feature distribution that was used to describe the distribution of normal features in defect-free sample X. During AE training, the network was optimized for the difference between the input and output. Given input
from feature space ∈ R
h × w × d of dimension d, the auto-encoder was employed to produce features that were as close as possible to the input. In this step, the SRM attempted to reconstruct features of the normal sample distribution.
The detailed network structure in the SRM is shown in
Figure 2. In order to ensure that the size information of the image was not lost, 1 × 1 convolution was utilized in the network. The encoding unit in the SRM contained three convolution layers that were compressed in the bottleneck layer. The size of the input feature was constant, but the number of channels gradually compressed from
d to
d3. In the decoding unit, three convolution layers were also employed, and the number of channels of the three convolution layers corresponded to the encoding unit. This ensured that the SRM could restore input features
.
3.3. Self-Feature Distillation
SFD aims to train a student network, S, that uses a pretrained network to locate anomalies in images. It is trained to mimic the comprehensive behavior of a teacher network T. In order to reduce the network parameters, most of the early knowledge distillation methods migrate the information in the complex network to the network with smaller parameters, and ensure its performance is unchanged. These methods are essentially based on the consistent output of large network T and small network S. In the SFD, the intermediate feature of on the normal training data is transferred to the student network instead of the final output.
Since was obtained from several critical blocks, the SFD promoted the student network to gain ’s knowledge on normal samples by conforming its intermediate representations in layers to ’s representations. Hence, i represented the i-th intermediate layer in the networks, and the source features of that intermediate layer as i and the cloner’s features as Si. The feature maps after the pooling operation were selected as Si. The final S was generated by concentrating the branch feature Si. Our knowledge distillation concept was not to ensure that the final outputs of T and S were consistent, but to make their intermediate feature outputs consistent.
In this paper, VGG19 was selected for the pretrained feature extraction network. As can be seen in
Figure 1, the three feature maps from different layers constituted a final extracted feature
‘. The features of different layers represented different semantic information. Compared to the front layer, the semantic information of the rear layer was more abundant. The structure of the SFD is shown in
Figure 3.
In order to better simulate the output of
T, the structure of
S was not consistent with
T.
S used a simplified convolutional neural network. It only needed to ensure that the output of the key intermediate feature layer was consistent with the output of the teacher network. As the teacher network was VGG19, we selected the three convolutional block layers to conduct the knowledge distillation. The names of the three layers in the VGG19 were block3_conv4, block4_conv4, and block5_conv4. As shown in
Figure 3, each layer feature was upsampled and then concentrated into the final feature tensor. The detailed structure of the student network is shown in
Table 1.
BatchNormalization was used after each convolution layer for convergence stability. Three layers in the student net—Max pool2, Max pool3, and Conv10—were used for matching the teacher networks. The parameters of T came from the pretraining model, and could remain unchanged. In the training process of the SFD, the parameters of S were optimized.
3.4. Training Loss
Two losses were used to optimize the proposed network.
Lsrm and
Lsfd were each aspect from the SRM and SFD. The first,
Lsrm, attempted to optimize the similarity between
and its reconstruction feature
r. The loss function for guiding the training of SRM is:
where ‖·‖ is the L
2 norm, and
xi is the input image.
The SFD was used to optimize the composite intermediate layers of
T and
S, and ensure that the outputs of
S and
T were as close as possible. Hence,
Lsfd is expressed as:
By considering the two parts losses,
Ltotal is formulated as:
where
λ in the formula adjusts the contributions of the different parts. In
Section 4.4, we will analyze the influence of different
λ values on the anomaly segmentation effect in the dataset.
3.5. Detection of Fabric Anomalies
After the training phase, the SFC had learned the ability to map features of normal samples into themselves in the SRM. Moreover, the SFD also had the ability to reproduce the features of normal samples. In the test phase, the abnormal score could be obtained by calculating the affinity between the
and its reconstruction or distillation results. The anomaly scoring could be obtained from the reconstruction branch and distillation branch. The final anomaly heatmap was produced by the fusion of two anomaly scores:
where
xi is the test sample;
β1 and
β2 represent the influence coefficient of different branches on the final anomaly score; and UP(.) is the upsampling process, which employed a bilinear interpolation operation to enlarge the difference to the size of the input.
4. Experiments
To evaluate the performance of the proposed method, several sets of experiments were conducted. In this section, the experimental datasets and implementation details are given first. Then, the overall performance of our method is compared with several state-of-the-art models. Third, the ablation analysis of the model is discussed. Finally, we deploy the model in a real industrial environment for practical usage.
In our experiments, the classical fabric carpet dataset in MVTec AD was utilized. The carpet dataset is a real-world dataset composed of five different industrial defect types, and defects were manually generated that occur in real-world industrial inspection scenarios. It has been widely used for performance verification. The training set contained 280 defect-free images. 89 defective images of five defect types (color, cut, hole, metal_contamination, and thread), and 28 defect-free images were included in the testing sets. An image of an example for each defect type is shown in
Figure 4. For quantitative comparison of the performances of the different methods, a threshold-independent evaluation indicator, the pixel-level area under the receiver operating characteristic curve (AUROC), was adopted; this indicator is widely used for performance evaluation of anomaly localization. All the experiments were conducted with the deep-learning toolbox Keras with an NVIDIA 3090 GPU. The main architecture of the SFC in the experiment is shown in
Figure 1. All input images were scaled to 256 × 256 pixels.
4.1. Overall Performance Comparison
To verify the effect of the proposed method more fully, the segmentation performance of the proposed method was compared with several state-of-the-art methods, including autoencoders (AE-SSIM [
12] and multiscale AE [
13]), a generative adversarial network (AnoGAN [
15]), variational methods (FAVAE [
16], VAE-GRAD [
17], and VE-VAE [
18]) and other superior unsupervised algorithms (SPADE [
19], DFR [
20], and US [
21]). The results of the methods [
12,
13,
15,
16,
17,
18,
19,
20,
21] were taken from the literature.
The segmentation results of these methods on the carpet dataset are exhibited in
Table 2. The table reports the segmentation results, which showed that margins of 0.80% and 1.30% were achieved compared with the second-best and third-best methods. The proposed method handled the fabric textured surfaces well, demonstrating the superiority and stability of our method and its potential to be a unified model for defect inspection in industrial applications.
4.2. Visual Inspection Result
The visual inspection results of our proposed method using the carpet dataset are shown in
Figure 5. The proposed method could locate the anomaly accurately on five classical defects. It can be seen that the thin texture changes could be captured well by our model. This indicated that our approach has application prospects in real scenarios.
4.3. Influence of a Single Module
Previous studies showed that the single reconstruction module or distillation module had the fine discrimination ability of anomaly segmentation. The proposed method experimentally explored the effect of combination modules for fine industrial anomaly segmentation.
Table 3 shows the influence of a single module on anomaly localization.
As shown in
Table 3, the single SFD showed the weakest discriminability, while the combination modules had the highest discriminability. However, the effect of the SRM module exceeded that of the SFD. This was because the reconstruction branch was highly discriminative compared to the distillation branch. The method of distillation was only mimicking input features to some extent, but it ignored subtle abnormalities.
4.4. Influence of the Hyperparameter λ
The hyperparameter λ was used to balance the relative contributions of the SRM and SFD in the training phase. The best value was achieved when the data with normal features could be well modeled. To illustrate the influence of
λ intuitively, the carpet dataset was employed to perform the verification.
Table 4 shows the influence of hyperparameter
λ. When the
λ increased, the inspection performance of the carpet increased smoothly. This occurred because the loss of the SRM dominated the entire loss. The loss of the SFD can be regarded as an auxiliary loss.
4.5. Setting of the Hyperparameter βi in Equation (5)
The fusion mechanism aimed to take advantage of the anomaly representation in different feature spaces. The values of
βi, which represented the significance of SRM and SFD, were used to obtain the fusion in Equation (5). This was a critical parameter that affected the inspection result. We aimed to investigate the sensitivity of the combination of
βi.
Table 5 shows the outcomes of various combinations. It can be seen in
Table 5 that the fusion experimental effect was generally better than that of a single branch. The experimental results showed that fusion could improve segmentation performance.
4.6. Applications
To further assess its practical performance, we applied our method to two complicated fabric datasets in real industrial environments established by the AOI equipment.
The star-patterned fabric dataset [
22] was published by H.Y.T. Ngan and G.K.H. Pang of the University of Hong Kong, and is composed of 50 images acquired with an AGFA Scan1236 scanner in 2003. The images in the dataset can be regarded as a composite texture of stars according to periodicity. The training dataset contains 25 normal images without defects. At the same time, the test set contains 25 defect images in five categories. The size, shape, and type of defects in different defect images were different, which was quite different from the periodic texture of the background. Anomalies in target images included broken-end, hole, netting-multiple, thick-bar, and thin-bar. All images used in the evaluations are 256 × 256 pixels in gray-value level scale. The details are shown in
Table 6.
The color fabric dataset [
23] was released by the Tianchi platform [
24]. Since the original dataset was used for the object detection task, we produced a new dataset for unsupervised fabric defect localization. It consisted of 168 nondefect samples for training. The test sample included 200 defective samples and 24 defect-free samples. The image size in all datasets was 256 × 256 pixels. The anomalies in the target images were varied, including long-span scratches and small dots. Pixel-level labeled ground-truth images were provided for evaluation. Accurately locating anomalies on a complicated actual texture background was a difficult challenge. The details are shown in
Table 7.
The anomaly maps of the two datasets are described in
Figure 6 and
Figure 7, demonstrating the great potential of our method in industrial applications. Our approach was not only suitable for a uniform texture background, but also adapted to defect detection with a challenging nonuniform color background.
In addition, the comparison of different methods (AE-SSIM [
12], multiscale AE [
13], FAVAE [
16], SPADE [
19], and US [
21]) used on the above two datasets is shown in
Figure 8. We compared our method with AE-SSIM [
12] and multiscale AE [
13] using their publicly official code. Since the official projects are not publicly available, we used third-party implementations for FAVAE [
16], SPADE [
19], and US [
21].
As for the star-patterned fabric dataset, three methods, including FAVAE [
16], SPADE [
19], and US [
21], also achieved a more than 97% pixel AUROC value. In the color fabric dataset, the variations between the images were large, our method achieved the best results. Although the FAVAE [
16] could obtain internal defect-free features, when the external imaging difference between defect-free images was large, tiny defects were covered up in complex textures, resulting in false detection. The pixel AUROC of FAVAE [
16] was not over 75%. SPADE [
19] and US [
21] were lower than our method in the color fabric dataset, at 2.1% and 3.3%, respectively. Overall, the proposed SFC method achieved a comparable result against the state-of-the-art methods.