1. Introduction
In recent years, artificial intelligence has been applied in many fields and is widely recognized for its excellent results in many traditionally complex problems. Currently, artificial intelligence (AI) techniques have demonstrated near-human-level performance in a variety of computer vision operations, such as image classification and object perception [
1], and have been successfully applied. The currently used models have excellent performance in terms of accuracy, but the models do not provide proper descriptions and act as a black box. With the development of AI technology, machine learning models are becoming increasingly complex, and the accuracy of the models is improving, but the transparency of the models is decreasing, and a significant part of the decision process of both is still unclear and difficult to explain to users, making it impossible to fully understand the function or logic behind it. This feature is considered to be one of the biggest problems in the application of AI technology. As black-box machine learning models are increasingly used to make important predictions in critical contexts, the demand for transparency from various stakeholders in AI is growing [
2]. The opacity of machine decision making reduces human trust in artificial intelligence. For example, in the medical field, a deep learning system determines whether a patient has cancer based on medical images [
3], while a human medical expert has the opposite opinion. Since the system cannot provide an explanation, the expert may not accept the system’s opinion, and if the judgment is wrong, it may lead to medical errors. No matter what kind of application, we can see that if we do not solve the problem of explainability of deep learning, the future development of its application will be limited.
With the development of intelligent systems in application areas, such as autonomous robots and vehicles [
4,
5,
6], health care [
7,
8,
9], such as soft tissue sacromas segmentation [
10], skin lesion segmentation [
11], and coronavirus (COVID-19) classification [
12], classification and detection in image processing [
13,
14,
15], etc. Automated systems must provide users, developers, and regulators with explanations based on practical factors and social and legal reasons when making decisions or recommendations. Therefore, making the black-box model transparent and explainable is also an important research in the field of AI. The technique for making models transparent and explainable is called XAI, and XAI techniques improve the reliability of models by giving users confidence that the models make good decisions [
16]. Many existing methods compute the importance of a given base model and output class [
17,
18,
19], but they require the use of intermediate feature mappings, network weights, and other internal factors of the underlying model. The explainability of the AI model is shown in
Figure 1.
A black-box model can be interpreted as follows:
- (1)
Model properties: Presentation of specific properties of the model or its predictions, such as sensitivity to changes in the properties or identification of the model components responsible for a given decision.
- (2)
Local logic: Presentation of the internal logic behind an individual decision or prediction.
- (3)
Global logic: Representation of all internal logic.
The black-box model problem specifically involves the following three points [
20]:
- (1)
Inability to dig causality problem: The internal structure of the black-box model is complicated, and when making predictions, we will evaluate the goodness of the model based on some model evaluation indicators (such as AUC), but even if the AUC is high, it is still unclear whether the black-box model is based on the correct judgment. If the model cannot provide a reasonable causal relationship, the results of the model will be difficult to be convincing.
- (2)
Insecurity problem of black-box model: For modelers, the internal structure of black-box models is complicated, and it is usually difficult to detect these attacks when the models are attacked from the outside; for users of the models, they do not understand the operating mechanism of the models and only use the results of the models to make decisions. It is difficult for users to detect anomalies from the results of the black-box model, which may cause the problem of insecurity in the use of the model results.
- (3)
Possible bias problem in the black-box model: When making predictions, it reinforces the problem of data imbalance that may exist in the data collection process, which leads the model to end up with biased results.
This is also a problem that needs to be overcome in the future.
Here, we propose a new black-box approach for estimating pixel saliency. By inserting and removing pixels to estimate the weights corresponding to different pixels, and visualizing the saliency range, the saliency of different pixel points is presented in the form of heat maps to highlight the key pixels for the purpose of explanatory illustrations to humans. With OISE, we can clearly see which region of the image the network is focused on, and improve its inability to classify multiple targets in the same image compared to traditional deep learning explainable methods. Unlike traditional CAM series methods that require changing the network structure, OISE does not require internal access to arbitrary networks, does not require reimplementation of each network architecture, applies to existing image networks, and is considered a complete black box with no assumptions about parameters, features, and classification.
Our contributions are as follows:
- (1)
We have improved the Randomized Input Sampling for Explanation (RISE) method by using an optimized way to generate the mask, which reduces the computational effort and makes the generated range of significance regions more accurate.
- (2)
We introduce a new black-box resolution method that compensates for the shortcomings of perturbation-based, intuitive, and understandable representation of the weighted value of activity.
- (3)
We evaluate the generation. The saliency map shows that the fairness of the work can be identified, and points out that this method can find better evidence for the target category.
In
Section 2, we introduce the related works and summarize the shortcomings of the existing method and improve it accordingly; in
Section 3, we give a detailed description of the implementation process of the OISE method; in
Section 4, the experimental process and results are described, and the experimental results are evaluated using the pixel deletion and insertion methods to confirm the practicality and accuracy of the OISE method. Finally,
Section 5 summarizes the main ideas of OISE and discusses the advantages and disadvantages of the method, application areas, and future research directions.
2. Related Works
Researchers have explored many directions in the field of explainable artificial intelligence, and the importance of interpretation has been widely studied in various fields of machine learning and deep learning.
The Randomized Input Sampling for Explanation (RISE) method [
21] introduced by Petsiuk et al. perturbs an input image by multiplying it with randomized masks. RISE uses the black-box model, which differs from the white-box method, which uses other internal network states to infer pixel importance. The black-box model uses mask-based visualization by estimating the importance of the input image region for the model prediction. The method generates the mask by first sampling the smaller binary mask and then upsampling it using bilinear interpolation to improve resolution. After interpolation, the mask Mi is no longer binary, but has the value [0, 1]. To make the mask more flexible, all masks are shifted by a random number of pixels in both spatial directions. The saliency of the pixels is then estimated by randomly combining dimmed pixels to reduce their intensity to zero, and this model is built by multiplying the image with a mask of [0, 1] values. Saliency maps are generated by empirically estimating the sum using Monte Carlo sampling. This black-box based interpretation method generates multiple masks by random or Monte Carlo sampling to compute the saliency of each mask field, which usually requires a lot of masks and computations. It is very complex and wastes time and resources.
To localize visual evidence in images, Class Activation Mapping (CAM) [
22] emerged in 2016. In CAM, the authors argue that the global average pooling layer has local localization capability, replace the original pooling and fully connected layers after the convolutional network with global average pooling and fully connected layers, retrain the training model to obtain the weights, and obtain the deep feature maps’ weighted sum to build the saliency map. The class activation map is simply a weighted linear sum of the presence of these visual patterns at different spatial locations. By simply upsampling the class activation map to the size of the input image, the image regions most relevant to a particular class can be identified, providing a new idea for the explainability of convolutional neural networks. CAM can also be used in many other ways. However, this method can only be applied to a specific CNN architecture, and the importance of each feature map is represented by retraining the model to obtain the corresponding weights on the fully connected layer. This technique is very useful, but has some drawbacks: first, it requires changing the network structure, for example, by changing the fully connected layer to a globally averaged pooling layer, which does not facilitate training; second, it is a visualization technique based on a classification problem, which is not as effective for regression problems.
To address the shortcomings of CAM, an improved technique, Gradient-weighted Class Activation Mapping (Grad-CAM) [
23] , emerged in 2017, which allows visualization without changing the network structure. Grad-CAM extends CAM by weighting the feature activation value for each position and the class average weight for each feature mapping channel. First, given an image and a target class as input, the image is propagated through the CNN part of the model, and then the raw score for that class is obtained by task-specific computation. For all classes, the gradient is set to zero, except for the gradient of the target class, which is set to one. This signal is then backpropagated to rectified convolutional feature maps, which are combined to compute the rough Grad-CAM localization (blue heat map), which indicates where the model needs to make precise decisions. Finally, the heat map is multiplied point-by-point with Guided Backprop to obtain a high-resolution and semantically specific visualization of Guided Grad-CAM. The difference between this and CAM is that Grad-CAM adds a ReLU to the final weighted sum, the reason being that we only care about pixel points that have a positive impact on the target class, and without the ReLU layer, we may end up bringing in some pixels that belong to other classes, thus affecting the interpretation.
Haofan Wang, Zifan Wang et al. proposed Score-CAM [
24], which follows the main idea of CAM (linear weighting of the feature map). Compared with the previous series of CAM methods, the main difference is the way to obtain the linear weights. The first generation of CAM used the model weights on the full connection layer after training. Grad CAM and Grad-CAM++ [
25] both used the local gradients on the corresponding feature map (the difference is in the method of processing the gradients). Unlike previous methods based on class activation maps, Score-CAM obtains the weights of each activation map by forward-passing the scores of each activation map on the target class, thus eliminating the dependence on gradients, and the final result is obtained by a linear combination of weights and activation maps. The results of the study show that Score-CAM has better visualization and fairness in explaining the decision process. Score-CAM not only locates a single object accurately, but also shows better performance than previous work in locating multiple objects of the same type. Grad-CAM tends to capture only one target in the image, and both Grad-CAM++ and Score-CAM show the ability to locate multiple targets; however, Score-CAM’s remarkable map is more focused than Grad-CAM++.
Marco Tulio Ribeiro et al. proposed the locally interpretable model diagnostic interpretation (LIME) in 2016 [
26]. LIME is an algorithm that interprets the predictions of a classifier or regressor by performing a local approximation with an interpretable model. It modifies a single data sample by adjusting the feature values and observing the effect on the output. The output of LIME is a set of interpretations representing the contribution of each feature to the prediction of a single sample, which is a local interpretability. However, the LIME algorithm is slow, and the results must be predicted once by the original model for each sampled image after sampling is complete.
In summary, the RISE method generates multiple masks by random or Monte Carlo sampling to calculate the importance of each mask field, which usually requires a lot of masks and calculations. It is very complex and wastes time and resources. The existing CAM series methods have been able to interpret image targets more accurately in terms of localization, but the evaluation metrics need to be artificially estimated and the model architecture needs to be changed, which cannot locate multiple objects or locate them inaccurately. The LIME algorithm has strong generality and does not need to change the model internals, but the result is a local approximation of the black box model instead of a global approximation, and when the input is perturbed, the samples obey the Gaussian distribution, ignore the correlation between features, and is not stable enough to obtain completely different results for repeated interpretations using the same parameters and the same method.
3. Proposed Method
3.1. Framework of OISE
OISE is a new black-box interpretation method for better visual interpretation. By optimizing the loss function and continuously updating the mask, we can reduce the computation of the random generation mask, minimize the area of the generation mask and influence the decision score, and make the generated saliency map area more accurate. The model framework is shown in
Figure 2.
3.2. Definition of the OISE-Based Black-Box Method
Using random or Monte Carlo sampling to generate multiple masks and compute the significance of each mask typically requires a large number of masks and computations. It is complex, wastes time and resources, and the range of significant features in the generated saliency map is not particularly accurate.
The OISE method can generate a saliency map without accessing a network, and does not need to rebuild the network architecture. It is applicable to all image meshes.
The mask is generated in an optimized way, the loss function is optimized, and the mask is continuously updated to minimize the generated mask area, which affects the decision score. When computing the mask, we should reduce the amount of computation, reduce the complexity, and reduce the key display area of the saliency map in the final output. The main content is to subsample the input image with the mask, record the response to each masked image, and then detect the basic model. The weights are derived from the output probabilities predicted by the masked images in the base model. The linear combination of multiplying the weights and masks and then adding them together is the final saliency map.
where f is a black-box model that produces scalar confidence scores for a given input of I, I is the input image, m is the binary mask, and
denotes element multiplication. The importance of a pixel
is its expected score on the mask M. The more important the pixel, the higher
is.
First, input the image I to generate a mask pointing to the input image.
Mask generation process:
- (1)
Initialize a random mask.
- (2)
Use gradient descent technology to optimize the MSE (Mean Square Error) loss function. X, Y are the horizontal and vertical coordinates of the input image. To learn the desired model, we find the optimal parameters by minimizing the cost function .
Choose the initialized parameter value. For example,
. Then, select the step
. Calculate the partial differential according to the loss function:
Update
until convergence.
- (3)
Add regulation to make the parameters close to 0, but not equal to 0. Reducing the parameter size, complexity, and mask area will affect the decision score:
- (4)
Update mask M according to the optimized loss function to mask the input image I.
Then, the input image I is multiplied by the mask to obtain the masked image with i = 1, ..., N.
Input the cover image in the basic model f and output the weight value.
The weights are the probability scores generated by the masks and are adjusted according to the distribution of the masks. The final saliency map is generated by a linear combination of the weights and the masked images, multiplied, and then summed.
Finally, the weighted value of the mask is taken to obtain the saliency map. The complete process is shown in Algorithm 1.
Algorithm 1 OISE |
Input: Image I, mask M, model f Output: I’s saliency map (linear combination of masks)
- 1:
for i <= N do - 2:
- 3:
MSE loss - 4:
regularization - 5:
- 6:
optimized mask - 7:
close to 0 - 8:
end for - 9:
masked image - 10:
output probability of the masked image prediction in the f - 11:
saliency map Time complexity: O(N)
|
4. Experimental Results and Discussion
4.1. Experimental Results
In this section, we conduct experiments to evaluate the effectiveness of the proposed interpretation methods. First, we qualitatively evaluated our method using ImageNet visualization to demonstrate the effectiveness of class conditional location of objects in a given image. In our experiment, we used the publicly available object classification dataset, i.e., ILSVRC2012 and PASCAL VOC 2007. We used H = W = 224 throughout.
The ILSVRC2012_img_val dataset from ImageNet contains 50,000 images, 50 of each type. These categories correspond to the set of 1000 synonyms in WordNet. If an image contains x, it belongs to category x, where x is a synonym. The PASCAL VOC 2007 standard dataset is a benchmark for measuring the ability to classify images. The dataset contains 5011 images in the training set and 4952 images in the test set, for a total of 9963 images with 20 categories.
The evaluation was performed for the top-1 and top-5 predicted categories, and 5000 images were selected from the dataset for evaluation. Given an image, we first obtained category predictions from our network and then generated OISE saliency maps for each predicted category. We used the pre-trained VGG-16 [
27], GoogleNet [
28], and Resnet50 [
29] to evaluate OISE. After evaluating ILSVRC2012 and PASCAL VOC 2007, we report the values set in the localization error table for the top-1 and top-5 rankings in
Table 1. In all three classical neural networks, OISE has a lower localization error than CAM and Grad-CAM. CAM and Grad-CAM require changes in the model structure and must be retrained, resulting in a worse classification error, while OISE improves in classification performance.
The accuracy is shown in
Table 2 for the ILSVRC2012 and PASCAL VOC07 datasets. OISE performs with consistently high accuracy.
As shown in
Table 3, OISE has an average decrease rate of 47.4% and an average increase rate of 19.7%, which is better than other CAM-based methods. The original input is masked by point-wise multiplication with the saliency maps to observe the score change on the target class. We follow the metrics used in [
25] to measure the quality, the average drop is expressed as
and the average increase is expressed as
, where
is the predicted score for class c on image im and
is the predicted score for class c with the explanation map region as input. The experiment is performed on the ILSVRC2012 validation set; 2000 images are randomly selected.
OISE performs well in the recognition task and can successfully detect distinguishable regions of the target objects. The results of the recognition task show that OISE better reflects the decision process of the original CNN model than previous methods. The statistical graph of the comparison results is shown in
Figure 3.
We qualitatively compared the saliency maps generated by the three methods (CAM, Grad-CAM, and OISE). Our method produces a more intuitively explainable saliency map with less random noise. The results are shown in
Figure 4.
The method can explain the reason to people in a clearer way, and is more convenient compared to the white-box method, for arbitrary networks, without internal access, without reimplementing each network architecture, and for existing image networks, and is considered to be completely black-box. This method provides a more precise positioning capability than traditional CAM series methods. Grad-CAM is unable to perform multi-target detection, and Score-CAM improves on Grad-CAM’s shortcomings with more accurate localization and a poor classification task. OISE has excellent performance in both localization and classification tasks. However, the method also has some limitations, such as that it requires a lot of computation when generating and updating the mask, and the optimal parameters for optimizing the mask may be different for each update due to the use of gradient descent, which results in a different final recognition saliency range each time. More recognition results are shown in
Figure 5.
4.2. Multi-Target Positioning
Compared to previous methods, this work significantly narrows the significant range of recognition results, allowing for more accurate localization. For example, when identifying an image as a bull mastiff, both the face and the body are responsible for the recognition result, but the facial features are more indicative of the animal being a bull mastiff than the body. It also solves the drawback that the previous method could not or did not perform the classification task better.
Figure 6 shows the results and the reasons for OISE to recognize different target objects in the same image. This shows that OISE can classify multiple targets. The color change of the saliency map indicates the different importance of each pixel, with the red color indicating the most important part.
4.3. Evaluation
There are two automatic evaluation metrics: deletion and insertion. Deletion changes the decision of the underlying model. As more important pixels are removed from the image, the probability of predicting a category decreases. The importance of a pixel is defined by its significance score. A sharp drop in the probability curve indicates a good explanation. The insertion metric, on the other hand, uses a complementary approach. As pixels are inserted, the probability of measurement increases, and the higher the area under the curve (AUC), the better the interpretation. When a pixel is removed from an image, the pixel value can be set to 0 or some other constant value. Similarly, pixel insertion can start with a highly blurred image and gradually increase the blurred area. The results are shown in
Figure 7.
4.4. Discussion
The main idea of the algorithm is to summarize the statistics of different features and visualize the significance to establish the causal relationship between features and predictions. Many explainability methods perform summary statistics for each feature based on the decision results and return a quantitative metric, such as feature importance, to measure the importance of different features on the prediction results, and visualize the statistical information of feature importance to visually display the significance graph of important features. The method needs to pre-train a large number of images under classification labels, and the more images and categories are trained, the more accurate the results will be.
5. Conclusions
We propose Optimized Input Sampling Explanation (OISE), a new black-box explanation method for visual interpretation. OISE reduces the computational complexity of randomly generated masks by optimizing the loss function, continuously updating the masks, minimizing the area of the generated masks, and influencing the decision scores, and finally generating saliency maps to show the reasons why the model makes the final decision, thus providing a better explanation to the user. We analyze the evaluation and compare the results, and the method outperforms previous CAM-based methods for better visual explanations in multi-objective classification tasks, making machine decisions more transparent and credible. The method can be applied in many domains, such as the XAI problem for visual target detection, where it can be integrated into a model to generate an interpretation of the target detection, i.e., a bounding box. At the detection level, an attention map is computed to evaluate what information leads to a particular decision.
A good visual explanation will increase people’s confidence in the black box model, and with the continuous development of science and technology level, more accurate explainable models can be more widely used in various fields such as medicine, automobiles, and industry to reduce human workload. However, the accuracy of current methods still needs to be improved, which is an important issue that we must continue to explore. In future research on deep learning explainability, we can focus on how to merge different model interpretation techniques to build a more powerful model interpretation method; develop metrics for interpretation methods to measure the interpretation results of models in a more rigorous way; and explore the interpretation work of unsupervised and self-supervised methods to give stronger explainability to models, ensure their fairness, increase privacy protection performance and robustness, and improve users’ trust in explainable systems.