1. Introduction
In recent years, machine and deep learning models have been proposed to address complex tasks, such as anomaly detection. Anomaly detection methods aim to find abnormal instances in data that deviate from the normal or expected behavior. These methods are useful for providing actionable information that can be critical to a system [
1,
2]. Anomaly detection serves a vital role in various domains and applications, including the military surveillance of enemy activity, intrusion detection in cyber-security, fraud detection in insurance, and fault detection in safety-critical systems [
3]. Despite the high accuracy achieved by these methods, their applicability in real-life settings is still considered limited due to their opaqueness and black box nature in the eyes of end users, stakeholders and even research. The explainable artificial intelligence field of research, also known as XAI, seeks to balance the trade-off between accuracy and interpretability. Explanations have the potential to not only provide transparency but also reveal hidden biases and errors in machine and deep learning models.
When an anomaly is detected, domain experts must understand the features contributing to its detection. Without a clear explanation, it is challenging to investigate and address the root cause effectively [
4,
5]. Trustworthy explanations are crucial, as they increase user confidence and can improve the performance of anomaly detectors [
6]. Correct explanations are necessary for identifying the source of an anomaly, as effective mitigation relies on fully understanding its origin [
7]. Ground truth explanations, which reveal the actual reasons behind model decisions, are essential for assessing the accuracy of explanation methods. These can be obtained either through costly expert labeling or algorithmically.
A limited number of studies have addressed anomaly explanations, and even fewer have concentrated on evaluating the explanations. Of those, only a few measured the explanations’ correctness. This is not surprising considering there is no consensus on what constitutes a proper explanation. This work focuses on the evaluation of the correctness and robustness of anomaly explanations using the ground truth. For this purpose, we propose a novel anomaly benchmark data set, based on digital circuits, with local ground truth explanations produced by an algorithm we developed based on game theory. We also present an evaluation methodology for explanations’ quantitative metrics. To demonstrate the utilization of the data set and methodology, we adapted a method for explaining anomalies revealed by an autoencoder [
8]; the transition from an autoencoder to a digital circuit, through a decision tree, is legitimate in light of previous studies [
9,
10,
11,
12]. We evaluated three model-agnostic explanation methods, Kernel SHAP, Sampling SHAP [
13], and LIME [
14], by comparing the local explanations to the ground truth.
Since the approach of evaluating the correctness of explanations is in its infancy, we chose to create a binary data set as a basis. Binary data sets have been used for decades in various fields. One of the most well known data sets in the image processing field is MNIST, a large data set of handwritten digits, which was first introduced by LeCun et al. [
15] in 1998. This data set has been used as a worldwide machine learning benchmark for more than 2 decades, although the original images were black and white, i.e., binary. Our contribution is threefold:
We provide a
data set with labeled anomalies, based on a benchmark data set of digital circuits [
16]; the data set captures both linear and nonlinear relationships between features to represent complex real-world scenarios, and the anomalies are created by modifying different gates in the circuits.
The data set is accompanied by local ground truth explanations for the anomalies, based on a method used to assign influence to relevant features in Boolean functions, a game theory technique.
We provide a methodology for evaluating the correctness and robustness of local explanations. The methodology utilizes several correctness and robustness metrics, which are calculated using the ground truth explanations.
We have made our data set with local ground truth explanations (
https://doi.org/10.7910/DVN/W4FPPN, accessed on 5 September 2021) and evaluation methodology code (
https://github.com/XAI-Lab/CREM, accessed on 1 December 2022) public for the use of other researchers. Although we use specific settings for the evaluation, the methodology can also be used with other anomaly detectors and explanation methods.
The rest of the paper is organized as follows: In
Section 2, we introduce the field of explanation methods and provide an overview of evaluation metrics for explanations.
Section 3 reviews related studies that proposed evaluation methods using ground truth, methods for explaining anomalies, and works revealing techniques we rely on in our algorithm. In
Section 4, we describe the proposed data set and evaluation methodology for explanations.
Section 5 demonstrates how the data set and methodology can be used to assess the correctness and robustness of local model-agnostic explainers. In
Section 6, we discuss our conclusions and future research directions.
2. Background
In this section, we introduce existing explanation methods and elaborate on commonly used metrics for evaluating explanations.
2.1. Explanation Methods
Existing explanation methods can be divided into three categories [
17]: (1) deep explanation—techniques adapted from deep learning used to learn explainable features; (2) interpretable models—techniques used to learn models that are interpretable; and (3) model induction—techniques used to infer an explainable model from any model as if it were a black box. These techniques are also known as model-agnostic or post hoc explanations. A black box model’s internals may either be exposed but uninterpretable by humans or unexposed [
18]. Post hoc interpretations often do not elucidate exactly how a model works; however, they can provide useful information to end users [
19]. Influence methods are a subcategory of post hoc explanations that quantify the contribution of each feature to a model’s predictions [
6]. Explanation methods that are part of this group, such as feature importance methods, estimate the importance of a feature by altering the input or internal components to assess the extent to which the changes affect the model’s decision.
In this work, we focus on feature importance-based explanations due to their relevance to anomaly explanations aimed at identifying the source of an anomaly. Feature-importance-based methods provide a magnitude and direction for each feature based on its contribution to a model prediction. Several feature-importance-based methods have been proposed over the last decade [
13,
14,
20,
21]. In this paper, the use of our data set and evaluation methodology is demonstrated in the following methods:
LIME. Local Interpretable Model-Agnostic Explanation (LIME) is a model-agnostic method for explaining a prediction, which uses a local model to approximate the original model [
14]. LIME refers to simplified inputs x’ as “interpretable inputs”, and the mapping of x’ to x converts a binary vector of interpretable inputs to the original input space.
SHAP. Lundberg and Lee [
13] proposed a unified framework for interpreting predictions called Shapley Additive exPlanation (SHAP), which combines six methods within the class of additive feature attribution methods. SHAP uses Shapley values from game theory [
22] to explain a particular prediction of a complex model by assigning each feature an importance value (SHAP value). In our research, we use the following two methods from the SHAP framework: Kernel SHAP, which is a model-agnostic explanation method that uses LIME [
14] and Shapley values to build a local explanation model, in which the local model is a weighted regression built using a background set from the data, and Sampling SHAP, which is an extension of the Shapley sampling values explanation algorithm proposed by [
23]. This method is similar to Kernel SHAP, but only samples from the background set are considered.
2.2. Explanation Evaluation Metrics
While the amount of published research presenting explanation methods is growing, the field of evaluating explanations still lacks proper evaluation methodologies. There are two ways of evaluating the results of explanation methods: (1) evaluation that uses the ground truth to measure the accuracy or correctness of the explanation, and (2) evaluation that does not use the ground truth but rather measures other properties, such as consistency or robustness. Although evaluating explanations using the first approach is challenging due to the subjective nature of explanations and the rareness of a ground truth to compare against, evaluating the explanation’s quality is important for realizing end users’ benefits in practical settings [
24]. Markus et al. [
25] state that evaluation methods have one of two purposes: the first is for comparing against available explanation methods, and the second is to determine whether the explanation achieves the defined objective. Doshi-Velez and Kim [
26] divide evaluation metrics into three groups: application-grounded, which involve real humans and real tasks; human-grounded, which involve real humans and simplified tasks; and functionally grounded, which involve no humans and proxy tasks. In this paper, we focus on the third group.
Various studies have proposed properties for evaluating all kinds of explanations. For example, Hoffman et al. [
27] defined key concepts for measuring the explanations of an AI system: the goodness of explanations, user satisfaction, users’ understanding of the AI system, the effect of curiosity on the search for explanations, user trust, and performance. Melis and Jaakkola [
28] used explicitness, faithfulness, and stability for evaluation. Gunning [
17] divided explanation effectiveness into five categories: mental model, task performance, trust assessment, correctability, and user satisfaction. Yang et al. [
24] proposed the following properties: generalizability, fidelity, persuasibility, robustness, capability, and certainty. Mohseni et al. [
29] presented a survey that maps between design goals for different XAI user groups and their respective evaluation methods. The evaluation measures in the survey include explanations’ usefulness and satisfaction, fidelity, task performance and user trust. Sokol and Flach [
30] suggested fact sheets to evaluate an explainability method, which include five dimensions: (1) functional requirements, (2) operational requirements, (3) usefulness from a user’s perspective, (4) security and privacy, and (5) validation using user studies or synthetic experiments. Many other studies have adopted such properties as their evaluation objective.
3. Related Work
In this section, we first review related studies that proposed evaluation methods using ground truth and methods that explain anomalies. Then, we present studies that proposed game theory techniques that are applied in our algorithm.
3.1. Explaining Anomalies
Kopp et al. [
2] suggested an approach for explaining an anomaly using a random forest classifier. Evaluation of the explanations was performed by measuring the change in the detector’s AUC. The tabular data sets used in their experiments [
31] were adapted for the purpose of anomaly detection. Haldar et al. [
32] proposed an algorithm that generates a diverse set of counterfactual explanations [
33,
34] for an anomaly identified by an autoencoder. The authors evaluated the explanations by validating that the new instances suggested as explanations belong to the normal class. Dang et al. [
35] proposed an algorithm that addresses both outlier detection and explanations. The algorithm uses a mathematical approach from spectral graph theory to learn an optimal subset in which an anomaly is well separated from normal objects.
Giurgiu and Schumann [
36] extended SHAP with influence weighting in order to explain anomalies detected from multivariate time series using a GRU-based autoencoder. Nguyen et al. [
37] proposed a framework to detect anomalies in network traffic using a variational autoencoder (VAE) and explain them using a gradient-based fingerprinting technique. They changed a feature of an anomalous instance and examined how it affected the model’s objective function. Explanations are evaluated by plotting the receiver operating characteristic curve (ROC). Takeishi [
38] compared Shapley values to the reconstruction error (RE) of features in principal component analysis (PCA) to explain anomalies. The authors changed a feature to make the instance anomalous and then compared the Shapley values to the RE. Amarasinghe et al. [
39] presented a framework for explaining anomalies detected using DNN. The framework provides the features that were relevant in making the prediction using Layer-Wise Relevance Propagation (LRP) [
40]. An evaluation of the relevant features was made by comparing the relevant features across different DNN models.
Liu et al. [
41] suggested a Contextual Outlier INterpretation (COIN) framework to explain anomalies detected using important features, the abnormality score, and the contrastive context of the anomaly. Takeishi and Kawahara [
42] proposed a method for anomaly interpretation via Shapley values. They evaluated the method on both real and synthetic anomalies generated by perturbing features in normal records. The ground truth was obtained using the known perturbed features. The authors compared the ground truth to the explanations in order to calculate metrics that indicate whether the interpretation is correct. While this is the only study that evaluated anomaly explanations using the ground truth, the authors’ method of perturbing features and identifying them as the ground truth is somewhat problematic, since it does not consider any relationships between the features. Creating anomalies based on a model, as was performed in our data set, addresses this issue.
Table 1 summarizes the recent methods for evaluating anomaly explanations. For more methods, refer to Yepmo et al. [
43], who provided an extensive review of the anomaly explanation field. Many methods were reported, but none of them included model-based ground truth explanations to evaluate the correctness of the explanations.
3.2. Evaluating Explanations Using Ground Truth
Tritscher et al. [
44] suggested a setting for evaluating XAI approaches, using binary synthetic data sets with ground truth explanations. The explanations are based on a relevance definition for features in Boolean functions. Features that were not used in the Boolean function acted as noise, although no analysis of the influence of the noise was reported. They considered a single explanation to be correct if the top-scoring features provided by the method matched the ground truth features. The evaluation fails to consider partial matches and does not differentiate between true positive and true negative errors. Yalcin et al. [
45] developed a method to quantitatively evaluate the correctness of XAI algorithms for binary classification by constructing data sets using language derived from a grammar and ground truth explanations using repeated application of production rules. Barr et al. [
46] provided a synthetic data generation method inspired by Yang and Kim [
47]. The method allows the generation of arbitrarily complex data designed for binary classification that utilizes symbolic expressions. The authors demonstrated their method using data sets with and without feature correlation and provided local attributions using SHAP. They added redundant features and observed the influence of noise on the SHAP values.
Guidotti [
48] proposed a ground-truth-based evaluation framework that focuses on evaluating the correctness of model-agnostic explanations. It includes several methods for generating synthetic transparent classifiers that are accompanied by synthetic ground truth explanations. The above-proposed methods are not aimed specifically at anomalies; because of this, they might not be suitable for evaluating anomaly explanations. Antwarg et al. [
8] created autoencoders for which the connections between the features are known and thus had a ground truth to explain the anomalies. Then, they created an artificial anomaly data set to examine whether their method uses the correct set of features to explain the anomalies.
Arras et al. [
49] developed a visual question-answering dataset containing questions and pixel-level ground truth masks that can be used to evaluate visual explanations. In Agarwal et al. [
50], a synthetic graph data generator is presented that can be used to generate the benchmark datasets, including varying graph sizes, degree distributions, etc., accompanied by ground truth explanations. The last two papers use ground truth explanations but for other types of data or data representation than in our research.
The crucial difference between the above works and the evaluation methodology we propose is that we offer a unique data set with anomalies that is based on a real-world benchmark data set; most other works are based on synthetic data sets. Our data set captures both linear and nonlinear relationships between features to represent real-world scenarios. In addition, by padding the data with attribute noise, we allow evaluation of both the correctness and robustness of explanations.
3.3. Influence on Boolean Functions
The influence of a single vote on a decision made by a majority vote was first discovered by Penrose [
51] and was later re-introduced by Banzhaf III [
52] and Shapley and Shubik [
53] as the “power index”. Both methods are based on a technique from game theory applied to “simple games” and “weighted majority games” [
54]. According to their definition, an individual’s power in a decision is determined by the individual’s chance of becoming critical to the success of a winning coalition. The “power index” can be generalized as a definition of the influence of coordinate i in a Boolean function, since participating in a vote or game may result in two possible outcomes. O’Donnell [
55] defines the influence of coordinate i on a Boolean function
for an instance
x as the probability of
where
denotes
x with the
bit flipped. We can apply this definition to determine that input
should be considered as influential on output
if flipping the value of the
feature of input
x corresponding to input
results in changing the value of
. We extend these definitions to create local ground truth explanations (see
Section 4.4).
4. Anomaly Data Set and Ground Truth Explanation Based Evaluation Methodology
In this section, we describe the proposed data set and methodology for evaluating the correctness and robustness of anomaly explanations, which is presented in
Figure 1.
4.1. Original Data Set
The data set proposed in this study is based on four digital circuits included in the ISCAS ’85 [
16] and 74x series benchmarks. ISCAS ’85 is an accepted benchmark data set that has been in wide use ever since being introduced at the International Symposium of Circuits and Systems in 1985. The original descriptions of the benchmark circuits were provided in netlist format, which does not include any functions or high-level designs; however, high-level models have been developed over the years [
56] to allow gate-level understanding.
We chose to include the four smallest circuits in the benchmark in our data set, since we wanted to enable other researchers to run experiments using the data set in a reasonable amount of time. The circuits are as follows: (1) C17 is the smallest circuit in the ISCAS benckmark, containing just six NAND gates. It implements a very simple two output circuit with five inputs. (2) ’74283’ is a fast adder composed of three modules. It contains nine inputs and five outputs. (3) ’74182’ uses the carry-look ahead (CLA) realization of the carry function. It contains nine inputs and five outputs. (4) ’74181’ is a four-bit arithmetic logic unit (ALU) and function generator; this is the largest circuit of the four, containing 14 inputs and eight outputs. All digital circuits include different types of logic operators, both linear as AND and nonlinear as XOR.
4.2. Generating Anomalies in the Data Set
To create a data set containing anomalies (
Table A1), we used .sys format files that were published in a diagnostic competition [
57]. Each circuit is represented by inputs
and a series of logic operations that produce inner layers
and the outputs
. A digital circuit, from a system’s perspective, may include faults leading to an abnormal behavior. A system’s observed behavior that conflicts with its expected behavior is considered anomalous. Identifying faulty system components that explain the anomaly is a diagnostic problem [
58].
In this work, we aim to detect the inputs that contribute to each anomaly rather than diagnosing the faulty components, i.e., the operator whose output is not as expected. To generate anomalies for each circuit, we replaced one logic operator at a time with its negated operator. The new behavior of that gate makes its functionality abnormal. We created four anomalous versions of each digital circuit by negating four logic operators in different locations in the circuit (an inner or final operator) to reflect a variety of anomaly complexities. For each version of the circuit, both original and anomalous, we created a truth table consisting of
combinations with their inputs and outputs. We refer to each row in the truth table as an instance
R. To facilitate the robustness evaluation, we added attribute noise to each version of the circuit (see
Section 4.3). An instance from the modified (anomalous) truth table is considered anomalous if it differs from the corresponding instance in the original truth table. Refer to Appendix A for a complete list of the circuits with the anomalies, which includes the circuit’s name, the number of input and output nodes it has, the name of the altered gate, the altered operator, the attribute noise level, and the number of anomalous instances produced.
Example 1. To create an anomaly in = ( NAND ), we change the operator NAND to AND, so = ( AND ). Then, we create a modified truth table, where the inputs’ values remain the same ( combinations), but for some instances, depending on the altered operator, an output is different. Such instances are considered anomalous, since negating the logic operator results in changing at least one output. Figure 2 provides a comparison between the original and anomalous diagrams and truth tables for circuit C17. 4.3. Attribute Noise
The inputs of the digital circuits, which serve as features, were padded with uninformative features that play the role of attribute noise. Adding attribute noise to the data is a typical way of augmenting a data set to enrich it with more examples and consequently increase the model’s generalizability [
59]. However, if the model lacks robustness, adding noise could harm its performance [
60,
61]. The amount of attribute noise we added, selected in proportion to the number of features in the data, varied from zero redundant features (no noise) to six redundant features. By introducing controlled levels of noise, we assess the stability of the explanation methods under various conditions, aiming to improve the reliability of explanations in the presence of redundant features.
4.4. Creating Ground Truth
A local ground truth explanation is the reason why a model returned a certain prediction for a specific instance. It can be represented as the set of features that led the model to make such a prediction. A ground truth explanation is not easy to obtain. Ground truth explanations are useful for evaluating the correctness and robustness of explanations produced by an explanation method. The correctness of explanations can be examined by comparing the ground truth to the explanation method’s output. Robustness can be examined by verifying that the explanation includes no noise, meaning noisy features are not considered as part of the influential features.
In our setting, we explain the outputs influenced by the anomaly; thus, the ground truth explanation is represented as a set of inputs that contribute to the anomalous outputs. To generate local ground truth explanations, we adopt the concept of assigning influence to relevant features in Boolean functions [
55,
62], as described in
Section 3.3. Since our digital circuits are composed of multiple Boolean logic operators, we can extend this definition to assign influence to features for each explained instance. After we create truth tables for the original and anomalous versions of the circuit, we transfer the logic of the anomalous
file into a diagram. This diagram is created based on concepts proposed by Akers [
63] and Lee [
64] for building a binary decision diagram (BDD).
Figure 2 presents an original diagram of the C17 circuit, before changing any logic operators, and an anomalous version in which
has been modified. A diagram provides a means of identifying the outputs of the circuit for any given input’s initialization, meaning each row in the truth table can be represented by a diagram.
For a given observation (input and output values) and the output to explain, we use the diagram to find a set of influencing features, according to Algorithm 1, where D represents the circuit diagram, O is the output we wish to explain, and R is the specific instance. Algorithm 1 finds the set of influencing features, i.e., the local ground truth for one instance R in the truth table. The influencing set is reached by backtracking from O to the initial input nodes represented as , through inner nodes represented as or output nodes represented as . Starting from O, we focus on one logic operator at a time and follow the definition of the influence of one coordinate on a Boolean function, where the Boolean function is the logic operator applied on a set of inputs to produce an output. We extend the definition to consider the dependency between features and to allow sets of features to be considered influential.
First, we use the diagram
D to calculate the value of each circuit node for the given instance
R (line 1). We propagate the known values of the inputs throughout the diagram to calculate the values of all nodes, including inner nodes
. Then, we initialize a queue (line 2) that will allow us to backtrack the nodes in the path from an output (prediction)
O to the bottom of the diagram where the original influencing features are found. We also initialize a list
I (line 3), which will contain the final influencing inputs. As long as the queue contains nodes, we extract the current node (line 6) and obtain the nodes that serve as inputs to that node (line 7), meaning they are the inputs to the logic operator producing this output. We create all combinations of subsets of the current output’s entered nodes (current inputs) to find the minimal subset of features that influence that node (line 9). The smallest subset includes individual features, and the largest includes all of the current inputs. We then examine the influence for every group of subsets in order, from the smallest to the largest (line 10). We aim to find a minimal subset to avoid redundancy. A subset is considered influential if flipping all of the features in the subset and feeding the logic operator with the flipped subset results in changing the output (lines 14–17). An influential node is included in the final list
I if it is an input node
, or in the queue
if it is an inner node
(lines 19–22). If an influencing subset was found in a group of a certain size, the search is performed (lines 12–13). Finally, we return a list containing influential inputs (line 24).
Algorithm 1 Generating Local Ground Truth |
Input: Circuit diagram D, explained output O, instance R Output: Influencing inputs list I 1: = CalcValueForAllNodes(). 2: = Queue(). 3: I = List(). 4: .enqueue(O). 5: while not .empty() do: 6: = .pop(). 7: = InputsForOutput(). ▹ get all the nodes that have an edge in the diagram leading to currOut 8: = length(). 9: = GetCombinations(, ). ▹ creates subsets of the current inputs with sizes from 1 to nInputs 10: = False. 11: for all in do: 12: if is True then: 13: break. 14: for all in do: 15: = FlipInputs() 16: = CalcOut(, ) 17: if then: 18: for all in do: 19: if .isInputNode() then: 20: I.append() 21: else: 22: .enqueue() 23: = True 24: return I end
|
Example 2. Let = ( NAND ), = ( AND ), = ( NAND ), and = ( NAND ), where , , , and so that , , , and (see Figure 2 for the respective anomalous diagram of C17). We create the combinations {{{}, {}}, {{, }}}, which are subsets of ’s entered nodes. We then check all subsets of the same size, starting from the smallest size (meaning {} and {}). We flip the value of to one and examine whether the value of changes. This is carried ot foru as well, to conclude that only {} has an influence on the output. Next, we examine the branch leading to {} in the same manner. We create the combinations {{{}, {}}, {{, }}}. Here, flipping the value of {} changes , so it is considered influential. The final set {} is then returned by the algorithm as the local ground truth. 4.5. Evaluation Metrics
The evaluation methodology enables the evaluation of the correctness and robustness of local explanations. The explanation produced is a set of feature importance scores representing the contribution of each feature to the prediction. The set is sorted by descending absolute value and then compared to the ground truth explanation, considering not only the presence and absence of features but also their rank in the explanation.
Correctness. The evaluation utilizes three metrics, where the correctness of the explanation is reflected by a high metric value.
Mean Reciprocal Rank (MRR). This expresses the mean of the rank positions of the first relevant feature in the produced explanation across all explained instances. A relevant feature is a feature that appears in the local ground truth. MRR is defined as , where refers to the rank position of the first relevant feature for the explanation. Here and in the metrics below, refers to the number of explanations.
Mean Average Precision (MAP). This expresses the mean of all average precision values across the explained instances. MAP is defined as , where is the average precision of the explanation.
Mean R-Precision (MR-Precision). This expresses the mean of the precision value at the recall point across all explained instances. The recall point is determined by the length of the corresponding local ground truth. MR-Precision is defined as , where is the number of relevant features returned for the explanation and is the length of the corresponding local ground truth (total relevant features).
Robustness. Huber [
65] defines robustness as the insensitivity to minor deviations from the expected behavior. In terms of machine learning, a model is considered more robust than another if it suffers less from the impact of noise. The robustness evaluation utilizes the
Equalized Loss of Accuracy (ELA) metric suggested by Sáez et al. [
66], which establishes the expected behavior of a model with noisy data.
ELA takes into account the performance without noise (
) and the loss of accuracy (
). The lower the
ELA value, the more robust the model.
is defined as
, where
x is the level of noise,
is the accuracy of the model with attribute noise level
x, and
is the accuracy of the model without noise. In our methodology, we use the R-precision instead of the accuracy metric used in the original work, since it represents the ideal output of the explanation method.
5. Experiments
In this section, we demonstrate the utilization of our anomaly data set and evaluation methodology. The case study presented involves the evaluation of an autoencoder-based anomaly detector, explained using local model-agnostic explanation methods. We conducted experiments to show how the local ground truth explanations are used to evaluate the correctness and robustness of the chosen explanation methods. Note that the experimental results relate to the specific settings used for this demonstration. Other settings can also be applied.
5.1. Anomaly Detector
We adapted a method of explaining anomalies revealed by an autoencoder presented by Antwarg et al. [
8]. Autoencoders are one of the most common approaches for outlier detection for cases where labels are not available [
67]. An autoencoder is an unsupervised neural network that represents normal data in a low dimension and reconstructs input data in the original dimension. Consequently, abnormal instances, which are not properly reconstructed, stand out [
68].
To apply the suggested method, we organized the instances in the data set to resemble an autoencoder, where the inputs and outputs follow the same structure. We created normal and anomalous instances by concatenating the inputs and outputs of the original truth table and anomalous truth table, respectively. According to the method, we provided an explanation for output features that have a high reconstruction error. Since here the features are binary, we explained the outputs such that
. The model created to detect the anomalies is a custom model derived from the
base package of the Python
scikit-learn library [
69]. The model is a simplified version of an autoencoder-based anomaly detector, in which the
fit function creates a mapping between the original truth table and anomalous truth table (the truth table after modifying a circuit’s logic operator), which serves as the tabular data. The
predict function receives a truth table instance and returns the reconstructed instance.
5.2. Explanation Methods
We used three model-agnostic explanation methods: Kernel SHAP, Sampling SHAP [
13], and LIME [
14] (respectively, the shap (
https://github.com/slundberg/shap/blob/master/shap/explainers, accessed on 25 September 2024) and lime (
https://github.com/marcotcr/lime/blob/master/lime, accessed on 25 September 2024) Python packages). For both SHAP methods, we set the number of samples for coalitions of features (
) to the default value for the C17 circuit and to 500 for the other circuits. This number was selected to avoid a long run time but still provided enough coalitions to approximate Shapley values.
Background Set Tuning
All three explanation methods rely on a background set, which serves as a reference for building a local explanation model. The choice and design of this background set are crucial, as they can significantly influence both the accuracy and efficiency of the explanations. In LIME, the background set is used to perturb features by sampling from a standard normal distribution. The mean and standard deviation of the background set are employed for mean-centering and scaling features, allowing LIME to approximate how changes in feature values affect the model’s predictions. For SHAP, the background set provides a basis for approximating Shapley values. SHAP methods replace feature values with values from the background set to compute marginal contributions, simulating the absence of specific features and assessing their importance in the model’s decision. This approach aligns with Shapley values’ theoretical grounding in cooperative game theory, where the background set acts as the “coalition” of references. The composition and size of the background set directly impact the fidelity of explanations. A larger background set allows for more accurate approximations of feature contributions, as it provides a richer representation of the data distribution. However, this increased accuracy comes at the cost of computational efficiency. Larger background sets increase the number of model evaluations required to compute explanations, which can significantly slow down the explanation process. This trade-off between accuracy and computational efficiency is particularly relevant for complex models or large datasets. To explore this balance, we tested different background set proportions, ranging from 10% to 80% of the dataset: 0.1, 0.2, 0.4, 0.6, 0.8. For instance, with a dataset of 100 instances and a proportion of 0.8, the background set consists of 80 instances. By varying the proportion, we aimed to identify the optimal background set size that balances explanation accuracy with computational feasibility. The selection of instances for the background set was carried out randomly using the Python NumPy library (
https://numpy.org/), with a fixed random seed of 27 to ensure reproducibility. Additionally, the diversity within the background set is another important consideration. A background set that accurately reflects the full range of data variability can lead to more reliable explanations, as it better captures the conditions under which features contribute to model predictions. Conversely, a background set with limited variability may lead to biased or incomplete explanations, as it may not represent the complete distribution of feature values. Therefore, careful sampling to include representative examples across different data clusters can enhance explanation robustness.
In summary, the design of the background set involves balancing three key factors: (1) the size of the background set, which affects computational cost, (2) the diversity of the background set, which influences the accuracy of feature importance estimates, and (3) the relevance of the set to the data distribution, ensuring that it reflects the conditions under which explanations are required. Future work could further explore adaptive background set selection techniques, which dynamically adjust the set based on data characteristics to optimize both performance and computational efficiency.
5.3. Results
We explained the anomalous instances of each digital circuit with Kernel SHAP, Sampling SHAP, and LIME. The results were averaged across all four anomalous versions of each circuit. Each experiment was conducted several times with different background set proportions. For circuit 74181, which is the largest and most complex circuit, we tested all five background set proportions. For 74283, 74182, and C17, we tested background set proportions of {0.4, 0.6, 0.8}. The best proportion for each circuit was selected based on the results for each metric.
5.3.1. Correctness Evaluation
Table 2 reports the correctness of explanations produced by each method evaluated with the MRR, MAP, and MR-Precision metrics, calculated based on local ground truth explanations. The background proportions selected after tuning for 74181, 74182, 74283, and C17 were 0.2, 0.4, 0.6, and 0.6, respectively. We used the adjusted Friedman test to reject the null hypothesis that all methods have the same MRR measure with
F-
and
p-
. Using the post hoc Nemenyi with
p-
, we can conclude that LIME performs significantly worse than both SHAP methods. However, we could not reject the null hypothesis that Kernel SHAP and Sampling SHAP perform the same in terms of MRR. With respect to the MAP results, Kernel SHAP achieved the best performance. Overall, the MAP results seem to decrease as the circuit becomes more complex and more features are involved, meaning that the features’ rank becomes less accurate. As for the R-precision results, Kernel SHAP and Sampling SHAP achieved comparable high performance. The highest values are those of C17 (0.803 and 0.787 respectively), while the R-precision values of the other circuits are lower but stable. Specifically, LIME performed poorly on circuit 74181, which might be due to the complex relations between features and a large number of inputs and outputs in comparison to the smaller circuits.
5.3.2. Robustness Evaluation
Table 3 reports the robustness of the explanation methods evaluated by
. The background proportions selected after tuning for circuits 74182, 74283, and C17 were 0.6, 0.4, and 0.4, respectively. The best (lowest) ELA values were achieved for circuit C17 and the worst (highest) for circuit 74283. The decrease in the ELA value from attribute noise level
x to
in circuits 74182 and C17 indicates that Kernel SHAP and Sampling SHAP benefited when more noise is introduced in smaller data sets, as these circuits are smaller and less complex than circuits 74283 and 74181. In contrast, LIME seems be affected by the noise in all cases. We used the adjusted Friedman test to reject the null hypothesis that all methods have the same robustness measures with
F-
and
p-
. Using the post hoc Nemenyi with
p-
, we can conclude that LIME performs significantly worse than both SHAP methods. However, we could not reject the null hypothesis that Kernel SHAP and Sampling SHAP yield similar robustness measures.
6. Discussion
This work presents a benchmark dataset for anomaly detection in digital circuits, complete with ground truth explanations that allow for a rigorous evaluation of explanation methods. By simulating a range of linear relationships among features, our dataset supports both correctness and robustness assessments of anomaly explanations.
Our evaluation methodology goes beyond correctness metrics to examine robustness, offering insights into the stability of explanations under attribute noise. Experimental results highlight that Kernel SHAP and Sampling SHAP consistently rank influential features effectively across different circuit complexities, while LIME’s performance is more variable, especially in noisy environments.
An important aspect to discuss is how to insert noise into the data. In this paper, we added attributes with noise, but robustness can be further evaluated by introducing perturbative and correlated noise specifically tailored for binary data. For binary features, perturbative noise can be applied by randomly flipping values, changing some 0 s to 1 s and vice versa. This type of noise simulates potential errors in data collection and tests the stability of explanation methods under minor disruptions in feature values. Correlated noise, on the other hand, introduces changes that either maintain or slightly alter dependencies between features. For instance, if two binary features often appear together, one could be selectively flipped to test if the explanation method can still capture the interdependence between them. By examining robustness under these additional types of noise, we gain deeper insights into the stability of explanations in real-world binary data scenarios.
An essential aspect of this methodology is the selection of a well-tuned background set, which balances accuracy and computational efficiency. Our findings indicate that larger background sets generally enhance explanation reliability by reducing the influence of irrelevant features, but this comes with higher computational demands. The robustness and adaptability of explanation methods shown here are essential for real-world applications with diverse data quality.
7. Conclusions
In this paper, we addressed a key gap in the evaluation of anomaly explanation methods by presenting a benchmark dataset with ground truth explanations, specifically designed to support the rigorous assessment of correctness and robustness in anomaly detection. By leveraging digital circuit data, our dataset captures both linear and non-linear feature interactions, providing a realistic basis for testing the efficacy of various model-agnostic explanation methods. Additionally, we introduced a novel evaluation methodology that incorporates correctness and robustness metrics, enabling a structured, evidence-based approach to assess explanation quality. Our findings highlight that model-agnostic explanation methods, particularly Kernel SHAP and Sampling SHAP, perform reliably in identifying influential features across different anomaly scenarios, showing resilience to noise. In contrast, LIME’s sensitivity to noise emphasizes the need for robustness testing in practical applications. These results demonstrate that our methodology provides valuable insights into the stability and reliability of explanations, which is crucial for transparent AI systems in domains such as cybersecurity, industrial diagnostics, and beyond.
This work contributes to evaluation theory by adapting traditional metrics and introducing new, domain-specific measures tailored to the unique requirements of anomaly detection in AI. In doing so, it bridges the gap between traditional evaluation frameworks and the emerging needs of explainable AI (XAI), offering a foundation for future research that seeks to evaluate and improve transparency in AI-driven decision-making. Furthermore, our methodology aligns with evidence-based evaluation principles, providing a data-driven framework that can support informed decisions about the utility and reliability of anomaly explanations.
Looking forward, this research can be extended by developing similar datasets and evaluation methodologies for non-binary and more complex anomaly types, broadening the applicability of our framework to a wider range of real-world scenarios. By continuing to build on the intersection of evaluation theory and XAI, future work can contribute to a robust and transparent foundation for evaluating AI explanations, helping to establish standards that promote trust and accountability in AI systems.