Evaluating Anomaly Explanations Using Ground Truth

Antwarg Friedman, Liat; Galed, Chen; Rokach, Lior; Shapira, Bracha

doi:10.3390/ai5040117

Open AccessArticle

Evaluating Anomaly Explanations Using Ground Truth

Department of Software and information Systems, Ben Gurion University of the Negev, Be’er Sheva 8410501, Israel

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AI 2024, 5(4), 2375-2392; https://doi.org/10.3390/ai5040117

Submission received: 26 September 2024 / Revised: 7 November 2024 / Accepted: 11 November 2024 / Published: 15 November 2024

(This article belongs to the Special Issue Interpretable and Explainable AI Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The widespread use of machine and deep learning algorithms for anomaly detection has created a critical need for robust explanations that can identify the features contributing to anomalies. However, effective evaluation methodologies for anomaly explanations are currently lacking, especially those that compare the explanations against the true underlying causes, or ground truth. This paper aims to address this gap by introducing a rigorous, ground-truth-based framework for evaluating anomaly explanation methods, which enables the assessment of explanation correctness and robustness—key factors for actionable insights in anomaly detection. To achieve this, we present an innovative benchmark dataset of digital circuit truth tables with model-based anomalies, accompanied by local ground truth explanations. These explanations were generated using a novel algorithm designed to accurately identify influential features within each anomaly. Additionally, we propose an evaluation methodology based on correctness and robustness metrics, specifically tailored to quantify the reliability of anomaly explanations. This dataset and evaluation framework are publicly available to facilitate further research and standardize evaluation practices. Our experiments demonstrate the utility of this dataset and methodology by evaluating common model-agnostic explanation methods in an anomaly detection context. The results highlight the importance of ground-truth-based evaluation for reliable and interpretable anomaly explanations, advancing both theory and practical applications in explainable AI. This work establishes a foundation for rigorous, evidence-based assessments of anomaly explanations, fostering greater transparency and trust in AI-driven anomaly detection systems.

Keywords:

explainability; explanation evaluation; ground truth; anomaly detection

1. Introduction

In recent years, machine and deep learning models have been proposed to address complex tasks, such as anomaly detection. Anomaly detection methods aim to find abnormal instances in data that deviate from the normal or expected behavior. These methods are useful for providing actionable information that can be critical to a system [1,2]. Anomaly detection serves a vital role in various domains and applications, including the military surveillance of enemy activity, intrusion detection in cyber-security, fraud detection in insurance, and fault detection in safety-critical systems [3]. Despite the high accuracy achieved by these methods, their applicability in real-life settings is still considered limited due to their opaqueness and black box nature in the eyes of end users, stakeholders and even research. The explainable artificial intelligence field of research, also known as XAI, seeks to balance the trade-off between accuracy and interpretability. Explanations have the potential to not only provide transparency but also reveal hidden biases and errors in machine and deep learning models.

When an anomaly is detected, domain experts must understand the features contributing to its detection. Without a clear explanation, it is challenging to investigate and address the root cause effectively [4,5]. Trustworthy explanations are crucial, as they increase user confidence and can improve the performance of anomaly detectors [6]. Correct explanations are necessary for identifying the source of an anomaly, as effective mitigation relies on fully understanding its origin [7]. Ground truth explanations, which reveal the actual reasons behind model decisions, are essential for assessing the accuracy of explanation methods. These can be obtained either through costly expert labeling or algorithmically.

A limited number of studies have addressed anomaly explanations, and even fewer have concentrated on evaluating the explanations. Of those, only a few measured the explanations’ correctness. This is not surprising considering there is no consensus on what constitutes a proper explanation. This work focuses on the evaluation of the correctness and robustness of anomaly explanations using the ground truth. For this purpose, we propose a novel anomaly benchmark data set, based on digital circuits, with local ground truth explanations produced by an algorithm we developed based on game theory. We also present an evaluation methodology for explanations’ quantitative metrics. To demonstrate the utilization of the data set and methodology, we adapted a method for explaining anomalies revealed by an autoencoder [8]; the transition from an autoencoder to a digital circuit, through a decision tree, is legitimate in light of previous studies [9,10,11,12]. We evaluated three model-agnostic explanation methods, Kernel SHAP, Sampling SHAP [13], and LIME [14], by comparing the local explanations to the ground truth.

Since the approach of evaluating the correctness of explanations is in its infancy, we chose to create a binary data set as a basis. Binary data sets have been used for decades in various fields. One of the most well known data sets in the image processing field is MNIST, a large data set of handwritten digits, which was first introduced by LeCun et al. [15] in 1998. This data set has been used as a worldwide machine learning benchmark for more than 2 decades, although the original images were black and white, i.e., binary. Our contribution is threefold:

We provide a data set with labeled anomalies, based on a benchmark data set of digital circuits [16]; the data set captures both linear and nonlinear relationships between features to represent complex real-world scenarios, and the anomalies are created by modifying different gates in the circuits.
The data set is accompanied by local ground truth explanations for the anomalies, based on a method used to assign influence to relevant features in Boolean functions, a game theory technique.
We provide a methodology for evaluating the correctness and robustness of local explanations. The methodology utilizes several correctness and robustness metrics, which are calculated using the ground truth explanations.

We have made our data set with local ground truth explanations (https://doi.org/10.7910/DVN/W4FPPN, accessed on 5 September 2021) and evaluation methodology code (https://github.com/XAI-Lab/CREM, accessed on 1 December 2022) public for the use of other researchers. Although we use specific settings for the evaluation, the methodology can also be used with other anomaly detectors and explanation methods.

The rest of the paper is organized as follows: In Section 2, we introduce the field of explanation methods and provide an overview of evaluation metrics for explanations. Section 3 reviews related studies that proposed evaluation methods using ground truth, methods for explaining anomalies, and works revealing techniques we rely on in our algorithm. In Section 4, we describe the proposed data set and evaluation methodology for explanations. Section 5 demonstrates how the data set and methodology can be used to assess the correctness and robustness of local model-agnostic explainers. In Section 6, we discuss our conclusions and future research directions.

2. Background

In this section, we introduce existing explanation methods and elaborate on commonly used metrics for evaluating explanations.

2.1. Explanation Methods

Existing explanation methods can be divided into three categories [17]: (1) deep explanation—techniques adapted from deep learning used to learn explainable features; (2) interpretable models—techniques used to learn models that are interpretable; and (3) model induction—techniques used to infer an explainable model from any model as if it were a black box. These techniques are also known as model-agnostic or post hoc explanations. A black box model’s internals may either be exposed but uninterpretable by humans or unexposed [18]. Post hoc interpretations often do not elucidate exactly how a model works; however, they can provide useful information to end users [19]. Influence methods are a subcategory of post hoc explanations that quantify the contribution of each feature to a model’s predictions [6]. Explanation methods that are part of this group, such as feature importance methods, estimate the importance of a feature by altering the input or internal components to assess the extent to which the changes affect the model’s decision.

In this work, we focus on feature importance-based explanations due to their relevance to anomaly explanations aimed at identifying the source of an anomaly. Feature-importance-based methods provide a magnitude and direction for each feature based on its contribution to a model prediction. Several feature-importance-based methods have been proposed over the last decade [13,14,20,21]. In this paper, the use of our data set and evaluation methodology is demonstrated in the following methods:

LIME. Local Interpretable Model-Agnostic Explanation (LIME) is a model-agnostic method for explaining a prediction, which uses a local model to approximate the original model [14]. LIME refers to simplified inputs x’ as “interpretable inputs”, and the mapping of x’ to x converts a binary vector of interpretable inputs to the original input space.

SHAP. Lundberg and Lee [13] proposed a unified framework for interpreting predictions called Shapley Additive exPlanation (SHAP), which combines six methods within the class of additive feature attribution methods. SHAP uses Shapley values from game theory [22] to explain a particular prediction of a complex model by assigning each feature an importance value (SHAP value). In our research, we use the following two methods from the SHAP framework: Kernel SHAP, which is a model-agnostic explanation method that uses LIME [14] and Shapley values to build a local explanation model, in which the local model is a weighted regression built using a background set from the data, and Sampling SHAP, which is an extension of the Shapley sampling values explanation algorithm proposed by [23]. This method is similar to Kernel SHAP, but only samples from the background set are considered.

2.2. Explanation Evaluation Metrics

While the amount of published research presenting explanation methods is growing, the field of evaluating explanations still lacks proper evaluation methodologies. There are two ways of evaluating the results of explanation methods: (1) evaluation that uses the ground truth to measure the accuracy or correctness of the explanation, and (2) evaluation that does not use the ground truth but rather measures other properties, such as consistency or robustness. Although evaluating explanations using the first approach is challenging due to the subjective nature of explanations and the rareness of a ground truth to compare against, evaluating the explanation’s quality is important for realizing end users’ benefits in practical settings [24]. Markus et al. [25] state that evaluation methods have one of two purposes: the first is for comparing against available explanation methods, and the second is to determine whether the explanation achieves the defined objective. Doshi-Velez and Kim [26] divide evaluation metrics into three groups: application-grounded, which involve real humans and real tasks; human-grounded, which involve real humans and simplified tasks; and functionally grounded, which involve no humans and proxy tasks. In this paper, we focus on the third group.

Various studies have proposed properties for evaluating all kinds of explanations. For example, Hoffman et al. [27] defined key concepts for measuring the explanations of an AI system: the goodness of explanations, user satisfaction, users’ understanding of the AI system, the effect of curiosity on the search for explanations, user trust, and performance. Melis and Jaakkola [28] used explicitness, faithfulness, and stability for evaluation. Gunning [17] divided explanation effectiveness into five categories: mental model, task performance, trust assessment, correctability, and user satisfaction. Yang et al. [24] proposed the following properties: generalizability, fidelity, persuasibility, robustness, capability, and certainty. Mohseni et al. [29] presented a survey that maps between design goals for different XAI user groups and their respective evaluation methods. The evaluation measures in the survey include explanations’ usefulness and satisfaction, fidelity, task performance and user trust. Sokol and Flach [30] suggested fact sheets to evaluate an explainability method, which include five dimensions: (1) functional requirements, (2) operational requirements, (3) usefulness from a user’s perspective, (4) security and privacy, and (5) validation using user studies or synthetic experiments. Many other studies have adopted such properties as their evaluation objective.

3. Related Work

In this section, we first review related studies that proposed evaluation methods using ground truth and methods that explain anomalies. Then, we present studies that proposed game theory techniques that are applied in our algorithm.

3.1. Explaining Anomalies

Kopp et al. [2] suggested an approach for explaining an anomaly using a random forest classifier. Evaluation of the explanations was performed by measuring the change in the detector’s AUC. The tabular data sets used in their experiments [31] were adapted for the purpose of anomaly detection. Haldar et al. [32] proposed an algorithm that generates a diverse set of counterfactual explanations [33,34] for an anomaly identified by an autoencoder. The authors evaluated the explanations by validating that the new instances suggested as explanations belong to the normal class. Dang et al. [35] proposed an algorithm that addresses both outlier detection and explanations. The algorithm uses a mathematical approach from spectral graph theory to learn an optimal subset in which an anomaly is well separated from normal objects.

Giurgiu and Schumann [36] extended SHAP with influence weighting in order to explain anomalies detected from multivariate time series using a GRU-based autoencoder. Nguyen et al. [37] proposed a framework to detect anomalies in network traffic using a variational autoencoder (VAE) and explain them using a gradient-based fingerprinting technique. They changed a feature of an anomalous instance and examined how it affected the model’s objective function. Explanations are evaluated by plotting the receiver operating characteristic curve (ROC). Takeishi [38] compared Shapley values to the reconstruction error (RE) of features in principal component analysis (PCA) to explain anomalies. The authors changed a feature to make the instance anomalous and then compared the Shapley values to the RE. Amarasinghe et al. [39] presented a framework for explaining anomalies detected using DNN. The framework provides the features that were relevant in making the prediction using Layer-Wise Relevance Propagation (LRP) [40]. An evaluation of the relevant features was made by comparing the relevant features across different DNN models.

Liu et al. [41] suggested a Contextual Outlier INterpretation (COIN) framework to explain anomalies detected using important features, the abnormality score, and the contrastive context of the anomaly. Takeishi and Kawahara [42] proposed a method for anomaly interpretation via Shapley values. They evaluated the method on both real and synthetic anomalies generated by perturbing features in normal records. The ground truth was obtained using the known perturbed features. The authors compared the ground truth to the explanations in order to calculate metrics that indicate whether the interpretation is correct. While this is the only study that evaluated anomaly explanations using the ground truth, the authors’ method of perturbing features and identifying them as the ground truth is somewhat problematic, since it does not consider any relationships between the features. Creating anomalies based on a model, as was performed in our data set, addresses this issue.

Table 1 summarizes the recent methods for evaluating anomaly explanations. For more methods, refer to Yepmo et al. [43], who provided an extensive review of the anomaly explanation field. Many methods were reported, but none of them included model-based ground truth explanations to evaluate the correctness of the explanations.

3.2. Evaluating Explanations Using Ground Truth

Tritscher et al. [44] suggested a setting for evaluating XAI approaches, using binary synthetic data sets with ground truth explanations. The explanations are based on a relevance definition for features in Boolean functions. Features that were not used in the Boolean function acted as noise, although no analysis of the influence of the noise was reported. They considered a single explanation to be correct if the top-scoring features provided by the method matched the ground truth features. The evaluation fails to consider partial matches and does not differentiate between true positive and true negative errors. Yalcin et al. [45] developed a method to quantitatively evaluate the correctness of XAI algorithms for binary classification by constructing data sets using language derived from a grammar and ground truth explanations using repeated application of production rules. Barr et al. [46] provided a synthetic data generation method inspired by Yang and Kim [47]. The method allows the generation of arbitrarily complex data designed for binary classification that utilizes symbolic expressions. The authors demonstrated their method using data sets with and without feature correlation and provided local attributions using SHAP. They added redundant features and observed the influence of noise on the SHAP values.

Guidotti [48] proposed a ground-truth-based evaluation framework that focuses on evaluating the correctness of model-agnostic explanations. It includes several methods for generating synthetic transparent classifiers that are accompanied by synthetic ground truth explanations. The above-proposed methods are not aimed specifically at anomalies; because of this, they might not be suitable for evaluating anomaly explanations. Antwarg et al. [8] created autoencoders for which the connections between the features are known and thus had a ground truth to explain the anomalies. Then, they created an artificial anomaly data set to examine whether their method uses the correct set of features to explain the anomalies.

Arras et al. [49] developed a visual question-answering dataset containing questions and pixel-level ground truth masks that can be used to evaluate visual explanations. In Agarwal et al. [50], a synthetic graph data generator is presented that can be used to generate the benchmark datasets, including varying graph sizes, degree distributions, etc., accompanied by ground truth explanations. The last two papers use ground truth explanations but for other types of data or data representation than in our research.

The crucial difference between the above works and the evaluation methodology we propose is that we offer a unique data set with anomalies that is based on a real-world benchmark data set; most other works are based on synthetic data sets. Our data set captures both linear and nonlinear relationships between features to represent real-world scenarios. In addition, by padding the data with attribute noise, we allow evaluation of both the correctness and robustness of explanations.

3.3. Influence on Boolean Functions

The influence of a single vote on a decision made by a majority vote was first discovered by Penrose [51] and was later re-introduced by Banzhaf III [52] and Shapley and Shubik [53] as the “power index”. Both methods are based on a technique from game theory applied to “simple games” and “weighted majority games” [54]. According to their definition, an individual’s power in a decision is determined by the individual’s chance of becoming critical to the success of a winning coalition. The “power index” can be generalized as a definition of the influence of coordinate i in a Boolean function, since participating in a vote or game may result in two possible outcomes. O’Donnell [55] defines the influence of coordinate i on a Boolean function

f : {- 1, 1}^{n} \to {- 1, 1}

for an instance x as the probability of

f (x) \neq f (x^{\oplus i})

(1)

where

x^{\oplus i}

denotes x with the

i_{t h}

bit flipped. We can apply this definition to determine that input

i_{j}

should be considered as influential on output

f (x)

if flipping the value of the

j_{t h}

feature of input x corresponding to input

i_{j}

results in changing the value of

f (x)

. We extend these definitions to create local ground truth explanations (see Section 4.4).

4. Anomaly Data Set and Ground Truth Explanation Based Evaluation Methodology

In this section, we describe the proposed data set and methodology for evaluating the correctness and robustness of anomaly explanations, which is presented in Figure 1.

4.1. Original Data Set

The data set proposed in this study is based on four digital circuits included in the ISCAS ’85 [16] and 74x series benchmarks. ISCAS ’85 is an accepted benchmark data set that has been in wide use ever since being introduced at the International Symposium of Circuits and Systems in 1985. The original descriptions of the benchmark circuits were provided in netlist format, which does not include any functions or high-level designs; however, high-level models have been developed over the years [56] to allow gate-level understanding.

We chose to include the four smallest circuits in the benchmark in our data set, since we wanted to enable other researchers to run experiments using the data set in a reasonable amount of time. The circuits are as follows: (1) C17 is the smallest circuit in the ISCAS benckmark, containing just six NAND gates. It implements a very simple two output circuit with five inputs. (2) ’74283’ is a fast adder composed of three modules. It contains nine inputs and five outputs. (3) ’74182’ uses the carry-look ahead (CLA) realization of the carry function. It contains nine inputs and five outputs. (4) ’74181’ is a four-bit arithmetic logic unit (ALU) and function generator; this is the largest circuit of the four, containing 14 inputs and eight outputs. All digital circuits include different types of logic operators, both linear as AND and nonlinear as XOR.

4.2. Generating Anomalies in the Data Set

To create a data set containing anomalies (Table A1), we used .sys format files that were published in a diagnostic competition [57]. Each circuit is represented by inputs

i_{1}, i_{2}, . ., i_{n}

and a series of logic operations that produce inner layers

z_{1}, z_{2}, . ., z_{p}

and the outputs

o_{1}, o_{2}, . . ., o_{m}

. A digital circuit, from a system’s perspective, may include faults leading to an abnormal behavior. A system’s observed behavior that conflicts with its expected behavior is considered anomalous. Identifying faulty system components that explain the anomaly is a diagnostic problem [58].

In this work, we aim to detect the inputs that contribute to each anomaly rather than diagnosing the faulty components, i.e., the operator whose output is not as expected. To generate anomalies for each circuit, we replaced one logic operator at a time with its negated operator. The new behavior of that gate makes its functionality abnormal. We created four anomalous versions of each digital circuit by negating four logic operators in different locations in the circuit (an inner or final operator) to reflect a variety of anomaly complexities. For each version of the circuit, both original and anomalous, we created a truth table consisting of

2^{n}

combinations with their inputs and outputs. We refer to each row in the truth table as an instance R. To facilitate the robustness evaluation, we added attribute noise to each version of the circuit (see Section 4.3). An instance from the modified (anomalous) truth table is considered anomalous if it differs from the corresponding instance in the original truth table. Refer to Appendix A for a complete list of the circuits with the anomalies, which includes the circuit’s name, the number of input and output nodes it has, the name of the altered gate, the altered operator, the attribute noise level, and the number of anomalous instances produced.

Example 1.

To create an anomaly in

z_{4}

= (

i_{5}

NAND

z_{2}

), we change the operator NAND to AND, so

z_{4}

= (

i_{5}

AND

z_{2}

). Then, we create a modified truth table, where the inputs’ values remain the same (

2^{n}

combinations), but for some instances, depending on the altered operator, an output

o_{j}^{'}

is different. Such instances are considered anomalous, since negating the logic operator results in changing at least one output. Figure 2 provides a comparison between the original and anomalous diagrams and truth tables for circuit C17.

4.3. Attribute Noise

The inputs of the digital circuits, which serve as features, were padded with uninformative features that play the role of attribute noise. Adding attribute noise to the data is a typical way of augmenting a data set to enrich it with more examples and consequently increase the model’s generalizability [59]. However, if the model lacks robustness, adding noise could harm its performance [60,61]. The amount of attribute noise we added, selected in proportion to the number of features in the data, varied from zero redundant features (no noise) to six redundant features. By introducing controlled levels of noise, we assess the stability of the explanation methods under various conditions, aiming to improve the reliability of explanations in the presence of redundant features.

4.4. Creating Ground Truth

A local ground truth explanation is the reason why a model returned a certain prediction for a specific instance. It can be represented as the set of features that led the model to make such a prediction. A ground truth explanation is not easy to obtain. Ground truth explanations are useful for evaluating the correctness and robustness of explanations produced by an explanation method. The correctness of explanations can be examined by comparing the ground truth to the explanation method’s output. Robustness can be examined by verifying that the explanation includes no noise, meaning noisy features are not considered as part of the influential features.

In our setting, we explain the outputs influenced by the anomaly; thus, the ground truth explanation is represented as a set of inputs that contribute to the anomalous outputs. To generate local ground truth explanations, we adopt the concept of assigning influence to relevant features in Boolean functions [55,62], as described in Section 3.3. Since our digital circuits are composed of multiple Boolean logic operators, we can extend this definition to assign influence to features for each explained instance. After we create truth tables for the original and anomalous versions of the circuit, we transfer the logic of the anomalous

. s y s

file into a diagram. This diagram is created based on concepts proposed by Akers [63] and Lee [64] for building a binary decision diagram (BDD). Figure 2 presents an original diagram of the C17 circuit, before changing any logic operators, and an anomalous version in which

z_{4}

has been modified. A diagram provides a means of identifying the outputs of the circuit for any given input’s initialization, meaning each row in the truth table can be represented by a diagram.

For a given observation (input and output values) and the output to explain, we use the diagram to find a set of influencing features, according to Algorithm 1, where D represents the circuit diagram, O is the output we wish to explain, and R is the specific instance. Algorithm 1 finds the set of influencing features, i.e., the local ground truth for one instance R in the truth table. The influencing set is reached by backtracking from O to the initial input nodes represented as

i_{x}

, through inner nodes represented as

z_{y}

or output nodes represented as

o_{w}

. Starting from O, we focus on one logic operator at a time and follow the definition of the influence of one coordinate on a Boolean function, where the Boolean function is the logic operator applied on a set of inputs to produce an output. We extend the definition to consider the dependency between features and to allow sets of features to be considered influential.

First, we use the diagram D to calculate the value of each circuit node for the given instance R (line 1). We propagate the known values of the inputs throughout the diagram to calculate the values of all nodes, including inner nodes

z_{y}

. Then, we initialize a queue (line 2) that will allow us to backtrack the nodes in the path from an output (prediction) O to the bottom of the diagram where the original influencing features are found. We also initialize a list I (line 3), which will contain the final influencing inputs. As long as the queue contains nodes, we extract the current node (line 6) and obtain the nodes that serve as inputs to that node (line 7), meaning they are the inputs to the logic operator producing this output. We create all combinations of subsets of the current output’s entered nodes (current inputs) to find the minimal subset of features that influence that node (line 9). The smallest subset includes individual features, and the largest includes all of the current inputs. We then examine the influence for every group of subsets in order, from the smallest to the largest (line 10). We aim to find a minimal subset to avoid redundancy. A subset is considered influential if flipping all of the features in the subset and feeding the logic operator with the flipped subset results in changing the output (lines 14–17). An influential node is included in the final list I if it is an input node

i_{x}

, or in the queue

I Q u e u e

if it is an inner node

z_{y}

(lines 19–22). If an influencing subset was found in a group of a certain size, the search is performed (lines 12–13). Finally, we return a list containing influential inputs (line 24).

Algorithm 1 Generating Local Ground Truth

Input: Circuit diagram D, explained output O, instance R
Output: Influencing inputs list I
1: $n V a l u e s$ = CalcValueForAllNodes( $D, R$ ).
2: $I Q u e u e$ = Queue().
3: I = List().
4: $I Q u e u e$ .enqueue(O).
5: while not $I Q u e u e$ .empty() do:
6: $c u r r O u t$ = $I Q u e u e$ .pop().
7: $c u r r I n$ = InputsForOutput( $n V a l u e s, c u r r O u t$ ).
▹ get all the nodes that have an edge in the diagram leading to currOut
8: $n I n p u t s$ = length( $c u r r I n$ ).
9: $c o m b s$ = GetCombinations( $c u r r I n$ , $n I n p u t s$ ).
▹ creates subsets of the current inputs with sizes from 1 to nInputs
10: $f o u n d I n f L e v e l$ = False.
11: for all $c o m b s L e v e l$ in $c o m b s$ do:
12: if $f o u n d I n f L e v e l$ is True then:
13: break.
14: for all $c o m b_{i}$ in $c o m b s L e v e l$ do:
15: $n e g C o m b_{i}$ = FlipInputs( $c o m b_{i}$ )
16: $n e w O u t$ = CalcOut( $c u r r O u t$ , $n e g C o m b_{i}$ )
17: if $n e w O u t \neq c u r r O u t$ then:
18: for all $n o d e$ in $c o m b_{i}$ do:
19: if $n o d e$ .isInputNode() then:
20: I.append( $n o d e$ )
21: else:
22: $I Q u e u e$ .enqueue( $n o d e$ )
23: $f o u n d I n f L e v e l$ = True
24: return I
end

Example 2.

Let

o_{2}

= (

z_{4}

NAND

z_{3}

),

z_{4}

= (

i_{5}

AND

z_{2}

),

z_{2}

= (

i_{3}

NAND

i_{4}

), and

z_{3}

= (

z_{2}

NAND

i_{2}

), where

i_{2} = 0

,

i_{3} = 0

,

i_{4} = 0

, and

i_{5} = 0

so that

z_{2} = 1

,

z_{3} = 1

,

z_{4} = 0

, and

o_{2} = 1

(see Figure 2 for the respective anomalous diagram of C17). We create the combinations {{{

z_{4}

}, {

z_{3}

}}, {{

z_{4}

,

z_{3}

}}}, which are subsets of

o_{2}

’s entered nodes. We then check all subsets of the same size, starting from the smallest size (meaning {

z_{4}

} and {

z_{3}

}). We flip the value of

z_{4}

to one and examine whether the value of

o_{2}

changes. This is carried ot foru

z_{3}

as well, to conclude that only {

z_{4}

} has an influence on the output. Next, we examine the branch leading to {

z_{4}

} in the same manner. We create the combinations {{{

z_{2}

}, {

i_{5}

}}, {{

z_{2}

,

i_{5}

}}}. Here, flipping the value of {

i_{5}

} changes

z_{4}

, so it is considered influential. The final set {

i_{5}

} is then returned by the algorithm as the local ground truth.

4.5. Evaluation Metrics

The evaluation methodology enables the evaluation of the correctness and robustness of local explanations. The explanation produced is a set of feature importance scores representing the contribution of each feature to the prediction. The set is sorted by descending absolute value and then compared to the ground truth explanation, considering not only the presence and absence of features but also their rank in the explanation.

Correctness. The evaluation utilizes three metrics, where the correctness of the explanation is reflected by a high metric value.

Mean Reciprocal Rank (MRR). This expresses the mean of the rank positions of the first relevant feature in the produced explanation across all explained instances. A relevant feature is a feature that appears in the local ground truth. MRR is defined as $\frac{1}{| E |} \sum_{i = 1}^{| E |} \frac{1}{r a n k_{i}}$ , where $r a n k_{i}$ refers to the rank position of the first relevant feature for the $i^{t h}$ explanation. Here and in the metrics below, $| E |$ refers to the number of explanations.
Mean Average Precision (MAP). This expresses the mean of all average precision values across the explained instances. MAP is defined as $\frac{1}{| E |} \sum_{i = 1}^{| E |} A v g P_{i}$ , where $A v g P_{i}$ is the average precision of the $i^{t h}$ explanation.
Mean R-Precision (MR-Precision). This expresses the mean of the precision value at the recall point across all explained instances. The recall point is determined by the length of the corresponding local ground truth. MR-Precision is defined as $\frac{1}{| E |} \sum_{i = 1}^{| E |} \frac{r_{i}}{R_{i}}$ , where $r_{i}$ is the number of relevant features returned for the $i^{t h}$ explanation and $R_{i}$ is the length of the corresponding local ground truth (total relevant features).

Robustness. Huber [65] defines robustness as the insensitivity to minor deviations from the expected behavior. In terms of machine learning, a model is considered more robust than another if it suffers less from the impact of noise. The robustness evaluation utilizes the Equalized Loss of Accuracy (ELA) metric suggested by Sáez et al. [66], which establishes the expected behavior of a model with noisy data. ELA takes into account the performance without noise (

A_{0}

) and the loss of accuracy (

A_{x}

). The lower the ELA value, the more robust the model.

E L A_{x}

is defined as

\frac{100 - A_{x}}{A_{0}}

, where x is the level of noise,

A_{x}

is the accuracy of the model with attribute noise level x, and

A_{0}

is the accuracy of the model without noise. In our methodology, we use the R-precision instead of the accuracy metric used in the original work, since it represents the ideal output of the explanation method.

5. Experiments

In this section, we demonstrate the utilization of our anomaly data set and evaluation methodology. The case study presented involves the evaluation of an autoencoder-based anomaly detector, explained using local model-agnostic explanation methods. We conducted experiments to show how the local ground truth explanations are used to evaluate the correctness and robustness of the chosen explanation methods. Note that the experimental results relate to the specific settings used for this demonstration. Other settings can also be applied.

5.1. Anomaly Detector

We adapted a method of explaining anomalies revealed by an autoencoder presented by Antwarg et al. [8]. Autoencoders are one of the most common approaches for outlier detection for cases where labels are not available [67]. An autoencoder is an unsupervised neural network that represents normal data in a low dimension and reconstructs input data in the original dimension. Consequently, abnormal instances, which are not properly reconstructed, stand out [68].

To apply the suggested method, we organized the instances in the data set to resemble an autoencoder, where the inputs and outputs follow the same structure. We created normal and anomalous instances by concatenating the inputs and outputs of the original truth table and anomalous truth table, respectively. According to the method, we provided an explanation for output features that have a high reconstruction error. Since here the features are binary, we explained the outputs such that

(O - O^{'}) < > 0

. The model created to detect the anomalies is a custom model derived from the base package of the Python scikit-learn library [69]. The model is a simplified version of an autoencoder-based anomaly detector, in which the fit function creates a mapping between the original truth table and anomalous truth table (the truth table after modifying a circuit’s logic operator), which serves as the tabular data. The predict function receives a truth table instance and returns the reconstructed instance.

5.2. Explanation Methods

We used three model-agnostic explanation methods: Kernel SHAP, Sampling SHAP [13], and LIME [14] (respectively, the shap (https://github.com/slundberg/shap/blob/master/shap/explainers, accessed on 25 September 2024) and lime (https://github.com/marcotcr/lime/blob/master/lime, accessed on 25 September 2024) Python packages). For both SHAP methods, we set the number of samples for coalitions of features (

n s a m p l e s

) to the default value for the C17 circuit and to 500 for the other circuits. This number was selected to avoid a long run time but still provided enough coalitions to approximate Shapley values.

Background Set Tuning

All three explanation methods rely on a background set, which serves as a reference for building a local explanation model. The choice and design of this background set are crucial, as they can significantly influence both the accuracy and efficiency of the explanations. In LIME, the background set is used to perturb features by sampling from a standard normal distribution. The mean and standard deviation of the background set are employed for mean-centering and scaling features, allowing LIME to approximate how changes in feature values affect the model’s predictions. For SHAP, the background set provides a basis for approximating Shapley values. SHAP methods replace feature values with values from the background set to compute marginal contributions, simulating the absence of specific features and assessing their importance in the model’s decision. This approach aligns with Shapley values’ theoretical grounding in cooperative game theory, where the background set acts as the “coalition” of references. The composition and size of the background set directly impact the fidelity of explanations. A larger background set allows for more accurate approximations of feature contributions, as it provides a richer representation of the data distribution. However, this increased accuracy comes at the cost of computational efficiency. Larger background sets increase the number of model evaluations required to compute explanations, which can significantly slow down the explanation process. This trade-off between accuracy and computational efficiency is particularly relevant for complex models or large datasets. To explore this balance, we tested different background set proportions, ranging from 10% to 80% of the dataset: 0.1, 0.2, 0.4, 0.6, 0.8. For instance, with a dataset of 100 instances and a proportion of 0.8, the background set consists of 80 instances. By varying the proportion, we aimed to identify the optimal background set size that balances explanation accuracy with computational feasibility. The selection of instances for the background set was carried out randomly using the Python NumPy library (https://numpy.org/), with a fixed random seed of 27 to ensure reproducibility. Additionally, the diversity within the background set is another important consideration. A background set that accurately reflects the full range of data variability can lead to more reliable explanations, as it better captures the conditions under which features contribute to model predictions. Conversely, a background set with limited variability may lead to biased or incomplete explanations, as it may not represent the complete distribution of feature values. Therefore, careful sampling to include representative examples across different data clusters can enhance explanation robustness.

In summary, the design of the background set involves balancing three key factors: (1) the size of the background set, which affects computational cost, (2) the diversity of the background set, which influences the accuracy of feature importance estimates, and (3) the relevance of the set to the data distribution, ensuring that it reflects the conditions under which explanations are required. Future work could further explore adaptive background set selection techniques, which dynamically adjust the set based on data characteristics to optimize both performance and computational efficiency.

5.3. Results

We explained the anomalous instances of each digital circuit with Kernel SHAP, Sampling SHAP, and LIME. The results were averaged across all four anomalous versions of each circuit. Each experiment was conducted several times with different background set proportions. For circuit 74181, which is the largest and most complex circuit, we tested all five background set proportions. For 74283, 74182, and C17, we tested background set proportions of {0.4, 0.6, 0.8}. The best proportion for each circuit was selected based on the results for each metric.

5.3.1. Correctness Evaluation

Table 2 reports the correctness of explanations produced by each method evaluated with the MRR, MAP, and MR-Precision metrics, calculated based on local ground truth explanations. The background proportions selected after tuning for 74181, 74182, 74283, and C17 were 0.2, 0.4, 0.6, and 0.6, respectively. We used the adjusted Friedman test to reject the null hypothesis that all methods have the same MRR measure with F-

s t a t i s t i c = 20.23

and p-

value < 0.0001

. Using the post hoc Nemenyi with p-

value < 0.05

, we can conclude that LIME performs significantly worse than both SHAP methods. However, we could not reject the null hypothesis that Kernel SHAP and Sampling SHAP perform the same in terms of MRR. With respect to the MAP results, Kernel SHAP achieved the best performance. Overall, the MAP results seem to decrease as the circuit becomes more complex and more features are involved, meaning that the features’ rank becomes less accurate. As for the R-precision results, Kernel SHAP and Sampling SHAP achieved comparable high performance. The highest values are those of C17 (0.803 and 0.787 respectively), while the R-precision values of the other circuits are lower but stable. Specifically, LIME performed poorly on circuit 74181, which might be due to the complex relations between features and a large number of inputs and outputs in comparison to the smaller circuits.

5.3.2. Robustness Evaluation

Table 3 reports the robustness of the explanation methods evaluated by

E L A_{x}

. The background proportions selected after tuning for circuits 74182, 74283, and C17 were 0.6, 0.4, and 0.4, respectively. The best (lowest) ELA values were achieved for circuit C17 and the worst (highest) for circuit 74283. The decrease in the ELA value from attribute noise level x to

x + 1

in circuits 74182 and C17 indicates that Kernel SHAP and Sampling SHAP benefited when more noise is introduced in smaller data sets, as these circuits are smaller and less complex than circuits 74283 and 74181. In contrast, LIME seems be affected by the noise in all cases. We used the adjusted Friedman test to reject the null hypothesis that all methods have the same robustness measures with F-

s t a t i s t i c = 14.9

and p-

value < 0.001

. Using the post hoc Nemenyi with p-

value < 0.05

, we can conclude that LIME performs significantly worse than both SHAP methods. However, we could not reject the null hypothesis that Kernel SHAP and Sampling SHAP yield similar robustness measures.

6. Discussion

This work presents a benchmark dataset for anomaly detection in digital circuits, complete with ground truth explanations that allow for a rigorous evaluation of explanation methods. By simulating a range of linear relationships among features, our dataset supports both correctness and robustness assessments of anomaly explanations.

Our evaluation methodology goes beyond correctness metrics to examine robustness, offering insights into the stability of explanations under attribute noise. Experimental results highlight that Kernel SHAP and Sampling SHAP consistently rank influential features effectively across different circuit complexities, while LIME’s performance is more variable, especially in noisy environments.

An important aspect to discuss is how to insert noise into the data. In this paper, we added attributes with noise, but robustness can be further evaluated by introducing perturbative and correlated noise specifically tailored for binary data. For binary features, perturbative noise can be applied by randomly flipping values, changing some 0 s to 1 s and vice versa. This type of noise simulates potential errors in data collection and tests the stability of explanation methods under minor disruptions in feature values. Correlated noise, on the other hand, introduces changes that either maintain or slightly alter dependencies between features. For instance, if two binary features often appear together, one could be selectively flipped to test if the explanation method can still capture the interdependence between them. By examining robustness under these additional types of noise, we gain deeper insights into the stability of explanations in real-world binary data scenarios.

An essential aspect of this methodology is the selection of a well-tuned background set, which balances accuracy and computational efficiency. Our findings indicate that larger background sets generally enhance explanation reliability by reducing the influence of irrelevant features, but this comes with higher computational demands. The robustness and adaptability of explanation methods shown here are essential for real-world applications with diverse data quality.

7. Conclusions

In this paper, we addressed a key gap in the evaluation of anomaly explanation methods by presenting a benchmark dataset with ground truth explanations, specifically designed to support the rigorous assessment of correctness and robustness in anomaly detection. By leveraging digital circuit data, our dataset captures both linear and non-linear feature interactions, providing a realistic basis for testing the efficacy of various model-agnostic explanation methods. Additionally, we introduced a novel evaluation methodology that incorporates correctness and robustness metrics, enabling a structured, evidence-based approach to assess explanation quality. Our findings highlight that model-agnostic explanation methods, particularly Kernel SHAP and Sampling SHAP, perform reliably in identifying influential features across different anomaly scenarios, showing resilience to noise. In contrast, LIME’s sensitivity to noise emphasizes the need for robustness testing in practical applications. These results demonstrate that our methodology provides valuable insights into the stability and reliability of explanations, which is crucial for transparent AI systems in domains such as cybersecurity, industrial diagnostics, and beyond.

This work contributes to evaluation theory by adapting traditional metrics and introducing new, domain-specific measures tailored to the unique requirements of anomaly detection in AI. In doing so, it bridges the gap between traditional evaluation frameworks and the emerging needs of explainable AI (XAI), offering a foundation for future research that seeks to evaluate and improve transparency in AI-driven decision-making. Furthermore, our methodology aligns with evidence-based evaluation principles, providing a data-driven framework that can support informed decisions about the utility and reliability of anomaly explanations.

Looking forward, this research can be extended by developing similar datasets and evaluation methodologies for non-binary and more complex anomaly types, broadening the applicability of our framework to a wider range of real-world scenarios. By continuing to build on the intersection of evaluation theory and XAI, future work can contribute to a robust and transparent foundation for evaluating AI explanations, helping to establish standards that promote trust and accountability in AI systems.

Author Contributions

L.A.F.: Conceptualization, methodology, investigation, writing—original draft preparation, visualization. C.G.: Methodology, software, formal analysis, data curation, writing—original draft preparation. L.R.: Conceptualization, supervision, writing—review and editing. B.S.: Supervision, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated as part of this research are publicly available online at https://doi.org/10.7910/DVN/W4FPPN, accessed on 5 September 2021. The research code is available online at https://github.com/XAI-Lab/CREM, accessed on 1 December 2022.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

XAI	Explainable Artificial Intelligence
AI	Artificial Intelligence
LIME	Local Interpretable Model-Agnostic Explanation
SHAP	Shapley Additive exPlanations

Appendix A. Circuits Containing Anomalies

Table A1. A complete list of circuits containing anomalies. The list includes the circuit name, number of inputs the circuit receives, number of outputs the circuit produces, the gate changed to create the anomalies, the operator changed to create the anomalies, the amount of noise added, and the number of anomalies generated.

Circuit Name	Input Count	Output Count	Gate	Original Operator	Noise Count	Anomaly Count
74181	14	8	z6	and3	0	14,336
74181	14	8	z6	and3	2	57,344
74181	14	8	z6	and3	4	229,376
74181	14	8	z6	and3	6	917,504
74181	14	8	z32	nor2	0	16,384
74181	14	8	z32	nor2	2	65,536
74181	14	8	z32	nor2	4	262,144
74181	14	8	z32	nor2	6	1,048,576
74181	14	8	z33	nor3	0	16,384
74181	14	8	z33	nor3	2	65,536
74181	14	8	z33	nor3	4	262,144
74181	14	8	z33	nor3	6	1,048,576
74181	14	8	o8	and4	0	16,384
74181	14	8	o8	and4	2	65,536
74181	14	8	o8	and4	4	262,144
74181	14	8	o8	and4	6	1,048,576
74283	9	5	z2	nand2	0	384
74283	9	5	z2	nand2	2	1536
74283	9	5	z2	nand2	4	6144
74283	9	5	z2	nand2	6	24,576
74283	9	5	z17	and5	0	272
74283	9	5	z17	and5	2	1088
74283	9	5	z17	and5	4	4352
74283	9	5	z17	and5	6	17,408
74283	9	5	z18	and2	0	512
74283	9	5	z18	and2	2	2048
74283	9	5	z18	and2	4	8192
74283	9	5	z18	and2	6	32,768
74283	9	5	o1	nor5	0	512
74283	9	5	o1	nor5	2	2048
74283	9	5	o1	nor5	4	8192
74283	9	5	o1	nor5	6	32,768
74182	9	5	z2	and4	0	344
74182	9	5	z2	and4	2	1376
74182	9	5	z2	and4	4	5504
74182	9	5	z2	and4	6	22,016
74182	9	5	z7	nand4	0	344
74182	9	5	z7	nand4	2	1376
74182	9	5	z7	nand4	4	5504
74182	9	5	z7	nand4	6	22,016
74182	9	5	z8	nand3	0	360
74182	9	5	z8	nand3	2	1440
74182	9	5	z8	nand3	4	5760
74182	9	5	z8	nand3	6	23,040
74182	9	5	o5	nor2	0	512
74182	9	5	o5	nor2	2	2048
74182	9	5	o5	nor2	4	8192
74182	9	5	o5	nor2	6	32,768
C17	5	2	z2	nand2	0	24
C17	5	2	z2	nand2	2	96
C17	5	2	z2	nand2	4	384
C17	5	2	z2	nand2	6	1536
C17	5	2	z3	nand2	0	30
C17	5	2	z3	nand2	2	120
C17	5	2	z3	nand2	4	480
C17	5	2	z3	nand2	6	1920
C17	5	2	z4	nand2	0	20
C17	5	2	z4	nand2	2	80
C17	5	2	z4	nand2	4	320
C17	5	2	z4	nand2	6	1280
C17	5	2	o2	nand2	0	32
C17	5	2	o2	nand2	2	128
C17	5	2	o2	nand2	4	512
C17	5	2	o2	nand2	6	2048

References

Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 15. [Google Scholar] [CrossRef]
Kopp, M.; Pevnỳ, T.; Holeňa, M. Anomaly explanation with random forests. Expert Syst. Appl. 2020, 149, 113187. [Google Scholar] [CrossRef]
Singh, K.; Upadhyaya, S. Outlier detection: Applications and techniques. Int. J. Comput. Sci. Issues (IJCSI) 2012, 9, 307. [Google Scholar]
Siddiqui, M.A.; Fern, A.; Dietterich, T.G.; Wong, W.K. Sequential feature explanations for anomaly detection. ACM Trans. Knowl. Discov. Data (TKDD) 2019, 13, 1–22. [Google Scholar] [CrossRef]
Leake, D.B. Evaluating Explanations: A Content Theory; Psychology Press: Hove, UK, 2014. [Google Scholar]
Adadi, A.; Berrada, M. Peeking inside the black box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
Song, F.; Zhou, B.; Sun, Q.; Sun, W.; Xia, S.; Diao, Y. Anomaly detection and explanation discovery on event streams. In Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, Rio de Janeiro, Brazil, 27–27 August 2018; pp. 1–5. [Google Scholar]
Antwarg, L.; Miller, R.M.; Shapira, B.; Rokach, L. Explaining anomalies detected by autoencoders using Shapley Additive Explanations. Expert Syst. Appl. 2021, 176, 115736. [Google Scholar] [CrossRef]
Rokach, L.; Feldman, A.; Kalech, M.; Provan, G. Machine-learning-based circuit synthesis. In Proceedings of the 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel, IEEE, Eilat, Israel, 14–17 November 2012; pp. 1–5. [Google Scholar]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Sethi, I.K. Entropy nets: From decision trees to neural networks. Proc. IEEE 1990, 78, 1605–1613. [Google Scholar] [CrossRef]
Welbl, J. Casting random forests as artificial neural networks (and profiting from it). In Proceedings of the German Conference on Pattern Recognition, Munich, Germany, 8–11 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 765–771. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 4765–4774. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Bryan, D. The ISCAS ’85 benchmark circuits and netlist format directory. In Proceedings of the International Symposium on Circuits and Systems, Orlando, FL, USA, 3–5 June 1985. [Google Scholar]
Gunning, D. Explainable Artificial Intelligence (XAI); Defense Advanced Research Projects Agency (DARPA): Arlington, VA, USA, 2017.
Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 2018, 51, 93. [Google Scholar] [CrossRef]
Lipton, Z.C. The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
Štrumbelj, E.; Kononenko, I. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst. 2014, 41, 647–665. [Google Scholar] [CrossRef]
Shrikumar, A.; Greenside, P.; Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, JMLR, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 3145–3153. [Google Scholar]
Shapley, L.S. A value for n-person games. Contrib. Theory Games 1953, 2, 307–317. [Google Scholar]
Strumbelj, E.; Kononenko, I. An Efficient Explanation of Individual Classifications using Game Theory. J. Mach. Learn. Res. 2010, 11, 1–18. [Google Scholar]
Yang, F.; Du, M.; Hu, X. Evaluating explanation without ground truth in interpretable machine learning. arXiv 2019, arXiv:1907.06831. [Google Scholar]
Markus, A.F.; Kors, J.A.; Rijnbeek, P.R. The role of explainability in creating trustworthy artificial intelligence for health care: A comprehensive survey of the terminology, design choices, and evaluation strategies. J. Biomed. Inform. 2020, 109, 103655. [Google Scholar] [CrossRef]
Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
Hoffman, R.R.; Mueller, S.T.; Klein, G.; Litman, J. Metrics for explainable AI: Challenges and prospects. arXiv 2018, arXiv:1812.04608. [Google Scholar]
Melis, D.A.; Jaakkola, T. Towards robust interpretability with self-explaining neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 2–8 December 2018; pp. 7786–7795. [Google Scholar]
Mohseni, S.; Zarei, N.; Ragan, E.D. A multidisciplinary survey and framework for design and evaluation of explainable AI systems. ACM Trans. Interact. Intell. Syst. (TiiS) 2021, 11, 1–45. [Google Scholar] [CrossRef]
Sokol, K.; Flach, P. Explainability fact sheets: A framework for systematic assessment of explainable approaches. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Virtual, 27–30 January 2020; pp. 56–67. [Google Scholar]
Bache, K.; Lichman, M. UCI Machine Learning Repository; University of California: Irvine, CA, USA, 2013. [Google Scholar]
Haldar, S.; John, P.G.; Saha, D. Reliable Counterfactual Explanations for Autoencoder based Anomalies. In Proceedings of the 8th ACM IKDD CODS and 26th COMAD, Bengaluru, India, 4–6 January 2021; CODS COMAD: Bengaluru, India, 2021; pp. 83–91. [Google Scholar]
Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. J. Law Technol. 2017, 31, 841. [Google Scholar] [CrossRef]
Russell, C. Efficient search for diverse coherent explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; pp. 20–28. [Google Scholar]
Dang, X.H.; Assent, I.; Ng, R.T.; Zimek, A.; Schubert, E. Discriminative features for identifying and interpreting outliers. In Proceedings of the 2014 IEEE 30th International Conference on Data Engineering, IEEE, Chicago, IL, USA, 31 March–4 April 2014; pp. 88–99. [Google Scholar]
Giurgiu, I.; Schumann, A. Additive explanations for anomalies detected from multivariate temporal data. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 2245–2248. [Google Scholar]
Nguyen, Q.P.; Lim, K.W.; Divakaran, D.M.; Low, K.H.; Chan, M.C. GEE: A gradient-based explainable variational autoencoder for network anomaly detection. In Proceedings of the 2019 IEEE Conference on Communications and Network Security (CNS), IEEE, Atlanta, GA, USA, 30 September–3 October 2019; pp. 91–99. [Google Scholar]
Takeishi, N. Shapley Values of Reconstruction Errors of PCA for Explaining Anomaly Detection. In Proceedings of the 2019 International Conference on Data Mining Workshops (ICDMW), IEEE, Beijing, China, 8–11 November 2019; pp. 793–798. [Google Scholar]
Amarasinghe, K.; Kenney, K.; Manic, M. Toward Explainable Deep Neural Network Based Anomaly Detection. In Proceedings of the 2018 11th International Conference on Human System Interaction (HSI), IEEE, Gdańsk, Poland, 27–29 June 2018; pp. 311–317. [Google Scholar]
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.R.; Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar] [CrossRef] [PubMed]
Liu, N.; Shin, D.; Hu, X. Contextual outlier interpretation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; AAAI Press: Palo Alto, CA, USA, 2018; pp. 2461–2467. [Google Scholar]
Takeishi, N.; Kawahara, Y. On Anomaly Interpretation via Shapley Values. arXiv 2020, arXiv:2004.04464. [Google Scholar]
Yepmo, V.; Smits, G.; Pivert, O. Anomaly explanation: A review. Data Knowl. Eng. 2022, 137, 101946. [Google Scholar] [CrossRef]
Tritscher, J.; Ring, M.; Schlr, D.; Hettinger, L.; Hotho, A. Evaluation of Post hoc XAI Approaches Through Synthetic Tabular Data. In Proceedings of the International Symposium on Methodologies for Intelligent Systems, Zaragoza, Spain, 7–9 September 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 422–430. [Google Scholar]
Yalcin, O.; Fan, X.; Liu, S. Evaluating the correctness of explainable AI algorithms for classification. arXiv 2021, arXiv:2105.09740. [Google Scholar]
Barr, B.; Xu, K.; Silva, C.; Bertini, E.; Reilly, R.; Bruss, C.B.; Wittenbach, J.D. Towards Ground Truth Explainability on Tabular Data. arXiv 2020, arXiv:2007.10532. [Google Scholar]
Yang, M.; Kim, B. Benchmarking attribution methods with relative feature importance. arXiv 2019, arXiv:1907.09701. [Google Scholar]
Guidotti, R. Evaluating local explanation methods on ground truth. Artif. Intell. 2021, 291, 103428. [Google Scholar] [CrossRef]
Arras, L.; Osman, A.; Samek, W. CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations. Inf. Fusion 2022, 81, 14–40. [Google Scholar] [CrossRef]
Agarwal, C.; Krishna, S.; Saxena, E.; Pawelczyk, M.; Johnson, N.; Puri, I.; Zitnik, M.; Lakkaraju, H. Openxai: Towards a transparent evaluation of model explanations. Adv. Neural Inf. Process. Syst. 2022, 35, 15784–15799. [Google Scholar]
Penrose, L.S. The elementary statistics of majority voting. J. R. Stat. Soc. 1946, 109, 53–57. [Google Scholar] [CrossRef]
Banzhaf, J.F., III. Weighted voting doesn’t work: A mathematical analysis. Rutgers L. Rev. 1964, 19, 317. [Google Scholar]
Shapley, L.S.; Shubik, M. A method for evaluating the distribution of power in a committee system. Am. Political Sci. Rev. 1954, 48, 787–792. [Google Scholar] [CrossRef]
Morgenstern, O.; Von Neumann, J. Theory of Games and Economic Behavior; Princeton University Press: Princeton, NJ, USA, 1953. [Google Scholar]
O’Donnell, R. Some topics in analysis of Boolean functions. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, Victoria, BC, Canada, 19–22 May 2008; pp. 569–578. [Google Scholar]
Hansen, M.C.; Yalcin, H.; Hayes, J.P. Unveiling the ISCAS-85 benchmarks: A case study in reverse engineering. IEEE Des. Test Comput. 1999, 16, 72–80. [Google Scholar] [CrossRef]
Poll, S.; de Kleer, J.; Abreau, R.; Daigle, M.; Feldman, A.; Garcia, D.; Sweet, A. Third international diagnostics competition—DXC’11. In Proceedings of the 22nd International Workshop on Principles of Diagnosis, Coimbra, Portugal, 26–29 June 2011; pp. 267–278. [Google Scholar]
Reiter, R. A theory of diagnosis from first principles. Artif. Intell. 1987, 32, 57–95. [Google Scholar] [CrossRef]
Bishop, C.M. Training with noise is equivalent to Tikhonov regularization. Neural Comput. 1995, 7, 108–116. [Google Scholar] [CrossRef]
Nettleton, D.F.; Orriols-Puig, A.; Fornells, A. A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 2010, 33, 275–306. [Google Scholar] [CrossRef]
Zhu, X.; Wu, X. Class noise vs. attribute noise: A quantitative study. Artif. Intell. Rev. 2004, 22, 177–210. [Google Scholar] [CrossRef]
Kahn, J.; Kalai, G.; Linial, N. The Influence of Variables on Boolean Functions; Citeseer: University Park, PA, USA, 1989. [Google Scholar]
Akers, S.B. Binary decision diagrams. IEEE Comput. Archit. Lett. 1978, 27, 509–516. [Google Scholar] [CrossRef]
Lee, C.Y. Representation of switching circuits by binary-decision programs. Bell Syst. Tech. J. 1959, 38, 985–999. [Google Scholar] [CrossRef]
Huber, P.J. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 523. [Google Scholar]
Sáez, J.A.; Luengo, J.; Herrera, F. Evaluating the classifier behavior with noisy data considering performance and robustness: The equalized loss of accuracy measure. Neurocomputing 2016, 176, 26–35. [Google Scholar] [CrossRef]
Chalapathy, R.; Chawla, S. Deep learning for anomaly detection: A survey. arXiv 2019, arXiv:1901.03407. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]

Figure 1. A flow chart describing two processes: (1) data generation (above), in which the data set with anomalies and the ground truth explanations are created, and (2) explanation evaluation (below), in which anomalies are detected and explained and the local explanations are evaluated against the ground truth.

Figure 2. Left—C17 original diagram, along with sample instances. Right—C17 anomalous diagram after changing gate z4 from NAND to AND, along with sample instances, and by doing that, we created an anomaly. The instances containing outputs marked in red are anomalous.

Table 1. Summary of methods for evaluating explanations anomalies.

Paper	Explanation Method	Evaluation Metrics	Data Set	Ground Truth
[2]	Rules extracted from RF	AUC	UCI ML repository	no
[32]	Counterfactuals	Instances are normal, diversity	ODDS, UCI ML repositories	no
[35]	Coefficients of eigenvectors	ROC AUC	UCI ML repository (images)	no
[36]	SHAP with influence weighting	SHAP against RE windows	EEG recordings, KPIs collected	no
[37]	Gradient-based fingerprints	ROC, AUC	UGR16 data set	no
[38]	Difference between RE and Shapley values	Correlation of RE and Shapley values	2004 Car & Truck, ODDS library	no
[39]	LRP	Accuracy	KDD Cup 1999	no
[41]	Outliers local classification	AUC, Precision, F1-score	WBC, MNIST, Twitter spammer	no
[42]	Adapted Shapley values	MRR, AUROC	KDD based, artificial anomalies	yes

Table 2. Correctness evaluation for the 74181, 74283, 74182, and C17 circuits with no attribute noise added.

Circuit	74181			74283			74182			C17
Circuit	MRR	MAP	R-Precision	MRR	MAP	R-Precision	MRR	MAP	R-Precision	MRR	MAP	R-Precision
Kernel SHAP	0.862	0.476	0.634	0.894	0.634	0.680	0.921	0.756	0.646	0.919	0.863	0.803
Sampling SHAP	0.820	0.370	0.687	0.889	0.600	0.653	0.915	0.710	0.692	0.902	0.838	0.787
LIME	0.197	0.141	0.117	0.737	0.530	0.504	0.654	0.572	0.450	0.655	0.595	0.506

Table 3. Robustness evaluation for the 74182, 74283, and C17 circuits with attribute noise added.

E L A_{x}

represents the robustness score for attribute noise level x from {2, 4, 6}. The lower the

E L A_{x}

value, the more robust the method.

Table 3. Robustness evaluation for the 74182, 74283, and C17 circuits with attribute noise added.

E L A_{x}

represents the robustness score for attribute noise level x from {2, 4, 6}. The lower the

E L A_{x}

value, the more robust the method.

Circuit	74182			74283			C17
Circuit	${ELA}_{2}$	${ELA}_{4}$	${ELA}_{6}$	${ELA}_{2}$	${ELA}_{4}$	${ELA}_{6}$	${ELA}_{2}$	${ELA}_{4}$	${ELA}_{6}$
Kernel SHAP	0.423	0.413	0.413	0.473	0.496	0.508	0.239	0.247	0.243
Sampling SHAP	0.460	0.454	0.523	0.530	0.537	0.521	0.233	0.252	0.239
LIME	1.219	2.203	2.206	0.975	1.556	1.574	0.961	1.023	1.041

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Antwarg Friedman, L.; Galed, C.; Rokach, L.; Shapira, B. Evaluating Anomaly Explanations Using Ground Truth. AI 2024, 5, 2375-2392. https://doi.org/10.3390/ai5040117

AMA Style

Antwarg Friedman L, Galed C, Rokach L, Shapira B. Evaluating Anomaly Explanations Using Ground Truth. AI. 2024; 5(4):2375-2392. https://doi.org/10.3390/ai5040117

Chicago/Turabian Style

Antwarg Friedman, Liat, Chen Galed, Lior Rokach, and Bracha Shapira. 2024. "Evaluating Anomaly Explanations Using Ground Truth" AI 5, no. 4: 2375-2392. https://doi.org/10.3390/ai5040117

APA Style

Antwarg Friedman, L., Galed, C., Rokach, L., & Shapira, B. (2024). Evaluating Anomaly Explanations Using Ground Truth. AI, 5(4), 2375-2392. https://doi.org/10.3390/ai5040117

Article Menu

Evaluating Anomaly Explanations Using Ground Truth

Abstract

1. Introduction

2. Background

2.1. Explanation Methods

2.2. Explanation Evaluation Metrics

3. Related Work

3.1. Explaining Anomalies

3.2. Evaluating Explanations Using Ground Truth

3.3. Influence on Boolean Functions

4. Anomaly Data Set and Ground Truth Explanation Based Evaluation Methodology

4.1. Original Data Set

4.2. Generating Anomalies in the Data Set

4.3. Attribute Noise

4.4. Creating Ground Truth

4.5. Evaluation Metrics

5. Experiments

5.1. Anomaly Detector

5.2. Explanation Methods

Background Set Tuning

5.3. Results

5.3.1. Correctness Evaluation

5.3.2. Robustness Evaluation

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Circuits Containing Anomalies

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI