Specification Mining over Temporal Data

Bergami, Giacomo; Appleby, Samuel; Morgan, Graham

doi:10.3390/computers12090185

Open AccessArticle

Specification Mining over Temporal Data^†

by

Giacomo Bergami

^*

,

Samuel Appleby

and

Graham Morgan

School of Computing, Faculty of Science, Agriculture and Engineering, Newcastle University, Newcastle Upon Tyne NE4 5TG, UK

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in IDEAS’23: Proceedings of the International Database Engineered Applications Symposium Conference, Heraklion, Greece, 5–7 May 2023.

Computers 2023, 12(9), 185; https://doi.org/10.3390/computers12090185

Submission received: 10 August 2023 / Revised: 1 September 2023 / Accepted: 11 September 2023 / Published: 14 September 2023

(This article belongs to the Special Issue Advances in Database Engineered Applications 2023)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Current specification mining algorithms for temporal data rely on exhaustive search approaches, which become detrimental in real data settings where a plethora of distinct temporal behaviours are recorded over prolonged observations. This paper proposes a novel algorithm, Bolt2, based on a refined heuristic search of our previous algorithm, Bolt. Our experiments show that the proposed approach not only surpasses exhaustive search methods in terms of running time but also guarantees a minimal description that captures the overall temporal behaviour. This is achieved through a hypothesis lattice search that exploits support metrics. Our novel specification mining algorithm also outperforms the results achieved in our previous contribution.

Keywords:

verified artificial intelligence; specification mining; knowledge bases; temporal logic

1. Introduction

Verified artificial intelligence from the perspective of formal methods (Figure 1) concerns the specification, design, and verification of systems represented in mathematical specifications [1]. These all consider a model of the system to be verified

S

, a model of the environment

E

, and the property to be verified

Φ

. When merely focusing on the system to be verified in context-free scenarios, we can also ignore

E

and assume that

S

will contain all the relevant information of interest. The application of such techniques requires the representation of both

S

and

Φ

in a formalism for which efficient decision processes exist, either for verifying the compliance of a system to a specification (formal verification [2], Section 2.3.1) or for producing a system that abides by a given specification (formal synthesis [3,4]). The ability to determine the former also allows us to mine specifications

Φ

from the system represented in our data

S

(specification mining [5], Section 2.3.2). Under these assumptions, formal verification determines whether a system

S

satisfies the property

Φ

through a yes or a no answer (

S ⊧ Φ

). A negative output is accompanied by a counterexample that shows how

Φ

is falsified, while a positive answer also includes a certificate of correctness.

The wide adoption of such techniques makes finding efficient algorithms for solving specification mining tasks more pressing. Relational databases, usually providing more efficient query plans than other NoSQL representations [6], provide suitable representations for storing and querying collections of temporal information, thus including numerical time series [7], temporal updates of entities, and relationships represented in the relational model [8], as well as recording multiple traces as logs. In particular, traces result from observing a completed real-world process, providing a temporally ordered sequence of activities being audited [2,9]. Recent research shows how temporal data can always be represented as traces, as time series segmentations discretise multi-varied time series into discrete, fully observable, linear, and distinct events, identifying the different states in which a system is transiting [10], as well as the variations (increases or decreases) of the values associated with time series [11]. Such a segmentation also allows to optimise pattern searches over temporal processes [10]. In the context of temporal data, long-time proposed declarative languages provide both human-readable representations for temporal behaviours as clauses/templates [12] as well as suitable query languages for expressing formal verification tasks [2,13]. Adopting such languages is extremely beneficial for specification mining tasks, as they fairly restrict the sets of all possible temporal behaviours to certain interesting patterns of interest, thus allowing a restriction of the space search.

Cybersecurity datasets provide ideal use case scenarios for temporal data represented as logs. Those are often represented as a sequence of observable events that are audited while collecting information concerning the attacker’s strategy [2]. As such, sequences are temporal; this makes it the perfect use-case scenario for sequential pattern mining and association rule mining, which then become the basis for attack prediction tasks [14]. This makes them the ideal testbed to verify whether current state-of-the-art specification mining algorithms for temporal data can deal with greater amounts of distinct behaviours and data to be analysed (Section 5.2 vs. Section 5.3). Lagraa et al. [15] show that it is also possible to represent entire malicious log-in sequences as Workflow Nets, a procedural representation of temporal data, mined via sequential process mining algorithms, such as Alpha+++ [16]: Workflow Nets are then used as predictors for malicious log-in attempts through replaying live sequences. Çatak et al. [17,18] also show that it is possible to audit and log malware system call invocations to represent malicious software as a sequence of subsequent instruction invocations. The resulting dataset can then be exploited to train a long short-term memory neural network for distinguishing between eight different types of malware: trojan, backdoor, downloader, worm, spyware, adware, dropper, and virus. The primary distinction between earlier methods and this recent one lies in the resultant specification’s interpretability. While the new approach elucidates the disparities between various malware types post-facto using explainers, such as SHAP [19], the aforementioned approaches inherently extract a white-box classifier. Tools such as Workflow Nets, sequential pattern mining, or association rules deliver explainable results, motivating the rationale behind a specific classification outcome.

Specification mining over logs stipulates the return of temporal behaviour fulfilled by most traces in a log; such a majority constraint is usually assessed through a minimum support threshold

θ

. This allows the practitioner to impose early stopping conditions to avoid testing underrepresented temporal behaviour. Despite the adoption of such a constraint, state-of-the-art specification mining algorithms [20,21] still test all the traces in the log without using the minimum support threshold as an earlier stopping condition (Section 3.3 and Section 6.3), thus introducing sensible overhead while testing unnecessary temporal behaviour. This is a desired feature when dealing with real data. Furthermore, the aforementioned approaches are hindered by the exhaustive generation and testing of declarative clauses instantiated over each frequently occurring event type [20,21], thus being heavily hindered by the curse of propositionalisation. As a result, existing approaches do not adequately scale up to the size of big and real-world datasets, as the running time becomes quite explosive in the long run (Section 5.3). As the adoption of a declarative language to express the temporal behaviour of interest considerably reduces the space search for the temporal behaviour to be returned as a specification, we can efficiently visit our search space by organising the candidate’s temporal behaviour to be tested as a lattice (Section 2.2 and Section 4.3), where the descent is pruned using the aforementioned

θ

as an early stopping condition. As a consequence of the missed opportunity to exploit faster search approaches, current algorithms do not ensure the provision of specifications of compact sizes. This feature is essential due to the profusion of mined declarative clauses which are later utilized as features to describe temporal data sequences in learning tasks [21] (Section 2.1). Such an approach can easily lead to trained models that overfit the training dataset [22], while training data with such an explosive amount of features requires an exponential number of data samples with respect to the number of such features (https://youtu.be/QZ0DtNFdDko, accessed on 10 September 2023).

This paper proposes Bolt2, an extension of our previous specification mining algorithm Bolt [5], which improves the former in terms of specification size reduction, providing a comparable running time to our initial solution. Our controlled experiments show that our heuristic algorithm can find the most specific behaviour with the best support if compared to other algorithms returning all the possible clauses (Section 5.1), while adopting a general-to-specific lattice search to traverse the declarative clause space, combined with pruning and heuristic assumptions for inference, the algorithm significantly surpasses existing exhaustive search methods. This makes it ideal for mining data-sparse real-world temporal datasets (Section 5.2 and Section 5.3). As any mined specification can then be decomposed into minimal features to be fed to a classifier, we postulate that an ideal specification mining algorithm for temporal data shall return a minimal specification that “adequately” describes the temporal data of interest.

2. Related Work

In this section, we use the notation from [23] for uniformly describing data mining and machine learning problems.

2.1. Machine Learning

Machine learning studies systems that enhance their behaviour over time through experience. This experience consists of a set of examples

L_{e}

in which we look for regularities or patterns fitting a specific model. Such a set of examples might be labelled with a

f : L_{e} \to Y

function, associating each example to a target classification class

y \in Y

. We can, therefore, define a (labelled) dataset as a collection of pairs

D = {(x_{0}, y_{0}), \dots, (x_{m - 1}, y_{m - 1})}

, associating each data point

x_{i}

to a class

f (x_{i}) = y_{i} \in Y

. We can define the set of all the possible training outcomes (i.e., hypotheses) as

L_{h} : = Y^{L_{e}} \equiv \{h | h : L_{e} \to Y\}

, among which, we are interested in a function

\tilde{f}

that minimises the classification loss when compared to the expected output f. The training problem can then be summarised as follows:

M L (loss, D, L_{h}) : = \underset{h^{'} \in L_{h}}{arg min} loss (h^{'}, D)

When using decision trees [24] as use case examples for explainable machine learning models, each path

π : = r \to p_{1} \to \dots \to l_{n}^{y}

from the root r toward a leaf

l_{n}^{y}

node associated with a class

y \in Y

describe a classification rule

r \land ⋀_{p_{i} \in π} p_{i} \Rightarrow y

, where each predicate might be (or might not be) negated, depending on the direction followed while traversing the learned model. As each branch provides alternative specifications for a class, we can extract an explanation for each class

y \in Y

as a finitary disjunction of all the rule tails associated with the same head y.

Deviant model learning (DML) [21] generates specifications (a.k.a. models) that uniquely characterise classes of traces present in a log through a mixture of declarative and procedural (i.e., patterns) features that are mined from the data (Section 2.2). It assumes that the log associates each trace

σ

to a specific class

y_{σ} \in Y

. As the authors represent the presence/absence of a feature in a trace

σ

as a number

{\vec{x}}_{σ}

[ℓ], each trace can be represented as an n-dimensional embedding

{\vec{x}}_{σ} \in R^{n}

, where n is the overall number of features, e.g., when a Declare clause c_ℓ is chosen as the ℓ-th feature, a value of

{\vec{x}}_{σ}

[ℓ]

= - 1 / 0 / n

denotes that the trace

σ

does not satisfy, vacuously satisfies, or satisfactorily activates c_ℓ n times. For the features providing declarative characterisations, DML extracts those via the ADM+Specification mining algorithm [21] (Section 2.3.2). Any improvement in the specification mining phase ensuring model minimality will also minimise both the risk of model overfitting and the curse of dimensionality, as discussed in the introduction of this paper. Therefore, we aim to improve the preliminary mining phase before effectively discussing how to cope with the learning phase, which will be part of our future contribution. As the experiments in this paper are biased toward learning descriptions of deviant specifications expressed as conjunctive declarative clauses, no evidence of the possibility of returning specifications in disjunctive normal form is provided.

2.2. Data Mining

Contrary to machine learning algorithms, data mining algorithms do not limit themselves to the provision of one single hypothesis as they rather return all hypotheses that satisfy a quality criterion

Q

. Given

h \in L_{h}

and two classes

Y = 2 \equiv {0, 1}

, we can define a cover function

co (h, D) = {x \in D | h (x) = 1}

as the function identifying the set of all items that are covered by the hypothesis h from D. Data mining algorithms are often interested in returning only the hypotheses exhibiting frequent patterns. We can then define the minimum frequency criterion over a minimum threshold

θ

as follows:

Q_{θ} (h, D) : = | co (h, D) | \geq θ

We capture all possible data mining algorithms with the following mathematical formulation:

D M (Q, D, L_{h}) : = {h \in L_{h} | Q (h, D) holds}

Under the previous assumptions, one natural way to structure the hypothesis search space is to employ a generality relation ⪯. We state that a hypothesis

h_{2}

is more general than

h_{1}

if all the examples covered by

h_{1}

are also covered by

h_{2}

, expressed as

h_{2} ⪯ h_{1} \Leftrightarrow co (h_{1}, D) \subseteq co (h_{2}, D)

Despite the exhaustive generation of all possible hypotheses in

L_{h}

satisfying

Q

being inefficient (Algorithm 1), we can enhance efficiency by pruning the majority of non-valid hypotheses. For binary classification problems, the set of all possible hypotheses is

2^{L_{e}}

, so each possible predicted positive example in

℘ (L_{e})

represents the outcome of one single hypothesis. Given that there is an isomorphism between the set of hypotheses and the set of all possible data subsets

℘ (L_{e})

(Lemma A1 in Appendix A), we can directly search the hypotheses by navigating the lattice space by ⪯ using a general-to-specific visit. After proving that all the generalisations of hypotheses will satisfy the same quality criterion when a minimum-frequency

Q_{θ}

is used (Lemma A2), it follows, by duality, that none of the specialized hypotheses not meeting this criterion will satisfy it either.

Algorithm 1 Brute-force hypothesis generation [23]

1:: function Enumerate( $Q, D, L_{h}$ )
2:: $H \leftarrow \emptyset$
3:: for all $h \in L_{h}$ do if $Q (h, D)$ then $H \leftarrow H \cup {h}$
4:: return H

While simplistic mining algorithms [20,21,25] enumerate and test all possible hypotheses in

L_{h}

(Algorithm 1), more efficient ones [5,26,27] represent the whole hypothesis space as a lattice induced by the ⪯ relationship while traversing from the most general hypothesis (⊤, Algorithm 2 Line 2) toward the most specific ones, while still validating the quality conditions, which are then returned (Line 7). These last solutions require the definition of a specialization operator

ρ

. This operator associates a hypothesis with all other hypotheses that are less general than itself, formulated as

ρ (h) = {h^{'} \in L_{h} | h ⪯ h^{'}}

(Line 8). For frequent pattern mining, specialisation involves the extension of the set with a new frequent item [26]. For multi-dimensional frequent sub-graph mining, the specialization entails either the specialization of the nodes appearing in a graph pattern with more specific terms according to a taxonomy [27] or the extension of the graph with further edges and nodes [28]. We can always slightly refine this algorithm to return the S-border containing maximally specific solutions within a data mining solution, thus returning the most specific hypotheses that still satisfy the quality criterion:

max (DM (Q, D, L_{h})) where max (T) = \{h \in T | ∄ t \in T . h ≺ t\}

With respect to temporal model mining from linear finite logs of finite traces

S

, our previous contribution defined a specialization

ρ

for declarative dataless clauses through logical entailment [5]. With respect to this, our current contribution relies on a more solid theoretical foundation related to the adequacy of our quality criterion of choice (Section 3.4) and some improved branch-and-bound heuristic conditions (Section 3.3); it is adopted to fasten the search while making more refined assumptions about the data (Section 6.3). To ensure specification size reduction, our aim is to return the S-borders of such clauses rather than return all clauses with different support types that do not use max as a post-processing operator, as in [20]. Moreover, we intend to design a quality criterion. This criterion will ensure that the most refined candidate neither compromises the support of any more general clause nor falls below a specified minimum support threshold.

Algorithm 2 General-to-specific lattice search [23]

1:: function GeneralToSpecific $_{⊤ \in L_{h}, ρ}$ ( $Q, D, L_{h}$ )
2:: Queue ⟵ {⊤}
3:: H ⟵ ∅
4:: while Queue ⟵ ≠ ∅ do
5:: h ⟵ pop(Queue)
6:: if $Q$ {h,D} then
7:: H ← H ∪ {h}
8:: Queue ← Queue ∪ ρ (h)
9:: end if
10:: end while
11:: return H

2.2.1. Frequent Pattern Mining

Given a set of all possible transactions T, and a set of occurring items

I = ⋃_{t \in T} t

, an itemset is an element of

i \in ℘ (I)

. Support

ς (i)

of any itemset i is defined as the ratio between the number of transactions T being supersets of i and the total number of transactions. Given a minimum support threshold

θ

, an item i is frequent if its support is greater or equal to

θ

. We can express the representativeness of an itemset in the whole set of transactions as a frequency metric

ς^{'} (i)

defined as the number of traces where i appears:

ς^{'} (i) = | \{t \in T | i \subseteq t\} |

(1)

We denote

ς

as the normalisation of such support by providing a ratio between

ς^{'} (i)

and the overall number of available transactions

| T |

.

We focus our attention on two frequent itemset mining algorithms. The Apriori [25] is a brute-force mining algorithm that begins by calculating an

F_{1}

table, which associates each item in I with its support, denoted as

ς^{'} (i)

. For each i-th iteration step from 1 to

| I |

, the algorithm prunes each frequent itemset table

F_{i}

containing all of the itemsets of length i using a minimum-support threshold

θ

. If

F_{i}

is not empty, we might compute the itemsets of length

i + 1

as

F_{i + 1} : = F_{1} \times F_{i}

. In the worst-case scenario, this algorithm runs in exponential time:

O (2^{| I |})

. FP-Growth [26] improves the time complexity of the former algorithm by merging the itemset generation phase with the itemset pruning via the minimum support threshold, considerably reducing the overall time complexity. This enhancement exploits an item support table and the representation of frequent patterns as a prefix tree. The latter guarantees a compact in-memory representation of the transactions, and offers an efficient general-to-specific approach when generating frequent candidate itemsets.

2.2.2. Association Rule Mining

An association pattern [29]

X \to Y

states that “if X occurs in the current transaction

t \in T

, then it is likely that Y will occur, too”. We refer to X as the antecedent or tail and to Y as the consequence or head. Each frequent itemset Z can be considered as a rule

\emptyset \to Z

, stating that Z is a possible configuration within our dataset. Association rule mining can be generalised through general-to-specific algorithms [23], refining each possible frequent itemset i expressed as

\emptyset \to i

to the set of all possible rules

X \to Y

, where

X \cup Y = i

and

X \cap Y = \emptyset

. In particular, if we perform this for each itemset, we obtain a correct confidence-based pruning algorithm [29] by visiting the lattice of all possible rules descending from i, where

\emptyset \to i

(

i \to \emptyset

) is its top (bottom). Support and confidence metrics quantify the strengths of such rules from different perspectives. Support s determines how often a rule is “applicable” to the transactions in T, while confidence c determines how frequently items in Y appear in “transactions containing X”. These can be defined as follows:

s (X \to Y) = \frac{ς^{'} (X \cup Y)}{| T |} c (X \to Y) = \frac{ς^{'} (X \cup Y)}{ς^{'} (X)}

(2)

2.3. Verified Artificial Intelligence for Temporal Data

The present paper concerns specification mining for temporal data, where properties to be verified

Φ

are expressed in linear time logic for finite traces (LTL_f) or the extensions. The LTL_f formalism [3] extends modal logic to include linear, discrete, and future-oriented time, where each event represents a single relevant instant of time. LTL_f quantifies only on events reported in the trace, so each event of interest is fully observable, as the events represented in each trace are totally ordered. The syntax of a well-formed LTL_f formula

ϕ

is defined as follows:

ϕ : : = a | \neg ϕ | ϕ \lor ϕ^{'} | ϕ \land ϕ^{'} | ◯ ϕ | □ ϕ | ⋄ ϕ | ϕ U ϕ^{'}

(3)

where

A \in I

represents an atomic proposition,

◯ ϕ

determines that the condition

ϕ

must hold from the subsequent temporal step,

□ ϕ

requires that such a condition must always hold from the current instant of time,

⋄ ϕ

specifies that the current condition must happen at any time in the future, and

ϕ U ϕ^{'}

shows that

ϕ

must hold until the first occurrence of

ϕ^{'}

does. The remainder are interpreted as classical logical connectives, where the implication connective

ϕ \Rightarrow ϕ^{'}

can be interpreted as

\neg ϕ \lor (ϕ \land ϕ^{'})

[30]. The semantics are usually defined in terms of the first-order logic [3] for a given trace

σ^{i}

at a current time j (e.g., for event

σ_{j}^{i}

).

Under the LTL_f framework, any given event in a sequence of interest is immediately followed by a single possible future event. This enforces a specific representation of systems, denoted as

S

, as a collection of sequences of distinguishable events

σ^{i} \in S

(i.e., a trace) [2]. Each trace

σ^{i}

is a temporally ordered and finite sequence of distinct events

σ^{i} = σ_{1}^{i} \dots σ_{n}^{i}

. Each event

σ_{j}^{i} = 〈 A_{j}^{i}, δ_{j}^{i} 〉

pairs an activity label

A_{j}^{i} \in I

, describing the specific actions performing/occurring in a given event, a payload

δ_{j}^{i}

, and a finite function mapping each key to a single value. In this context, the formal verification determines which traces in a log satisfy an LTL_f formula [2] while formal synthesis concerns the generation of traces abiding by an LTL_f formula [4].

A transaction

t \in T

differs from the trace representation

σ

; the former presents items in the form of sets, and the latter captures a sequence of temporally ordered events. Events might still be represented as items through their associated activity labels; therefore, I denotes the set of all activity labels. The conversion of a trace to a transaction comes at the cost of losing temporal information.

2.3.1. Formal Temporal Verification

In the context of LTL_f and temporal data, a counterexample of temporal verification is usually expressed as a non-empty edit sequence for a specific trace invalidating

Φ

, thus showing which events have to be removed or added to make a trace compliant to

Φ

[31]. On the other hand, the certificate of correctness associated with traces often reflects the association between the atomic propositions in the formula and the events satisfying such conditions [2,30], and are often referred to as activation B, target B, and correlation

Θ

conditions (see the following paragraph). As the present paper focuses on positive specification mining, we are only going to consider the latter aspect.

‘Declare’ [12] expresses properties

Φ

as conjunctions of clauses providing a specification (or model)

Φ

, where each clause is an instantiation of a specific temporal template. A Declare template differs from customary association rules as the latter enforces no temporal occurrence pattern between the conditions at the head and the tail of the rule. Declare can be assessed over a finite multiset of finite traces.

Templates express abstract parameterised temporal properties (first column of Table 1), where semantics can be expressed in terms of the LTL_f (third column). The provided definitions align with the prevailing literature’s interpretations, subject to

α

-renaming (Section 6.4). We assert that a trace

σ^{i}

satisfies a Declare clause c if it meets the associated semantic representation in LTL_f; we denote this as

σ^{i} ⊧ [[c]]

. We also state that a trace conforms to a (conjunctive) Declare specification (or model) if the trace satisfies each of the clauses

Φ = {c_{l}}_{l \leq n, n \in N}

composing the aforementioned specification, i.e.,

σ^{i} ⊧ Φ \Leftrightarrow σ^{i} ⊧ ⋀_{c_{l} \in Φ} [[c_{l}]]

. This verification process is also referred to as conformance checking in the business process management literature. We also state that an LTL_f formula

ϕ

implies

ϕ^{'}

if all the traces satisfying the former also satisfy the latter. We denote this as

ϕ \Rightarrow ϕ^{'}

, where we state that

ϕ^{'}

is more general than

ϕ

by interpreting the coverage set as the set of all traces in the log satisfying a given declarative clause (Lemma A3).

The implication notion can also be extended to declarative clauses in terms of their associated LTL_f semantics; Figure 2 represents the entailment lattice of such declarative clauses, expressed as a Hasse diagram [32], and depicting the relationship among such clauses under the assumption of non-zero confidence (Section 6.1).

If a Declare specification contains one clause that predicates over event payloads as an activation/target/correlation condition, we refer to this as data-aware specification; otherwise, we refer to this as dataless. We state that the log is dataless if each event has an empty payload, otherwise, it is data-aware. For data-aware specifications, we can identify correlation conditions denoting disjunction conditions across target/activation conditions, as well as correlations between them. In this paper, we only consider dataless specifications and logs; thus, we omit the true data predicates associated with the activation/target conditions and only provide the activity labels. Therefore, throughout the rest of the paper, we denote each event

σ_{j}^{i}

by only its associated activity label, A when all activity labels are considered as one character, we might represent each trace as a string of activity labels.

Example 1.

Our current logs of interest are dataless; thus, having events that contain empty payload functions ∅, we use

σ^{i} =

ABC as the shorthand for

σ^{i} = 〈

A,∅ 〉 〈 B,∅ 〉 〈 C, ∅〉.

Declarative clauses, therefore, involve the instantiation of declarative templates with activation and target conditions; by restricting our consideration to dataless specifications, such conditions restrict the possible activity label events by the one expressed by the activation (and target) conditions

A

(and

B

). For example,

Precedence (A, B)

has A as an activation condition, as an occurrence of the first A requires checking whether no preceding B shall occur. We denote

ActLeaves (c)

as the set of traces containing at least one event satisfying an activation condition in c [2]. The clause activation might require the investigation of a possible target condition, where temporal occurrence is regulated by the template semantics. To demonstrate this, consider the following example: while

Response (A, B)

requires that the target conditions always follow the activations,

RespExistence (A, B)

only requires its existence across the trace, and holds no temporal constraints as the former. Some of the clauses listed in Table 1 entail others, due to the stricter variation in the “temporal strictness” or the composition of multiple clauses. The previous example focuses on stricter temporal variation, while

CoExistence (A, B)

entails both

RespExistence (A, B)

and

RespExistence (B, A)

.

We state that a trace

σ^{i}

vacuously satisfies a Declare clause c if the trace satisfies such a clause without providing an event that satisfies the clause’s activation condition (i.e.,

σ^{i} ⊧ [[c]] \land σ^{i} \notin ActLeaves (c)

). We can similarly frame the concept of vacuous violation. It occurs when a trace fails to satisfy a declarative clause, represented as (

σ^{i} ⊭ [[c]] \land σ^{i} \notin ActLeaves (c)

).

Declare walks in the footsteps of association rule mining by providing definitions of confidence and support metrics associated with declarative clauses. Still, the current literature provides disparate definitions of those metrics, depending on the specific context of interest [2,20], exploiting different clause “applicability” notions. Our previous work [2] considers “applicability” in terms of mere clause satisfiability, thus considering traces either vacuously or non-vacuously, satisfying a given declarative clause, while straightforwardly considering the traces containing the activation conditions as the denominator for the confidence score:

{S UPP}_{S} (c) = \frac{| {σ^{i} \in S ∣ σ^{i} ⊧ [[c]]} |}{| S |} {C ONF}_{S} (c) = \frac{| {σ^{i} \in S ∣ σ^{i} ⊧ [[c]]} |}{| ActLeaves (c) |}

(4)

As the notion of Supp is directly related to the satisfiability of a given declarative clause, despite its inclusion of the vacuously satisfied traces, this makes it the ideal candidate for the basis of an anti-monotonic quality criterion (Lemma A4) for implementing the lattice search within a general-to-specific heuristic algorithm for the hypothesis search. Still, this definition of confidence is not normalised between zero and one, as the number of traces potentially satisfying the clause are much larger than the ones satisfying and providing an activation condition to the clause.

By restricting the notion of “applicability” to the sole traces containing activation conditions, we can have the notion of restrictive support and confidence [20], ensuring that the clause will fully represent the data if the evidence for both activation and (possibly) target conditions are given:

RS U P P_{S} (c) = \frac{| {σ^{i} \in S \cap ActLeaves (c) ∣ σ^{i} ⊧ [[c]]} |}{| S |}

(5)

RC O N F_{S} (c) = \frac{| {σ^{i} \in S \cap ActLeaves (c) ∣ σ^{i} ⊧ [[c]]} |}{| ActLeaves (c) |}

(6)

Still, these two definitions each come with an intrinsic limitation. First, we can easily observe that RSupp is not anti-monotonic in the same way as Supp and, therefore, cannot be chosen as a quality criterion candidate for traversing the declarative hypothesis space (Section 6.2). Second, the authors showcasing

{RC ONF}_{S}

assumed that, for Precedence(A,B), B should have been an activation condition, while this should have never really been the case, as the clause only postulates that B shall not appear before the first A, but nothing is stated on whether Bs shall occur after an occurrence of A. For this particular clause, there is a requirement for a redefined declarative confidence that ranges between 0 and 1. This new definition aims to provide a more restrictive understanding of confidence, especially in the presence of arbitrary activations or vacuous conditions within the trace. These force us to propose a novel specification of this restrictive confidence, which is a topic we will delve into in Section 3.1.

KnoBAB [2] is a columnar-based relational database, built with the prime purpose of optimizing formal verification for temporal data represented as event logs. This was achieved through the provision of relational semantics to the LTL_f operators, namely xtLTL_f, while providing a one-to-one mapping between the provided temporal operators and the LTL_f logical operators. The encoding of specific temporal operators for the relational model was mainly due to the inadequacy of the relational query plans and physical operators to efficiently compute such tasks over customary relational models [30]. The definition of xtLTL_f operators facilitates the expression of formal verification tasks as relational database queries. These queries not only return the trace that satisfies the specification but also provide details on the events that activate, target, and correlate with one another. xtLTL_f also allows the identification of query plan optimisation strategies. First, we minimise access to the data tables by fusing entire sub-expressions, providing the same temporal behaviour, including the query plan leaf nodes that directly access the temporal information stored as relational tables. Second, we can further pre-process this query plan to achieve trivial intra-query parallelism. As the aforementioned query plan transformation converts an abstract syntax tree into a direct acyclic graph, we can now run a topological sort algorithm to determine the operators’ order of execution after representing them as nodes within the query plan. This allows the additional transformation of the aforementioned direct acyclic graph into a dependency graph, represented as a layered graph. As each node belonging to a given layer shares no computational dependencies with the other, we can also define a scheduling order for the xtLTL_f operators, where the execution can be trivially parallelised, achieving straightforward intra-query parallelism. Third, we guarantee efficient data access through the provision of indexed relational tables containing temporal, payload, and event-counting information. The results provided in our previous research [2] show that the system achieves significant performance gains against state-of-the-art dataless and data-aware temporal formal verification systems, either ad hoc solutions or competing relational representations and queries expressing formal verification tasks directly in SQL.

We will now describe in greater detail the relational table being exploited to store the log traces’ temporal behaviour, as both Bolt and the current algorithm leverage such architecture. Temporal data are loaded and processed into three types of tables depicted in Figure 3:

{CountingTable}_{S}

,

{ActivityTable}_{S}

, and

{AttributeTable}_{S}

(s).

The

{CountingTable}_{S}

compactly stores the number of occurrences

#_{σ^{i}} A

(count cell) for each activity label A (activityId cell) occurring in a trace

σ^{i}

(trace cell) as a 64-bit unsigned integer, which is also used to sort the tables in ascending order. We also show the absence of events associated with a given activity label by setting the count field for a specific trace to zero. The space complexity associated with this table is

Θ (| Σ | \cdot | S |)

, while the occurrence count of a given activity for a specific trace can be retrieved in the

O (1)

time complexity. This table facilitates the efficient computation of dataless ‘exists’ and ‘absence’ Declare clauses.

Example 2.

With reference to Figure 3 showcasing a KnoBAB relational representation for a healthcare log, determining whether a mastectomy event occurred at least once in

σ_{1}

boils down to checking whether

#_{σ^{1}} Mastectomy \geq 1

. As

#_{σ^{1}} Mastectomy = 1

, the aforementioned trace satisfies the desired condition.

As this table lacks any temporal information, we need to introduce a novel table that preserves this. Each row

〈 A, j, i, q, q^{'} 〉

in an

{ActivityTable}_{S}

states that the i-th event of the j-th trace (

σ_{i}^{j} = 〈 A, δ 〉

) is labelled as

A

, while q (or

q^{'}

) is the pointer to the immediately preceding

σ_{i - 1}^{j}

(or following,

σ_{i + 1}^{j}

) event within the trace, if any. NULLs from Figure 3 in

{ActivityTable}_{S}

highlight the start (finish) event of each trace, where there is no possible reference to past (future) events. The primary index of the

{ActivityTable}_{S}

efficiently points to the first and last record associated with a given activity label in the

O (1)

time complexity. As the table’s records are sorted by the activity label, trace ID, and event ID by storing them as 64-bit integers, we can retrieve all records containing a given activity across the log. Pointers provide efficient access to the events immediately preceding and following a given event without additional access to the table and a subsequent join across accessed data.

Example 3.

ChainPrecedence(FollowUp,Mastectomy) in our running example requires accessing all the activation conditions through the primary index of such a table, thus obtaining

{ActivityTable}_{S} .

p r i m a r y_i n d e x (FollowUp) = 〈 # 8, # 9 〉

. Now, we can iterate the table’s records, starting from no. 8 until no. 9 included. By exploiting the q Prev pointers, we can determine that no. 8 has a preceding mastectomy event while no. 9, also belonging to a different trace, has a preceding lumpectomy event, thus violating the aforementioned clause. This leads to the main optimisation of mining ChainPrecedence and ChainSuccession clauses This assumption will be used implicitly in the code for mining such clauses in Section 4.3.3.

The secondary index of the

{ActivityTable}_{S}

maps each trace the ID to the first and last events. This is especially beneficial for declarative mining as we can directly return the first and last events of each trace, and are able to quickly mine Init/End clauses.

Example 4.

With reference to our running example, determining whether End(FollowUp) requires accessing the last event for all traces, as we obtain

{ActivityTable}_{S} .

s e c o n d a r y_i n d e x (σ^{1}) = 〈 # 1, # 8 〉

,

{ActivityTable}_{S} .

s e c o n d a r y_i n d e x (σ^{2}) = 〈 # 2, # 5 〉

, and

{ActivityTable}_{S} .

s e c o n d a r y_i n d e x

(σ^{3}) = 〈 # 3, # 9 〉

, only the first and the last events are associated with a FollowUp event, while the second is a Referral. This leads to the main optimisation associated with Algorithm 5 for mining Init/Ends clauses.

{AttributeTable}_{S}

-s guarantee faithfully reconstructing the entire log information by associating the payload information for each event in the activity table. In particular, we generate one table per key in the data payload. As our current algorithm does not rely on payload information

δ

stored in this table, we refer to [2] for further information.

2.3.2. Temporal Specification Mining

In traditional machine learning (Section 2.1), data provide the common specifications from which specifications are extracted. Data points

x_{i}

are usually associated with a classification label

y_{i}

, directly associating the data to the expected class of specifications. This is very different from customary formal methods, where a formula defines the correct expected behaviours over all possible inputs. When we are only interested in distinguishing between behaviour that adheres to a formal specification and that which does not, it is adequate to identify two distinct classes: one for traces conforming to the expected temporal property, and another for those that deviate from it [21]. Still, doing so when data are biased, limited, or labelled from non-experts might be challenging [1]. Specification mining attempts to bridge the gap between specifications and data by inferring the former from the latter.

In business process mining literature, specification mining is often referred to as declarative process (or model) mining. As in specification mining, we are interested in extracting specifications, where the language for the specification is expressed as a Declare clause. Apriori declarative mining (ADM) [20] represents an advancement in state-of-the-art declarative temporal specification mining algorithms [33] by limiting the instantiation of all possible declarative clauses to the itemsets being returned by an Apriori algorithm; such clauses are then further pruned through declarative minimum support thresholds, while prior solutions [33] instantiated every Declare clause with all conceivable templates, leading to a quadratic combination set. This constraint is also evident in the available implementation of Apriori declarative mining + Fisher score (ADM+S) in Python [21], where the actual implementation instantiates all the templates for all the possible activity labels being available on the current data. In fact, both [33] and [21] become easily intractable in real-world logs describing thousands of activities/events. ADM also further creams off infrequent declarative clauses through confidence and support. Notwithstanding the aforementioned improvements for candidate clause pruning, such approaches exploit no efficient heuristics for mining clauses. By only exploiting frequent itemset mining algorithms with support filtering for generating and testing all possible declarative clauses on log traces, the authors achieve suboptimal algorithms that generate and test all the candidate clauses (propositionalisation [23]), which are further creamed off through an additional filtering step.

Figure 4 provides a clear picture concerning the impossibility of generating, testing, and returning all possible clauses for realistic datasets containing many distinct events through exhaustive search and propositionalisation within a reasonable amount of time. As an upper bound to the total number of possible clauses to be returned, we can consider instantiating all possible pairs of activity labels for the 11 binary clauses and instantiating the 4 unary clauses with

n = 0

(Absence) or

n = 1

(Exists), thus obtaining the following:

U B (I) = 11 \cdot (| I |^{2} - | I |) + 4 | I |

(7)

For the lower bound, we can limit the number of clauses for the 3 symmetric clauses of Choice, ExclChoice, and CoExistence to

\frac{| I | (| I | - 1)}{2}

combinations, thus obtaining

L B (I) = \frac{1}{2} {(19 | I |}^{2} - 11 | I |)

. We can easily see that, for hundreds of activity labels, as in realistic cybersecurity datasets, naïve algorithms are required to test an order of one million clauses, which cannot be analysed in a feasible amount of time. This clearly postulates the need for an efficient search algorithm limiting the search to the clauses effectively being representative of the data of interest.

Our previous implementation of dataless declarative process mining, i.e., Bolt [5], was designed to efficiently mine the twelve clauses listed in Table 1; however, it excluded the two newly introduced clauses, Succession and ChainSuccession, and our novel proposal introduced in the present paper, Surround (Section 3.2). Bolt outperforms all state-of-the-art dataless declarative mining competitors by leveraging KnoBAB’s aforementioned indexed tables while avoiding costly formal verification computations by only detecting vacuity and violation conditions without the need for combining multiple xtLTL_f operators, as they can generate considerable overhead due to the intermediate results being generated by the implementation of the relational operators. Moreover, declarative clauses are generated using a general-to-specific search, starting from frequent clauses being directly inferable from association rule mining, such as RespExistence and CoExistence, temporally refining them by starting the visit of the lattice in Figure 5. To avoid generating uninteresting atemporal behaviour when temporal behaviour might be identified, we generate Choice and ExclChoice clauses by combining unary infrequent patterns. Our current solution further improves this by discarding the return of such clauses if any more informative temporal behaviour is returned (Section 4.3.5).

In our previous contribution, we compared our proposed algorithm to [5] TopN Declare [30] using KnoBAB’s formal verification pipeline, Apriori declarative mining (ADM) after re-implementation on top of KnoBAB [20], and Apriori declarative mining + Fisher score (ADM+S) implemented in Python [21]. Through experiments, we demonstrated that we were both scalable with regard to the trace length and log size, while consistently outperforming the running time for all former competitors. TopN Declare was poor at mining all expected clauses from a log, as just considering all possible pairs among frequently occurring activity labels does not guarantee returning all the relevant clauses, and the others suffered greatly by producing explosive and redundant specifications, achieving hundreds of clauses, even with a strict support threshold

θ

, as these solutions are affected by the curse of propositionalisation required to instantiate and test all possible clauses that might be generated from some mined binary frequent patterns. Bolt avoids the computational overhead from prior approaches due to the reduction of the specification mining task to a conformance checking one and its associated data allocation footprint by directly scanning relational tables without the need for generating intermediate results, as enforced by the adoption of query operators. Different from previous mining approaches that required instantiating all possible declarative clauses over all possible frequent itemsets through extensive clause propositionalisation, Bolt exploits a preliminary branch-and-bound solution for pruning irrelevant or “redundant” clauses, avoiding testing to abide by some irrelevant clauses that are not represented within the data (Example 13). The frequent support and confidence of itemsets serve as heuristics for determining whether specific clauses are frequent or appear in the specification, thus narrowing down the testing of temporal patterns to only the relevant instances.

Despite Bolt outperforming state-of-the-art temporal specification mining algorithms while mining less superfluous and more descriptive specifications, this approach can still be significantly improved. First, despite Bolt being able to prune many redundant clauses that do not aptly describe the log compared to other solutions, the outputted specification is still far from minimised, as there still exist redundant clauses that imply each other. Improving this requires avoiding returning Choice clauses for temporal declarative clauses that have already been considered if appearing with greater or equal support. Second, this preliminary solution still returns multiple symmetric clauses, which are deemed as equivalent, e.g.,

CoExistence (A, B)

is completely equivalent to

CoExistence (B, A)

, as they refer to the same set of log traces. Third, we still return multiple clauses entailing each other, not fully exploiting the lattice descent for ensuring specification size minimality via the computation of an S-border. Any desired improvement of Bolt tackling all these limitations with minimal performance costs will provide specifications returning more specific clauses while being less verbose than Bolt, while outperforming all other competitors by at least one order of magnitude. Our methodology section (Section 4) shows how our novel solution addresses all such limitations.

3. Theoretical Contributions

This section provides preliminary theoretical definitions that are needed before introducing our improved algorithm discussed in the methodology section (Section 4). After enhancing the restrictive definition of confidence (Section 3.1), we describe our newly proposed clause, Surround (Section 3.2), where we envision pruning heuristics jointly with ChainSuccession to avoid costly violation checks (Section 3.3). Finally, we discuss a notion of generality and quality criteria for shaping the S-border to ensure specification minimality (Section 3.4). Table 2 summarises the notation used throughout the paper.

3.1. Enhanced Restrictive Confidence

During our extension of the declarative metrics, we found that the definition of restrictive confidence in Section 2.3.1 did not encompass all the possible formulations of activations and targets in Declare clauses. This is mainly due to the semantics of Precedence(A,B), as well as the “correct” interpretation of what an activation for such a clause should be. Here,

A

is our activation, as in LTL_f, we need to scan the first

A

s of each trace and check for

\neg B

in the past. Precedence is distinct from the other declarative clauses as the lack of activation does not guarantee vacuous satisfiability because it can be violated without any activation at all via

□ (\neg B)

. By considering this, we obtain our upgraded metric by considering the cases where the trace was either activated or vacuously violated.

Definition 1.

Enhanced restrictive confidence expresses the ratio between traces in

S

both satisfying and activating a declarative clause c of the overall number of traces, either activating the declarative clause (without necessarily satisfying it) or not satisfying the clause at all.

{RC ONF}_{S}^{+} (c) = \frac{| {σ_{i} \in S \cap ActLeaves (c) ∣ σ_{i} ⊧ [[c]]} |}{| ActLeaves (c) \cup {σ_{i} \in S | σ_{i} ⊭ [[c]]} |}

(8)

Example 5.

Consider the log

S = {σ^{1}, σ^{2}, σ^{3}}

with traces

σ^{1} =

ABB,

σ^{2} =

CAA, and

σ^{3} =

CCB. Let c be a clause Precedence(A,B).

σ^{1}

is activated and satisfied with both activation and target conditions, and

σ^{2}

is satisfied, containing an activation only. However,

σ^{3}

provides a case that is not considered by the other clauses; we have no activation, A, but the clause is still violated. We consider

σ^{3}

as vacuously violating c as it violates

□ (\neg B)

, while the others are non-vacuously satisfied. For Precedence, we have RConf

_{S} (c) = 1.0

. However, this does not consider the vacuous violation from trace

σ^{3}

.

By using the novel formulation above, we obtain the expected RConf

_{S}^{+} (c) = \frac{2}{3}

.

3.2. Surround

We propose a novel Declare clause, Surround, providing a unique temporal behaviour. We might argue that Surround is not required due to the existence of ChainSuccession, as they both describe similar temporal behaviours. Still, the latter does not prescribe that an event might be both preceded and followed by a given one as the former.

Surround

describes the following unique temporal constraint: every activation is immediately preceded by and immediately followed by the target. In the context of data-aware specifications and logs, we can exploit this clause to express the idea that the occurrence of an activation event should not affect the post-conditions of the event that should then match the pre-conditions, thus adequately describing invariant behaviour.

Example 6.

Consider the following log:

S = {σ^{1}, σ^{2}, σ^{3}} σ^{1} = CBAB, σ^{2} = BBD, σ^{3} = CCBAB

We have that ChainSuccession(A,B) has zero support as all the traces violate the constraint, while Surround(A,B) has a perfect support. In fact, the latter is defined as

ChainResponse (A, B) \land ChainPrecedence (A, B)

, while the former is defined as

ChainResponse (A, B) \land ChainPrecedence (B, A)

.

We can formally prove that

ChainSuccession (A, B) \neq Surround (A, B)

by showing that there exists at least one trace differentiating the behaviours of the two clauses (Lemma A5).

3.3. Pruning Heuristics

The following pruning heuristics aim to avoid computing costly trace scanning and testing for specific temporal patterns if we can efficiently test a numerical condition that encompasses the violation of a clause by a substantial number of traces within the log. Formulating such heuristics as conditions to be checked on the

{ActivityTable}_{S}

will not be deemed as adequately efficient as it would require extra scans that will still boil down to the punctual detection of dissatisfying traces, we aim to determine early stopping conditions on the

{CountingTable}_{S}

when applicable. We identify two heuristics, one for

ChainSuccession

and the other for

Surround

.

We now discuss the novel improved heuristics, replacing the ones from our previous paper. Any trace satisfying

ChainSuccession (A, B)

should have the same number of As as B. However, we also need to consider that B might also occur at the beginning of a trace, for which no

ChainPrecedence (B, A)

is activated. This provides a new bound for the number of As within a trace, which should be one less than the number of Bs. This leads to the following lemma:

Lemma 1.

A trace

σ^{i}

violates not-vacuously

ChainSuccession (A, B)

upon violation of the following constraint:

#_{σ^{i}} B - 1_{σ_{1}^{i} = 〈 B, δ^{i} 〉} \leq #_{σ^{i}} A \leq #_{σ^{i}} B

.

Consider that

1_{P}

is the indicator function returning 1 upon verification of the predicate P, and 0 otherwise. We can infer a similar constraint for

Surround (B, A)

in the worst-case scenario, where activation never occurs at the start of a trace, events labelled as B must occur both before and after event A. Conversely, in the best-case scenario, when A occurs at the start of the trace, there is no need for a preceding B event before such A:

Lemma 2.

A trace

σ^{i}

violates non-vacuously

Surround (A, B)

upon violation of the following constraint:

#_{σ^{i}} A \leq #_{σ^{i}} B \leq 2 \cdot #_{σ^{i}} A

.

Correctness proofs for both lemmas are presented in Appendix C.

3.4. Clause Generality and Quality Criterion for Mining

The generality relationship of choice for our mining algorithm formalises the notion implicitly provided in our previous paper [5]. This notion is required to guarantee the correctness of our mining algorithm and avoid returning evidence from the data due to vacuous satisfiability.

Definition 2

(Bolt2 Generality). Given a log

S

, we state that a declarative clause

c^{'}

is more general than c (and we denote this as

c^{'} ⪯^{b} c

) if c implies

c^{'}

, while both are activated by at least one trace in

S

:

c^{'} ⪯^{b} c \Leftrightarrow c \Rightarrow c^{'} \land {RC ONF}_{S}^{+} (c) > 0 \land {RC ONF}_{S}^{+} (c^{'}) > 0

Observe that this definition follows from the correlation between the coverage and clause implication (Lemma A3) while also guaranteeing the usage of the Supp metric as an anti-monotonic metric for general-to-specific algorithms for lattice descent (Lemma A4 and Corollary A1). Due to this, our algorithm is also going to exploit this metric as an early stopping condition, thus avoiding further refining clauses that might provide infrequent support from the data. Furthermore, generating preliminary clauses from non-zero confidence association rules guarantees the generation of refined clauses with non-zero enhanced restrictive support, while also guaranteeing the traversal of the lattice from the most specific hypothesis to a more general one (Section 6.2). This definition enables a specific definition for a refinement function

ρ

using

⪯^{b}

as a generality relation.

We then leverage the former definition of generality for specifying our quality criterion for our general-to-specific hypothesis (i.e., clause) search. Before saying so, we want to state that clause c is “stronger” (in our previous paper, this notion was referred to as “consumption”) than

c^{'}

if (i) the coverage of the former is greater than the one of the latter, (ii) all the traces accepted by c are also accepted by

c^{'}

(i.e.,

c \Rightarrow c^{'}

), and (iii) its support is lower-bounded by a minimum support threshold

θ

. The first property favours clauses covering most of the traces from the log while still satisfying the minimal quality criterion. Given this requirement, we provide a quality criterion for a clause c over a log

S

, dictating that c be stronger than any possible clause

c^{'}

for which c is a refinement.

Definition 3

(Bolt2 quality criterion). A clause c satisfies the quality criterion over a log

S

if this appears stronger than any clause for which it is a direct refinement.

Q^{b} (c, S) \Leftrightarrow (\forall c^{'} . c \in ρ (c^{'}) \Rightarrow {S UPP}_{S} (c^{'}) \leq {S UPP}_{S} (c)) \land {S UPP}_{S} (c) \geq θ \land {RC ONF}_{S}^{+} (c) > 0

Observe that this quality criterion can also be further simplified after observing that Supp is anti-monotonic (Corollary A2).

Example 7.

By leveraging the same log from Example 6,

S U P P (Response (D, B)) = \frac{2}{3}

, as

σ^{2}

contains no B events following D, and the remaining clauses are vacuously satisfied, while the support of

RespExistence (D, B)

is

1.0

as

σ^{2}

contains a B event as well as a D. In our proposed approach, we advocate returning the latter instead of the former, as it provides better data coverage, thus minimising the risk of returning clauses overfitting the temporal behaviour of traces.

4. Methodology

Our proposed specification mining algorithm Bolt2 (Algorithm 3) can be seen as the repeated application of the general-to-specific hypothesis search (Algorithm 2) for each frequent pair of activity labels

\{A, B\}

frequently occurring in the log that generates binary declarative clauses (Algorithm 6 Line 6), while unary patterns are treated separately (Algorithm 5). We also extend the mining algorithm with pruning and inference heuristics, discarding costly tests of clauses not being representative of the log. This log then describes the data D of interest for the algorithm. At the same time, the hypothesis space

L_{h}

for our declarative clause is derived from the instantiation of all possible templates from Table 1 with both

\{A, B\}

and

\{B, A\}

. The generality relation

⪯^{b}

, essential for the refinement operator

ρ

, as well as the quality criterion

Q^{b}

, are defined in Section 3.4.

Algorithm 3 Bolt2
1: function Bolt2( $S, θ$ ) 2: $P \leftarrow$ GenerateFrequentItemsets( $S, θ$ ) ▹ Algorithm 4 3: `ChoiceFilter` $, Φ^{'} \leftarrow$ GenerateUnaryClauses(P) ▹ Algorithm 5 4: return $Φ^{'} \cup$ BinaryClauseMining( $P, ChoiceFilter, θ$ ) ▹ Algorithm 6

We now describe the specific implementation of such an algorithm. This accepts a log

S

loaded and indexed as a knowledge base in KnoBAB, as well as a minimum support threshold

θ

, and returns a list of declarative clauses to be considered as part of a conjunctive declarative specification

Φ

. As with our previous mining algorithm, Bolt [5], we avoid mining clauses that are completely vacuously satisfied through association rule confidence. This is further enhanced by our notion of Supp while ensuring that at least one trace also activates the associated clause. This is extremely important for ensuring the correctness of our proposed lattice descent and ascent (Section 6.1).

We now motivate the adoption of this novel algorithm in light of the limitations of Bolt, as discussed in Section 2.3.2.

First, regarding Choice and ExclChoice, we generate such clauses by either pairing infrequent unary label with events having other activity labels or by returning a (Excl)Choice clause if no suitable temporal binary clause was not already returned. We also change the policy for returning Init and End clauses by returning them only if they satisfy the whole log.

Second, for symmetrical clauses

c (A, A^{'}) \Leftrightarrow c (A^{'}, A)

, where both arguments are to be considered as activations, such as Choice, ExclChoice, and CoExistence, we also consistently prefer one of the two options by preferring the activity label appearing before the other one as the first argument of the clause.

Third, we further reduce the amount of returned clauses by discarding any Choice clauses if any other more refined clause is returned, as well as using the quality metric discussed in Section 3.4. To achieve this, we strictly abide by the entailment lattice described in Figure 6. In particular, considering the non-zero support constraint, the general-to-specific traversal of the lattice exploits the refinement relationship

ρ

using the associated generality relation

⪯^{b}

from the previous section. Implementation-wise, this is achieved by always preferring the most refined clause at the same support conditions and avoiding returning the more general clauses if a clause with greater support is already returned.

The algorithm proceeds as follows: After generating the frequent itemsets with a maximum size 2 via FPGrowth (Section 4.1), we generate Exists/Absence clauses for certain unary patterns and prepare the generation of binary clauses by combining infrequent unary patterns or directly exploiting the mined binary patterns (Section 4.2). The last phase effectively generates such binary clauses through a preliminary heuristic phase via association rule mining, allowing further pruning of declarative clauses to be tested and considered (Section 4.3). Algorithm 3 shows the mutual dependencies of such tasks regarding input/output parameters.

4.1. Frequent Pattern Mining

In our first step described in Algorithm 4, we transform the temporal data stored in KnoBAB as transactions that are suitable for our implementation of the FPGrowth algorithm. This is rewritten from scratch due to errors in the previous library (https://github.com/integeruser/FP-growth, accessed on 10 August 2023) of choice, where bugs lead to inaccurate support

ς^{'}

computations, thus enforcing an estimate of the overall support through approximations. This phase now generates frequent itemsets suitable for the next association rule-mining phase.

Algorithm 4 Running FPGrowth over KnoBAB

1:: function GenerateFrequentItemsets( $S, θ$ )
2:: $ς \leftarrow {[A \mapsto |\{σ^{i} | #_{σ^{i}} j > 0\}|]}_{A \in I}$
3:: $T \leftarrow {[i \in I | #_{σ^{j}} i > 0]}_{σ^{j} \in S}$
4:: return FPGrowth( $ς$ ,T, $θ$ , $l = 2$ )

As applying the FPGrowth algorithm over a log requires discarding the temporal information associated with a trace, as transactions merely list the occurrence of items/activity labels, we exploit our

{CountingTable}_{S}

as a transactional representation of our log. Any transaction

t_{i}

associated with a temporal trace

σ^{i}

will contain an item j if

σ^{i}

has at least one occurrence of j, i.e.,

j \in t_{i} \Leftrightarrow (#_{σ^{i}} j) > 0

(Line 3). Given this correspondence, we can then determine the item frequency table

ς^{'}

associated with our log as required by the FPGrowth algorithm by counting the traces containing at least one event with a given activity label A (Line 2). Such an algorithm can then be run with a minimum support threshold

θ

, consistent with previous literature. As we are only interested in unary and binary clauses, we generate frequent itemsets that have a maximum length of

l = 2

and support at least

θ

. The algorithm then returns the frequently mined itemsets

π_{i} \in P

with support

ς^{'} (π_{i})

(Line 4).

4.2. Unary Clause Mining and Binary Clauses Pre-Processing

The phase described in Algorithm 5 combines the infrequent unary itemsets in P for later generating Choice and ExclChoice clauses, while directly yielding unary declarative patterns for certain itemsets. The clauses being generated in this phase share the same certain (

100 %

) values for Supp, RSupp, and RConf

^{+}

.

Algorithm 5 Generating unary clauses

1:: function GenerateUnaryClauses(P)
2:: ChoiceFilter $\leftarrow \emptyset$ ; $Φ \leftarrow \emptyset$
3:: for all $〈 π_{i} = {A}, ς (π_{i}) 〉 \in P$ do ▹Certain singlets
4:: if $ς (π_{i}) = 100 %$ then
5:: $n \leftarrow {arg min}_{σ^{i}} #_{σ^{i}} A > 0$
6:: $N \leftarrow {arg max}_{σ^{i}} #_{σ^{i}} A > 0$
7:: $Φ \leftarrow Φ \cup {〈 Exists (A, n), 100 % 〉, 〈 Absence (A, N + 1), 100 % 〉}$
8:: else
9:: ChoiceFilter $\leftarrow ChoiceFilter \cup {A}$
10:: end if
11:: end for
12:: Init $\leftarrow {}$ ; Ends $\leftarrow {}$
13:: for all $σ^{i} \in S$ do
14:: $σ_{1}^{i} = 〈 A, δ 〉$ ; $σ_{| σ^{i} |}^{i} = 〈 B, δ 〉$ ▹ Retrieving this information through secondary index
15:: Init[A]← Init[A]+1
16:: Ends[B]← Ends[B]+1
17:: end for
18:: for all $A \in I$ s.t. Init[A] $= | S |$ do
19:: $Φ \leftarrow Φ \cup {〈 Init (A), 100 % 〉}$
20:: end for
21:: for all $B \in I$ s.t. Ends[B] $= | S |$ do
22:: $Φ \leftarrow Φ \cup {〈 End (B), 100 % 〉}$
23:: end for
24:: return ChoiceFilter, $Φ$

We now consider the unary frequent itemsets

π_{i}

with support

ς (π_{i})

generated by the FPGrowth algorithm. For the events represented in every trace (Line 4), we perform a linear scan of the

{CountingTable}_{S}

to obtain the minimum (n) and maximum (N) non-zero occurrences of the event in every single trace, thus guaranteeing that the event will appear in the trace at least n times and shall never occur in the trace more than N times (Line 7). Next, we can easily infer Init and End clauses by scanning the secondary index of the ActivityTable, associating with each trace the starting and ending event in it (Line 13). We then count the number of traces, starting (Line 15) or ending (Line 16) with a specific activity label. We assert that it is likely that traces will start (end) with an activity label of A if all traces do so (Lines 19 and 22). This differs from our previous implementation, where it was sufficient that at least

θ \cdot | S |

traces started (or ended) with a given activity label to return such associated clauses.

For unary patterns that do not occur in all traces, we might observe that they have no strong correlation with any other binary pattern; therefore, we might want to combine those in a later step described in Section 4.3.5, so as to generate Choice and ExclChoice with greater support and confidence than the one associated with each single infrequent unary pattern. This also differs from ADM, where binary clauses are only generated from frequent binary itemsets. As we show in Section 5.1, this is detrimental to the detection of ExclChoice patterns.

4.3. Binary Clause Mining

The generation of binary declarative clauses from frequent binary itemsets can be summarised as follows: (i) after inferring some candidate association rules from each item (Section 4.3.1), (ii) we run the pruning heuristics (Section 3.3), determining whether we can avoid the testing of time-consuming clauses (Section 4.3.2). After this, (iii) we attempt to refine the mined clauses with other ones that provide stricter temporal behaviours (Section 4.3.3). (iv) If none of these more refined clauses has greater or equal support than the one associated with the association rules in (i), we return the latter or we further relax them, as (Excl)Choice (Section 4.3.4). This algorithm mainly differs from Bolt in these two last points, as this does not extensively exploit the notions of generality and quality for the lattice descent. (v) Last, we generate distinct (Excl)Choice clauses through a combination of pairs of infrequent unary patterns not being represented in any of the clauses being generated in the previous steps (Section 4.3.5). All these steps are summarised in Algorithm 6.

Algorithm 6 Binary clause mining, except from (Excl)Choice

1:: $CSviol (i; A, B) \leftarrow #_{σ^{i}} B - 1_{σ_{1}^{i} = 〈 B, δ^{i} 〉} > #_{σ^{i}} A \lor #_{σ^{i}} B < #_{σ^{i}} A$
2:: $Sviol (i; A, B) \leftarrow #_{σ^{i}} A > #_{σ^{i}} B \lor 2 \cdot #_{σ^{i}} A < #_{σ^{i}} B$
3:: function BinaryClauseMining( $P, ChoiceFilter, Beginnings, θ$ )
⊳Cross-Branch Entailment (ChainSuccession)
4:: Beginnings $\leftarrow [A \mapsto | {σ^{i} \in S | σ_{1}^{i} = 〈 A, δ 〉} {|]}_{A \in I}$
5:: $Φ \leftarrow \emptyset$
6:: for all $〈 π_{i} = {A, B}, ς (π_{i}) 〉 \in P$ do ▹ $Φ \leftarrow Φ \cup max (DM (Q^{b}, L, L_{h}))$
⊳ (i) Algorithm 7
7:: $c, r, θ^{'}, 〈 A^{'}, B^{'} 〉 \leftarrow$ AssociationRulesForDeclare( $〈 A, B 〉, Beginnings, θ$ )
⊳(ii) Computing Pruning Heuristics
8:: global notChainSucc $_{A, B} \leftarrow 1 - \frac{\sum_{σ^{i} \in S} 1_{CSviol (i; A, B)}}{| S |} < θ^{'}$
9:: global notChainSucc $_{B, A} \leftarrow 1 - \frac{\sum_{σ^{i} \in S} 1_{CSviol (i; B, A)}}{| S |} < θ^{'}$
10:: global notSurr $_{A, B} \leftarrow 1 - \frac{\sum_{σ^{i} \in S} 1_{CSviol (i; A, B)}}{| S |} < θ^{'}$
11:: global notSurr $_{B, A} \leftarrow 1 - \frac{\sum_{σ^{i} \in S} 1_{CSviol (i; B, A)}}{| S |} < θ^{'}$
⊳(iii) Algorithm 8
12:: $Φ^{'} \leftarrow$ TemporalRefinement( $r, S, θ^{'}$ )
⊳(iv) Post-TemporalRefinement (with Algorithm 14)
13:: if $Φ^{'} = \emptyset$ and ( $θ + ϵ \geq 1$ or not SingleChoiceExclGen( $〈 A, B 〉, μ, μ^{- 1}, θ$ )) then
14:: $Φ \leftarrow Φ \cup {c}$
15:: else
16:: $Φ \leftarrow Φ \cup Φ^{'}$
17:: end if
18:: end for
⊳(v) Algorithm 14
19:: return ChoiceExclGen( $Φ, ChoiceFilter, θ$ )

4.3.1. Association Rule Mining for Declarative Clauses

Algorithm 7 runs a simplified version of the association rule mining algorithm for each binary pattern

〈 A, B 〉

generated by Algorithm 4. This step provides a heuristic inference for determining what clauses need to be mined as a refinement of the RespExistence(A,B) representing an association rule

A \to B

, as inferred from

〈 A, B 〉

.

Algorithm 7 Determining declarative clauses from association rule mining

1:: function AssociationRulesForDeclare( $〈 A, B 〉, θ$ )
2:: $p \leftarrow 〈 A, B 〉$
3:: if $c (A \to B) > c (B \to A)$ and $c (A \to B) \geq θ$ then
4:: $c \leftarrow RespExistence (A, B)$ ; $r \leftarrow {A \to B}$
5:: $θ^{'} \leftarrow max {θ, S UPP (RespExistence (A, B))}$
6:: else if $c (A \to B) < c (B \to A)$ and $c (B \to A) \geq θ$ then
7:: $c \leftarrow RespExistence (B, A)$ ; $r \leftarrow {B \to A}$
8:: $θ^{'} \leftarrow max {θ, S UPP (RespExistence (B, A))}$
9:: $p \leftarrow 〈 B, A 〉$
10:: else if $c (A \to B) = c (B \to A) \geq θ$ then
11:: $c \leftarrow CoExistence (A, B)$ ; $r \leftarrow {B \to A, A \to B}$
12:: $θ^{'} \leftarrow max {θ, min {S UPP (RespExistence (A, B)), S UPP (RespExistence (B, A))}}$
13:: end if
14:: ⊳ Cross-Branch Entailment (ChainSuccession)
15:: if $r = {T \to H}$ and Beginnings[H] $> 0$ then
16:: $r \leftarrow {A \to B, B \to A}$
17:: end if
18:: return $c, r, θ^{'}, p$

We compute the strengths of two candidate association rules,

A \to B

and

B \to A

, in terms of their confidence. If the confidence of one dominates the other, we choose the strongest, either

A \to B

(Line 4) or

B \to A

(Line 7). As such rules entail that the existence of the antecedent requires the consequent, this associative rule immediately represents a RespExistence. Different from our previous algorithm, this candidate clause is generated but not yielded yet as returning this clause requires ensuring that no “stronger” clause exists. Still, this candidate clause restricts the generation of subsequent candidate declarative rules, as highlighted in Figure 6. This also differs from our previous implementation where we always assumed that the rule’s antecedent will always correspond to an activation and the consequence is always a target, as now

A \to B

is associated with a Precedence(B,A) clause (Section 6.3). If, on the other hand, no rule dominates the other (Line 11), we can only assume that both the activation and target coexist while we assess declarative rules, where one of the two appears to be an activation and the other is a target, thus entailing an

A \leftrightarrow B

pattern representing a CoExistence and two association rules:

A \to B

and

B \to A

.

The generation of a candidate clause enforces the update of the minimum support threshold

θ

as

θ^{'}

, as we later want to return more refined clauses than the ones described here only if they have declarative support that is at least equal to one of the clauses that are mined;

θ^{'}

is set to the maximum between our previous

θ

and the support of the currently mined clause (Lines 5 and 8). As the subsequent refinement step (Algorithm 8, Section 4.3.3) will separately consider two distinct interpretations of RespExistence jointly with their associated threshold, we always consider the support of the two associated RespExistence clauses rather than the one from CoExistence (Line 12). This differs from our previous implementation, where we only considered

θ

as a minimum support threshold in the subsequent refinement step. This restriction will allow us to avoid further undesired behaviour being mined.

Still, we can observe that the heuristic approach described so far without considering the relaxation proposed in Line 15 can be too strict and might discard some relevant declarative clauses:

Example 8.

Let us consider the following log:

S = {σ^{1}, σ^{2}, σ^{3}} σ^{1} = AB, σ^{2} = BD, σ^{3} = CABAB

FPGrowth can yield the frequent itemset

{A, B}

; we now want to generate some association rules. As the confidence of

A \to B

is greater than the one for

B \to A

, we might return only the former, thus potentially returning only a ChainResponse(A,B) clause. Still, we can observe that

σ^{2}

vacuously satisfies both ChainResponse(A,B) and ChainPrecedence(B,A). The vacuousness of the former is already encompassed by the other traces satisfying both ChainResponse(A,B) and contributing to the support of

A \to B

; the vacuousness of the latter implies an absence of A event being labelled in the data. This prevents us from mining

B \to A

and returning a ChainSuccession(A,B).

To alleviate the aforementioned issue, we, therefore, consider both possible association rules for adequately mining the ChainSuccession clauses from the data if there exists at least one trace in the log starting with the activity label associated with the rule’s head (Line 15).

As this step avoids the immediate return of either RespExisteince or CoExsitence, we need to first check if any refined clauses stronger than these can be returned instead; this helps reduce the specification size.

4.3.2. Computation of Pruning Heuristics

Before refining the clauses mined in the previous stage, we want to compute the pruning heuristics in Section 3.3 to avoid time-consuming checks required by ChainSuccession and Surround in Algorithm 10 by merely exploiting the

{CountingTable}_{S}

.

θ^{'}

reflects the minimum tolerated number of traces for any candidate stronger than c from Line 7 in Algorithm 6 should satisfy, as any refinement of c having more violations than c cannot be deemed as a clause stronger than c and, therefore, shall not be considered in the final mined specification. Upon violating the heuristic constraints of a specific trace, we can infer that either ChainSuccession (Line 1) or Succession (Line 2) is violated for a specific trace in the log; to compute the number of non-violating clauses, we decrease one by the ratio of the number of violating clauses on the overall number of traces within the log. We can then determine that ChainSuccession

(A, B)

shall never be returned as a clause consuming c if the aforementioned value is strictly less than

θ^{'}

(Line 8). Similar considerations can be performed for Surround clauses (Lines 10 and 11) and ChainSuccession(B,A) (Line 9).

This phase significantly differs from our previous implementation, as the heuristics are now computed before attempting to refine the clause while also considering Surround.

4.3.3. Temporal Refinements

Algorithm 8 describes the attempt to refine either RespExistence or CoExistence with more precise temporal behaviour. For each association rule

A \to B

, we attempt to descend the lattice from RespExistence(A,B) (Line 6), irrespective of this being associated with a CoExistence(A,B). We exploit the quality criterion in Definition 3 for the lattice descent.

Algorithm 8 Lattice descent for the mined association rules in r

1:: procedure AssociationRuleRefinement(A,B, $θ^{'}$ )
2:: PrecedenceResponse(A,B, $θ^{'}$ ) ▹ Algorithm 9
3:: ChainPrecedenceResponse(A,B, $θ^{'}$ ) ▹ Algorithm 10
4:: function TemporalRefinement( $r, S, θ^{'}$ )
5:: for all $Tail \to Head \in r$ do
6:: AssociationRuleRefinement( $Tail$ ,Head, $θ^{'}$ )
7:: if $| r | = 1$ then return SingleBranchMining(A’,B’, $S$ , $θ^{'}$ ) ▹ Algorithm 11
8:: end for
9:: $⊳ r = {A^{'} \to B^{'}, B^{'} \to A^{'}}$
10:: $C \leftarrow$ SingleBranchMining(A’,B’, $S$ , $θ^{'}$ )
11:: return CrossBranchEntailment( $C$ ; B’,A’, $S$ , $θ^{'}$ ) ▹ Algorithm 12

Algorithm 9 Scanning Precedence(B,A) and Response(A,B) for association rule

A \to B

1:: procedure PrecedenceResponse(A, B, $θ^{'}$ ) ▹
2:: ${it}_{a} \leftarrow S O R T_{i, j} ({σ_{j}^{i} ∣ σ_{j}^{i} \in σ^{i}, σ^{i} \in S, σ_{j}^{i} = A})$ ; $σ_{j}^{i} \leftarrow C U R R E N T ({it}_{a})$
3:: ${it}_{b} \leftarrow S O R T_{i, j} ({σ_{j}^{i} ∣ σ_{j}^{i} \in σ^{i}, σ^{i} \in S, σ_{j}^{i} = B})$ ; $σ_{κ}^{ι} \leftarrow C U R R E N T ({it}_{b})$
4:: $retain \leftarrow false$ ; ${retain}_{a} \leftarrow - 1$
5:: ${vac}_{A, B}^{r} \leftarrow {1, \dots, i - 1}$ ; $p_{a} \leftarrow i$
6:: while $σ_{j}^{i} \neq ↑$ do ▹ Start scanning from the first A event in trace $σ^{i}$
7:: if $1 - | {viol}_{B, A}^{p} | / | S | < θ^{'}$ and $1 - | {viol}_{A, B}^{r} | / | S | < θ^{'}$ and
8:: $1 - | {viol}_{A, B}^{p} \cup {viol}_{A, B}^{r} | / | S | < θ^{'}$ and $1 - | {viol}_{B, A}^{p} \cup {viol}_{B, A}^{r} | / | S | < θ^{'}$ then break
9:: if retain and ( $σ_{κ}^{ι} = ↑$ or $ι \geq {retain}_{a}$ ) then
10:: $retain \leftarrow false$ ; ${retain}_{a} \leftarrow - 1$
11:: end if
12:: if $p_{a} \neq i$ then
13:: ${vac}_{A, B}^{r} \leftarrow {vac}_{A, B}^{r} \cup {p_{a} + 1, \dots, i - 1}$ ; $p_{a} \leftarrow i$
14:: end if
15:: if $σ_{κ}^{ι} = ↑$ or $i < ι$ then ▹A-s on their own
16:: ${act}_{A, B}^{r} \leftarrow {act}_{A, B}^{r} \cup {i}$ ; ${viol}_{A, B}^{r} \leftarrow {viol}_{A, B}^{r} \cup {i}$ ;
17:: ${act}_{A, B}^{p} \leftarrow {act}_{A, B}^{p} \cup {i}$ ; ${viol}_{B, A}^{p} \leftarrow {viol}_{B, A}^{p} \cup {i}$ ;
18:: if not retain then
19:: ${retain}_{a} \leftarrow i$ ;
20:: retain $\leftarrow true$
21:: if $σ_{κ}^{ι} = ↑$ then
22:: ${vac}_{B, A}^{p} \leftarrow {vac}_{B, A}^{p} \cup {i + 1, \dots, | S |}$
23:: else
24:: ${vac}_{B, A}^{p} \leftarrow {vac}_{B, A}^{p} \cup {i + 1, \dots, ι - 1})$
25:: end if
26:: end if
27:: $p_{a} \leftarrow i$ ; do $N E X T ({it}_{a})$ ; $σ_{j}^{i} \leftarrow C U R R E N T ({it}_{a})$ while $p_{a} \neq i$
28:: else if $σ_{j}^{i} = ↑$ or $i > ι$ then ▹B-s on their own
29:: ${act}_{B, A}^{r} \leftarrow {act}_{B, A}^{r} \cup {i}$ ; ${act}_{B, A}^{p} \leftarrow {act}_{B, A}^{p} \cup {i}$ ;
30:: if $σ_{j}^{i} = ↑$ then
31:: ${vac}_{B, A}^{p} \leftarrow {vac}_{B, A}^{p} \cup {α ∣ ι \leq α \leq | S |, #_{σ^{α}} B = 0}$
32:: else
33:: ${vac}_{B, A}^{p} \leftarrow {vac}_{B, A}^{p} \cup {α ∣ ι \leq α < i, #_{σ^{α}} B = 0}$
34:: end if
35:: do $N E X T ({it}_{b})$ ; $σ_{κ}^{ι} \leftarrow C U R R E N T ({it}_{b})$ while $ι < i$
36:: else ▹ Presence of both
37:: ${act}_{A, B}^{r} \leftarrow {act}_{A, B}^{r} \cup {i}$ ; ${act}_{B, A}^{r} \leftarrow {act}_{B, A}^{r} \cup {i}$ ;
38:: ${act}_{B, A}^{p} \leftarrow {act}_{B, A}^{p} \cup {i}$ ; ${act}_{A, B}^{p} \leftarrow {act}_{A, B}^{p} \cup {i}$ ;
39:: if $κ > j$ then ${viol}_{B, A}^{p} \leftarrow {viol}_{B, A}^{p} \cup {i}$
40:: for all $σ_{j^{'}}^{i}$ s.t. $j^{'} \geq j$ do
41:: while $κ < j$ and $i = ι$ do Next( ${it}_{b}$ ); $σ_{κ}^{ι} \leftarrow C U R R E N T ({it}_{b})$ endwhile
42:: if $σ_{κ}^{ι} = ↑$ or $ι \neq i$ or $κ \leq j$ then
43:: ${viol}_{A, B}^{r} \leftarrow {viol}_{A, B}^{r} \cup {i}$ ; break
44:: end if
45:: end for
46:: end if
47:: $p_{a} \leftarrow i$ ; do $N E X T ({it}_{a})$ ; $σ_{j}^{i} \leftarrow C U R R E N T ({it}_{a})$ while $p_{a} \neq i$
48:: do $N E X T ({it}_{b})$ ; $σ_{κ}^{ι} \leftarrow C U R R E N T ({it}_{b})$ while $p_{a} \neq ι$
49:: end while
50:: ${vac}_{A, B}^{r} \leftarrow {vac}_{A, B}^{r} \cup {p_{a} + 1, \dots, | S |}$

For the clauses built upon refinements of RespExistence(A,B) for the association rule

A \to B

are Response(A,B) and Precedence(B,A) (both on Line 2), as well as ChainResponse(A,B) and ChainPrecedence(A,B) (both computed on Line 3), we can restrict the computations of the activated (

{act}_{A, B}^{★}

), violated (

{viol}_{A, B}^{★}

), and vacuous traces (

{vac}_{A, B}^{★}

) to only the aforementioned clauses ★. Observe that all satisfying traces constitute the complement of the violating traces (

{sat}_{A, B}^{★} = ∁ {viol}_{A, B}^{★}

). This is different from ADM+S [21], where such sets are computed all the time for all clauses, and not only the basic building blocks, thus providing significant space and time overheads.

After this, we can then express the scoring functions for the declarative clauses as follows:

S U P P_{S} (★ (A, B)) = \frac{| {sat}_{A, B}^{★} |}{| S |} R S U P P_{S} (★ (A, B)) = \frac{| {sat}_{A, B}^{★} \cap {act}_{A, B}^{★} |}{| S |}

R C O N F_{S}^{+} (★ (A, B)) = \frac{| {sat}_{A, B}^{★} \cap {act}_{A, B}^{★} |}{| {act}_{A, B}^{★} \cup {viol}_{A, B}^{★} |}

If the mining heuristic only provides one association rule

Tail \to Head

, we only perform the lattice descent for the rules that are derivable from this (Line 7). Otherwise, after caching the clauses associated with the

Tail \to Head

rule in

C

(Line 10), we finalise the lattice descent for the other

Heat \to Tail

rule while testing descendant clauses across all its specialisations (Line 11).

This mining step provides the following improvements when compared to the existing competing approaches:

(a): We exploit early stopping conditions via both pruning heuristics, as in Section 3.3 and Section 4.3.2, as well as testing how many traces violate the condition associated with a specific declarative clause ★. For the latter, if we know that at least $\bar{c}$ traces violate ★, such that $1 - \frac{\bar{c}}{| S |} < θ^{'}$ , we can stop testing the remaining traces as we have already found a violation to the minimum support threshold $θ^{'}$ . This is one of the main differences with the algorithms reducing specification mining to formal verifications with an exhaustive search, as they test all the traces without any stopping condition, either because they still test one trace at a time (ADM+S [21]) or because multiple traces are assessed from the log by design (xtLTL_f-based algorithms, re-implementation of ADM [20], or TopN Declare [2]).
(b): We test multiple clauses for each trace by merging compatible clause-testing conditions, minimising both data access and testing for violations, activations, and vacuity conditions. Different from our temporal formal verification pipeline [2], and only ensuring this for declarative clauses that share common sub-expressions, we show that we can further improve this by merging clauses that also have fairly distinct temporal behaviour (Precedence and Response). We further enhance the computations of ChainPrecedence and ChainResponse clauses by accessing the immediately preceding or following events through the ${ActivityTable}_{S}$ ’s Prev and Next pointers without the need to use an $◯ ϕ$ operator, thus further reducing the data access to the knowledge base. Observe that other algorithmic implementations for formal verification, such as ADM+S [21], do not offer even the basic optimizations that our temporal verification pipeline does [2].

When compared to our previous heuristic algorithm Bolt [5], this phase mainly differs in the following aspects:

(c): We interpret Precedence(A,B) as a candidate clause for association rule $B \to A$ rather than one for $A \to B$ (see the discussion in Section 6.3).
(d): We refine the (Chain)PrecedenceResponse to better take into account the vacuous conditions associated with ChainPrecedence for events occurring at the beginning of the trace.
(e): Different from our preliminary implementation, where the non-vacuity is heuristically enforced by non-zero support thresholds and the generation of clauses stemming from mined association rules, this implementation tests the non-vacuity satisfaction jointly with the consumption quality constraint.

The following paragraphs detail each subroutine required to yield the overall mined clauses pertaining to a specific frequent itemset.

Precedence (

★ = p

) and Response (

★ = r

). Despite the temporal semantics of Precedence(B,A) and Response(A,B) significantly differing, the former requires that there exists a first occurrence of a B-labelled event, requiring that no preceding event must correspond to A-labelled one or that no A-labelled event shall occur in the trace. In contrast, the latter requires that each occurrence of A needs a future presence of a B; we can test for traces activating, violating, and vacuously satisfying/violating either of the two clauses within the same Algorithm 9. Similar to the implementation of our xtLTL_f operators [2], we access the

{Activity Table}_{S}

for returning all the events associated with the activity labels A (Line 2) and B (Line 3), pre-sorted by trace and event in lexicographical order during the indexing phase. We extensively check for all such events in a trace until both clauses are violated (Lines 7 and 8). When this occurs, we skip to the following trace containing candidate-activating events (Lines 47 and 48). To show the correctness of our approach, we discuss how each condition is collected for each trace

σ^{i}

in our log.

Activation detection: We can easily determine that the presence of A-s is the minimum requirement for the activation of Response(A,B); this encompasses traces containing either just A-labelled events (Line 16) or those alongside B-labelled ones (Line 37). Similarly, Precedence(B,A) is activated when at least one B appears in the trace; this encompasses traces containing only B-labelled events (Line 29) as well as ones containing both types of labels (Line 38). Observe that a clause’s activation does not necessarily entail its satisfaction.
Violation detection: As the traces violating a clause complement the traces satisfying it, we collect only violating traces. A trace containing only A events provides a violation for both clauses of interest, Response(A,B) and Precedence(B,A), as the former contains no B events of interest (Line 16) while the latter violates the requirements that no A-s must occur if no B is present in the trace (Line 17). A trace containing both types of events violates Precedence(B,A) if the first occurring $B \equiv σ_{κ}^{ι}$ occurs after the first existing $A \equiv σ_{j}^{i}$ , violating $\neg B U A$ (Line 39), while Response(A,B) is violated the first time (break) we cannot find a B-labelled $σ_{κ}^{ι}$ event occurring after the current A-labelled event $σ_{j}^{i}$ of interest (Line 43), even after proactively scanning for such a desired B event, as in Line 42.
Vacuity detection: While scanning all the A-labelled events sorted by trace and event IDs, the traces vacuously satisfying Response(A,B) are the ones where no A appears, which are all the traces preceding the traces containing the first occurrences of such event types (Line 5), the ones after the last trace where A occurs (Line 50), as well as all the traces occurring between the previous ( $p_{a}$ ) and current (i) traces containing A-labelled events (Line 13). Regarding Precedence(B,A), we shall have either vacuously satisfied or vacuously violated traces if we detect no B events within the trace. This includes all traces between the current one containing only A (i) and the next one containing at least one B ( $ι$ , Line 21); when we observe a trace $ι$ containing all B-labelled events, all remaining traces $α$ up to a subsequent trace containing at least one event A (if any), should not have any B-labelled events ( $#_{σ^{α}} B = 0$ , Line 33). For efficiency purposes, in the first scenario, we avoid inserting these values multiple times by detecting whether some vacuous traces are already retained in the past.

Algorithm 10 Scanning ChainPrecedence(B,A) and ChainResponse(A,B) for rule

A \to B

1:: procedure ChainPrecedenceResponse(A, B, $θ^{'}$ )
2:: ${it}_{a} \leftarrow S O R T_{i, j} ({σ_{j}^{i} ∣ σ_{j}^{i} \in σ^{i}, σ^{i} \in S, σ_{j}^{i} = A})$ ; $σ_{j}^{i} \leftarrow C U R R E N T ({it}_{a})$
3:: ${forward}_{r} \leftarrow false$ ; ${forward}_{p} \leftarrow false$ ; $p_{a} \leftarrow - 1$
4:: $start \leftarrow σ_{j}^{i}$
5:: ${vac}_{A, B}^{c r} \leftarrow {vac}_{A, B}^{r}$ ; ${vac}_{A, B}^{c p} \leftarrow {vac}_{A, B}^{r}$
6:: while $σ_{j}^{i} \neq ↑$ do ▹ Start scanning from the first A event in trace $σ^{i}$
7:: if $1 - | {viol}_{B, A}^{c p} | / | S | < θ^{'}$ and $1 - | {viol}_{A, B}^{c r} | / | S | < θ^{'}$ and notSurr $_{A, B}$ and
8:: (notChainSucc $_{A}, B$ or $1 - | {viol}_{A, B}^{c p} \cup {viol}_{A, B}^{c r} | / | S | < θ^{'}$ ) and
9:: (notChainSucc $_{B}, A$ or $1 - | {viol}_{B, A}^{c p} \cup {viol}_{B, A}^{c r} | / | S | < θ^{'}$ ) then break
10:: if $i \neq p_{a}$ then
11:: ${forward}_{r} \leftarrow false$ ; ${forward}_{p} \leftarrow false$ ; $p_{a} \leftarrow i$
12:: end if
13:: if not ${forward}_{r}$ and ( $j = | σ^{i} |$ or $σ_{j + 1}^{i} \neq B$ ) then
14:: ${viol}_{A, B}^{c r} \leftarrow {viol}_{A, B}^{c r} \cup {i}$ ; ${forward}_{r} \leftarrow true$
15:: end if
16:: if not ${forward}_{p}$ and $j > 1$ and $σ_{j - 1}^{i} \neq B$ then
17:: ${viol}_{A, B}^{c p} \leftarrow {viol}_{A, B}^{c p} \cup {i}$ ; ${forward}_{p} \leftarrow true$
18:: end if
19:: if $σ_{j}^{i} = start$ or ( $σ_{κ}^{ι} \leftarrow P R E V ({it}_{a})$ and $ι \neq i$ ) then
20:: ${act}_{A, B}^{c r} \leftarrow {act}_{A, B}^{c r} \cup {i}$
21:: if $j > 1$ or $#_{σ^{i}} A > 1$ then
22:: ${act}_{A, B}^{c p} \leftarrow {act}_{A, B}^{c p} \cup {i}$
23:: else if $j = 1$ and $#_{σ^{i}} A = 1$ then
24:: ${vac}_{A, B}^{c p} \leftarrow {vac}_{A, B}^{c p} \cup {i}$
25:: end if
26:: end if
27:: $p_{a} \leftarrow i$ ; $N E X T ({it}_{a})$ ; $σ_{j}^{i} \leftarrow C U R R E N T ({it}_{a})$
28:: while ${forward}_{r}$ and ${forward}_{p}$ and $p_{a} \neq ι$ do $N E X T ({it}_{a})$ ; $σ_{j}^{i} \leftarrow C U R R E N T ({it}_{a})$ endwhile
29:: end while

ChainPrecedence (

★ = c p

) and ChainResponse (

★ = c r

). Algorithm 10 determines the coverage of both ChainPrecedence and ChainResponse over our log of interest. This leverages the following main optimisations: The use of pruning heuristics to avoid costly trace and log scans (Lines 7–9) and extend vacuously satisfied conditions from the previous algorithm to minimise the memory footprint (Line 5 vs. 24). Unlike the last step, we only scan the events associated with the activation conditions for both clauses while accessing the immediately preceding and following events to test the satisfaction of the target condition. While iterating over all A-labelled events as candidate activation conditions for both clauses, we extensively check for all such events in a trace (Line 27) until both clauses are violated. When this occurs, we skip to the following trace containing candidate-activating events (Line 28).

Similar to the previous step, we show the correctness of such an algorithm by explaining our rationale for detecting trace activating, violating, or vacuous satisfying of each clause.

Activation detection: We determine the presence of activations at each first occurrence of A-labelled event $σ_{j}^{i}$ per trace $σ^{i}$ . As any occurrence of A in a trace is a valid activation for a ChainResponse, we add it straight away (Line 19). Concerning activations for ChainPrecedence(A,B), $σ_{j}^{i}$ is a plausible candidate activation if it occurs after the first event of the trace ( $j > 1$ ); otherwise, if this is the first event of the trace ( $j = 1$ ), trace $σ^{i}$ can have a future activation only if there are further A-labelled events in $σ^{i}$ ( $#_{σ^{i}} A > 1$ , Line 22).
Violation detection: A trace $σ^{i}$ violates the ChainResponse(A,B) whenever we find A-labelled event at the end of the trace ( $j = | σ^{i} |$ ) or that does not have the following B-labelled event (Line 14); once a violation in the trace is detected, we can skip testing for further violations. Similarly, a trace will violate the ChainPrecedence(A,B) if we find an event occurring after the first event with a non-B-labelled event, as a preceding one (Line 17). We also test these conditions for each new trace of interest (Line 11).
Vacuity detection: As previously mentioned, this step inherits all vacuous traces being computed for the previous set of clauses, Precedence and Response. While ChainResponse(A,B) shares the same set of vacuously satisfying traces as Response(A,B), ChainPrecedence(A,B) extends the former by adding all the traces, starting with the sole occurrence of an A-labelled event (Line 24).

Algorithm 11 Single-branch mining

1:: function SingleBranchMining( $A, B, S, θ^{'}$ )
2:: if $| {sat}_{A, B}^{r} | \geq θ^{'} | S |$ or $| {sat}_{B, A}^{p} | \geq θ^{'} | S |$ then
3:: if $| {sat}_{A, B}^{r} | \geq θ^{'} | S |$ then
4:: if $| {sat}_{A, B}^{c r} | \geq θ^{'} | S |$ and $| {sat}_{A, B}^{c r} | \geq | {sat}_{A, B}^{r} |$ and
${act}_{A, B}^{c r} \neq \emptyset$ and ${sat}_{A, B}^{c r} \cap {act}_{A, B}^{c r} \neq \emptyset$ then
5:: yield ChainResponse(A,B)
6:: else if ${act}_{A, B}^{r} \neq \emptyset$ and ${sat}_{A, B}^{r} \cap {act}_{A, B}^{r} \neq \emptyset$ then
7:: yield Response(A,B)
8:: end if
9:: end if
10:: if $| {sat}_{B, A}^{p} | \geq θ^{'} | S |$ and ${act}_{B, A}^{p} \neq \emptyset$ and ${sat}_{B, A}^{p} \cap {act}_{B, A}^{p} \neq \emptyset$ then
11:: yield Precedence(B,A)
12:: end if
13:: end if
14:: if $| {sat}_{A, B}^{c p} | \geq θ^{'} | S |$ and ${act}_{A, B}^{c p} \neq \emptyset$ and ${sat}_{A, B}^{c p} \cap {act}_{A, B}^{c p} \neq \emptyset$ then
15:: ${act}_{A, B}^{s r} \leftarrow {act}_{A, B}^{c r} \cup {act}_{A, B}^{c p}$ ; ${sat}_{A, B}^{s r} \leftarrow {sat}_{A, B}^{c r} \cap {sat}_{A, B}^{c p}$ ▹ Surround
16:: if $| {sat}_{A, B}^{s r} | \geq | {sat}_{A, B}^{c r} |$ and $| {sat}_{A, B}^{s r} | \geq | {sat}_{A, B}^{c p} |$ and
${act}_{A, B}^{s r} \neq \emptyset$ and ${sat}_{A, B}^{s r} ∖ ∁ {act}_{A, B}^{s r} \neq \emptyset$ then
17:: yield Surround(A,B)
18:: else
19:: yield ChainPrecedence(A,B)
20:: end if
21:: end if

Single-branch mining. We collected all relevant information (

{vac}_{A, B}^{★}

,

{viol}_{A, B}^{★}

, and

{act}_{A, B}^{★}

) associated with the basic constituent clauses ★; we will now discuss the subroutine in Algorithm 11 to determine the insertion of each clause in the final specification. When dealing with one single association rule per frequent itemset (e.g., either

A \to B

or

B \to A

for a frequent itemset

{A, B}

), we return neither Succession nor ChainSuccession clauses, as they require the return of both association rules to be returned. As ChainResponse entails Response, we can avoid testing the return of the former if the latter violates the quality criterion (Line 3). Moreover, we produce the former only if its support is at least equal to the one of the latter (Line 5); otherwise, the latter is returned instead (Line 7). We produce a Precedence if, as per the previous cases, this has support greater or equal to the given minimum threshold

θ^{'}

, while guaranteeing that this clause is not vacuously satisfied for all traces in the log (Line 11). Finally, we return a Surround clause, if this support is not less than one of the associated ChainResponse and ChainPrecedence clauses (Line 17); otherwise, ChainPrecedence is returned (Line 19) alongside ChainResponse, as per previous testing.

Cross-branch entailment. Cross-branch entailment is triggered by the co-presence of two specular association rules for a single frequent itemset. In this scenario, we might have a descendant clause stronger than the ancestor ones stemming from different association rules. We cannot simply add clauses to the specification upon the first refinement attempt, and we need to retain the information of all descendants to compute the metrics of the entailing clause correctly.

Algorithm 12 Cross-branch entailment (1/2)

1:: function CrossBranchEntailment( $C$ ; $B, A, S, θ^{'}$ )
2:: if $| {sat}_{B, A}^{r} | < θ^{'} | S |$ and $| {sat}_{A, B}^{p} | < θ^{'} | S |$ and $| {sat}_{B, A}^{c r} | < θ^{'} | S |$ and $| {sat}_{B, A}^{c p} | < θ^{'} | S |$ then
3:: return $C$
4:: end if
5:: $surrYield \leftarrow false$
6:: if $| {sat}_{B, A}^{c p} | \geq θ^{'} | S |$ and ${act}_{B, A}^{c p} \neq \emptyset$ and ${sat}_{B, A}^{c p} \cap {act}_{B, A}^{c p} \neq \emptyset$ then
7:: ${act}_{B, A}^{s r} \leftarrow {act}_{B, A}^{c r} \cup {act}_{B, A}^{c p}$ ; ${sat}_{B, A}^{s r} \leftarrow {sat}_{B, A}^{c r} \cap {sat}_{B, A}^{c p}$ ▹ Surround
8:: if $| {sat}_{B, A}^{s r} | \geq | {sat}_{B, A}^{c r} |$ and $| {sat}_{B, A}^{s r} | \geq | {sat}_{B, A}^{c p} |$ and
${act}_{B, A}^{s r} \neq \emptyset$ and ${sat}_{B, A}^{s r} ∖ ∁ {act}_{B, A}^{s r} \neq \emptyset$ then
9:: yield Surround(B,A)
10:: $surrYield \leftarrow true$
11:: end if
12:: ${act}_{A, B}^{c s} \leftarrow {act}_{A, B}^{c r} \cup {act}_{B, A}^{c p}$ ; ${sat}_{A, B}^{c s} \leftarrow {sat}_{A, B}^{c r} \cap {sat}_{B, A}^{c p}$ ▹ ChainSucc.(A,B)
13:: if $| {act}_{A, B}^{c s} | \geq θ^{'} S$ and $| {act}_{A, B}^{c s} | \geq | {act}_{A, B}^{c r} |$ and $| {act}_{A, B}^{c s} | \geq | {act}_{B, A}^{c p} |$ and
${act}_{A, B}^{c s} \neq \emptyset$ and ${sat}_{A, B}^{c s} ∖ ∁ {act}_{A, B}^{s r} \neq \emptyset$ then
14:: yield ChainSuccession(A,B)
15:: else
16:: if not surrYield then
17:: yield ChainPrecedence(B,A)
18:: end if
19:: if $Surround (A, B) \notin C$ and $ChainResponse (A, B) \in C$ then
20:: yield ChainResponse(A,B)
21:: end if
22:: end if
23:: end if

Example 9.

Consider the following log:

S = {σ^{1}, σ^{2}, σ^{3}} σ^{1} = C B A B, σ^{2} = B B A B, σ^{3} = B A B B

ChainResponse(A,B), satisfying all the traces in the log, is a constituent of both Surround(A,B) and ChainSuccession(A,B). As Surround also requires the verification of ChainPrecedence(A,B), the inclusion in the final specification of the former can be determined while running TemporalRefimenent (A,B,

θ^{'}

) for

A \to B

, while determining this for ChainSuccession(A,B) requires also invoking TemporalRefimenent(B,A,

θ^{'}

), as its other constituent clause (ChainPrecedence(B,A)) can only be determined in this additional step. Furthermore, the provision of ChainSuccession(A,B) in the final specification requires the mining of both association rules

A \to B

and

B \to A

, to activate both refining steps. Observe that this is also the case in the current example, as both association rules share the same confidence score.

The resulting procedure is described in Algorithms 12 and 13. This works as follows: if the candidate clauses that are mined from the second association rule yield no additional clause, we then return all the clauses that are preemptively cached in

C

(Line 3). Otherwise, we extend the former algorithm for single-branch mining to finally return the clauses from either the cache or the newly visited branch. We produce the specular (but not symmetric) Surround clause (Surround(B,A)) similar to the previous branch (Line 9), while remembering whether this was inserted so as to avoid the return of duplicate Chain clauses (Line 10). We can return ChainSuccession(A,B) (and ChainSuccession(B,A)) if such a clause is at least non-vacuously satisfied by at least one trace in the log, and has support that is greater or equal to one of the constituent clauses (Lines 14 and 29, respectively). If this condition is not met, we can return each constituent if and only if corresponding Surround clauses are not previously returned in both branches (Line 17 or 20, and Line 32 or 32) or the other constituent delivered from the first lattice visit is actually cached in

C

(Line 20 and 35). Similarly, we can return Succession(B,A) (or Succession(A,B)) only if this clause is non-vacuously supported by at least one trace in the log and its associated support is greater or equal to each one of its constituents (Line 41 or 53). Otherwise, we return each constituent if either are mined in the current step (Lines 43 or 55) or were already cached in the previous step (Line 55 or 57). We can easily observe that this algorithm considers all possible combinations from the lattice represented in Figure 2.

4.3.4. Temporal Refinement Post-processing

Despite the previous attempt at refining the clauses initialised in Section 4.3.1, we might not have found an adequate refinement. In this situation, we want to add RespExistence or a CoExistence instead if and only if a (Excl)Choice clause has better support (i.e., strictly greater) than the previous. Although this condition might seem to contradict the assumptions of the S-border, please consider that we start directly with the lattice descent from the RespExistence and CoExistence clauses and not from the top of the lattice, represented by (Excl)Choice clauses. Furthermore, the price we are willing to pay for discarding RespExistence in favour of a more general clause that doesn’t offer entailment information, such as Choice, is only increased data coverage. Lines 13–17 in Algorithm 6 describe this second last step.

Example 10.

Let us consider the following log:

S = {σ^{1}, σ^{2}, σ^{3}, σ^{4}, σ^{5}} σ^{1} = B A, σ^{2} = A B, σ^{3} = A, σ^{4} = σ^{5} = B

After mining an

A \to B

as

c (A \to B) > c (B \to A)

, we generate RespExistence(A,B) with support of 0.8. However, Choice(A,B) has greater support than the previous clause, as this one is satisfied by each trace in the log. So, we can consider generating this last clause instead of considering RespExistence(A,B).

4.3.5. (Excl)Choice Generation

Finally, we generate further declarative clauses from the paring non-certain unary patterns if they do not lead to the generation of other declarative clauses. The steps for achieving this are summarised in Algorithm 14. We first collect all distinct pairs of activity labels that occur in any of the mined clauses (Line 12) while ensuring that only one instance between

〈 A, B 〉

and

〈 B, A 〉

is retained using a non-symmetric predicate ⊏ (e.g., a string lexicographical order on the activity labels). For each pair

〈 A, B 〉

(Line 15), we preliminary generate a candidate declarative clause if the ratio between the number of traces of either A or B on the overall log size is greater or equal to the mining parameter

θ

(Line 3). We return a simple Choice clause if there exists at least one clause sharing both event IDs and ExclChoice if all traces containing A-labelled events do not contain B-s, and vice-versa. We then populate a global map M associating the currently mined clausal support and the first activation of the mined clause (e.g., A) to the clauses exhibiting such value. Map M was preliminarily filled by the post-TemporalRefinement phase (Algorithm 6 Line 13).

Algorithm 13 Cross-branch entailment (2/2)

24:: if $| {sat}_{B, A}^{r} | \geq θ^{'} | S |$ or $| {sat}_{A, B}^{p} | \geq θ^{'} | S |$ then
25:: if $| {sat}_{B, B}^{r} | \geq θ^{'} | S |$ then
26:: if $| {sat}_{B, A}^{c r} | \geq θ^{'} | S |$ and $| {sat}_{B, A}^{c r} | \geq | {sat}_{B, A}^{r} |$ and
${act}_{B, A}^{c r} \neq \emptyset$ and ${sat}_{B, A}^{c r} \cap {act}_{B, A}^{c r} \neq \emptyset$ then
27:: ${act}_{B, A}^{c s} \leftarrow {act}_{B, A}^{c r} \cup {act}_{A, B}^{c p}$ ; ${sat}_{B, A}^{c s} \leftarrow {sat}_{B, A}^{c r} \cap {sat}_{A, B}^{c p}$ ▹ ChainSucc.(B,A)
28:: if $| {act}_{B, A}^{c s} | \geq θ^{'} S$ and $| {act}_{B, A}^{c s} | \geq | {act}_{B, A}^{c r} |$ and $| {act}_{B, A}^{c s} | \geq | {act}_{A, B}^{c p} |$ and
${act}_{B, A}^{c s} \neq \emptyset$ and ${sat}_{B, A}^{c s} ∖ ∁ {act}_{B, A}^{s r} \neq \emptyset$ then
29:: yield ChainSuccession(B,A)
30:: else
31:: if not surrYield then
32:: yield ChainResponse(B,A)
33:: end if
34:: if $Surround (A, B) \notin C$ and $ChainPrecedence (A, B) \in C$ then
35:: yield ChainPrecedence(A,B)
36:: end if
37:: end if
38:: else if ${act}_{B, A}^{r} \neq \emptyset$ and ${sat}_{B, A}^{r} \cap {act}_{A, B}^{r} \neq \emptyset$ then
39:: ${act}_{B, A}^{s} \leftarrow {act}_{B, A}^{r} \cup {act}_{B, A}^{p}$ ; ${sat}_{B, A}^{s} \leftarrow {sat}_{B, A}^{r} \cap {sat}_{B, A}^{p}$ ▹ Succ.(B,A)
40:: if $| {act}_{B, A}^{s} | \geq θ^{'} S$ and $| {act}_{B, A}^{s} | \geq | {act}_{B, A}^{r} |$ and $| {act}_{B, A}^{s} | \geq | {act}_{A, B}^{p} |$ and
${act}_{B, A}^{s} \neq \emptyset$ and ${sat}_{B, A}^{s} ∖ ∁ {act}_{B, A}^{s r} \neq \emptyset$ then
41:: yield Succession(B,A)
42:: else
43:: yield Response(B,A)
44:: if $Precedence (B, A) \in C$ then
45:: yield Precedence(B,A)
46:: end if
47:: end if
48:: end if
49:: end if
50:: if $| {sat}_{A, B}^{p} | \geq θ^{'} | S |$ and ${act}_{A, B}^{p} \neq \emptyset$ and ${sat}_{A, B}^{p} \cap {act}_{A, B}^{p} \neq \emptyset$ then
51:: ${act}_{A, B}^{s} \leftarrow {act}_{A, B}^{r} \cup {act}_{A, B}^{p}$ ; ${sat}_{A, B}^{s} \leftarrow {sat}_{A, B}^{r} \cap {sat}_{A, B}^{p}$ ▹ Succ.(A,B)
52:: if $| {act}_{A, B}^{s} | \geq θ^{'} S$ and $| {act}_{A, B}^{s} | \geq | {act}_{A, B}^{r} |$ and $| {act}_{A, B}^{s} | \geq | {act}_{A, B}^{p} |$ and
${act}_{A, B}^{s} \neq \emptyset$ and ${sat}_{A, B}^{s} ∖ ∁ {act}_{A, B}^{s r} \neq \emptyset$ then
53:: yield Succession(A,B)
54:: else
55:: yield Precedence(A,B)
56:: if $Response (B, A) \in C$ then
57:: yield Response(B,A)
58:: end if
59:: end if
60:: end if
61:: end if

Algorithm 14 Generating (Excl)Choice from pair-wise infrequent behaviour

1:: function SingleChoiceExclGen( $〈 A, B 〉, μ, μ^{- 1}, θ$ )
2:: $s \leftarrow r s \leftarrow r c^{+} \leftarrow \frac{| μ (A) \cup μ (B) |}{| L |}$
3:: if $s \geq θ$ then
4:: if $μ (A) \cap μ (B) = \emptyset$ then
5:: $M [A] [s] \leftarrow M [A] [s] \cup 〈 ExclChoice (A, B), s, r s, r c^{+} 〉$
6:: else
7:: $M [A] [s] \leftarrow M [A] [s] \cup 〈 Choice (A, B), s, r s, r c^{+} 〉$
8:: end if
9:: end if
10:: return $s \geq θ$
11:: function ChoiceExclGen( $Φ, C, θ$ )
12:: $Considered \leftarrow \{〈 A, B 〉 \in C \times C | A ⊏ B, c (A, B) \notin Φ\}$
13:: $μ \leftarrow {[A \mapsto {i | \exists j . σ_{j}^{i} = 〈 A, δ 〉 \land σ^{i} \in L}]}_{A \in I}$
14:: $μ^{- 1} \leftarrow {[i \mapsto {A \in I | #_{σ^{i}} A > 0}]}_{σ^{i} \in L}$
15:: for all $〈 A, B 〉 \in Considered$ do SingleChoiceExclGen( $〈 A, B 〉, μ, μ^{- 1}, θ$ )
16:: for all $〈 A, M_{2} 〉 \in M$ do
17:: if $M_{2} [1.0] \neq \emptyset$ then
18:: for all $〈 c (A, B), s, r s, r c^{+} 〉 \in M_{2} [1.0]$ s.t. $〈 A, B 〉 \notin Considered$ do
19:: $Considered \leftarrow Considered \cup {A, B}$
20:: $Φ \leftarrow Φ \cup {〈 c (A, B), s, r s, r c^{+} 〉}$
21:: end for
22:: else
23:: for all $〈 c (A, B), s, r s, r c^{+} 〉 \in M_{2} [s]$ s.t. $〈 A, B 〉 \notin Considered \land M_{2} [s] \neq \emptyset$ do
24:: $Considered \leftarrow Considered \cup {A, B}$
25:: $Φ \leftarrow Φ \cup {〈 c (A, B), s, r s, r c^{+} 〉}$
26:: end for
27:: end if
28:: end for
29:: return $Φ$

After fully populating M, we now return some of the clauses collected in M for the final specification; we prefer returning (Excl)Choice clauses with perfect (1.0) support associated with a given activity label A, if any (Line 18); if not, we resort to returning those with non-perfect support (Line 23).

After this final step, we directly return the conjunctive declarative specification to the user with the novel (Excl)Choice clauses that are mined (Algorithm 6 Line 19, Algorithm 3 Line 4).

Note that this is different from our previous Bolt, as at this stage, we add any (Excl)Choice clause without evaluating their quality. Furthermore, we do not abstain from including these if another temporally superior clause with the same activity label is presented as an alternative.

5. Results

Given the competing approaches (ADM, ADM+S, and TopN Declare), all rely on propositionalisation for generating declarative clauses for a conjunctive specification; we completely rerun these algorithms under the same circumstances as the previous paper [5]. As Bolt2 supports more declarative clauses than our predecessor, Bolt, we extend our prior re-implementation of the aforementioned competing algorithms to keep all the declarative clauses supported by Bolt2, such as ChainSuccession, Succession and Surround. In ADM+S, we also implement the CoExistence clause, as this was not considered in our previous implementation or benchmark [5] As in our previous contribution, we upgrade ADM to compute declarative clause support through KnoBAB, thus requiring the need to load the entire log in KnoBAB. For ADM+S, we upgrade the original Python codebase while keeping the original trace representation format, where each trace in the log is a dictionary that maps each activity label to the event ID where a given activity label occurs. For all candidates, we retain the

θ

frequent itemset support as well as declarative support Supp, as our algorithm exploits Supp for its lattice descent and, thus, tests all algorithms under the same conditions.

For algorithm validation, we repeat our experiments alongside state-of-the-art specification mining algorithms for temporal data. Additionally, we compare them with our prior algorithm, Bolt, thus showing how the enhanced temporal specification mining algorithm provides greater specification size reduction when compared to our previous heuristic solution (Section 5.1). After this, we provide loading and mining time benchmarks on datasets designed for Declare process mining and compact trace sizes (Section 5.2), as well as real data sequences containing considerable amounts of events within each trace (Section 5.3).

Our benchmarks were conducted on a Dell Precision mobile workstation 5760 running Ubuntu 22.04. The specifications of the machine include an Intel^® Xeon(R) W-11955M CPU @ 2.60GHz × 16, 64GB DDR4 3200MHz RAM, with 500 GB of free disk space.

5.1. Algorithm Validation

To make a fair comparison, we benchmark Bolt2 as well as all competing approaches using the same dataset provided in our previous paper [5]. We use a synthetic dataset (https://osf.io/nsqcd/, accessed on 10 September 2023) generated through the ground-truth specification

Φ^{*}

, described in the first column of Table 3, thus ensuring that all traces abide by the specification requested by

Φ^{*}

and that the whole resulting log will contain, at most,

| I | = 9

activity labels. The resulting dataset is composed of 1000 traces containing 10 events each; we mine the clauses with minimum support of

θ = 0.1

, thus obtaining a conjunctive specification

Φ_{θ}

of all clauses with minimum declarative support of

θ

. In

Φ_{\geq 0.999}

, we keep the ones whose declarative support is at least

0.999

to mitigate the effect of inaccuracies from our previous implementations; this discards any infrequent behaviour not expressed in all traces. To better test the ability of the algorithm to generate compact specifications, no subsequent metric, such as the Fisher score to further prune the mined clauses, is considered.

Table 3 presents the results of the comparison; ticks (and crosses) denote the (missed) return of a declarative clause expected from the original specification, and “stronger” clauses with non-zero confidence are returned alongside expected clauses. Such results show that the current implementation (Bolt2) further improves the specification size reduction results from our previous heuristic search (Bolt) by slightly reducing the number of clauses in

Φ_{\geq 0.999}

, while the size of the entire specification is a quarter of the one provided by Bolt. This also comes to a comparable mining time. This shows that the restrictions and additional tests described in Section 4.3 for reducing the specification size are effective and do not come at the cost of an increased running time. Furthermore, Bolt2 returns a Succession clause that is “stronger” than both Responseand RespExistence; on the other hand, this clause is not currently supported by Bolt. Given our previous considerations, as well as the fact that our previous implementation is less expressive in terms of declarative templates being instantiated as clauses, we retain Bolt2 and discard Bolt in all following experiments. While TopN Declare’s specification size is less than both Bolt and Bolt2, the former fails at providing most of the expected behaviour while also being less performant than Bolt. Due to this, we discard TopN Declare from all following benchmarks.

We now compare our best implementation, Bolt2, with the remaining two competitors, ADM and ADM+S. While the previous exploits pair candidate pruning using the Apriori algorithm, while avoiding any lattice descent algorithm by simply instantiating all possible pairs with all available declarative templates, the latter simply instantiates all declarative clauses over all possible pairs. This is shown by the fact that ADM+S returns a total number of clauses that is exactly

U B (I)

(see Equation (7)), i.e., the maximum number of expected clauses that an algorithm can return. This is motivated by the fact that this algorithm still tests all the clauses and uses

θ

as a mere post-processing step for creaming off the final clauses. While the latter approach ensures the generation of all possible candidates to be tested, the former discards infrequent behaviour. In fact, satisfying an ExclChoice(h,a) clause essentially demands the absence of any frequent pair

〈 h, a 〉

, as the declarative pattern stipulates that a and h should not co-occur in the same trace. On the other hand, both Bolt and Bolt2 mitigate this problem by generating pairs for (Excl)Choice clauses, merging frequent unary patterns that infrequently occur from the data. Furthermore, we show that both competitors return very large specifications with many uncertain behaviours, while our lattice descent algorithm, along with the adoption of a quality criterion, ensures the return of the most general clause providing the best support. This allows the reduction of the overall specification size, to become as close as possible to the number of expected clauses with perfect support. On the other hand, competing approaches do not exploit this principle within the specification mining algorithm, thus returning an explosive amount of clauses. This is also reflected in the number of clauses with perfect support, where the considerable size can be explained in terms of the provision of both implying and being-implied clauses. This adversely affects the overall running time, which is substantially prolonged due to unnecessary candidate testing. This delay can be mitigated through early stopping conditions, such as heuristic computations, as well as leveraging efficient lattice descent to avoid testing clauses whose support is anticipated to be lower than the ones currently under consideration.

By comparing these results with the ones where ChainSuccession, Succession, and Surround are not included among the Declare templates to be tested, ADM generates 512 clauses, of which, 119 have perfect support in

1.22 \times 10^{2}

ms, while ADM+S generates 612 clauses, of which, 450 have exact support in

1.67 \times 10^{2}

ms. The decreased running time in comparison with the ones provided in Table 3 clearly shows that an increased amount of clauses to be tested leads to an increase in running time. Furthermore, while the running time increase for ADM is almost negligible, the running time for ADM+S is almost tripled; this is in line with our previous observations [2], where an optimised query plan eases the running time for formal verification by avoiding recomputing multiple sub-expressions to be tested in the mining phase. This cannot be leveraged by the ADM+S pipeline, as each clause has to be computed separately. This consideration is further leveraged in the implementation of Bolt2, where some clauses are also tested within the same for loop, despite not formally sharing any LTL_f sub-expression similar to Algorithms 9 and 10.

This final comparison strengthens our previous considerations in [5], showing how the curse of propositionalisation affects both the running time and the specification size by leading to an increase in mined clauses and the overall running time.

5.2. Mining and Loading Times for Process Mining Datasets

We now want to test the execution times of the mining algorithms under the same boundary conditions assumed by the original authors of these state-of-the-art algorithms, i.e., a rather limited total number of distinct activity labels (at most, in the order of 50). We restrict our analysis to Bolt2, ADM, and ADM+S. We also consider some datasets from our previous paper (https://osf.io/nsqcd/, accessed on 10 September 2023), where we considered both a synthetic dataset generated from the same declarative specification (i.e., model)

Φ^{*}

, and a real-world process mining dataset that described the purchase order handling process for some of the 60 subsidiaries part of a large multinational company (BPIC2019 [34]). To conduct mining benchmarks, while considering the log size as a dimension of our analysis, we sample both datasets, generating three and five sub-datasets, respectively. This downsizing strategy generates some possible subsets

S_{i}^{s} \in ℘ (S)

of

S

of increasing size

| S_{i}^{s} | = 10^{i}

:

\forall 1 \leq i \leq (N - 1) . S_{i}^{s} \subseteq S_{i + 1}^{s} where N = ⌊ {log}_{10} S ⌋

We start our analysis by describing a controlled experiment from the synthetic dataset, where we set up both the maximum log size and ensure that all traces have the same length. In particular, we return logs of 1000 traces, where trace lengths are in

ς \in {10, 15, 20, 25, 30}

. We run the algorithms on different log sizes

S_{i}^{s}

(x-axis) and trace lengths

ς

(symbol shape and line type) while keeping a minimum support threshold of

θ = 0.9

. Figure 7a shows the running time for loading and indexing data for Bolt2, ADM is implemented on top of KnoBAB, and ADM+S. As we can see, both Bolt2 and ADM exhibit similar loading times as they both use KnoBAB as a knowledge base for storing temporal data while also exploiting the same physical representations for both datasets. Despite ADM+S not relying on sophisticated data representations, we observe that this is slower than loading data when compared to KnoBAB by at least one order of magnitude. After further investigation, we find that the main issue is related to the Python library for parsing XES data and OpyenXES (https://github.com/OpyenXES/OpyenXES, accessed on 10 September 2023), while KnoBAB relies on a customary parser and lexer generated by ANTLR4 (https://github.com/antlr/antlr4, accessed on 10 September 2023) [35] from the given grammar in the Backus–Naur form. By comparing this plot with the mining times in Figure 7b, we note that Bolt2’s data loading time dominates its mining time, while for the competing approaches, we can provide the opposite consideration.

Concerning the mining times in Figure 7b, our proposed solution is, at most, three orders of magnitude more efficient than our slowest competitor (ADM+S over

| S | = 10^{1}

and

ς = 10

) and at least one-and-a-half orders of magnitude more efficient than our fastest competitor (ADM over

| S | = 10^{3}

and

ς = 10

). All algorithms exhibit polynomial time complexity with respect to the log size, mainly because the log (in the worst-case scenario) needs to be accessed to test each binary itemset, leading to the generation of a declarative clause. ADM+S, while able to prune potential candidates, tests each declarative clause per trace; furthermore, it cannot leverage efficient data scanning operations via an indexed relational database. The main difference comes from large additive constants that heavily affect the computation time. Such experiments show that our previous solution to lower the testing times of declarative clauses against a log via query plans [2] can only be lapped by ad hoc heuristics further pruning redundant or infrequent clauses, as well as promptly yielding, clauses without the need for “intermediate results”, leading to allocation overhead.

We now discuss our experiments on the BPIC2019 dataset. Figure 8a shows the overall trace length distribution for each sample. We can observe that the logs contain fairly short traces; there are few traces with a length of 10³, and the majority of traces have, at most, 100 events. The biggest sample (

| S_{4}^{s} | = 10^{4}

) contains, at most, 41 distinct activity labels, requiring testing of, at most, UB(I) = 18.204 clauses, while the smallest (

| S_{1}^{s} | = 10

) contains 10 distinct activity labels, where we can test, at most, UB(I) = 1030 clauses. The resulting data are also provided (https://osf.io/nsqcd/, accessed on 10 September 2023) in the aforementioned dataset. For this, we also consider variations of the minimum support threshold (θ ∈ {0.1, 0.25, 0.5, 0.9}) as customary for mining algorithms benchmarks. Data loading times in Figure 8b show similar data loading trends to those presented in our previous experimental setting, where loading times across OpyenXES and KnoBAB solutions mainly differed by one order of magnitude. These results do not consider θ thresholds, as this only affects the mining algorithm. Mining times in Figure 8c show that all algorithms exploiting the minimum support threshold for candidate pruning through a frequent itemset algorithm (ADM and Bolt2) exhibit an inverse correlation between the minimum support threshold θ and mining time, as a lesser mining time requires the identification of a great number of items leading to the subsequent generation of declarative clauses. The experiments clearly show that ADM+S exploits θ only as a post-processing feature by filtering out all the clauses appearing as infrequent after generating all the possible patterns, as θ does not affect the running time at all. Despite ADM+S exhibiting a linear time complexity, its running time consistently upper-bounds the running times of the two other competing algorithms exhibiting polynomial time complexities; among these, Bolt2 consistently provides the fastest running time, being two orders of magnitude less than the slowest competitor (ADM).

For datasets with

| S_{i}^{s} | > 10^{4}

, ADM+S exceeds the 64 GB of available primary memory and fails without yielding results, after running for about 3 h and 15 min. This shows the inadequacy of ADM+S when testing several clauses from relatively compact sizes of activity labels. ADM+S utilises the pandas Python library for data-frame representation, where combinations of candidates are represented in a sparse matrix to be more efficiently held in memory and reduce the overall memory footprint. Even so, the memory required to store the embedding information is insufficient.

5.3. Real-World Datasets

We now test our algorithm alongside competing approaches at their boundaries, by increasing the total number of distinct activity labels as well as the total trace lengths. Both parameters, in the worst-case scenario, lead to an increase in the number of clauses to be tested. The former is directly dependent on the size of I and the frequent patterns that can be mined from the data due to

U B

, while the latter will require a longer time to effectively test all the clauses being considered. Usual business process management datasets contain restricted amounts of distinct activity labels, while temporal datasets from the artificial intelligence field, such as the ones for cybersecurity, contain larger varieties of distinct possible activities I. GNU/Linux systems can have between 300 and 500 system calls, requiring the testing of just under

3 \times 10^{6}

clauses, while other operative systems, such as Windows, provide 2000 possible system calls (https://www.knowledgehut.com/blog/web-development/system-calls-in-os (last accessed: 14 March 2023)), requiring the testing of several clauses in the order of

5 \times 10^{7}

. Considering that cybersecurity datasets often track and audit a significantly larger number of operations compared to typical business process scenarios, it becomes evident that testing and generating all possible clauses in a reasonable timeframe is nearly unfeasible. This further motivates our approach advocating for testing only the clauses that are considered the most representative and are supported by existing data.

For this, we consider a cybersecurity dataset [18] that represents malware in terms of a sequence of Windows operating system API calls being invoked. The authors distinguish between eight main types of malware: trojan, backdoor, downloader, worm, spyware, adware, dropper, and virus. Each trace is then associated with a specific label identifying the specific class to which the malware belongs. The authors represent the dataset as a tab-separated file. The data loading phase in KnoBAB potentially supports the loading of logs serialized in multiple formats, including the tab-separated (TAB, .tab) format for this dataset and the de facto log standard extensible event stream (XES, .xes) used for the previous datasets. We use different data parsers, which are still linked to the same data loading primitives; thus, the main differences in loading times are related to the parsing times, and not to the loading and indexing times. For this reason, we discard the data loading analysis for this dataset.

Even in this scenario, we sample (https://osf.io/69q8h/, accessed on 10 September 2023) the original dataset in different logs. We consider logs with increasing sizes of

| S_{i}^{s} | = 9^{i}

. Figure 9a represents the trace length distribution for each sampled dataset. Different from the previous ones, this exhibits longer traces that appear more frequently for each subsample, as the longest trace contains more than

10^{6}

events; while sampling the logs, we also attempt to guarantee that each log has a similar trace length distribution despite the skewed trace length distribution. The biggest sample (

| S_{4}^{s} | = 6552

) contains 278 distinct activity labels, while the smallest (

| S_{1}^{s} | = 9

) contains 132 distinct activity labels. Therefore, the data pose major challenges for both ADM and ADM+S algorithms, which heavily rely on propositionalisation approaches to generate all of the possible declarative candidates for the given candidate itemsets of choice. We run the following algorithms with a minimum support threshold of

θ = 0.9

.

By comparing the mining times from Figure 9b with our previous experiments, we observe that the combined provision of longer traces and longer traces overall affects the running times of both ADM and ADM+S, as they are now at least four orders of magnitude less efficient than our mining algorithm, Bolt2. Despite our implementation of ADM being exploited, the optimised conformance-checking algorithm for formal verification, as discussed in [2], makes this implementation more competitive than ADM+S in terms of running time; both implementations run out of memory in the third sample (

| S_{s}^{i} | = 9^{3}

), showing the inadequacy of directly exploiting current formal temporal verification algorithms, such as conformance checking for specification mining. For ADM, this is mainly due to the amount of intermediate results associated with multiple events being considered, while for the latter, this can be ascribed to the wasteful memory representation provided by pandas. Concerning Bolt2, the running time plateau between the two last samples can be partially ascribed to a lesser variation in the number of distinct activity labels (258 vs. 278) and partially to the effectiveness of the candidate-pruning algorithm, as well as the early stopping conditions, avoiding testing unnecessary clauses that are not representative of the data. This last stress test shows the effectiveness of the envisioned heuristic search algorithm for dealing with big data scenarios when compared to competing solutions, as our mining algorithm consistently takes less than one second of mining time when compared to existing solutions that take more than 15 min to compute before running out of memory.

We now conduct a complete series of benchmarks over Bolt2, where we vary the log size and the minimum support threshold to empirically asses the properties of our novel proposed approach. Figure 10a,b show that Bolt2 is not significantly affected by the variations of log and minimum support threshold sizes, as in all such circumstances, our algorithm mines a specification without running out of memory (while doing so in a reasonable amount of time). In Figure 10a, we observe that the aforementioned running time plateau is only observable for

θ \in {0.9, 0.5}

, while this effect is less evident for lower minimum support thresholds. This shows the effectiveness of the early stopping conditions, avoiding testing for further clauses through the provided heuristics; it clearly shows that no further temporal behaviour is actually tested. For the remaining minimum support values, we register a decrease in the running time as it deviates from the initial polynomial time trend.

By correlating these results with the size of the mined specification reported in Figure 10c, we observe that these running times are not necessarily correlated with a decreased amount of clauses being collected (please refer to the Pearson correlation coefficient in the Figure). In fact, for

θ = 0.9

, we can still observe an increase in running time per log size, which is then reflected in an overall decrease in the number of clauses being returned, thus showing that the running time has a stronger correlation with the size of the log while having the strongest inverse correlation with the minimum support threshold, as per expectations. However, the constant decline in the number of clauses returned, despite the use of logs providing a constant increase in

| I |

, should not arouse particular suspicions in the reader. On the contrary, these highlight the fact that, for sufficiently high support values, we attempt to obtain clauses that maximise trace coverage. This tendency, especially when considering traces with significantly varied behaviour, often leads to the exclusion of numerous clauses not supported across all traces.

These results suggest to practitioners that to retrieve a larger number of clauses supporting the majority of instances within a log, one could partition the aforementioned log into sub-logs. By then running Bolt2 for each of these sub-logs, we can merge all the clauses mined for each bin in a single conjunctive specification. This is still a viable solution, as the presented algorithm is quite efficient even for considerably small values of the minimum support value.

6. Discussion

In the subsequent section, following our discussions on the reasons why Bolt2 can be considered as faster than competing approaches, this second-last section discusses some of the assumptions that we adopted in the ideation of our specification mining algorithm (see Section 6.1 and Section 6.2); we provide disambiguations between similar concepts that might be easily confused (see Section 6.3 and Section 6.4). Finally, we postulate some future works, addressing the current limitations of deviant model learning algorithms and discussing how we aim to overcome these limitations within our present framework (Section 6.5).

6.1. Non-Zero Confidence and Activated Traces

Observe that our lattice described in Figure 2 works if and only if we consider that all the clauses tested or returned by the algorithm anytime are always activated at least once in our dataset. This is ensured from the very beginning by the adoption of association rule mining, as the provision of an

X \to Y

rule entails the presence of an event X in at least one trace, evaluating these rules by their confidence scores. This ensures that RespExistence (or CoExistence) is non-vacuously satisfied by at least one trace in the log.

RespExistence(A,B) does not generally imply Choice(A,B). In particular, the only case where RespExistence(A,B) is satisfied but Choice(A,B) is not is when neither A nor B are in the trace. But, if we return RespExistence(A,B), as per our mining algorithm, it is because A appears in the trace at least once, and that is because we are generating such clauses from the frequent itemsets through the confidence measure as discussed in the previous paragraph. So, in these scenarios, returning a RespExistence(A,B) from a mined rule (A → B) entails the satisfaction of Choice(A,B) by at least one trace in the specification. This can be shown as follows:

\begin{array}{l} c (A \to B) > 0 & \Leftrightarrow \frac{ς^{'} (A \cup B)}{ς^{'} (A)} > 0 \\ \Leftrightarrow ς^{'} (A) > 0 \land ς^{'} (A \cup B) > 0 \\ \Leftrightarrow {σ^{i} \in S | #_{σ^{i}} A > 0} \neq \emptyset \land {σ^{i} \in S | #_{σ^{i}} A > 0 \land #_{σ^{i}} B > 0} \neq \emptyset \\ \Rightarrow {σ^{i} \in S | #_{σ^{i}} A > 0 \lor #_{σ^{i}} B > 0} \neq \emptyset \\ \Leftrightarrow {σ^{i} \in S | σ^{i} ⊧ Choice (A, B)} \neq \emptyset \\ \Leftrightarrow S U P P_{S} (Choice (A, B)) > 0 \end{array}

When returning a declarative clause candidate, our quality criterion (Definition 3) enforces this by ensuring that clauses non-vacuously satisfied by at least one trace are considered as candidates for our specification.

6.2. $S U P P_{S}$ vs. $R S U P P_{S}$

During our benchmarks and tests, we found that the choice of support metric (Section 2.3.1), either

S U P P_{S}

or

R S U P P_{S}

, has a significant impact on the lattice search. The former allows for vacuously satisfying traces, while the latter will only consider those containing activation conditions.

First, given that activations may vary on the lattice descent (due to Precedence), the support of a descendant clause may be greater than its ancestors.

Example 11.

Consider the following log and clauses:

S = {σ^{1}, σ^{2}, σ^{3}} σ^{1} = C D σ^{2} = A B σ^{3} = A

In the log above, we obtain B → A as a frequent association rule with certain confidence entailing the Declare clause

ϕ = RespExistence (B, A), S u p p_{S} (ϕ) = 1.0

as per a previous discussion. Upon refinement of the former declarative clause, we might generate a

ϕ^{'} = Precedence (A, B)

, with the same support as the former, as all traces are either activated and satisfied, albeit vacuously for the latter in the case of

σ_{1}

. These two clauses differ in their activation conditions, as the former is activated by any occurrence of a B while the latter triggers the checks upon detecting an activation A. Within the same example,

σ_{3}

only activates (and satisfies) Precedence(A,B), but not RespExistence(B,A). This leads to considerably different restrictive supports for the two clauses, i.e.,

R S U P P_{S} (ϕ) = 0.33

and

R S U P P_{S} (ϕ^{'}) = 0.66

.

We can see that if we pruned our clauses using restrictive support as quality criterion for the general-to-specific lattice visit with a minimum threshold of 0.6, we might not have been able to generate any of its descendants with greater restrictive support, as the lattice descent would have been stopped at the coarser clause with lesser support. This is an undesired behaviour violating the anti-monotonicity requirement of a good quality criterion. This postulates the adoption of our previous notion of Supp over RSupp, as the former can be used as an early stopping condition to avoid returning clauses that are underrepresented in a log due to the anti-monotonicity (Lemma A4). As RSupp still enables greater discriminative and pruning power while incorporating the requirement of having a support strictly greater than zero while ensuring at least one activation is available, future works will find new ways to leverage this novel notion of support with a novel algorithmic approach.

Second, the relaxed notion of support avoids an overabundance of atemporal and “acorrelational” behaviour (e.g., (Excl)Choice), not allowing adequate temporal discrimination across traces. This is a desired behaviour, as our previous experiment on cybersecurity data showed that this often occurs within the data.

Example 12.

Let us now consider the following log:

S = {σ^{1}, σ^{2}, σ^{3}} σ^{1} = B A, σ^{2} = A B, σ^{3} = B .

By considering

S U P P_{S}

for our lattice descent, we never obtain a

Choice (A, B) = Choice (B, A)

, as RespExistence(B,A) has a

S U P P_{S}

of 1.0 due to

σ^{3}

. Alternatively, by considering

R S U P P_{S}

, we now obtain a Choice instead, as the latter clause is not activated by

σ_{3}

, reducing the

R S U P P_{S}

to

\frac{2}{3}

. This then guarantees the return of temporal or association-based behaviour when possible.

Future works will analyse whether returning additional Choice clauses will actually benefit a better classification task, thus further motivating the adoption of RSupp in lieu of Supp.

6.3. Activation and Target vs. Heads and Tails in Association Rules

Our heuristic for associating a declarative rule to a mined association rule can be summarised by Equation (9), stripping any temporal information from the LTL_f semantics of a declarative clause while retaining only the activation and target conditions of reference. This works under the assumption that an implication ⇒ in LTL_f is the main separator between the activation condition, appearing in the premise of the rule, and the target one, appearing as a consequence of the implication. Concerning

U

, we consider the detection of the second argument of the operator as an activation, thus interpreting the first argument as a target condition, as all the remaining events preceding the activation must satisfy this condition. We can easily observe that CoExistence(A,B) will then return both

A \to B

and

B \to A

, while Choice(A,B) returns

{A, B}

, which can also be interpreted as the former.

E R A S U R E_{A, B} (ϕ) = \{\begin{cases} T & T \in {A, B} \\ E R A S U R E_{A, B} (ϕ^{'}) \to E R A S U R E_{A, B} (ϕ^{″}) & (ϕ = ϕ^{'} \Rightarrow ϕ^{''} \lor ϕ = ϕ^{''} U ϕ^{'}) \land \\ (E R A S U R E_{A, B} (ϕ^{'}), E R A S U R E_{A, B} (ϕ^{″}) \in I) \\ E R A S U R E_{A, B} (ϕ^{'}) \cup E R A S U R E_{A, B} (ϕ^{″}) & ϕ = ϕ^{'} \land ϕ^{″} \lor ϕ = ϕ^{'} \lor ϕ^{″} \\ E R A S U R E_{A, B} (ϕ^{'}) & ϕ = ◯ □ ϕ^{'} \lor ϕ = □ ϕ^{'} \lor ϕ = ⋄ ϕ^{'} \\ \emptyset & oth . \end{cases}

(9)

Consider that this formulation completely discards negated information, as the negation of a condition cannot be straightforwardly translated into a specific activity label of interest. This includes the interpretation of Precedence(A,B), which provides no association rule associated with its LTL_f semantics. For this, we cannot consider A as a potential activation condition, as the semantics of this rule stipulate that the occurrence of A does not strictly require a future occurrence of B as in the Response. On the other hand, we can observe that, when a B occurs, we need to check that such Bs shall always occur after A-s, but not before those. Due to this, we can associate such a clause to an association rule

B \to A

. This differs from our previous implementation [5], where we associated such a clause to

A \to B

instead. The following example further contextualises the intuition for this association rule:

Example 13.

Let us consider the following log:

S = {σ^{1}, σ^{2} σ^{3}} σ^{1} = A, σ^{2} = A A σ^{3} = B C

A specification mining algorithm exploiting an exhaustive search, such as ADM+S, might generate Exists(A,1), as well as Precedence(A,B), as the first two traces trivially satisfy this condition, and no B appears in the data. Nevertheless, the data do not provide sufficient evidence on how to interpret the occurrence of B alongside A, as the traces where As occur solely contain As.

The peculiar case of Precedence stresses the fact that the notion of the activation/target condition cannot be straightforwardly mapped to a notion of the head or tail of an association rule, as the presence of a negation heavily changes the game. Future works will provide a better formulation of our Erasure function to encompass negated LTL_f formulae; this will further assist in the mining of additional negated declarative clauses [2].

6.4. ChainPrecedence: $α$ -Renaming

Our ChainPrecedence(A,B) has undergone

α

-renaming; previous declarative literature studies denote the semantics of the former as follows: B is immediately preceded by A [36], with the LTL_f formalisation as

□ (◯ B \Rightarrow A)

. Our definition, after

α

-renaming, is A immediately preceded by B (

□ (◯ A \Rightarrow B)

). This is mainly to provide clarity to the non-practitioner by always representing A as the activation condition and B as the target condition. To maintain consistency among the activation labels and the semantics, our

α

-renaming flips the semantic labels, such that for an association

A \to B

, we mine the clause

ChainPrecedence (A, B)

, with the LTL_f formalization flipped as

□ (◯ A \Rightarrow B)

.

6.5. Limitations of Current Deviant Model Learning Algorithms

Our previous experiments showed that both ADM implemented in KnoBAB as well as ADM+S implemented in Python are severely affected by exceptionally long traces in a log, as the mining tasks are both time-expensive and run out of memory for larger logs. This discards any further attempt to use pandas and scikit-learn in Python for mining and learning deviant models from real data, thus requiring attempts to re-implement this pipeline from scratch in C++. In the long run, this language is proven to be better when ensuring memory-saving strategies.

Even by ensuring this, exploiting just one decision tree for exporting either explanations or deviant features from the different classes, as in [21], might provide specifications that are extremely biased toward a single candidate providing perfect purity. Observe that this heuristic assumption will likely create specifications that overfit the training data, as the testing data might, in the worst-case scenario, not satisfy such a clause. This, on real data not necessarily biased toward declarative temporal representations as the datasets of choice in [21], might lead to specifications with low precision. Our future works will attempt to mitigate this potential overfitting risk by exploiting ensemble methods that guarantee the generation of multiple decision trees that can then be combined as one single explainable specification, such as random forests or their variants.

Future works will attempt to tackle these limitations, requiring improvements in our formal verification pipeline [2]. as we might still need to test clauses appearing across different logs, where distinct clauses might appear in distinct logs. The decision trees can return conjunctive specifications, as outlined in [21], as well as generate specifications in disjunctive normal form; we will assess the benefits of defining novel specification mining algorithms that return specifications in disjunctive normal form.

7. Conclusions and Future Works

The present paper proposes a specification mining algorithm for temporal data, Bolt2, improving our previous algorithm Bolt [5] specification (i.e., model) size reduction while retaining similar running times. Benchmarks show that Bolt2 outperforms competing approaches in terms of the size of the specification being returned in the data loading and mining times by at least one order of magnitude, and in the best-case scenario, in real temporal data, by four/five orders of magnitude. These improvements can be ascribed to the heuristics of choice for navigating the hypothesis/clause search spaces that provide early stopping conditions, while guaranteeing a compact number of clauses via a quality criterion, ensuring that clause coverage does not decrease during a lattice descent. This advances the field of specification mining, as for the first time in this field, we introduce efficient lattice search algorithms, which were mainly known in data mining and relational learning literature. Notwithstanding the aforementioned, a controlled experiment demonstrates that our algorithm effectively captures all the relevant and expected temporal features present in synthetic datasets, while competing approaches are significantly impeded by testing a vast number of clauses without early stopping conditions across the entire log, thereby failing to bypass clauses for traces that do not meet specific temporal requirements.

Future works will consider returning specifications expressed in disjunctive normal forms as well as specification mining algorithms that consider more than one specification at once from the data. To achieve this, we need to enhance our previous formal verification pipeline further to assess the validity of mined clauses across different logs efficiently. We will also consider extending the refinement operator to extend dataless clauses with ones that support data predicates. After achieving this, we will then conduct further experiments to test whether the S-border visit and the early pruning of infrequent behaviour will lead to the extraction of several relevant features to be used in deviant model (i.e., specification) learning algorithms. Our previous results on Bolt ensure that we might still be able to increase the recall with a negligible loss of execution time, which could worsen as the distinct activities contained in the data increase. Subsequent versions must, therefore, consider increasing recall while maintaining a reasonable execution time. Preliminary considerations in this paper might suggest that this can be trivially achieved by running the algorithm on multiple partitions of the same log (Section 5.3). Experiments over ADM show that our current implementation of the xtLTL_f operators can be further improved; this will be crucial for testing mined clauses across distinct logs, as required by deviant model learning algorithms. This will then come to the further benefit of existing temporal formal verification techniques in KnoBAB [2].

Our future extensions of this algorithm will also consider the adoption of further Declare clauses that were not considered at this current stage, such as AltPrecedence, AltResponse, AltSuccession, and some other negative clauses. Regarding the first three clauses, these will still require a complete rewriting of the algorithm to hardcode the testing algorithm, including the scanning of the column database for seeking temporal correlations between activation and target conditions, as well as hardcoding further

Q^{b}

test conditions to determine the inclusion of the tested clauses. This is strictly required to achieve good optimisation, as the current clause-rewriting solutions via xtLTLf cannot subsume the presented early stopping conditions while embedding clause testing containing disparate xtLTL_f operators, such as Precedence and Response. To automate this approach, future works will assess the possibility of further atomising the xtLTL_f operators [2] to combine the iteration over activation and target conditions that do not necessarily share the same xtLTL_f operators. Achieving this will also enable us to define a generic general-to-specific mining algorithm that might be completely independent of the intended hypothesis language of choice. Concerning the negative clauses, we will consider the possibility of generating these from the ones discarded by the specification as they had low support.

Finally, we are also planning to further extend our dataless mining algorithm in order to support data conditions. Preliminary observations also show that achieving this will require exploiting our formal verification pipeline to extract activation/target payloads; this, as per previous observations, shows a need to further optimise this pipeline to handle real-data use case scenarios more efficiently. We will also ponder the possibility of replacing FPGrowth with more up-to-date mining algorithms, such as Q-Eclat [37].

Author Contributions

Conceptualization, G.B.; methodology, G.B.; software, S.A. and G.B.; validation, S.A.; formal analysis, G.B.; investigation, S.A.; resources, G.B. and G.M.; data curation, G.B.; writing—original draft preparation, S.A. and G.B.; writing—review and editing, G.B. and G.M.; visualization, G.B.; supervision, G.B. and G.M.; project administration, G.B.; funding acquisition, G.B. All authors have read and agreed to the published version of the manuscript.

Funding

Samuel Appleby’s work is supported by Newcastle University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset associated with the presented experiments is available online at https://osf.io/nsqcd/ and https://osf.io/69q8h/ (accessed on 10 September 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Sample Availability

The most up-to-date version of KnoBAB is available on GitHub: https://github.com/datagram-db/knobab (accessed on 10 September 2023).

Abbreviations

The following abbreviations are used in this manuscript:

DAG	direct acyclic graph
KnoBAB	knowledge base for alignments and business process modelling
LTL_f	linear temporal logic over finite traces
RDBMS	Relational Database Management System
XES	extensible event stream
`xt`LTL_f	extended linear temporal logic over finite traces

Appendix A

This section focuses on the lemmas pertaining to the foundations of data mining algorithms.

Lemma A1.

Given two hypotheses,

h_{1} : = 1_{B_{1}}

and

h_{2} : = 1_{B_{2}}

, for binary decision problems, we have that

h_{2} ⪯ h_{2} \Leftrightarrow B_{1} \subseteq B_{2}

.

Proof.

For any subset

S \subseteq T

of T, if we assume that the intersection

S \cap T

is equal to S (H), then we can provide the following proof:

\begin{matrix} h_{2} ⪯ h_{1} & \Leftrightarrow co (h_{1}, D) \subseteq co (h_{2}, D) \\ \Leftrightarrow (B_{1} \cap D) \subseteq (B_{2} \cap D) \\ \overset{H}{\Leftrightarrow} (B_{1} \cap D) \cap (B_{2} \cap D) = (B_{1} \cap D) \\ \Leftrightarrow (B_{1} \cap B_{2}) \cap D = B_{1} \cap D \\ \Leftrightarrow B_{1} \cap B_{2} = B_{1} \\ \Leftrightarrow B_{1} \subseteq B_{2} \end{matrix}

□

Lemma A2.

The minimum frequency quality criterion is anti-monotonic with respect to D.

Proof.

By Lemma A1, we rewrite

h_{1} ⪯ h_{2}

as

B_{2} \subseteq B_{1}

(H1). Also, the min-frequency quality criterion for

h_{2}

becomes

| B_{2} \cap D | \geq θ

, implying that we know a real number

x \in R^{+}

for which

| B_{2} \cap D | = θ + x

(H2). If we now expand the min-frequency definition for

h_{1}

, we now have to prove that

| B_{1} \cap D | \geq θ

also holds. By expanding H1, we know a set S for which

| (B_{2} \cup S) \cap D | = | (B_{2} \cap D) \cup (S \cap D) | \geq θ

. We can now prove

\begin{matrix} | (B_{2} \cup S) \cap D | = | (B_{2} \cap D) \cup (S \cap D) | \geq θ & \Leftrightarrow \exists y \in R^{+} . | B_{2} \cap D | + y = θ \\ \overset{H 2}{\Leftrightarrow} \exists y \in R^{+} . θ + x + y = θ \\ \Leftrightarrow θ \geq θ \end{matrix}

□

Lemma A3.

If

ϕ^{'}

is more general than ϕ, then ϕ implies

ϕ^{'}

by considering each LTL_f formula as a hypothesis, in which coverage is defined as the set of traces in a log satisfying such a formula.

Proof.

ϕ^{'} ⪯^{b} ϕ \Leftrightarrow co (ϕ, S) \subseteq co (ϕ^{'}, S) \Leftrightarrow {σ \in S | σ ⊧ ϕ} \subseteq {σ \in S | σ ⊧ ϕ^{'}} \Leftrightarrow ϕ \Rightarrow ϕ^{'}

□

Lemma A4.

If clause

c_{2}

is more general than

c_{1}

(

c_{2} ⪯^{b} c_{1}

), then the support of the latter is lesser or equal to the support of the former.

Proof.

\begin{matrix} c_{2} ⪯^{b} c_{1} & \Leftrightarrow {σ \in S | σ ⊧ c_{1}} \subseteq {σ \in S | σ ⊧ c_{2}} \\ \Rightarrow \frac{| {σ \in S | σ ⊧ c_{1}} |}{| S |} \leq \frac{| {σ \in S | σ ⊧ c_{2}} |}{| S |} \\ \Leftrightarrow S U P P (c_{1}) \leq S U P P (c_{2}) \end{matrix}

□

Corollary A1.

If clause

c_{1}

is a refinement of

c_{2}

, then its support is less or equal to the support of

c_{2}

.

Proof.

By definition of

ρ

,

c_{2} ⪯^{b} c_{1}

. Therefore, from Lemma A4, it derives that

S U P P (c_{1}) \leq S U P P (c_{2})

. □

Corollary A2.

Our quality metric for Bolt2 can be rewritten by stating that the support of the clause of interest shall be equal to each one of its direct ancestors for which it is a direct refinement.

Proof.

As per Corollary A1, we have that

h \in ρ (h^{'})

entails

S U P P (h) \leq S U P P (h^{'})

. As our definition of the metric requires that

S U P P (h^{'}) \leq S U P P (h^{'})

for each

h^{'}

, then by antisymmetry, we have

S U P P (h^{'}) = S U P P (h)

. Then, the quality criterion can be rewritten as follows:

Q^{b} (c, S) \Leftrightarrow (\forall c^{'} . c \in ρ (c^{'}) \Rightarrow S U P P (c^{'}) = S U P P (c)) \land S U P P (c) \geq θ \land RC O N F_{S}^{+} (c) > 0

□

Appendix B

This section focuses on the lemmas regarding the expressiveness of each single clause.

Lemma A5.

Clauses ChainSuccession(A, B) and Surround(A, B) do not entail each other.

Proof.

We can state that two declarative clauses are equivalent if they satisfy the same set of traces; thus, showing that two clauses are not equivalent requires finding a trace that is satisfied by the former but not by the latter, and vice versa. Therefore, showing their inequality requires finding only one trace distinguishing their temporal behaviour. For example, consider the following traces:

σ^{1} = ABABB σ^{2} = BB

First, we can show that there exists a trace satisfied by a Surround that violates a ChainSuccession. We can observe that

σ^{1}

satisfies Surround(A,B) as the first A is not activated by

□ (◯ A \Rightarrow B)

but not

ChainSuccession (A, B)

. This is due to the temporal requirement

□ (◯ B \to A)

stemming from

ChainPrecedence (B, A)

of ChainSuccession, which is violated by the last B occurring in the trace. Last, we can show that there exists a trace satisfying the ChainSuccession violating a Surround.

σ^{2}

demonstrates this; A/

◯ A

is never activated for Surround(A,B), which is vacuously satisfied, while

□ (◯ B \Rightarrow A)

is activated but violated for ChainSuccession(A,B). □

Appendix C

This section provides some lemmas motivating the correctness of our heuristics.

Proof (Lemma 1).

Satisfaction of ChainSuccession(A,B) requires that both ChainPrecedence(B,A) and ChainResponse(A,B) must be satisfied. If both defining clauses are not vacuously satisfied, then A and B must occur in equal numbers, as all the B-s occurring in the trace must be located after the first event due to

◯ B

while A-s might occur freely. We observe that having A-s and B-s in equal numbers is not a sufficient condition for a ChainSuccession to occur, as there might be an arbitrary number of such events within a given trace without them necessarily being ordered in the trace as dictated by the aforementioned clause. Still, the violation of such a condition indicates the violation of the clause, as there is no adequate number of events to be paired consecutively.

Nevertheless, when there exists one B event at the beginning of the trace, this does not activate ChainPrecedence(A,B) as this is formulated as

□ (◯ B \to A)

, as only a B that does not occur in the first position of the trace will activate the latter clause. Therefore, in this situation, we need to exclude—at most—one occurrence of B from the count. This exclusion results in a decrease in the count, thereby providing the lower bound for our constraint. □

Proof (Lemma 2).

Satisfaction of Surround(A,B) requires that both ChainPrecedence(A,B) and ChainResponse(A,B) must be satisfied. If both defining clauses are not vacuously satisfied, then the number of B-s doubles the number of A-s; as for each A occurring after the first event of the trace, there shall be a B both preceding and following it. Similar to the proof of the previous lemma, having events in that number is not a sufficient condition for the Surround to be verified, as simple counting does not prescribe any temporal occurrence of such events. Still, the violation of such a condition manifests the impossibility of Surrounding.

Nevertheless, when there exists at least one A event occurring at the beginning of the trace, this does not activate the ChainPrecedence(A,B), as this is formulated as

□ (◯ A \Rightarrow B)

, as only a B that does not occur in the first position of the trace will activate such a clause. In this situation, we might expect the Bs in the same number of A-s. (Bs can also be shared by As, i.e., the B event satisfying

□ (A \Rightarrow ◯ B)

might be the same event as

□ (◯ A \Rightarrow B)

for the following A). □

References

Seshia, S.A.; Sadigh, D.; Sastry, S.S. Toward verified artificial intelligence. Commun. ACM 2022, 65, 46–55. [Google Scholar] [CrossRef]
Bergami, G.; Appleby, S.; Morgan, G. Quickening Data-Aware Conformance Checking through Temporal Algebras. Information 2023, 14, 173. [Google Scholar] [CrossRef]
Zhu, S.; Pu, G.; Vardi, M.Y. First-Order vs. Second-Order Encodings for LTLf-to-Automata Translation. arXiv 2019, arXiv:1901.06108. [Google Scholar]
Bergami, G. Fast Synthetic Data-Aware Log Generation for Temporal Declarative Models. In Proceedings of the 6th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), Seattle, WA, USA, 18 June 2023; Hartig, O., Yoshida, Y., Eds.; ACM: New York, NY, USA, 2023; pp. 1–9. [Google Scholar]
Appleby, S.; Bergami, G.; Morgan, G. Enhancing Declarative Temporal Model Mining in Relational Databases: A Preliminary Study. In Proceedings of the International Database Engineered Applications Symposium Conference (IDEAS 2023), Heraklion, Crete, Greece, 5–7 May 2023; Manolopoulos, Y., Revesz, P.Z., Eds.; ACM: New York, NY, USA, 2023. [Google Scholar]
Sun, W.; Fokoue, A.; Srinivas, K.; Kementsietsidis, A.; Hu, G.; Xie, G. SQLGraph: An Efficient Relational-Based Property Graph Store. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15), New York, NY, USA, 31 May–4 June 2015; pp. 1887–1901. [Google Scholar]
Huang, S.; Zhu, E.; Chaudhuri, S.; Spiegelberg, L. T-Rex: Optimizing Pattern Search on Time Series. Proc. ACM Manag. Data 2023, 1, 1–26. [Google Scholar] [CrossRef]
Anselma, L.; Bottrighi, A.; Montani, S.; Terenziani, P. Extending BCDM to Cope with Proposals and Evaluations of Updates. IEEE Trans. Knowl. Data Eng. 2013, 25, 556–570. [Google Scholar] [CrossRef]
Schönig, S.; Rogge-Solti, A.; Cabanillas, C.; Jablonski, S.; Mendling, J. Efficient and Customisable Declarative Process Mining with SQL. In Proceedings of the Advanced Information Systems Engineering-28th International Conference, CAiSE 2016, Ljubljana, Slovenia, 13–17 June 2016; Nurcan, S., Soffer, P., Bajec, M., Eder, J., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2016; Volume 9694, pp. 290–305. [Google Scholar]
Wang, C.; Wu, K.; Zhou, T.; Cai, Z. Time2State: An Unsupervised Framework for Inferring the Latent States in Time Series Data. Proc. ACM Manag. Data 2023, 1, 1–18. [Google Scholar] [CrossRef]
Huo, X.; Hao, K.; Chen, L.; Tang, X.; Wang, T.; Cai, X. A dynamic soft sensor of industrial fuzzy time series with propositional linear temporal logic. Expert Syst. Appl. 2022, 201, 117176. [Google Scholar] [CrossRef]
Pesić, M.; Schonenberg, H.; van der Aalst, W.M. DECLARE: Full Support for Loosely Structured Processes. In Proceedings of the 11th IEEE International Enterprise Distributed Object Computing Conference (EDOC 2007), Annapolis, MD, USA, 15–19 October 2007; p. 287. [Google Scholar]
Chamberlin, D.D.; Boyce, R.F. SEQUEL: A Structured English Query Language. In Proceedings of the 1974 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control (SIGFIDET ’74), New York, NY, USA, 1–3 May 1974; pp. 249–264. [Google Scholar]
Husák, M.; Kašpar, J.; Bou-Harb, E.; Čeleda, P. On the Sequential Pattern and Rule Mining in the Analysis of Cyber Security Alerts. In Proceedings of the 12th International Conference on Availability, Reliability and Security (ARES ’17), New York, NY, USA, 29 August–1 September 2017. [Google Scholar]
Lagraa, S.; State, R. Process mining-based approach for investigating malicious login events. In Proceedings of the NOMS 2020-IEEE/IFIP Network Operations and Management Symposium, Budapest, Hungary, 20–24 April 2020; pp. 1–5. [Google Scholar]
Küsters, A.; van der Aalst, W.M.P. Revisiting the Alpha Algorithm To Enable Real-Life Process Discovery Applications. In Proceedings of the Joint Proceedings of the Workshop on Algorithms & Theories for the Analysis of Event Data and the International Workshop on Petri Nets for Twin Transition Co-Located with the 44th International Conference on Application and Theory of Petri Nets and Concurrency (Petri Nets 2023), Caparica, Portugal, 25–30 June 2023; CEUR Workshop Proceedings. Volume 3424. [Google Scholar]
Yazi, A.F.; Çatak, F.Ö.; Gül, E. Classification of Methamorphic Malware with Deep Learning(LSTM). In Proceedings of the 27th Signal Processing and Communications Applications Conference, SIU 2019, Sivas, Turkey, 24–26 April 2019; pp. 1–4. [Google Scholar]
Catak, F.O.; Ahmed, J.; Sahinbas, K.; Khand, Z.H. Data augmentation based malware detection using convolutional neural networks. PeerJ Comput. Sci. 2021, 7, e346. [Google Scholar] [CrossRef] [PubMed]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
Maggi, F.M.; Bose, R.P.J.C.; van der Aalst, W.M.P. Efficient Discovery of Understandable Declarative Process Models from Event Logs. In Proceedings of the Advanced Information Systems Engineering, Gdansk, Poland, 25–29 June 2012; pp. 270–285. [Google Scholar]
Bergami, G.; Di Francescomarino, C.; Ghidini, C.; Maggi, F.M.; Puura, J. Exploring Business Process Deviance with Sequential and Declarative Patterns. arXiv 2021, arXiv:2111.12454. [Google Scholar]
Grünwald, P. A tutorial introduction to the minimum description length principle. arXiv 2004, arXiv:math/0406077. [Google Scholar]
Raedt, L.D. Logical and Relational Learning; Cognitive Technologies; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
von Winterfeldt, D.; Edwards, W. Decision Analysis and Behavioral Research; Cambridge University Press: Cambridge, UK, 1986. [Google Scholar]
Agrawal, R.; Srikant, R. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the VLDB, Santiago, Chile, 12–15 September 1994; pp. 487–499. [Google Scholar]
Han, J.; Pei, J.; Yin, Y.; Mao, R. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 2004, 8, 53–87. [Google Scholar] [CrossRef]
Petermann, A.; Micale, G.; Bergami, G.; Pulvirenti, A.; Rahm, E. Mining and ranking of generalized multi-dimensional frequent subgraphs. In Proceedings of the Twelfth International Conference on Digital Information Management, ICDIM 2017, Fukuoka, Japan, 12–14 September 2017; pp. 236–245. [Google Scholar]
Petermann, A.; Junghanns, M.; Rahm, E. DIMSpan: Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems. In Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, BDCAT 2017, Austin, TX, USA, 5–8 December 2017; pp. 237–246. [Google Scholar]
Tan, P.; Steinbach, M.S.; Karpatne, A.; Kumar, V. Introduction to Data Mining, 2nd ed.; Pearson: London, UK, 2019. [Google Scholar]
Appleby, S.; Bergami, G.; Morgan, G. Running Temporal Logical Queries on the Relational Model. Information 2022, 14, 173. [Google Scholar]
Bergami, G.; Maggi, F.M.; Marrella, A.; Montali, M. Aligning Data-Aware Declarative Process Models and Event Logs. In Proceedings of the Business Process Management-19th International Conference, BPM 2021, Rome, Italy, 6–10 September 2021; Polyvyanyy, A., Wynn, M.T., Looy, A.V., Reichert, M., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2021; Volume 12875, pp. 235–251. [Google Scholar]
Davey, B.A.; Priestley, H.A. Introduction to Lattices and Order, 2nd ed.; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar] [CrossRef]
Maggi, F.M.; Mooij, A.J.; van der Aalst, W.M. User-guided discovery of declarative process models. In Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Paris, France, 11–15 April 2011; pp. 192–199. [Google Scholar]
van Dongen, B. Dataset BPI Challenge 2019. Available online: https://data.4tu.nl/articles/_/12715853/1 (accessed on 30 August 2023).
Parr, T. The Definitive ANTLR 4 Reference, 2nd ed.; Pragmatic Bookshelf: Dallas, TX, USA; Raleigh, NC, USA, 2013. [Google Scholar]
Li, J.; Pu, G.; Zhang, Y.; Vardi, M.Y.; Rozier, K.Y. SAT-based explicit LTLf satisfiability checking. Artif. Intell. 2020, 289, 103369. [Google Scholar] [CrossRef]
Czubryt, T.J.; Leung, C.K.; Pazdor, A.G.M. Q-Eclat: Vertical Mining of Interesting Quantitative Patterns. In Proceedings of the 26th International Database Engineered Applications Symposium (IDEAS ’22), New York, NY, USA, 22–24 August 2022; pp. 25–33. [Google Scholar]

Figure 1. Pipeline for Verified Artificial Intelligence for Temporal Data.

Figure 2. The lattice represents all clauses that entail each other with non-zero confidence, visualized as a Hasse diagram.

Figure 3. KnoBAB representation of a log

S

describing a healthcare scenario [2]. Colours indicate the respective traces: yellow/orange/pink labels refer to

σ_{1} / σ_{2} / σ_{3}

respectively. Pointers to the n-th record of the ActivityTable are denoted as #n, while NULL pointers refer to no record.

Figure 3. KnoBAB representation of a log

S

describing a healthcare scenario [2]. Colours indicate the respective traces: yellow/orange/pink labels refer to

σ_{1} / σ_{2} / σ_{3}

respectively. Pointers to the n-th record of the ActivityTable are denoted as #n, while NULL pointers refer to no record.

Figure 4. Lower bound and upper bounds to the maximum number of clauses returnable by a specification mining algorithm, given several maximum activity labels.

Figure 5. Lattice implemented for the clause traversal in Bolt [5], where the parts in blue show the results from frequent itemset and association rule mining.

Figure 6. Highlighting the Bolt2 phases where we compute the clauses from the lattice in Figure 2. Clauses in orange remark clauses being determined from AssociationRuleRefinement (Algorithm 8) while ones in red are determined only through CrossBranchEntailment (Algorithm 12).

Figure 7. Benchmarks on the synthetic dataset. Colours show the algorithm names, while symbols and line types show the trace lengths for all traces in the log containing ∣L∣ traces.

Figure 8. Benchmarks on the BPIC2019 dataset. (a) Trace length PDFs on the different samples. Colours refer to each specific sample size ∣L∣. (b) Loading time in ms. Colours show the algorithm names. (c) Mining time in ms. Colours show the algorithm names.

Figure 9. Benchmarks on the cybersecurity dataset. (a) Trace length PDFs of the different samples. Colours refer to each specific sample size ∣L∣. (b) Mining time in ms. Colours show the algorithm names.

Figure 10. Variations in the mining times and model sizes under different logs and support values for Bolt2 for the cybersecurity dataset. (a) Mining time. (b) Mined model size. (c) Multivariate analysis for running times vs. algorithmic parameters. Stars denote the correlation coefficient where the significance levels level is remarked by stars (

★ / ∎

low significance,

★ ★ ★

high significance).

Figure 10. Variations in the mining times and model sizes under different logs and support values for Bolt2 for the cybersecurity dataset. (a) Mining time. (b) Mined model size. (c) Multivariate analysis for running times vs. algorithmic parameters. Stars denote the correlation coefficient where the significance levels level is remarked by stars (

★ / ∎

low significance,

★ ★ ★

high significance).

Table 1. A set of dataless Declare templates considered in the paper. The clauses are marked according to their implementations in the respective Bolt/Bolt2 algorithmic implementations. Surround is our novel clause that, to the best of our knowledge, was first introduced in this paper. To simplify the characterisation of such clauses, we always assume that the first argument of the clause provides its activation A, while the second argument always provides a target B or an alternative activation A’.

Exemplifying Clause ( $c_{l}$ )	Natural Language Specification for Traces	LTL_f Semantics ( $〚 c_{l} 〛$ )	Bolt	Bolt2
Init(A)	The trace should start with an activation	A	✓	✓
End(A)	The trace should end with an activation	$L a s t (A)$	✓	✓
Exists( $A, n$ )	Activations should occur at least n times	$\{\begin{matrix} ⋄ (A \land ◯ (〚 Exists (A, n - 1) 〛)) & n > 1 \\ ⋄ (A \land p) & n = 1 \end{matrix}$	✓	✓
Absence( $A, n + 1$ )	Activations should occur at most n times	$\neg 〚 Exists$ ( $A, n + 1$ )〛	✓	✓
Choice( $A, A^{'}$ )	At least one of the two activation conditions must appear.	$⋄ A \lor ⋄ A^{'}$	✓	✓
ExclChoice( $A, A^{'}$ )	Only one activation condition must happen.	$〚 Choice (A, A^{'}) 〛 \land 〚 NotCoExistence (A, A^{'}) 〛$	✓	✓
RespExistence( $A, B$ )	The activation requires the existence of the target.	$⋄ A \Rightarrow ⋄ B$	✓	✓
CoExistence( $A, B$ )	RespExistence, and vice versa.	$〚 RespExistence (A, B) 〛 \land 〚 RespExistence (B, A) 〛$	✓	✓
Precedence( $A, B$ )	Events preceding the activation should not satisfy the target	$\neg B W A$	✓	✓
ChainPrecedence( $A, B$ )	The activation is immediately preceded by the target.	$□ (◯ □ A \Rightarrow B)$	✓	✓
Response( $A, B$ )	The activation is either followed by or simultaneous to the target.	$□ (A \Rightarrow ⋄ B)$	✓	✓
ChainResponse( $A, B$ )	The activation is immediately followed by the target.	$□ (A \Rightarrow ◯ B)$	✓	✓
Succession( $A, B$ )	The target should only follow the activation.	$〚 Precedence (A, B) 〛 \land 〚 Response (A, B) 〛$	✗	✓
ChainSuccession( $A, B$ )	Activation immediately follows the target, and the target immediately precedes the activation.	$〚 ChainPrecedence (B, A) 〛 \land 〚 ChainResponse (A, B) 〛$	✗	✓
Surround( $A, B$ )	The activation is immediately followed by, and immediately preceded by, the target.	$〚 ChainPrecedence (A, B) 〛 \land 〚 ChainResponse (A, B) 〛$	✗	✓

Table 2. Table of notation for symbols

χ \in T

defined as (

χ : = E

) or characterised by (

E (χ)

)

E

.

Table 2. Table of notation for symbols

χ \in T

defined as (

χ : = E

) or characterised by (

E (χ)

)

E

.

Symbol ( $χ$ )	Definition ( $E$ )	Type ( $T$ )	Comments
Set Theory for Data Mining
∅		Set	An empty set contains no items.
$(⪯, S)$			A partially ordered set (poset) is a relational structure for which ⪯ is a partial ordering over S [32]. ⪯ over S might be represented as a lattice, referred to as the Hasse diagram.
$∁ C$	$U ∖ C$	Set	Complement set: given a universe $U$ , the complement returns all of the elements that do not belong to C.
$\| C \|$	$\sum_{c \in C} 1$	$N$	The cardinality of a finite set indicates the number of contained items.
$℘ (C)$	${T \| T \subseteq C}$	Set	The power set of C is the set whose elements are all of the subsets of C.
$Y^{X}$	${f \| f : X \to Y}$	Set	Set exponentiation.
$1_{B} (x)$	$\{\begin{matrix} 1 & B (x) \\ 0 & oth . \end{matrix}$	$D \to {0, 1}$	Indicator function.
$⪯^{b}$	Def. 2	Relationship	Bolt2 generality relationship.
$Q^{b}$	Def. 3	Predicate	Bolt2 quality criterion.
Frequent Patterns and Association Rules
I		Set	Finite set of occurring atoms, also providing a finite set of activity labels.
i		$℘ (I)$	A possible itemset (i), or a transaction (t).
T		Set	The set of all possible transactions.
$ς^{'} (i)$	Equation (1)		Itemset support for i with respect to some transactions T.
$ς (i)$	$\frac{ς^{'} (i)}{\| T \|}$		Normalised itemset support for i w.r.t some transactions T.
$T \to H$		Rule	Binary association rule composed of a tail `T` and a head `H`.
$s (T \to H)$	Equation (2)		Support for association rule $T \to H$ .
$c (T \to H)$	Equation (2)		Confidence for association rule $T \to H$ .
LTL_f & Declare
$σ_{j}^{i}$	$〈 A_{j}, δ_{j} 〉 ≃ A_{j}$	$Σ \times V^{K}$	Event
$σ^{i}$	$σ_{1}^{i}, \dots, σ_{n}^{i}$	Sequence	Trace, sequence of temporarily ordered events.
$S$	${σ^{1}, \dots, σ^{m}}$	Set of traces	System or Log.
⊧			$S ⊧ Φ$ denotes that $Φ$ is satisfied for the world/system $S$ .
Supp $_{S} (c)$ , RSupp $_{S} (c)$	Equations (4) and (5)		(Restrictive) support for a declarative clause c over a log $S$ .
Conf $_{S} (c)$ , RConf $_{S} (c)$	Equations (4) and (6)		(Restrictive) confidence for a declarative clause c over a log $S$ .
RConf $_{S}^{+} (c)$	Equation (8)		Enhanced restrictive confidence for a clause c over a log $S$ .
Pseudocode
↑			Null pointer or terminated iterator.
Next $(it)$		Pointer	On $it$ non-null, it increments the pointer it to the next element in the stream/sequence
Current $(it)$		Dereference	Element pointed by the pointer/iterator it.
${act}_{A, B}^{★}$	$ActLeaves (★_{A, B})$	Set	Set of activated traces for the clause $★ (A, B)$ .
${viol}_{A, B}^{★}$		Set	Set of traces violating clause $★ (A, B)$ .
${sat}_{A, B}^{★}$	$∁ {viol}_{A, B}^{★}$	Set	Set of traces satisfying clause $★ (A, B)$ .
${vac}_{A, B}^{★}$		Set	Set of traces containing no event being activated by `A` for $★ (A, B)$ .

Table 3. Mined specifications with frequent itemset support

θ = 0.1

on the same Synthetic Data, where all algorithms, except our previous Bolt, support ChainSuccession, Succession, Surround, and CoExistence.

Φ^{*}

denotes the expected specification.

Table 3. Mined specifications with frequent itemset support

θ = 0.1

on the same Synthetic Data, where all algorithms, except our previous Bolt, support ChainSuccession, Succession, Surround, and CoExistence.

Φ^{*}

denotes the expected specification.

Log Gen. Spec., $Φ$ *	TopN [2]	Bolt [5]	Bolt2	ADM [20]	ADM+S [21]
ChainPrecedence(b,c)	✗	✓	✓	✓(+Choice,CoExist.)	✓(+Choice,CoExist.)
ChainResponse(d,e)	✗	✓	✓	✓(+Choice,CoExist.)	✓(+Choice,CoExist.)
Choice(f,g)	✓(+Choice(g,f))	RespExistence(g,f)	RespExistence(g,f)	✓(+Choice(g,f))	✓(+Choice(g,f))
ExclChoice(h,a)	✗	✓	✓	✗	✓(+ ExclChoice(a,h))
Exists(b,1)	✗	✓	✓	✓	✓
Init(e)	✓	✓	✓	✓	✓
Precedence(c,b)	✗	✓	✓	✓(+Choice,RespEx.,CoEx.)	✓(+Choice,RespEx.)
RespExistence(h,i)	✗	CoExistence(h,i)	CoExistence(i,h)	✓(+ CoExistence(h,i))	✓(+ CoExistence(h,i))
RespExistence(i,h)	✗	CoExistence(h,i)	CoExistence(i,h)	✓(+ CoExistence(i,h))	✓(+ CoExistence(i,h))
RespExistence(e,f)	✓	Response(e,f)	Succession(e,f)	✓	✓
Response(e,f)	+RespExistence(e,f)	Response(e,f)	Succession(e,f)	✓(+RespExistence(e,f))	✓(+RespExistence(e,f))
$\| Φ_{θ} \|$	36	179	46	704	828
$\| Φ_{\geq 0.999} \|$	148	46	40	120	137
Mining time (ms)	5.95 × 10¹	2.11 × 10⁰	2.03 × 10⁰	1.98 × 10²	3.55 × 10³

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bergami, G.; Appleby, S.; Morgan, G. Specification Mining over Temporal Data. Computers 2023, 12, 185. https://doi.org/10.3390/computers12090185

AMA Style

Bergami G, Appleby S, Morgan G. Specification Mining over Temporal Data. Computers. 2023; 12(9):185. https://doi.org/10.3390/computers12090185

Chicago/Turabian Style

Bergami, Giacomo, Samuel Appleby, and Graham Morgan. 2023. "Specification Mining over Temporal Data" Computers 12, no. 9: 185. https://doi.org/10.3390/computers12090185

APA Style

Bergami, G., Appleby, S., & Morgan, G. (2023). Specification Mining over Temporal Data. Computers, 12(9), 185. https://doi.org/10.3390/computers12090185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Specification Mining over Temporal Data †

Abstract

1. Introduction

2. Related Work

2.1. Machine Learning

2.2. Data Mining

2.2.1. Frequent Pattern Mining

2.2.2. Association Rule Mining

2.3. Verified Artificial Intelligence for Temporal Data

2.3.1. Formal Temporal Verification

2.3.2. Temporal Specification Mining

3. Theoretical Contributions

3.1. Enhanced Restrictive Confidence

3.2. Surround

3.3. Pruning Heuristics

3.4. Clause Generality and Quality Criterion for Mining

4. Methodology

4.1. Frequent Pattern Mining

4.2. Unary Clause Mining and Binary Clauses Pre-Processing

4.3. Binary Clause Mining

4.3.1. Association Rule Mining for Declarative Clauses

4.3.2. Computation of Pruning Heuristics

4.3.3. Temporal Refinements

4.3.4. Temporal Refinement Post-processing

4.3.5. (Excl)Choice Generation

5. Results

5.1. Algorithm Validation

5.2. Mining and Loading Times for Process Mining Datasets

5.3. Real-World Datasets

6. Discussion

6.1. Non-Zero Confidence and Activated Traces

6.2. S U P P S vs. R S U P P S

6.3. Activation and Target vs. Heads and Tails in Association Rules

6.4. ChainPrecedence: α -Renaming

6.5. Limitations of Current Deviant Model Learning Algorithms

7. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Sample Availability

Abbreviations

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Specification Mining over Temporal Data^†

6.2. $S U P P_{S}$ vs. $R S U P P_{S}$

6.4. ChainPrecedence: $α$ -Renaming