Next Article in Journal
Scams and Solutions in Cryptocurrencies—A Survey Analyzing Existing Machine Learning Models
Next Article in Special Issue
Fundamental Research Challenges for Distributed Computing Continuum Systems
Previous Article in Journal
Architecture-Oriented Agent-Based Simulations and Machine Learning Solution: The Case of Tsunami Emergency Analysis for Local Decision Makers
Previous Article in Special Issue
DEGAIN: Generative-Adversarial-Network-Based Missing Data Imputation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Quickening Data-Aware Conformance Checking through Temporal Algebras †

School of Computing, Faculty of Science, Agriculture and Engineering, Newcastle University, Newcastle Upon Tyne NE4 5TG, UK
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper: Appleby, S.; Bergami, G.; Morgan, G. Running Temporal Logical Queries on the Relational Model. In Proceedings of the IDEAS’22, 26th International Database Engineered Applications Symposium, Budapest, Hungary, 22–24 August 2022.
Information 2023, 14(3), 173; https://doi.org/10.3390/info14030173
Submission received: 14 November 2022 / Revised: 3 March 2023 / Accepted: 5 March 2023 / Published: 8 March 2023
(This article belongs to the Special Issue Best IDEAS: International Database Engineered Applications Symposium)

Abstract

:
A temporal model describes processes as a sequence of observable events characterised by distinguishable actions in time. Conformance checking allows these models to determine whether any sequence of temporally ordered and fully-observable events complies with their prescriptions. The latter aspect leads to Explainable and Trustworthy AI, as we can immediately assess the flaws in the recorded behaviours while suggesting any possible way to amend the wrongdoings. Recent findings on conformance checking and temporal learning lead to an interest in temporal models beyond the usual business process management community, thus including other domain areas such as Cyber Security, Industry 4.0, and e-Health. As current technologies for accessing this are purely formal and not ready for the real world returning large data volumes, the need to improve existing conformance checking and temporal model mining algorithms to make Explainable and Trustworthy AI more efficient and competitive is increasingly pressing. To effectively meet such demands, this paper offers KnoBAB, a novel business process management system for efficient Conformance Checking computations performed on top of a customised relational model. This architecture was implemented from scratch after following common practices in the design of relational database management systems. After defining our proposed temporal algebra for temporal queries (xtLTLf), we show that this can express existing temporal languages over finite and non-empty traces such as LTLf. This paper also proposes a parallelisation strategy for such queries, thus reducing conformance checking into an embarrassingly parallel problem leading to super-linear speed up. This paper also presents how a single xtLTLf operator (or even entire sub-expressions) might be efficiently implemented via different algorithms, thus paving the way to future algorithmic improvements. Finally, our benchmarks highlight that our proposed implementation of xtLTLf (KnoBAB) outperforms state-of-the-art conformance checking software running on LTLf logic.

1. Introduction

(Temporal) conformance checking is increasingly at the heart of Artificial Intelligence activities: due to its logical foundation, assessing whether a sequence of distinguishable events (i.e., a trace) does not conform to the expected process behaviour (process model) reduces to the identification of the specific unfulfilled temporal patterns, represented as logical clauses. This leads to Explainable AI, as the process model’s violation becomes apparent. Clauses are the instantiation of a specific behavioural pattern (i.e., template) that expresses temporal correlation between actions being carried out (activations) and their expected results (targets). These, therefore, differ from traditional association rules [1], as they can also describe complex temporal requirements: e.g., whether the target should immediately follow (ChainResponse) or precede (ChainPrecedence) the activation, if the former might happen any time in the future (Response), or if the target should have never happened in the past (Precedence). These temporal constraints can be fully expressed in a Linear Temporal Logic over Finite Traces (LTLf) and its extensions; this logic is referred to as linear as it assumes that, in a given sequence of events of interest, only one possible future event exists immediately following a given one. This differs from stochastic process modelling, where each event is associated with a probabilistic distribution of possibly following events [2,3]. Such a formal language can be extended to express correlations between activations and targets through binary predicates correlating data payloads. Events are also associated with either an action or a piece of state information represented as an activity label. Collections of traces are usually referred to as log.
Despite its theoretical foundations, state-of-the-art conformance checking techniques for entire logs expose sub-optimal run-time behaviour [4]. The reasons are the following: while performing conformance checking over relational databases requires computing costly aggregation conditions [5], tailored solutions do not exploit efficient query planning and data access minimisation, thus requiring scanning the traces multiple times [6]. Efficiency becomes of the uttermost importance after observing that conformance checking’s run-time enhancement has a strong impact on a whole wide range of practical use case scenarios (Section 1.1). To make conformance checking computations efficient, we synthesise temporal data derived from a system (be it digital or real) to a simplified representation in the Relational Database model. In this instance, we use xtLTLf, our proposed extension of LTLf, to represent process models. While the original LTLf merely asserts whether a trace is conformant to the model, our proposed algebra returns all of the traces satisfying the temporal behaviour, as well as activated and targeted events. As a temporal representation in the declarative model provides a point-of-relativity in the context of correctness (i.e., time itself may dictate if traces maintain correctness throughout the logical declarations expressed by the model), the considerations of such temporal issues significantly increase the checking requirement. This is due to a need to visit logical declarations for correctness in the context of each temporal instance.
This paper extends our previous work [4], where we clearly showed the disruptiveness of the relational model for efficiently running temporal queries outperforming state-of-the-art model checking systems. While our original work [4] provided just a brief rationale behind the success of KnoBAB (The acronym stands for KNOwledge Base for Alignments and Business process modelling). The Business Process Mining literature often uses the term Knowledge Base differently from customary database literature: while in the former, the intended meaning is a customary relational representation for trace data, in the latter, we often require that such representation provides a machine-readable representation of data in order to infer novel facts or to detect inconsistencies., this paper wants to dive deep into each possible contribution leading to our implementation success.
  • As an extension from our previous work, we fully formalise the logical data model (Section 3.1) and characterise the physical one (Section 4) in order to faithfully represent our log. This will prelude the full formalisation of the xtLTLf algebra;
  • Contextually, we also show for the first time that the xtLTLf algebra (Section 3.2) can not only express declarative languages such as Declare [7] as in our previous work but can express the semantics of LTLf formula by returning any non-empty finite trace satisfying the latter if loaded in our relational representation (see Appendix A.2). We also show for the first time a formalisation for data correlation conditions associated with binary temporal operators;
  • Differently from our previous work, where we just hinted at the implementation of each operator with some high level, we now propose different possible algorithms for some xtLTLf operators (Section 6), and we then discuss both theoretically (Supplement II.2) and empirically (Section 7.1) which might be preferred under different trace length ϵ or log size | L | conditions. This leads to the definition of hybrid algorithms [8];
  • Our benchmarks demonstrate that our implementation outperforms conformance checking techniques running on both relational databases (Section 7.2) and on tailored solutions (Section 7.5) when customary algorithms are chosen for implementing xtLTLf  operators;
  • Finally, this paper considerably extends the experimental section from our previous work. First, we show (Section 7.3) how the query plan’s execution might be parallelised, thus further improving with super-linear speed-up our previous running time results. Then, we also discuss (Section 7.4) how different data accessing strategies achievable through query rewriting might affect the query’s running time.
Figure 1 provides a graphical depiction of this paper’s table of contents, with the mutual dependencies between its sections. Appendices and Supplements start from p. 50.
Figure 2 provides a bird-eye view of the overall KnoBAB architecture: in the upper half, we show how a log is loaded in our business process management system as a series of distinct tables providing some activity statistics (CountingTable) and full payload information (AttributeTable) in addition to reconstructing the unravelling of the events as described by their traces (ActivityTable). On the other hand, the lower half shows the main steps of the query engine transforming a declarative model into a DAG query plan accessing the previously-loaded relational tables. The most recent version of our system is on GitHub ( https://github.com/datagram-db/knobab as accessed the 5 March 2023). When not explicitly stated, all the links were last accessed the 5 March 2023.

1.1. Case Studies

The present section shows a broad-ranging set of real case studies requiring efficient conformance checking computations in LTLf. This, therefore, motivates the need for our proposed approach in a practical sense.

1.1.1. Cyber Security

Intrusion detection for cyber security aims at auditing an environment for identifying potential flaws that can be remedied and fixed later. While anomaly-based approaches raise an alarm if the observed behaviour differs significantly from the expected one, signature-based approaches check whether attack patterns might be recognised from the environment. The latter are often used to mitigate the high false-alarm rates of the former [10]. Expected behaviour might be encoded as process models expressed in LTLf, which, when violated, lead to the detection of an attack: such a language can be directly exploited to represent several different kinds of attacks, such as Denial Of Service, Buffer Overflows, and Password Guessing [10]. In his dissertation [11], Ray shows how malware can be detected by determining LTLf formulae discriminating between system–calls patterns generated by malicious software from expected run-time behaviour. Recent developments [12,13] showed that it is possible to perform prediction (and therefore reasoning) on potentially infinite sequences by analysing a finite subsequence of the overall behaviour within a sliding window; Buschjäger et al. [12] predict future events not covered by the sliding window by correlating them to the patterns observed in such a window. By associating a positive label to each finite subsequence preceding or containing an attack, and a negative one otherwise, we can also extract temporal models detecting subsequences containing attacks [14]. This entails that real-time verification boils down, to some extent, to offline monitoring, as we guarantee that it is sufficient to analyse currently-observed behaviours to predict and detect an attack. The learned model, once validated, can be exploited in the aforementioned real-time verification systems [10].
Example 1.
The Cyber Kill Chain® framework (https://www.lockheedmartin.com/en-us/capabilities/cyber/cyber-kill-chain.html as accessed the 5 March 2023) allows the identification and prevention of intrusion activities on computer systems. This framework is based on a military tactic simply known as a kill chain (https://en.wikipedia.org/wiki/Kill_chain, 5 March 2023), which breaks down the attack into the following phases: target identification, marshalling and organizing forces towards the target, starting an attack, and target neutralisation. Lockheed Martin reformulated these steps to be transferred to the IT world and redirected the attack against a virtual target. These phases were reformulated as follows:
  • Reconnaissance (rec): An attacker observes the situation from the outside in order to identify targets and tactics. As the attacker mainly collects information regarding the system’s vulnerabilities, this is the hardest part to detect.
  • Weaponisation (weap): After gathering the information, the cybercriminal implements his strategy through a software artefact. This detection will have greater chances of success in the future after post-mortem analysis, when either a temporal model is mined over the collected attack data or the strategy is directly inferred from available artefacts (e.g., malaware).
  • Payload or Delivery (del): The cybercriminal devises a way to infiltrate the host system that hides the previously produced artefact (e.g., a Trojan). This must sound as harmless as possible to fool the system.
  • Exploitation (expl): The cybercriminal exploits the system’s vulnerabilities and infiltrates it through the previous “cover”. At this stage, the defensive system should raise the alarm if any kind of unusual behaviour is detected while increasing the security level.
  • Installation (inst): The weapon escapes the payload and gets installed into the host computer system. At this point, any kind of suspected behaviour might be detected by malicious system calls.
  • Command & Control (comm): The weapon establishes a communication with the cybercriminal for receiving orders from the attacker. The system should detect any kind of suspicious network communication and should attempt to break the communication channel.
  • Action (act): The intruder starts the attack on the system. At this stage, the attack should be more evident, and the Industrial IoT Shields (iiot_sh), such as network devices protection, should be activated.
Figure 3a describes the actions (and therefore activity labels) of interest. Having defined the actions that should be monitored, records of activities can be stored as traces within a log. This is represented in Figure 3b, where we define three distinct attacks as distinct traces ( σ 1 , σ 2 , σ 3 ). Each trace contains the event information leading up to the completion of an attack attempt (which may be (un)successful). Data payload information is also considered, and here this is provided as the unique timestamp (ts) associated with each event. Trace payload information is not simulated here but is described and applied in Example 2.
A temporal model might describe a completely successful attack. The occurrence of the aforementioned phases can be described through a temporal declarative language Declare [7], where each constraint is an instantiated Declare clause (see Table 1). Our declarative language should be able to state the following requirements: Ⓐ all reconnaissance events should be followed by a weaponisation, Ⓑ there should be no IoT shields in place, and Ⓒ either command and control or action should occur.
On blockchains, each trace event represents a proper blockchain event, thus including function or event invocations issued by one or more smart contracts. In particular, smart contracts are sets of conditions specified in self-executing programs [15], which include protocols within which the parties will fulfil some promises [16]. Given that smart contracts can also be seen as postconditions activated upon the occurrence of specified pre-conditions [17], they are also exploited as security measures reducing malicious and accidental exceptions [15]. As per previous considerations, we can directly encode the smart contract premises in LTLf, as well as represent the whole smart contract as a whole LTLf formula under the assumption that the blockchain guarantees its execution [17]. Therefore, we can perform post-mortem analysis checking whether a given run-time abides by the rules imposed by the system.

1.1.2. Industry 4.0

Smart factories enable the collection and analysis of data through advanced sensors and embedded software for better decision-making. These enable monitoring each phase of the entire production process in both real-time and domain-specific applications where the safety of both autonomous cyber-physical systems as well as human workers is at stake [18]. This is of the uttermost importance, as both humans and machines cooperate in the same environment where a minimal violation of safety requirements might damage the overall production process, thus reflecting in maintenance costs. This calls for logical-based formal methods providing correctness guarantees [19]. Run-time verification [19] and prediction [13] have started gaining momentum against customary static analysis tools: in fact, real complex systems such as factories are often hard to predict and analyse before execution. As run-time verification can be deployed as a permanent testing condition on the environment, Mao et al. [19] show that this approach is complete, thus reducing the complicated model-checking problem into a simpler conformance checking one. Programmable Logic Controllers (PLC) are at the heart of this mechanism, where controllers can make decisions over previously-observed events. PLC work is similar to smart contracts in the previous scenario: at each “scan cycle”, the controllers perceive through sensors the status change of the environment (e.g., variations of temperature and pressure). This information is then fed to the internal logic, which, on the other hand, might decide to intervene directly in the environment by sending signals to some actuators (e.g., controlling the pressure and temperature on the system). Due to the similarity of PLC to smart contracts, these might also exploit LTLf for determining security requirements: when a safety condition is violated, the PLC might activate an alarm while ensuring that the system works within safe operation ranges [19]. Please observe that ptLTL, also defined in [19], is a version of LTL allowing reasoning on past events so as to avoid semi-decidable computations for traces of infinite length, might be still represented through an equivalent LTLf formula evaluated over a finite sliding window [13] bounded by the first and the latest event. Please observe that the difference between LTL and LTLf  is that only the latter considers traces of finite length.
In some other industrial scenarios, we might be interested in detecting unexpected variations in time series reflecting the fluctuation of some perceived variables (e.g., variations in temperature and pressure). The latest developments [13] showed that (industrial) time series could also be represented as traces: we might assign to each event an activity label ϖ if the current event has a data payload whose values upper bound the ones from immediately preceding event’s payload, and ¬ ϖ otherwise. Consequently, we can encode disparate data variation patterns in LTLf reflecting different types of data volatility or steep increases/decreases [13]. This shows how LTLf can also represent anomaly-based problems by reducing them to the identification of anomaly patterns [20].

1.1.3. Healthcare

A medical process describes clinical-related procedures as well as organisational management ones (e.g., registration, admission, and discharge) [21]. The renowned openEHR (https://www.openehr.org/ accessed the 5 March 2023) standard distinguishes the former in four main archetypes: an observation, recording patients’ clinical symptoms (e.g., body temperature, blood pressure); an evaluation, providing preliminary diagnosis and assessing the patient’s health based on the former results; and an instruction, the execution of the treatment plan proposed by a physician (e.g., prescribing, examining, and testing). An action describes the way to intervene or treat medical patients according to the treatment plan (e.g., drug administration, blood matching). Once encoded as such, each process representing an instantiation of a medical process, i.e., a patient’s clinical course, can be then collected and represented in a log. As such, each action is going to be represented as a distinct activity label of a given event [22] that might contain relevant payload information recording the outcome of the clinical procedures, as well as demographical information related to the patient [21] for future socio-clinical analyses [23].
Declarative temporal languages such as Declare can then be exploited to provide a descriptive approach specifying temporal constraints among activities without strictly enforcing their order of completion, thus restricting the order of application of a specific set of activities [21]. As these models come with temporal semantics expressed in LTLf, these are, for all intents and purposes, process models. As such, these might be applied to detect discrepancies between clinical guidelines, expressed by the aforementioned model, and the actual process executions collected in a log. This is of the utmost concern as often deviations represent errors compromising the patient recovery [22], which, if efficiently and identified in advance, lead to an increased patient satisfaction as well a reduction of healthcare costs (e.g., due to mismanagement) [21].
Example 2.
To minimise costs and unrequired procedures, only ill patients should receive treatment. Thus, sufferers not receiving treatment (false negatives) and non-sufferers receiving treatment (false positives) need to be minimised. Figure 2 proposes a simplified scenario where we consider two event payload keys: CA 15-3 (cancer antigen concentration in a patient’s blood) and biopsy (biopsies should be taken before any procedure is acted upon). Our model targets only breast cancer patients with successful therapies that describe a medical protocol and the desired patients’ health condition at each step. Ⓒ states that two possible surgical operations for breast tumours are mastectomy or lumpectomy if the biopsy is positive and the CA-13.5 is way above ( 50 ) the guard level, being 23.5 units per mL, and Ⓐ–Ⓑ any successful treatment should decrease the CA-13.5 levels, which should be below the guard level; such correlation data condition is expressed via a Θ condition (introduced by a where ). A twinned negative model (not in Figure) might better discriminate healthy patients from patients where the therapy was unsuccessful. Novel situations can be represented as a log. For example, in Figure 2, we have three patients: ➀ a cancer patient with a successful mastectomy, ➁ a healthy patient, and ➂ an unsuccessful lumpectomy, thus suggesting that the patient might still have some cancerous cells. Given the aforementioned model, patient ➀ will satisfy the model as the surgical operation was successful, ➁ will not satisfy the model because neither a mastectomy nor a lumpectomy was required ( M is only fulfilled for successful procedures), and ➂ will not satisfy the target condition, even though the correlation condition was met. Our model of interest should only return ➀ as an outcome of the conformance checking process.

2. Preliminaries

eXtensible Event Stream (XES). This paper relies on temporal data represented as a temporally ordered sequence of events (trace or streams), where events are associated with at most one action described by a single activity label [24]. In this paper, we formally characterize payloads as part of both events and traces while, in our previous work, we only considered payloads from events [25].
Given an arbitrarily ordered set of keys K and a set of values V, a tuple [26] is a finite function p : K V (also p V K ), where each key is either associated with a value in V or is undefined. After denoting ⊥ as a null element missing from the set of values ( V ), we can express that κ is not associated with a value in p as p ( κ ) = , thus κ dom ( p ) . An empty tuple ε has an empty domain.
(Data) payloads are tuples, where values can represent either categorical data or numerical data. An event  σ j i is a pair a , p Σ × V K , where Σ is a finite set of activity labels, and p is a finite function describing the data payload. A trace σ i is an ordered sequence of distinct events σ 1 i , , σ n i , which is distinguished from the other traces by a case id i; n represents the trace’s length ( n = | σ i | ). If a payload is also associated with the whole trace, this can be easily mimicked by adding an extra initial event containing such a payload with an associated label of __trace_payload. A log L is a finite set of traces σ 1 , , σ m . In this paper, we further restrict our interest to the traces containing at least one event, as empty traces are meaningless as they are not describing any temporal behaviour of interest. Finally, we denote as β : Σ 1 , , | Σ | the bijection mapping each activity label occurring in the log to an unique id.
Example 3.
The log L in Figure 2 comprises three distinct traces L = σ 1 , σ 2 , σ 3 . In particular, the second trace comprises two events σ 2 = σ 1 2 σ 2 2 , where the first event represents the trace payload, and therefore σ 1 2 = _ _ trace _ payload , p having p ( loc _ po ) = N E and p ( p _ id ) = 002 A . The other event is σ 2 2 = Referral , p ˜ , where payload p ˜ is only associated with the CA-13.5 levels as p ˜ ( CA 13 . 5 ) = 20 . Similar considerations can be carried out for the other log traces.
Linear Temporal Logic over finite traces (LTLf). LTLf is a well-established extension of modal logic considering the possible worlds as finite traces, where each event represents a single relevant instant of time. The time is thereby linear, discrete, and future-oriented. This entails that that the events represented in each trace are totally ordered and, as LTLf quantifies only on events reported in the trace, all the events of interest are fully observable. The syntax of an well-formed LTLf formula φ is defined as follows:
φ : : = a | ¬ φ | φ φ | φ φ | φ | φ | φ | φ U φ
where a Σ . Its semantics is usually defined in terms of First Order Logic [27] for a given trace σ i at a current time j (e.g., for event σ j i ) as follows:
  • An event satisfies the activity label  a iff. its activity labels is a : σ j i a σ j i = a , p ;
  • An event satisfies the negated formula iff. the same event does not satisfy the non-negated formula: σ j i ¬ φ σ j i φ ;
  • An event satisfies the disjunction of LTLf sub-formulæ iff. the event satisfies one of the two sub-formulæ: σ j i φ φ σ j i φ σ j i φ ;
  • An event satisfies the conjunction of LTLf formulæ iff. the event satisfies all of the sub-formulæ: σ j i φ φ σ j i φ σ j i φ ;
  • An event satisfies a formula at the next step iff. the formula is satisfied in the incoming event if present: σ j i φ i < | σ j | σ j + 1 i φ ;
  • An event globally satisfies a formula iff. the formula is satisfied in all the following events, including the current one: σ j i φ j x | σ i | . σ x i φ ;
  • An event eventually satisfies a formula iff. the formula is satisfied in either the present or in any future event: σ j i φ j x | σ i | . σ x i φ ;
  • An event satisfies φ until φ holds iff. φ holds at least until φ becomes true, which must hold at the current or a future position: σ j i φ U φ j y | σ i | . σ y i φ x z < y . σ z i φ
Other operators can be seen as syntactic sugar: Weak-Until is denoted as φ W φ : = φ U φ φ , while the implication can be rewritten as φ φ : = ( ¬ φ ) ( φ φ ) . Generally, binary operators bridge activation and target conditions appearing in two distinct sub-formulæ. The semantics associated with activity labels, consistently with the literature on business process execution traces [25], assumes that, in each point of the sequence, one and only one element from Σ holds. We state that a trace σ i is conformant to an LTLf formula iff. it satisfies it starting from the first event: σ i φ σ 1 i φ , and is deviant otherwise [25]. The Declare language described in the next paragraph provides an application for such logic. As relational algebra describes the semantics for SQL [28,29], LTLf is extensively applied [30] as a semantics for formally expressing temporal and human-readable declarative constraints such as Declare.
At the time of the writing, different authors proposed several extensions for representing data conditions in LTLf. The simplest extensions are compound conditions  a q , which are the conjunction of data predicate q Prop to the activity label a [25]. Nevertheless, this straightforward solution is not able to express correlation conditions in the data which might be relevant in business scenarios [31], as representing correlations as single atoms requires decomposing the former into disjunctions of formulae [32]. Despite prior attempts to define a temporal logic expressing correlation conditions, no explicit formal semantics on how this can be evaluated was provided [6]. This poses a problem to the current practitioner, as this hinders the process of both checking formally the equivalence among two languages expressing correlation conditions, as well as providing a correct implementation of such an operator. We, on the other hand, propose a relational representation of xtLTLf, where the semantics of all of the operators, thus including the ones requiring correlation conditions, is clearly laid out and implemented.
Declare. Temporal declarative languages pinpoint highly variable scenarios, where state machines provide complicated graph models that can be hardly understandable by the common business stake-holder [33]. Among all possible temporal declarative languages, we constrain our interest to Declare, originally proposed in [7]. Every single temporal pattern is expressed through templates (i.e., an abstract parameterised property: Table 1 column 2), which are parametrised over activation, target, or correlation conditions. Template names induce the semantic representation in LTLf  c l of each model clause c l . Therefore, a trace σ i is conformant to a Declare clause iff. it satisfies its associated semantic representation in LTLf ( σ i c l σ i c l ). At this stage, activation (and target) conditions are predicates A p (and B q ) in such a clause asserting required properties for the events’ activity label ( A and B ) and payload (p and q). An event in a given trace activates (or targets) a given clause if they satisfy the activation (or target) condition. Please observe that neither activation nor target conditions postulate the temporal (co)occurrence between activating or targeting events, as this is duty is transferred to the specific LTLf semantics of the clause. A trace vacuously satisfies a clause if the trace satisfies the clause despite no event in the trace satisfied the activation condition. After this, we state that a trace non-vacuously satisfies the declarative clause if the trace satisfies the clause and one of the following conditions is satisfied:
  • The clause provides no target condition and it exists at least one activating event;
  • The clause provides a target condition but no binary (payload) predicate Θ , and the declarative clause establishes a temporal correlation between (at least one) activating event and (at least one) targeting one;
  • The clause provides both a target condition and a binary predicate Θ , while the activating and targeting events satisfying the temporal correlation as in the previous case also satisfy a binary Θ predicate over their payloads; in this situation, we state that the activating and targeting event match as they jointly satisfy the correlation condition Θ .
Finally, the presence of activating events is a necessary condition for non-vacuous satisfiability.
We can then categorize each Declare template from [30] through these conditions and the ability to express correlations between two temporally distant events happening in one trace: simple templates (Table 1, rows 1–3) only involving activation conditions; (mutual) correlation templates (rows from 4 to 15), which describe a dependency between activation and target conditions, thus including correlations between the two; and negative relation templates (last 2 rows), which describe a negative dependency between two events in correlation. Despite these templates possibly appearing quite similar, they generate completely different finite state machines, thus suggesting that these conditions are not interchangeable (http://ltlf2dfa.diag.uniroma1.it/, 5 March 2023). Figure 4 exemplifies the behavioural difference between two clauses differing only on the template of choice.
A Declare Model is composed of a set of clauses M = c l l n , n N which have to be contemporarily satisfied in order to be true. A trace σ i is conformant to a model M iff. such a trace satisfies each LTLf formula c l associated with the model clause c l M . Consequently, a Declare model can be represented as a finitary conjunction of the LTLf representation of each of its clauses, M : = c l M c l : for this, the Maximum-SATisfiability problem (Max-SAT) for each trace counts the ratio between the satisfied clauses over the whole model size. This consideration can be extended later on to also data predicates through predicate atomisation [25], as discussed in the next paragraph.
Relational Models and Algebras. The relational model was firstly introduced by Codd [34] to compactly operate over tuples grouped into tables. Such tables are represented as mathematical n-ary relations that can be handled through a relational algebra. Upon the effective implementation of the first Relational Database Management Systems (RDBMS), such algebra expressed the semantics of the well-known declarative query language, SQL. The rewriting of SQL in algebraic terms allowed the efficient execution of the declarative queries through abstract syntax tree manipulations [28]. Our proposed xtLTLf (Section 3.2) takes inspiration from this historical precedent, in order to run conformance checking and temporal model mining queries over an relational representation of the log via relational tables (Section 3.1).
More recently, column-oriented DBMS such as MonetDB [35] proposed a new way to store data tables: instead of representing these per row, these were stored by column. There are several advantages to this approach, including better access to data when querying only a subset of columns (by eliminating the need to read columns that are not relevant) as well as discarding null-valued cells. This is achieved by representing each relation ( id , A 1 , , A n ) in the database schema as distinct binary relations A i ( id , A i ) for each attribute A i in . As this decomposition guarantees that the full-outer natural join 1 i n A i over the decomposed tables is equivalent to the initial relation , we can avoid representing NULL values in each single binary relation, thus limiting our space allocation to the values effectively present in the data. We therefore took inspiration from this intuition for representing the payload information, thus storing one single table per payload attribute. To further optimise the query engine, it is also possible to boost the query performance by guaranteeing that the results always have a fixed schema, mainly listing the record ids satisfying the query conditions [36]. As we will see while introducing our temporal operators (Section 3.2), we will also guarantee that each operator returns the output in the same schema, thus guaranteeing time and memory optimality.
Finally, the nested relational model [37] extends the relational model by relaxing its first normal form (1NF), thus allowing table cells to contain tables and relations as values. Relaxing this 1NF allows for storing data in a hierarchical way in order to access an entire sub-tree with a single read operation. We will leverage this representation for our intermediate result representation, in order to associate multiple activation, target, or correlation conditions to one single event, thus including any relevant future event occurring after it.
Common Subquery Problem. Query caching mechanisms [38] are customary solutions for improving query runtime by holding partially-computed results in temporary tables referred to as materialised views, under the assumption that the queries sharing common data are pipelined [39]. Recently, Kechar et al. [9] proposed a novel approach that can also be run when queries are run contemporarily: it is sufficient to find the shared subqueries before actually running them so that, when they are run, their result is stored into materialised views thus guaranteeing that these are computed at most once.
Example 4.
Figure 2 shows how this idea might be transferred to our use case scenario requiring running multiple declarative clauses: Response is both a subquery of Succession as well as a distinct declarative clause of interest. Green arrows indicate operators’ output shared among operators expressed in our proposed xt LTLf extension of xt LTLf. Please also observe that operators with the same name and arguments but marked either with activation, target, or no specification are considered different as they provide different results, and therefore are not merged together. This includes distinctions between timed and untimed operators, which will be discussed in greater detail in Section 3.2.
To further minimize tables’ access times, it is possible to take this reasoning to its extreme by minimising the data access per data predicate in order to avoid accessing the same table multiple times. In order to do so, we need to partition the data space according to the queries at our disposal as in our previous work [25]. This process can be eased if we assume that each payload condition p and p for the declarative clauses within a model M is represented in Disjunctive Normal Form (DNF) [40]: in this scenario, data predicates q are in DNF if they are a disjunction of one or more conjunctions of one or more data intervals referring to just one payload key.
Example 5.
The model illustrated in Figure 3a and discussed in former Example 1 comes with data conditions associated with neither activation nor target conditions. Therefore, no atomisation process is performed. Thus, each event in a log might just be distinguished by its activity label [25].
Given an LTLf expression φ containing compound conditions, we denote D φ as the set of distinct compound conditions in φ . We refer to the items in D φ as atoms iff. for each pair of distinct compound conditions in it, they never jointly satisfy any possible payload p (More formally, p . a Σ . a q , a q D φ . ( q q ) ( q ( p ) ¬ q ( p ) ) ). Ref. [25] shows a procedure showing how any formula φ can be rewritten into an equivalent one φ by ensuring that D φ contains atoms. This can be achieved by constructing D φ first from φ (Algorithm 1), and then converting each compound conditions in φ as disjunctions of atoms in D φ , thus obtaining φ .
Algorithm 1 Atomisation: D φ -encoding pipeline.
  1:
global  μ { } ; a d { } ; a k { }
 
 
  2:
procedure CollectIntervals( a , DNF)   ▹ DNF : = 1 i n 1 k m ( i ) low i , k k i , k up i , k
  3:
    for all  conj D N F and low k up conj do
  4:
         μ ( a , k ) . put ( [ low , up ] )
  5:
    end for
 
 
  6:
procedure CollectIntervals( M )   M : = 1 i | M | clause i ( A , p , B , p )
  7:
    for all clause i ( A , p , B , p ) M do
  8:
        if  p True then CollectIntervals( A , p )
  9:
        if  p True  then CollectIntervals( B , p )
10:
    end for
 
 
11:
procedure  D φ -encoding( )
12:
    for all a Σ do
13:
        for all k K do
14:
            μ ( a , k ) SegmentTree( μ ( a , k ) )
15:
        end for
16:
        for all partition k K μ ( a , k ) . elementaryIntervals ( ) do   partition : = low k k up k k K
17:
            p i new atom()
18:
            p i : = a partition
19:
            a k ( a ) . put ( p i )
20:
           for all low k k up k partition do
21:
                a d ( a , low k k up k ) . put ( p i )
22:
           end for
23:
        end for
24:
        if a k ( a ) = then
25:
            a k ( a ) { a }
26:
        end if
27:
    end for
We collect all the conjunctions referring to the same payload key into a map μ(a,κ) (Line 4). After doing so, we can construct a Segment Tree [41] from the intervals in μ(a,κ), thus identifying the elementary intervals partitioning the collected intervals (Line 14). These elementary intervals also partition the payload data space associated with events for each activity label a. This can be achieved by combining each elementary interval in each dimension κ for a (Line 16) and then associating it with a new atom representing such a partition (Line 18) that is then guaranteed to be an atom by construction. This entails that each interval low κ κ up κ will be characterised by the disjunction of all of the atoms p i comprising such interval (Line 21). Given this, we can then associate to each activation condition A that is associated with an activation payload condition p the disjunction of atoms that are collected by the following formula:
Atom μ , a d ( A , p ) : = c o n j p low κ up conj I μ ( A , κ ) . findElementaryIntervals ( low , up ) a d ( A , I )
If we assume that the dimension of μ ( a , κ ) for each a Σ and κ K is at most m, our implementation available at https://github.com/datagram-db/knobab/blob/main/include/yaucl/structures/query_interval_set/structures/segment_partition_tree.h (5 March 2023) builds such trees in 1 i m log ( i ) + m O ( m · log ( m ) ) time, as we first insert the intervals into the data structure and then we guarantee to minimise the tree representation, requiring a linear visit cost to the whole tree data structure. The time complexity of D φ -Encoding() is m | K | ( 1 + log m + | Σ | ) O ( m | K | | Σ | ) .
Example 6.
Each distinct payload conditions associated with either activation or target conditions in Figure 2 can be expressed as one single atom, as there are no overlapping data conditions associated with the same activity label, and each data condition can be mapped into one single elementary interval associated with an activity label. The next example will provide another use case example and a different model on the same dataset leading to a decomposition of payload conditions into a disjunction of several atoms. Table 2 shows the partitioning of the data payloads associated with each activity label in the log by the elementary interval of interest.
Example 7.
Let us suppose to return all the false negative and false positive Mastectomy cases that are not caused by data imputation errors. For this, we want to obtain all of the negative biopsies having CA15.3 levels greater than the guard level of 50 and positive biopsies having CA15.3 below the same threshold. Under the assumption that biopsy values were imputed through numerical numbers thus leading to more imputation errors, we are ignoring cases where both CA15.3 and biopsy values are out of scale, that is, we want to ignore the data where CA15.3 levels are negative or above 1000, and where the biopsy values are neither true ( 1.0 ) nor false ( 0.0 ). For this, we can outline the following model:
M = { Choice ( Mastectomy , biopsy = 0.0 CA 15 . 3 50 , Mastectomy , biopsy = 1.0 CA 15 . 3 < 50 ) , Absence ( Mastectomy , CA 15 . 3 > 1000 CA 15 . 3 < 0 ) , Absence ( Mastectomy , biopsy 1.0 biopsy 0.0 ) }
This implies that we are interested in decomposing the intervals pertaining to both CA-15.3 and biopsy into elementary intervals: Table 3a shows that only CA 15 . 3 < 50 and CA 15 . 3 50 are decomposed into two elementary intervals, as the former also includes the range CA 15 . 3 < 0 , while the latter also includes CA 15 . 3 > 1000 . Elementary intervals not occurring in the initially collected ones are not reported in this graphical representation. Table 3b shows the partitioning of the Mastectomy data payload induced by the elementary intervals of interest; the former data conditions can be now rewritten after Equation (2) in the Supplement as follows:
1. 
Atom μ , a d ( Mastectomy , biopsy = 0.0 CA 15 . 3 50 ) = p 12 p 17
2. 
Atom μ , a d ( Mastectomy , biopsy = 1.0 CA 15 . 3 < 50 ) = p 4 p 9
3. 
Atom μ , a d ( Mastectomy , CA 15 . 3 > 1000 CA 15 . 3 < 0 ) = p 1 p 5 p 16 p 20
4. 
Atom μ , a d ( Mastectomy , biopsy 0.0 biopsy 1.0 ) = p 1 p 3 p 5 p 6 p 8 p 10 p 11 p 13 p 15 p 16 p 18 p 20
where each atom is defined as a conjunction of compound conditions defined upon the previously collected elementary intervals. Some examples are then the following:
  • p 1 : = biopsy < 0 CA 15 . 3 < 0
  • p 2 : = biopsy = 0 CA 15 . 3 < 0
This decomposition will enable us to reduce the data access time while scanning the tables efficiently.
Further Notation. We represent relational tables as a sequence of records indexed by id as per the physical relational model: given a relational table T, T [ i ] represents the i-th record in T counting from 1. We denote f = [ x y , z t ] as a finite function such that f ( x ) = y and f ( z ) = t . Table 4 collects the notation used throughout the paper.

3. Logical Model

Differently from our previous work [4], we provide a full definition of the (logical) model, thus describing the relational schema and how such tables are instantiated in order to fully represent the original log L (Section 3.1). This is a required preliminary step, as this will provide the required background to understand the definitions for the xt LTLf operators (Section 3.2). These operators, differently from the LTLf ones, are defined over the aforementioned model and assess the satisfiability of multiple traces loaded in such a model.
The discussion on how such tables are loaded and indexed is postponed when discussing the physical model (Section 4), as well as the different algorithms associated with the different operators (Section 6).

3.1. Model Definition

KnoBAB provides a tabular (i.e., relational) representation of the log L , in order to efficiently query it through tailored relational operators (xtLTLf). If the log does not contain data payloads, the entire log can be represented in two relational tables, CountingTable L (Activity,Trace,Count) and ActivityTable L (Activity,Trace,Event,Prev,Next). While the former can efficiently assess how many events in the same given trace share the same activity label, the latter allows a faithful reconstruction of the activity label associated with the traces. In particular, we use the former to assess whether a trace contains a given activity label at all. Such tables are then defined as follows:
Definition 1
(CountingTable). Given a log L , the CountingTable L (Activity,Trace,Count) counts for each trace in L how many times each activity label occurs. More formally:
CountingTable L = β ( a ) , i , | { σ j i σ i | σ j i = a , p } | | a Σ , σ i L
A record β ( a ) , i , n states that the i-th trace from the log σ i L contains n occurrences of a -labelled events with id β ( a ) .
Definition 2
(ActivityTable). Given a log L , the ActivityTable(Activity,Trace,Event,Prev,Next) lists all of the possible events occurring in each log trace, where Prev (π) and Next (ϕ) are offsets pointing to the row representing the immediately preceding or following event in the trace if any. More formally:
ActivityTable L = β ( a ) , i , j , π , ϕ | a Σ , σ i L , σ j i σ i , σ j i = a , p
A record β ( a ) , i , j , π , ϕ states that the j-th event of the i-th log trace ( σ j i σ i , σ i L ) has an activity label a and that its preceding and following events (if any) are respectively located on the π-th and ϕ-th record of the same table. Each record of this table should also satisfy the following integrity constraints:
  • ( j = 1 π = ) ( h , π , ϕ . h , i , j 1 , π , ϕ ActivityTable L [ π ] ) ;
  • ( j = | σ i | ϕ = ) ( h , π , ϕ . h , i , j + 1 , π , ϕ ActivityTable L [ ϕ ] )
Please observe that Prev and Next are computed after bulk inserting while loading and indexing the data (see LoadingAndIndexing from Algorithm 2). If a log is associated with either trace or event payloads, we must store for each payload the values associated with keys k in an AttributeTable L k (Activity,Value,Offset), where Offset points to the event described in the ActivityTable L .
Definition 3
(AttributeTable). Given a log L , for each attribute κ K associated with at least one value in a payload, we define a table AttributeTable L κ (Activity,Value,Offset) associating each value to the pertaining event’s payload as follows:
AttributeTable L κ = β ( a ) , p ( κ ) , π | σ i L , σ j i σ i , σ j i = a , p , p ( κ )
A record β ( a ) , v , π states that the event σ j i = a , p stored in the ActivityTable associated with the π-th offset contains a payload p associating κ to a value v ( p ( κ ) = v ).
Please observe that, similarly to the former table, the offset π is also computed while loading and indexing the data: this is discussed in greater detail in Section 4.2.2.
Algorithm 2 Populating the Knowledge Base (Section 4.2)
  1:
procedure BulkInsertion( L )
  2:
     Σ , K
  3:
    for all σ i L do
  4:
         Σ Σ { a }
  5:
        for all σ j i = a , p σ i do
  6:
            CountBulkMap [ β ( a ) ] [ i ] = CountBulkMap [ β ( a ) ] [ i ] + 1
  7:
            ActToEventBulkVector [ β ( a ) ] . put ( i , j )
  8:
            TraceToEventBulkVector [ i ] [ j ] = j
  9:
           for all κ dom ( p ) do
10:
                K K { κ }
11:
                AttBulkMap k [ β ( a ) ] [ p ( κ ) ] . put ( i , j )
12:
           end for
13:
        end for
14:
    end for
 
 
15:
procedure LoadingAndIndexing( L )
16:
     actTableOffset 1
17:
    for all β ( a ) { 1 , , | Σ | } do
18:
         ActivityTable L . primary _ index [ β ( a ) ] actTableOffset
19:
        for all σ i L do
20:
            CountingTable L . load ( β ( a ) , i , CountBulkMap [ β ( a ) ] [ i ] )
21:
        end for
22:
        for all i , j ActToEventBulkVector [ β ( a ) ] do
23:
            ActivityTable L . load [ β ( a ) , i , j , , ]
24:
            TraceToEventBulkVector [ i ] [ j ] = actTableOffset
25:
            actTableOffset actTableOffset + 1
26:
        end for
27:
    end for
28:
    for all κ K and β ( a ) { 1 , , | Σ | } do
29:
         begin | AttributeTable L κ | , map { }
30:
        for all ν , lst AttBulkMap k [ β ( a ) ] and i , j lst do   σ j i = a , p with ν = p ( κ )
31:
           offset TraceToEventBulkVector [ i ] [ j ]
32:
            AttributeTable L κ . load ( β ( a ) , ν , offset )
33:
            AttributeTable L κ . secondary _ index [ offset ] | AttributeTable L κ |
34:
        end for
35:
         AttributeTable L κ . primary _ index [ β ( a ) ] begin , | AttributeTable L κ |
36:
    end for
37:
    for all σ i L and σ j i σ i do
38:
         curr TraceToEventBulkVector [ i ] [ j ]
39:
        if j = 1 then
40:
            ActivityTable L . secondary _ index [ i ] curr , TraceToEventBulkVector [ i ] [ | σ i | ]
41:
        else
42:
            ActivityTable L [ curr ] ( Prev ) TraceToEventBulkVector [ i ] [ j 1 ]
43:
        end if
44:
        if j < | σ i | then
45:
            ActivityTable L [ curr ] ( Next ) TraceToEventBulkVector [ i ] [ j 1 ]
46:
        end if
47:
    end for
 
 
48:
function ReconstructLog( L )
49:
     L
50:
    for all i , begin , end ActivityTable L . secondary _ index do
51:
         ς i [ ] ; j 1
52:
        repeat
53:
            r ActivityTable L [ begin ]
54:
            a β 1 ( r ( Activity ) )
55:
            p { }
56:
           for all κ K s.t. o . begin , o AttributeTable k .secondary_index do
57:
                p ( κ ) AttributeTable k [ o ] ( Value )   AttributeTable k [ o ] ( Offset ) = begin
58:
           end for
59:
            ς j i a , p ; σ i . put ( ς j i )
60:
           begin  r ( Next ) ; j j + 1
61:
        until begin
62:
         L . put ( ς i )
63:
    end for
64:
    return  L
Example 8.
Figure 2 provides a graphical depiction of the tables storing our data. The records are also sorted by ascending order induced by the first three cells of each record, as required by our Physical Database Design (Section 4). For representation purposes, the first cell of each row shows the activity label a rather than its associated unique id β ( a ) .

3.2. eXTended LTLf Algebra (xt LTLf)

We extend the operators provided in our previous work [4] into more minimal ones, thus better describing the data access on the relational model. Furthermore, we provide a full formal characterisation for each of these operators via their access to the aforementioned relational tables. Please observe that, similarly to the relational algebra, each xt LTLf operator might come with different possible algorithms [42], which are discussed in Section 6.
Our operators, assessing the behaviour of non-empty traces, come in two flavours: timed and untimed. While the former are marked by a τ and return all of the traces’ events for which a given condition holds, the latter guarantee that such a condition will hold any time from the beginning of the trace. Furthermore, these operators assess the satisfiability of all the log traces simultaneously and not only one trace at a time as per LTLf.
Each xtLTLf operator returns a nested relational table ρ with schema IntermediateResult ( Trace , Event , MarkList ( Mark ) ) implemented as an ordered set of triplets i , j , L , where each triplet states that an event σ j i from trace σ i satisfies a condition specified by the returning operator. If L (MarkList(Mark)) is not empty, the current event σ j i might have observed events σ k i and σ h i satisfying either an activation ( A ( k ) L , k j ), a target ( T ( k ) L , k j ), or a correlation condition ( M ( h , k ) L , k , h j ). The nested relation L is implemented as a vector ordered by mark type and referenced event id. ρ is implemented as a vector and sorted by increasing Trace and Event id, as sorted vectors guarantee efficient intersection and union operations, as well as efficient event counting within the same trace through linear scanning. Binary operators associated with a non-True binary predicate Θ return matching/correlation conditions M ( h , k ) L if at least one activation and one target condition were matched, depending on the definition of the operator. As we are going to see next, if the output comes from a base operator, as defined in the next section, L might contain a single activation or target corresponding to the immediately returned event.

3.2.1. Base Operators

First, we discuss the base operators directly accessing the tables. These might have an associated marker specifying whether the event of interest is considered an activation (A) or a target (T) condition; if none is required, the mark can be omitted from the operator. The Activity τ ( a ) A / T L operator lists all of the events associated with an activation label a . As the ActivityTable L directly provides this information, this operator is defined as follows:
Activity A / T L , τ ( a ) = i , j , { A / T ( j ) } | π , ϕ . β ( a ) , i , j , π , ϕ ActivityTable L
We can also make similar considerations for single elementary interval representable as an LTLf compound condition  a lower κ upper , which can be run as a single range query over an AttributeTable L κ . As each of its records has an offset π to the ActivityTable L , this resolves the trace id and event id information required for the intermediate result. This operator can therefore be formalised as follows:
Compound A / T L , τ ( a , κ , [ lower , upper ] ) = { i , j , { A / T ( j ) } | π , π , ϕ , v . lower v upper , β ( a ) , v , π AttributeTable L κ , ActivityTable L [ π ] = β ( a ) , i , j , π , ϕ }
If we want to list all of the initial (or terminal) events of a trace, we can directly access the ActivityTable and provide a linear scan over the number of the possible traces through its associated secondary index. If we are not interested in whether the trace starts with a specific activity label, then we can define the First A L , τ (and Last A L , τ ) operators as follows:
First A L , τ = i , 1 , { A ( 1 ) } | a , ϕ . β ( a ) , i , 1 , , ϕ ActivityTable L
Last A L , τ = i , | σ i | , { A ( | σ i | ) } | a , π . β ( a ) , i , | σ i | , π , ActivityTable L
On the other hand, Init (and Ends ) are the specific refinements of the former operators if we are also interested in retrieving events with a specific activity label. These can be defined as follows:
Init A L ( a ) = i , 1 , { A ( 1 ) } | ϕ . β ( a ) , i , 1 , , ϕ ActivityTable L
Ends A L ( a ) = i , 1 , { A ( | σ i | ) } | π . β ( a ) , i , | σ i | , π , ActivityTable L
Given a natural number n, Exists ( a , n ) A L lists the traces containing at least n events with an activity label a . As Absence ( a , n ) A L is the substantial negation of the former, this lists the traces containing at most n 1 events with an activity label a . Please observe that these operators directly provide the formal semantics for the homonym Declare template. As the CountingTable precisely contains the counting information required to solve this query, these operators can be formalised as follows for n N > 0 :
Exists A L ( a , n ) = i , 1 , { A ( 1 ) } | m n . β ( a ) , i , m CountingTable L
Absence A L ( a , n ) = i , 1 , { A ( 1 ) } | m < n . β ( a ) , i , m CountingTable L
The following paragraph shows how these last two operators can be generalised for counting the salient event information returned by any sub-expression returning an operand ρ .

3.2.2. Unary Operators

The unary xtLTLf operators come in two flavours: the first ones extend some of the former operators for compound conditions or atoms not necessarily associated with activity labels, while the second ones directly extend the unary operators from LTLf.
Base Operators’ generalisations. We extend the definition of Init/Ends or Exists/Absence for any possible set of events of interest listed in an intermediate result ρ , not necessarily associated with the same activity label. We first define Exists and Absence operator as such: instead of exploiting the counting table, we now actually need to count the events returned in ρ for each trace and return an intermediate result triplet iff. they satisfy the counting condition. These can be then defined as follows for n N > 0 :
Exists n ( ρ ) = i , 1 , i , j , L j ρ L j | n | { i , j , L ρ } |
Absence n ( ρ ) = i , 1 , i , j , L j ρ L j | n > | { i , j , L ρ } |
Similarly, while the operators accessing the CountingTable (Exists/Absence) return the result by linearly scanning such a table, their generalised counterparts require scanning their operand ρ as returned from a subexpression of choice, and then creaming them off depending on how many events per trace were in ρ . As we might observe, we might exploit the previously provided operators when we want to evaluate conditions only associated with activity labels, while we might need to exploit the former if we are interested in results associated with compound conditions whose evaluation is returned in ρ .
Finally, we refine Init and Ends for a given operand ρ , to keep only the events at the beginning or end of a given trace:
Init ( ρ ) = i , j , L ρ | j = 1
Ends ( ρ ) = i , 1 , L | i , | σ i | , L ρ
Further details on our intended notion of these operators’ generality if compared to the corresponding base operators can be found in Appendix A.1.
LTLfextensions. The unary xtLTLf operators work differently from the corresponding ones in LTLf: while the latter compute the semantics from the first occurring operator appearing in the formula towards the leaves, the former assume to receive intermediate results from the leaves.
This structural difference also imposes an explicit distinction between timed and untimed operators. This is required as each operator is completely agnostic from the semantics associated with the upstream operator, and therefore the downstream operator has to combine the incoming intermediate results appropriately. This motivates why LTLf operators do not have to provide such an explicit distinction from their syntactical standpoint.
Such a premise motivates the counter-intuitive definition of the timed Next τ operator if compared to the homonym in LTLf: as this needs to return the events for which desired temporal constraints happen immediately after them, it needs to assume that the desired forthcoming temporal behaviour is the one received as an input ρ , for which all the events preceding the ones listed in ρ are the ones of interest. As per the previous statement, it also follows that this operator shall never possess an equivalent untimed flavour. From these considerations, Next τ is formally defined as follows:
Next τ ( ρ ) = i , j 1 , L | i , j , L ρ , j > 1
where L fulfils the role of preserving the information of the events satisfying an activation, target, or correlation condition independently from the event stated in the second component of the intermediate representation record. Therefore, i , j , L shall be interpreted as follows: σ j i witnesses the satisfaction of any activation, target, or correlation condition by the events collected in L.
We now discuss the definition of “globally”. As per previous considerations, checking that all of the events in a trace satisfy a given condition corresponds to retrieving all of the events satisfying such a condition, for then counting if the length of the returned events corresponds to the trace length. Similarly, the timed version of the same operator shall test the same condition for each possible event and return the points in the trace after which the desired condition always happens in the future. These operators are therefore defined as follows:
Globally τ ( ρ ) = i , j , j k | σ i | i , k , L k ρ L k | i , j , L j ρ , | σ i | j + 1 = | { i , k , L k ρ | j k | σ i | } |
Globally ( ρ ) = i , 1 , i , j , L k ρ L k | | σ i | = | { i , k , L k ρ } |
The operators expressing the eventuality that a condition shall happen in the future undergo similar considerations, with the only difference that these do not require to test that all of the trace events from a given point in time will satisfy a given condition, as it suffices that at least one event will satisfy it. The Future operator with its timed counterpart are then formally defined as follows:
Future τ ( ρ ) = i , j , j k | σ i | i , k , L k ρ L k | h j , L . i , h , L ρ
Future ( ρ ) = i , 1 , i , k , L k ρ L k | j , L . i , j , L ρ
Timed and untimed negations are implemented dissimilarly by design. While the timed negation returns all of the events that are in the log but which were not returned in the previous computation ρ , the untimed version returns the traces containing no events associated with the provided input. These operators are therefore defined as follows:
Not τ ( ρ ) = i , j , | ( L . i , j , L ρ ) α , π , ϕ . α , i , j , π , ϕ ActivityTable L
Not ( ρ ) = i , 1 , | ( j , L . i , j , L ρ ) α , j , π , ϕ . α , i , j , π , ϕ ActivityTable L

3.2.3. Binary Operators

Differently from the LTLf binary operators, the xtLTLf binary operators are specifically tailored to express data correlation conditions Θ between activation and target payloads. This requires that one of the two operands, either ρ or ρ , returns activated events while the other provides targeted ones. Supplement I discusses the formal definition of predicates assessing whether an event i , j , L ρ matches with another event i , j , L ρ on the basis of their matched and activated events in L and L . After this, we have the definition of our required binary operators.
The until operators work similarly to the other LTLf-derived unary operators. The timed until returns all of the events within the trace satisfying the until condition, expressed by returning all of the “activated” events σ j i listed in the right operand (as they trivially satisfy the until condition) alongside all of the “targeted’ events σ j i from the left operand with k < j at a distance j k + 1 from the second operand’s event while guaranteeing that all the events in σ k i , , σ j 1 i appear in the first operand while satisfying the matching condition within this temporal window. The untimed version of this operator performs such considerations only from the beginning of the trace. These are defined as follows:
Until Θ τ ( ρ 1 , ρ 2 ) = ρ 2 { i , k , τ | j > k . i , j , L ρ 2 , ( k h < j . i , h , L ρ 1 ) , τ : = T Θ A , i ( [ k L ] k h < j , [ h L h ] k h < j , i , h , L h ρ 1 ) , τ False }
Until Θ ( ρ 1 , ρ 2 ) = i , j , L ρ 2 | j = 1 { i , 1 , τ | j > 1 , L . i , j , L ρ 2 , ( 1 k < j . i , k , L k ρ 1 ) , τ : = T Θ A , i ( [ k L ] i k < j , [ k L k ] i k < j , i , k , L k ρ 1 ) , τ False }
where T Θ A , i performs (Please see Supplement I for more details.) the correlation tests and returns the set of the matches if any and, if no match was successful, it returns False. Differently from Until Θ τ and Until Θ , the rest of the binary operators assume to receive “activated” (or “targeted”) events from the left (right) operand. The timed conjunction states that a join condition effectively happens in a given event σ j i if both operands return such an event and their associated activation and target conditions match. Thus, we only care for activation and target conditions at the same event σ j i . For its untimed counterpart, we state that a trace satisfies the conjunction of events if at least one activation condition from the left operand matching with a target from the right operand, if any, exists; this corresponds to coalescing the activations and target conditions on the first event while requiring that at least one of them occurs. These two operators can then be defined as follows:
And Θ τ ( ρ 1 , ρ 2 ) = i , j , τ | L 1 , L 2 . i , j , L 1 ρ 1 , i , j , L 2 ρ 2 , τ : = T Θ E , i ( [ j L 1 ] , [ j L 2 ] ) , τ False
And Θ ( ρ 1 , ρ 2 ) = { i , 1 , τ | j , j , L , L . ( i , j , L ρ 1 i , j , L ρ 2 ) , τ : = T Θ E , i ( [ 1 { L j | i , j , L j ρ 1 } ] , [ 1 { L j | i , j , L j ρ 2 } ] ) , τ False }
The disjunctive version of the timed conjunctive operator returns either the result of the conjunctive operator or the events that did not temporally match from each respective operator. The only difference with its untimed version is that the latter merges all potential activation or target conditions from either of the two operands:
Or Θ τ ( ρ 1 , ρ 2 ) = And Θ τ ( ρ 1 , ρ 2 ) i , j , L ρ 1 | L . i , j , L ρ 2 i , j , L ρ 2 | L . i , j , L ρ 1
Or Θ ( ρ 1 , ρ 2 ) = And Θ ( ρ 1 , ρ 2 ) i , 1 , { L | j . i , j , L ρ 1 } | j , L . i , j , L ρ 2 i , 1 , { L | j . i , j , L ρ 2 } | j , L . i , j , L ρ 1
As we will see, the choice of characterizing Or with an E Θ i match while coalescing the activation and target conditions on the first trace event allows us to express the Choice template from Declare with one single operator while preserving its expected LTLf semantics.

3.2.4. Derived Operators

Similarly to relational algebra, we can now compose some frequently occurring operators together for enhancing the overall time complexity associated with the execution of frequently appearing subqueries in Declare.
Appendix A.3 will show that computing these operators is equivalent to computing their semantically equivalent xtLTLf expression containing multiple operators.
AndFuture Θ τ ( ρ 1 , ρ 2 ) = { i , j , τ | L . i , j , L ρ 1 , ( L , k j . i , k , L ρ 2 ) , τ : = T Θ E , i ( [ j L ] , [ j h j , i , h , L h ρ 2 L h ] ) , τ False }
AndGlobally Θ τ ( ρ 1 , ρ 2 ) = { i , j , τ | L . i , j , L ρ 1 , ( | σ i | k j . L . i , k , L ρ 2 ) , τ : = T Θ A , i ( [ j L ] , [ j h j , i , h , L h ρ 2 L h ] ) , τ False }
For easing the pseudocode readability, we can also define an Atom A / T L , τ ( p i ) operator computing the conjunction of all of the compound conditions characterizing each atom:
Atom A / T L , τ ( p i ) = And True τ κ K Compound A / T L , τ ( a , κ , [ low κ , up κ ] ) s . t . p i : = a κ K low κ κ up κ
Properties of thextLTLf Algebra. We furnish the previous definitions with some formal proofs, which, so as not to burden the reader, are postponed to the Appendix A. We show that xtLTLf is as expressive as traditional LTLf, as we can show that each LTLf expression evaluated over a finite and non-empty trace σ corresponds to an xtLTLf expression evaluated over the representation of such a trace within the proposed logical model; as the proofs of Lemmas A5 and A6 in Appendix A.2 are constructive, they show the translation process from LTLf formulæ  to equivalent xtLTLf expressions.
Next, we also show that the timed and untimed operators correspond to the intended semantics: that is, for each timed operator having a corresponding untimed operator if the former states that the timed formula is satisfied by the i-th trace starting from time j, it follows that the sub-trace of i starting from time j will satisfy the corresponding untimed formula. This shows the correctness of the untimed operators concerning their timed definitions (Lemma A7).
In Appendix A.3, we show that the Declare template Choice can be fully implemented by exploiting an untimed Or operator (Corollary A1) while the latter still abides to the rules of LTLf semantics. We also motivate the need of the derived operators in terms of equivalence to the intended xtLTLf expressions (Lemmas A9 and A10) as well as in terms of improved computational complexity (Section 6.4) and run time (Section 7.1). The latter is discussed after describing the physical model in more detail alongside the algorithms associated with each operator, which is introduced in the following section.

4. Physical Database Design

This section shows how the defined model (Section 3.1) is represented in primary memory in terms of indices and data structures (Section 4.1). We also illustrate the algorithm loading a log in such representation of choice (Section 4.2).

4.1. Primary Memory Data Structures

At the time of the writing, KnoBAB is primarily an in-memory database. This is a common assumption in the conformance checking domain where most of the log datasets are quite compact and nicely fit in primary memory.
In order to be both memory and time efficient in our operations, the sub-record referring to the first three columns of both the CountingTable L and the ActivityTable L are fully stored in primary memory as an unsigned 64-bit unsigned integer, while the Prev and Next are more efficiently stored as pointers to the table records rather than being an offset. After sorting the CountingTable L , we directly obtain the occurrence of each activity label a within the log by accessing the records in the range [ | L | · ( β ( a ) 1 ) + 1 , | L | · β ( a ) ] .
Indexing data structures, on the other hand, eases the access to the ActivityTable L , as different traces might have different lengths, and activity labels might be differently distributed among the traces. Therefore, we exploit a clustered and sparse primary index for determining which is the first event associated with a given activity label; as the traces in such a table are represented as a doubly linked list, its secondary index maps each trace-id to a block that, in turn, points to the head (first event of the trace) and the tail (last event of the trace) of such a doubly linked list.
The deduplication of trace and event payloads in distinct AttributeTable L κ for each key κ follows the prescriptions of the query and memory-efficient representation of columnar-based storages [35]. In our implemnetation, such tables are sorted in ascending order by their three first columns. Each AttributeTable L κ is also associated with two indices: the clustered and sparse primary index maps each activity label’s id β ( a ) to the records referring to values contained in a -labelled events, and a dense secondary index associates an ActivityTable L record offset to an AttributeTable L κ record offset if and only if the event described in ActivityTable L has a payload containing a value associated with a key κ . While data range queries leverage the former, the latter is used for reconstructing the payload associated with a given event when identified by its offset in the ActivityTable L . A relevant use case for doing so is the reconstruction of the event payload information while performing the Θ correlation condition, as well as reconstructing the original log leading to the loading of the internal database. ReconstructLog function in Algorithm 2 shows the computation of the latter.
Example 9.
With reference to Figure 2, let us consider some events with activity label Mastectomy associated with an unique id β ( Mastectomy ) = 3 . The offsets for accessing the records in the CountingTable L defining the number of events per trace with such a label is [ 3 · ( 3 1 ) + 1 , 3 · 3 ] = [ 7 , 9 ] .
The ActivityTable L ’s primary index allows the access to the first record within the table recording a Mastectomy event, i.e., ActivityTable L . primary _ index [ β ( Mastectomy ) ] = 7 ; the index implicitly returns the last event associated with such an activity label by decreasing the offset to the following activity label by one, i.e., ActivityTable L . primary _ index [ β ( Mastectomy ) + 1 ] 1 = 7 : please remember that, if the activity label is such that β ( a ) = | Σ | , then the final offset to be considered corresponds to the ActivityTable L size. This indicates that there exists only one event throughout the whole log associated with such an activity label. We will exploit this mechanism for returning the events associated with Activity A / T L , τ ( Mastectomy ) . As the seventh record of such a table refers to the third event of the first trace, Activity A L , τ ( Mastectomy ) will then return 1 , 3 , { A ( 3 ) } .
Finally, we discuss how we can leverage AttributeTable L κ ’s primary indices for returning results associated with an Atom A / T L , τ operator. Let us consider atom p 12 : we can see that this is associated with the elementary intervals Lumpectomy biopsy = true and Lumpectomy 50 CA _ 15 . 3 < + . By definition of the operator of interest, we then have that:
Atom A L , τ ( p 12 ) = And True τ ( Compound A L , τ ( Lumpectomy biopsy = true ) , Compound A L , τ ( Lumpectomy CA _ 15 . 3 50 ) )
The first Compound operator will access the primary index from AttributeTable L biopsy while the second one will access the one from AttributeTable L CA _ 15 . 3 . Then, the primary index of each table maps each activity to the offsets of the first and last record: AttributeTable L biopsy . primary _ index [ β ( Lumpectomy ) ] = 2 , 2 and AttributeTable L CA _ 15 . 3 . primary _ index [ β ( Lumpectomy ) ] = 7 , 7 . Then, within these returned record offsets, we perform range queries respectively looking for records satisfying biopsy = true and 50 CA _ 15 . 3 < + . All of the ActivityTable L κ ’s records satisfying these conditions point to the tenth record of the ActivityTable L referring to the third event of the third trace. Therefore, Atom A L , τ ( p 12 ) returns 3 , 3 , { A ( 3 ) } .

4.2. Populating the Database

We discuss two subsequent steps for loading a log in our proposed relational model: we preliminarily sort the data by activity label id, event id, and values (Section 4.2.1) for then loading the sorted record in the tables while generating their primary and secondary indices (Section 4.2.2). These are computed in quasi-linear time with respect to the full log size.

4.2.1. Bulk Insertion

KnoBAB uses BulkInsertion to pre-load the tables’ data into an intermediate representation by pre-sorting it according to the ascending order induced by the first column of the tables of interest. Algorithm 2 shows the loading of the following three maps referring to the aforementioned tables. (i) CountBulkMap counts the occurrence of each activity label per track, implying that the absence of a trace identifier for a given β ( a ) value presupposes the absence of a given activity label a within a trace; as the name suggests, we use this to later on populate the CountingTable. (ii) The ActToEventBulkVector prepares the insertion of sorted data in ActivityTable L by associating an activity label to each event and its associated trace containing it. (iii) Similarly to the ActToEventBulkMap, the AttBulkMap k associates to each key κ the values p ( κ ) for each event σ j i with payload p and activity label a , in order to prepare the insertion of sorted records in AttributeTable L κ . Please observe that, by construction, the set of pairs associated with each activity id β ( a ) is already sorted by increasing trace and event id.
We also pre-allocate a TraceToEventBulkVector map (represented as a vector of vectors) which will later associate each event trace to an offset on the ActivityTable L where such event is stored. KnoBAB will later use this to calculate Prev and Next in the ActivityTable. After this, KnoBAB knows the number of the traces within the log | L | , the length | σ j | for each trace σ j , and the number of distinct activity labels | Σ | is known, as well as their associated unique id β ( a ) for each a Σ . We can show that this procedure might be computed in quasi-linear time with respect to the full log size (Lemma S1).

4.2.2. Loading and Indexing

We continue our discussion with LoadingAndIndexing. First, we can iterate over the activity labels in ascending order of appearance (Line 17). All the tables including the CountingTable L have activity ids β ( a ) as their first cell: by further iterating by increasing trace id, we can immediately orderly store the records in CountingTable L (Line 20).
Second, we start populating the ActivityTable L (Line 23) where each record is associated with an increasing offset of the table (Line 25). We can populate its primary index in order to point at the record representing the first event of the first trace with the currently considered activity label. We store such information in the pre-allocated traceToEventBulkVector (Line 24), in order to later set the currently null pointer (↑) Next and Prev fields.
Third, we start populating each AttributeTable L κ for each key κ K associated with at least one value in a payload: as per the previous discussion, each record associates the offset of event σ j i = a , p in the ActivityTable L with a value ν = p ( κ ) and the activity label id β ( a ) (Line 32). We also populate its secondary index by associating each event offset in the ActivityTable L to the current position in the AttributeTable L κ (Line 33). The last iteration finally populates each ActivityTable L ’s secondary index (Line 50) and sets the Next (Line 45) and Prev (Line 42) fields through the offset via TraceToEventBulkVector. After this, the relational database is fully loaded in primary memory. The overall time complexity grows linearly to the whole log representation (Lemma S2).

5. Query Processing and Optimisation

This section shows how a declarative model M is compiled to a query plan consisting of xt LTLf operators (Section 5.1) so it can be run (Section 5.2) on top of the primary memory data described in the previous section.

5.1. Query Compiler

The conversion of a declarative model M into its corresponding xt LTLf query plan is structured into three main phases. First, the atomisation pipeline calls the preliminary D φ -encoding from [25] for rewriting the data predicates appearing in each declarative clause as a disjunction of mutually exclusive atoms (Section 5.1.1). Second, we (ii) rewrite each Declare constraint as a xt LTLf formula from which we obtain a preliminary query plan represented as a Direct Acyclic Graph (DAG) (Section 5.1.2). Third, we compute the scheduling order for the operators’ execution over the DAG, thus preparing the execution to a potential parallel evaluation of the query (Section 5.1.3).

5.1.1. Atomisation Pipeline

The atomisation pipeline (Algorithm 3) represents each activation and target condition as a set of disjunct atoms or activity labels. KnoBAB can always be configured in two ways: to either fully represent each possible activation (or target) condition with activity label a as a disjunction of atoms (or activity labels) if there exists at least one declarative clause where a is also associated with a non-trivial payload condition (strategy=AtomizeEverything), or to restrict atomisation to data conditions appearing in a clause (strategy=AtomizeOnlyOnDataPredicates). Both can be set through the AtomisationPipeline procedure in Algorithm 3. The D φ -encoding step guarantees that each activation or target condition will be associated with at least one atom or activity label. While the former approach will maximise the access to the AttributeTable L , the latter will maximise the access to the ActivityTable L . Correlation conditions do not undergo this rewriting step. We discuss the effects of each different strategy on the query runtime via empirical benchmarks in Section 7.4. We can show that this step has a polynomial complexity with respect to the model, key set, and element intervals’ maximum size (Lemma S3).
Algorithm 3 Atomisation Pipeline (Section 5.1.1)
  1:
procedure AtomisationPipeline( M , strategy )
  2:
    CollectIntervals( M )   ▹ See Algorithm 1
  3:
     D φ -encoding( )   ▹ See Algorithm 1
  4:
    for all clause l ( A , p , B , p ) where Θ M do
  5:
        if p=True and (strategy=AtomizeOnlyOnDataPredicates or  a k ( A ) = { A } ) then   ▹ Defining S A for clause l
  6:
            clause l . left { A }
  7:
        else
  8:
            clause l . left a k ( A ) Atom μ , a d ( A , p ) ▹ Equation (2)
  9:
        end if
10:
        if p=True and (strategy=AtomizeOnlyOnDataPredicates or  a k ( B ) = { B } ) then   ▹ Defining S T for clause l
11:
            clause l . right { B }
12:
        else
13:
            clause l . right a k ( B ) Atom μ , a d ( B , p ) ▹ Equation (2)
14:
        end if
15:
    end for
Example 10.
With reference to Figure 3a, we might observe that, as no activation or target is ever associated with payload conditions, the atomisation pipeline will never express each activation or target condition as a disjunction of atoms, as no elementary interval is collected. Therefore, these will be only associated with activity labels.
Example 11.
With reference to Example 6, Figure 2 shows the atomised version of the declarative model, where each activation and target condition is associated, in this case, with just one atom.
Example 12.
Continuing with Example 7, where we discussed the outcome of the D φ -encoding phase for a model M in Equation (3), we obtain the following atomisation:
{ Choice 1 ( left = { p 12 , p 17 } , right = { p 4 , p 9 } ) , Absence 2 ( left = { p 1 , , p 5 , p 16 , , p 20 } , n = 1 ) , Absence 3 ( left = { p 1 , p 3 , p 5 , p 6 , p 8 , p 10 , p 11 , p 13 , p 15 , p 16 , p 18 , p 20 } , n = 1 ) }

5.1.2. Query Optimiser

The query optimiser consists of three steps: (i) loading the xt LTLf formulæ  associated with each declarative clause at warm-up, (ii) exploiting the outcome of the Atomisation Pipeline to instantiate the xt LTLf formulæ, (iii) and coalescing the single xt LTLf into one compact abstract syntax DAG. Our query plan will not be represented as a tree as we merge as many nodes computing the same result as possible, thus computing the same sub-expression at most once.
First, we load the translation map xtTemplates (Table 5) at warm-up through an external script providing the temporal semantics associated with the clauses of interest via partially-instantiated xtLTLf expressions. Such representation also supports negated activation or target conditions, thus avoiding the need to compute a Not operator stripping the information of either activation or target conditions. These are marked in the previous table via set complementation, ∁. At the time of the writing, the scripts provide the xtLTLf semantics for Declare templates. Future investigations will express other temporal declarative languages such as [43] in xtLTLf, as well as other LTLf extensions including “past” operators [17,19].
Second, we exploit the aforementioned map to convert each declarative clause into its xtLTLf semantics ψ . If the clause is met for the first time, we proceed with its instantiation by recursively visiting ψ until the leaves are reached: at this level, we potentially replace the activation and target placeholders with the associated set of atoms. Disjunctions of atoms and activity labels associated with leaf nodes as returned by the previous pipeline are minimised by ensuring that each shared Or True computation across all of the atoms and activity labels is computed at most once. If an atom is met, we decompose it into its defining compound conditions (Line 14), thus guaranteeing that each compound condition is evaluated via Compound A / T L , τ at most once across all of the atoms occurring in the xtLTLf formula when running the query (Section 5.2.1).
Third, we complete the process by coalescing shared disjunct sub-expressions via a map (queryCache) guaranteeing that all of the equivalent sub-expressions are all replaced by just one instance of these. Finally, we associate each sub-expression referring to each clause to the final query operator representing the expression’s root (queryRoot), either presenting an aggregation or a conjunctive query.
Example 13.
The model in Figure 3a, when compiled and associated with a conjunctive query, might produce the following xLTLf expression:
And True ( Globally Or τ ( Not τ ( Activity L , τ ( rec ) ) , AndFuture True τ ( Activity A L , τ ( rec ) ) , Activity T L , τ ( weap ) ) ) , And True ( Absence ( iiot _ sh , 1 ) , Or True ( Activity L ( comm ) , Activity L ( act ) ) ) )
We might observe that this expression cannot be further minimised, as there are neither shared atoms nor sub-expression in common. This can neither be achieved by rewriting Not τ ( Activity L , τ ( rec ) ) as Or True τ a Σ , a rec Activity L , τ ( a ) , as the comm and act atoms associated with the choice clause are untimed, while the former rewriting only included timed Activity operators. As these two different flavours of operators do not necessarily return the same result, these nodes are not merged.
Example 14.
With reference to Example 11 and Table 1, as the Response clause was associated with the same activation and target condition to Succession, the former is indeed a subquery of the latter. For this reason, these queries are fused together, thus guaranteeing that the result for Response is computed at most once. As the query root requires the computation of Max-SAT, this one is always going to be linked to the sub-expression being the representation of an original declarative clause. Green arrows in Figure 2 indicate operators’ output shared among operators.
Example 15.
This last example shows the effect of the reduction of the number of shared timed union operators at the leaf level. By recalling the atomised model discussed in Example 12, we need to represent each set of atoms as a timed disjunction of Atom operators. While doing so, we observe that Choice and the first Absence condition share atoms p 4 and p 17 , while the two Absence clauses share all the atoms in { p 1 , p 3 , p 5 , p 16 , p 18 , p 20 } . Not ensuring that the timed unions associated with these last elements are computed only once will result in both multiple data access to our relational tables, as well as a considerable increase in run time as union operations are run twice. The detection and minimisation of such kind of shared sub-queries cannot be merely computed through a simple caching mechanism, thus requiring a more sophisticated algorithm for determining the maximal common subset shared among all of the possible sets of atoms (and potentially activity labels).
Algorithm 4 provides additional details on the implementation of such an approach. Line 11 refers to the first phase and shows the point in the code where we associate each negated leaf with the complementary set of atoms appearing after the decomposition process. With respect to the second phase, Line 65 shows the rewriting of the Declare clause into an intermediate xtLTLf by recursively visiting it in each of its operands until the leaves are reached (Line 5). If during this visit we meet a binary operator marked as being the “tester” for the correlation condition, we associate to it the Θ coming from the declarative clause (Line 4); otherwise, the operator keeps the default True. Concerning the leaves, for unary clauses, we consider the sole activation condition, while for binary clauses, we might also consider target conditions. If the leaf node is associated with an S A (or S T ) containing more than one activity label or atom, we need to keep track of all of these while representing such a leaf as a disjunction of such atoms
(Lines 18–25). Next, we optimize each disjunction of atoms and activity labels in order to minimize the number of shared union computations (Line 48); such optimisation is performed after fully visiting the xtLTLf expression, thus ensuring that each appearing disjunction is actually collected (Line 69).
Line 14 shows where we collect atoms representing compound conditions while guaranteeing that its associated Compound A / T L , τ operator is computed only once, as well as decomposing it in its constituent compound conditions.
Finally, the method PutInCache extends the queryCache map by guaranteeing that each distinct disjunction of atoms is also represented at most once within the query plan.
Algorithm 4 Query Optimiser
  1:
global  d e c l a r e 2 xt LTL f { } ; queryCache { } ; collectUnions { } ; Q { } ; atomQ
  2:
global  keyToLabelToSortedIntervals { } ; S Σ { } ; Results { }
 
 
  3:
function Instantiate( ψ , Θ , S A , S T )
  4:
    if  ψ . hasTheta then  ψ . theta Θ
  5:
    if  ψ .arg = then   ψ is a leaf
  6:
        if  ψ . isActivation or  ψ . isNeither then
  7:
            ψ . atom S A
  8:
        else if  ψ . isTarget then
  9:
            ψ . atom S T
10:
        end if
11:
        if  ψ . negated then  ψ . atom ψ . atom   ▹ Complementing the atoms from the universe set upon negation
12:
        for all  atom ψ . atom do
13:
           if  atom a Σ a k ( a ) then   ▹ The atom is generated from D φ -encoding
14:
               RetrieveIntervals(atom)
15:
           else  atomQ . put ( atom )
16:
           end if
17:
        end for
18:
        if  | ψ . atom | > 1 then
19:
            disj
20:
           for all  atom ψ . atom do
21:
                ψ new xtLTLf ()
22:
                ψ . atom = { atom }
23:
                disj . put ( atom )
24:
           end for
25:
            collectUnions [ disj ] . put ( ψ )
26:
        else
27:
        end if
28:
    else
29:
        for all arg ψ do
30:
           arg ←Instantiate(arg , Θ , S A , S T )
31:
        end for
32:
    end if
 
 
33:
procedure CollectUnions( )   ▹ DAG over the leaves undergoing union operations.
34:
    for all  atomSet , ψ FinitarySetOperations ( collectUnions , Or True ) do   Algorithm S4
35:
        for all  ψ collectUnions [ atomSet ] do
36:
            queryCache [ ψ ] ψ
37:
        end for
38:
    end for
 
 
39:
procedure PutInCache( ψ )
40:
    if  ψ . ψ , ψ queryCache then
41:
        return  ψ
42:
    else
43:
        for all arg ψ . args do
44:
           arg ←PutInCache(arg)
45:
        end for
46:
         ψ new  xtLTLf ()
47:
         ψ ψ
48:
         queryCache [ ψ ] ψ
49:
        return  ψ
50:
    end if
 
 
51:
procedure RetrieveIntervals( p i )   p i : = a partition
52:
    for all  low κ κ up κ partition do   p i = κ K low κ κ up κ
53:
        if  h . low κ κ up κ , h keyToLabelToSortedIntervals [ κ ] [ a ] then
54:
            S Σ [ p i ] . put ( h )
55:
        else
56:
            Results . put ( )
57:
            S Σ [ p i ] . put ( | Results | )
58:
            keyToLabelToSortedIntervals [ κ ] [ a ] . put ( low κ κ up κ , | Results | )
59:
        end if
60:
    end for
 
 
61:
function QueryOptimiser( M ,queryRoot)
62:
    for all  clause l ( A , p , B , q ) where Θ M do
63:
        if  ψ : xt LTL f . clause l ( A , p , B , q ) where Θ , ψ d e c l a r e 2 xt LTL f then  Q . push ( ψ )
64:
        else
65:
            ψ Instantiate(xtTemplates[ clause l ] , Θ , clause l . left , clause l . right )
66:
            Q . push ( ψ )
67:
        end if
68:
    end for
69:
    CollectUnions( )
70:
    queryRoot.args←{PutInCache ( ψ ) | ψ Q }
71:
    return  queryRoot
Example 16.
Figure 5 showcases the result of the application of such an algorithm while generating unique xt LTLf expressions. Such an algorithm also guarantees the non-repetition of single-leaf operators appearing in different clauses. Its upper box shows a query plan where common union operations are shared across sub-trees by representing each sub-tree at most once. These are actually represented in the query plan as opposed to the evaluation associated with the atoms, which is discussed in Supplement III.1.

5.1.3. Enabling Intraquery Parallelism

The query scheduler (Algorithm 5) takes as an input the query compiled in the previous phase and returns the scheduling order for achieving intraquery parallelism [42]. The previously generated expression might not be considered as an abstract syntax tree, rather than an abstract syntax Direct Acyclic Graph (DAG) rooted in the entry-point operator queryRoot, as we guarantee that sub-expressions appearing multiple times are replaced by unique instances of them.
Therefore, we can freely represent the query plan as a DAG G in our pseudocode notation, where each root operator in ψ is a single node while edges connect parent operators to the siblings’ ( ψ . args ) root operator. Graph edges induce the execution order, where any ancestor node needs to be run after all of its immediate siblings. A reversed topological sort (Line 3) induces the order in which the operations should be run. To know which of these operators can be run contemporarily (i.e., scheduled together [44]) as they share no interdependencies, we compute for each node its maximum distance from queryRoot (Line 6). This generates a layering [45] guaranteeing that all of the nodes at the same levels share no mutual dependencies (Line 10). This enables the level-wise parallelisation of the tasks’ execution (also referred to as Intraquery Parallelism [42]), thus showing how such a problem can be reduced into an embarrassingly parallel problem by parallelising the computation of each operator in the same given layer. This procedure runs in linear time with respect to the number of operators appearing in the xtLTLf query plan (Lemma S4). We benchmark query plan parallelisation with different task scheduling policies in Section 7.3.
Algorithm 5 Query Scheduler (Section 5.1.3)
  1:
function QueryScheduler( G )
  2:
     layer { }
  3:
     V R e v e r t ( T o p o l o g i c a l S o r t ( G ) )
  4:
    for all ψ V do
  5:
        for all ψ ψ . args do
  6:
            ψ . distance max ( ψ . distance , ψ . distance + 1 )
  7:
        end for
  8:
    end for
  9:
    for all ψ V do
10:
         layer [ ψ . distance ] . put ( ψ )
11:
    end for
12:
    return layer
Example 17.
The DAG in Figure 2 depicts a query plan, where operators’ dependencies are suggested as arrows starting from the ancestors. The graph is also already represented as a layered graph, as all of the nodes having the same maximum distance from the query root are aligned horizontally. We might observe that none of the nodes within each layer shares dependencies.

5.2. Execution Engine

The execution engine (Algorithm 6) runs the previously compiled query (Section 5.1) on top of the relational model populated from the XES log (Section 4.2). The computation will start from the DAG query leaves directly accessing the relational database (Section 5.2.1) for then propagating the results until the root of the DAG is reached (Section 5.2.2). At this point, we can perform the final conjunctive or aggregation queries (Section 5.2.3).
Algorithm 6 Execution Engine (Section 5.2)
  1:
function ExecutionEngine( layer , L , A )
  2:
    for all  ψ atomQ (parallel) do ψ . result A ( ψ )
  3:
    Run D φ -encodingAtoms( L ) Algorithm S5
  4:
    for all distance , Ψ layer do
  5:
        for all ψ Ψ (parallel) do
  6:
           if ψ . atom = { p i } and p i a Σ a k ( a ) then
  7:
                ψ . result A ( Atom · L , τ ) ( ψ ) Algorithm S5
  8:
           else if ψ . atom = { a } a Σ then
  9:
               continue   ▹ Already run in Line 2
10:
           else
11:
                ψ . result A ( ψ ) ( { ψ . result | ψ ψ . args } )
12:
           end if
13:
        end for
14:
    end for
15:
     queryRoot layer [ 0 ]
16:
    return queryRoot.result
At each stage, we exploit a functor A associating to each xt LTLf operator an algorithm which will take the result from the ψ’s operands as an input while returning the expected output by formal definition in an intermediate result ρ. This abstraction enables the separation between xt LTLf syntax and multiple possible algorithmic implementations. Some algorithmic implementations for such operators are discussed in Section 6.
For this step, we will not discuss the computational complexity of evaluating the query plan as this is heavily dominated by the computation of every single operator, the model of choice, and the log size. For this reason, we only conducted empirical analysis by benchmarking the run time of the whole execution engine, where models either only contain Activity A / T L , τ (Section 7.2) operators or mainly Atom A / T L , τ ones (Section 7.5).

5.2.1. Basic Operators’ Execution

Among all of the possible DAG node leaves, we first (Line 2) execute the leaves either (i) directly associated with an activity label, or (ii) First and Last. For the former (i), each activity label a is run through its correspondent Activity A / T L , τ ( a ) operator, whether either A or T or none are going to be set depending on the fact that such atom refers to an activation ( ψ . isActivation ) or target ( ψ . isTarget ) condition, or whether the associated result should be ignored as a whole ( ψ . isNeither ). For the latter (ii), we directly access the data tables and retrieve the data from them. As the tables are already sorted by trace and event id, no further post-processing besides the insertion of activation or target label in the nested component L of the intermediate representation is required.
Next, we evaluate the intermediate result associated with each atom generated by the D φ -encoding (Line 3). Intuitively (Please refer to Supplement III.1 for a more in-depth discussion with pseudocode.), this requires three subsequent phases. First, we obtain the compound conditions grouped by key and activity label as collected at query compile time, and we exploit them to pipeline multiple range queries over each AttributeTable L k . The associated results are cached. Second, we compute the results for each atom by intersecting the previously cached results before actually computing the actual Atom A / T L , τ . This also guarantees that shared intersections are run at most once across all of the previously cached results. Third, we exploit the former result to compute the Atom A / T L , τ operator at the leaf level on our DAG, while associating either an activation or a target mark in L depending on the prior definition of our leaf-level operator.

5.2.2. Results Propagation

After running the basic operators and their derived counterparts (e.g., Atom A / T L , τ ), the only xtLTLf operators that KnoBAB runs are the ones not accessing the relational tables. KnoBAB implements three different A -s which are only sharing the implementation for the aforementioned operators: one set is either strictly abiding by the formal definition and completely ignoring the fact that the intermediate results are provided as an ordered set of tuples or providing slower algorithms overall, one will leverage appropriate data representation, thus outperforming the former operations, while the other will implement hybrid algorithms for selecting the best performant implementation depending on the data conditions through hybrid algorithms. An in-depth discussion of how different operators might have different algorithmic implementations is postponed to a specific section (Section 6).
While computing these, we associate a temporary primary-memory cache (We can completely free each intermediate cache if we are not computing a Confidence query and if the furthest ancestor has already accessed it, or if the cache is unassociated with any activation required by Confidence.) to each intermediate representation being computed ( ψ . result ).

5.2.3. Conjunctive and Aggregation Queries

The first version of KnoBAB supports the Conjunctive Query of the model as well as three aggregation queries: Max-SAT, Confidence, and Support. While the former requires a further untimed And True among all the intermediate results associated with the computation to each clause, the aggregation requires just an iteration over the provided results. The conjunctive query is formulated as follows:
C ONJUNCTIVE Q UERY ( ρ 1 , , ρ n ) = And True ( ρ 1 , And True ( ρ n 1 , ρ n ) )
The Max-SAT will calculate the ratio of the intermediate results ρ l associated with each clause c l , over the total number of model clauses | M | . ActLeaves ( ρ l ) is the untimed union of the intermediate results yielded by activation conditions for the Declare clause c l M . For c l , the Confidence represents the ratio between the number of traces returned by ρ l and the total number of traces that contain activation conditions. When the same numerator is on the other hand divided by the total log traces, we have Support. Following the computation of each ρ l per clause c l , the aggregation functions can be expressed as follows:
Max - SAT ( ρ 1 , , ρ n ) = | l | j , L . i , j , L ρ l | | M | σ i L
C ONFIDENCE ( ρ 1 , , ρ n ) = | i | j , L . i , j , L ρ l | | ActLeaves ( ρ l ) | c l M
S UPPORT ( ρ 1 , , ρ n ) = | i | j , L . i , j , L ρ l | | L | c l M
The execution of such queries is performed in a non-parallel way, as each aggregation query will appear at the top of the query plan, and this will be associated with the latest execution run of the scheduler (Line 15). We then return and prompt the result associated with the root node of our query plan (Line 16).
Example 18.
As per previous discussions, the satisfaction of a model requires the satisfaction of all constituent clauses. The model described as the bottom table in Figure 6 is the result of further elaborating on the requirements from Example 1. This is only one example of a myriad of possible solutions, which can either be manually defined (as here), or generated through mining/learning techniques. Such model can be now used to compute the degree to which the model is satisfied, or per trace, each requiring different metrics. An example of a trace-wise metric is Max-SAT while Support and Confidence values can be computed per clause. By providing the trace metrics, we are able to analyse the scenarios with respect to the model, and therefore help provide insight into the exhibits of any backdoors in the software. On the contrary, providing model metrics allows us to establish the suitability of a model and its constituent clauses; for example, clauses with low Support but high Confidence may indicate a correlation between events. Finally, a conjunctive query will return all the traces satisfying all the model clauses. From Figure 6, it is evident that the only trace where a successful attack occurred is σ 1 , as returned by the Conjunctive Query, providing the grounds that we have a suitable model. By exploiting the previous formulæ, we can compute the metrics as Table 6. These metrics may provide some insight of correlations between events. For example, clause Ⓑ had Support(Confidence) values as 1.0, while clause Ⓒ had 1 3 (1.0). This therefore indicates that the activation of the latter occurred much less than that of the former; however, every time the activation occurred, the clause was always fulfilled. Conclusions such as these can help to identify any weaknesses/strengths within the model and the system itself (here, the metrics obtained from Ⓒ may suggest that comm/act contain a correlation that needs investigating).

6. Algorithmic Implementations

In this section, we show how the relational model and the proposed intermediate result representation enable the definition of different operators boosting the query performance compared to an equivalent xt LTLf expression obtained through the straightforward translation procedure entailed by the lemmas in Appendix A.2 (LTLf-rewriting). Each subsection is going to discuss different possible algorithms for implementing some operators, as well as discussing its associated pseudocode and computational complexity.

6.1. Timed and Untimed Or/And

Algorithm 7 shows the implementation of the timed version of the And Θ τ (Line 27) and Or Θ τ (Line 28) operators, for then generalising this concept for the implementation of the untimed And Θ . We omit the discussion related to the implementation of the untimed Or Θ operator for the sake of conciseness.
As we see from their formal definition, any binary xtLTLf operator supports Θ conditions. And (and Or) resembles a sorted set intersection (or union, Line 11), where we use both trace (i) and event (j) id information from the intermediate result triplet as preliminary equality condition for the match. We also use a Θ binary predicate to be tested over the activated and targeted events in the third component (L). The event shared among the operands is returned if either Θ is always true (Line 7) or, from this point in time, if there exists one activated future activated event (in a L coming from the left operand) as well as a targeted one (in a L coming from the right operand) satisfying the correlation (Line 4). The match is then represented as a marked correlation condition M ( h , k ) , which is then collected in the L associated with the returned event (Line 5).
For the untimed And Θ operator, we require to return one single trace i as i , 1 , L if either Θ is true and each operator has an event from σ i , or if there exists at least one event per operand from the same trace performing the match. This can be implemented in two different ways: we can either group the records by trace id (Lines 31 and 32) and then scan the intermediate results’ records (Line 38) associated with the same trace id (Line 36, SlowUntimedAnd) or straightforwardly scan them by trace id without exploiting the preliminary aggregation (FastUntimedAnd). This latter implementation is possible as the intermediate results records are already sorted, thus allowing the results’ aggregation while scanning the intermediate results without the need for any preliminary aggregation. We show that the faster version is always faster than computing it with its slower counterpart in Corollary S1.
Algorithm 7 xtLTLf pseudocode implementation for And Θ and Or Θ operators
  1:
function  T Θ E , i ( L , L )
  2:
     L ; hasMatch Θ = True ▹ (Explicitly) computing T Θ E , i
  3:
    if Θ True and L and L then
  4:
        for all A ( m ) L and T ( n ) L s.i. Θ ( m , n ) do
  5:
            L L M ( m , n ) ; hasMatch true
  6:
        end for
  7:
    else
  8:
         L L L L
  9:
    end if
10:
    if hasMatch then return  L  else return False
 
 
11:
function TimedIntersection Θ ( ρ , ρ , i s U n i o n )
12:
     i t Iterator ( ρ ) , i t Iterator ( ρ )
13:
    while i t and i t do
14:
         i , j , L current ( i t ) , i , j , L current ( i t )
15:
        if i = i and j = j then
16:
            tmp T Θ E , i ( L , L )
17:
           if  tmp False then yield i , j , t m p
18:
            next ( i t ) ; next ( i t ) ;
19:
        else if i < i or ( i = i and j < j ) then
20:
           if  i s U n i o n then yield i , j , L  end if
21:
            next ( i t )
22:
        else
23:
           if  i s U n i o n then yield i , j , L  end if
24:
            next ( i t )
25:
        end if
26:
    end while
 
 
27:
function And Θ τ ( ρ , ρ ) TimedIntersection Θ ( ρ , ρ , false )
 
 
28:
function Or Θ τ ( ρ , ρ ) TimedIntersection Θ ( ρ , ρ , true )
 
 
29:
function SlowUntimedAnd Θ ( ρ , ρ )
30:
     leftOperand { } ; rightOperand { }
31:
    for all  i , j , L ρ do rightOperand [ i ] . put ( i , j , L )
32:
    for all  i , j , L ρ do rightOperand [ i ] . put ( i , j , L )
33:
     i t Iterator ( leftOperand ) , i t Iterator ( rightOperand )
34:
    while i t and i t do
35:
         i , R current ( it ) ; i , R current ( it )
36:
        if i = i then
37:
            L ; hasMatch Θ = True
38:
           for all i , j , L R and i , j , L R do
39:
                tmp T Θ E , i ( L , L )
40:
               if tmp False then
41:
                   hasMatch true ; L L tmp
42:
               end if
43:
           end for
44:
           if hasMatch then yield  i , 1 , L ;
45:
        else if i < i then next(it)
46:
        else next ( it )
47:
        end if
48:
    end while
 
 
49:
function FastUntimedAnd Θ ( ρ , ρ )
50:
     i t Iterator ( ρ ) , i t Iterator ( ρ )
51:
    while i t and i t do
52:
         i , ι , λ current ( it ) ; i , ι , λ current ( it )
53:
        if i = i then
54:
            L ; canOptimize false
55:
            it * it
56:
           while it * do
57:
                i , j , L current ( i t * ) ; it * it
58:
               if not canOptimize then
59:
                   while it * do
60:
                        i , j , L current ( i t * )
61:
                        tmp T Θ E , i ( L , L )
62:
                       if tmp False then
63:
                          hasMatch true ; L L tmp
64:
                       end if
65:
                       next( it * )
66:
                   end while
67:
                   if  Θ = True then canOptimize true
68:
               else L L L
69:
               end if
70:
               next( it * )
71:
           end while
72:
           if hasMatch then yield  i , 1 , L ;
73:
            it it * ; it it * ;
74:
        else if  i < i  then next(it)
75:
        else next ( it )
76:
        end if
77:
    end while
Similar considerations can be also applied for the untimed Or operation, for which we implemented equivalent SlowUntimedOr and FastUntimedOr, as we only need to pay an additional linear scan for the unmatched traces.

6.2. Choice and Untimed Or

We prelude our analysis of derived operators by firstly discussing the difference in computational complexity between providing the straightforward translation from LTLf to xtLTLf and to exploiting equivalent expression rewriting in xtLTLf. We remind the reader that the definition of Choice (see Table 1) states that either one condition or another should occur anytime in the trace.
This requirement can be interpreted in two distinct ways: by either returning all the traces satisfying the first condition or the second separately and then merging them, or instead collecting all of the events satisfying either the former or the latter condition while jointly scanning both operands, and then returning the traces where any one of them is met. After observing (Please also refer to the experiments in Section 7.1 for the empirical evidence of such theoretical claims.) that the SlowUntimedOr is actually slower than FastUntimedOr and that the latter actually implements the Choice declarative clause (Corollary A1), the time complexity of computing the LTLf rewriting of Choice in its LTLf-rewriting is almost equivalent to the time complexity of FastUntimedOr, as we can have an asymptotic constant speed-up in the best case scenario (Corollary S3). As the untimed Or Θ behaves by computing a Future operator (Algorithm 8) on each of its operands, the computation of an additional Future operator for each of its operands becomes an omittable overhand.
Algorithm 8 xtLTLf pseudocode implementation for Future and Globally
  1:
function Future ( ρ ) O ( | L | ϵ 2 )
  2:
    for all i , j , L ρ do yield i , j , L | i , j , L ρ and j j
  3:
    end for
 
 
  4:
function Globally( ρ )
  5:
    for all i , j , L ρ do
  6:
         E j | i , j , L ρ and j j
  7:
        if  | E | = t j then yield i , j , L | i , j , L ρ and j E  end if
  8:
    end for

6.3. Untimed Until(s)

We show how different data access policies for scanning the intermediate results affect the overall computational complexity as well as their associated run time. Algorithm 9 provides two possible variants for the untimed until:
All optimisations happen when the activation condition coming from the second operand does not occur at the beginning of a trace (Lines 34 and 61). In the first variant, we calculate, for all of the events in the first operand starting from the beginning of the trace (Line 29, and Line 51 for the second variant), the position of the last activated event preceding the current target condition with a logarithmic scan with respect to the length of the first operand (Line 34). On the other hand, the second variant directly discards the traces not starting with a target condition (Line 59) and, otherwise, it moves the scan of the first operand—from that initial position—by an offset equal to the distance from the event preceding activation (Line 61): if that position does not correspond to an activation condition preceding the current activation condition, then we completely discard the trace (Line 65). The matching conditions between activations and target are implemented similarly (Lines 37–40 and 67–69). Lemma S7 shows that the second variant is better asymptotically only for bigger datasets.
Algorithm 9 Two implementations for the untimed xtLTLf Until Θ .
  1:
function  A Θ i ( it , bEnd , it , aEnd )
  2:
     i , j , L current ( i t ) ; L ;
  3:
    if Θ True and L then
  4:
        for all A ( k ) , M ( k , k ) L do
  5:
           aBeg it
  6:
           while aBegaEnd do
  7:
                i , j , L current ( a B e g )
  8:
               if L = then L L L
  9:
               else
10:
                   anyMatch false
11:
                   for all T ( h ) L s.t. Θ ( σ k i , σ h i ) do anyMatch true ; L L { M ( k , h ) }
12:
                   end for
13:
                   if not anyMatch then return False
14:
               end if
15:
           end while
16:
        end for
17:
    else
18:
        while aBegaEnd do
19:
            i , j , L current ( a B e g + + ) ; L L L
20:
        end while
21:
         L L L
22:
    end if
23:
    return  L
 
 
24:
function UntimedUntil Θ 1 ( ρ , ρ )
25:
     i t Iterator ( ρ ) , i t Iterator ( ρ )
26:
    while it do
27:
         i , j , L current ( i t ) ; bendUpperBound ( ρ , it , , i , | σ i | + 1 , )
28:
         itLowrBound ( ρ , it , , i , 1 , )
29:
        atLeastOneResult false ; L
30:
        while it < bend do
31:
           if j = 1 then
32:
               atLeastOneResult true ; L L L ; it + +
33:
           else
34:
               aEndUpperBound ( ρ , it , , i , j 1 , Ω )
35:
               if it = aEnd or Distance ( aEnd 1 , i t ) + 1 j 1 then break
36:
               else   i = i . Computing partial T Θ A , i
37:
                    tmp A Θ i ( it , bend , it , aEnd )
38:
                    atLeastOneResult atLeastOneResult or tmp False
39:
                   if  tmp False then L L tmp ;
40:
                    it + +
41:
               end if
42:
           end if
43:
        end while
44:
        if atLeastOneResult then yield  i , 1 , L
45:
         it bend
46:
    end while
 
 
47:
function UntimedUntil Θ 2 ( ρ , ρ )
48:
     i t Iterator ( ρ ) , i t Iterator ( ρ )
49:
    while it do
50:
         i , j , L current ( i t ) ; bendUpperBound ( ρ , it , , i , | σ i | + 1 , )
51:
        itLowerBound ( ρ , it , , i , 1 , )
52:
        atLeastOneResult false ; L
53:
        while it < bend do
54:
           if j = 1 then
55:
               atLeastOneResult true ; L L L ; it + +
56:
           else if it = then break
57:
           else
58:
                i , j , L current ( i t ) ;
59:
               if j > 1 then break
60:
               else
61:
                   aEndMoveForward ( it , j 1 ) ;   ( it ) + j 1
62:
                   if aEnd = then break
63:
                   else
64:
                        i e , j e , L e current ( aEnd )
65:
                       if i e > i or j e j 1 then break
66:
                       else   i = i = i e . Computing partial T Θ A , i
67:
                           tmp A Θ i ( it , bend , it , aEnd )
68:
                           atLeastOneResult atLeastOneResult or tmp False
69:
                          if  tmp False then L L tmp ;
70:
                           it + +
71:
                       end if
72:
                   end if
73:
               end if
74:
           end if
75:
        end while
76:
        if atLeastOneResult then yield  i , 1 , L
77:
         it bend
78:
    end while

6.4. Derived Operators

Our previous observation for the untimed Or Θ led us to the definition of additional derived operators with the hope of easing the overall computational complexity. We walked in the same footsteps of relational algebra, where it was customary to merge multiple operators into one single new operator if the latter might be implemented through a more performant algorithm than computing an equivalent expression being the straightforward translation of LTLf formulae into LTLf (LTLf rewriting).
For example, we can implement TimedAndFuture by extending the fast implementation of the timed And operator, and considering all of the trace events from the second operand succeeding the events from the first operand within the same trace. Similar considerations can be carried out with TimedAndGlobally, where in the former we need to count whether all of the events from the current time until the end of the trace are present in the rightmost operand, while in the latter we also need to skip the matched event from the rightmost operand and start scanning from the following ones.
For simplicity’s sake, we postpone the discussion of these operands’ pseudocode as well as the discussion of their computational complexity in Supplement II.2, where we show that these two operators might come with two different algorithms, for which there always exists one of them having a lower running time with respect to the equivalent xtLTLf expression containing no derived operators. We can show formally that, while the first implementation (variant) works better for smaller datasets, the second works better for reasonably long traces when the number of the traces is upper bounded by an exponential number of events (Corollary S2).
Table 7 shows the range of datasets used for benchmarking.

7. Results and Discussion

Our benchmarks exploited a Razer Blade Pro on Ubuntu 20.04: Intel Core i7-10875H CPU @ 2.30 GHz–5.10 GHz, 16GB DDR4 2933 MHz RAM, 450 GB free disk space. All of our datasets used for benchmarking (synthetic data generation (Section 7.1), BPIC_2011 (Section 7.2 and Section 7.3), BPIC_2012 (Section 7.4 and Section 7.5) and our proposed cancer example (Section 1.1) are publicly available (https://dx.doi.org/10.17605/OSF.IO/2CXR7). Table 7 summarises these datasets’ features.

7.1. Comparing Different Operators’ Algorithms

We advocate that the choice of representing the intermediate representation as an ordered record set allows the exploitation of efficient algorithms through which we might avoid costly counting and aggregation operations [46]. From these comparisons, the operators fully assuming that the data are sorted greatly outperform naïve operators. Walking in the footsteps of relational algebra, we show that the computational complexity of so-called derived operators outperforms the computation of an equivalent expression evaluated through either naïve or fast algorithms. The experiments are discussed in order of presentation of the algorithms in the previous section.
To create a suitable testing environment, we synthetically generate data-less logs, where the trace and log lengths are increased 10-fold at a time from 10 1 10 4 , with the resulting sets | L | 10 , 100 , 1000 , 10 , 000 ϵ 10 , 100 , 1000 , 10 , 000 , with the most extreme log consisting of 10 8 events. In some cases, we exceeded 16 GB of primary memory on the testing machine; in the following results (Figure 7, Figure 8, Figure 9 and Figure 10), M+ denotes an out of memory exception. We chose to generate our data in place of using existing real-world logs (https://dx.doi.org/10.17605/OSF.IO/2CXR7), as the controlled scenario allows for identifying the location and extent of any possible speed-ups. These data were up-sampled, guaranteeing that a given log configuration was always a subset of the larger. The data generation randomly assigned events from the universal alphabet ( Σ = A , B , C , D , E ), up to the maximum length for the set in consideration, and we stored the resulting logs as tab-separated files.
Our operators consider correlations between timed events A and B, where the computed speed-up is per operator. Given this, we denote ρ 1 = Activity A L , τ ( A ) , ρ 2 = Activity T L , τ ( B ) , prior to benchmarking, and we ignore the time required for accessing the data on the knowledge base, as the focus of the present benchmarks is solely on the operators. Details of how the custom clauses/derived operators are run are demonstrated in Table 8, while singular operators are run sequentially.
Untimed Or/And. The first group of experiments aim to challenge different possible algorithms for the same xtLTLf operators, And True and Or True , as discussed in Section 6.1. The outcome of such experiments is given in Figure 7: our experiments reveal that, in every case, the Fast- operators are always more performant than their logical counterparts. Our benchmark confirms the cost of overhead encumbered by the Slow- implementation, which conforms linearly to increased log size, almost polynomially with trace length. This aggregation is upper bounded with a quadratic with respect to trace length ϵ (Lines 31 and 32); in the most extreme case ( ϵ = 10 4 ), the cost is over one order of magnitude versus the algorithm without aggregation. From now on, we always exploit our Fast- operators in place of the Slow- equivalent for representing non-derived xtLTLf operators, which usually suffer the cost caused by the preliminary aggregation as per previous experiments.
Choice and Untimed Or. The next set of experiments is to evaluate the customary declarative clause implementation, where we hypothesise reformulating the semantics associated with Choice to provide performance gains from the absence of preliminary aggregations via the UntimedFuture operator. In fact, the proposed optimisation derives from the omittance of the Future operators for ρ 1 , ρ 2 , which formally comply with the logical definition. For the untimed Future Section 3.2.2 operator, bounded scans can be exploited, as the data are sorted with respect to trace id, and all the events that satisfy ρ for the current trace id are included in the result. Therefore, we expect an overhead that grows linearly with log size. Figure 8 shows that, in the best case ( ϵ = 10 ), we gain  0.5 orders of magnitude in performance. The findings affirm that log size has a greater influence on computational overhead than trace length. For ϵ 10 3 , the overhead resulting from the Future operators steadily increases while both the trace length ϵ and the log size | L | grows, albeit this is negligible in the logarithmic scale.
Untimed Until(s). Benchmarks from Figure 9 show that the first variant is almost always more performant than the second one for considerably short traces, while the latter becomes more efficient when ϵ increases. With significant increases to log size, the latter becomes more performant; when | L | = 10 4 , all cases show improved running times, regardless of ϵ . The plots also show that the operator’s running time is polynomial with respect to the number of traces in the log, as a consequence of the increased scans within every single trace.
Derived Operators. The final set of experiments is to test whether the newly proposed derived operators achieve more optimised results than those from their LTLf rewriting counterpart (Table 8). For example, TimedAndGlobally can be optimised with the customary algorithms replacing one single operator with the execution of multiple pipelined operators. Computations from LTLf rewriting demonstrate worse performance than the derived counterparts across all operators; in the most extreme case TimedAndGlobally, there is over 10 1.5 speed-up for ϵ = 10 4 . We were able to conclude that different impersonations to the internal data storage of the optimised algorithm may provide better results depending on the log size. As for UntimedUntil, we provide two implementations for TimedAndGlobally and TimedAndFuture, VARIANT-1 (Algorihtm S1) and VARIANT-2 (Algorihtm S2), with the latter exploiting bounded reversed scans on the data.
TimedAndGlobally: by merging the And join operation with Globally, we only consider elements within the same trace after the first operand. The logical implementation performs these operations separately, and so cannot reap the benefits of a merged join [47]. Figure 10 shows that, in most cases, there is a linear performance gain with log size. VARIANT-2 aims to exploit potential gains from a reversed scan of a trace while VARIANT-1 provides a forwards scan for every activation. By performing a reverse scan, the latter is able to prune further events from any activations happening in the past, as the condition did not hold for the current time. For smaller trace lengths ( ϵ 10 1 ), the VARIANT-1 demonstrates better performance than VARIANT-2. With increased trace length, the latter operators outperform the former, sometimes by over an order of magnitude ( ϵ = 10 4 ). In some cases, the VARIANT-1 performs slower than their LTLf-rewriting counterparts ( ϵ 10 3 ).
TimedAndFuture: the principal optimisation gains from this operator follow the same reasoning as TimedAndGlobally; however, the implementations of the variants follow a unique approach. By exploiting the allocation of intermediate data structures in reverse, VARIANT-2 also provides improved performance for larger | L | . As with TimedAndGlobally, VARIANT-1 outperforms the former for smaller trace lengths.
We conclude that VARIANT-1 (VARIANT-2) of TimedAndFuture and TimedAndGlobally outperform each other for small (large) trace lengths. In addition, the first variant of Until proves to be more performant than our second variant for smaller log lengths. We design a mechanism for always running the fastest algorithm under the previously-observed circumstances. We then need to calculate the average trace length and the log size at data loading time (this only needs to happen once per log). Then, at query time, the most optimal operator is chosen based on these values. We define a HYBRID TRACE QUERY THRESHOLD γ of 10 2 2 (Lines 5 and 9) and a HYBRID LOG QUERY THRESHOLD η of 10 3 2 (Line 1); values exceeding these thresholds will execute the operators more tailored towards large trace (log) sizes. The pseudocode provided as Algorithm 10 demonstrates how two different variants can be engulfed in one single parametric algorithm.
Algorithm 10 Hybrid Algorithms
  1:
function HybridUntimedUntil Θ η ( ρ , ρ )
  2:
    if | L | η then return UntimedUntil Θ 2 ( ρ , ρ ) ▹ Algorithm 9
  3:
    else  return UntimedUntil Θ 1 ( ρ , ρ ) ▹ Algorithm 9
  4:
    end if
 
 
  5:
function HybridAndFuture Θ γ ( ρ , ρ )
  6:
    if ϵ > γ then return AndFuture Θ 2 ( ρ , ρ ) Algorithm S1
  7:
    else  return AndFuture Θ 1 ( ρ , ρ ) Algorithm S2
  8:
    end if
 
 
  9:
function HybridAndGlobally Θ γ ( ρ , ρ )
10:
    if ϵ γ then return AndGlobally Θ 2 ( ρ , ρ )    ▹ Algorithm S1
11:
    else  return AndGlobally Θ 1 ( ρ , ρ )    ▹ Algorithm S2
12:
    end if

7.2. Relational Temporal Mining

We now move from synthetic data, required to tune hybrid algorithms and thoroughly test our operators, towards real data benchmarks with no data payload conditions. We contextualise our experiments for data-intensive model mining operations that can also be run on a relational model. While doing so, we compare our runtimes both with hybrid operators with the one from the previous paper [4], as well as run times from the relational model with traditional SQL queries.
SQLMiner, provided by Schonig et al. [5], utilises database architectures for declarative process mining. We chose to test our hypothesis of engineering a custom database architecture against state-of-the-art traditional relational databases (PostgreSQL 14.2). For this set of experiments, we exploited the BPIC 2011 (Dutch academic hospital log) dataset (https://dx.doi.org/10.17605/OSF.IO/2CXR7), as used in [5]. This log contained data payload information, though the queries executed as [5] were comprised of data-less events. The original dataset was sampled into sub-logs containing 10, 100, and 1000 traces, and the sampling approach adopted the same behaviour as the synthetic dataset from the previous set-up, where each sub-log is guaranteed to be a subset of the greater ones. Increased sizes of datasets exhibited exponential increases in primary memory requirements and thus justifies our sampling approach. Schönig [48] provides the templated implementations for mining eight declarative clauses. As these are only templates, the models were instantiated from the resulting combinations of the five most occurring events. Therefore, we generated eight models, each consisting of 25 clauses. SQLMiner simulated this by creating a secondary Actions table, with each row containing the instantiated Declare template. SQLMiner provides the Support values associated with each clause. We extend this to also provide trace information, where each clause also contains the traces satisfying it. We also want to test our hypothesis that our proposed hybrid operator pipeline (Section 7.1) can outperform the pipeline set up from our previous work [4] that does not exploit the potential gains that can be made from picking the best algorithm according to the data conditions, and only uses our defined VARIANT-1 operators. The outcomes of these experiments are shown in Figure 11, where each plot represents the execution times for a given elected template, with the more complex queries located on the first row.
SQLMiner results. In the worst case, our running time is comparable with SQLMiner (Response). Even for this case, SQLMiner returns only the Support information, while KnoBAB also returns (for the same execution time) trace information. In SQL, providing the least possible query alterations to provide the trace information causes 10 1.5 run time increase, thus demonstrating that we are more performant on the same conditions. Conversely, in the best case, we outperform SQLMiner by over five orders of magnitude. By exploiting efficient database design, our custom query plan can minimise data access and our computation avoided explicit computations of aggregations. In addition, guaranteeing that the intermediate results are always sorted allows for linear scanning cost for counting operations. Responded Existence is a clear candidate for demonstrating the gains from custom database design: with access to our proposed CountingTable L , our solution requires only a table look-up, while SQLMiner requires an aggregation requiring an entire scan of the Log table. Combining this with the extended xtLTLf operators allows for much more optimised query times; this is shown in the results, where KnoBAB is consistently at least two orders of magnitude more performant with queries returning trace information. As | L | increases beyond 10 2 , the more complex queries were unable to finish to completion for SQLMiner, exceeding the 16 GB primary memory of the benchmarking machine.
Pipeline results. The execution times for KnoBAB + Support and KnoBAB + Max-SAT are comparable, while there is much greater variation for SQLMiner + Support and SQLMiner + Trace Info. As support requires only an aggregation over intermediate results (Section 5.2.3), we guarantee that we suffer at most a cost proportional to the model size, so we expect a constant overhead based on model size. The large fluctuation in results for SQLMiner is a culprit of the query rewriting provided by the PostgreSQL query engine; in some cases, returning trace information yielded better results. In these experiments, we combined the alternate ensemble methods with our proposed HYBRID operators. The results demonstrate that, for most operators, there is a marginal improvement in time complexity. For NotSuccession and Response, the improvement is more apparent, with the former, for | L | = 10 providing 20% improvement against VARIANT-1. The reader is encouraged to refer back to Figure 10 to explain this. The faster operators thrive with | L | > 10 3 , while, for traces within the region of 10 2 , the gain is much less apparent. The BPIC_2011 dataset has a corresponding average trace length of ∼220: exploiting the VARIANT-2 operators within this region will therefore yield lesser benefit than much larger  | L | .

7.3. Query Plan Parallelisation

By keeping the immediately preceding experimental setting while considering the whole log as well as extending the model size, we now benchmark our solution in a multithreaded environment, where we perform intra-query parallelism by running each operator laying in the same layer in parallel as per previous discussions.
The correctness of our proposed parallelisation approach is guaranteed by the fact that each thread in a given layer can operate independently with no interdependencies requiring costly mutual exclusions. In place of directly using the pthread C++ library on multiple tasks, we utilised a thread pool proposed by [49], to minimise the thread creation overhead, while feeding the pool with the tasks denoted by for ...(parallel) do statements in our pseudocode Algorithm 6. We extended the library to support both static and dynamic scheduling approaches proposed by the OpenMP specifications [50]; these are:
  • BLOCKED STATIC: aims to balance the chunk sizes per thread by distributing any leftover iterations;
  • BLOCK-CYCLIC STATIC. Does not utilise balancing as the former. Instead, work blocks are cyclically allocated over the threads;
  • GUIDED DYNAMIC: aims to distribute large chunks when there is a lot of work still to be completed; tasks are split into smaller chunks as the work load diminishes;
  • MONOTONIC DYNAMIC: uses a single centralised counter that is incremented when a thread performs an iteration of work. The schedule issues iterations to threads in an increasing manner.
In addition to these, we also implemented two different scheduling policies splitting the tasks to be run in parallel while estimating the running time that each operator will take depending on the size of its associated operands (if any).
  • TASK SIZE PREDICTION BLOCK STATIC provides an estimation of work required per chunk. Then, these chunks are sorted in ascending work load, with the last providing the greatest amount of computation. Threads are then assigned chunks through a distribution algorithm, distributing the first and last chunk of the sorted work to the first thread, the second and penultimate to the second, etc.. The algorithm aims to distribute equal amounts of work to each thread, though assumes that the workload is strictly increasing while workload sizes are evenly distributed;
  • TASK SIZE PREDICTION UNBALANCED DYNAMIC: unlike the former, we assume that the incoming work is not balanced. Instead, a chunk is taken, its work size estimated and assigned to a thread. Then, the next thread will recursively receive chunks until the summed work load is approximate to that of the former. The next thread is then pulled from the pool and the process repeated until all chunks are assigned.
For this set of experiments, we exploited the full BPIC 2011 (Dutch hospital log) dataset. We want to determine how varying the total number of threads affects execution time, and therefore use only the original dataset with no sampling. This also demonstrates the performance against the real-world scenario. Similarly to the previous mining approach in Section 7.2, we generated models from the most occurring events labels. Here, we extended the model size to consider the top 15 events for the same eight Declare templates, thus resulting in 225 clauses. Extending the model size as such allows a better scalability analysis on the large; in fact, a smaller model size would not be able to reap the benefits of the dissected query plan, as it becomes more likely that there will not be enough work to allocate; as more threads might be left idle in the pool, no speed-up can be achieved.
The results of our experiments are shown in Figure 12. Across all instances, the parallelisation pipeline (line with data-points) proves more performant than any single threaded executions (horizontal vertical bar). There also appears to be a great variation in speed-up for different scheduling policies; MONOTONIC DYNAMIC, TASK SIZE PREDICTION UNBALANCED DYNAMIC, and GUIDED DYNAMIC consistently perform worse than all others. In addition to this, the former schedules grant almost no gain with trace number, indicating that dynamic scheduling is not only less performant than static in our use case scenario, but also bears no potential gains by through thread scalability. This is especially true in the case of Alternate Precedence, where all static policies have improved performance by at least an order of magnitude. Schedules also show different degrees of speed-ups. For the dynamic and BLOCK-CYCLIC STATIC schedules, increasing the number of threads has little effect on performance. In fact, adding threads proves to be detrimental in some cases (BLOCK-CYCLIC STATIC & Chain Precedence). Conversely, the other static schedules (BLOCKED STATIC and TASK SIZE PREDICTION BLOCK STATIC) achieve a super-linear speed-up [51,52,53], as the thread count increases. The greatest gains in performance were found for Alternate Precedence and Alternate Response with thread sizes of eight; there are over two orders of magnitude improvement against a single threaded instance, and almost the same speed up compared with the static schedules. As our problem is heavily bounded on data access and on the size of it, reducing the task allocation size will create an overall increase of cache misses, while these are minimised by associating each thread with a greater amount of tasks.

7.4. D φ -Encoding Atomisation Strategies

We now want to test how distinct query atomisation strategies affect the query run time. For this, we exploit a different dataset while we hardcoded some models suitable for highlighting such differences.
While the AtomizeEverything strategy guarantees that all activation and targets undergo the atomizaiton step if a clause is found that contains a data payload predicate, the AtomizeOnlyOnDataPredicate atomises only those conditions containing a data payload and considers the others as activity labels. As a consequence, the former is expected to have more weighted access to AttributeTable L , while the latter to ActivityTable L . We analyse the execution times over the same models M 1 M 5 , where each model differs from the other in the number of clauses as well as in data conditions.
For these experiments, we exploited the full BPIC 2012 (Dutch loan company) dataset. This contained event/trace payload information and was comprised of activities occurring for a loan transaction. The models exploited are visualised in Supplement Table S1a. We define four models, increasing by five clauses, where each is a sub-model of the latter. These clauses consisted of both data and data-less payload conditions, in order to adhere to our benchmarking hypothesis.
Results are shown in Figure 13 for both configurations, where there is a positive correlation between model size and execution time, with a constant increase with each additional set of clauses. For the smaller model size, AtomizeEverything outperforms AtomizeOnlyOnDataPredicate, though the former exhibits greater increases in running time as more clauses are added. This therefore suggests that accessing the ActivityTable L becomes more expensive than the AttributeTable L as the number of activation/target conditions increases. To explain this, the reader is encouraged to refer back to Supplement Table S1a, which defines the clauses that are added to each model, and therefore the new activities and atoms that may require decomposition. With increased model sizes, AtomizeOnlyOnDataPredicate suffers from duplicated memory access; as some events (e.g., A_SUBMITTED) are accessed in both tables: while returning the events satisfying an atom requires the access to the AttributeTable L k for any given attribute k of interest, returning all of the events having a given activity label requires accessing the ActivityTable L . The data access for the atomised queries may duplicate access to the ActivityTable L , which becomes more costly as our model size increases. Conversely, AtomizeEverything will atomize A_SUBMITTED from q 1 , as clauses q 2 and q 3 contain payload conditions. Therefore, these queries only ever access the AttributeTable L , and the duplication of data access is removed. For the smaller model size M 1 , this gain is less apparent as the duplicated data access becomes negligible.

7.5. Data-Aware Conformance Checking

We now consider another state-of-the-art solution, Declare Analyzer [6] for conformance checking with payload information. This solution is tested against two different sets of models of increasing sizes, with each of them providing either the worst or the best case scenario for KnoBAB. These experiments exploit the same dataset as in the former experimental set-up, and also used in [6].
We represented the log for Declare Analyzer via MapDB (https://mapdb.org/), thus reflecting a relational model representation. The authors do not consider trace payloads, and therefore propose injecting trace payload as an extension of each event payload. On the other hand, KnoBAB injects the trace payload as a unique event at the beginning of the trace (Section 2), thus reducing the overhead of testing an activation/target condition per event while minimising data loading time. We wanted to investigate our solution’s performance among the best/worst cases regarding the clauses of choice. Therefore, we provide two scenarios. The first scenario (SCENARIO 1), also described in our seminal paper [4], provides our worst case scenario models (Table S1a) where each additional set of clauses consist of entirely novel activity labels and clauses and, within each sub-model, each clause is distinguished by data payload conditions. Consequently, the query plan cannot exploit gains made from data access minimisation as every condition is considered a unique disjunction of atoms. Conversely, the second (SCENARIO 2) novel scenario describes our best case. We encourage the reader to refer to this, where activation and target conditions appear several times in different clauses (Table S2). Thus, there are many more instances where data access can be minimised; for example, the model q 1 q 2 q 3 q 4 q 5 considers the activity label A_SUBMITTED across five instances. Following strategies such as in [9], this can be reduced to one access. ActivityTable (SCENARIO 2) results are shown from Figure 14a (Figure 14b). For either scenario, we average 2–3 orders of magnitude more performant than Declare Analyzer; even in the worst case ( M 4 ), we are over an order of magnitude more performant. For both scenarios, we compute the following metrics: Conjunctive Query (CQ) and Support, to analyse any variations between the ensemble methods. KnoBAB + CQ outperforms KnoBAB + Support in all cases, where the cost increase is linear with model size.
SCENARIO 1. For Declare Analyzer, increases in model size results in a constant slope of 3.47 × 10 2 ms per model size, while our solution demonstrates an initial slope of 2 × 10 1 ms per model size, followed by a constant slope of 6 × 10 0 ms per model size. To explain this abrupt behaviour, the reader is encouraged to refer to Supplement Table S1a and the query plan from Figure 2. KnoBAB thrives when data access is minimised; if this cannot be achieved (due to the addition of novel activation/target conditions), potential gains cannot be exploited. Every clause from M 2 contains new activation/target labels/payload conditions compared to M 1 . As a result, the number of atoms and leaves in the query plan is doubled. However, M 3 contains the activity label O_CANCELLED. This atom has already been considered in the previous model, and so data access is optimised. Therefore, the time increase from M 2 to M 3 is much less than that of the former. Subsequently, as M 3 is a sub-model of M 4 , the same gains are seen here ( M 4 contains entirely novel conditions). Overall, the results show that we are not bounded by model size unlike Declare Analyzer, which must perform an entire log scan per clause, while we can ignore irrelevant traces via bounding/indexing across our tabular representation available to the relational model. Still, our running times reflect the formal definition stated in Section 5.2.3, where queries still need to scan each model clause and therefore their expected running time is proportional to the model size.
SCENARIO 2. We now want to test whether clauses providing similar queries lead to lower running times. Here, the model sizes are smaller than the previous example, so as to demonstrate the potential optimisation from even small examples. The former contains only a single clause, while the latter consists of seven clauses. The slope between these models is 3.3 × 10 0 ms per model size, an order of magnitude less than the worst case scenario. To clarify the results, the reader is encouraged to compare the models q 1 vs. q 1 q 2 q 3 q 4 q 5 . All atoms in the former are included in the latter, so we can have much greater data access minimisation, which these results confirm. Of course, a hand-made model is unlikely to contain such overlapping elements, but these results demonstrate the potential gains to be made, even for less bespoke scenarios such as data mining, where a huge amount of overlap might still occur while testing multiple clauses’ combinations.

8. Conclusions

By summarizing the contributions of our paper, we showed how to express temporal logic through ad hoc temporal algebra (xtLTLf) based on the relational model. The latter, defined both in its logical and physical model, has been suitably extended for log and operators’ result representation. We showed how it is possible to load data on this model using suitable algorithms and how it is possible to represent a sequence of operations with a parallelisable query plan providing super-linear speed-up. As a new contribution to our previous work, we have also shown different implementations for the xtLTLf operators, thus showing how there is always a faster non-trivial implementation exploiting both the properties of the intermediate result representation as well as query rewriting. Our proposed solution, KnoBAB, leverages all of the aforementioned features, thus providing higher performance than current conformance checking and mining solutions, be it data or data-less.
This work encourages future KnoBAB developments and implementations, including more efficient data model mining algorithms and the use of views to reduce further the cost of allocating intermediate results. Furthermore, secondary memory representation of the log according to the percepts of Near Data Processing is in its infancy. Future developments will explore the possibility of using KnoBAB to learn temporal models from data and the ability to fully support trace repair operations in order to make deviant traces compliant to the given model. For this, we will consider the possibility of integrating our relational system with the BCDM relational model [54], thus fully supporting operations such as insertions, updates, and deletions required for trace repairs in conformance checking [25]. Finally, our future work will also consider vectorial data as a specific data representation [32,55]: this will enable KnoBAB to fully support spatial data representation, thus aiming for full spatio-temporal representation [56,57]. This, along with more advanced model mining algorithms, will enable us to efficiently mine spatio-temporal patterns from logs. Finally, we will also investigate the possibility of transferring the definition of such algebraic operators when logs are represented as graphs [58,59], thus further improving the efficiency of graph-based query languages.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/info14030173/s1.

Author Contributions

Conceptualisation, S.A. and G.B.; methodology, G.B.; software, S.A. and G.B.; validation, S.A. and G.B.; formal analysis, G.B.; investigation, S.A.; resources, G.B. and G.M.; data curation, S.A.; writing—original draft preparation, S.A. and G.B.; writing—review and editing, G.B. and G.M.; visualisation, S.A.; supervision, G.B. and G.M.; project administration, G.B.; funding acquisition, G.B. All authors have read and agreed to the published version of the manuscript.

Funding

Samuel Appleby’s work is supported by Newcastle University.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset associated with the presented experiments was available online the 5 March 2022: https://dx.doi.org/10.17605/OSF.IO/2CXR7.

Conflicts of Interest

The authors declare no conflict of interest.

Sample Availability

The most up-to-date version of KnoBAB is available on GitHub: https://github.com/datagram-db/knobab (5 March 2022).

Abbreviations

    The following abbreviations are used in this manuscript:
DAGDirect Acyclic Graph
KnoBABKNOwledge Base for Alignments and Business process modelling
LTLfLinear Temporal Logic over finite traces
RDBMSRelational Database Management System
XESeXtensible Event Stream
xtLTLfeXTended Linear Temporal Logic over finite traces

Appendix A

We now show some equivalence and correctness lemmas.

Appendix A.1

First, we want to show the equivalence of some unary operators as generalisations of some of the base operators. We now show that Init A / T L or Ends A / T L can be subsumed by appropriate combinations of Init or Ends with Activity A / T L , τ . As the former set of operators cannot express data conditions for the events while the former can by replacing Activity A / T L , τ with an arbitrary sub-expression with Atom A / T L , τ , we can trivially conclude that the former are less general than the latter.
Lemma A1.
a Σ . Init A / T L ( a ) = Init ( Activity A / T L , τ ( a ) )
Proof. 
We can expand the definition of the left-hand side of the equation for any a Σ as follows:
Init A / T L ( a ) = i , 1 , { A / T ( 1 ) } | ϕ . β ( a ) , i , 1 , , ϕ ActivityTable L
The right-hand side of the equation can be rewritten as follows:
Init ( Activity A / T L , τ ( a ) ) = i , 1 , { A / T ( 1 ) } | π , ϕ . β ( a ) , i , 1 , π , ϕ ActivityTable L
The goal is immediately closed by choosing π = , as any first event will have always an empty Prev pointer.    □
Lemma A2.
a Σ . Ends A / T L ( a ) = Ends ( Activity A / T L , τ ( a ) )
Proof. 
We can expand the definition of the left-hand side of the equation for any a Σ as follows:
Ends A / T L ( a ) = i , 1 , { A / T ( | σ i | ) } | π . β ( a ) , i , 1 , π , ActivityTable L
The right-hand side of the equation can be rewritten as follows:
Ends ( Activity A / T L , τ ( a ) ) = i , 1 , L | i , | σ i | , L ActivityTable L = ù i , 1 , { A / T ( | σ i | ) } | π . β ( a ) , i , | σ i | , π , ActivityTable L
The goal is immediately closed by choosing π = , as any first event will always have an empty Prev pointer.    □
On the other hand, as the Exists A / T L and Absence A / T L operators discard the activation and target marks of the associated events for the purposes of efficiency, we need to relax their notion of equivalence by ignoring the result being provided by the third component. Still, we can observe that they compute the same result trace-wise. Even in this scenario, as the former operators merely access the counting table for the purposes of efficiency, they cannot be generally exploited when the expression of data conditions is also required.
Lemma A3.
a Σ . σ i L . L , L . i , 1 , L Exists A / T L ( a , n ) i , 1 , L Exists n ( Activity A / T L , τ ( a ) )
Proof. 
i , 1 , L Exists A / T L ( a , n ) i , 1 , L Exists n ( Activity A / T L , τ ( a ) ) m n . β ( a ) , i , m CountingTable L n | i , j , L Activity A / T L , τ ( a ) | σ j i σ i | σ j i = a , p n n | β ( a ) , i , j , π , ϕ ActivityTable L |
   □
Lemma A4.
a Σ . σ i L . L , L . i , 1 , L Absence A / T L ( a , n ) i , 1 , L Absence n ( Activity A / T L , τ ( a ) )
Proof. 
By simply replacing the m n and n | S | for any set S conditions in the former lemma to m < n and n > | S | . This boils down to:
i , 1 , L Absence A / T L ( a , n ) i , 1 , L Absence n ( Activity A / T L , τ ( a ) ) m < n . β ( a ) , i , m CountingTable L n > | i , j , L Activity A / T L , τ ( a ) | σ j i σ i | σ j i = a , p < n n > | β ( a ) , i , j , π , ϕ ActivityTable L |
   □

Appendix A.2

Next, we want to show that xtLTLf is at least as expressive as LTLf. To support this claim, we need to prove the two following lemmas where, as LTLf does not support explicit activation and target conditions with Θ correlation conditions over the payload data, we are always going to assume Θ = True and that the atomic operators are never associated with an activation/target label, thus always returning an empty third component of the intermediate result. As we might observe, the following lemma entails that, differently from standard LTLf semantics applied to each event trace at a time, xtLTLf semantics returns all of the events for which the given temporal condition holds. This becomes very relevant for minimising the data access while scanning our relational representation of the log, as well as allowing better intermediate result reuse for any incoming sub-expression. The following lemma also entails a correspondence between timed xtLTLf operators and LTLf formulae.
Lemma A5.
For each LTLf formula φ, a timed xt LTLf expression ψ τ evaluated over an intended relational model representing a log L of finite and non-empty traces exists for which the latter returns i , j , L iff. σ j i φ . More formally:
σ j i σ i , σ i L . φ LTL f . ψ τ : timed xt LTL f . ( i , j , L ψ τ σ j i φ )
Proof. 
The constructive proof proceeds by structural induction over ψ τ . We first need to consider a rewriting lemma stating that β ( a ) , i , j , π , χ ActivityTable L iff. a p exists such that σ j i = a , p . Now, we can start the proof by induction.
φ = a
By applying the aforementioned rewriting lemma (from now on simply referred to as by construction of ActivityTable), we can immediately close the goal by choosing ψ τ = Activity τ ( a ) as the model will only return data associated with the log of choice:
i , j , L Activity τ ( a ) p . σ j i = a , p σ j i a
φ = a q
If the compound condition is also atomic for which q can be expressed as an interval query low κ up for some payload key κ , we can follow a similar proof from the former case and choose the atom ψ τ = Compound τ ( a , κ , low , up ) , thus closing the goal as follows:
i , j , L Compound τ ( a , κ , [ low , up ] ) p . σ j i = a , p low p ( κ ) up σ j i a ( low κ κ up )
φ = , φ
by inductive hypothesis, we know the ρ  xtLTLf expression returning ρ , which contains i , j + 1 , L when σ j + 1 i φ . For this, we choose as ψ τ = Next τ ( ρ ) , which also guarantees that j never exceeds the trace’s length ( j | σ i | ). We can therefore expand the definition of our proposed operator by obtaining:
i , j , L Next τ ( ρ ) i , j + 1 , L ρ 1 < j + 1 | σ i | I H σ j + 1 i φ 0 < j < | σ i | φ j i , φ
φ = φ
The application of the induction is similar to the former and, similarly to the former case, we also proceed by expanding the definition of the relational operator. We can hereby choose ψ τ = Globally τ ( ρ ) where the induction is applied over ρ and φ . We can close the goal as follows:
i , j , L Globally τ ( ρ ) i , j , L j ρ | σ i | j + 1 = | { i , k , L k ρ | j k | σ i | } | j k | σ i | . i , k , L k ρ I H j k | σ i | . σ k i φ σ j i φ
φ = φ
Similarly to globally, we obtain ψ τ = Future τ ( ρ ) for a ρ corresponding to φ by inductive hypothesis.
φ = ¬ φ
Similarly to the previous unary operators, we choose as xtLTLf operator ψ τ = Not τ ( ρ ) where the inductive hypothesis links ρ to φ . We can therefore close the goal as follows:
i , j , L Not τ ( ρ ) σ j i σ i σ i L i , j , L ρ I H σ j i σ i σ i L σ j i ¬ φ σ j i ¬ φ
This is doable as stating i , j , L ψ τ σ j i φ is equivalent to i , j , L ψ τ σ j i φ where the latter can be rewritten as σ j i ¬ φ .
φ = φ φ
As we have that two inductive hypotheses associate ρ and ρ respectively to φ and φ , we choose the xtLTLf formula ψ τ = And True τ ( ρ , ρ ) to be associated with φ φ . For this xtLTLf operator, we can state that a result i , j , is returned by such an operator if and only if i , j , ρ and i , j , ρ per definition of operators never returning explicit activation or target condition. We close the goal as follows:
i , j , L And True τ ( ρ , ρ ) i , j , ρ i , j , ρ I H σ j i φ σ j i φ σ j i φ φ
φ = φ φ
We can firstly observe that ( A B ) ( A ¬ B ) ( ¬ A B ) in the classical semantics is equivalent to A B for any possible proposition A and B (OrRwLem). After observing that the current operator is defined by extension of the previously proved one, we can exploit the previous one as a rewriting lemma. As we have that two inductive hypothesis associating ρ and ρ respectively to φ and φ , we choose the xtLTLf formula ψ τ = Or True τ ( ρ , ρ ) to be associated with φ φ . We close the goal as follows:
i , j , L Or True τ ( ρ , ρ ) I H i , j , L And τ ( ρ , ρ ) ( σ j i φ σ j i φ ) ( σ j i φ σ j i φ ) σ j i φ φ ( σ j i φ σ j i φ ) ( σ j i φ σ j i φ ) OrRwLem σ j i φ φ
φ = φ U φ
as both the results from the third element of the intermediate results are always empty by construction and preliminary assumption, and we have inductive hypothesis associating ρ and ρ respectively to φ and φ , we can immediately close the goal after choosing the xtLTLf formula ψ τ = Until True τ ( ρ , ρ ) to be associated with φ = φ U φ .
   □
The next lemma is required for closing the generic lemma stated at the beginning of this sub-section, as LTLf starts assessing the formulae from the beginning of each trace. We need to show that the former lemma applies to xtLTLf operators in a stricter version, which is the following one:
Lemma A6.
For each LTLf formula φ satisfied from the beginning of the trace, it exists an xt LTLf expression ψ returning a i , 1 , L , thus highlighting that the condition holds from the beginning of the trace. More formally:
σ i L . φ : LTL f . ψ xt LTL f . ( σ i φ L . i , 1 , L ψ )
Proof. 
Similarly to the previous lemma, as LTLf cannot express activation and target conditions to be tested in Θ correlation conditions, we always choose Θ = True , and we decide to use base xtLTLf operators where none of these conditions is returned. Differently from the previous lemma, we now have to go by inductive structure over the LTLf formulae rather than on the xtLTLf ones. We can therefore consider the following inductive cases:
φ = a
By definition of the Init operator, it is sufficient to consider ψ = Init ( a ) ;
φ = a p
Under the assumption that the compound condition corresponds to an atomic query with p : = low κ up , we can formulate the former as follows: ψ = Init ( Compound L , τ ( a , κ , low , up ) ) ;
φ = , φ
By rewriting this definition, this implies to prove that φ 2 i φ . As the Next τ operator is a timed one and we cannot assess φ from the beginning of the trace, we cannot exploit the inductive hypothesis for φ , but we need to apply the previously proven lemma for the conditions happening at any point in the trace. From the application of the previous lemma, we have that φ 2 i φ i , 2 , L ρ for some xtLTLf expression returning ρ . From this, it follows that i , 1 , L Next τ ( ρ ) . By its definition, Next τ returns all events preceding the ones stated in ρ , while, for σ i , φ , we are only interested in restricting all of the possible results of Next τ to the ones also corresponding to the beginning of the trace. For this reason, we need to consider ψ as And τ ( First τ , Next τ ( ρ ) ) ;
φ = φ
Similarly to the previous operator, φ is timed and should be checked for all events σ j i of interest within the trace σ i . Even in this case, we need to apply the previous lemma for φ , thus guaranteeing that an xtLTLf expression ρ exists containing i , j , L whenever σ j i φ . As globally requires that all of the events satisfy φ , we have that Globally ( ρ ) responds by the intended semantics, and therefore we choose this as our ψ ;
φ = φ
Similarly to the previous operator, we choose Future ( ρ ) when ρ is linked to the evaluation of φ for any possible trace event by the previous lemma;
φ = ¬ φ
In this other scenario, we can directly apply the previous lemma, as the evaluation of φ will always start from the beginning of the trace. After recalling that x . P ( x ) x . ¬ P ( x ) , we rewrite the definition of φ while applying the inductive hypothesis for the present lemma over some ρ semantically linked to φ as follows:
σ i ¬ φ σ 1 i φ I H L . i , 1 , L ρ
Per inductive hypothesis, ρ contains all of the records i , 1 , L for which σ 1 i φ ; as the untimed negation will return a record ι , 1 , if and only if there is no event associated with the trace ι in the provided operand, we can choose ψ = Not ( ρ ) and close the goal as follows:
i , 1 , Not ( ρ ) j , L . i , j , L ρ L . i , 1 , L ρ
φ = φ U φ
Similarly to the former operators, both φ and φ required a timed evaluation of the events along the trace of interest, for which we need to exploit the former lemma, thus obtaining timed xtLTLf expressions ρ and ρ . We can immediately close the lemma by choosing ψ = Until True ( ρ , ρ ) ;
φ = φ φ
Similarly to the negation operator, we can directly apply the inductive hypothesis on φ and φ , as these sub-operators will also be assessed from the beginning of a trace; these will be associated respectively to the xtLTLf expressions ρ and ρ having i , 1 , ρ and i , 1 , ρ as we exploit neither activation nor target conditions. As per construction ρ and ρ will contain no record i , j + 2 , L for some natural number j 0 , we chose ψ = And True ( ρ , ρ ) ;
φ = φ φ
By exploiting similar consideration from the former operator, we chose ψ = Or True ( ρ , ρ ) for some ρ and ρ respectively associated by inductive hypothesis to φ and φ .
   □
As a corollary of the two given lemmas, we have that xtLTLf is at least as expressive as LTLf, as any LTLf formula can always be computed through an equivalent xtLTLf formula. This validates the decision from our previous work [4] where we expressed the semantics of each template in Declare through a correspondent xtLTLf expression. These were also checked through automated testing Appendix A.2. At this stage, we also want to ascertain that the untimed and timed operators work as expected, that is, that we can mimic the outcome of the timed operators over the timed ones if, for each event i , j , L , we evaluate the corresponding untimed operator over the suffix σ j i , , σ | σ i | i . This can be proven as follows:
Lemma A7.
For each timed xt LTLf operator ψ τ containing a result i , j , L over a relational representation of L , generate a log of suffixes L = { σ i j } , where σ i j : = σ j i , , σ | σ i | i of σ i , and each event is defined as σ k i j : = σ j + k 1 i for each 1 k | σ i | j + 1 . For this, an xt LTLf expression ψ evaluated over the relational representation of L always exists such that i j , 1 , L ψ .
Proof. 
We prove the lemma by induction over ψ τ by considering all of the timed operators having an untimed counterpart. Please observe that we discard the negation Not from our considerations, as we have previously mentioned that the timed and untimed versions of this serve different purposes. We also provide an implementation (https://github.com/datagram-db/knobab/blob/main/tests/ltlf_operators_test.cpp, 5 March 2023) of such proofs via automated testing.
ψ τ = Activity A / T L , τ ( a ) :  
This can be trivially closed by choosing Init A / T L ( a ) ;
ψ τ = Compound A / T L , τ ( a , k , low , up ) :  
This can be trivially closed by choosing Init ( Compound A / T L ( a , k , [ low , up ] ) ) ;
ψ τ = Globally τ ( ρ )
After observing that | σ i j | = | σ i | j + 1 , we obtain the following condition by operator’s expansion, where ρ is evaluated over L as per inductive hypothesis:
i , j , L Globally τ ( ρ ) L : = j k | σ i | , i , k , L k ρ L k | σ i | j + 1 = | { i , k , L k ρ | j k | σ i | } | L : = 1 k | σ i j | , i j , k , L k ρ L k | σ i j | = | { i j , k , L k ρ | 1 k | σ i j | } | L : = i j , k , L k ρ L k | σ i j | = | { i j , k , L k ρ } | i j , 1 , L Globally ( ρ ) ;
ψ τ = Future τ ( ρ )
By following similar consideration as per the former operator, we have:
i , j , L Future τ ( ρ ) L : = j k | σ i | i , k , L k ρ L k h j , L . i , h , L h ρ L : = 1 k | σ i j | i j , k , L k ρ L k h 1 , L . i , h , L h ρ L : = i j , k , L k ρ L k h , L . i , h , L h ρ i j , 1 , L Future ( ρ ) ;
ψ τ = And Θ τ ( ρ 1 , ρ 2 )
By rewriting the definition of the timed And operator, we obtain the following:
i , j , L And Θ τ ( ρ 1 , ρ 2 ) L 1 , L 2 . i , j , L 1 ρ 1 i , j , L 2 ρ 2 L : = T Θ E , i ( [ j L 1 ] , [ j L 2 ] ) L False
If And contains for both of its operands an event σ j i , it follows that there should be at least one match σ 1 i j over the corresponding untimed operator And Θ ( ρ , ρ ) evaluated over L . For the latter operator, we can therefore ensure that a j exists and a j being j = j = 1 and L as well as L for which the following condition holds:
i , j , L And Θ τ ( ρ 1 , ρ 2 ) L 1 , L 2 . i j , 1 , L 1 ρ i j , 1 , L 2 ρ L : = T Θ E , i ( [ 1 { L j | i , j , L j ρ } ] , [ 1 { L j | i , j , L j ρ } ] ) L False i j , 1 , L And Θ ( ρ , ρ ) ;
ψ τ = Or Θ τ ( ρ 1 , ρ 2 )
As this operator is derived from the definition of And Θ τ , we can directly close the goal by the previous inductive step if the result represents a match between the elements of the first and second operand. If there were no events that might have been matched, the data come either from the first or from the second operand. As the two cases are symmetric, we just provide proof for the former case. In this situation, we have a i , j , L Or Θ τ ( ρ 1 , ρ 2 ) corresponding to a i , j , L ρ 1 for which there is no L such that i , j , L ρ 2 . If there still exists a j and L such that i , j , L ρ 2 for which there might be a match between L and L , then this case falls under the untimed And Θ over L , and we still have some τ for which the latter returns i j , 1 , τ ; if match is never possible or no of such j exists, then the untimed Or Θ operator will return a i j , 1 , { L | k . i j , k , L ρ 2 } by definition;
ψ τ = Until τ ( ρ 1 , ρ 2 )
This is a mere rewriting exercise, as the untimed version of Until is a mere instantiation of the latter where only the case k = 1 is considered.

Appendix A.3

At this stage, we provide some rewriting lemmas motivating the introduction of derived operators. First, we want to show that the untimed And Θ ( ρ , ρ ) operator can also be exploited to compute And Θ ( Future ( ρ ) , Future ( ρ ) ) , thus motivating the peculiar definition of such operator with an existential interpretation over all of the possible matches in the future. We can formally prove this as follows:
Lemma A8.
ρ , ρ . And Θ ( Future ( ρ ) , Future ( ρ ) ) = And Θ ( ρ , ρ )
Proof. 
By expanding the definition of the operators, we obtain:
i , 1 , L And Θ ( Future ( ρ ) , Future ( ρ ) ) L , L . i , 1 , L Future ( ρ ) i , 1 , L Future ( ρ ) L : = T Θ E , i ( [ 1 { L j | i , j , L j ρ 1 } ] , [ 1 { L j | i , j , L j ρ 2 } ] ) , L False j , j , L , L . i , j , L ρ i , j , L ρ L : = T Θ E , i ( [ 1 { L j | i , j , L j ρ 1 } ] , [ 1 { L j | i , j , L j ρ 2 } ] ) , L False i , 1 , L And Θ ( ρ , ρ )
Please remember that the untimed And operator is also compliant with the LTLf semantics as per our previous lemmas. We can therefore exploit the versatile definition of such operation to reduce the computational overhead provided by the additional and unrequired aggregation provided by Future. Given the previous lemma, we have as a Corollary that the semantics associated with the Choice Declare clause, i.e., Or Θ ( Future ( ρ ) , Future ( ρ ) ) , can equivalently be computed by Or Θ ( ρ , ρ ) . The following proof motivates the choice of exploiting E Θ i as a correlation matching semantics for both And Θ and Or Θ .
Corollary A1.
ρ , ρ . Or Θ ( Future ( ρ ) , Future ( ρ ) ) = Or Θ ( ρ , ρ )
Proof. 
By expanding the definition of the untimed Or Θ , we obtain:
Or Θ ( Future ( ρ ) , Future ( ρ ) ) = And Θ ( Future ( ρ ) , Future ( ρ ) ) i , 1 , { L | j . i , j , L Future ( ρ ) } | j , L . i , j , L Future ( ρ ) i , 1 , { L | j . i , j , L Future ( ρ ) } | j , L . i , j , L Future ( ρ )
For the previous lemma, this becomes:
Or Θ ( Future ( ρ ) , Future ( ρ ) ) = And Θ ( ρ , ρ ) i , 1 , { L | j . i , j , L Future ( ρ ) } | j , L . i , j , L Future ( ρ ) i , 1 , { L | j . i , j , L Future ( ρ ) } | j , L . i , j , L Future ( ρ )
At this stage, we only need to test the contribution of the second component of the union, as the third one is symmetrical ( ρ and ρ are just inverted). As the elements of the second component of the union come from Future operators, we can rewrite such as follows:
i , 1 , { L | i , 1 , L Future ( ρ ) } | L . i , 1 , L Future ( ρ )
We can also observe that i , 1 , L Future ( ρ ) for a given L if there exist a j and L for which i , j , L ρ . Similar considerations come from the negated counterpart ( i , 1 , L Future ( ρ ) ). For this expansion, we can therefore close our goal. □
The remaining lemmas show the correctness of the logical formulation of the derived operators, thus motivating their adoption when possible. These lemmas were also tested in our implementation (See the end of https://github.com/datagram-db/knobab/blob/main/tests/until_test.cpp, 5 March 2023). The supplementary materials (Section II) show that it is possible to implement such derived operators so that they are faster than their corresponding LTLf rewriting counterpart.
Lemma A9.
ρ , ρ . And Θ τ ( ρ 1 , Future τ ( ρ 2 ) ) = AndFuture Θ τ ( ρ 1 , ρ 2 )
Proof. 
i , j , L And Θ τ ( ρ 1 , Future τ ( ρ 2 ) ) L 1 , L 2 . i , j , L 1 ρ 1 i , j , L 2 Future τ ( ρ 2 ) L : = T Θ E , i ( [ j L 1 ] , [ j L 2 ] ) L False L 1 , L 2 . i , j , L 1 ρ 1 h j , L . i , h , L ρ 2 L : = T Θ E , i ( [ j L 1 ] , [ j j k | σ i | i , k , L k ρ L k ] ) L False i , j , L AndFuture Θ τ ( ρ 1 , ρ 2 )
Lemma A10.
ρ , ρ . And Θ τ ( ρ 1 , Globally τ ( ρ 2 ) ) = AndGlobally Θ τ ( ρ 1 , ρ 2 )
Proof. 
i , j , L And Θ τ ( ρ 1 , Globally τ ( ρ 2 ) ) L 1 , L 2 . i , j , L 1 ρ 1 i , j , L 2 Globally τ ( ρ 2 ) L : = T Θ E , i ( [ j L 1 ] , [ j L 2 ] ) L False L 1 . i , j , L 1 ρ 1 i , j , L j ρ 2 | σ i | j + 1 = | { i , k , L k ρ | j k | σ i | } L : = T Θ E , i ( [ j L 1 ] , [ j j k | σ i | i , k , L k ρ L k ] ) L False L 1 . i , j , L 1 ρ 1 j k | σ i | . L . i , k , L k ρ 2 L : = T Θ E , i ( [ j L 1 ] , [ j j k | σ i | i , k , L k ρ L k ] ) L False i , j , L AndGlobally Θ τ ( ρ 1 , ρ 2 )

References

  1. Agrawal, R.; Imieliński, T.; Swami, A. Mining Association Rules between Sets of Items in Large Databases. SIGMOD Rec. 1993, 22, 207–216. [Google Scholar] [CrossRef]
  2. Bergami, G.; Maggi, F.M.; Montali, M.; Peñaloza, R. Probabilistic Trace Alignment. In Proceedings of the 2021 3rd International Conference on Process Mining (ICPM), Eindhoven, The Netherlands, 31 October–4 November 2021; pp. 9–16. [Google Scholar] [CrossRef]
  3. Schön, O.; van Huijgevoort, B.; Haesaert, S.; Soudjani, S. Correct-by-Design Control of Parametric Stochastic Systems. In Proceedings of the 2022 IEEE 61st Conference on Decision and Control, Cancun, Mexico, 6–9 December 2022. [Google Scholar]
  4. Appleby, S.; Bergami, G.; Morgan, G. Running Temporal Logical Queries on the Relational Model. In Proceedings of the International Database Engineered Applications Symposium (IDEAS’22), Budapest, Hungary, 22–24 August 2022; pp. 222–231. [Google Scholar]
  5. Schönig, S.; Rogge-Solti, A.; Cabanillas, C.; Jablonski, S.; Mendling, J. Efficient and Customisable Declarative Process Mining with SQL. In Advanced Information Systems Engineering, Proceedings of the 28th International Conference, CAiSE 2016, Ljubljana, Slovenia, 13–17 June 2016; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  6. Burattin, A.; Maggi, F.M.; Sperduti, A. Conformance checking based on multi-perspective declarative process models. Expert Syst. Appl. 2016, 65, 194–211. [Google Scholar] [CrossRef] [Green Version]
  7. Pesic, M.; Schonenberg, H.; van der Aalst, W.M.P. DECLARE: Full Support for Loosely-Structured Processes. In Proceedings of the 11th IEEE International Enterprise Distributed Object Computing Conference, Annapolis, MA, USA, 15–19 October 2007; pp. 287–300. [Google Scholar]
  8. Musser, D.R. Introspective Sorting and Selection Algorithms. Softw. Pract. Exp. 1997, 27, 983–993. [Google Scholar] [CrossRef]
  9. Bellatreche, L.; Kechar, M.; Bahloul, S.N. Bringing Common Subexpression Problem from the Dark to Light: Towards Large-Scale Workload Optimizations. In Proceedings of the 25th International Database Engineering & Applications Symposium, Montreal, QC, Canada, 14–16 July 2021. [Google Scholar]
  10. Naldurg, P.; Sen, K.; Thati, P. A Temporal Logic Based Framework for Intrusion Detection. In Proceedings of the Formal Techniques for Networked and Distributed Systems—FORTE 2004: 24th IFIP WG 6.1 International Conference, Madrid, Spain, 27–30 September 2004; Núñez, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3235, pp. 359–376. [Google Scholar]
  11. Ray, I. Security Vulnerabilities in Smart Contracts as Specifications in Linear Temporal Logic. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2021. [Google Scholar]
  12. Buschjäger, S.; Hess, S.; Morik, K. Shrub Ensembles for Online Classification. In Proceedings of the the AAAI Conference on Artificial Intelligence 2022, Virtual, 22 February–1 March 2022; pp. 6123–6131. [Google Scholar]
  13. Huo, X.; Hao, K.; Chen, L.; song Tang, X.; Wang, T.; Cai, X. A dynamic soft sensor of industrial fuzzy time series with propositional linear temporal logic. Expert Syst. Appl. 2022, 201, 117176. [Google Scholar] [CrossRef]
  14. Bergami, G.; Francescomarino, C.D.; Ghidini, C.; Maggi, F.M.; Puura, J. Exploring Business Process Deviance with Sequential and Declarative Patterns. arXiv 2021, arXiv:2111.12454. [Google Scholar]
  15. Zhou, H.; Milani Fard, A.; Makanju, A. The State of Ethereum Smart Contracts Security: Vulnerabilities, Countermeasures, and Tool Support. J. Cybersecur. Priv. 2022, 2, 358–378. [Google Scholar] [CrossRef]
  16. Szabo, N. Smart contracts: Building blocks for digital markets. Extropy J. Transhumanist Thought 1996, 18, 28. [Google Scholar]
  17. Fionda, V.; Greco, G.; Mastratisi, M.A. Reasoning About Smart Contracts Encoded in LTL. In Proceedings of the AIxIA 2021—Advances in Artificial Intelligence: 20th International Conference of the Italian Association for Artificial Intelligence, Virtual Event, 1–3 December 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 123–136. [Google Scholar]
  18. Bank, H.S.; D’souza, S.; Rasam, A. Temporal Logic (TL)-Based Autonomy for Smart Manufacturing Systems. Procedia Manuf. 2018, 26, 1221–1229. [Google Scholar] [CrossRef]
  19. Mao, X.; Li, X.; Huang, Y.; Shi, J.; Zhang, Y. Programmable Logic Controllers Past Linear Temporal Logic for Monitoring Applications in Industrial Control Systems. IEEE Trans. Ind. Informatics 2022, 18, 4393–4405. [Google Scholar] [CrossRef]
  20. Boniol, P.; Linardi, M.; Roncallo, F.; Palpanas, T.; Meftah, M.; Remy, E. Unsupervised and scalable subsequence anomaly detection in large data series. Vldb J. 2021, 30, 909–931. [Google Scholar] [CrossRef]
  21. Xu, H.; Pang, J.; Yang, X.; Yu, J.; Li, X.; Zhao, D. Modeling clinical activities based on multi-perspective declarative process mining with openEHR’s characteristic. BMC Med. Inform. Decis. Mak. 2020, 20-S, 303. [Google Scholar] [CrossRef]
  22. Rovani, M.; Maggi, F.M.; de Leoni, M.; van der Aalst, W.M.P. Declarative process mining in healthcare. Expert Syst. Appl. 2015, 42, 9236–9251. [Google Scholar] [CrossRef] [Green Version]
  23. Bertini, F.; Bergami, G.; Montesi, D.; Veronese, G.; Marchesini, G.; Pandolfi, P. Predicting Frailty Condition in Elderly Using Multidimensional Socioclinical Databases. Proc. IEEE 2018, 106, 723–737. [Google Scholar] [CrossRef]
  24. De Giacomo, G.; Maggi, F.M.; Marrella, A.; Patrizi, F. On the Disruptive Effectiveness of Automated Planning for LTLf-Based Trace Alignment. In Proceedings of the AAAI Conference on Artificial Intelligence 2017, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  25. Bergami, G.; Maggi, F.M.; Marrella, A.; Montali, M. Aligning Data-Aware Declarative Process Models and Event Logs. In Business Process Management; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 235–251. [Google Scholar]
  26. Bergami, G. A Logical Model for joining Property Graphs. arXiv 2021, arXiv:2106.14766. [Google Scholar]
  27. Zhu, S.; Pu, G.; Vardi, M.Y. First-Order vs. Second-Order Encodings for LTLf-to-Automata Translation. arXiv 2019, arXiv:1901.06108. [Google Scholar]
  28. Ceri, S.; Gottlob, G. Translating SQL Into Relational Algebra: Optimization, Semantics, and Equivalence of SQL Queries. IEEE Trans. Software Eng. 1985, 11, 324–345. [Google Scholar] [CrossRef]
  29. Calders, T.; Lakshmanan, L.V.S.; Ng, R.T.; Paredaens, J. Expressive power of an algebra for data mining. ACM Trans. Database Syst. 2006, 31, 1169–1214. [Google Scholar] [CrossRef] [Green Version]
  30. Li, J.; Pu, G.; Zhang, Y.; Vardi, M.Y.; Rozier, K.Y. SAT-based explicit LTLf satisfiability checking. Artif. Intell. 2020, 289, 103369. [Google Scholar] [CrossRef]
  31. Petermann, A.; Junghanns, M.; Müller, R.; Rahm, E. FoodBroker-Generating Synthetic Datasets for Graph-Based Business Analytics. In Proceedings of the 5th International Workshop, WBDB 2014, Potsdam, Germany, 5–6 August 2014. [Google Scholar]
  32. Bergami, G. On Declare MAX-SAT and a finite Herbrand Base for data-aware logs. arXiv 2021, arXiv:2106.07781. [Google Scholar]
  33. Pichler, P.; Weber, B.; Zugal, S.; Pinggera, J.; Mendling, J.; Reijers, H.A. Imperative versus Declarative Process Modeling Languages: An Empirical Investigation. In Proceedings of the BPM 2011 International Workshops, Clermont-Ferrand, France, 29 August 2011; pp. 383–394. [Google Scholar]
  34. Codd, E.F. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 1970, 13, 377–387. [Google Scholar] [CrossRef]
  35. Idreos, S.; Groffen, F.; Nes, N.; Manegold, S.; Mullender, K.S.; Kersten, M.L. MonetDB: Two Decades of Research in Column-oriented Database Architectures. IEEE Data Eng. Bull. 2012, 35, 40–45. [Google Scholar]
  36. Boncz, P.A.; Manegold, S.; Kersten, M.L. Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct. Proc. VLDB Endow. 2009, 2, 1648–1653. [Google Scholar]
  37. Roth, M.A.; Korth, H.F.; Silberschatz, A. Extended Algebra and Calculus for Nested Relational Databases. ACM Trans. Database Syst. 1988, 13, 389–417. [Google Scholar] [CrossRef]
  38. Wang, J.; Ntarmos, N.; Triantafillou, P. GraphCache: A Caching System for Graph Queries. In Proceedings of the International Conference on Extending Database Technology (EDBT) 2017, Venice, Italy, 21–24 March 2017; pp. 13–24. [Google Scholar]
  39. Keller, A.M.; Basu, J. A Predicate-based Caching Scheme for Client-Server Database Architectures. VLDB J. 1996, 5, 35–47. [Google Scholar] [CrossRef] [Green Version]
  40. Davey, B.A.; Priestley, H.A. Introduction to Lattices and Order, 2nd ed.; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
  41. de Berg, M.; Cheong, O.; van Kreveld, M.J.; Overmars, M.H. Computational Geometry: Algorithms and Applications, 3rd ed.; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  42. Elmasri, R.; Navathe, S.B. Fundamentals of Database Systems, 7th ed.; Pearson: Upper Saddle River, NJ, USA, 2015. [Google Scholar]
  43. Polyvyanyy, A.; ter Hofstede, A.H.M.; Rosa, M.L.; Ouyang, C.; Pika, A. Process Query Language: Design, Implementation, and Evaluation. arXiv 2019, arXiv:1909.09543. [Google Scholar]
  44. Coffman, E.G.; Graham, R.L. Optimal Scheduling for Two-Processor Systems. Acta Inform. 1972, 1, 200–213. [Google Scholar] [CrossRef]
  45. Sugiyama, K.; Tagawa, S.; Toda, M. Methods for Visual Understanding of Hierarchical System Structures. IEEE Trans. Syst. Man. Cybern. 1981, 11, 109–125. [Google Scholar] [CrossRef]
  46. Bergami, G. On Efficiently Equi-Joining Graphs. In Proceedings of the 25th International Database Engineering & Applications Symposium 2021, Montreal, QC, Canada, 14–16 July 2021. [Google Scholar]
  47. Dittrich, J. Patterns in Data Management: A Flipped Textbook; CreateSpace Independent Publishing Platform: Charleston, SC, USA, 2016. [Google Scholar]
  48. Schönig, S. SQL Queries for Declarative Process Mining on Event Logs of Relational Databases. arXiv 2015, arXiv:1512.00196. [Google Scholar]
  49. Shoshany, B. A C++17 Thread Pool for High-Performance Scientific Computing. arXiv 2021, arXiv:2105.00613. [Google Scholar] [CrossRef]
  50. Klemm, M.; Cownie, J. 8 Scheduling parallel loops. In High Performance Parallel Runtimes; De Gruyter Oldenbourg: Berlin, Germany; Boston, MA, USA, 2021; pp. 228–258. [Google Scholar]
  51. Ristov, S.; Prodan, R.; Gusev, M.; Skala, K. Superlinear speedup in HPC systems: Why and when? In Proceedings of the 2016 Federated Conference on Computer Science and Information Systems (FedCSIS), Gdańsk, Poland, 11–14 September 2016; pp. 889–898. [Google Scholar]
  52. Yan, B.; Regueiro, R.A. Superlinear speedup phenomenon in parallel 3D Discrete Element Method (DEM) simulations of complex-shaped particles. Parallel Comput. 2018, 75, 61–87. [Google Scholar] [CrossRef]
  53. Nagashima, U.; Hyugaji, S.; Sekiguchi, S.; Sato, M.; Hosoya, H. An experience with super-linear speedup achieved by parallel computing on a workstation cluster: Parallel calculation of density of states of large scale cyclic polyacenes. Parallel Comput. 1995, 21, 1491–1504. [Google Scholar] [CrossRef]
  54. Anselma, L.; Bottrighi, A.; Montani, S.; Terenziani, P. Extending BCDM to Cope with Proposals and Evaluations of Updates. IEEE Trans. Knowl. Data Eng. 2013, 25, 556–570. [Google Scholar] [CrossRef]
  55. Bergami, G.; Bertini, F.; Montesi, D. Hierarchical embedding for DAG reachability queries. In Proceedings of the IDEAS 2020: 24th International Database Engineering & Applications Symposium, Seoul, Republic of Korea, 12–14 August 2020; Desai, B.C., Cho, W., Eds.; ACM: New York, NY, USA, 2020; pp. 24:1–24:10. [Google Scholar]
  56. Revesz, P.Z. Introduction to Databases—From Biological to Spatio-Temporal; Texts in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
  57. Revesz, P. Geographic Databases. In Introduction to Databases: From Biological to Spatio-Temporal; Springer: London, UK, 2010; pp. 81–109. [Google Scholar]
  58. Zaki, N.M.; Helal, I.M.A.; Awad, A.; Hassanein, E.E. Efficient Checking of Timed Order Compliance Rules over Graph-encoded Event Logs. arXiv 2022, arXiv:2206.09336. [Google Scholar]
  59. Rost, C.; Gómez, K.; Täschner, M.; Fritzsche, P.; Schons, L.; Christ, L.; Adameit, T.; Junghanns, M.; Rahm, E. Distributed temporal graph analytics with GRADOOP. VLDB J. 2022, 31, 375–401. [Google Scholar] [CrossRef]
Figure 1. Table of Contents.
Figure 1. Table of Contents.
Information 14 00173 g001
Figure 2. KnoBAB Architecture for Breast Cancer patients. Each trace ➀–➂ represents one single patient’s clinical history, represented with unique colouring, while each Declare clause Ⓐ–Ⓒ prescribes a temporal condition that such traces shall satisfy. Please observe that the atomisation process does not consider data distribution but rather partitions the data space as described by the data activation and target conditions. In the query plan, green arrows indicate access to shared sub-queries as in [9], and thick red ellipses indicate which operators are untimed.
Figure 2. KnoBAB Architecture for Breast Cancer patients. Each trace ➀–➂ represents one single patient’s clinical history, represented with unique colouring, while each Declare clause Ⓐ–Ⓒ prescribes a temporal condition that such traces shall satisfy. Please observe that the atomisation process does not consider data distribution but rather partitions the data space as described by the data activation and target conditions. In the query plan, green arrows indicate access to shared sub-queries as in [9], and thick red ellipses indicate which operators are untimed.
Information 14 00173 g002
Figure 3. We can express a cyber-security scenario by considering (a) possible situations in a Cyber Kill Chain, than are then (b) represented in the activity labels’ names associated to the events.
Figure 3. We can express a cyber-security scenario by considering (a) possible situations in a Cyber Kill Chain, than are then (b) represented in the activity labels’ names associated to the events.
Information 14 00173 g003
Figure 4. Two exemplifying clauses distinguishing Response and Precedence behaviours. Traces are represented as temporally ordered events associated with activity labels (boxed). Activation (or target) conditions are circled here (or ticked/crossed). Ticks (or crosses) indicate a (un)successful match of a target condition. For all activations, there must be an un-failing target condition; for precedence, we shall consider at most one activation. These conditions require the usage of multiple join tests per trace.
Figure 4. Two exemplifying clauses distinguishing Response and Precedence behaviours. Traces are represented as temporally ordered events associated with activity labels (boxed). Activation (or target) conditions are circled here (or ticked/crossed). Ticks (or crosses) indicate a (un)successful match of a target condition. For all activations, there must be an un-failing target condition; for precedence, we shall consider at most one activation. These conditions require the usage of multiple join tests per trace.
Information 14 00173 g004
Figure 5. In-depth representation of the query plan associated with the model described in Example 15.
Figure 5. In-depth representation of the query plan associated with the model described in Example 15.
Information 14 00173 g005
Figure 6. Assessing a high-level use case of an intrusion attack on a software system through a declarative model.
Figure 6. Assessing a high-level use case of an intrusion attack on a software system through a declarative model.
Information 14 00173 g006
Figure 7. Results for the fast set operations Section 6.1 against the traditional logical implementation.
Figure 7. Results for the fast set operations Section 6.1 against the traditional logical implementation.
Information 14 00173 g007
Figure 8. Results for the custom declarative clause implementations Section 6.2 against the traditional logical implementation.
Figure 8. Results for the custom declarative clause implementations Section 6.2 against the traditional logical implementation.
Information 14 00173 g008
Figure 9. Results for the Until operator (Section 6.3).
Figure 9. Results for the Until operator (Section 6.3).
Information 14 00173 g009
Figure 10. Results for the derived operators TimedAndFuture and TimedAndGlobally Section 6.4. We include both variants of the fast implementations to analyse the environments where each thrive.
Figure 10. Results for the derived operators TimedAndFuture and TimedAndGlobally Section 6.4. We include both variants of the fast implementations to analyse the environments where each thrive.
Information 14 00173 g010
Figure 11. Results for relational temporal mining Section 7.2.
Figure 11. Results for relational temporal mining Section 7.2.
Information 14 00173 g011
Figure 12. Results for parallelisation Section 7.3. ω indicates the set of threads in the thread pool, and the red dashed horizontal lines indicate running times for single threaded instances.
Figure 12. Results for parallelisation Section 7.3. ω indicates the set of threads in the thread pool, and the red dashed horizontal lines indicate running times for single threaded instances.
Information 14 00173 g012
Figure 13. Running times over different models (Table S1a) for different atomisation strategies.
Figure 13. Running times over different models (Table S1a) for different atomisation strategies.
Information 14 00173 g013
Figure 14. Running times for data-aware conformance checking.
Figure 14. Running times for data-aware conformance checking.
Information 14 00173 g014
Table 1. Declare templates illustrated as exemplifying clauses. A p ( B q ) represents the activation (target) condition, A (B) denotes the activity label, and p (q) is the data payload condition.
Table 1. Declare templates illustrated as exemplifying clauses. A p ( B q ) represents the activation (target) condition, A (B) denotes the activity label, and p (q) is the data payload condition.
TypeExemplifying Clause ( c l )Natural Language Specification for TracesLTLf Semantics ( c l )
SimpleInit( A , p )The trace should start with an activation A p
Exists( A , p , n )Activations should occur at least n times ( A p ( Exists ( A , p , n 1 ) ) ) n > 1 ( A p ) n = 1
Absence( A , p , n + 1 )Activations should occur at most n times ¬ Exists ( A , p , n + 1 )〛
Precedence( A , p , B , q )Events preceding the activations should not satisfy the target ¬ ( B p ) W ( A p )
(Mutual) CorrelationChainPrecedence( A , p , B , q )The activation is immediately preceded by the target. ( ( A p ) ( B q ) )
Choice( A , p , A , p )At least one of the two activation conditions must appear. ( A p ) ( A p )
Response( A , p , B , q )The activation is either followed by or simultaneous to the target. ( ( A p ) ( B q ) )
ChainResponse( A , p , B , q )The activation is immediately followed by the target. ( ( A p ) ( B q ) )
RespExistence( A , p , B , q )The activation requires the existence of the target. ( A p ) ( B q )
ExclChoice( A , p , A , p )Only one activation condition must happen. Choice ( A , p , A , p ) NotCoExistence ( A , p , A , p )
CoExistence( A , p , B , q )RespExistence, and vice versa. RespExistence ( A , p , B , q ) RespExistence ( B , q , A , p )
Succession( A , p , B , q )The target should only follow the activation. Precedence ( A , p , B , q ) Response ( A , p , B , q )
ChainSuccession( A , p , B , q )Activation immediately follows the target, and the target immediately preceeds the activation. ( ( A p ) ( B q ) )
AltResponse( A , p , B , q )If an activation occurs, no other activations must happen until the target occurs. ( ( A p ) ( ¬ ( A p ) U ( B q ) ) )
AltPrecedence( A , p , B , q )Every activation must be preceded by an target, without any other activation in between Precedence ( A , p , B , q ) ( ( A p ) ( ¬ ( A p ) W ( B q ) )
Not.NotCoExistence( A , p , B , q )The activation nand the target happen. ¬ ( ( A p ) ( B q ) )
NotSuccession( A , p , B , q )The activation requires that no target condition should follow. ( ( A p ) ¬ ( B q ) )
Table 2. Definition of the atoms from Figure 2 in terms of partitioning over the elementary intervals.
Table 2. Definition of the atoms from Figure 2 in terms of partitioning over the elementary intervals.
Referral CA 15 . 3 < 23.5 CA 15 . 3 23.5
p 1 p 2
Mastectomy CA 15 . 3 < 50 CA 15 . 3 50
biopsy = false p 5 p 6
biopsy = true p 7 p 8
FollowUp CA 15 . 3 < 23.5 CA 15 . 3 23.5
p 3 p 4
Lumpectomy CA 15 . 3 < 50 CA 15 . 3 50
biopsy = false p 9 p 10
biopsy = true p 11 p 12
Table 3. Intermediate steps to generate distinct atoms for the Referral data predicates from Example 7.
Table 3. Intermediate steps to generate distinct atoms for the Referral data predicates from Example 7.
(a) Interval decomposition in basic intervals  μ ( Mastectomy , · ) .
μ ( Mastectomy , CA 15 . 3 )
CA 15 . 3 < 0 CA 15 . 3 < 0
CA 15 . 3 < 50 CA 15 . 3 < 0 , 0 CA 15 . 3 < 50
CA 15 . 3 50 50 CA 15 . 3 1000 , CA 15 . 3 > 1000
CA 15 . 3 > 1000 CA 15 . 3 > 1000
μ ( Mastectomy , biopsy )
biopsy = 0 biopsy = 0
biopsy = 1 biopsy = 1
biopsy 0 biopsy < 0 , 0 < biopsy < 1 , biopsy = 1 , biopsy > 1
biopsy 0 biopsy < 0 , biopsy = 0 , 0 < biopsy < 1 , biopsy > 1
(b) Atom generation by partitioning the data space κ K μ ( Mastectomy , κ ) . elementaryIntervals ( ) with K = biopsy , CA 15 . 3 .
biopsy < 0 biopsy = 0 0 < biopsy = < 1 biopsy = 1 biopsy > 1
CA 15 . 3 < 0 p 1 p 2 p 3 p 4 p 5
0 CA 15 . 3 < 50 p 6 p 7 p 8 p 9 p 10
50 CA 15 . 3 1000 p 11 p 12 p 13 p 14 p 15
CA 15 . 3 > 1000 p 16 p 17 p 18 p 19 p 20
Table 4. Table of Notation for symbols χ T defined as ( χ : = E ) or characterised by ( E ( χ ) ) E .
Table 4. Table of Notation for symbols χ T defined as ( χ : = E ) or characterised by ( E ( χ ) ) E .
Symbol ( χ )Definition ( E )Type ( T )Comments
Set Theory
SetAn empty set contains no items.
( , S ) A partially ordered set (poset) is a relational structure for which ⪯ is a partial ordering over S [40]. ⪯ over S might be represented as a lattice, referred to as the Hasse diagram.
S a S . a S SGiven a poset ( , S ) , S is the unique greatest element of S.
C U \ C SetComplement set: given an universe U , the complement returns all of the elements that do not belong to C.
κ K f ( κ ) f ( κ 1 ) × × f ( κ n ) dom ( f ) | K | Generalised cross product for ordered sets K where κ 1 κ n
| C | c C 1 N The cardinality of a finite set indicates the number of contained items.
( C ) { T | T C } SetThe powerset of C is the set whose elements are all of the subsets of C.
XES Model & LTLf
Σ SetFinite set of activity labels
K SetFinite set of ordered (payload) keys, κ
V SetFinite set of (payload) values
p [ κ 1 v 1 , ] V K Tuple (or finite function) mapping keys κ 1 K to values in v 1 V
V NULL value
σ j i p , a Σ × V K Event
σ i σ 1 i , , σ n i SequenceTrace, sequence of temporarily ordered events.
L { σ 1 , , σ m } SetLog, set of traces.
β Σ { 1 , , | Σ | } Bijection mapping each activity label to its unique identifier.
φ Equation (1)ExpressionAn LTLf expression.
Γ φ denotes that φ is satisfied for the world/environment Γ .
xtLTLf
ψ Section 3.2ExpressioneXTended LTLf Algebra expression.
A ( k ) / T ( k ) / M ( h , k ) ω Marks associated with activation/target/matching conditions.
ρ i , j , L , Ω = { ( N × N × S ) | S ( ω ) } Intermediate representation returned by each xtLTLf operator
T [ i ] T [ i ] T Accessing the i-th record of a sequence T.
Θ ( x , y ) Binary PredicateCorrelation condition between activated and targeted events.
Θ 1 ( y , x ) Θ ( x , y ) Binary PredicateInverted/Flipped correlation condition.
True Binary PredicateAlways-true binary predicate.
E Θ i ( M 1 , M 2 ) Equation (S1)Algorithm 7Existential matching condition for which there exists at least one event in M 1 , M 2 providing a match.
A Θ i ( M 1 , M 2 ) Equation (S2)Algorithm 9Universal matching condition returning a non-empty set if each event expressed in the maps M 1 , M 2 provides a match.
T Θ F , i ( M 1 , M 2 ) Equation (S3) T Θ F , i ( M 1 , M 2 ) ( ω ) { False } Testing functor returning False iff., despite the maps containing activated and targeted events, the matching condition F Θ i ( M 1 , M 2 ) is empty. It returns F Θ i ( M 1 , M 2 ) otherwise.
Pseudocode
Null pointer or terminated iterator.
Iterator ( ρ ) PointerOn ρ non-empty, it returns the iterator pointing to the first record in ρ
current ( it ) DereferenceElement pointer by the pointer/iterator it.
LowerBound( d , b , e , ν ) Binary SearchGiven a beginning b and end e iterator range within a sequential and sorted data structure by increasing order, LowerBound returns either the first location in this range pointing at a value greater or equal to ν or e otherwise.
UpperBound( d , b , e , ν ) Binary SearchGiven a beginning b and end e iterator range within a sequential and sorted data structure by increasing order, UpperBound returns either the first location in this range pointing to a value strictly less to ν or e otherwise.
Time Complexity
ϵ N Maximum trace length.
N Maximum length of the third component of the intermediate representation.
Table 5. Declare templates illustrated as their associated xtLTLf semantics. S A (and S T ) denote the disjunction of collected atoms and activity labels (represented as sets) associated with the activation (and target) condition. The Atomisation Pipeline will return these sets. For declarative clauses that can be directly represented as xtLTLf operators, we might have two different possible operators depending on the atomisation result.
Table 5. Declare templates illustrated as their associated xtLTLf semantics. S A (and S T ) denote the disjunction of collected atoms and activity labels (represented as sets) associated with the activation (and target) condition. The Atomisation Pipeline will return these sets. For declarative clauses that can be directly represented as xtLTLf operators, we might have two different possible operators depending on the atomisation result.
Exemplifying Clause ( c l )xtLTLf Semantics
S A = { A } , S T = { B } A , B Σ Otherwise (e.g., Atomisation)
Init( S A ) Init A L ( A ) Init ( S A )
Exists ( S A , n ) Exists A L ( A , n ) Exists n ( S A )
Absence ( S A , n + 1 ) Absence A L ( A , n ) Absence n + 1 ( S A )
Precedence ( S A , S ) Or True ( Until ( S , S A ) , Absence ( S , 1 ) )
ChainPrecedence( S A , S T ) where  Θ Globally ( Or True τ ( Or True τ ( Last L , τ , Next τ ( S T ) ) , And Θ τ ( Next τ ( S A ) , S T ) ) )
Choice( S A , S A ) Or True ( S A , S A )
Response ( S A , S T )  where Θ Globally ( Or True τ ( S A , AndFuture Θ τ ( S A , S T ) ) )
ChainResponse( S A , S T ) where  Θ Globally ( Or True τ ( S A , And Θ τ ( S A , Next τ ( S T ) ) ) )
RespExistence ( S A , S T )  where Θ Or True ( Absence ( S A , 1 ) , And Θ ( S A , S T ) )
ExclChoice( S A , S A ) And True ( Or True ( Exists ( S A , 1 ) , Exists ( S A , 1 ) ) , Or True ( Absence ( S A , 1 ) , Absence ( S A , 1 ) ) )
CoExistence( S A , S T ) where  Θ And True ( RespExistence ( S A , S T ) where Θ , RespExistence ( S A , S T ) where Θ 1 ) s.t. S A = S T and S T = S A
Succession( S A , S T ) where  Θ And True ( Precedence ( S A , S ) , Response ( S A , S T ) where Θ ) s.t. S = S T
ChainSuccession( S A , S T ) where  Θ Globally ( And True τ ( Or True τ ( Or True τ ( Last L , τ , Next τ ( S T ) ) , And Θ 1 τ ( Next τ ( S A ) , S T ) ) , Or True τ ( S A , And Θ τ ( S A , Next τ ( S T ) ) ) ) ) s . t . S A = S T and S T = S A
AltResponse( S A , S T ) where  Θ Globally ( Or True τ ( S A , And Θ τ ( S A , Next τ ( Until True τ ( S A , S T ) ) ) ) ) )
AltPrecedence( S A , S T ) where  Θ And True ( Precedence ( S A , S T ) , Globally ( Or True τ ( S A , And Θ τ ( S A , Next τ ( Or True τ ( Until τ ( S A , S T ) , Globally τ ( S A ) ) ) ) ) ) )
NotCoExistence( S A , S T ) where  Θ Not ( And Θ ( S A , S T ) )
NotSuccession( S A , S ) Globally ( Or True ( S A , AndGlobally True τ ( S A , S T ) ) )
Table 6. Conjunctive and Aggregation queries for Figure 6.
Table 6. Conjunctive and Aggregation queries for Figure 6.
(a) Metric calculations per trace.
TraceMAX-SATin Conjunctive Query
σ 1 | c 1 , c 2 , c 3 | | M | = 1.0 true
σ 2 | c 2 | | M | = 1 3 false
σ 3 | c 1 , c 2 | | M | = 2 3 false
(b) Metric calculations per clause.
ClauseSupportConfidence
| σ 1 , σ 3 | | L | = 2 3 | σ 1 , σ 3 | | σ 1 , σ 2 , σ 3 | = 2 3
| σ 1 , σ 2 , σ 3 | | L | = 1.0 | σ 1 , σ 2 , σ 3 | | σ 1 , σ 2 , σ 3 | = 1.0
| σ 1 | | L | = 1 3 | σ 1 | | σ 1 | = 1.0
Table 7. Range of datasets used for benchmarking.
Table 7. Range of datasets used for benchmarking.
CompetitorDatasetTraces | L | EventsDistinct Activities | Σ |
SQL MinerBPIC 2011 (original)1143150,291624
BPIC 2011 (10)102613158
BPIC 2011 (100)10012,195276
BPIC 2011 (1000)1000133,935607
Declare AnalyzerBPIC 2012 (original)13,087262,20024
Table 8. Proposed operator semantics vs. traditional.
Table 8. Proposed operator semantics vs. traditional.
OperatorLTLf RewritingOptimised
Choice Or Θ ( Future ( ρ 1 ) , Future ( ρ 2 ) ) Or Θ ( ρ 1 , ρ 2 )
TimedAndFuture And Θ ( ρ 1 , Future τ ( ρ 2 ) ) AndFuture Θ τ ( ρ 1 , ρ 2 )
TimedAndGlobally And Θ ( ρ 1 , Globally τ ( ρ 2 ) ) AndGlobally Θ τ ( ρ 1 , ρ 2 )
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bergami, G.; Appleby, S.; Morgan, G. Quickening Data-Aware Conformance Checking through Temporal Algebras. Information 2023, 14, 173. https://doi.org/10.3390/info14030173

AMA Style

Bergami G, Appleby S, Morgan G. Quickening Data-Aware Conformance Checking through Temporal Algebras. Information. 2023; 14(3):173. https://doi.org/10.3390/info14030173

Chicago/Turabian Style

Bergami, Giacomo, Samuel Appleby, and Graham Morgan. 2023. "Quickening Data-Aware Conformance Checking through Temporal Algebras" Information 14, no. 3: 173. https://doi.org/10.3390/info14030173

APA Style

Bergami, G., Appleby, S., & Morgan, G. (2023). Quickening Data-Aware Conformance Checking through Temporal Algebras. Information, 14(3), 173. https://doi.org/10.3390/info14030173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop