1. Introduction
This paper deals with the increasingly important topic of the discovery of frequent temporal patterns when given as input a set of symbolic time intervals, i.e., time periods over which hold one or more propositions, such as, in the medical domain, “The dose of the medication was High” or “The blood pressure was Low”, and the temporal relationships among these periods. The discovered temporal patterns can then be exploited for clustering, classification, and prediction.
Analyzing time-oriented, multivariate clinical data enables researchers to discover new temporal knowledge and gain understanding regarding the temporal behavior and temporal associations of these data [
1,
2,
3,
4,
5,
6,
7,
8,
9]. The main methods for the discovery of new knowledge in longitudinal multivariate data include multiple
Temporal Data Mining (TDM) approaches, although an alternative,
Business Process Mining (BPM) approach has also been used successfully when the data describe actual activities and processes; in such cases, temporal relations at a finer level of resolution are often less emphasized [
10,
11,
12,
13]. Unlike most TDM methods, which typically focus mainly on the analysis of the
raw time-
stamped data, the use of
symbolic time intervals can reduce inherent random noise in the data, avoid problems resulting from different sampling frequencies and at various temporal granularities, and often alleviate the problem of missing data [
3,
7,
9,
14,
15]. Thus, to significantly enhance the capabilities for analysis of time-stamped data, a preprocessing step of meaningful summarization and interpretation of the time-stamped raw data (e.g., a series of hemoglobin values) into a set of interval-based abstractions or symbolic time intervals (e.g., periods of moderate anemia), known as
temporal abstractions, can be used [
16,
17,
18,
19,
20]. The resulting interval-based summary can have multiple uses, such as to create natural-language free-text summaries (e.g., discharge letters) of large numbers of digital time-oriented clinical data [
21], to visualize the data of individual patients [
22], or to interactively explore associations among the time-oriented data and their abstractions [
2].
Once a set of interval-based abstractions of the time-stamped raw data exists, a set of [sufficiently] frequent temporal patterns, incorporating these symbolic time intervals as components, can be discovered. We refer to these patterns as
time-interval-relation patterns (
TIRPs) [
3]. Within TIRPs, all of Allen’s seven basic temporal relations and their respective inverse relations [
23] might hold, such as Before, During, Overlaps, Finishes, etc.
Recently, time-interval patterns have been increasingly used as features to classify multivariate temporal data [
1,
20,
24,
25,
26,
27]. Using that approach, the [sufficiently] frequent interval-based temporal patterns are used as the base features to induce a classifier. Furthermore, we have shown in an earlier study that frequent TIRPs can be consistently discovered, and in similar proportions, in different subsets of the same data set within three different medical domains, especially as the minimal threshold for frequency is raised, thus increasing their value for potential classification and prediction tasks [
28]. This repeated discovery suggests that discovered TIRPs might indeed be good candidates for use as classification or prediction features.
However, our work in multiple clinical domains had suggested that many of the discovered frequent temporal patterns, although correct from the purely syntactic aspect, do not conform to the basic semantics of medical experts, who often assume a certain type of temporal adjacency among the temporal-pattern’s components. Thus, such patterns are not transparent characterizations of the data. This semantic temporal adjacency, which domain experts seem to implicitly assume, means that no instance of the pattern includes any [additional] intermediate intervals within the scope of the pattern, which might contradict a potential interpretation of causality, or at least direct temporal association, among the pattern’s components. Our new principle injects semantics into what are usually purely syntactic algorithms for discovering frequent temporal patterns in large data sets and demonstrates their effectiveness for classification and prediction. For example, this can be achieved by pruning away most of the potential machine-learning temporal features, without losing any classification or prediction accuracy.
An example of the idea at the core of our new semantic principle, which exploits also the semantics of the symbolic interval-based predicates and not just their temporal relations, is the discovery of the following frequent temporal pattern: <“A period during which a High dose of the medication” occurs
Before “A period during which the Hemoglobin level was Low”> (instance No. 1 in
Figure 1). A domain expert might assume that perhaps there is a causal association between the two symbolic intervals since they frequently seem to be found together in this specific order.
However, what if, after examining the symbolic intervals abstracted from the raw data, the expert finds that between these two symbolic intervals, there often exists, within the patient’s original longitudinal record, an additional interval, during which a
High level of Hemoglobin exists (Instance No. 2a in
Figure 1). Alternatively, what if this expert finds one or more instances of the pattern in which, between the two components defining it, there exists in the patient’s record an interval during which dose of the medication was actually
Low (instance No. 2b in
Figure 1)?
Although technically the original temporal relation still holds, its significance now, from a medical expert’s point of view, might change considerably. That would be the case whether the discovered frequent pattern is used for human explanatory purposes, or for succinctly summarizing the data, or for machine-learning purposes.
As we formally define in the Methods Section, we distinguish a TIRP (an abstract pattern with certain temporal qualitative constraints) from its TIRP
Instances, which are found in the longitudinal records that are being analyzed. A TIRP is
frequent if the proportion of the records within which its TIRP instances are discovered is higher than some threshold. However, within the temporal scope of some of the instances of frequent TIRPs, there might exist, in some of the patients’ longitudinal records, additional symbolic intervals (which are not part of that instance), as shown in
Figure 1, which seem to contradict the TIRP’s intuitive semantics.
Note that experts, especially medical experts, often expect a meaningful frequent temporal pattern to convey some potential causal relationship, such as a High dose of a medication reducing the level of Hemoglobin, in the case of the temporal pattern depicted in
Figure 1, Instance #1 (SAC-obeying TIRP). The fact that the true state of affairs is such that it rules that possibility out, as in the case of Instance #2, since in the patient’s record there exists a High-Hemoglobin period after the administration of the medication, would not be expected by a clinician when hearing the description of the frequently discovered pattern as “<Medication-dose-level = High> Before <Hemoglobin [HGB]-level = Low>”. Nor would this clinician expect that there might be an additional episode of medication administration, but with a Low dose, before the High Hemoglobin value. Given that description, from the point of view of a clinician, Instance #2 and Instance #3 lack
semantic coherence.
In the current study, we are not exploring the purely psychological issue of the potential lack of transparency, to medical experts, of different temporal-pattern semantics. (Although such a lack might considerably reduce, for example, the patterns’ explanatory value, their data-summarization value, and the efficacy of the experts in suggesting additional patterns to explore). What we do conjecture in this study concerns a purely quantitative functional issue: We believe that such “semantically incoherent” patterns, beyond being potentially less transparent to medical experts, might also be less useful, and perhaps even unnecessary, as classification and prediction features for a machine-learning process, precisely due to their potential lack of semantic coherence. At the same time, discovering such redundant, “semantically incoherent” temporal patterns might require significant effort during the discovery time, as well as during classifier-induction time, without enhancing the accuracy of the resultant classifiers.
Of course, what precisely is and is not
semantically coherent within a complete multiple-interval temporal pattern needs to be carefully and formally defined, as we do in
Section 3. We shall see that in fact several options exist for exploiting the basic semantic intuition demonstrated in
Figure 1, depending on which constraints exactly must hold in such a temporal interval triad, so as to comply with the notion of semantic coherence.
Although several of the earlier studies have noticed a potential redundancy during pattern discovery (in particular, the discovery of patterns containing repeating symbolic intervals as components, such as discovery of the pattern AAB in addition to the discovery of the pattern AB), or even considered patterns characterized by the
absence of certain symbols [
29], they have only considered the issue from a purely
computational point of view (i.e., the complexity of the temporal-pattern discovery process) [
1,
3]. For example, no attention was paid to the relationship between pattern components denoted by
different symbols, each of which represents a different proposition, which in a medical domain’s ontology might in fact represent different values of the
same concept. For example, both of the symbols “
Low blood pressure” and “
Hypertension” (i.e., High or Very-High values of the blood pressure) are propositions that assign different values to the same concept, namely, the concept that denotes the abstraction of the raw-data Blood Pressure measurement concept into a discrete symbolic value. In contrast to these studies, in the current study we shall refer to that potential problem from a
semantic point of view (i.e., the potential
meaning of the discovered pattern and of each of its components), as well as from a
functional point of view (i.e., the implications for the
effectiveness of the classification).
Thus, in the current study, we explored the application of
semantic considerations to symbolic time-intervals mining, and to classification and prediction tasks, in medical domains. The current study complements and significantly extends into new grounds a previous study of ours in which, in addition to the main methods, we had very briefly introduced, as a secondary method, the SAC principle and had shown, among other related results, that pruning discovered temporal patterns in clinical data using the SAC constraint enhanced the consistency of discovering the same temporal patterns in similar rates in similar patient populations [
28]. The current study examines, for the first time, through the use of several machine learning methods, the implications of the use of the SAC principle for the tasks of classification and prediction, and the exploitation of the SAC principle for a considerable reduction of the number of temporal-pattern features used by these machine-learning methods, without any loss of performance.
As we shall see, the use of domain-specific semantics (which is explained in detail in
Section 3.3) can constrain the discovery of temporal patterns in symbolic time intervals data to only those patterns that include certain semantically meaningful relations amongst the symbolic time intervals of which they are composed, in the sense of not violating certain semantic constraints that we have formally defined. Note that
no new temporal patterns are discovered; rather, a large number of candidate patterns are
pruned [filtered] out during the discovery process. Thus, our main contribution in this study, beyond introducing, for the first time, a highly detailed and formal definition of the SAC principle and several of its variations, is the rigorous evaluation for classification and prediction purposes of a new pruning constraint for mining time intervals, the
Semantic Adjacency Criterion (SAC). In fact, we have defined and explored three versions of the SAC criterion.
Consequentially, our core double-pronged hypothesis in this study is that:
- (a)
It is more efficient, during the temporal data mining process, to discover only semantically coherent patterns [coherent in a sense that we shall formally define].
But nevertheless,
- (b)
Imposing such semantic constraints, leading to the discovery of a significantly smaller set of patterns, will not cause any harm with respect to the performance of the discovered patterns as features for classification or prediction purposes, and might even enhance that performance.
As our current study demonstrates, using any of the SAC versions results in the discovery of temporal patterns whose overall cardinality, as well as time needed for discovery, are smaller by at least an order of magnitude than the respective resulting cardinality and required running time when not using the SAC constraint; but that whose value to classification and prediction tasks that use the discovered patterns as features is at least as good as the original full set of discoverable patterns.
The outline of this paper is the following:
Section 2 provides the necessary background for the rest of the paper.
Section 3 describes our computational framework, including the use of temporal abstractions, the discovery of TIRPs, and the formal definition of the three SAC versions.
Section 3.9 describes the evaluation and the experiments we performed to assess the effect of using the three SAC versions to discover frequent temporal patterns and exploit them as features for classification and prediction purposes, using four different classifier-induction methods within each of three different clinical domains.
Section 4 describes the results of our empirical evaluation.
Section 5 discusses the results.
Section 6 presents the main conclusions of this study.
3. Materials and Methods
We start by first defining the basic interval-based TDM terminology. We then describe briefly the high-level overview of the general TDM algorithm we chose to deploy for this study, before formally introducing the SAC criterion and its semantics and several versions.
3.1. The Time Intervals Mining Process
To formally define the problem of mining symbolic time intervals, and to better comprehend how the SAC constraint can be introduced into an interval-based frequent-pattern discovery algorithm, we present several basic definitions used by common TDM algorithms, such as by the KarmaLego frequent-pattern discovery algorithm [
3], which we are also using within the evaluation.
We define a symbolic time interval, I = <s, e, sym>, as an ordered pair of time points, start time (s) and end time (e), and a symbol (sym) which represents one of the domain’s symbols from the domain-specific set Ṡ. A non-ambiguous lexicographic TIRP P = {Ḯ, Ṙ} is defined as a set Ḯ of k symbolic time intervals (I1…Ik) ordered lexicographically by start time, then end time, then symbol, and a set Ṙ of all of the pairwise temporal relations among each of the (k2 − k)/2 pairs of symbolic time intervals in Ḯ. Note that each TIRP is an abstract temporal pattern that represents a class, or a set, of specific instances of that TIRP (The input interval-based database is also ordered lexicographically).
The goal of discovering frequent TIRPs is to discover [sufficiently] frequently occurring abstract patterns within the instances of the given database; each pattern instance found in the database is a TIRP instance.
The fact that the database is ordered lexicographically enables us to use only seven of Allen’s temporal relations (or even the three abstract relations) as defined in
Section 2.2.
Figure 2 presents a typical TIRP, represented as a half-matrix of temporal relations.
The vertical support for a TIRP is defined as follows: given an input database of |E| distinct entities (e.g., different patients), each represented as a set of symbolic time intervals (in which one or more symbols might repeat over different symbolic time intervals), the vertical support of a TIRP P that was discovered in the input data is denoted by the cardinality of the set of distinct entities (within which at least one instance of P was discovered) divided by the cardinality of |E|.
When a TIRP has a vertical support above a given minimal predefined threshold
min_ver_sup, it is referred to as
frequent. Furthermore, the
horizontal support of a TIRP
P for an entity e (e.g., a single patient’s record), hor_sup(P, e), is the number of instances of P found in e. Accordingly, we also define the mean duration of the supporting instances of the same TIRP P within an entity e as the average time of the instances of P within a specific entity e, from its earliest start-time to its last end-time. Thus, given a minimum vertical support
min_ver_sup, the goal of the mining task is to find the complete set of the frequent TIRPs, including all of their supporting instances vertically and horizontally [
26].
The KarmaLego algorithm [
3], which we chose to use in this study’s evaluation to discover frequent temporal patterns due to its proven efficiency comparing it to several of the best-known alternatives (and it is provably complete in its discovery of all frequent TIRPs [
26]), consists of two main phases. The first phase is called Karma, in which all of the frequent two-sized TIRPs, having two symbolic time intervals
and
and a temporal relation
r among them, are discovered and indexed. In the second phase, called Lego, a recursive process extends the frequent two-sized TIRPs, referred to as
, through efficient candidate generation, into a tree of longer frequent TIRPs consisting of conjunctions of the two-sized TIRPs that were discovered in the Karma phase. The final output is an enumeration tree of all the frequent TIRPs discovered in the given database.
3.2. Adding Semantic Considerations to Time Intervals Mining
As explained in
Section 1, there is a potential drawback inherent in the interval-based pattern mining task. Many of the discovered patterns are
syntactically true, but
semantically misleading. Addressing this problem requires the addition of semantic considerations to the time intervals mining task.
3.3. The Semantic Adjacency Criterion
Since frequent TIRPs mining algorithms generate
all of the feasible TIRP candidates and search for them in the data, certain discovered TIRP instances (e.g., a period of High-dose medication of a certain type, is followed by a period of Low blood pressure), although
syntactically accurate, i.e., corresponding to a TIRP formal definition, might not represent in a transparent fashion the common-sense semantics that a domain expert might assign to the real data (see
Figure 1). Thus, many of the discovered TIRPs might not be sufficiently transparent to the expert. We shall now explore this observation in depth.
Recall that a symbol, and in particular a symbol that holds during the duration of a symbolic time interval, is composed of a [raw or abstract] concept and its value (
Section 3.1). In
Section 1, we presented a frequent TIRP that might be discovered (
Figure 1), “<Medication-dose = High> occurs before <HGB-level = Low>”. The TIRP seems to imply that administering a medication at a High dose is often [temporally] followed by a Low value of the Hemoglobin-value abstract concept (which is a State abstraction of the HGB-value raw-data concept; see
Section 2.1). Such an association is not necessarily causal, of course, but it certainly might be, and justifies additional exploration.
To facilitate our discussion, for each symbol we will refer to its two components, the
concept and its
value, following purely for consistency and clarity reasons the KBTA theory’s nomenclature [
16] (see
Section 2.1). In the example we just discussed, the
state abstractions “Medication-dose-level” and “HGB-level” are the [abstract]
concepts, and “High” or “Low” are their
values. A concept can only have one value at any point in time; different values during the same time are considered mutually exclusive and therefore
contradictory. However, the TIRP shown in abstract fashion in
Figure 2 might in fact include, within its supporting instances group [for a given database] that defines its
vertical (or
horizontal)
support (see
Section 3.1), instances that include, somewhere within their overall temporal scope (although not at the same time), symbolic time intervals that represent values that are semantically
contradicting (see explanation below) to those appearing in the formal TIRP definition. Two such cases were shown [as the grayed-out symbolic intervals of Instances #2a and #2b] in
Figure 1. Contradictions are instances in which, between two of the TIRP’s symbols, there is a symbol, composed of a concept and a value, such as, in this case, “HGB-level = High” (or “Medication-dose-level = Low”), in which the
concept (which implicitly includes its abstraction type, such as State or Gradient, and its context, such as gender = Female) is
identical to the concept of either of these two symbols, but its
value is
different. Either of these contradictory associations (with the first or the second symbols) might change, and in this case even reverse, the semantic meaning of the original temporal association, since it now seems that a
High medication-dose level might be actually associated with a
High Hemoglobin value (or conversely, that a
Low medication-dose level is associated with
Low hemoglobin level).
Note that even if the meaning of the intermediate symbol does not directly contradict the meaning of either of the temporal relation’s symbolic time intervals, for example if both its concept name and abstraction type and its concept’s value are the same as the concept name, type, and value of one of the two symbols, it might nevertheless change the overall pattern’s semantics, or might simply be redundant. Thus, we might not wish to encounter even a copy of one of these two symbolic intervals between them.
It is important to note at this point that most time intervals mining algorithms, including KarmaLego, do not consider the semantics of the deeper structure of the symbols that hold over the symbolic time intervals, and thus view them as a single, non-decomposable symbol
sym. However, as we have explained in
Section 2.1, these symbols are in fact typically composed of a
concept of some type and its
value (e.g., the abstract concept “HGB-level” and its value “High”). Furthermore, using Shahar’s KBTA ontology [
16] (see
Section 2.1), an abstract concept would include also the
abstraction type (e.g.,
State) and usually also a
context (e.g., gender = Female). For example, the following predicate is an example of an abstraction that might hold for some patient: “The
State of the
HGB-value, in the context of a
Female, is
High”. (Note that the definition of the value
High might vary in different contexts.). We often refer to the full representation of the concept embodied in each symbol (which typically also includes an abstraction type and a context) as its
semantic type. In the case mentioned above, the semantic type would be “The
State of the
HGB-value, in the context of a
Female”. The value of that semantic type (i.e., of the symbol’s concept), would be “
High”.
Definition 1. Symbolic time intervals
and are of the same semantic type if each of the two symbols that hold over and represents some value for the same concept (which, as we recall, might include also a temporal abstraction type and a context). We denote that equivalence by the notation .
The semantic types (as defined above) of the symbols that might hold over all symbolic time intervals need to be pre-defined and used throughout the TIRPs discovery process. Such semantic types and the set of the allowed values for each abstraction of each concept in each context are a part of each domain’s temporal-abstraction ontology [
16] (see
Section 2.1). Thus, there might be, for example, exactly five values for the state abstraction of blood-glucose values in the context of a patient who has diabetes, ranging from hypoglycemia to hyperglycemia. Our intent is to discover only TIRPs in which adjacent symbolic time intervals are semantically coherent, i.e., their symbols are composed of concepts and values that fulfil a new, semantically oriented criterion, the
Semantic Adjacency Criterion (SAC).
A Semi-formal Definition: The SAC guarantees that between two symbolic time intervals within a TIRP, there can exist no other symbolic time interval of the same semantic type as either of the two symbolic time intervals.
In particular, over such an intermediate “forbidden” symbolic interval there cannot hold a symbol whose semantic type, i.e., its conceptual aspect (e.g., Hemoglobin-State in a Female, or the State of the Medication-dose), is the same as the semantic type of one of the two symbolic time intervals, but with a different value (e.g., a LOW value of the Hemoglobin State abstraction, instead of a HIGH value; or a LOW value of the Medication-dose State abstraction). Our motivation in defining and using the SAC is that symbolic time intervals that appear between a pair of two other symbolic time intervals, but are of the same semantic type as that of one of the two symbolic intervals, and in particular, those that contradict the value of one of the symbols that hold over the two symbolic time intervals, are not easily understandable to a domain expert and thus, the discovered TIRP might not really represent the associations the expert expects to find in the data.
The expert will be notified that the temporal data mining algorithm discovered a frequent relation such as “A before B”, as in the case of “a high medication-dose before a low value of the hemoglobin level”, not realizing that there might be another symbolic time interval between them that contains a concept of the same type as that of A or B (i.e., a medication dose, or a hemoglobin level), but with a similar, or even a different value.
Instances of such a potentially contradictory intermediate symbolic time interval might also interfere with the learning (training) phase of an algorithm that induces a classifier, and reduce its classification power, which relies on features that are TIRPs that were discovered while using the (potentially deceptive) temporal relation between that pair of symbolic time intervals. The reason is that the real meaning of that temporal relation might change in a radical fashion, depending on what other symbolic intervals exist between the two members of the pair, for some of the TIRP’s supporting instances. Thus, detecting in the patient’s record an instance of Low blood pressure, and that two weeks ago she had taken a High dose of a medication that is similar to the medication that was taken yesterday at a Low dose, i.e., a temporal pattern that syntactically also complies with the temporal relation of being before a current instance of Low blood pressure, might potentially mislead the classifier looking for the temporal pattern “High medication-dose before Low blood pressure” as a feature. However, a medical domain expert analyzing the same data set might consider as meaningful only the last instance of the medication administration before the blood pressure measurement.
We conjecture that using the SAC constraint, in addition to the discovery of only semantically meaningful patterns, will also significantly reduce the number of potential frequent TIRPs to consider, thus reducing the computational requirements of TIRPs discovery.
The SAC constraint was inspired by the temporal interpolation mechanism from the KBTA methodology theory [
16]. The
temporal interpolation mechanism uses, in each domain, a domain-specific
interpolation function (found as part of that domain’s temporal-abstraction ontology). The interpolation function is provided, as input, with two symbolic time intervals, both of which hold similar temporal abstraction types (see
Section 2.1) (e.g., Gradient) of the same concept (e.g., HGB level), such as two Increasing HGB-level periods, each lasting for two weeks, with a gap of one week between them, and that returns an abstraction interpreted over an interval that joins the two intervals while bridging the gap between them, i.e., “five weeks of Increasing HGB level”. The temporal interpolation mechanism [
16] allows for a certain value-sensitive and context-sensitive maximal time gap to be bridged between the two symbolic time intervals.
It also ascertains that the values of the symbols that hold over the symbolic time intervals within the gap do not contradict in any way the values of the symbols that hold over the two symbolic intervals that are to be joined.
Figure 3 closely examines the possible contradictions that might be hidden within a two-sized TIRP, thus revealing several possible versions of the SAC. For example, in the simple case of sequential data mining, a TIRP can be treated as a sequence of symbols (considering the
before and
meets relations only). The symbol that holds over the intermediate symbolic time interval cannot represent the same concept (i.e., have the same semantic type) as one of the two symbolic intervals, even if it has the same value.
3.4. The Sequential Semantic Adjacency Criterion: A Formal Definition
Using this simple version of the SAC constraint, any two
successive symbolic time intervals of each temporal relation pair within a TIRP, when the symbolic intervals of the TIRP are ordered
lexicographically in a canonical fashion, by start time, then end time, then symbol (see
Section 3.1), must be temporally adjacent in the sense of our semi-formal definition. That is, no symbolic time interval whose symbol has a semantic type equal to one of them can exist between them. The two symbols of the successive symbolic intervals themselves might include the same concept. This version of SAC is called the
Sequential Semantic Adjacency Criterion.
Definition 2. The Sequential Semantic Adjacency Criterion (Sequential SAC or SSAC) holds over a TIRP P = {Ḯ,Ṙ}, where Ḯ=
iff: (Note that; relaxing this constraint and allowing for relations such as Contains or Overlaps leads to additional SAC versions that we introduce later.) However, note that considering only successive symbolic time intervals within the TIRP definition when using the lexicographic ordering ignores the existence of all of Allen’s temporal relations [
23] and stays within the stricter limits of the sequential data mining approach. We shall demonstrate this observation with an example.
Figure 4 considers the TIRP “<Medication-dose = High>
before <HGB = High>
before <HGB = Normal>”; the relationship between the medication administration time interval and the third time interval is also
before: “<Medication-dose>
before <HGB = Normal>”, but the two symbolic time intervals participating in this relation are not
successive in the SSAC sense, since they do not follow each other in the lexicographic ordering.
Thus, in the Sequential SAC version, only the first two temporal relations of this particular TIRP’s definition will be checked for the existence of any semantically contradictory (in the SAC sense) symbolic time interval instances between them. However, other semantically contradictory symbolic time interval instances might exist between the first and the third symbolic time intervals, which will
not be checked using the SSAC. (For example, a Low medication dose within the dashed interval of
Figure 4 would not contradict the relation “<Medication-dose = High>
before <HGB = Normal>”, since the components of that relation do not follow each other in the canonical lexicographic-ordering representation of this particular TIRP.)
Note two interesting insights regarding the SAC principle: First, this version of SAC does allow for the discovery of what we refer to as “
Symbolic Gradient” temporal patterns, e.g., Decreasing or Increasing values of the State abstractions of the same raw-data concept that hold over successive symbolic time intervals, such as a gradual decrease in the value of the Hemoglobin-State concept (see
Figure 5). The reason is that the constraint will only be checked between every two successive symbolic time intervals, and these can be of the same type. Second, SSAC also allows for the discovery of what we refer to as “Counting” temporal patterns. “
Counting” temporal patterns are patterns that include some finite repetition of instances of a symbolic interval as part of the pattern. For example, the repeating of two or more successive Low Hemoglobin State abstraction values (see
Figure 5). Such TIRPs might serve as useful features in certain domains.
We have also defined two additional versions of SAC that capture slightly different semantics and enable different levels of expressivity of TIRPs that can be discovered.
3.5. The Conservative Semantic Adjacency Criterion (CSAC)
The first additional SAC version that we examine considers every pair of symbolic time intervals within the TIRP definition. While the previous version considered only successive symbolic time intervals, when the TIRP is represented in a canonical fashion using a lexicographic ordering, we might want to consider all of the temporal relations within the TIRP’s definition (thus probably reducing even more the potential vertical support for such a TIRP and, correspondingly, the number of different TIRPs discovered in the output).
Allowing the SAC constraint to hold over all of the TIRP’s pairwise relations enables us to discover, in the input interval-based data, only true SAC-obeying TIRPs, in the most restrictive interpretation. Thus, unlike the SSAC, this version of SAC will not allow for the discovery of any “Symbolic Gradient” temporal patterns or of any “Counting” temporal patterns at all. We refer to this version of SAC as the Conservative Semantic Adjacency Criterion.
Definition 3. The Conservative Semantic Adjacency Criterion (Conservative SAC or CSAC) holds over a TIRP P = {Ḯ,Ṙ}, where
Ḯ =
iff: 3.6. The Liberal Semantic Adjacency Criterion
The second additional version is a special variation of the conservative version, which considers the SAC for all of the temporal relations within the TIRP definition, but enforces it only between symbolic time intervals that represent different semantic types.
This new SAC version will allow for the discovery of “Symbolic Gradient” temporal patterns and of “Counting” temporal patterns. This is achieved since it allows for the discovery of patterns that involve successive symbolic time intervals whose symbol includes the same concept, such as A3A2A1A1A1, where each Ai represents some value of the concept of type A. In some cases, this expressivity might be useful. This version of SAC is called the Liberal Semantic Adjacency Criterion.
Definition 4. The Liberal Semantic Adjacency Criterion (Liberal SAC or LSAC) holds over a TIRP P = {Ḯ,Ṙ}, where
Ḯ = iff: However, unlike the Sequential SAC version, and certainly the Conservative SAC, the Liberal SAC constraint
does allow for some potential semantics that might not be intuitive to domain experts. For instance, between the two LOW values of the Hemoglobin State abstractions in the TIRP displayed in
Figure 5, the LSAC allows for the potential existence of a “hidden” HIGH Hemoglobin State value, because the constraint is not enforced between symbolic time intervals that represent the
same semantic types.
3.7. The Computational Implications of Enforcing the SAC
Using the SAC, in addition to enabling the discovery of only patterns whose semantics enforce a stricter interpretation of the data relationships, is potentially also more functional and efficient, i.e., classification and prediction algorithms might benefit from features that represent more “reliable” temporal patterns and might also enable us to compute them within a briefer time span. Recently, there has been an increasing use of TIRPs as features for classification and prediction tasks [
19,
24,
26,
27,
48], in which we would like to examine the potential contribution of the SAC.
The SAC is a stricter selection criterion for TIRPs discovery algorithms and many TIRP candidates might not be generated. Thus, our first hypothesis is that using the SAC (including the several versions we proposed) to discover TIRPs will generate fewer TIRPs than when not using the SAC for any given minimal vertical support, which is expected to lead to a shorter run time for the discovery phase of the TIRPs.
However, due to the semantic coherence of the discovered TIRPs, i.e., their more uniform meaning, our second hypothesis is that we expect that, nevertheless, the resultant [smaller] set of TIRPs, discovered using the SAC, will still induce a classifier that has the same or better classification and prediction performance, given the same minimal vertical support threshold for the TIRPs as features.
3.8. Adding the SAC Constraint to the KarmaLego Algorithm
The SAC is a highly general criterion, equally applicable to any time intervals mining algorithm, as well as sequential mining algorithms. However, we needed to assess it within a concrete framework. We decided to evaluate the SAC within the KarmaLego framework (see
Section 3.1). Finally, the algorithm structure’s natural modularity, composed of the Karma and Lego steps, greatly facilitated our task of integrating any SAC version.
To implement the SAC version within the KarmaLego algorithm, we first added the basic SAC pruning constraint to the Karma phase. That is, only two-sized TIRPs that obeyed the basic semi-formal SAC constraint (i.e., of not having any symbolic time interval between them, over which holds a symbol of the same semantic type as either of the two members of the potential two-sized TIRP) were added to the two-sized TIRP enumeration tree.
Then, during the Lego phase, given each SAC version to be applied, we decided which pairs of symbolic time intervals needed to be checked against the data when extending the TIRP from size k to size k + 1:
In the case of the SSAC version, we checked the constraint only between the (lexicographically ordered) kth and the new kth + 1 symbolic time intervals.
In the case of the CSAC version, we checked the constraint between the 1st, 2nd…, and the (lexicographically ordered) kth and the new kth + 1 symbolic time intervals.
In the case of the LSAC version, we performed a similar procedure to that used to enforce the CSAC constraint, but only for pairs of symbolic time intervals over which hold symbols of different semantic types.
Appendix A contains a short pseudo-code of the SAC implementation with the particular frequent TIRP-discovery algorithm that we had selected, the KarmaLego algorithm.
3.9. Evaluation
To demonstrate our methods, we decided to use a highly efficient interval-mining algorithm that was recently introduced by the authors, called KarmaLego [Moskovitch and Shahar, 2015a], as our means for discovering TIRPs. However, the semantic enhancements that we introduced into KarmaLego are quite general. We measured the number of discovered TIRPs, the runtime, and the performance of the TIRPs when used as features for several classification and prediction tasks.
We evaluated the runtime of the KarmaLego algorithm and the number of discovered TIRPs with the different SAC versions. Given our informal hypotheses (see
Section 3.3), which are based on reasonable arguments, but which need empirical verification, our specific research questions were:
Does using SAC indeed reduce the discovery runtime, compared to not using it? Which of the three SAC versions requires the shortest runtime?
Does using SAC indeed reduce the number of discovered TIRPs, compared to not using it? Which of the three SAC versions results in the smallest number of TIRPs?
Does using SAC maintain the classification and prediction performance, compared to not using it? Which of the SAC versions is best for classification and prediction?
This evaluation was performed across different state abstraction or discretization methods (KB, EWD, SAX, and TD4C-KL, which uses the Kullback–Leibler distance measure as explained in
Section 2.1), each with three bins, different temporal relation sets (the three abstract temporal relations mentioned in
Section 2.2, and the full set of Allen’s seven temporal relations), and various minimal vertical support thresholds to measure the number of discovered TIRPs.
In addition, for the purpose of the classification and prediction tasks, we evaluated different TIRP feature-representation methods (Binary, Horizontal Support, and Mean Duration) (as discussed in
Section 2.3 and
Section 3.1) and four different classification algorithms (Random Forest, Naïve Bayes, Support Vector Machines [SVM], and Logistic Regression). We expand on our classification-performance evaluation methods in
Section 3.9.2.
3.9.1. The Data Sets
To evaluate the effect of using the SAC versions on the results of the TIRP discovery process and on the eventual performance of the models induced for classification and prediction, we used three clinical data sets.
The data sets included: (1) an oncology data set from the Rush Medical Center, Chicago, USA, including patients who had undergone either allogeneic or autologous bone-marrow transplantation; (2) a hepatitis data set describing patients who had either Hepatitis B or C, which is from a KDD conference challenge [
49] and which is
publicly available [
50]; and (3) a diabetes data set from our local academic medical center [
51], including Type II diabetes patients who had been followed (albeit sporadically) for at least five years, focusing on the future outcome of the level of albuminuria (protein in the urine, a measure of renal dysfunction) in the fifth year. (Only the Hepatitis data set is publicly available.)
Table 1 describes the characteristics of the three data sets used throughout all of the evaluations: the total number of data points, the number of patients, the number of concepts, or the number of all semantic types (e.g., Hemoglobin State in a particular context, as explained in
Section 3.3), and the average number of data points per patient.
Note that the concepts were used as classification and prediction features. However, as shown in
Appendix B, each concept (e.g., Hemoglobin State) might have from two to five different values (e.g., in the case of Hemoglobin State values, five values: Very-Low, Low, Moderately Low, Normal, High). Therefore, in the Oncology domain, the 12 concepts included a total of up to 41 potential values; in the Hepatitis domain, the 10 concepts included a total of up to 29 potential values; and in the Diabetes domain, the four concepts included a total of up to 12 potential different values.
The task in the case of the oncology data set was to classify patients who underwent bone-marrow transplantation into autologous bone-marrow transplantation versus an allogeneic bone-marrow transplantation; the task in the case of the hepatitis data set was to classify the patients into Hepatitis B patients versus Hepatitis C patients; and the task in the case of the diabetes data set was the prediction, within a variable period of up to 5 years, of the state abstraction of the albuminuria-value concept (a measure of the amount of protein secreted in the urine), and specifically, whether the patient will have a normal albuminuria level (denoting normal renal function) versus a micro-albuminuria or macro-albuminuria albuminuria level (indicating renal deterioration).
The full description of the three data sets, the definitions used within each domain in the case of the knowledge-based temporal state abstraction method, and additional details about the tasks within each domain, appear in
Appendix B.
3.9.2. The Experimental Design and the Evaluation Measures
We based our evaluation on the KarmaLegoSification framework [
26]. The input time-stamped raw data were interpolated and abstracted into a set of symbolic time intervals, using either knowledge-based or automatic temporal abstraction methods. All of the frequent TIRPs that can be discovered were discovered from the symbolic time intervals output, with or without using any SAC version. In either case, we examined the effect of using either the abstract three temporal relations or the full seven temporal relations. The TIRPs were then used as features for the induction of classification and prediction models, by representing each TIRP using either a simple binary representation of the TIRPs, the mean horizontal support, or the mean duration of the TIRP within the entities.
For the evaluation of research questions 1 and 2, we performed a series of experiments recording the runtime in seconds and the number of discovered TIRPs using the KarmaLego algorithm with the three SAC versions, as well as without using any criterion at all. Note that we could use any other temporal data mining algorithm; we used KarmaLego because it is faster than several other approaches and it is complete [
3]. The runtime and number of TIRPs were evaluated on the different temporal abstraction methods, different sets of relations, and various minimal vertical support thresholds.
Because these experiments measure runtime, each combination was executed separately and thus was isolated from other processes that might have influenced the CPU behavior. We used an AMD Opteron™ Processor 6128 2.00 GHz Machine with 32.00 GB RAM and Windows Server 2008 R2 Datacenter.
To answer research question 3, we evaluated the classification and prediction performance of the SAC using the Area Under the Curve (AUC). We compared the mean AUC with two statistical analysis methods: a one-way ANOVA and the post hoc Scheffé method, using IBM SPSS Statistics 20. The one-way ANOVA was applied to the general parameters, such as determining whether the different SAC versions performed differently and the Scheffé method was applied as post hoc examination to test differences within the SAC versions. Comparisons that were found to be significantly different () are reported.
Since mining TIRPs may result with different sets for each group of patients [
26], we used a rigorous evaluation setup, including three-fold mining and ten-fold cross-validation classification. Thus, the data were split into three folds wherein TIRPs were discovered from one fold and then detected in the other two folds which were used for the classification experiment. This was repeated three times for the three-folds mining. We used four highly different types of induction algorithms:
Random Forest, the best known application of the decision trees family (randomizing both the features and the data) [
52], the classic pure probabilistic reasoning algorithm—
Naïve Bayes [
53], SVM—a very different family that uses a special type of linear optimization [
54], and of course from the Linear classifiers,
Logistic Regression [
55], which is often used as the baseline statistical approach against which other methods are compared.
We did not perform any feature selection methods on the temporal patterns discovered, since previous studies [
26] did not demonstrate their value, and also because we wanted to directly assess the value of using all features found with and without using any SAC variation. Once TIRPs were discovered, however, we exploited them for creating three different features from each TIRP:
Binary (existence of the TIRP in the record),
Horizontal Support (number of TIRP instances in the record), and
Mean Duration of the TIRP in the record.
The KarmaLego method was implemented based on the original Moskovitch and Shahar study mentioned above [
3]. We used the SAX algorithm, which we implemented based on Lin et al.’s description [
14], and the TD4C-KL method, which was implemented based on Moskovitch and Shahar’s original description [
20], using the Kullback–Leibler divergence as the measure for deciding which value cut-off leads to the best separation between the outcome classes. We used the classification algorithm implementations available in WEKA 3.7.1 [
56].
5. Discussion
We defined, formalized, and assessed in detail the Semantic Adjacency Criterion, a filtering principle that in the past we had only briefly introduced. Note that no new temporal patterns were discovered, of course, during the filtering-out process using the SAC principle, but the number of temporal patterns found, and the time needed to discover them, was greatly reduced by the various SAC pruning versions that we had used, without hurting classification and prediction performance. Three versions of the SAC criterion were suggested: The Sequential Semantic Adjacency Criterion (SSAC), which enforces the constraint only over pairs of temporally successive symbolic time intervals within the TIRP’s canonical lexicographical order definition; the Conservative Semantic Adjacency Criterion (CSAC), which enforces the constraint over every pair of symbolic time intervals within the TIRP’s definition (recall that in a TIRP, a temporal relation is defined in an unambiguous fashion between each pair of symbolic time intervals); and the Liberal Semantic Adjacency Criterion (LSAC), a variation of CSAC, which enforces the constraint only over pairs of symbolic time intervals that have different semantic types.
It is important to note that each type of SAC can serve a different purpose and has a different expressivity. For instance, the LSAC version implicitly enables a counting of symbolic time intervals of the same type (e.g., patterns such as A1A1A1 denoting, for example, three overall administrations of the same range of the dose of the medication, although different ranges of doses of the medication might have been administered between them), while constraining intervals of different semantic types. Another example is the SSAC version, which enables a more restricted version of counting but, like the LSAC version, enables the discovery of a TIRP that implicitly contains a “Symbolic Gradient” temporal pattern (e.g., patterns such as A1A2A3B denoting, for example, a Low level of the medication-dose, followed by a Medium level, and then a High level, followed by some side effect). The CSAC version seems the most useful to maintain compactness in the number of discovered TIRPs, while preserving the strictest semantics of the TIRPs, although it prevents the discovery of repetitions of symbolic time intervals within the same TIRP or the discovery of Symbolic Gradients.
Note also that as the minimal vertical support increases, and thus the number of potentially discoverable patterns decreases, the returns for using the SAC principle are diminishing; however, this is exactly the phenomenon that might enable researchers and physicians to extract useful patterns, using the SAC principle, from bigger, and even very big, data sets.
We evaluated the classification and prediction performance of the features discovered using the three SAC versions on three different medical domains: oncology, infectious hepatitis, and type II diabetes. Note that we did not choose any simulated or artificial data set, since the main point of the evaluation was to test the SAC principle within real clinical domains that incorporate real semantics, and in particular, potential causality. We believe that the true value of the SAC principle can only be apparent within real-world data, since it is precisely the lack of coherence of most real temporal patterns that is being filtered out by that principle.
To further bolster the assessment process and its conclusions, we performed the evaluation using classification algorithms from four different classifier-induction families: Random Forest, Naïve Bayes, SVM, and Logistic Regression.
It is important to emphasize that the main objective of this study was to demonstrate the possibility of reducing the number of pattern features that need to be discovered in a large time-oriented data set by at least an order of magnitude (and enhance their intuitive meaning to a domain expert, due to their increased transparency), without losing any classification performance. Our goal was not to enhance any classification performance.
The different SAC versions behaved slightly differently in each domain and for each classifier version. However, overall, using all of them required much less time, up to 98% less than when not using any SAC version, depending on the minimal vertical support specified, to discover all of their respectively relevant TIRPs, compared to not using any SAC at all. Using the various SAC versions also resulted in a significantly reduced number of TIRPs, up to 97% less, depending on the minimal vertical support threshold. This reduced set of TIRPs, however, did not lead to any reduced performance in any of the three medical domains, i.e., the resulting classifiers performed when using this reduced set of features as well as when using the full original set of TIRPs discovered in the standard KarmaLego methodology.
We infer that, at least in the medical domains in which we assessed our methodology, SAC-obeying TIRPs seem to contain most of the information important for classification and prediction.
Most of the data sets we used were relatively small (at least compared to current big data sets, although the data sets we experimented with contained about 70,000, 160,000, and 360,000 data points). However, the clear trend noted above towards a higher performance when using the reduced set of TIRPs filtered using various SAC versions, in spite of the much smaller feature set, suggests that repeating our studies with much larger data sets might, in fact, not only show that the much smaller set of TIRP-based features is sufficient, but might even demonstrate a significant improvement in the classification performance, and future studies might elucidate that aspect.
However, in any case, a reduction of the number of temporal pattern features to be discovered in a big data set has significant computational implications. On a similar note, it is interesting to consider that Fradkin and Mörchen’s conclusions in their study [
25] were that the main advantage of their new proposed sequential mining algorithm, BIDE-DC, lies in generating a
smaller number of patterns, while preserving the same classification performance.
As we noted in the Introduction, we have recently shown that frequent TIRPs can be consistently discovered and in similar proportions in different subsets of the same data set within three different medical domains, thus increasing their value for potential patient trajectory clustering, classification, and prediction tasks [
28]. (In fact, we used the same three domains mentioned in the current study.) This study has also shown that consistent discovery can be increased by increasing the minimal support threshold for frequency and, interestingly, by using the SAC principle to prune in each subset the patterns that are candidates for discovery.
Note also that the SAC principle is quite general and is not specific to the KarmaLego algorithm on which we demonstrated it or even to the family of multivariate interval-based TDM algorithms that KarmaLego is a part of. For example, sequential mining algorithms such as SPADE [
57] start with a set of time-stamped events, each containing several items, and discover qualitative associations that involve the Before temporal relation. Using SPADE to generate a set of temporal sequences that will be used as classification features might well benefit similarly through the addition of semantic considerations similar to the SAC variations, while enhancing its semantic transparency to domain experts.
Several limitations of our study can be noted. We did not measure the runtime of the classification and prediction phases, but obviously, representing a larger number of features, especially when using various functional methods (e.g., computing the mean duration of each TIRP) requires more time. That might be an additional advantage of using the SAC principle which we did not assess. Other factors that might also require more time are selecting and using various feature selection algorithms and inducing a classifier from a larger set of features.
Note that a trivial case for semantic equivalence is the one in which all concepts are different (e.g., different events, each with its own symbol); semantic type equivalence between two symbolic intervals will then consist of having the same symbol hold over both intervals.
Our main intent in the current study, however, was to explore the non-trivial case in which most concepts might have more than one value or, at least, in which there is some domain knowledge that assigns types to the various concepts. However, exploring the potential implications of the SAC principle for the simple case in which all symbols are different and no domain knowledge exists can certainly be explored in a future study.
Another potential limitation might be noted. We did not assess the actual transparency of the SAC-obeying TIRPs, as opposed to the non-SAC-obeying TIRPs, in the eyes of medical domain experts in the three domains. That was not an objective of the current study, which focused on the pure objective computational aspects of using the SAC principle, but it might be interesting to assess this subjective cognitive aspect explicitly in future studies.
The use of the four different abstraction and discretization methods led
qualitatively to the same results, with respect to the number of TIRPs discovered, the time needed to discover them, and the performance of the TIRPs as features for classification and prediction purposes, in all three domains using four different classification algorithms. Nevertheless, when using the EWD abstraction method, we noted in the specific case of the hepatitis data set that the SSAC’s runtime (and the number of discovered TIRPs) was close to the runtime achieved without the use of any SAC (see
Figure 10 and
Figure 11). SSAC is a sequential version of SAC; thus, the most reasonable explanation for this phenomenon is probably that the hepatitis data were not sequential and most of the concepts appeared at the same time. Still, the use of SSAC did not significantly reduce the performance compared to not using any semantic criterion.
Not all of the SAC versions performed equally well in the case of the diabetes data set (see
Figure 14 and
Figure 15). Using LSAC, which is the liberal version of SAC, meaning that it restricts the criterion to hold only over pairs of semantically
different concepts, led to a worse performance when compared to the other SAC versions. The reason might be that the diabetes data set includes a small number of concepts measured repeatedly over a long time and is pretty sparse, but there are several laboratory tests that are very common and, in the case of the liberal version of the SAC, relations among pairs of intervals of the
same concept were considered, just as in the case of not using any semantic criterion; the result is a runtime that is pretty close to that of not using
any semantic considerations, at least in some of the configurations.
The last two examples, i.e., the exceptions in our results of the empirical evaluation, demonstrate that one must learn the data and select the most appropriate SAC version, as well as the other parameters, e.g., discretization and representation methods. However, overall, the CSAC version performed best, no matter which configuration was chosen.