1. Introduction
Various resources, data, and information are increasingly being collected from large-scale intelligent devices in more diverse forms. However, some issues still need to be addressed in the processes of storing and utilizing these data, such as data security and processing efficiency. Some studies have redefined the concepts of the resources, as well as their hierarchical relationships and resource modeling, and attempted to examine privacy and security protection, model interpretability in data applications, and uncertainty or loss of control issues in artificial intelligence from this perspective [
1]. There are some complex frameworks beyond the traditional DIKW pyramid that have been proposed to address the complex interrelationships between data in the real world. A novel knowledge graph system architecture, namely DIKW graphs, has been proposed, which is further subdivided into data graphs (D), information graphs (I), knowledge graphs (K), and wisdom graphs (W) [
2]. The knowledge graphs describe the complex relationships between entities and concepts in a structured form, and the conceptual approaches for defining
,
and
are described in Ref. [
3].
Based on the theory of relation definition semantics of everything (RDXS) [
4], an extended metamodel was proposed in Ref. [
5]. A conceptual framework is proposed under the axioms and theorems system in the existence computation (EC) [
4,
6], defining concepts such as “class/type” and “association”. From the perspective of RDXS, a semantic transformation method has been proposed to convert relational semantics containing discrete independent entities to entity semantics within RDFS [
3]. The conceptualization of “entity” and “relationship” have been proposed as forming the theoretical basis for assessing semantic similarity or “sameness” between the compared elements [
5,
7].
The established DIKW graphs model has been expanded into the more comprehensive DIKWP framework by adding proposed graphs (P) [
8]. A proposed graph represents the process of extracting meaning and value from raw data, containing the objectives and intentions of artificial consciousness and driving the transformation between data, information, knowledge, and wisdom. In the DIKWP model, data are raw facts and numbers without processing. Information comprises data that have been given meaning and insight by being interpreted and understood in a particular context. Knowledge is a deeper understanding and familiarity with information, including the recognition of patterns, associations, and relationships within information. Wisdom involves applying knowledge, which involves synthesis, evaluation, and judgment to make wise decisions and take prudent actions. The collected data needs to be transformed into information through interpretation and understanding. The information is transformed into knowledge through pattern recognition, trend analysis, or contextual understanding. Through deeper analysis, classification, organization, or reasoning of information, a deep understanding of knowledge is gained, which could be transformed into wisdom [
9].
In typed DIKWP resources, information consists of associated data or data combinations that contain specific semantic content that can be used for analysis and interpretation. Data can be transformed into information through specific processing, such as analyzing the frequency of data across structural, temporal, or spatial dimensions to capture their associations. The associations between data, also represented as specific relationships in a directed graph, indicate that an order exists between the data. If patterns of associated or combined data are considered, sequence pattern mining (SPM) methods applicable to sequence databases can satisfy the processing of transforming data into information. From the information perspective, due to the existence of context, each data point may have specific semantics. The frequency could reflect an internal feature of data, while context-defined data requires another external evaluation criterion to distinguish each other. The concept of utility can be used to distinguish data with different contexts. In HUSPM, different utility functions can be defined to represent the external features of data across different dimensions. Additionally, the transformation of data into information does not necessarily always possess sufficient a priori knowledge, and therefore seeking to select appropriate thresholds to optimize transformation efficiency proves challenging. Top-k methods can achieve a goal-driven exploration by mapping user objectives to the parameter k.
SPM has found applications in web clickstream mining [
10] and consumer behavior analysis [
11]. However, in the frequent-oriented framework, patterns are extracted based solely on their support. Each of the items in a transaction has intrinsic value in the real world, and the value is unique. This frequent-oriented framework fails to consider the intrinsic value of individual items and does not capture the uniqueness of patterns. As is well known, the main task of retail business is to make more profits than sales. To overcome this problem, the relative importance of items was considered [
12,
13]. By incorporating the concept of utility, high-utility pattern mining (HUPM) [
14] was introduced within a utility-oriented framework [
15,
16]. HUPM focuses on discovering patterns with high utility values in non-binary databases [
17,
18]. The utility value of an item is determined by multiplying its internal utility (which represents its quantity) by its external utility (which represents its unique value).
Sequential patterns based on pattern frequency do not always correspond to the purpose of analysis, such as selling quantity and profit. Therefore, the SPM methods do not always provide sufficient information to support the decisions of stakeholders. The obvious contrast is between luxury goods and everyday supplies. Since each item in the sequential database is accompanied by a timestamp, traditional high-utility itemset mining (HUIM) methods are unable to effectively address this aspect. To tackle this issue, HUSPM [
19,
20] has been developed in the field of knowledge discovery in databases. It could be applied to various scenarios [
21].
In HUSPMA, a quantitative sequence database has been defined to represent the relative importance of patterns. In the context of retail business, the number of intelligent appliances or batteries purchased by a customer in a single visit is considered to be internal utility, reflecting the number of item occurrences. The unit profit of intelligent appliances or batteries is generally defined as external utility, indicating the relative importance of the item. The main task of HUSPM is to discover high-utility sequential patterns (HUSPs), which are subsequences with high utility values. Researchers have proposed many efficient algorithms for mining HUSPs [
22,
23], such as designing efficient pruning strategies and incorporating novel data structures. Notably, compared to SPM and HUIM, HUSPM takes into account the chronological ordering of items and their associated utility values [
24]. Although HUSPM can overcome some disadvantages of SPM and HUIM, there are still some typical challenging problems in HUSPM.
The utility of a combined pattern is determined by the sum of the utilities of subpatterns constituting this combined pattern. Therefore, as the number of combined subpatterns increases, the length of the pattern also increases, leading to a higher utility value. This implies that a longer pattern has an advantage over a shorter one in terms of being a high-utility pattern. As the pattern length increases and the number of subpatterns grows, the association rules between these patterns become more difficult to comprehend [
25]. Therefore, Hong et al. proposed the average-utility measurement, which offers a more comprehensive assessment of itemset utility [
26]. The high average-utility itemset (HAUI) mining approach [
27,
28] is designed to extract patterns with high average-utility value by considering both the utility of the pattern and its length [
28,
29]. In this way, it is more objective to assess the utility of the pattern. A high average-utility pattern indicates that each item comprising the pattern has a high utility value. Consequently, patterns with long lengths and high utilities may not always qualify as high average-utility patterns.
There is another limitation in HAUI and HUSP mining algorithms. In most cases, it is challenging to specify an appropriate value for the minimum utility threshold that ensures sufficient but not excessive itemset mining, particularly when they are unfamiliar with the features of the database. For instance, when designing a general model or analyzing a new database, users are often unaware of the distribution of utilities and the total number of sequences, so they must try different threshold parameters and execute the algorithm multiple times until they obtain a desirable result. If the threshold is specified too low, many HUSPs with redundant and unimportant information may be discovered. Conversely, if the threshold value is set too high, the number of captured HUSPs may not provide sufficient information. Additionally, fine-tuning the threshold extraction process is time-consuming. However, in many practical applications, stakeholders are more interested in identifying the top profitable patterns rather than hundreds of thousands of results. The top-
k HUSPM was proposed to handle this problem [
30], which was desired to specify the number of patterns instead of the minimum utility threshold. It draws inspiration from the top-
k HUIM [
31] and top-
k SPM [
32]. The objective of the top-
k HUSPM is to select patterns with the top-
k highest utilities from a sequential database.
Some preliminary studies [
33] were conducted to capture top-
k HAUI in the transaction database. However, research in this area is still in its early stages, and these strategies are not applicable to sequential databases. The approach proposed by Thilagu et al. was limited to specific cases where each item can occur only once [
34]. Tin et al. [
35] proposed the algorithm EHAUSM to identify HAUSP. However, there is a gap between the overestimation of utility, which is calculated by the total sequence utility or the maximum item utility, and the actual utility, with the former being consistently higher. Furthermore, the proposed upper bounds, which are similar to those in HAUPM, require a set number of items with the highest utility in the rest of the sequence through multiple sorting. To address these challenges and to improve performance and scalability, we formulate the problem of top-
k HAUSPM and propose a novel algorithm named TKAUS. The main contributions could be summarized as follows:
By considering the HUSPM and the HAUIM, the concept of TKAUS is addressed, and the problem of top-k HAUSPM is formulated. To quickly increase the minimum average-utility threshold, we investigate the sequence average-utility-raising (SAUR) strategy.
Based on the widely accepted definition of high average utility, a novel algorithm with utility upper bounds and corresponding pruning strategies is designed for mining HAUSP. The proposed approach not only takes into account the actual utility of prefixes but also avoids extra sorting processing.
Comprehensive experiments demonstrate that the proposed algorithm TKAUS performs excellently for top-k HAUSPM, particularly in terms of execution time, unpromising candidate filtering, memory usage, and scalability.
The remainder of this paper is organized as follows.
Section 2 briefly reviews various related work. In
Section 3, we provide definitions and formulate the top-
k HAUSPM problem.
Section 4 presents our proposed TKAUS algorithm, including data structures and several strategies.
Section 5 presents experimental results and evaluates the performance of our proposed algorithm. Finally, in
Section 6, we provide a summary and discuss future work.
3. Mining Top-k High Average-Utility Pattern
In this section, several definitions are introduced to explain the framework of TKAUS.
Table 1 shows an example.
Consider a set of distinct items , X is a nonempty subset of I, denoted as , and represents the size of X. A sequence is an ordered list of itemsets, where . The length of is calculated as , and is called l-sequence. The size of is m. Assume that there exists integers such that , the sequence has a subsequence , notated as or .
The definitions of the quantitative sequence and the quantitative sequence database have been proposed in [
48]. A quantitative sequence database consists of quantitative sequences (
q-sequences). The
q-sequence is an ordered list of the quantitative itemset (
q-itemset). The quantitative item (
q-item) is expressed in the form of (item, quantity) and denoted as (
i,
q). In the tuple, the quantity of the item is represented by the internal utility value of the item. In
Table 1,
is an identifier of
q-sequence, and it is unique.
Definition 1 (Q-item Utility). Consider a q-item in the jth q-itemset within a q-sequence Q, the q-item utility is denoted as and is calculated as follows:where is the external utility of the item i, and is the internal utility of the item i and is a quantitative measure for the q-item. Definition 2 (Q-itemset Utility). Let is a q-itemset and is the jth q-itemset in a q-sequence Q. The q-itemset utility is the sum of the q-item utilities. It is denoted as and is calculated as follows: The profit of the item as the external utility values is provided in
Table 2. For example, as Definition 1, consider the 3rd
q-itemset of
in
Table 1, the utility of
q-item
d is
. Then, we have the utility of the 3rd
q-itemset in
is calculated as
.
D represents the
q-sequence database in
Table 1, and the overall utility of
D is
.
Definition 3 (Q-itemset Average Utility). Let is the jth q-itemset in a q-sequence Q. Its average utility is denoted as and is calculated as follows: For instance, in
Table 1,
is the 1st
q-itemset in
, the average utility of
is calculated as
.
Definition 4 (Match). Consider the q-itemset and the itemset , if and only if such that , we say that X matches Y or Y matches X, denoted as or . Similarly, the q-sequence the sequence , denoted as or , if and only if such that .
Definition 5 (Contain). Consider the q-itemset and the itemset , we say that Y X or X is in Y, denoted as or , if and only if X is a subset of itemset and the itemset the q-itemset Y.
For instance, from
Table 1 and
Table 2, the itemset
the
q-itemset
in the sequence
. In
, the
q-itemset
the itemset
.
Definition 6 (Instance of Q-sequence). Consider the q-sequence and the sequence , where , we say that there is an of S at position in Q, if and only if there exists positive integers , so that . Also, we can say Q S, denoted as .
For instance, there are two of sequence , and , at positions and in . Therefore, we could also say the q-sequence or contians the sequence .
Definition 7 (Sequence Utility). Assume that the q-sequence is an of sequence at position in . The sequence utility of S in Q is the sum of the utilities of the q-items within S, denoted as and calculated as follows: Generally, a sequence appears in one
q-sequence more than once. Given
denote the set of all the positions of
S in
Q. The sequence utility of
S in
Q is the maximum value of instance utility at different positions and is defined as follows:
Assume that the sequence
S has
within different
Q in
q-sequence database
D. The sequence utility of
S in
D can be defined as follows:
For example, in
Table 1,
has an
of
at position:
. Then, we can calculate
. For another case, in
Table 1,
has five
of
,
,
,
,
,
, and
. The utility of the sequence
in
is
, and
.
Definition 8 (Sequence Average Utility). Consider the q-sequence is an of sequence at position in . The sequence average utility of S in Q is denoted as and is calculated as follow:
In this paper, we define the sequence average utility based on the concept of average utility, which was defined as the pattern utility divided by its length. For instance, the average utility of at position in is .
Assume that the set of all the positions of
of
S in
Q is
and is denoted as
, the sequence average utility of
S in
Q is the maximum value at different position and is defined as follows:
Assume that the sequence
S has
within different
Q in
q-sequence database
D. The average utility of
S in
D can be defined as follows:
For instance, in
Table 1 and
Table 2, the sequence average utility of
in
is
, and in the entire database
D, is
.
Definition 9 (Top-k High Average-Utility Sequential Pattern). We define a sequence S as the top-k high average-utility sequence in a q-sequence database if there exist fewer than k sequences whose average-utility value is higher than .
Problem Statement. The goal of conventional HUSPM is to find HUSPs in a sequence database with a user-specified utility threshold [
46]. Zhang et al. [
48] proposed that the threshold of top-
k HUSPM can be represented as the minimum utility value of the complete set of top-
k HUSPs, which is easily accessible.
The utility of pattern is affected by its length [
24]. In addition, mining representative
k patterns is a lossless compression of patterns [
65]. Therefore, the drawback of HUPM is more prominent in top-
k HUSPM.
4. Proposed TKAUS Algorithm
The fundamental technologies require multiple original database scans for the candidates. Multiple scanning also exists in utility calculation. High cost is a main problem. The projection and local search mechanism have been proposed to address the problem by constructing a projected database to keep the necessary information. Elimination of the unpromising items earlier also can greatly reduce the size of search space.
4.1. Data Structure and Projection Strategies
In the framework of the proposed algorithm, the mining process first scans the given
q-sequence database once. The lexicographic sequence tree as the complete search space has been used in SPM and HUSPM [
38]. Then, USpan [
19] presented the lexicographic quantitative sequence tree (LQS-tree) structure. Each node of this LQS-tree structure is associated with a candidate, which consists of the
q-sequence and its utility. The root of this tree is null. All child nodes are arranged in ascending order, for instance, alphabetical order. The child node is a prefix-based extension of its parent node. There are two kinds of
for the appending item in a sequence. One is the end of the last itemset. The other position is after the last itemset as a new itemset. Both
I-Extension and
S-Extension are operations of “
” to generate new sequences. For example, consider a
l-sequence
s and an appending item
i, the size of
s is
k,
I-Extension of sequence is to append item
i to the end of last itemset in sequence
s, notated as
, the new generate sequence is
-sequence
, the length of
is
and the size of
still is
k.
S-Extension of sequence is to let itemset
i as a new one
and append to the end of sequence
s, notated as
. In this way, the newly generated sequence also is
-sequence
which is
in length, but the size of
is
.
In some cases, the
of the sequence
S may appear more than one position, and the set of positions of
is denoted as
. Zhang et al. [
48] presented the position of the last itemset in
is
. In
q-sequence
Q, the sequence utility of
S is the maximum utility of all
. The last item within
is the
. The
is a subsequence from the item after the
to the end of
Q, which is denoted as
. Yin et al. defined the
of
S at the
p in
Q as
[
20].
where
indicates that the position of item
is after the
p. For instance, from
Table 1 and
Table 2, the sequence
has
at position
,
and
in
.The
of these
are
and
, and the item
g is the
. Thus, the
of
at
in
is
, and at
is
.
To calculate all the multiple
in a
q-sequence is time-consuming. In [
46,
48], the projected database was utilized to keep the necessary information in sequence. By adopting the utility-chain structure, these studies reduce calculation costs. The projected database of a sequence is composed of an information table and a utility chain. The utility chain in the structure consists of two parts: a head table and a utility list. The utility-chain structure gives the auxiliary information to facilitate pruning search space in HUS-Span [
46].
In this work, the prefix extension average utility (abbreviated as PEAU) of the sequence is stored in the information table as a key piece. In addition to the maximum utility of all and the utility of , the utility list also contains the length of and . More details about PEAU are described in the subsequent section.
In the structure, the size of the utility list corresponds with the number of of the in the q-sequence. In the 1st field, the unique identity of itemset, which corresponds to the is abbreviated as . In the 2nd and 3rd fields, and , are the average-utility value of the and the at this . In the 4th and 5th fields are length of the and , which are notated as and , respectively.
The above projection mechanism constructs a precise and complete projected database to avoid multiple scanning for the original database. In the subsequent section, the algorithm adopts various pruning strategies to further limit scanning space.
For instance,
Figure 1 shows the projected database of
. There are four rows on the head table. It shows
is contained in
,
,
and
. The utility chain corresponds to the head table. The first line in the utility chain contains complete information about two
in
for the following process. Therefore, the TKAUS algorithm avoids multiple scanning for the whole original database by the projection mechanism.
4.2. Sequential Average-Utility-Raising Strategies
The performance of the algorithm is limited due to more candidates being generated in the process of threshold increasing from a lower value. The sequence average-utility-raising (abbreviated as SAUR) strategy, which increases the rise rate of the minimum utility threshold, was proposed in TKAUS.
Definition 10 (SAUR Strategy). Assume that is the set of sequences contained in the q-sequence database D. Let denote the highest average-utility value in . Before the mining task, we can safely increase the minimum average-utility threshold, which is a variable denoted as , to , where k is the desired input number of patterns.
Proof. Let is the minimum utility threshold of the mining process. When the variable increases to from 0, the set of HAUSPs, denoted as , is reduced from to (), and all the sequences whose average utility is not lower than must be in the set of . According to the definition of top-k HAUSP, there are more than k sequences in whose average-utility values are not lower than . Thus, the SAUR strategy effectively increases the to a rational level. In addition, we prove that the method with the SAUR strategy will not miss any top-k HAUSPs. □
4.3. Upper Bounds and Pruning Strategies
There is an inherent time order between items in the sequence database. The process of HUSPM must deal with the critical combinatorial explosion of the search space [
48]. If the maximum sequence utility is much lower than the minimum utility threshold, even the sum of sequence utility and utilities of items in the
still cannot reach the threshold. The sequence is “
”. Zhang et al. [
48] proposed the upper bound and corresponding strategy to reduce unpromising patterns and unnecessary operations. Several existing studies have proposed some sequence pattern mining upper bounds. Three typical upper bounds: sequence extension utility (SEU) [
22], sequence projected utilization (SPU) [
29], and sequence-weighted utilization (SWU) [
19] had been introduced compared in ref. [
23]. SWU is a loose upper bound and is an overestimation, SPU loses some real HUSPs, and SEU, which is a modified real upper bound, could further limit the search scope. Wang et al. [
46] proposed PEU upper bound and RSU upper bound to speed up HUSPM. The PEU upper bound was defined as
.
The existing research work is instructive to improve the performance of HAUSPM by reducing the
pattern. The transaction-maximum average-utility upper bounds designed by Kim et al. [
59] or proposed by Wu et al. [
55] need to sort the set of remaining items by utility descending order. However, in a quantitative sequence database, each
q-itemset has a special order, so sorting among
q-items that contain different
q-itemsets is not an effective method.
In fact, each item in is helpful to add up to higher sequence utility by the “” operation for HUSPM and algorithm TKUS. The item is if the value of PEAU is not lower than the minimum average-utility threshold. However, in HAUSPM and our proposed algorithm, the average utility of the sequence, which is generated by the “” operation, is not higher than one of its prefixes if the utility of the appending item is lower than the average utility of the prefix.
Definition 11 (Increment of Sequence Utility). Let be the sequence utility of s. If is in excess of the minimum average-utility threshold , let be the rising proportion of the excess sequence-utility value in , and is calculated as follows: Assume that there is an
of sequence
S at the
in
q-sequence
Q, the corresponding
is
which is the rest after
p to the end. Let the
be a subsequence consisting of items that have a higher utility value than
in
. The rising proportion of the excess sequence-utility value in
is denoted as
, and is calculated as follows:
where the
is the maximum length of
of the prefix
S in database
D. Please note that
is determined by the
of sequence
S.
Therefore, we define the . It is a measure for the promotion of the utility on the utility of the generated candidates. The concept of “”, which has been mentioned at the beginning of this section, can be redefined. The appending item is if the average utilities of all the generated sequences containing this item still are lower than , i.e., all the promotion of its prefix and its will not increase the sequence average utility to the value of . We will, therefore, give the following upper bounds and pruning strategies.
Definition 12 ( Upper Bound). Assume that there is an of sequence S at the in q-sequence Q, the corresponding is which is the rest after p to the end. Let be a subsequence of and consist of items that have a higher utility value than . The prefix extension average utility of S in q-sequence database D is defined as follows: The is also calculated aswhere is the maximum length of of the prefix S in D. The is a revised prefix extension utility of the sequence S in D. The revised prefix extension utility of S at the in Q is defined as follows: Then, let is one position of S in Q, we define the revised prefix extension utility of S in Q is . And we define the revised prefix extension utility of S in D is .
For example, consider the sequence , there is an instance in , and equal to 0 because . In addition, has two in . When , for all q-sequence Q with , the utilities of items a, b and d in of are 3, 12 and 8, which are lower than , respectively. Therefore, , and . We have and , so that . Finally, the prefix extension average utility of any sequence extended by is . The maximum length of is , so that we have = .
Proof. Let the sequence S is a prefix of sequence , and the . Set a subsequence is , where , and , .
(1) if
is not null, then
and
. Let
is an item in
, then
. We have
(2) if
is null, but
is not null, then
and
. Consider an item
in
and
, we obtain
Therefore, we obtain . □
Based on the above derivation, the average utility of any sequence extended from S is less than . Then, the is the maximum average-utility upper bound of S and its descendant. Thus, if the value of is less than , any descendants of S will not be the high average-utility sequential pattern, and these descendants could be safely pruned.
Definition 13 ( Upper Bound). Let be the prefix extension average utility of S in database D. Assume that a sequence is extended from S through one I-Extension or S-Extension operation, i.e., consider the node representing S in LQS-tree, the node representing is its child node. The reduced sequence average utility of in q-sequence database D, denoted as , is defined as Proof. Assume the sequence is a prefix of or , and the sequence could be extended from sequence S through one “” operation, such that the sequence S is also a prefix of .
Based on the Definition 11, we obtain
According to the Definition 12, we have
such that the average of the sequence is
□
Thus, if the value of
is less than
, any descendants of
S are not the high average-utility sequential pattern, and these descendants could also be safely pruned. But RSAU still overestimates the upper bound. This upper bound is similar to RSU in HUS-Span, and more details can be found in Ref. [
46].
4.4. Proposed TKAUS Algorithm
By adopting the aforementioned mechanism and strategy, the proposed TKAUS is described as follows. The main algorithm is represented as Algorithm 1. Algorithm 2 represents a recursive procedure, and Part 3 of the whole algorithm procedure represents the updating of the set of HAUSPs and the threshold.
In the beginning, a given quantitative database is read, and the number of desired HAUSPs is specified. The algorithm utilizes the variable
to keep the current minimum average-utility threshold (Line 1) and engages the
, which is a sorted list of size
k to maintain the top-
k HAUSPs (Line 2). From lines 3 to 7, scan the original database once, calculate the utility value of each item (1-sequences), and construct the projected databases of all items. In line 8 and line 9, following the SAUR strategy, the algorithm increases
to the
kth highest average-utility value. From lines 10 to 14, for each item whose utility is greater than
, update to the
. Then, adopting the upper bound, for each
item, the
-TKAUS algorithm is called recursively (Lines 15 to 19). When the TKAUS algorithm is over, the
is returned, which stores all the HAUSPs.
Algorithm 1 The TKAUS algorithm |
Input: A quantitative database, D.
The number of desired HAUSPs, k. Output: Top-k HAUSPs in D.
- 1:
initializing - 2:
initializing - 3:
//First scanning (for the original database) - 4:
for each sequence Q in D - 5:
(1) calculating and storing the average utility of all 1-sequences - 6:
(2) constructing projected databases of all 1-sequences - 7:
end for - 8:
sorting the average-utility values of all 1-sequences in descending order, then getting the kth highest average-utility value - 9:
- 10:
for each sequence -sequences - 11:
if - 12:
update - 13:
end if - 14:
end for - 15:
for each sequence -sequences - 16:
if - 17:
calling -TKAUS - 18:
end if - 19:
end for - 20:
return
|
Algorithm 2 -TKAUS algorithm |
Input: A sequence as prefix, S.
The projected database of S, . Output: Top-k HAUSPs in D.
- 1:
scanning once to: - 2:
(1) adding I-Extension items of S to - 3:
(2) adding S-Extension items of S to - 4:
for each item - 5:
- 6:
if - 7:
removing i from (Pruning the unpromising item in ) - 8:
end if - 9:
constructing projection database of , - 10:
if - 11:
calling - 12:
putting to - 13:
end if - 14:
end for - 15:
for each item - 16:
- 17:
if - 18:
removing i from (Pruning the item in ) - 19:
end if - 20:
constructing projection database of , - 21:
if - 22:
calling - 23:
putting to - 24:
end if - 25:
end for - 26:
for each sequence - 27:
if - 28:
Calling -TKAUS - 29:
end if - 30:
end for - 31:
return
|
As shown in Algorithm 2, in line 1, the
-TKAUS algorithm identifies items that can be extended by scanning the projected database. Each input sequence is regarded as a prefix of a pattern. From lines 4 to 25, for each extended sequence, if the RSAU value is not less than the threshold, the extended sequence could be a top-
k HAUSP or its prefix. After that “
” operation, we compare all generated sequences in seqlist to the new minimum average-utility threshold according to their PEAU values. Then, the
-TKAUS algorithm is called recursively. Algorithm 3 is
algorithm. The lowest average-utility sequence is removed from
when the size exceeds the specified parameter. Meanwhile, the
is updated to the sequence average utility of
kth HAUSP in
.
Algorithm 3 algorithm |
Input: A generated sequence from S, .
The number of desired HAUSPs, k. Output: A minimum average-utility threshold, .
A set of top-k high average-utility sequential patterns, .
- 1:
scanning once to: - 2:
(1) sorting all sequences by the sequence average utility in descending order, then getting the sequence with the lowest average-utility value - 3:
(2) getting the number of sequential patterns in , - 4:
if - 5:
Inserting into - 6:
= the average utility of HAUSP in - 7:
Removing from - 8:
else - 9:
Inserting into - 10:
end if - 11:
return ,
|
4.5. Complexity Analysis of the Proposed TKAUS Algorithm
Suppose the number of quantitative sequences in a quantitative sequential database is , is the average number of items of the quantitative sequence in the database, and is the number of distinct items in the database. k represents the desired number of HAUSPs. It is a user-specified parameter. Therefore, in the first scan of the original database, q-items are processed to calculate the information of each one. This information is stored in a projected database, and is also the memory complexity of the first scan. Then, in TKAUS, the highest average-utility value of all 1-sequences is set as the initial threshold, so it takes in order to sort all 1-sequences in an average utility descending order using the quick sorting algorithm. In the sorting step, the memory complexity is . TKAUS next updates , which is used to store HAUSPs. At this time, in the worst case, it takes to calculate the actual sequence average utility for each sequence. Meanwhile, TKAUS calls the recursive function -TKAUS. The required processing for comparing the threshold and the PEAU of each sequence is . In -TKAUS, the -sequence is generated from the k-sequence by appending each item. In addition, two conditional lists are constructed to store the appending items. It takes to scan the projected database and to mark the unpromising items with low RSAU. Then, the unpromising items will be deleted from the lists. The largest size of a conditional list is . The algorithm next traverses the reduced lists and appends each item to the prefix. Meanwhile, TKAUS finds the HAUSPs and calls the function. The required amount of processing to calculate the actual sequence average utility is . The generated sequence with high PEAU is set as a new prefix and is inserted in the projected database. Repeat the above process for all the prefixes. The corresponding time complexity is , which equals . Its memory complexity is , which equals . As for the function, the generated sequence is inserted into , and the sequence with the lowest average utility is deleted. Its time complexity of the process is . In addition, the memory complexity of the update function is .
Suppose L is the sequence length of the longest generated one in the above process. Then, . Therefore, the maximum number of recursive calls is in the function -TKAUS. According to the above, the time complexity and memory complexity of TKAUS are and , respectively. In the worst case where , the time complexity and memory complexity are and .