1. Introduction
The IoT is a comprehensive and much researched direction nowadays, and it can be categorized into two main components: embedded systems using sensors and processing technologies on Wi-Fi sensor networks. IoT is evolving day by day, yet it suffers from three main challenges: architecture [
1], interoperability [
2], and security [
3,
4,
5]. Various IoT-enabled heterogeneous devices around us communicate through sensors via the internet to provide smart services. For example, smart health monitoring systems [
6], smart parking systems [
7], and smart shopping carts [
8]. In this paper, we address one of the pertinent issues with smart shopping carts’ transactional data collected on webservers. After preprocessing this coarse-grained data, some artificial intelligence method can be utilized for effective analysis, such as to study customer buying patterns, effective catalog deign, efficient stock management, etc. [
9,
10,
11]. Frequent itemsets mining is one such method in association rule mining that can serve the aforementioned purpose perfectly. Frequent Itemsets Mining (FIM) is the discovery of those itemsets which coexist in the given dataset more than the user specified limit [
12]. The importance of the FIM process can be seen from the fact that it is used for finding customer purchase trends [
13], fraud detection in the banking sector [
14], most frequent web pages traversal findings [
15], health monitoring [
16], data profiling systems [
17,
18,
19], and document clustering [
20].
Rakesh Agarwal first highlighted FIM in his seminal work as the “Apriori algorithm”. Subsequently, the research community has floated many proposals for the FIM problem. The proposed solutions for FIM encountered two main issues before and after the discovery of Frequent Itemsets (FIs) [
21]. First, the user’s selection of the ideal Support Threshold Parameter (STP) before starting the FIM process. Second, the exponential number of FIs generated as a result of the FIM process. This issue is addressed well in different compact FIs-mining procedures that have been proposed. Hence, the selection of an optimum STP for the FIM process is a harder task as it depends on the characteristics of the datasets. Also, the frequentness of an itemset depends on the STP value that is provided by the user. This value is used by the FIM process to isolate frequent patterns from infrequent patterns. If the user chooses a high threshold, there will be fewer itemsets generated, and choosing a low threshold will generate more patterns. Therefore, the user may tune the FIM process with different STP values to discover the desired FIs. The miss tuning of STP can in turn result in two main problems for the FIM process, i.e., effects on the size of the result space and the size of the search space [
21].
Many algorithms have been proposed in the area of mining the topmost frequent itemsets [
22,
23,
24,
25,
26,
27,
28,
29,
30,
31]. These algorithms are commonly named K-topmost or topmost FIM approaches. Initial topmost FIM algorithms are derivative from the classical user’s-STP-based FIM techniques and inherit all of their demerits, i.e., the Apriori or FP-tree algorithm. There are two main challenges faced by the topmost FIM algorithms. First, the Apriori-based topmost FIM techniques proposed in [
23,
24,
31] have multiple scans, excessive candidate generation, they are search space focused, and have large search space. Second, the FP-tree-based topmost FIM techniques proposed in [
22,
25,
27,
28,
32] perform at least two scans of the database, suffer from the large search space problem in the case of dense datasets, and have the memory resident construction of FP-tree. In general, the topmost FIs-mining techniques designed based on the aforementioned principles are computationally inefficient [
30]. Recent techniques for topmost patterns mining include the mining top-K frequent patterns without support threshold proposed by Salam et al. [
30], the top-K identical frequent without support threshold by Rehman et al. [
29], and the top-K frequent itemsets mining technique based on equivalence classes by Iqbal et al. [
26]. All Apriori-and-FP-growth independent top-K methods mentioned earlier are not viable to perform due to the two main limitations described below.
a: No top-K specific pruning method:—All top-K specific methods so far do not use any top-K candidate itemsets pruning technique apart from an automatic support threshold raising strategy. The aforementioned strategy can only determine whether an itemset is frequent or not after computing its support value. On the other hand, we need a pruning mechanism which guarantees that a particular itemset is infrequent without computing its support value.
b: Computationally intensive methods:—The Apriori-based top-K methods, the FP-Growth based top-K methods, and the Apriori-and-FP-Growth-free methods are computationally intensive on dense datasets with large value of K. Apriori-and-FP-Growth-free methods produce a large number of candidate itemsets, requiring support computation for every itemset. FP-Growth-based top-K methods are computationally extensive due to the FP-Growth method.
1.1. Research Objectives
In this article, we propose a novel topmost frequent itemsets mining algorithm named TKIFIs Miner. The proposed algorithm is not a derivative of the Apriori or FP-Growth techniques, but it uses a recursive best-first-based depth-first traversal technique on search space to discover topmost frequent itemsets. During this traversal, two novel pruning strategies are used to early prune the portion of the search space not required for topmost patterns mining. Subsequently, support of the remaining candidate patterns using an intersection of TIDs approach is calculated. Finally, the support of the pattern is checked against the least support of the patterns available in the top-K list. If the support of the discovered pattern is greater than the least support of the patterns available in the top-K list, it is added to the top-K list; otherwise, the pattern is not added to the top-K list.
1.2. Organization of the Paper
The article is organized as follows:
Section 2 presents previous work related to this area of research, followed by a discussion of the proposed algorithm and a presentation of a few preliminaries about the topmost frequent itemsets mining problem in
Section 3. Additionally, a set of examples are presented to highlight the topmost frequent itemsets mining using k as parameter in place of support threshold. In
Section 4, the experimental evaluation of the proposed algorithm and the comparison of the TKIFIs Miner with the current benchmark methods are presented. Finally,
Section 5 discusses the achievements of this work and future directions that can be unveiled through topmost frequent patterns mining studies.
2. Related Work
The top-K FIM is introduced to avoid user-given threshold parameter tuning. On the other hand, top-K FIM is considered a computationally harder task to perform as compared to support-based FIM. Due to the self-explanatory results, it is applied in different real world application domains such as monitoring users’ activity from their movements data [
33], COVID-19 virus strains classification and identification [
34], and extraction of metadata from dynamic scenarios [
35,
36].
Salam et al. for the first time suggested considering natural requirements of STP-free K-most FIs to design a novel topmost-FIM technique that was not derived from the aforementioned classical approaches [
30]. This technique mines topmost maximal frequent itemsets and top-K maximal frequent itemsets from the given transactional dataset. According to this approach, a topmost maximal frequent itemset in a dataset is a frequent itemset of a highest support with item length greater or equal to 3. On the other hand, a top-K maximal frequent itemset set is a set of K distinct highest support maximal frequent itemsets. In this approach, first the association ratio of all 2-length itemsets is calculated, followed by the construction of an Association Ratio graph (AR-graph). Subsequently, the AR-graph is applied to construct an all-path-source-destination tree (ASD-tree). Finally, the ASD-tree is used to find the top-K maximal frequent itemset set. Additionally, the Top-K Miner introduced by Rehman et al. [
29] does not extend any classic STP-based mining techniques. It performs only one scan of the dataset to construct frequent itemsets of length 2, and during this scan the top-K list is updated with frequent itemsets of lengths 1 and 2. The topmost frequent itemsets of length 2 from the top-K list whose support is greater or equal to the Kth support value in the top-K list are selected for topmost k frequent itemsets mining of lengths greater than or equal to 3. Subsequently, every 2-itemset is iteratively combined with the selected nodes of a Candidate Itemsets Search tree (CIS-tree) to form the itemsets of any length greater than 2. Their support is calculated and adjusted accordingly in the top-K list. The Top-K Miner finds all topmost FIs as per the tuned parameter, but suffers as memory-expensive due to the Candidate Itemsets Search tree (CIS-tree). Moreover, Iqbal et al. presented the TKFIM algorithm which inherits the Apriori mining technique for the discovery of K-topmost FIs [
26]. This algorithm performs excessive candidate generation because it uses common itemsets prefixes of the already produced topmost frequent itemsets. The frequencies of the itemsets are generated using the diffset support finding method [
37]. Apart from the automatic support threshold raising strategy, the algorithm does not adopt any other pruning strategy to perform search space pruning and to avoid excessive candidate generation.
Table 1 presents the discussed topmost FIM techniques and their details.
3. Frequent Itemsets Mining
In this section, we will discuss the proposed frequent itemsets mining algorithms, which include three methods: the TKIFIs mining method that mines a user-required number of IFIs from the user-given dataset; the Gen-IFIs method that explores the search space to discover TK-IFIs; and the Candidate-IFIs method that generates the set of candidate itemsets for the newly created IFIs. Additionally, some examples will be presented while explaining the frequent itemsets mining methods to give more clarification to the reader. At the beginning, it is worth starting by presenting some preliminaries and definitions.
3.1. Preliminaries and Problem Definitions
The support threshold value provided by the user serves as the border between frequent and infrequent patterns. Our proposed FIM method is not based on a support threshold value supplied by the user. Then, how can our proposed method discriminate between frequent and in-frequent patterns to produce the required number of patterns?
There is an important observation related to the selection of a support threshold as far as our proposed method is concerned. The parameter required by the proposed method is K, which determines the number of topmost frequent patterns that are required to be discovered from search space. The value of the K parameter guides the proposed method for the automatic adjustment of the border between the number of required topmost frequent and infrequent patterns. The support count value of the patterns decides the aforementioned border. This is automatically determined or adjusted by the machine based on K value rather than given by the user. Tuning the value of the K for finding the required number of patterns is simple and unlike a support threshold value which is not dependent on the characteristics of the given dataset. Therefore, based on the above discussion, the following definitions are presented for further groundwork:
Identical Itemsets (IIs): One or more itemsets are called IIs if and only if all of them have the same support count that is , where is the support count of every pattern found in the set of support count S.
Top-1 Identical Frequent Itemsets (IFI1): One or more itemsets are called IFI1 if and only if all of them have same support count and their support is the first highest most support in the set of support count S that is , where is the first highest support count in the set of support count S.
Top-K Identical Frequent Itemsets (): One or more itemsets are called IFIs if and only if all of them have same support count and their support is the Kth highest most support in the set of support count S that is , where is the Kth highest support count in the set of support count S.
Suppose is the set of all items in a database, i.e., , where is an item. The database is a set of transactions where every is a transaction. An itemset containing one or more transactions presents support of , i.e., . Let be a set of all found in transactions of database. That is, contains all the itemsets that can be generated from D. Let be the set of support counts of all the itemsets contained by. The problem of topmost frequent itemsets mining is to find all IFIs of highest support to Kth support where K is a user specified number of topmost IFIs, i.e., .
3.2. TKIFIs Mining Method
In this section we introduce
method that mines the user required number of IFIs from the user-given dataset
basd on one simple parameter K as shown in step 1 in the above Algorithm 1. The database is scanned once to copy the tid in the corresponding position of the 1-itemset indexed with item i. It is done by performing the horizontal scan of the given datasets as shown in step 3 to 5 of the above Algorithm 1. The next step is copying all 1-itemsets of the highest support to the Kth support into set of candidate itemset C, and into the TK
IFIs list. In order to perform the aforesaid task, first topmost highest support itemset is discovered and copied into itemset i, followed by copying this itemset in C as shown in step 6a and step 6b of the above Algorithm 1. The topmost selected itemset i in step 6a is copied into the TK
IFIs list in step 6c of the above Algorithm 1.This itemset i is removed from the list of 1-itemsets in step 6d, as shown above in Algorithm 1, In step 7 of the above Algorithm 1, we used 1-itemsets of highest frequency to Kth frequency for producing frequencies of 2-itemsets. In step 8, we use the
algorithm for the top-K IFIs mining from the already trimmed search space. For more explanation, we give Example 1 to illustrate the algorithm by applying it to a transactional dataset that is given in
Table 2.
Algorithm 1.
TKIFIs Miner (Data Set: D, User Input K).
|
- (1)
// input the number of TKIFIs to mined from dataset D. - (2)
- (3)
- (4)
- (5)
- (6)
- a.
- b.
- c.
- d.
- (7)
- a.
- b.
- (8)
- (9)
|
Example 1: Consider the transactional dataset given in
Table 2. The dataset consists of 10 transactions and 6 distinct items. For the given value of K = 4, 5, the dataset is mined for top-4, top-5 IFIs as shown in
Table 3.
Table 3 represents the topmost patterns mined for K= 4, and 5, from the transactional dataset given in
Table 2. It is clear from
Table 3 that the only input required from the user is the K value to produce the desired number of topmost patterns. In the top-K result sets in
Table 3, it can be observed that all the subsets of an FI in top-K IFIs are placed at the same, or ranked at higher, positions due to Apriori property [
1]. For example, all the subsets of a frequent itemset {FBA = 7} in top-5 IFIs in
Table 3 are ranked at top-4 or higher positions.
3.3. Production Method
This section presents the method which is a search space exploration algorithm that discovers the based on the number of topmost-K IFIs required by the user. It is a recursive best-first-based depth-first search discovery approach. Every time is called recursively, it discovers one or more IFIs. The generated IFIs are then either accommodated in the currently maintained list or dropped due to lesser support count. The procedure uses four parameters during the discovery process. The first parameter is current highest support. At the start, whenis called, it is passed as empty. The second parameter is candidate itemsets list which is used for the extension of current. Before calling for the first time, the top-K 1-itemsets are copied in , and subsequently it is passed to as the second parameter. The parameter maintains the current top-K IFIs produced before calling and within the calls. The is a last parameter used in calling. This parameter is used in pruning the candidate patterns in the procedure in step 5 of Algorithm 2.
The procedure extends the current IFI with the item i from the candidate itemset list iteratively as shown in step 1 of Algorithm 2. The minimum support threshold is raised automatically with the addition of IFIs in the list. Therefore, before appending item i from the candidate list to the current IFI, its support count is checked with minimum support of the list, as shown in step 2 of Algorithm 2. If the support of item i is less than the minimum support of IFIs in the list, then the current item i is pruned from combining it with the current IFI. In step 3, the selected item i is combined with current to form a new IFI,. In step 4, all the 1-itemsets in whose support value is greater than or equal to the current item i are selected/copied in the possible candidate itemset list.
The
procedure is used here to return a new candidate list
for the new IFI,
, as shown in step 5 of Algorithm 2. It uses the new IFI,
, formed in step 3, and possible candidate list
, to generate a new candidate itemsets list,
. In step 6, the
algorithm is called recursively for further processing. The used search space strategy is explained below. For FIM, the search space is arranged in the form of a Set Enumeration tree (SE-tree), as presented in
Figure 1. Since FIM methods discover patterns or itemsets in a search space, the SE-tree is one of the search space presentation methods used here to enumerate all of the possible subsets. The intuitions behind adopting the SE-tree for search space presentation is given below.
The problems where search is a subset of power set. It represents irredundant complete search space.
The SE tree is a structure that can be traversed recursively.
For the FIM techniques adopting depth-first traversal of the search space, SE-tree can be used to model their search space efficiently.
Algorithm 2. |
- (1)
- (2)
- (3)
- (4)
- (5)
- (6)
- (7)
|
We now give Example 2 to illustrate the
method by applying it to the transactional dataset that was given in
Table 2.
Example 2: Consider the transactional dataset
Table 2. Generate all 1-itemset for K = 5 and corresponding 2-itemsets from it. The given transactions dataset is scanned transaction by transaction using iterative step 3 of the Algorithm 1. Every transaction is scanned from left to right for every item. The transaction ID for every item is recorded. Finally, the K highest support 1-itemsets are selected iteratively as shown in steps 6a and 6c of Algorithm 1. The 2-itemsets are then computed from the K highest support 1-itemstets, as shown in the table below.
3.4. Candidate Generation Method
The method is used to generate the set of candidate itemset for the newly created. Every item from this set will be used subsequently to extend the current. For the creation of candidate itemset, individual items from possible candidate list are utilized. An item is iteratively selected from in step 1, as shown above in Algorithm 3, for the purpose of addition into candidate list. Before appending an item into candidate list, two types of pruning strategies are applied on it. These strategies are explained in the following.
Itemset Based Candidate Pruning (IBCP): An itemset is not a candidate for IFI if an item in an itemset has support count less than minimum support of IFIs in.
Two-Itemset Based Candidate Pruning (TIBCP): An itemset is not a candidate for IFI if any 2-itemset in an itemset has support count less than minimum support of IFIs in.
Let
, where
is a two-itemset and
is a one-itemset to extend
. It is clear that the support count of
or
is less than minimum support of IFIs in
, i.e.,
. This implies that the
according to the Apriori property [
12]. Hence, an itemset
can never be termed as IFI.
The IBCP and TIBCP strategies are applied on line numbers 2 and 4, respectively, in Algorithm 3. If the support of an item , or two-itemset containing an item , is less than minimum support of IFIs in , the item is skipped from frequency test.
Algorithm 3. |
- (1)
- (2)
- (3)
- (4)
- (5)
- (6)
- (7)
- (8)
- (9)
- (10)
- (11)
- (12)
- (13)
|
The support count of the
is computed in step 10 of Algorithm 3 as shown above. If the support count of the
is found equal or greater than the minimum support of any IFI in a set of IFIs, i.e.,
, the resultant IFI is added to
in step 11 of Algorithm 3 above. The currently selected item
is added to the set of candidates
in step 12 of Algorithm 3 above for further depth-first order processing of itemsets. We now give Example 3 to illustrate the Candidate_IFIs method by applying it to the transactional dataset that was given in
Table 2.
Example 3: For the given transactional dataset in
Table 2, assuming descending support order < on top-5 IFIs as shown in
Table 4, the complete set enumeration tree is given above in
Figure 1.
If we have n items in a set of items
in a database
, the search space for the FIM problem consists of all possible subsets of
, that is
itemsets. Assuming specific order on the items of set
, the set enumeration tree of
will present all
itemsets. All nodes of the tree except the root node shown in the above
Figure 1 are made according to the following two observations.
Let be a node label in the SE-tree. Let be a candidate set of items which is used to extend any node in the tree apart from root node, that is.
A node label is an extension of a node label if where , and X is any possible subset of C such that An itemset is a single item extension or child of if
5. Conclusions and Future Work
In this article, we presented an efficient TKIFIs Miner algorithm for mining top-K IFIs without using the support threshold parameter from transactional data collected through smart shopping carts. Hence, adjusting this parameter to mine the required number of FIs is a harder choice for users. Therefore, the TKIFIs Miner algorithm allows users to control the production of the number of IFIs using a parameter K. In TKIFIs Miner, as part of an automatic adjustment of the support threshold, we used new IBCP and TIBCP pruning techniques which were not used earlier in any topmost frequent patterns mining problem. These pruning methods have proven the effectiveness of pruning the entire itemsets’ sub-trees from the search space, and hence enabling the depth-first search based on the method to quickly find those itemsets having a high support value.
TKIFIs Miner outperformed all the topmost patterns mining approaches and the FP-Growth technique in the experimental evaluation. TKIFIs Miner’s computational time is equal to its counterparts on dense and sparse datasets with small values of K. On dense datasets, TKIFIs Miner excels by a bigger margin in terms of computational time for high values of K. As future work to show the created result-set in a more compact manner, we will integrate maximal or closed patterns mining with the TKIFIs Miner approach.