1. Introduction
Location-detection devices such as GPS and FRID provide the ability to log an object’s travel pattern. Trajectories denote the paths traced by bodies moving in space over time [
1].They are captured periodically by these devices installed on moving bodies as sequences of geographical coordinates and timestamps. Every day, enormous amounts of data are created and collected. Additionally, the movement of objects generally follows frequently repeated patterns. Trajectory pattern discovery has applications in movement prediction [
2,
3,
4], region of interest discovery [
5], and the study of traffic flow or congestion [
6]. Therefore, extracting implicit and valuable patterns from vast databases of location data has attracted great interest in recent years.
This work focuses on extracting knowledge from vehicle trajectories through Sequential Pattern Mining (SPM). SPM is used to find sequences whose ratio of occurrences exceeds a user-defined minimum threshold [
7,
8,
9]. The resulting frequently occurring sequences that are retrieved can be used to find connections between different objects or events. A similar concept can be extended to trajectory data, where sequential patterns are collections of geographical locations that several moving objects visit in a particular order. Therefore, applying SPM on trajectory data gives rise to frequently appearing sequences of locations, which provide insights into the travel pattern of the subjects. Analyzing and predicting travel behavior is facilitated by thoroughly understanding residents’ travel patterns, providing valuable insights into socioeconomic dynamics. These mining objectives are best achieved with taxi trajectory data since it is ofhigh quality, consistent, and has a wide distribution [
10]. Hence, this work uses Microsoft T-drive taxi trajectory data [
11,
12].
Trajectories are essentially spatio-temporal records of vehicles. Performing SPM with vehicle trajectory is problematic for two reasons: (1) Spatial uncertainty and noisy trajectory recordings plague GPS technology. It is impossible to have vehicles visit the exact locations. For example, two vehicles may travel on the same stretch of road, but their recorded coordinates rarely match precisely. Therefore, finding frequent travel patterns necessitates some level of fuzziness. (2) SPM requires sequences of discrete items. However, each trajectory commonly contains hundreds and thousands of GPS points without any pre-defined segmentation, presenting a redundancy issue. Worst still, the trajectories recorded are far from discrete. Due to these characteristics of the trajectory data, many SPM methods now in use often result in enormous, repetitive, and indecipherable outputs [
13].
Addressing these issues requires transforming trajectory data into formats compatible with current sequential pattern mining systems. An abstract method of segmenting the trajectories is essential, ensuring the fuzziness of data while reducing the size of the data and, at the same time, discretizing continuous data. Trajectory simplification methods such as grid-based or clustering-based segmentation can be applied to get discretized trajectories and remove spatial uncertainty, thereby preparing trajectory data for SPM algorithms [
1,
14,
15]. Obtaining a smaller, less redundant set of sequential patterns can be resolved by enforcing constraints such as closed, maximal, or contiguous [
7,
16,
17]. For example, by enforcing contiguous constraints, items in resultant patterns must also be adjacent in the original sequence. Such an approach can be observed in [
18].
Until now, there has not been much comparative work that extensively examines and summarizes the performance of the various SPM types in the context of trajectory data. Furthermore, most of the recent works either focus on introducing trajectory data mining and provide little clues on the application of SPM on trajectory data, such as [
1,
19,
20], or focus on the overview of SPM, such as in [
7,
21]. Therefore, through extensive experiments, this work presented the performances and behaviors of several SPM algorithms, namely GSP, CM-SPADE, PrefixSpan, CM-Clasp, CloFast, MaxSP, VMSP, and CM-SPAM. This work served as a reference for those interested in implementing the SPM algorithms in trajectory-related applications. The process of searching for and choosing a suitable algorithm is time-consuming. With such a reference guide, users can therefore pick the most appropriate algorithm based on the application scenario promptly.
The remainder of the paper is structured as follows:
Section 2 introduces the definitions and problems of sequential pattern mining.
Section 3 elaborates on representative sequential mining algorithms and constraint-based SPM based on their approaches, some of which were used in this comparative study.
Section 4 describes the data processing procedure and experiment details.
Section 5 presents and discusses the results of mining T-drive data using different SPM algorithms. Finally, this paper concludes with
Section 6.
3. Sequential Pattern Mining Algorithms
Pattern mining algorithms differ in several ways: (1) the candidate search method, such as breadth-first or depth-first search, and (2) the database representation. For example, the database described previously is an implementation of a horizontal database. Sequential pattern mining is computationally more intensive than itemset mining as many intermediate candidates or subsequences must be generated and verified during the process. A typical sequential pattern mining algorithm always aims to find all the patterns given a database and the
threshold. When mining long sequences, having massive databases, or small
, the computational resources required may become the limiting factor. According to (
2), a frequent 100-sequence will give rise to
frequent subsequences with a
of 1. Also, the algorithms sometimes demand too much computing power, preventing them from completing the search. Many efforts have been made to increase the efficiency of algorithms. Some directions of optimization can be: (1) reducing intermediate candidate size, (2) fewer database scans, (3) limiting the search space, and (4) optimizing candidate generation and support counting phase.
Besides computational inefficiency, the algorithms often output a massively large number of frequent patterns. Redundant and meaningless patterns in the output caused users to spend additional time and effort searching for patterns of interest. Mining trajectory sequential patterns, which often have long sequences, poses the challenge of obtaining a smaller, less redundant set of sequential patterns. These challenges can be overcome by using constraint-based SPM. This diversion of SPM allows results to be summarized as a concise representation of a complete set of frequent patterns without extracting the entire set meaningfully.
This section first introduced three main types of search techniques and their representative algorithms, followed by three different constraints and the algorithms that enforced them.
3.1. Breadth-First Search
The Apriori algorithm was proposed to deal with the problem of frequent itemset mining [
23]. The algorithm is designed for mining frequently occurring itemsets by applying the downward closure property, which states that if an itemset is infrequent in the database, all its extensions would also be infrequent [
22]. It is particularly useful during pruning by greatly reducing the search space. However, Apriori does not account for orders, causing it to fail in situations where orders matter, such as time-series data and text. Sequential pattern mining was subsequently introduced [
9]. Prior to expanding them to longer sequences, AprioriAll first determines which item is frequently observed in the database. It proceeds in a two-step manner: candidate generation followed by support counting. In candidate generation, the algorithm first searches for the frequent 1-sequences (i.e., sequences containing a single item) and generates 2-sequences by extending the 1-sequences.
Similarly, 3-sequences are generated with 2-sequences. The process continues until no further extensions can be made. This approach is also known as the level-wise or breadth-first approach [
7,
22]. Each n-sequence extension scans the whole database. The search space with such approaches can become monumental as they always consider the worst-case and explore all possible sequences [
7]. In the support counting phase, AprioriAll employs a hash tree to count the generated sequential patterns and remove unwanted patterns. Various improvements have later been proposed to increase the efficiency of the Apriori-based algorithm, but techniques such as the two-step approach and hash-tree-based support counting in AprioriAll remain inspirational for newer algorithms.
The authors of the Apriori algorithm, Agrawal and Srikant, proposed an improved version named Generalized Sequential Patterns (GSP) in 1996 [
29]. GSP adopts a similar type of horizontal database as Apriori and also uses the breadth-first search for frequent sequential pattern discovery. GSP attempts to generalize sequential pattern mining by employing time restrictions, sliding time windows, and taxonomies in sequential patterns [
22]. Like AprioriAll, GSP follows a two-step approach where all candidates are generated prior to support counting. However, GSP keeps all k-sequences in memory to extend and generate k + 1-sequences [
7]. By merging smaller patterns, this tactic allows GSP to generate candidates without repeatedly visiting the database. Although this reduces the number of database scans, much time and space is wasted on non-existent candidates. In the second step, GSP performs multiple database scans to calculate the support, which is very costly for large databases.
3.2. Depth-First Search
Unlike breadth-first search, the depth-first search algorithm starts with sequences containing single items, i.e., 1-sequence, and recursively extends one of the sequences until exhausted. Then, the algorithm returns and extends another 1-sequence to generate other sequential patterns. This search strategy was employed in the SPADE (Sequential Pattern Discovery Algorithm Using Equivalence Classes) by Zaki in 2001 [
30]. SPADE creates an IDList for each item, indicating that particular item’s position in the respective sequence. An IDList is a vertical database representation. All IDLists are generated together by scanning the horizontal database once. The horizontal database can be reconstructed from corresponding vertical databases. The support counting step is greatly simplified as all frequent patterns can be enumerated by performing the joins of IDLists, hence calculating the support of any pattern directly. Subsequently, it checks the cardinality of the resulting id-list against
to ensure that all subsequences of the resulting sequence are frequent. Adopting such approaches improved the performance significantly compared to breadth-first search algorithms without the need to perform multiple databases or maintain candidates in memory.
However, the issue of the non-existence candidates persisted as the candidate generation procedure made no references to original databases. Furthermore, when a pattern appears in many sequences, the join operation of IDList remains costly in a large database, especially a dense one or long sequence database. Co-occurrence pruning was introduced by Fournier-Viger et al. in CM-SPADE to minimize join operations [
31]. A co-occurrence map was created during the initial database scanning, which stores all frequent 2-sequences. If the last two items of any sequence are not in the co-occurrence map, the sequence is eliminated directly without constructing IDLists, thus reducing the number of join operations.
Another strategy using bitmap representation of vertical databases is employed in SPAM [
32]. For each item, a bitmap with a bit of 0 or 1, depending on the appearance of such an item in each pattern, is generated. Such a data layout makes SPAM perform efficient support counting and is capable of online outputting the resultant pattern. However, SPAM is relatively space-inefficient. It needs to fit the entire database and data structure into memory. This makes large trajectory data possibly incomprehensible when memory resources are limited. Similarly, co-occurrence pruning was added in CM-SPAM to enhance the performance of SPAM.
3.3. Pattern-Growth
Besides utilizing a vertical database, another way of optimizing the frequent pattern mining process is by establishing a frequent pattern tree (FP-tree) structure projected database, which is a more efficient variation of the horizontal database and extends from the prefix tree during the depth-first search [
33,
34]. Such an approach eliminates the costly task of candidate generation as it gradually grows trees of frequent itemsets during projected database generation. The original database is divided into a series of projected databases, with each itemset projected to no more than one of the projected databases. The size of the resulting database is always less than that of the original database. Next, this method starts from a frequent pattern of length 1, or suffix pattern, and recursively extends the pattern from sub-databases consisting of the set of frequent items co-occurring with the suffix pattern [
33]. Larger patterns form as the algorithm recursively concatenates items to suffix patterns, eventually resulting in a constructed FP-tree. During the process, sets of frequent patterns under the same suffix pattern are ordered in descending supports. The FP-tree-based mining approach achieves efficiency in three ways: (1) Using a highly condensed database that reduces the number of database scans to two, hence avoiding multiple database scans; (2) Using a pattern fragment growth method to avoid massive candidate sets; and (3) Using a partitioning-based, divide-and-conquer method to reduce colossal search space.
However, the FP-tree is unsuitable for sequential mining as sub-sequences with different orderings cannot be re-ordered and considered the same pattern, leading to a vast, non-collapsible database. In the worst-case scenario, database projection will require copying almost the entire database. Based on the idea of FP-tree, Pei, Han et al. developed PrefixSpan (Prefix-projected Sequential pattern mining) and extended it to sequential pattern mining [
35]. In PrefixSpan, the corresponding postfix subsequences of prefix subsequences in sequence databases are recursively projected into smaller databases. Upon completion, each projected database considers only frequent local patterns to grow the sequential pattern trees. When the database is huge, the main-memory-based pseudo-projection techniques in PrefixSpan, which consider the projected database as a set of pointers on the original database, can reduce the computational cost for projection.
3.4. Closed Constraints
Closed frequent patterns represent the largest sub-sequences common to sets of sequences [
7]. Such patterns are lossless, which means the whole set of sequential patterns can be reconstructed from the resulting patterns. Given a dataset
D, a pattern
X is a closed frequent pattern if
X is frequent and there exists no proper super-pattern
Y such that
Y has the same support as
X in
D, i.e.,
for all
. The idea of a closed pattern was first introduced by Pasquier et al. in 1999, together with the frequent closed itemset discovery algorithm AprioriClose (or A-Close) [
36]. Subsequently, Yan et al. proposed CloSpan (Closed Sequential Pattern Mining), which targets closed sequential patterns [
37]. CloSpan was extended from PrefixSpan and adopted a two-step approach. The first step generates a candidate set based on the concept of equivalence of the projected database, followed by pruning all non-closed sequences in the second step. A hash-based technique for pruning where the hash function is the support of a sequence. CloSpan outperforms PrefixSpan and can mine long sequences in a large database even with low
where PrefixSpan failed. ClaSP (Closed Sequential Patterns algorithm) was then proposed by Gomariz and Campos et al. [
38]. It marries the idea of vertical databases from SPADE and the heuristic pruning approach from CloSpan. However, ClaSP needs to maintain previous candidates in the memory for sequence pruning, which is rather memory-intensive. Similar to CM-SPADE, Co-occurrence map was also implemented in CM-ClaSP by Fournier-Viger et al. [
31]. Unlike the aforementioned approaches, CloFAST combines the idea of
sparse id-list and
vertical id-list to enable rapid counting of sequential patterns [
39]. The properties of such combinations improve the memory-intensive situation in ClaSP by allowing a one-step approach for both sequence closure checking and search space pruning. Specifically, sparse id-list is used for closed frequent itemset mining, and vertical id-list is used for closed sequence pattern generation.
3.5. Maximal Constraints
For dense databases or databases with long sequences, closed sequences may still produce a large set of patterns. Therefore, maximal patterns are introduced to address such issues further. Maximal patterns are sets of sequential patterns that do not appear in other sequential patterns. By definition, a pattern
X is a maximal frequent pattern in dataset
D if
X is frequent in
D, and there exists no sequence
Y where
, which is a super-sequence of
X. A set of maximal sequential patterns is always smaller than the set of closed sequential patterns and all sequential patterns, i.e.,
. Unlike closed patterns, maximal patterns are not lossless. Similarly, different strategies are employed for finding maximal sequential patterns or itemsets. For example, AprioriAdjust and FMMSP [
40] uses breadth-first search algorithms, MaxSP [
41] and DIMASP [
42] use pattern-growth algorithms, and VMSP [
43] uses depth-first algorithms with a vertical database. MaxSP adopted PrefixSpan’s projected database mechanism. To tackle the memory inefficiency of previous algorithms, MaxSP determines whether a frequent pattern is maximal without retaining previously found patterns in the memory. It incorporates a checking mechanism consisting of verifying maximal-backward-extensions and maximal-forward-extensions [
41]. With these, MaxSP can output results directly. A more efficient method, VMSP, uses vertical databases instead of costly database projection generation in MaxSP. VMSP relies on three approaches: exclusion of non-maximal patterns, validation of forward maximal, and co-occurrence map candidate pruning to achieve its efficiency [
43].
3.6. Contiguous Constraint
Contiguous constraint requires items in resultant patterns must also be adjacent to each other in the original sequence. Two sequences
Y = <
> and
X = <
> where
X is a contiguous subsequence of
Y denoted by
if and only if there exist integers
where
, and
. It also implies that
Y is a super-sequence of
X. GSP was one of the first few algorithms to incorporate the idea of gap constraints and contiguity in sequential patterns [
29]. The constraint is well-suited to a particular trajectory data mining goal, assuring that the resulting trajectory patterns will always follow the actual trajectory. Without this constraint, resultant elements in sequential patterns, which represent the locations in a trajectory, can jump from one position to another. Some algorithms, such as VMSP and CM-SPAM, inherently allow specifying the number of gaps between itemsets, with no gaps allowed, suggesting a contiguous constraint.
5. Experimental Results and Analysis
This section presents the results and analysis of comprehensive experiments. This paper intends to evaluate each algorithm based on the following properties: runtime, memory usage, and the number of patterns generated. Each algorithm’s runtime was measured from the time the algorithm started to the time it returned a result. Memory usage was obtained with the Java class Runtime. Each algorithm was repeated with the same parameters five times, with the results averaged to obtain the final runtime and memory consumption. The number of patterns generated depends on the parameters. Hence, each algorithm would be run with values of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, and 0.7. VMSP and CM-SPAM can be run with and without gaps allowed between candidate sequences. When no gap is allowed, it forces the resultant pattern to be consecutive, i.e., a contiguous constraint. Experiments are conducted with bash scripts running in a Linux environment with breaks between each run and consider sufficient JVM warm-ups beforehand.
The data for runtime (s), RAM consumption (Mb), and the number of output patterns are shown in
Table 3,
Table 4 and
Table 5, respectively. All algorithms give no pattern output at
of 0.7, suggesting that no grid has 70% of all trajectories passing through. Among all, MaxSP failed at
is at 0.2 or less. Only CM-SPADE and algorithms with contiguous constraints enforced can mine at a
of 0.1. The considerable number of patterns mined by CM-SPADE indicated that non-constrained algorithms might struggle with either time, memory, or storage space, given the massive number of outputs.
Algorithms that failed experienced either Java heap space problems, extremely long run times, or exceptionally high RAM usage if is too low, i.e., below 0.2. In these scenarios, algorithms either threw an execution exception, were killed when runtime exceeded 3 h, or when output occupied more than 300 GB of storage space. These algorithms were marked with “Failed” in the result tables.
The number of outputs presented a challenging task for processing and interpretation. Theoretically, max and closed constraints should return a concise representation of the output, yet algorithms returned a similar number of outputs with the T-drive dataset. This observation revealed that the characteristics of the dataset determined these constraints’ effects. Trajectory data might be too distinct to be compressed compared to other forms of data.
Between three algorithms without constraints: GSP, PrefixSpan, and CM-SPADE, GSP observed a sharp increase in runtime as decreased to 0.1, while PrefixSpan and CM-SPADE remained relatively stable. However, CM-SPADE consumed more RAM but could mine sequences at of 0.1, where PrefixSpan failed. PrefixSpan utilizes main-memory-based pseudo-projection techniques, which reduce RAM consumption significantly, as demonstrated by experiment results. PrefixSpan trades execution time for less RAM consumption. PrefixSpan might be able to finish execution if more time were given. On the other hand, CM-SPADE had a shorter, more stable execution time at the expense of higher RAM consumption time. The choice between PrefixSpan and CM-SPADE would depend on the computational resources available.
As closed-constraint algorithms, CM-ClaSP and CloFast were similar in terms of RAM usage. However, CloFast was significantly faster. The number of closed sequential patterns discovered differs slightly between the two algorithms, which is attributed to different strategies employed. Strangely, the number of closed patterns found by CloFast was greater than those without constraints. Experimental figures further suggest that the closed pattern did not work well with trajectory data.
Conversely, maximal SPM algorithms like MaxSP and VMSP adopt very different candidate generation and pruning approaches. MaxSP can generate output directly without in-memory candidate maintenance, whereas VMSP opts for the common candidate-maintain-and-test approach. The approaches generated a notably different number of patterns. VMSP mined more maximal sequential patterns with less RAM consumption and shorter execution time, while MaxSP has the highest RAM consumption and longest execution time among all algorithms. VMSP, therefore, is more efficient considering the size and sequence length of typical trajectory databases.
Contiguous constraints were performed with VMSP and CM-SPAM. Their respective gap constraints were set to no gaps allowed. Their runtime differed by a few seconds, while VMSP had consumed lesser RAM than CM-SPAM.
However, judging the algorithm’s efficiency simply through runtime or the number of pattern outputs was unfair.
Figure 3 depicts algorithm efficiencies using runtime per pattern. When compared to algorithms that found a similar number of patterns (i.e., VMSP (no gap) and CM-SPAM (no gap)), GSP performed better at most
, particularly at 0.6, 0.3, and 0.2. As a closed-constrained algorithm, CM-ClaSP was the second most efficient algorithm besides GSP at a
of 0.6, while MaxSP became the second at a lower
of 0.3. CM-ClaSP was the best-performing among algorithms without constraints. However, MaxSP and GSP failed to complete the task as
was further reduced. Therefore, at low
, CM-SPAM (no gap) became the most efficient algorithm.
Figure 4 and
Figure 5 visually compared SPM results at
of 0.2 and 0.5, with and without contiguous constraints plotted against actual roads of Beijing. Results from CM-SPADE and CM-SPAM were used, respectively. At
of 0.2, the outputs of three algorithm groups, non-constraint based, closed constraints, and max constraints, fall in the range of 7628 to 8068, except for GSP, which produces a very concise result similar to that of contiguous constraint algorithms. Contiguous constraints give an exceptionally low output count, which makes the comparison meaningful. Higher support denoted that more vehicle trajectories traveled on the road segment. A red-colored road segment suggested a large volume of vehicles. CM-SPADE was selected for visualizations, representing the results of algorithms producing outputs within the 7628 to 8068 rang. Results were distributed within the 3rd Ring Road, which is a city ring road that encircles the center of Beijing. The solid red box outlined in blue in
Figure 5b is one of the sequential patterns mined. The road segment in the blue box is the junction between the West Second Ring, which is notorious for traffic congestion in Beijing, and Chegongzhuang Road.
Figure 6 exemplified the SPM result by plotting some trajectories that supported the pattern boxed out. The same road segments were found in these trajectories. Trajectory pattern mining tries to mine for repeated patterns, such as frequently traveled road segments. More overlapping trajectories give more support.
The high-support regions colored red and captured by SPM with and without constraints were highly similar at of 0.5. While with of 0.2, SPM with constraints captured more high-support road sections as contiguous constraints gave a more concise representation of results. The resultant pattern discovered by SPM without contiguous constraints tends to be duplicated. When the contiguous constraint was enforced, a single pattern could represent multiple patterns.
6. Conclusions
The continuity and uncertain nature of trajectory data make it distinctively different from typical transactional data, which requires additional trajectory segmentation procedures to allow applications of SPM for various purposes. This work outlined the method of dealing with trajectory data of such a nature. Subsequently, several representative algorithms were chosen to participate, including GSP, PrefixSpan, CM-SPADE, CM-SPAM, CM-CLasp, CloFast, VMSP, and MaxSP, in the sequential pattern mining comparative experiment using a large trajectory dataset, the Microsoft T-drive trajectory data. Algorithms are selected based on their search heuristics, database implementations, and constraints. Comparative experiments are performed to understand how different implementations and parameters affect SPM results. The runtime, RAM consumption, and the number of pattern outputs were collected. The results were further visualized on the actual map, which uses traffic congestion as an analytic example. These experimental results demonstrate that contiguous constraints are more relevant in the context of trajectory data than closed and maximal constraints, as contiguous constraints can provide a concise representation similar to constraint-less SPM and allow algorithms to perform well under extremely low when other algorithms fail to. Among all non-constraint-based algorithms, CM-SPADE shows decent performance at low and stable control of runtime and memory consumption with varied parameters. PrefixSpan is a suitable choice when RAM resources are limited, but there is plenty of storage space. This work performed sequential pattern mining using taxi trajectory data and visualized the result of SPM on the actual road in Beijing, China. The results highlighted road segments frequently traveled by taxi, which implies congested regions. Results from SPM can be further extended to other forms of analysis, such as next-location prediction. Employing trajectory map-matching techniques can further improve the accuracy of pattern mining. This study can be used as a guide for academics and professionals when determining the most suitable SPM algorithm for applications that involve trajectory data. Future experiments can investigate patterns’ quality and dive deep into reasons for inconsistency in pattern output between algorithms using similar constraints or parameter settings. A comparative study of other specialized forms of pattern mining, such as time constraints and top-k, can be considered.