1. Introduction
Two strings of the same length are
order-isomorphic when they have the same relative orders. For example, strings
and
are order-isomorphic because the relative order of individual characters in both strings is the same as
. Given a text
T of length
n and a pattern
P of length
m, the order-preserving pattern matching (OPPM) problem is finding all substrings of
T which are order-isomorphic to
P. For example, given a text
and
,
P occurs at the fourth position of
T because the underlined substring
of
T is order-isomorphic to
P (see
Figure 1). The OPPM problem can be solved in
time [
1,
2,
3] based on the KMP algorithm [
4], where
is time for sorting
P. Practical algorithms for the OPPM problem using the Horspool approach [
5] and a filtration method have been studied [
6,
7,
8,
9]. Additionally, various related studies have been conducted, including research on order-preserving string regularities [
10,
11], OPPM with multiple patterns [
1,
12], parallelism in OPPM [
12,
13], order-preserving suffix trees [
14], and OPPM in a tree and directed acyclic graphs [
15].
Since, in OPPM, the relative order of characters is more important than their absolute values, it can be effectively applied to analyze various types of time-series data, such as stock prices, energy consumption, annual temperature variance, and music melodies [
7,
16,
17,
18]. For example, when there are two stocks in the same industry but with different face values, the actual price fluctuations of the two stocks over the same period may differ, but the pattern of change can be similar. In such a case, using order-isomorphism allows for detecting the same or similar patterns. However, it is not always easy to apply OPPM algorithms directly due to various factors such as differences in the measurement intervals of time-series data, noise occurring in the measurement process, errors in the data transformation process, etc. Additionally, even if this error occurs at just one position, it can affect the overall relative order, which makes it difficult to detect in the reference data. For example, if the measurements of a certain time-series data are
, the relative order of each measurement is
. If the second value (character), 30, is changed to 24 due to a measurement or recording error, the relative order of each measurement changes significantly to
, making it difficult to detect in the reference data or possibly leading to the detection of an incorrect position. Therefore, to perform a more meaningful pattern search, prediction, and analysis of time-series data, it is necessary to consider diverse types of errors.
Recently, approximate OPPM problems that allow errors in matching have been studied. Gawrychowski and Uznański [
17] studied the OPPM problem with
k-mismatches. Their algorithm finds occurrences of a pattern within a text where the relative order of elements is preserved, allowing up to
k mismatches, with a time complexity of
. Russo et al. [
18] studied OPPM problems on indeterminate strings, which considers the relative order of elements, even when some elements are not precisely determined. They presented an
-time algorithm, where r is a bound on the number of uncertain characters per position. Kim et al. [
19] defined a scaling approximation and presented algorithms for the OPPM problem with scaling, which considers the relative orders of cusps and the scale of lengths of strings between them. They presented an
-time algorithm to solve the OPPM problem with scaling, which is the same time bound as the best known exact OPPM algorithm.
In some cases in OPPM, it might be more effective to compare two strings part by part rather than as whole strings. Consider two strings
and
for example. The relative orders of these two strings are the same with
. Assume that the second character of
y is changed to 24 due to some error, as in the previous example. That is, let us assume
y is changed to
. Then the relative orders of
x and
become quite different, with
and
, respectively. Despite their overall differences in order, however, when
x and
are divided at the third position, the orders of the two divided parts of
x and
are the same with
and
, respectively, i.e., the two respective parts match each other. See
Figure 2.
The above approach can be applied to plagiarism detection in music melodies. OPPM can be used to detect parts of a full melody that exactly match a specific short melody to determine plagiarism. If one of the notes in the detected section is modified to a similar but different note, it may sound similar when played but might not be detected by OPPM. However, if the section is divided into two parts and OPPM is performed separately, the likelihood of detection increases. It can also be applied to stock price prediction. If the price movement of a particular stock over a certain period resembles the movement patterns of other stocks in the past, future changes can be predicted to some extent. However, as these patterns become longer, it becomes more challenging to find exact matches using OPPM. By dividing the pattern into two parts and performing OPPM separately, the chances of finding matching sections increase.
Our contributions are as follows:
We introduce a novel approximation criterion for the order-isomorphism, called the partitioned order-isomorphism for the first time. In OPPM, unlike classical pattern matching, a change in just one character can significantly alter the overall ranking of the entire string. As a result, parts that could not be found when the two strings were compared directly may be detected when the strings are divided and compared in two parts.
We present an efficient algorithm for the problem using the
Z-function [
10,
20]. Even with the consideration of the approximation criterion, our algorithm achieves a time complexity of
, which is equivalent to the best-known exact OPPM algorithm.
We demonstrate the effectiveness of our algorithm through comparative experiments using three types of real time-series data. The experiments are conducted by varying the lengths of patterns for each data set, which demonstrates that our algorithm is more effective than the OPPM algorithm.
This paper is organized as follows. In
Section 2, we give some preliminary works. In
Section 3, we first define the partitioned order-isomorphism and the OPPM problem with partition formally. And then we give an efficient algorithm for the OPPM problem with partition. In
Section 4, we present experimental comparisons between the algorithm introduced in [
1] and the algorithm proposed in this study. In
Section 5, we conclude this research.
2. Related Works
Let
denote the set of characters where two characters can be compared in constant time. A string
S is a sequence of characters drawn from
. We denote by
the length of
S and by
the
ith character of
S. The substring
is denoted by
, and
is an empty string when
. Substrings
and
are called a prefix and a suffix of
S, respectively. The concatenation of two strings
X and
Y is denoted by
. For convenience, we assume that the characters in a string are all distinct as in [
1].
The relative order (or rank) of a character
in a string
S is defined as the number of characters in
S that are smaller than or equal to
, including itself. A natural way to represent the order relations of the characters in
S is to enumerate their relative orders in
S, which is called the
natural representation of the order relations. For example, the natural representation of the string
is
. Although the natural representation is intuitive and easy to understand, it is not suitable for pattern matching [
1]. The
prefix representation is an alternative way to represent order relations, where the relative order of a character
is considered within
rather than in the entire string
S. For example, the prefix representation of the string
is
. It has been shown that there is a one-to-one mapping between the natural representation and the prefix representation [
1].
We formally define the order-isomorphism of two strings. Two strings X and Y of the same length are order-isomorphic, denoted by , when for all . Obviously, the following lemma and corollary hold by the definition of the order-isomorphism.
Lemma 1. For two strings X and Y of the same length, if X and Y are order-isomorphic, then substrings and (for all ) are also order-isomorphic.
Corollary 1. For two strings X and Y of the same length, if there exist substrings and () that are not order-isomorphic, then X and Y are not order-isomorphic.
The order-isomorphism can be efficiently determined using the nearest neighbor representation [
1,
2,
3,
6], which is represented by the location tables
and
.
That is,
and
are the nearest neighbors of
in the sorted sequence of characters in
.
Table 1 shows
and
for
. The location tables can be computed in
time using a sorting algorithm or an order-statistic tree. Using the location tables for
X, we can determine the order-isomorphism of
X and
Y in
time by checking the following inequality for every position
:
(if or is equal to , we assume the respective inequality is true). Moreover, we can find the longest prefix of Y that is order-isomorphic to a prefix of X in time by checking the above inequality in increasing order of i until we reach a position where the inequality is false (Algorithm 1). We call the longest order-isomorphic prefix (loip) of X and Y, which is called the longest common prefix (lcp) in classical matching.
For a string
S, the
Z-function
represents the length of the longest prefix of
which is order-isomorphic to a prefix of
S, i.e., the loip of
and
[
10]. It is a modified version for order-isomorphism of the
Z-function introduced in [
20] to solve classical string matching problems. With the nearest neighbor representation of
S,
can be computed in
time.
Algorithm 1 Computing loip |
Input: two strings and , and an integer such that .
Output: length of the loip of X and Y.
1: procedure CompLoip(X, Y, k)
2: for to m do ▹ Assume and are given.
3: if and then 4: return 5: end if 6: if and then 4: return 8: end if 9: end for 10: return m 11: end procedure |
3. Order-Preserving Pattern Matching (OPPM) with Partition
In this section, we define the OPPM problem with partition and present an efficient algorithm for solving the problem.
3.1. Problem Definition
We define the partitioned order-isomorphism. Let us consider two strings and , which are not order-isomorphic. When we partition X (resp. Y) into two substrings and (resp. and ), the partitioned substrings of X are order-isomorphic to those of Y, i.e., and . Then, we say that X is partitioned order-isomorphic to Y. That is, the partitioned order-isomorphism is formally defined as follows.
Definition 1. For two strings X and Y of length m and a position t (), if and , then X and Y are partitioned order-isomorphic at t, denoted by . If there exists any position t such that , then X and Y are partitioned order-isomorphic.
In the example above, since and . However, X is not partitioned order-isomorphic to because there exists no t such that . We call position t a partition position. (Although we consider the cases when a string is partitioned into two substrings, Definition 1 can be extended to the cases of partitioning into more than two substrings.)
The following lemma shows that partition positions are consecutive.
Lemma 2. For two strings X and Y of length m, let be the largest l such that . Similarly, let be the smallest r such that . Then, if and only if .
Proof. Note that and always exist since two strings of length 1 are order-isomorphic. (⇒) We show by contradiction that if , then . Suppose that for some t such that or . Consider the case when . Since , , which contradicts the definition of . Similarly, a contradiction occurs when .
(⇐) We show that if . Since and , by Lemma 1. Similarly, . Therefore, . □
Corollary 2. If , then X and Y are not partitioned order-isomorphic.
The OPPM problem with partition is to find all substrings of a text T that are partitioned order-isomorphic to a pattern P, formally defined as follows.
Problem 1. The order-preserving pattern matching (OPPM) problem with partition.
Input: A text T of length n and a pattern P of length m .
Output: Every pair where i is a position in T and is a position range in P such that if and only if .
For example, given
and
, the output is
and
. See
Table 2, which shows
and
for
P and
.
3.2. An Algorithm for the OPPM Problem with Partition
Let denote the substring of T of length m starting at position i, i.e., (). Let and be the lengths of the longest prefix and the longest suffix of that are order-isomorphic to a prefix and a suffix of P, respectively. When we are given ’s and ’s for all , we can solve Problem 1 in time using Lemma 2 and Corollary 2 (note that and for and P). See Algorithm 2.
Now we focus on the problem of computing all
’s. (We omit the details of computing
’s since it can be computed using the same way by considering reversed strings of
T and
P.) Note that
is the length of the loip of
and
P. So we can compute each
in
time using the nearest neighbor representation of
P (Algorithm 1), resulting in a total time complexity of
, where
represents the time required for computing the nearest neighbor representation
and
of
P. More efficiently, we can compute all
’s in
time by concatenating
P and
T (say
) and computing
.
Algorithm 2 OPPM with partition |
Input: a text and a pattern Output: all pairs representing occurrences of Problem 1.
1: procedure OPPMwPartition(T, P)
2: compute and for using Algorithm 3
3: for to do 4: if then 5: report as an occurrence
6: end if 7: end for 8: end procedure |
However, we can improve the time complexity to
time by computing
and all
’s separately. We first preprocess
P for computing
,
, and
. Then, we compute
in increasing order of
i using
and
, which is similar to the
Z-algorithm in [
20]. For any position
k, the
L-box starting at
k is defined as the range
of length
. (Note that
because the first characters of two strings are trivially order-isomorphic.) For each
i (
), let
be the
L-box whose end position is the rightmost among the
L-boxes whose start position is less than or equal to
i. (If more than one
L-box has the rightmost end position, we may choose any
L-box.)
Algorithm 3 Computing ’s |
Input: a text and a pattern Output: all pairs representing occurrences of Problem 1.
1: procedure ComputeL(T, P)
2: Compute , , and for P ▹ Preprocessing P 3: ▹ Initialize L-box 4: for to +1 do 5: if then ▹ Case 1
6: 7: ▹ Set as the new L-box
8: else ▹ Case 2
9: 10: if then ▹ Case 2(a)
11: 12: ▹ Set as the new L-box
13: else ▹ Case 2(b) and 2(c)
14: 15: ▹ remains unchanged, i.e., 16: end if 17: end if 18: end for 19: end procedure |
Algorithm 3 shows the pseudo-code of our algorithm for computing ’s. At each iteration i (), we compute and using . Initially, we set . The details of each iteration i are as follows:
If , then compute by explicitly checking the inequality (1) for () and P character by character, starting from the first position and continuing until either the inequality becomes false or the check reaches the last position. Since , we set to , i.e., the new L-box starting at position i.
If , then is contained in the L-box . Let and . Then, where , i.e., it is guaranteed that is at least , which is proved in Lemma 3.
- (a)
If , then can be larger than . Thus, we compute by explicitly checking the inequality (1) for and P character by character, starting from the position . Furthermore, we set as , i.e., the new L-box starting at position i.
- (b)
If , then and .
- (c)
If , then and .
We illustrate how our algorithm computes
’s for
and
. We first preprocess
P (see
Table 1 for
,
, and
). Initially,
.
Iteration (Case 1). We obtain by explicitly checking the inequality (1) for and P. Note that the inequality is false at position 2 since and , i.e., . Thus, the L-box starting at position 1 is , and it is set as since .
Iteration (Case 1). We compute explicitly and set . Note that but .
Iteration (Case 2(b)). We have , , and . Then, , but , i.e., , which we can know without checking the inequality (1) as proven in Lemma 3. The new L-box starting at position 3 is and it is included in . Hence .
Iteration (Case 2(a)). We have and . Since , we obtain by explicitly checking the inequality (1) starting from position 2. Note that , but . Moreover, we set .
Iteration (Case 2(a)). We have and . Similarly, we obtain and .
Iteration (Case 2(a)). We have and . Similarly, we obtain and .
Iteration (Case 2(b)). We have and . Similarly to iteration 3, we obtain and .
Iteration (Case 2(b)). We have and . Similarly to iteration 7, we obtain and .
Lemma 3. Algorithm 3 correctly computes and for all .
Proof. It can be proved by induction. Initially (when ), since (Case 1), we compute explicitly and set as , which is the only L-box whose start position is less than or equal to 1.
Now we consider an iteration where . In Case 1, we obtain correct since we compute it by explicitly checking inequality (1). Since (note that the first characters of two strings are trivially order-isomorphic), the end position of the new L-box starting at i is greater than or equal to i. Additionally, since (by the condition of Case 1), . Hence, is the L-box whose end position is the rightmost among the L-boxes whose start position is less than or equal to i. Therefore, we compute correctly.
In Case 2,
is contained in substring
such that
, which means
by Lemma 1. Let
,
, and
(see
Figure 3). Then,
and
(by the definition of
z). Thus,
where
, i.e.,
.
In Case 2(a), we obtain correct since we compute it by explicitly checking inequality (1) starting from position . Note that we do not need to check the inequality for the position k since . Additionally, since , it follows that . Therefore, should be .
In Case 2(b),
is contained in
and
(see
Figure 3(i)). Since
by the definition of
, it follows that
and thus
. Moreover, since the end position
of the new
L-box starting at position
i is to the left of
,
is the same as
.
In Case 2(c),
is contained in
and
(see
Figure 3(ii)). Since
by the definition of
, it follows that
and thus
. Additionally, the end position
of the new
L-box starting at position
i is equal to
. Hence,
can be either
or
(where
is chosen in our algorithm).
Therefore, Algorithm 3 computes and correctly in all cases. □
Lemma 4. Given , , and , we can compute all ’s () in time.
Proof. Each iteration takes constant time except for checking the inequality (1) in procedure CompLoip. The check ends when the inequality becomes false or the check reaches the last position. Let be the number of times when the inequality is true during iteration i. If , increases by (and never decreases) from (Cases 1 and 2(a)). Since , the total sum of ’s for all iterations is at most n. Therefore, given , , and , all ’s can be computed in time. □
Theorem 1. Given a text T of length n and a pattern P of length m, the OPPM problem with partition can be solved in time. If the characters of P can be sorted in linear time, the problem can be solved in time.
Proof. We can compute
and
in
time. (If we can sort the characters of
P in linear time, for example, such as when the characters are drawn from an integer alphabet, we can compute
and
in
time [
2]). Then,
and
’s can be computed in
time and
time, respectively. Moreover,
’s can be computed similarly using the reversed strings of
P and
T. For each position
i in Algorithm 2, given
and
, we can determine whether
is partitioned isomorphic to
P and compute the range of partition positions in constant time. Therefore, we obtain the theorem. □
4. Experimental Results
The experimental environment was as follows: the operating system was Microsoft Windows 10 Education (64-bit), the CPU was an AMD Ryzen 9 5950X 16-Core, with 64 GB of RAM. The development tool was gcc 13.1.0, and the development language was C++. Three types of time-series data were used in the experiment: a power consumption index, particulate matter (PM2.5) levels, and the Dow Jones Index. The power consumption index consisted of data on the average household voltage per minute in Sceaux, France, from 00:00 on 16 December 2006, to 22:00 on 2 December 2008 [
21]. The PM2.5 levels were recorded in Beijing at one-hour intervals from 00:00 on 2 January 2010, to 22:00 on 9 October 2014 [
22]. The Dow Jones Index data included the daily closing prices of the Dow Jones Industrial Average from 2 May 1885, to 12 April 2019 [
23]. For brevity, the power consumption index data are referred to as PCI, the particulate matter level data as PML, and the Dow Jones Index as DJI. We generated the experimental data as follows. First, the text was created by multiplying the real-valued time-series data by appropriate values to convert all values into integers. Since using a random string as the pattern could result in significantly fewer occurrences, a fixed-length substring starting from a random position in the converted text was designated as the pattern.
We compare the number of occurrences detected by our OPPM with partition algorithm (referred to as OPPM_PART) with those detected by the exact OPPM algorithm proposed by Kim et al. [
1] (referred to as OPPM_EXACT). The lengths
n of text
T for PCI, PML, and DJI were 2,049,280, 41,757, and 38,104, respectively. Pattern
P was generated by randomly extracting strings of lengths
, and 14 from
T. The reason for selecting the pattern lengths in this way is that, due to the nature of OPPM, the longer the pattern length, the fewer matching parts occur. In fact, when the pattern length exceeded 15, there were almost no occurrences.
Table 3 compares the number of occurrences for each algorithm, which are the averages obtained from executing the algorithm on 100 different patterns. As shown in the table, our algorithm, OPPM_PART, detects approximately
(for
) to
(for
) times more occurrences than OPPM_EXACT with PCI. With PML, OPPM_PART detects approximately
(for
) to
(for
) times more occurrences than OPPM_EXACT. With DJI, OPPM_PART detects approximately
(for
) to
(for
) times more occurrences than OPPM_EXACT. In all cases we experimented with, OPPM_PART detected significantly more meaningful parts than OPPM_EXACT.
Table 4 compares the execution times for each algorithm, which are the averages obtained from running the algorithm on 100 different patterns. As shown in
Table 4, OPPM_PART runs approximately
(for
) to
(for
) times slower than OPPM_EXACT with PCI. With PML, OPPM_PART runs approximately
(for
) to
(for
) times slower than OPPM_EXACT. With DJI, OPPM_PART runs approximately
(for
) to
(for
) times slower than OPPM_EXACT. In all the cases we experimented with, OPPM_PART was approximately twice as slow as OPPM_EXACT. Although both algorithms, OPPM_EXACT and OPPM_PART, have the same time complexity of
, OPPM_PART was slower due to the additional
time required for calculating
and
, as confirmed by additional experiments (not presented in a table). However, considering that the actual execution time is measured in microseconds, this increase in runtime is not significant (see
Figure 4).
5. Conclusions
We have introduced the concept of partitioned order-isomorphism as a novel criterion for approximate order-isomorphism. This approximation criterion is significant not only because it can be utilized in analyzing time-series data where errors may occur, but also because it provides a criterion for approximation that is challenging to address in classical pattern matching based solely on character value comparison. Using this criterion, we have defined the OPPM problem with partition and presented an efficient algorithm for solving it. One advantage of our criterion is that it allows for approximate matching without increasing the time complexity compared to the exact OPPM.
Our experimental results show that we can find more occurrences using our approximation criterion. As shown in the experimental results, our algorithm detected significantly more meaningful parts in all cases we tested. Therefore, we expect that our criterion will also enable more meaningful pattern search, prediction, and analysis in real-world applications. Moreover, as the pattern length increases, the difference of the number of occurrences between OPPM_EXACT and OPPM_PART decreases. This suggests that partitioning into more than two parts may be necessary for longer patterns.
Although our current focus is on partitioning a string into two parts, our definition can be extended to cases where the string is partitioned into three or more parts. However, developing an efficient algorithm for partitioning into even three parts is not straightforward. In the two-part case, we solved the problem efficiently by preprocessing T (i.e., computing ’s and ’s), since the start position of the left part and the end position of the right part are fixed. In contrast, when partitioning into three parts, neither the start nor the end position of the middle part is fixed, which can increase the complexity. We plan to explore this in future research.