Next Article in Journal
Adaptive Self-Scaling Brain-Storm Optimization via a Chaotic Search Mechanism
Previous Article in Journal
Detect Overlapping Community Based on the Combination of Local Expansion and Label Propagation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Construction of the Equation Automaton

1
Department of Computer Science, Faculty of Sciences, Mohammed V University in Rabat, Rabat 10000, Morocco
2
EA 4491-LISIC-Lab., Laboratoire d’Informatique Signal et Image de la Côte d’Opale Université Littoral Côte d’Opale, 62100 Calais, France
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Algorithms 2021, 14(8), 238; https://doi.org/10.3390/a14080238
Submission received: 27 May 2021 / Revised: 27 July 2021 / Accepted: 6 August 2021 / Published: 11 August 2021

Abstract

:
This paper describes a fast algorithm for constructing directly the equation automaton from the well-known Thompson automaton associated with a regular expression. Allauzen and Mohri have presented a unified construction of small automata and gave a construction of the equation automaton with time and space complexity in O ( m log m + m 2 ) , where m denotes the number of Thompson automaton transitions. It is based on two classical automata operations, namely epsilon-removal and Hopcroft’s algorithm for deterministic Finite Automata (DFA) minimization. Using the notion of c-continuation, Ziadi et al. presented a fast computation of the equation automaton in O ( m 2 ) time complexity. In this paper, we design an output-sensitive algorithm combining advantages of the previous algorithms and show that its computational complexity can be reduced to O ( m × | Q e | ) , where | Q e | denotes the number of states of the equation automaton, by an epsilon-removal and Bubenzer minimization algorithm of an Acyclic Deterministic Finite Automata (ADFA).

1. Introduction

The equation automaton (also known as derived terms automaton or Antimirov automaton) was first introduced in Mirkin’s paper [1]. In [2], Antimirov introduced the notion of partial derivative of a regular expression, that lead to another definition and construction of the equation automaton. It is an ϵ -free NFA which has in general smaller number of states and transitions than the well-known position automaton [3,4,5]. The complexity of the original construction algorithm of [2], which is based on the computation of the set of partial derivatives of the expression, is in O ( n 5 ) , where n denotes the size of the regular expression. In 2001, Champarnaud and Ziadi [6] introduced the notion of canonical derivatives and constructed a new automaton called the c-continuation automaton. They also proved that this automaton is isomorphic to the position automaton and that the equation automaton is its quotient for some equivalence relation.
The notion of c-derivative has been introduced in [6] to derive the equation automaton from the position automaton via the c-continuation automaton. A unique regular expression over indexed and ordered letters, called c-continuation, is assigned to each state of the position automaton. The resulting automaton is called the c-continuation automaton [6]. After that, one can define the equivalence relation between two c-continuations i.e., two states of the c-continuation automaton as follows: if deleting the indices of letters from two c-continuations results in the same regular expression, they correspond to the same partial derivative. Hence, the equation automaton would be a quotient of the c-continuation automaton w.r.t. the previously defined equivalence relation. From the algorithmic point of view, this result allows the construction of the equation automaton in O ( n 2 ) time and space [6,7]. Therefore, this improves the Antimirov’s algorithm by a factor of O ( n 3 ) .
In [8], Allauzen and Mohri present simple and unified constructions of the position automata [3,4,5], follow automata [9,10], and the equation automata [2,6,7] from regular expressions. Their algorithms are based on two standard automata operations applied to the Thompson automata [11,12] called: epsilon-remove and Hopcroft’s algorithm for DFA minimization [13]. The complexity of their construction for the equation automaton is in O ( m log m + m 2 ) , where m is the number of Thompson automaton transitions. Notice that, by construction the number of transitions of the Thompson automaton m and the size of a regular expression n are proportional. Thus, we have m = O ( n ) .
To improve the time complexity of computing the equation automaton from a regular expression E, we design an algorithm, combining advantages of previous methods [6,7,8], with a worst-case time complexity in O ( m × | Q e | ) , where | Q e | denotes the number of its states. Our approach is based on Bubenzer minimization of an acyclic DFA instead Hopcroft’s algorithm for DFA minimization step used in Allauzen and Mohri’s method. The main idea is to associate implicitly each c-continuation to a corresponding state, called position state in the Thompson automaton by a special marking of ε -transitions. As a consequence of this marking, the right language of each position state in the Thompson automaton represents implicitly its c-continuation, called pseudo-continuation. After that, we disable temporarily the cyclic ε -transition in the Thompson automaton and perform Bubenzer minimization of an acyclic DFA to compute efficiently partial derivatives equivalence relation over the set of position states. Finally, we remove indexed ε -transitions, enable the cyclic ε -transition and then compute the ε -closure of states in the produced automaton from the previous step to get the equation automaton. The implementation of the proposed algorithm is available under the repository https://github.com/FaissalOuardi/Equation-automaton, (accessed on 27 May 2021).
The paper is organized as follows. Section 2 contains some basic definitions and necessary preliminaries. Section 3 summarizes theoretical results that lead to c-continuations of a regular expression, and their relations with the partial derivatives. The definition of the c-continuation automaton is recalled, as well as the way it is connected to the equation automaton. Section 4 is a recall to the algorithm due to Allauzen and Mohri. We detail then in Section 5 the algorithmic refinements leading to an O ( m × | Q e | ) time complexity of the efficient construction of the equation automaton where | Q e | is the number of its states.

2. Preliminaries

In this section, we introduce briefly the notion of finite automata. For further details on formal aspects of finite automata theory, we particularly recommend reading classical books [14,15].

2.1. Regular Expressions and Finite Automata

Let A be a non-empty finite set of letters, called an alphabet. The set of all words over A is denoted by A . ε is the empty word. A language over A is a subset of A .

2.1.1. Regular Expressions and Languages

A regular expression over the alphabet A is a term of the algebra T r e g ( A ) defined over the set A { 0 , 1 } with the symbols of functions , + , · , where ∗ is unary and + and · are binary. Properties of the constants 0 , 1 and the operators , + , and · lead to identities on this algebra. Each regular expression denotes a language. L is the function that assigns to each regular expression the regular language it denotes. L : T r e g ( A ) r e g ( A ) is defined as follows:
L ( 0 ) = L ( 1 ) = { ε } L ( a ) = a , for   each a in A L ( F + G ) = L ( F ) L ( G ) L ( F · G ) = L ( F ) L ( G ) L ( F ) = L ( F )
The following identities are classically used:
0 + E = E = E + 0 , 1 · E = E = E · 1 , 0 · E = 0 = E · 0 .
Let E be a regular expression. The set of letters occurring in E is denoted by A E . To specify their position in the expression, letters are subscripted following the order of reading. The resulted expression is the linearized form of E, denoted by E ¯ . For example, starting from E = ( a + b ) a b a + 1 , one obtains the linearized version E ¯ = ( a 1 + b 2 ) a 3 b 4 a 5 + 1 of E. The subscripted letters are called positions; the set of all position in the expression E is denoted by pos ( E ) . For the previous example, we have pos ( E ) = { a 1 , b 2 , a 3 , b 4 , a 5 } . If F is a subexpression of E, we denote by pos E ( F ) the subset of positions of E that are letters of F. We say that a regular expression is in linear form if each letter of the expression occurs only once. We denote by h the function that maps each position in pos ( E ) to the letter of A E that appears at this position in E. For E = ( a + b ) a b a + 1 , we have h ( a 1 ) = h ( a 3 ) = h ( a 5 ) = a and h ( b 2 ) = h ( b 4 ) = b . The size of the regular expression E, denoted by | E | , is the number of nodes in its syntax tree. We call alphabetic width of E, denoted by | | E | | , the number of occurrences of letters in the expression i.e., the cardinality of pos ( E ) . The alphabetic width of the expression ( a + b ) a b a + 1 is equal to 5; its size is equal to 12.
Notice that the alphabetic width and the size of a regular expression are independent parameters. Therefore complexities are expressed w.r.t. both of these two parameters. However, it is usual to preprocess the input expression in order to reduce its size and to make its size proportional to its alphabetic width. So, if we consider a reduced regular expression E w.r.t. the following rules:
  • 1 + 1 = 1 ,
  • 1 + E + 1 = 1 + E ,
  • E is in Star-Normal Form (SNF) [16].
Thus, we have in this case | E | = O ( | | E | | ) . It is known that regular expressions can be transformed to SNF in linear time [16]. λ ( E ) denote the null term of E, that is
λ ( E ) = 1 if   ε L ( E ) , 0 otherwise .
By T ( E ) we denote the syntax tree associated with the regular expression E. A node in T ( E ) will be denoted by ν . We write Nodes ( E ) for the set of nodes of T ( E ) . If ν Nodes ( E ) is a node in T ( E ) , sym ( ν ) , father ( ν ) and right ( ν ) denote respectively the symbol, the father and the right son of the node ν . If sym ( ν ) is an operator, E ν will denote the subexpression that corresponds to the subtree with the root ν .

2.1.2. Finite Automata and Recognizable Languages

A nondeterministic finite automaton (NFA) is a quintuple A = Q , A , q 0 , δ , F where Q is a finite set of states, A is the alphabet, q 0 Q is the initial state, F Q is the set of final states, and δ : Q × ( A { ε } ) 2 Q is the transition function. The size of an automaton A , denoted by | A | , is the number of its states. The automaton A is called deterministic (DFA) if there is only one initial state, | δ ( q , ε ) | = 0 and | δ ( q , a ) | = 1 , for any q Q , for any a A . A path in A is a sequence ( q i , a i , q i + 1 ) , i = 1 , , n , of consecutive transitions. Its label is the word w = a 1 a 2 a n . A word w = a 1 a 2 a n is recognized by the automaton A if there exists a path labeled w such that q 1 = q 0 and q n + 1 F .
The language recognized by the automaton A , denoted by L ( A ) , is the set of words it recognizes. The right language of a state q in the automaton A , denoted by L q ( A ) , is obtained by setting q to be the initial state, i.e., L q ( A ) = { w A | δ ( q , w ) F } .
We say that A is acyclic if the underlying graph is acyclic. The language associated with an acyclic automaton is finite.
Let ∼ be an equivalence relation over Q. For q Q , [ q ] denotes the equivalence class of q w.r.t. ∼ and, for C Q , C / denotes the quotient set C / = { [ q ] | q C } . We say that ∼ is right invariant w.r.t. A if and only if the following conditions hold:
  • ( Q F ) 2 F 2 (final and non-final states are not ∼-equivalent),
  • for any p , q Q a A , if p q , then δ ( p , a ) / = δ ( q , a ) / .

2.2. Thompson Automaton

In [12], Thompson gave a linear time and space algorithm to convert a regular expression E to an NFA with ε -transitions, denoted by T E . The recursive steps of the construction of Thompson NFA are pictured in Figure 1.
Example 1.
Let us consider the regular expression E = ( a + b a + b ) . The Thompson automaton T E associated with E is shown in the Figure 2.
There are some disadvantages of Thompson’s NFA when it is used in practice: it has many redundant states and ε -transitions, its number of states is in O ( | E | ) while other constructions offer NFAs with O ( | | E | | ) states.
In the next section, we will present the construction of a reduced ε -free automaton, named equation automaton, sometimes called Antimirov automaton or derived terms automaton.

3. Equation Automaton

The equation automaton has been introduced for the first time by Mirkin in [1]. In 1996, Antimirov introduced the notion of partial derivatives and used it to define the equation automaton [2]. Champarnaud and Ziadi [6] defined the notion of canonical derivatives of a linear expression and constructed a new automaton called the c-continuation automaton. They also proved that this automaton is isomorphic to the position automaton in the sense that the two automata have identical sets of states, identical initial and final states, and transitions Theorem 6 in [6]. Using an equivalence relation over the set of states of the c-continuation automaton, they derive the equation automaton in quadratic time.
The definition of the equation automaton of a regular expression is based on that of the partial derivatives of regular expressions, which are multisets of regular expressions over A. The partial derivative of E with respect to a A is defined recursively on the structure of E as follows:
a ( 0 ) = a ( 1 ) = , a ( x ) = if a = x then { 1 } else , a ( F + G ) = a ( F ) a ( G ) , a ( F · G ) = a ( F ) · G λ ( F ) · a ( G ) , a ( F ) = a ( F ) · F .
The partial derivative of E with respect to the string u A is denoted by u ( E ) and recursively defined by ε ( E ) = E and u a ( E ) = a ( u ( E ) ) .
Let D ( E ) = { E } { E | E u ( E )   with   u A } .
Theorem 1.
(Antimirov [2]). The cardinality of the set D ( E ) of all partial derivatives of a regular expression E is less than or equal to | | E | | + 1 .
The equation automaton E E = D ( E ) , A E , E , δ , F of E is defined by:
  • δ ( E , a ) = { E D ( E ) | E a ( E ) } ,
  • F = { E D ( E ) | λ ( E ) = 1 } .
Example 2.
Let us consider the regular expression E = ( a + b a + b ) . The partial derivatives of E are as follows:
a ( E ) = { a ( a + b a + b ) } b ( E ) = { a ( a + b a + b ) , b ( a + b a + b ) }
The computation of the transitions of the equation automaton E E are as follows:
a ( a ( a + b a + b ) ) = a ( a ) ( a + b a + b ) a ( ( a + b a + b ) ) = a ( a ) a ( a + b a + b ) ( a ( a ) a ( b a ) a ( b ) ) ( a + b a + b ) = { a ( a + b a + b ) } b ( a ( a + b a + b ) ) = b ( a ) ( a + b a + b ) b ( ( a + b a + b ) ) = b ( a ) a ( a + b a + b ) ( b ( a ) b ( b a ) b ( b ) ) ( a + b a + b ) = ( b ( b ) a b ( b ) b ) ( a + b a + b ) = { a ( a + b a + b ) , b ( a + b a + b ) } a ( b ( a + b a + b ) ) = a ( b ) ( a + b a + b ) a ( ( a + b a + b ) ) = ( a ( a ) a ( b a ) a ( b ) ) ( a + b a + b ) = { a ( a + b a + b ) } b ( b ( a + b a + b ) ) = b ( b ) ( a + b a + b ) b ( ( a + b a + b ) ) = b ( b ) b ( a + b a + b ) ( b ( a ) b ( b a ) b ( b ) ) ( a + b a + b ) = { b ( a + b a + b ) } ( b ( b ) a b ( b ) b ) ( a + b a + b ) = { b ( a + b a + b ) , a ( a + b a + b ) }
The equation automaton E E associated with E is shown in Figure 3.
In the following, we recall the definition and properties of the c-continuation automaton. Next, we show how it can be bound to the equation automaton.

3.1. C-Continuation Automaton

This automaton has been introduced by Champarnaud and Ziadi [6] to efficiently compute the equation automaton. Let us recall the notion of c-derivative, c-continuation and c-continuation automaton.
Definition 1.
(c-derivative with respect to a letter). The c-derivative of a regular expression E with respect to a letter a is the regular expression d a ( E ) defined by:
d a ( 0 ) = d a ( 1 ) = 0 , d a ( x ) = if a = x then 1 else 0 , d a ( F + G ) = if d a ( F ) 0 then d a ( F ) else d a ( G ) , d a ( F · G ) = if d a ( F ) 0 then d a ( F ) · G else λ ( F ) · d a ( G ) , d a ( F ) = d a ( F ) · F .
The c-derivative with respect to a word u = u 1 u n is defined recursively by the rules: d ε ( E ) = E and d u 1 u n ( E ) = d u 2 u n ( d u 1 ( E ) ) .
Theorem 2.
(Theorem 4 in [6]). Let E be a linear regular expression and a be a letter from E. Then all non-zero c-derivatives of the form d u a ( E ) , where u is an arbitrary word, are equal.
Theorem 2 allows us to define the c-continuation c a ( E ) of a in a linear expression E as the unique value of the non-zero c-derivatives d u a ( E ) .
Proposition 1.
(Proposition 6 in [6]). For every letter a of a linear expression E, the c-continuation c a ( E ) is such that:
c a ( a ) = 1 , c a ( F + G ) = if c a ( F ) exists then c a ( F ) else c a ( G ) , c a ( F · G ) = if c a ( F ) exists then c a ( F ) · G else c a ( G ) , c a ( F ) = c a ( F ) · F .
Corollary 1.
(Corollary 5 in [6]). For every letter a of a linear expression E, the c-continuation c a ( E ) is either 1 or a subexpression of E or a product of subexpressions.
More precisely, for a linear regular expression E, we have c a ( E ) = H 0 H k , where H i is a subexpression of E, for all 0 i k .
We now consider a regular expression E over A. Let E ¯ be the linearized form of E over pos ( E ) and h be the mapping from pos ( E ) onto A E .
In order to simplify the writing for a regular expression E, we consider by convention that c 0 ( E ) = d ε ( E ¯ ) = E ¯ and c x ( E ) will denote c x ( E ¯ ) .
Definition 2.
(c-continuation automaton) The c-continuation automaton of E, C E = Q , A E , i , δ , F , is defined by:
  • Q = { ( x , c x ( E ) ) | x pos ( E ) { 0 } } ,
  • i = ( 0 , c 0 ( E ) ) ,
  • F = { ( x , c x ( E ) ) | λ ( c x ( E ) ) = 1 } ,
  • δ ( ( x , c x ( E ) ) , a ) = { ( y , c y ( E ) | h ( y ) = a   a n d   d y ( c x ( E ) ) c y ( E ) } , x pos ( E ) { 0 }   a n d   a A E .
We note that the number of states of C E is exactly | | E | | + 1 .
Corollary 2.
(Corollary 7 in [6]). Let E be a regular expression. One has: L ( C E ) = L ( E ) .
Example 3.
Let us consider the regular expression E = ( a + b a + b ) from Example 1. The linearized form of E is E ¯ = ( a 1 + b 2 a 3 + b 4 ) and the c-continuations of E are as follows:
c 0 ( E ) = ( a 1 + b 2 a 3 + b 4 ) c a 1 ( E ) = a 1 ( a 1 + b 2 a 3 + b 4 ) c b 2 ( E ) = a 3 ( a 1 + b 2 a 3 + b 4 ) c b 4 ( E ) = b 4 ( a 1 + b 2 a 3 + b 4 ) c a 3 ( E ) = c a 3 ( a 1 + b 2 a 3 + b 4 ) · ( a 1 + b 2 a 3 + b 4 ) = c a 3 ( b 2 a 3 ) · ( a 1 + b 2 a 3 + b 4 ) = c a 3 ( a 3 ) · ( a 1 + b 2 a 3 + b 4 ) = a 3 ( a 1 + b 2 a 3 + b 4 )
The outgoing transitions from the state ( 0 , c 0 ( E ) ) are computed using the c-derivatives of c 0 ( E ) as follows:
d a 1 ( c 0 ( E ) ) = c a 1 ( E ) , d b 2 ( c 0 ( E ) ) = c b 2 ( E ) , d a 3 ( c 0 ( E ) ) = 0 , d b 4 ( c 0 ( E ) ) = c b 4 ( E ) d a 1 ( c a 1 ( E ) ) = c a 1 ( E ) , d b 2 ( c a 1 ( E ) ) = c b 2 ( E ) , d a 3 ( c a 1 ( E ) ) = 0 , d b 4 ( c a 1 ( E ) ) = c b 4 ( E )
d a 1 ( c b 2 ( E ) ) = c a 1 ( E ) , d b 2 ( c b 2 ( E ) ) = c b 2 ( E ) , d a 3 ( c b 2 ( E ) ) = c a 3 ( E ) , d b 4 ( c b 2 ( E ) ) = c b 4 ( E )
d a 1 ( c a 3 ( E ) ) = c a 1 ( E ) , d b 2 ( c a 3 ( E ) ) = c b 2 ( E ) , d a 3 ( c a 3 ( E ) ) = c a 3 ( E ) , d b 4 ( c a 3 ( E ) ) = c b 4 ( E )
d a 1 ( c b 4 ( E ) ) = c a 1 ( E ) , d b 2 ( c b 4 ( E ) ) = c b 2 ( E ) , d a 3 ( c b 4 ( E ) ) = 0 , d b 4 ( c b 4 ( E ) ) = c b 4 ( E )
Then we get the c-continuation automaton C E in Figure 4.

3.2. Equation Automaton as a Quotient of C-Continuation Automaton

Champarnaud and Ziadi [6] have proved that the equation automaton is a quotient of the c-continuation automaton. Let us consider the equivalence relation e defined by
( x , c x ( E ) ) e ( y , c y ( E ) ) h ( c x ( E ) ) h ( c y ( E ) )
Sometimes we write x e y h ( c x ( E ) ) h ( c y ( E ) ) .
Proposition 2.
The relation e is right-invariant, i.e., for all letters a in A, for all pairs of states ( x , c x ( E ) ) , ( y , c y ( E ) ) in Q such that ( x , c x ( E ) ) e ( y , c y ( E ) ) , we have: δ ( ( x , c x ( E ) ) , a ) / e = δ ( ( y , c y ( E ) ) , a ) / e .
Moreover, if two states are equivalent w.r.t. e , then they are either both final or both non-final, since ( x , c x ( E ) ) F λ ( c x ( E ) ) = 1 λ ( h ( c x ( E ) ) ) = 1 .
The equivalence class of the state ( x , c x ( E ) ) is represented by C x = h ( c x ( E ) ) . Since the relation e is right-invariant, we can define the quotient automaton C E / e = Q e , A E , q 0 , δ , F as follows:
  • Q e = { C x | x pos ( E ) { 0 } } ,
  • q 0 = C 0 ,
  • F = { C x | λ ( c x ( E ) ) = 1 } ,
  • δ ( C x , a ) = { C y | h ( y ) = a and d y ( c x ( E ) ) c y ( E ) } , C x Q e and   a A E .
Theorem 3.
(Theorem 10 in [6]). Let E be a regular expression. The automaton C E / e deduced from the c-continuation automaton is isomorphic to the equation automaton E E .
We note that the number of states of E E is majorized by | | E | | + 1 .
Example 4.
Let us consider the regular expression E = ( a + b a + b ) from Example 1. There are three e -equivalence classes when applying the function h that remove indices from letters for different c-continuations of E:
h ( c 0 ( E ) ) = ( a + b a + b ) h ( c a 1 ( E ) ) = h ( c b 2 ( E ) ) = h ( c a 3 ( E ) ) = a ( a + b a + b ) h ( c b 4 ( E ) ) = b ( a + b a + b )
The c-continuation automaton C E and the quotient automaton C E / e which is isomorphic to the equation automaton E E are schematized in Figure 5:

4. Allauzen and Mohri’s Algorithm

In [8], Allauzen and Mohri compute the equation automaton from the Thompson automaton of a regular expression E in O ( m log m + m 2 ) time. Their algorithm is based on some combinations of ε -transitions removal and Hopcroft’s algorithm for DFA minimization to the classical Thompson automata [13]. In the next, we briefly describe their method.
Let A ^ = A E { ε + 1 , ε + 2 , ε 1 , ε 2 } . We denote by T E ^ the automaton over A ^ obtained by recursively marking some of the ε -transitions of the Thompson automaton T E as follows:
Algorithms 14 00238 i001
Allauzen and Mohri have shown that the equation automaton can be obtained using some ε -transitions marking of the Thompson automaton and then apply two classical automata operations, namely epsilon removal, denoted by the function r m e p s (resp. the function r m e p s ^ for marked epsilon removal) and the Hopcroft’s algorithm for DFA minimization [13], denoted by m i n B .
Proposition 3.
(Proposition 3 in [8]). We have E E = r m e p s ^ ( m i n B ( r m e p s ( T E ^ ) ) ) .
Note that after removing ε -transitions from the automaton T E ^ , we obtain a deterministic finite automaton r m e p s ( T E ^ ) . After that, the Hopcroft’s algorithm for DFA minimization is applied to derive the automaton m i n B ( r m e p s ( T E ^ ) ) such that the set of its states is in bijection with the set of partial derivatives of E. Finally, to compute transitions of the equation automaton from m i n B ( r m e p s ( T E ^ ) ) , marked ε -transitions are removed using r m e p s ^ operation.
Theorem 4.
(Theorem 3 in [8]). Let E be a regular expression over A. The equation automaton of E can be computed in O ( m log m + m 2 ) time.

5. Efficient Conversion Algorithm

In this section, we will show that the equation automaton E E of a regular expression E can be deduced from the associated Thompson automaton in O ( | E | · | Q e | ) time, where | Q e | denotes the number of states of E E . Algorithm 1 summarizes the different steps of our approach.
    Algorithm 1 Computation of the equation automaton.
input: The Thompson automaton T E = Q , A E , I , δ , F associated with a regular expression E.
output:  The equation automaton E E associated with E.
/*           Computation of states                   */
Compute I d ( T E ):
Algorithms 14 00238 i004
Compute C e ( I d ( T E ) ) :
• Compute pseudo-continuations for all position states of I d ( T E ) .
• Merge equivalent states having the same pseudo-continuation.
/*   Computation of transitions and final states              /*
Compute r m e p s ( C e ( I d ( T E ) ) ) :
• Perform epsilon removal operation using r m e p s ( ) function over C e ( I d ( T E ) ) .
For convenience, we assume that the k states of a given finite automaton are identified by the integers 1 , , k .
From Corollary 1, the c-continuation c x ( E ) = H 0 H l associated with a position x is a concatenation of distinct subexpressions H i of E, possibly reduced to a single subexpression or to 1. In the Thompson automaton T E , we can associate a position x to a particular state q, called position state and define the associated pseudo-continuation C ( q ) = N ( H 0 ) N ( H l ) , where N ( H i ) denotes the integer that identify the initial state of the Thompson automaton T H i . So, the first step, compute I d ( T E ) , of our algorithm consists on computing the function N ( . ) such that for two subexpressions H i and H j of E, we have: H i H j N ( I T H i ) = N ( I T H j ) . This step can be done using a special marking of the ε -transitions of T E that makes it acyclic and deterministic and such that the right languages of its states represent the structure of the corresponding subexpressions. In the next step, Compute C e ( I d ( T E ) ) , we re-mark the ε -transitions such that the resulted automaton is acyclic and deterministic and the right language of a position state in I d ( T E ) represents a pseudo-continuation. After that, one can merge equivalent position states having the same right language. The final step is the computation of final states and transitions of the equation automaton using an epsilon removal operation, denoted by r m e p s ( . ) , from the resulted automaton in the previous step.
In the next, we will show that the equation automaton E E can be computed efficiently from the Thompson automaton T E using the following operations r m e p s ( C e ( I d ( T E ) ) ) .

5.1. Computation of States

In the following, We will show that the computation of the relation e over the states of the Thompson automaton can be performed in linear time w.r.t. the size of the expression using the minimization of an acyclic deterministic finite automaton. This minimization can be performed efficiently in O ( | E | ) time using Bubenzer’s algorithm [17,18].
Before computing the equivalence classes C e over states of the Thompson automaton, we will perform a preprocessing step to identify all identical sub-expressions of E. In the next, we will show that this identification can be done in O ( | E | ) time.

5.1.1. Sub-Expressions Identification

Let Exp the set of all subexpressions of E. In this preprocessing step, we will mark each state in the Thompson automaton by a unique letter in the set { 1 , 2 , , | Exp | } .
Let us define a bijection N between the set Exp and a finite set of letters { 1 , 2 , , | Exp | } . Consequently, if E 1 and E 2 are two sub-expressions of E, then we have:
E 1 E 2 N ( E 1 ) = N ( E 2 )
Based on the parsing method, introduced in Section 6 in [19], that derive an equivalent regular expression from Thompson automaton, each subexpression H i of E is associated with an integer identifying the initial state I T H i in the Thompson automaton T E .
Let q be a state in T E , we denote by E q the subexpression associated with q, if it exists. For abbreviation, N ( q ) represents N ( E q ) .
In the following, we will show that the computation of the function N over the states of T E turns into a minimization of the acyclic deterministic sub-automaton of the Thompson automaton, I d ( T E ) = Q , A E , I , δ , F , defined by:
  • A = { ε l + , ε · , ε r + , ε 1 , ε 2 } A E where l (resp. r), denote left (resp. right),
  • Q = { ( q , N ( q ) ) | q Q } , i.e., a state in T E is augmented by the letter N ( q ) .
  • The transition function δ is defined over the Thompson automaton as follows:
Algorithms 14 00238 i002
Notice that this automaton is an acyclic deterministic sub-automaton of the Thompson automaton where ε -transitions are indexed and the cyclic transitions in the case when E = F are temporarily disabled.
To compute identical subexpressions, we define the equivalence relation ∼ over the states of I d ( T E ) as follows:
q q N ( q ) = N ( q )
Thus we have:
[ q ] = [ q ] N ( E q ) = N ( E q )
Lemma 1.
Let q and q be two states in I d ( T E ) . We have:
L q ( I d ( T E ) ) = L q ( I d ( T E ) ) E q E q
Proof. 
Obvious, by construction. □
Proposition 4.
The function N ( . ) can be computed over I d ( T E ) in O ( | E | ) time.
Proof. 
Let q and q two states in I d ( T E ) . One has:
q q ( 3 ) N ( q ) = N ( q ) ( 2 ) E q E q L e m m a 1 L q ( I d ( T E ) ) = L q ( I d ( T E ) )
Thus, the equivalence relation ∼ coincides with Myhill-Nerode equivalence relation [20,21] over the states of I d ( T E ) . Since the automaton I d ( T E ) is deterministic and acyclic, its minimization using Bubenzer’s algorithm [17] requires O ( | E | ) time and space complexity. □
Example 5.
The automaton I d ( T E ) obtained after performing the subexpression identification step for the regular expression E = ( a + b a + b ) through T E .
As shown in Figure 6, for the states 3 and 9 we have:
L 3 ( I d ( T E ) ) = { ε 1 a ε 1 ε 1 , ε 2 ε 1 } L 9 ( I d ( T E ) ) = { ε 1 a ε 1 ε 1 , ε 2 ε 1 } 3 9 N ( 3 ) = N ( 9 )
As a consequence, we have E 3 E 9 .

5.1.2. e -Equivalent States Merging

Let us now turn to the computation of the set of states of the equation automaton. From Corollary 1, the c-continuation c x ( E ) is a concatenation of distinct subexpressions of E, possibly reduced to a single subexpression or to 1. The following proposition shows that the c-continuation c x ( E ) can be computed over the syntactic tree T ( E ¯ ) associated with the linearized version E ¯ .
Proposition 5.
(Ref. [6]). Let E be a regular expression and x a position in pos ( E ) . Let ν x be a node in T ( E ¯ ) such that sym ( ν x ) = x . The c-continuation c x ( E ) is as follows:
c x ( E ) = ν x ν ν E ¯ f ( ν ) E f ( ν )
where is the concatenation operator.
The function f : Nodes ( E ) { } Nodes ( E ) { } is defined as follows:
f ( ν ) = father ( ν ) i f sym ( father ( ν ) ) =   a n d   ν ν E ¯ right ( father ( ν ) ) i f sym ( father ( ν ) ) = · o t h e r w i s e .
with is an artificial node such that f ( ) = .
Using Proposition 5, the computation of the set of states requires O ( | E | 3 ) time and space complexity. This is due to the fact that the size of a c-continuation is in O ( | E | 2 ) . In order to reduce this complexity, we introduce a modified definition of pseudo-continuation introduced in [6] over an acyclic deterministic sub-automaton of the Thompson automaton, denoted by C e ( I d ( T E ) ) . When merging e -equivalent states over I d ( T E ) , we get the automaton C e ( I d ( T E ) ) . This step requires a linear time w.r.t the size of E using Bubenzer’s algorithm [17], since the automaton I d ( T E ) is acyclic and deterministic.
In the following, a state ( q , N ( q ) ) in the automaton I d ( T E ) is called a position state, if there exists a A E and ( q , N ( q ) ) Q such that δ ( ( q , N ( q ) ) , a ) = ( q , N ( q ) ) . The state ( 0 , N ( 0 ) ) is also considered as a position state.
It is obvious to see that each position state is associated with a unique position in pos ( E ) { 0 } and then it’s can be associated with a c-continuation.
Let ( q , N ( q ) ) and ( q , N ( q ) ) two position states associated with the positions x and x in pos ( E ) . One can extend the e relation over position states in I d ( T E ) as follows:
( q , N ( q ) ) e ( q , N ( q ) ) h ( c x ( E ) ) h ( c x ( E ) )
In the next, we will prove that the computation of the equivalence classes C e can be performed in a linear time w.r.t. the size of the regular expression over I d ( T E ) using the notion of pseudo-continuations.
For abbreviation, a state ( q , N ( q ) ) in I d ( T E ) will be denoted by q. We denote by C ( q )  the pseudo-continuation associated with the position state q which is an implicit representation of its c-continuation c x ( E ) , where x is the position letter of q. We will show that the computation of the equivalence classes C e turns on the computation of pseudo-continuation C ( q ) over a particular ε -transitions marking of the automaton I d ( T E ) .
Definition 3.
The pseudo-continuation C ( q ) associated with a position state ( q , N ( q ) ) Q in I d ( T E ) is recursively defined by:
C ( q ) = N ( q ) i f   δ ( q , ε l + ) = q   a n d   δ ( q , ε r + ) = q , N ( q ) · C ( q ) i f   δ ( q , ε 1 ) = q   a n d   δ ( q , a ) = q   f o r   s o m e   a A E , N ( q ) · C ( q ) i f   δ ( q , ε 1 ) = q   a n d   δ ( q , ε 2 ) = q , a · C ( q ) i f   δ ( q , a ) = q ,   f o r   a l l   a A E , C ( q ) i f   δ ( q , ε ) = q .
In order to compute efficiently the set of pseudo-continuations associated with the position states in I d ( T E ) , we define an acyclic deterministic sub-automaton C e ( I d ( T E ) ) = Q e , A E , I , δ , F of the Thompson automaton as follows:
  • A = { ε 0 , , ε | Exp | } A E .
  • Q e = { ( q , C ( q ) ) | ( q , N ( q ) ) Q } , i.e., a state in I d ( T E ) is replaced by ( q , C ( q ) ) .
  • The transition function δ is defined as follows:
Algorithms 14 00238 i003
Let us define the equivalence relation ≈ over the position states of C e ( I d ( T E ) ) as follows:
( q , C ( q ) ) ( q , C ( q ) ) C ( q ) C ( q )
Thus we have:
[ q ] = [ q ] C ( q ) C ( q )
Let h ¯ be the application that maps a letter ε i to the letter i. By construction, the following proposition holds.
Proposition 6.
Let ( q , C ( q ) ) be a position state in C e ( I d ( T E ) ) . We have:
h ¯ ( L q ( C e ( I d ( T E ) ) ) ) = C ( q )
Example 6.
Let us consider the Thompson automaton T E defined in previous examples. Figure 7 schematizes the derived automaton from I d ( T E ) after pseudo continuations computation for position states.
Notice that dotted ε-transitions are temporarily disabled and dashed ones are temporarily added.
The position states in the automaton I d ( T E ) are { 0 , 5 , 8 , 11 , 16 } . According to the definition of a pseudo continuation (see Formula (7)), the pseudo-continuations associated with position states are computed over the automaton I d ( T E ) as follows:
C ( 5 ) = C ( 8 ) = C ( 11 ) = 10 · 1 C ( 16 ) = 15 · 1
On the other hand, we have:
L 5 ( C e ( I d ( T E ) ) ) = L 8 ( C e ( I d ( T E ) ) ) = L 11 ( C e ( I d ( T E ) ) ) = ε 10 · ε 1 L 16 ( C e ( I d ( T E ) ) ) = ε 15 · ε 1
The following proposition is fundamental to prove that the equivalence relation e using the notion of c-continuation is the same when using pseudo-continuations C ( q ) .
Proposition 7.
Let q (resp. q ) be a position state associated with a position x pos ( E ) (resp. x pos ( E ) ) in C e ( I d ( T E ) ) . One has:
C ( q ) C ( q ) h ( c x ( E ) ) h ( c x ( E ) )
As a consequence, the following proposition holds.
Proposition 8.
Let q and q be two position states in C e ( I d ( T E ) ) . One has:
q e q C ( q ) C ( q )
Theorem 5.
Let E be a regular expression and T E the associated Thompson automaton. The relation e can be computed over C e ( I d ( T E ) ) in O ( | E | ) time.
Proof. 
From Proposition 8, one can deduce that the computation of the equivalence relation e , turn to apply the Myhill-Nerode relation on the states of the automaton C e ( I d ( T E ) ) . By definition, this last is acyclic and deterministic. Then, its minimization using Bubenzer’s algorithm [17] requires O ( | E | ) time and space complexity. □
Example 7
(Continues). The automaton C e ( I d ( T E ) ) , schematized in the Figure 8, is obtained from I d ( T E ) after merging e -equivalent position states 5 , 8 and 11.
The next step of our approach consists of the transitions and final states computation of the equation automaton E E using epsilon removal operation, denoted by r m e p s ( ) , over the automaton C e ( I d ( T E ) ) .

5.2. Computation of Transitions and Final States

After merging e -equivalent states in the previous step, we obtain a reduced automaton C e ( I d ( T E ) ) having the same set of states as the equation automaton. To compute the transition function, we first enable the cyclic transitions previously disabled in the case when E = F on C e ( I d ( T E ) ) .
Let Q e be the set of states of C e ( I d ( T E ) ) . Recall that epsilon removal operation is denoted by, r m e p s ( p ) for a state p Q and r m e p s ( A ) denotes the resulted automaton after removing marked and non-marked ε -transitions from the automaton A .
As a consequence of Lemma 5 from [8], the following Lemma yeilds.
Lemma 2.
(Lemma 5 in [8]). Let q and q be two position states in Q e associated respectively with the positions x and x in pos ( E ) { 0 } , we have:
q r m e p s ( q )   i f f   h ( c x ( E ) ) a ( h ( c x ( E ) ) ) ,   f o r   s o m e   a A E .
The set of destination states of the outgoing transitions from a state q Q e is then equal to
{ p Q e | p r m e p s ( q )   and   p   is   a   position   state }
Lemma 3.
Let q be a position state in Q e associated with a position x in pos ( E ) { 0 } , we have:
F r m e p s ( q )   i f f   λ ( c x ( E ) ) = 1 .
Proposition 9.
We have E E = r m e p s ( C e ( I d ( T E ) ) )
Example 8.
(Continues). Let us consider the automaton C e ( I d ( T E ) ) of the Example 7. The final states and the transitions of the equation automaton are computed over C e ( I d ( T E ) ) using epsilon removal operation r m e p s as follows:
  • The set of states of the equation automaton are { 0 , { 5 , 8 , 11 } , 16 } .
  • Since the final state of C e ( I d ( T E ) ) is the state 19 and 19 r m e p s ( 0 ) , 19 r m e p s ( { 5 , 8 , 11 } ) , and 19 r m e p s ( 16 ) , then the set of final states are { 0 , { 5 , 8 , 11 } , 16 } .
  • There are two paths in C e ( I d ( T E ) ) from the state 0 to the state { 5 , 8 , 11 } labeled respectively by ε · ε · ε · ε · a and ε · ε · ε · b , then { 5 , 8 , 11 } r m e p s ( 0 ) . Consequently, two transitions ( 0 , a , { 5 , 8 , 11 } ) and ( 0 , b , { 5 , 8 , 11 } ) are added to the equation automaton. The same process will be applied for other transitions.
Since there are O ( | E | ) states in C e ( I d ( T E ) ) and the operation r m e p s ( C e ( I d ( T E ) ) ) is performed on exactly | Q e | states, the following theorem holds.
Theorem 6.
Let E be a regular expression. The equation automaton of E can be computed in O ( | E | · | Q e | ) .

6. Conclusions

In this paper, we presented a fast and sophisticated construction of the equation automaton from a regular expression over its associated Thompson automaton. The time complexity of our algorithm is at least as favorable as that of the best previously known algorithm. It is based on the minimization of acyclic deterministic finite automata and epsilon removal operations. This allowed us a construction of the equation automaton in O ( | E | · | Q e | ) time and space complexity where | Q e | denotes the number of transitions of the produced automaton. The implementation of the proposed algorithm is available under the following repository: https://github.com/FaissalOuardi/Equation-automaton (accessed on 27 May 2021).

Author Contributions

Conceptualization, F.O., Z.L. and B.E.; methodology, F.O., Z.L. and B.E.; validation, F.O., Z.L. and B.E.; formal analysis, F.O., Z.L. and B.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Source code can be found under the following link: https://github.com/FaissalOuardi/Equation-automaton (accessed on 27 May 2021).

Acknowledgments

We wish to thank the referees for the care they put into reading the previous versions of this manuscript. Their comments were invaluable in depth and detail, and the current version owes much to their efforts.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mirkin, B.G. Novyj algoritm postroéniá bazisa v ázyké régularnyh vyražénij. Izvéstiá Akadémii Nauk SSSR. Engineering cybernetics, no. 5 (1966). English translation of the preceding: Brzozowski, J. An algorithm for constructing a base in a language of regular expressions. pp. 110–116. J. Symb. Log. 1971, 36, 694. [Google Scholar]
  2. Antimirov, V. Partial derivatives of regular expressions and finite automaton constructions. Theor. Comput. Sci. 1996, 155, 291–319. [Google Scholar] [CrossRef] [Green Version]
  3. Glushkov, V.M. The abstract theory of automata. Russ. Math. Surv. 1961, 16, 1–53. Available online: https://iopscience.iop.org/article/10.1070/RM1961v016n05ABEH004112 (accessed on 27 May 2021). [CrossRef]
  4. McNaughton, R.F.; Yamada, H. Regular expressions and state graphs for automata. IEEE Trans. Electron. Comput. 1960, 9, 39–57. [Google Scholar] [CrossRef]
  5. Ziadi, D.; Ponty, J.-L.; Champarnaud, J.-M. A New Quadratic Algorithm to Convert a Regular Expression into an Automaton. In Proceedings of the Workshop on Implementing Automata, London, ON, Canada, 29–31 August 1996; pp. 109–119. [Google Scholar]
  6. Champarnaud, J.-M.; Ziadi, D. Canonical derivatives, partial derivatives and finite automaton constructions. Theor. Comput. Sci. 2002, 289, 137–163. [Google Scholar] [CrossRef] [Green Version]
  7. Khorsi, A.; Ouardi, F.; Ziadi, D. Fast equation automaton computation. J. Discret. Algorithms 2008, 6, 433–448. [Google Scholar] [CrossRef] [Green Version]
  8. Allauzen, C.; Mohri, M. A Unified Construction of the Glushkov, Follow, and Antimirov Automata. In Proceedings of the International Conference of Mathematical Foundations of Computer Science, Stará Lesná, Slovakia, 28 August–1 September 2006; pp. 110–121. [Google Scholar]
  9. Ilie, L.; Yu, S. Follow automata. Inf. Comput. 2003, 186, 140–162. [Google Scholar] [CrossRef] [Green Version]
  10. Champarnaud, J.-M.; Nicart, F.; Ziadi, D. From the ZPC Structure of a Regular Expression to its Follow Automaton. IJAC 2006, 16, 17–34. [Google Scholar] [CrossRef]
  11. Kleene, S. Representation of Events in Nerve Nets and Finite Automata; Automata Studies, Ann. Math. Studies 34; Princeton University Press: Princeton, NJ, USA, 1956; pp. 3–41. [Google Scholar]
  12. Thompson, K. Regular Expression Search Algorithm. Commun. ACM 1968, 11, 410–422. [Google Scholar] [CrossRef]
  13. Hopcroft, J. An n log n Algorithm for Minimizing States in a Finite Automaton; Technical Report; Stanford University, CS Dept.: Stanford, CA, USA, 1971. [Google Scholar]
  14. Hopcroft, J.E.; Ullman, J.D. Introduction to Automata Theory, Languages and Computation; Addison-Wesley: Reading, MA, USA, 1979. [Google Scholar]
  15. Sakarovitch, J.; Thomas, R. Elements of Automata Theory; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
  16. Brüggemann-Klein, A. Regular expressions into finite automata. Theor. Comp. Sci. 1993, 120, 117–126. [Google Scholar] [CrossRef] [Green Version]
  17. Bubenzer, J. Cycle-aware minimization of acyclic deterministic finite-state automata. J. Discret. Appl. Math. 2014, 163, 238–246. [Google Scholar] [CrossRef]
  18. Revuz, D. Minimization of acyclic deterministic automata in linear time. Theor. Comput. Sci. 1992, 92, 181–189. [Google Scholar] [CrossRef] [Green Version]
  19. Giammarresi, D.; Ponty, J.-L.; Wood, D.; Ziadi, D. A characterization of Thompson digraphs. Discret. Appl. Math. 2004, 134, 317–337. [Google Scholar] [CrossRef]
  20. Myhill, J. Finite automata and the representation of events. In WADD TR-57-624; Wright Patterson AFB: Dayton, OH, USA, 1957; pp. 112–137. [Google Scholar]
  21. Nerode, A. Linear automata transformation. Proc. AMS 1958, 9, 541–544. [Google Scholar] [CrossRef]
Figure 1. Thompson construction of an NFA.
Figure 1. Thompson construction of an NFA.
Algorithms 14 00238 g001
Figure 2. Thompson automaton T E .
Figure 2. Thompson automaton T E .
Algorithms 14 00238 g002
Figure 3. The equation automaton E E .
Figure 3. The equation automaton E E .
Algorithms 14 00238 g003
Figure 4. The c-continuation automaton C E .
Figure 4. The c-continuation automaton C E .
Algorithms 14 00238 g004
Figure 5. (a) The c-continuation automaton C E versus (b) The quotient automaton C E / e .
Figure 5. (a) The c-continuation automaton C E versus (b) The quotient automaton C E / e .
Algorithms 14 00238 g005
Figure 6. The automaton I d ( T E ) associated with E = ( a + b a + b ) .
Figure 6. The automaton I d ( T E ) associated with E = ( a + b a + b ) .
Algorithms 14 00238 g006
Figure 7. Pseudo-continuations computation for position states in I d ( T E ) .
Figure 7. Pseudo-continuations computation for position states in I d ( T E ) .
Algorithms 14 00238 g007
Figure 8. The automaton C e ( I d ( T E ) ) associated with E = ( a + b a + b ) .
Figure 8. The automaton C e ( I d ( T E ) ) associated with E = ( a + b a + b ) .
Algorithms 14 00238 g008
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Ouardi, F.; Lotfi, Z.; Elghadyry, B. Efficient Construction of the Equation Automaton. Algorithms 2021, 14, 238. https://doi.org/10.3390/a14080238

AMA Style

Ouardi F, Lotfi Z, Elghadyry B. Efficient Construction of the Equation Automaton. Algorithms. 2021; 14(8):238. https://doi.org/10.3390/a14080238

Chicago/Turabian Style

Ouardi, Faissal, Zineb Lotfi, and Bilal Elghadyry. 2021. "Efficient Construction of the Equation Automaton" Algorithms 14, no. 8: 238. https://doi.org/10.3390/a14080238

APA Style

Ouardi, F., Lotfi, Z., & Elghadyry, B. (2021). Efficient Construction of the Equation Automaton. Algorithms, 14(8), 238. https://doi.org/10.3390/a14080238

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop