Next Article in Journal
Hierarchical Optimization Framework for Layout Design of Star–Tree Gas-Gathering Pipeline Network in Discrete Spaces
Previous Article in Journal
Design of Multichannel Spectrum Intelligence Systems Using Approximate Discrete Fourier Transform Algorithm for Antenna Array-Based Spectrum Perception Applications
Previous Article in Special Issue
Parsing Unranked Tree Languages, Folded Once
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Complete Subhedge Projection for Stepwise Hedge Automata †

by
Antonio Al Serhali
* and
Joachim Niehren
*
Inria Center, University of Lille, 59000 Lille, France
*
Authors to whom correspondence should be addressed.
This paper is an extended version of our paper published in International Symposium on Fundamentals of Computation Theory, Trier, Germany, 18–21 September 2023.
Algorithms 2024, 17(8), 339; https://doi.org/10.3390/a17080339
Submission received: 29 January 2024 / Revised: 23 July 2024 / Accepted: 23 July 2024 / Published: 2 August 2024
(This article belongs to the Special Issue Selected Algorithmic Papers From FCT 2023)

Abstract

:
We demonstrate how to evaluate stepwise hedge automata (Shas) with subhedge projection while completely projecting irrelevant subhedges. Since this requires passing finite state information top-down, we introduce the notion of downward stepwise hedge automata. We use them to define in-memory and streaming evaluators with complete subhedge projection for Shas. We then tune the evaluators so that they can decide on membership at the earliest time point. We apply our algorithms to the problem of answering regular XPath queries on Xml streams. Our experiments show that complete subhedge projection of Shas can indeed speed up earliest query answering on Xml streams so that it becomes competitive with the best existing streaming tools for XPath queries.

1. Introduction

Hedges are sequences of letters from the alphabet and trees h where h is again a hedge. Hedges are abstract from the syntactical details of Xml or JSON documents while still being able to represent them. The linearization of a hedge is a nested word that can be written in a file. In this article, we study the problem of hedge pattern matching, i.e., whether a given hedge h belongs to a given regular hedge language L. Regular hedge languages can be defined in XPath or JSONPath in particular.
Projection is necessary for the efficiency of many algorithms on words, trees, hedges, or nested words. Intuitively, an algorithm is projecting if it visits only a fragment of the input structure: in the best case, the part that is relevant to the problem under consideration. The relevance of projection for Xml processing was already noticed by [1,2,3,4]. Saxon’s in-memory evaluator, for instance, projects input Xml documents relative to an Xslt program, which contains a collection of XPath queries to be answered simultaneously [5]. The QuiXPath tool [4] evaluates XPath queries in streaming mode with subtree and descendant projection. Projection during the evaluation of JSONPath queries on JSON documents in streaming mode is called fast-forwarding [6].
Projecting in-memory evaluation assumes that the full graph of the input hedge is constructed beforehand. Nevertheless, projection may still save time if one has to match several patterns on the same input hedge or if the graph was constructed for different reasons anyway. In streaming mode, the situation is similar: the whole input hedge on the stream needs to be parsed, but the evaluators need to inspect only nodes that are not projected away. Given that pure parsing is by two or three orders of magnitude faster than pattern evaluation, the time gain of projection may be considerable.
The starting point of this article is the notion of the subhedge irrelevance of positions of hedge h with respect to a hedge language L that we introduce. These are positions of h where, for all possible continuations, inserting any subhedge does not affect membership to L. We contribute an algorithm for hedge pattern matching with complete subhedge projection. Our algorithm decides hedge pattern matching while completely projecting away the subhedges located at irrelevant subhedge positions. In other words, it decides whether a hedge h matches a pattern L without visiting any subhedge of h that is located at a position that is irrelevant with respect to L.
Regular hedge languages can be represented by nested regular expressions, which can be derived from regular XPath queries [7] or by some kind of hedge automata, which can be compiled from nested regular expressions. We use stepwise hedge automata (Shas) [8], a recent variant of standard hedge automata, which, in turn, date back to the 1960s [9,10]. Shas mix up the bottom-up processing of standard tree automata with the left-to-right processing of finite word automata (Dfas). They neither support top-down processing nor have an explicit stack, unlike nested word automata Nwas [11,12,13,14,15]. Shas have a good notion of bottom-up and left-to-right determinism, avoiding the problematic notion of determinism for standard hedge automata and the too-costly notion of determinism for Nwas. Furthermore, compilers from nested regular expressions to Shas are available [8], and determinization algorithms [16] produce small Shas for all regular XPath queries in the existing practical benchmarks [7,17].
The motivation for the present work is to add subhedge projection to a recent streaming algorithm for deterministic Shas that can evaluate regular XPath queries in the earliest manner. Most alternative streaming approaches avoid earliest query answering by considering sublanguages of regular queries without delays so that node selection depends only on the past of the node but not on the future [18,19] or by admitting late selection [20]. In particular, it was recently demonstrated that the earliest query answering for regular monadic queries defined by deterministic Shas [21] has a lower worst-case complexity than for deterministic Nwas [22]. Thereby, the earliest query answering for regular queries became feasible in practice for the first time. On the other hand side, it is still slower experimentally than the best existing non-earliest approaches for general regular monadic queries on Xml streams (see [23] for a comparison) since the latter supports projection. How projections could be made for Shas has not been studied before the conference version of the present article [24].
Consider the example of the XPath filter [ self : : list / child : : item ] that is satisfied by an Xml document if its root is a list element that has an item child. With the encoding of the Xml document as hedges are chosen in the present article, satisfying hedges have the form list · h 1 · item · h 2 · h 3 · h 4 , and for some hedges, h 1 , h 2 , h 3 , h 4 . When evaluating this filter on a hedge, it is sufficient to inspect the first subtree for having the root label list and then all its children until one with a root label item is found. The subhedges of these children, i.e., the hedge h 3 and the subhedges of the top-level trees in h 1 and h 3 , can be projected away, as well as the hedge h 4 : they do not have to be inspected for evaluating this filter. However, one must memorize whether the level of the current node is 0, 1, or greater. This level of information can be naturally updated in a top-down manner. The evaluators of Shas, however, operate bottom-up and left-to-right exclusively. Therefore, projecting evaluators for Shas need to be based on more general machines. We propose downward stepwise hedge automata (Shas), a variant of Shas that supports top-down processing. They are basically Neumann and Seidl’s pushdown forest automata [25], except that they are applied to unlabeled hedges instead of labeled forests. Furthermore, Nwas are known to operate in basically the same manner on nested words [26] while allowing for slightly more general visible pushdowns. We then distinguish subhedge projection states for Shas, and show how to use them to evaluate Shas with subhedge projection both in memory and in streaming mode.
As a first contribution, we present the safe-no-change compiler from Shas to Shas that can distinguish appropriate subhedge projection states. The idea is that the safe-no-change Sha can distinguish contexts in which the states of the Sha will safely not change. For instance, the XPath filter [ self : : list / child : : item ] can be defined by a deterministic Sha operting bottom-up which our compiler maps to the Sha operating top-down. The context information made explicit by top-down processing is about the levels of the states. This permits us to distinguish projection states starting from level 2, and in which subhedges can be ignored. We prove the soundness of our compiler based on a nontrivial invariant that we establish (we note that the proof required an adaptation of the original compiler from FCT [24]). We also present a counter-example showing that the safe-no-change projection is not complete for subhedge projection. It shows that a subhedge may be irrelevant even though its state may still be changing.
As a second contribution, we propose the congruence projection algorithm. It again compiles Shas to Shas but relies on the congruence relations of automata states. We then prove that congruence projection yields not only a sound algorithm but that this algorithm is also complete for subhedge projection, i.e., all strongly irrelevant subhedges (see Definition 9) are evaluated into a looping state. Congruence projection starts on the top-level of hedges with the Myhill–Nerode congruence that is well-known from automata minimization.
For languages with words L, this congruence identifies prefixes v that have the same residual language v 1 ( L ) = { w v · w L } . A prefix v is then an irrelevant suffix if its residual language v 1 ( L ) is either universal or empty. In the case of hedges, which extend on words by adding hierarchical nesting via a tree constructor, the congruence must be adapted when moving down into subtrees.
We show that both compilers may increase the size of the automata exponentially since they must be able to convert bottom-up deterministic automata on monadic trees into top-down deterministic automata. That is the same as inverting deterministic automata on words, which is well known to raise an exponential blow-up in the worst case (e.g., for the family of languages ( a + b ) · a · ( a + b ) n , where n N , the minimal deterministic left-to-left automaton, has 2 n + 1 states while the minimal deterministic right-to-left automaton has n + 1 states).
The exponential explosion can be avoided when interested in the evaluation of hedges by automaton with subhedge projection, by not constructing the complete Shas statically. Instead, only the required part of the Shas may be constructed on the fly when evaluating a hedge.
Our third contribution is the refinement of the two previous compilers so that they can also distinguish safe states for selection at the earliest position. For this, we combine these compilers with a variant of the earliest membership tester of [21] that operates in memory by compiling Shas instead of Nwas. Furthermore, membership failure is detected at the earliest position, too. In this way, we obtain the earliest in-memory membership tester for deterministic Shas.
Our fourth contribution is to show how to run the previous in-memory evaluators in streaming mode while reading the linearization of the hedge into a nested word as an input. Thereby, we can improve on the recent earliest streaming membership tester behind the AStream tool [21] by adding subhedge projection to it.
Our fifth and last contribution is an implementation and experimental evaluation of the streaming earliest query answering algorithm for dShas by introducing subhedge projection into the AStream tool [21]. We implemented both safe-no-change and congruence projection. For this, we lifted our earliest membership tester with subhedge projection to an earliest query answering algorithm for monadic queries. For the experimentation, we started with deterministic Shas constructed with the compiler from [8] for the forward regular XPath queries of the XPathMark benchmark [27] and real-world XPath queries [7]. It turns out that congruence projection projects much more strongly than the safe-no-change projection for at least half of the benchmark queries. It reduces the running time for all regular XPath queries considerably since large parts of the input hedges can be projected away. In our benchmark, the projected percentage ranges from 75.7% to 100% of the input stream. For XPath queries that contain only child axes, the earliest query answering algorithm of AStream with congruence projection becomes competitive in efficiency with the best existing streaming tool called QuiXPath [23], which is not always earliest for some queries. Our current implementation of the earliest congruence projection in AStream is by a factor from 1.3 to 2.9 slower than QuiXPath on the benchmark queries. The improvement is smaller for XPath queries with the descendant axis, where less subhedge projection is possible. Instead, some kind of descendant projection would be needed. Still, even in the worst case in our benchmark, the earliest congruence projection with AStream is currently only by a factor of 13.8 slower than QuiXPath.
  • Outline. We start with preliminaries on hedges and nested words in Section 2. In Section 3, we recall the problem of hedge pattern matching and formally define irrelevant subhedges. In Section 4, we recall the definition of Shas, introduce Shas, and discuss their in-memory evaluators. In Section 5, we introduce the notion of subhedge projection states, define when an Sha is complete for subhedge projection, and present an in-memory evaluator for Shas with subhedge projection. In Section 6, we introduce the safe-no-change projection, state its soundness, and show its incompleteness. In Section 7, we introduce congruence projection an prove it to be sound and complete for subhedge projection. The earliest in-memory evaluators for Shas with subhedge projection follow in Section 8. Streaming variants of our in-memory evaluators are derived systematically in Section 9. Section 10 discusses how to lift our algorithms from membership testing to monadic queries. Section 11 discusses our practical experiments. Section 12 discusses further related work on automata notions related to Shas and Shas. Appendix A presents the reformulation of the earliest query-answering algorithm from [21] based on Shas and with schema restrictions. Appendix B provides sound proof for the safe-no-change projection.
  • Publication Comments. This original version of the journal article was published at the FCT conference [24]. Compared to the previous publication, the following contributions are new:
    • The definition of subhedge irrelevant prefixes of nested words and the definition of completeness for subhedge projection (Section 3).
    • The addition of schema restrictions throughout the paper and the notion of schema completeness (Section 4.3).
    • The notion of completeness for subhedge projection (Section 5.2).
    • The soundness proof of the safe-no-change projection algorithm in Section 6.2, moved to Appendix B.
    • The congruence projection algorithm and the proof of its soundness and completeness for subhedge projection (Section 7).
    • A systematic method to add subhedge projection to an earliest Sha (Section 8).
    • A systematic method to reduce the correctness of streaming automata evaluators to the corresponding in-memory evaluators (Section 9).
    • A discussion of how to deal with monadic queries while exploiting the availability of schema restrictions (Section 10).
    • An implementation of congruence projection with experimental results for XPath evaluation on Xml streams (Section 11).
    • A longer discussion on related work (Section 12).
    • In Appendix A, we also show how to obtain the earliest dShas from dShas by adapting the compiler from [21] mapping dShas to the earliest dNwas.

2. Preliminaries

Let A and B be sets. If A is a subset of B, then the complement of A in B is denoted by A ¯ = B \ A while keeping the set relative to which the complement is taken implicit. The domain of the binary relation r A × B is dom ( r ) = { a A b B . ( a , b ) r } . A partial function  f : A B is a binary relation f A × B that is functional. A total function  f : A B is a partial function f : A B with dom ( f ) = A .
  • Words. Let N be the set of natural numbers, including 0. Let the alphabet Σ be a set. The set of words over Σ is Σ = n N Σ n . A word is ( a 1 , , a n ) Σ n , where n N is written as a 1 a n . We denote the empty word of length 0 by ε Σ 0 , and by v 1 · v 2 Σ the concatenation of two words v 1 , v 2 Σ .
If v = u · v · w is a word, then we call u a prefix of v and v a factor of v and w a suffix of v. Given any subset L Σ , we denote the set of prefixes of words in L by prefs ( L ) and the set of suffixes of words in L by suffs ( L ) .
For any subalphabet Σ Σ , the projection of a word v Σ to Σ is the word proj Σ ( v ) in ( Σ ) that is obtained from v by removing all letters from Σ \ Σ .
  • Hedges. Hedges are sequences of letters and trees h with a hedge h. More formally, a hedge h H Σ has the following abstract syntax:
    h , h H Σ : : = ε | a | h | h · h where a Σ
    We assume ε · h = h · ε = h and ( h · h 1 ) · h 2 = h · ( h 1 · h 2 ) . Therefore, we consider any word in Σ as a hedge in H Σ , i.e., Σ a a b = a · a · b H Σ . Any hedge can be converted or stored as a graph. For the signature Σ = { l i s t , i t e m , A , , Z } , the graph of an example hedge in H Σ is shown in Figure 1. It encodes the sequence of Xml documents in Figure 2:
  • Nested Words. A nested word over Σ is a word over the alphabet Σ ^ = Σ { , } in which all parentheses are well-nested. Intuitively, this means that for any opening parenthesis, there is a matching closing parenthesis and vice versa.
The set of nested words over Σ can be defined as the subsets of words over Σ ^ that are a linearization of the hedge, where the linearization function nw : H Σ ( Σ { , } ) is defined as follows:
nw ( ε ) = ε , nw ( h ) = · nw ( h ) · , nw ( a ) = a , nw ( h · h ) = nw ( h ) · nw ( h ) .
So, the set of all nested words over Σ is nw ( H Σ ) . Let hdg be the inverse of the injective function nw restricted to its image, so the mapping from nested words to hedges with hdg ( nw ( h ) ) = h for all h H Σ .
  • Nested Word Prefixes. Prefixes, suffixes, and factors of nested words may not be nested words themselves. For instance, the hedge h = a · b · c has the linearization nw ( h ) = a b c . Its prefix a b is not well-nested since it has a dangling opening parenthesis. Its suffix c is also not well-nested since it has a dangling closing parenthesis.
Any algorithm that traverses a hedge h in a top-down and left-to-right manner inspects all the prefixes of nw ( h ) . Any prefix v of nw ( h ) not only distinguishes the position of the hedge h but also specifies the part of the hedge that is located before this position in the linearization nw ( h ) . An example is illustrated graphically in Figure 3. This particularly holds for streaming algorithms that receive the nested word nw ( h ) as an input on a stream and may be inspected only once from the left to the right. But it holds equally for in-memory algorithms that receive the input hedge h as a hierarchical structure whose graph is stored in memory and then traverses the graph top-down and left-to-right. Any node will then be visited twice: once when reading an opening parenthesis 〈 and going down into a subhedge, and another time when going up to the closing parenthesis 〉 after having processed the subhedge.

3. Problem

We formally define the pattern-matching problem for hedges and the notion of subhedge irrelevance, which allows us to characterize the parts of the input hedge that do not need to be visited during pattern matching.

3.1. Hedge Pattern Matching

We are interested in algorithms that match a hedge pattern against a hedge. We start with algorithms that test whether such matching exists. This is sometimes called filtering or, alternatively, Boolean query answering. In Section 10, we consider the more general problem of returning the set of matchings. This problem is sometimes called monadic query answering. Our experiments in Section 11 apply monadic query answering for regular XPath queries.
We start from hedge patterns that are nested regular expressions with letters in the signature Σ . For these, we fix a set V of recursion variables and assume that it is disjointed from Σ .
e , e nRegExp Σ : : = ε a z e · e e + e e e μ z . e where a Σ , z V
Each nested regular expression e nRegExp Σ is satisfied by a subset of hedges in e H Σ V . We recall the definition of this set only informally. The expression ε is satisfied only by the empty word, expression a Σ only by the hedge a only, and expression z V only by the one letter hedge z. An expression e · e with the concatenation operator and e , e nRegExp Σ is satisfied by the concatenations of hedges satisfying e and e , respectively. An expression e with Kleene’s star operator and e nRegExp Σ is satisfied by repetitions of words satisfying e. The Kleene star provides for horizontal recursion on hedges. The μ expression μ z . e , where z V and e nRegExp Σ , is satisfied by all hedges without recursion variable x in which z is repeatedly replaced by a hedge satisfying e until it disappears. We adopt the usual restriction [28] that in any nested regular expression μ z . e , all occurrences of z in e are guarded by a tree constructor . . In this way, the μ -operator provides proper vertical recursion. Also note that, for defining sublanguages of H Σ , we only need such expressions in nRegExp Σ in which all occurrences of recursion variables z are bound in the scope of binder μ z .
It is well known that nested regular expressions in nRegExp Σ can define all regular sublanguages of hedges in H Σ , i.e., they have the same expressiveness as hedge automata with alphabet Σ . Analogous results for regular tree expressions and tree automata are folklore [29].
Example 1 (Nested regular expressions of regular XPath filters).
For instance, we can define subsets of hedge encodings of Xml documents with two kinds of elements in Σ = { list , item } by expressions in nRegExp Σ . For this article, we will use an encoding of the Xml document, which produces hedges that satisfy the following regular expression doc nRegExp Σ , where z V :
doc = def tree where tree = def μ z . ( list + item ) · z
In our application, the encodings are a little richer in order to distinguish different types of Xml nodes such as elements, attributes, comments, and texts. An XPath filter such as [self::list/child::item] can then be expressed by the nested regular expression:
filter = def list · doc · item · doc · doc · doc
General compilers from regular downward XPath filter queries to nested regular expressions were proposed in [8]. An implementation was developed in the context of the AStream tool [21].
We will use schemas S H Σ to restrict the inputs of our pattern-matching problem. As we are mostly interested in regular schemas, we do not assume regularity globally since safe-no-change projection does not rely on it. The pattern-matching problem for nested regular expressions with respect to a schema S H Σ receives the following inputs:
  • a nested regular expression e nRegExp Σ and
  • a hedge h S .
It then returns the truth value of the judgment h e . The input hedge h S may either be given by a graph that resides in memory or by a stream containing the linearization nw ( h ) of the input hedge, which can be read only once from left to right.

3.2. Irrelevant Subhedges

We define the concept of irrelevant occurrences of subhedges with respect to a given hedge pattern. What this means depends on the kind of algorithm that we will use for pattern matching. We will use algorithms that operate on the input hedge top-down, left-to-right, and bottom-up. This holds for streaming algorithms in particular.
Intuitively, when the pattern-matching algorithm reaches a node top-down or left-to-right whose subsequent subhedge is irrelevant, then it can jump over it without inspecting its structure. What jumping means should be clear if the hedge is given by a graph that is stored in memory. Notice that the full graph is to be constructed beforehand, even though some parts of it may turn out irrelevant. Still, one may win lots of time by jumping over irrelevant parts if either the graph was already constructed for other reasons or if many pattern-matching problems are to be solved on the same hedge.
In the case of streams, the irrelevant subhedge still needs to be parsed, but it will not be analyzed otherwise. Most typically, the possible analysis is carried out by automata, which may take two orders of magnitude more time than needed for parsing. Therefore, not having to do any analysis may considerably speed up the streaming algorithm.
Definition 1. 
Let S H Σ be a schema and L S a language satisfying this schema. We define diff S L as the least symmetric relation on prefixes of nested words u , u prefs ( N Σ ) :
diff S L ( u , u ) w suffs ( N Σ ) . u · w nw ( L ) u · w nw ( S \ L ) .
A nested word prefix u is called subhedge relevant for L with schema S if there exists nested words v , v N Σ such that diff S L ( u · v , u · v ) . Otherwise, the prefix u is called subhedge irrelevant for L with schema S .
So, diff S L ( u , u ) states that u and u have continuations that behave differently with respect to L and S . Furthermore, a prefix u is subhedge irrelevant if language membership does not depend on hedge insertion at u under the condition that schema membership is guaranteed. The case without schema restrictions is subsumed by the above definition by choosing S = H Σ . In this case, the complement diff H Σ L ¯ is the Myhill–Nerode congruence of L, which is well-known in formal language theory from the minimization of DFAs (see any standard textbook on formal language theory or Wikipedia (Myhill–Nerode theorem: https://en.wikipedia.org/wiki/Myhill-Nerode_theorem, accessed on 23 July 2024)).
This congruence serves as the basis for Gold-style learning of regular languages from positive and negative examples (see [30] or Wikipedia (Gold-style learning in the limit https://en.wikipedia.org/wiki/Language_identification_in_the_limit, accessed on 23 July 2024)).
Schema restrictions are needed for our application to regular XPath patterns.
Example 2 (Regular XPath filters).
Consider the schema S = doc for hedges representing Xml documents with signature Σ = { list , item } , and the regular hedge language L = filter , where
filter = list · tree · item · tree · tree · tree
is the nested regular expression from Example 1 for the XPath filter [ self : : list / child : : item ] . Recall that this nested regular expression can be applied to hedges representing Xml documents only. The nested word prefix u = list · item is subhedge irrelevant for the language L with schema S . Note that its continuation list · item with the suffix w = belongs to L. Hence, for any h H Σ , if the continuation list · item · h belongs to S = doc , then it also belongs to L. Nevertheless, for the hedge h 1 = , the continuation list · item · h 1 does not belong to L since it does not satisfy the schema S = doc .
The prefix item is also irrelevant for the language L, even independently of the schema. This is because L does not contain any continuation of this prefix to some hedge. The prefix list is not irrelevant for the language L wrt S . This can be seen with the suffix w = , since list does not belong to L while the continuation list · h with h = item does belong to L and both continuations satisfy the schema S = doc .
However, schema restrictions may also have surprising consequences that make the problem more tedious.
Example 3 (Surprising consequences of schema restrictions).
Consider the signature Σ = { a } , the pattern L = { a } , and the schema S = L { · a } . The prefix 〈 is indeed subhedge irrelevant to L and S . In order to see this, we consider all possible closing suffixes one by one. The smallest closing prefix is 〉. Note that a hedge h L if and only if h = a . So, membership to L seems to depend on the subhedge h. However, also note that h S if and only if h = a . Therefore, when assuming that h belongs to the schema S , it must also belong to L. So, the pattern must match when assuming schema membership. The next closing suffix is · a . Note that a hedge h · a S if and only if h = ε . However, h · a L , so the pattern will not match for any subhedge h with this prefix when assuming the input hedge satisfies the schema. The situation is similar for larger suffixes, so in no case does language membership depend on the subhedge at prefix 〈 when assuming that the full hedge satisfies the schema S .

3.3. Basic Properties

We first show that subhedge irrelevant prefixes remain irrelevant when extended by the nested word of any hedge. This property should be expected for any reasonable notion of subhedge irrelevance.
Lemma 1. 
For any nested word prefix u, language L ,   S H Σ , and nested word v N Σ , if the prefix u is subhedge irrelevant for L with schema S , then the prefix u · v is so too.
Proof. 
Let u be subhedge irrelevant for L with schema S and v N Σ a nested word. We fix arbitrary nested words v , v N Σ and a nested word suffix w suffs ( N Σ ) such that u · v · v · w nw ( S ) and u · v · v · w nw ( S ) . Since v · v and v · v are also nested words, the subhedge irrelevance of u yields the following:  
u · v · v · w nw ( L ) u · v · v · w nw ( L ) .
Hence, the nested word prefix u · v is subhedge irrelevant for L with schema S .   □
Definition 2. 
We call a binary relation D on prefixes of nested words a difference relation on prefixes if for all prefixes of nested words u , u prefs ( N Σ ) and nested words v N Σ :
( u · v , u · v ) D ( u , u ) D .
Lemma 2. 
diff S L is a difference relation.
Proof. 
We must show for all prefixes u , u prefs ( N Σ ) and nested words v N Σ that ( u · v , u · v ) diff S L implies ( u , u ) diff S L . So, let u , u be prefixes of nested words and v be a nested word such that ( u · v , u · v ) diff S L . Then, there exists a nested suffix w such that
u · v · w nw ( L ) u · v · w nw ( S \ L ) .
The suffix w ˜ = v · w then satisfies u · w ˜ nw ( L ) u · w ˜ nw ( S \ L ) . Hence, ( u , u ) diff S L as required.   □
In the schema-free case of words, the complement diff Σ L ¯ is an equivalence relation that is known as the Myhill–Nerode congruence. In the case of schemas, however, the complement diff S L ¯ may not be transitive; thus, it may not even be an equivalence relation. This might be surprising at first sight.
Example 4. 
Let Σ = { a , b } , L = { ε , b , a a } and S = L { a a b } . Then, ( ε , a ) diff S L ¯ since there is no continuation that extends both ε and a into the schema. For the same reason, we have ( a , a a ) diff S L ¯ . However, diff S L ( ε , a a ) since ε · b L and a a · b S \ L . Thus, ( ε , a a ) diff S L ¯ , so the complement diff S L ¯ is not transitive.
It should be noted for all languages L that diff H Σ L ¯ is reflexive and transitive, so it is an equivalence relation and thus a congruence. For general schemas S , diff S L ¯ may not be transitive, as shown by Example 4, and may thus not be congruent.
This indicates that schemas may make it more difficult to decide subhedge irrelevance. Indeed, the natural manner in which one might expect to eliminate schemas based on complements does not work, as confirmed by following lemma (in particular, the counter-example in proof 1 . 2 .):
Lemma 3. 
For any nested word prefix u suffs ( N Σ ) and L ,   S H Σ , consider the following two properties:
1. 
u is irrelevant for L with schema S .
2. 
u is irrelevant for L S ¯ with schema H Σ .
Then, property 2. implies property 1., but not always vice versa.
Proof. 
1 . 2 .:
Here, a counter-example is presented. Let L = { ε } and S = { ε , a } . Then, prefix a is irrelevant for L with schema S . However, prefix a is relevant for L S ¯ = H Σ \ { a } with schema H Σ , since for the nested words v = ε and v = a we have a · v L S ¯ and a · v L S ¯ .
1 . 2 .:
Next, we assume property 2. Consider arbitrary nested words v , v N Σ and a nested word suffix w suffs ( N Σ ) such that u · v · w , u · v · w S . It is sufficient to show that u · v · w S and u · v · w S implies u · v · w L u · v · w L . This implication holds trivially if u · v · w S ¯ or u · v · w S ¯ . Otherwise, we have u · v · w S and u · v · w S . Property 2., as assumed, then implies u · v · w L u · v · w L . So, property 1. holds.
   □

4. Hedge Automata

We recall the notion of stepwise hedge automata (Shas), introduce their downward extension (Sha), discuss schema-completeness for both kinds of automata, define subhedge projection states for Shas, and show how to use them for evaluating Shas with subhedge projection. We limit ourselves to in-memory evaluation in this section but will discuss streaming evaluation in Section 9. The relation to several closely related notions of automata on hedges, forests, or nested words is discussed in some more detail in Section 12.

4.1. Stepwise Hedge Automata

Shas are a recent notion of automata for hedges [8] that mix up bottom-up tree and left-to-right word automata in a natural manner. They extend stepwise tree automata [31] in that they operate on hedges rather than unranked trees, i.e., on sequences of letters and trees containing hedges. They differ from nested word automata (Nwas) [11,15] in that they process hedges directly rather than their linearizations to nested words.
Definition 3. 
A stepwise hedge automaton (Sha) is a tuple A = ( Σ , Q , Δ , I , F ) where Σ and Q are finite sets, I , F Q , and Δ = ( ( a Δ ) a Σ , Δ , @ Δ ) , where a Δ Q × Q , Δ Q , and @ Δ ( Q × Q ) × Q . A Sha is deterministic or equivalently a dSha  if I and Δ contain at most one element, and all relations ( a Δ ) a Σ and @ Δ are partial functions.
The set of states Q subsumes a subset I of the initial state, a subset F of final states, and a subset Δ of tree initial states. The transition rules in Δ have three forms: if ( q , q ) a Δ , then we have a letter rule that we write as q a q in Δ . If ( q , p , q ) @ Δ , then we have an apply rule that we write as q @ p q in Δ . Lastly, if q Δ Q , then we have a tree initial rule that we denote as q in Δ .
In Figure 4, we illustrate the graph of a dSha for the XPath filter [ self : : list / child : : item ] . The graph of Sha is obtained and drawn as usual for state automaton on words. The states are the nodes of the graph; in the example, Q = { 0 , 1 , 2 , 3 , 4 , 5 } . The transition rules define the edges. The edge of a letter rule q a q in Δ is annotated by the letter a Σ . In the example, Σ = { list , item } . Any apply rule q @ p q in Δ is drawn as an edge q p q annotated by state p Q . The initial state 0 I is indicated by an ingoing gray arrow, and the tree initial state 0 Δ by an ingoing gray arrow that is annotated by symbol . Final states, such as 4 F , are indicated by a double circle.

4.1.1. Semantics

We next define the semantics of Shas, i.e., the transitions on hedges that it defines and the hedge language that it accepts. For any hedge h H Σ we define the transition steps q h p wrt Δ such that for all q , q , p , p Q , a Σ , and h , h H Σ
q a q in Δ q a q wrt Δ q h q wrt Δ q h q wrt Δ q h · h q wrt Δ
q Q q ε q wrt Δ p in Δ p h p q @ p q in Δ q h q wrt Δ
The transitions can be used to evaluate hedges nondeterministically bottom-up and left-to-right using Shas. The first three rules state how to evaluate sequences of trees and letters, such as a usual finite state automaton, while assuming that the trees were already evaluated to states. The last rule states how to evaluate a tree h from some state q to some state q′. This can be visualized as follows:
Algorithms 17 00339 i001
For any tree initial state p Δ , one has to evaluate the subhedge h nondeterministically to some state p . For each p obtained, one then returns to a state q such that q @ p q in Δ nondeterministically.
A hedge is accepted by A if it permits a transition from some initial state to some final state. The language L ( A ) is the set of all accepted hedges:
L ( A ) = { h H Σ q h q wrt Δ , q I , q F } .

4.1.2. Runs

Runs can represent the whole history of a single choice of the nondeterministic evaluator on a hedge. For instance, the unique successful run of the dSha in Figure 4 on the hedge h = h with subhedge h = l i s t · i t e m is the hedge 0 · 0 · l i s t · 1 · 0 · i t e m · 2 · 3 · 4 whose computation is illustrated in Figure 5. On the top level, the run justifies the transition 0 h 4 wrt Δ from the initial state 0 to a final state 4. When started in state 0, the subhedge h needs to be evaluated first, so the run on the upper level needs to be suspended until this is done. The subrun on h eventually justifies the transition 0 h 3 wrt Δ . At this time point, the run of the upper level can be resumed and continue in state 4 since 0 @ 3 4 in Δ .
When drawing runs, we illustrate any suspension/resumption mechanism with a box, for instance, the box on the edge 0 Algorithms 17 00339 i029 4. The box stands for a future value computed by the lower level, causing the upper level to suspend until its arrival. Eventually, the box is filled by state 3, as illustrated by 3 Algorithms 17 00339 i030, so that the computation can be resumed on the upper level.
We next define runs of Shas on hedges formally. Whether a hedge with letters and states R H Σ Q is a Δ -run or simple a run on a hedge h H Σ , written R run Δ ( h ) , is defined by the following rules:
q a q in Δ q · a · q run Δ ( a ) q · R · q run Δ ( h ) q · R · q run Δ ( h ) q · R · q · R · q run Δ ( h · h )
t r u e q run Δ ( ε ) p in Δ p · R · p run Δ ( h ) q @ p q in Δ q · R · q run Δ ( h )
Note that if R run Δ ( h ) , then h can be obtained by removing all states from R, i.e., proj Σ ( R ) = h . Any run R run Δ ( h ) can be identified with the mapping of prefixes of the nested word nw ( h ) to states in Q . Note that nw ( h ) = proj Σ ( nw ( R ) ) . For any prefix v of nw ( h ) , there exists a unique prefix r of the nested word nw ( R ) that ends with some state q Q and satisfies proj Σ ( r ) = v . We then call q the state of R at prefix v. The run 0 · 0 · l i s t · 1 · 0 · i t e m · 2 · 3 · 4 on hedge l i s t i t e m from Figure 5, for instance, assigns state 0 to the nested word prefixes ε and 〈 and state 1 to the nested word prefix · list , etc.
A Δ -run is called a run of A = ( Σ , Q , Δ , I , F ) if it starts with some state in I. A run of A is successful if it ends with some state in F.
Lemma 4. 
h L ( A ) iff there exists a successful run R run Δ ( h ) of A.
Proof. 
Straightforward with induction on h.   □

4.1.3. Determinization

Shas can always be determinized using the subset construction known from the determinization of finite state automata on words or of tree automata.
Given a Sha  A = ( Σ , Q , Δ , I , F ) , the state set of its determinization is Q det = 2 Q . The transition rules Δ det of the determinization are presented in Figure 6.

4.1.4. In-Memory Membership Testing

Testing membership h L ( A ) for a hedge h, whose graph is stored in memory, can be carried out by computing the unique run of the determinization of A in a bottom-up and left-to-right manner, and testing whether it is successful. Alternatively, one can compute the unique set P Q such that I h P with respect to the determinization of A and to test whether P F . The computations are essentially the same, but the full history of the computation is lost when returning only the set P of states reached and not the run.
For determinizing Shas we can rely on a variant of the usual subset construction that is well-known from finite state automata on words or standard tree automata (see e.g., [8,16]). When it comes to membership testing, on-the-fly determinization is sufficient, which creates only the part of the deterministic automaton that is used by the run on h. This part is of linear size O ( | h | ) while the full determinization of A may be of exponential size O ( 2 | A | ) . In this way, membership h L ( A ) can be tested in time O ( | h | | A | ) .

4.1.5. Hedge Accessibility

For any set P Q we define the set of states accessible from P via a hedge as follows:
acc Δ ( P ) = { q Q q P . h H Σ . q h q wrt Δ } .
We note that acc Δ ( P ) can be computed from Δ and P in linear time in the size of Δ . A state is hedge accessible from P if and only if it is accessible for P in the graph of Δ . Or by computing the least fixed point of the following ground Datalog program depending on P and Δ that is of linear size in A:
q P a c c ( p ) . q a q in Δ a c c ( q ) : a c c ( q ) . q @ p q in Δ a c c ( q ) : a c c ( q ) , a c c ( p ) .
Clearly, L ( A ) = if and only if F acc Δ ( I ) = . Therefore, emptiness can be decided in linear time for Shas. This is in contrast to the more general notion of Shas that we introduce next.

4.2. Downward Stepwise Hedge Automata

The evaluation of Shas operates in a bottom-up and left-to-right manner, but cannot pass any information top-down. We next propose an extension of Shas to Shas, adding the ability to pass finite state information top-down.
Shas are basically equal to Neumann and Seidl’s pushdown forest automata [25] but lifted from (labeled) forests to (unlabeled) hedges. See Section 12 for a more detailed discussion on the relationship of Shas with other automata notions and Nwas in particular.
Definition 4. 
A downward stepwise hedge automaton (Sha) is a tuple A = ( Σ , Q , Δ , I , F ) , where Σ and Q are finite sets, I , F Q , and Δ = ( ( a Δ ) a Σ , Δ , @ Δ ) . Furthermore, a Δ Q × Q , Δ Q × Q , and @ Δ ( Q × Q ) × Q . A Sha is deterministic or equivalently a dSha if I contains at most one element, and all relations, Δ , a Δ , and @ Δ , are partial functions.
The only difference compared to Shas is the form of the tree-opening rules. If ( q , q ) Δ Q , then we have a tree initial rule that we denote as follows: q q in Δ .
So, state q , where the evaluation of a subhedge starts, depends on the state q of the parent.

4.2.1. Semantics

The definition of transition steps for Shas differs from that of Shas only in the usage of opening rules in the following inference rule:
q p in Δ p h p wrt Δ q @ p q in Δ q h q wrt Δ
This means that the evaluation of the subhedge h starts in a state p such that q p in Δ .
Algorithms 17 00339 i002
So, the restart state p that is chosen may now depend on the state q above. This is how finite state information is passed top-down by Shas. Shas, in contrast, operate purely bottom-up and left-to-right.

4.2.2. Runs

The notion of runs can be adapted straightforwardly from Shas to Shas. When in state q, it is sufficient to restart the computation in subhedges on the state p, such that q p Δ . In this way, finite state information is passed down (while for Shas some tree initial is to be chosen that is independent of q). The only rule of runs to be changed is the following:
q p in Δ p · R · p run Δ ( h ) q @ p q in Δ q · R · q run Δ ( h )
An example of a run of the dSha in Figure 7 is shown in Figure 8. Here, the information on the level of nodes is passed down. It is represented by the number of primes. For instance, when opening the first subtree labeled with item we use the transition rule 1 0 stating that one moves from state 1 on the first level to state 0 on the second level.
One can see that all nodes of the subtrees h 1 and h 2 are evaluated to the projection state Π . So, one can jump over these subtrees without reading them. This is justified by the fact that they are on level 2, the information that the evaluator of this Sha was passing down.

4.2.3. In-Memory Membership Testing

Testing membership h L ( A ) for a Shas in memory can again be carried out by computing the unique run of the determinization of A. As before, the membership tester can be based on on-the-fly determinization. However, the determinization procedure for Shas is more complicated then for Shas since Shas can pass information top-down and not only bottom-up and left-to-right. Determinization not only relies on the pure subset construction but also uses pairs of states in the subsets, basically in the same way as for nested word automata (Nwas) [14,15]. This is needed to deal with the stack of states that were seen above. Therefore, each determinization step may cost time O ( | A | 5 ) as stated for instance in [23]. The upper time bound for membership testing for Shas is thus in time O ( | h | | A | 5 ) and no longer combined linear as it is for Shas.

4.2.4. Relationship to Shas

Any Sha  A = ( Σ , Q , Δ , I , F ) can be mapped to a Sha A down = ( Σ , Q , Δ down , I ,   F ) while preserving runs and determinism. The only change of Δ down compared to Δ is described by the following rule:
p in Δ q Q q p in Δ down
Independently of the current state q Q , the Sha A down can start the evaluation of the subhedge of a subtree in any open tree state p Δ . For instance, the dSha A down for dSha A from Figure 4 from the introduction is provided in Figure 9. Conversely, one can convert any dSha A into an equivalent Sha by introducing nondeterminism to eliminate top-down processing. See Section 12.2 for the construction.

4.2.5. Hedge Accessiblity

Hedge accessibility can be defined for Shas identically as for Shas. However, the set acc Δ ( P ) can no longer be computed in linear time in the size of Δ , in contrast to Shas. It can still be done in cubic time, similarly as for Nwas [22]. The problem is that hedge accessibility for Shas does no more coincide with the accessibility notion of the automaton’s graph.
For instance, in the Sha from Figure 7, we have acc Δ ( { 0 } ) = { 4 } . Note that 0 does not belong to this set, even though there is a transition rule 0 0 in Δ , making node 0 accessible from node 0 in the graph of the automaton. This is because 0 is not accessible over any hedge from 0; that is, there is no hedge h such that 0 h 0 wrt Δ . Note that 0 · · 0 is a factor of the nested word of some run of Δ . This shows that 0 is accessible from 0 over a nested word prefix nevertheless. For any Sha A, hedge accessibility can be computed by the following ground Datalog program of cubic size depending on the number of states of A:
p @ q q in Δ p p in Δ a c c ( p , q ) : a c c ( p , q ) . p Q a c c ( p , p ) . p a q in Δ a c c ( p , q ) .
p , q , r Q a c c ( p , q ) : a c c ( p , r ) , a c c ( r , q ) .
Due to the cubic time of hedge accessibility for Shas, we will use hedge accessibility only for Shas where it is in linear time.

4.3. Schema Completeness

Schemas are inputs of the membership problem that we consider while completeness is a fundamental property of automata. To combine both aspects appropriately, we propose the notion of schema completeness, which we define uniformly for both Shas and Shas.
We call a set of Δ transition rules Sha complete if all its relations ( a Δ ) a Σ , Δ , and @ Δ are total. For Shas, we require that Δ is a nonempty set (instead of a total relation). A Sha or Sha A is called complete if Δ is complete and I . We can complete any Sha or Sha by adding a fresh sink state and also transition rules into the sink state for all left-hand sides, for which no other transition rule exists. The sink state is made initial if and only if there is no initial state before. It is not final though. We denote the completion of A by c o m p l ( A ) .
Definition 5. 
A partial run of A on a hedge h is a prefix r of a run of c o m p l ( A ) on h, such that
  • r ends with some state of Q , and
  • r does not contain the sink state of c o m p l ( A ) .
A partial run r of A on h is called blocking if there does not exist any partial run r of A on h, such that r is a strict prefix of r .
For instance, consider the hedge h = list · item . The dSha in Figure 4 has the unique run R = 0 · · 0 · list · 1 · · 0 · item · 2 · 3 · · 4 on h that is illustrated in Figure 5. The partial runs of h are thus all prefixes of nw ( R ) that end with some state, i.e., 0, 0 · · 0 , 0 · · 0 · list · 1 , etc. None of these partial runs are blocking.
Definition 6. 
A schema for an automaton A is a hedge language S H Σ , such that L ( A ) S . We call the automaton A schema-complete for schema S if no hedge h S has a blocking partial run of A.
Schemas S are usually used to restrict the type of hedges that a problem may accept as input. The dSha pattern-matching problem for a schema S takes two inputs: a hedge h S and an automaton A, rather than a nested regular expression e. The automaton A then selects those input hedges h S that match the pattern, i.e., such that h L ( A ) . For this reason, we assume that S is a schema for A, i.e., L ( A ) S . We will often assume that A is schema-complete for S , so that no partial run of A on any input hedge h S may ever block. This can always be achieved next based on the next Lemma 5.
Example 5. 
The dSha in Figure 4 for the XPath filter [ self : : list / child : : item ] has signature Σ = { item , list } . It is schema-complete for schema doc . To make this happen, we added state 5 to this automaton, which is not used by any successful run. Still, this  Sha  is not complete. For completing it, we would have to run 5 into sink states by adding many transitions into it, such as 0 @ 4 5 and 0 @ 5 5 , and also add for all other states q { 1 , 2 , 3 , 4 , 5 } the transition rules q list 5 , q item 5 , q @ 4 5 , and q @ 5 5 .
As we will see, automaton completion would make the safe-no-change algorithm incomplete for subhedge projection for Example 5. Without schema-completeness, however, safe-no-change projection may be unsound: it might overlook the rejection of hedges inside the schema. This is why it is much better to assume only schema-completeness rather than full completeness for the input automata of safe-no-change projection. For congruence projection, schema-completeness will be assumed implicitly.
We note that schema-completeness is well-defined even for nonregular schemas. For safe-no-change projection, we indeed do not have to restrict schemas to be regular, while for congruence projection, only regular schemas are considered.
Furthermore, if A is schema-complete for the universal schema S = H Σ and does not have inaccessible states, then A is complete.
Schema-completeness of deterministic automata for regular schemas can always be made to hold as we will show next. For this we define that A is compatible with schema S if for any hedge h S there exists a run of A in run Δ ( h ) . Schema-completeness implies compatibility, as shown by the next lemma. The converse is true only when assuming determinism.
Lemma 5. 
Let A be a deterministic  Sha  with schema S . Then, A is compatible with S iff A is schema-complete for S .
Proof. 
Suppose that A is schema-complete for S Wlog., we can assume S . Note that if I = , then ε would be a blocking partial run for any hedge in S \ { ε } . Since S , it follows that there exists q 0 I . So, q 0 is a partial run for any hedge h S \ { ε } . Since there exist no blocking partial runs by schema-completeness, this run can be extended step by step to a run on any h S . Therefore, for any hedge h S , there exists some run by A, showing compatibility.
For the converse, let A be compatible with S and deterministic. We have to show that A is schema-complete for S . Let v be a partial run of A on h S . By compatibility, there exists a run R run Δ ( h ) that starts in some initial state of A. By determinism, v must be a prefix of nw ( R ) . Thus, v is not blocking.   □
Proposition 1. 
Any  Sha  A with regular schema S can be made schema-complete for S .
Proof. 
By Lemma 5, it is sufficient to make A compatible with S . Let B be a dSha with S = L ( B ) . Automaton B is compatible with S . Since B is deterministic, Lemma 5 implies that B is schema-complete for S . Since S is a schema for A with L ( A ) L ( B ) . We then compute the completion of c o m p l ( A ) . The product C = B × c o m p l ( A ) with final states F C = F B × F A is schema-complete for schema S . Furthermore, since L ( A ) S = L ( B ) , it follows that L ( C ) = L ( B ) L ( A ) = L ( A ) .   □
Finally, note that the schema S can be recognized by the same product C = B × c o m p l ( A ) except for replacing the set of final states F C by the set of schema final states F S = F B × ( Q A { sink } ) . We have S = L ( C [ F C / F S ] ) , where C [ F C / F S ] is the dSha obtained from C by replacing the set of final states F C with F S .

5. Subhedge Projection

We next introduce the concept of subhedge projection states for Shas in order to distinguish prefixes of hedges that are subhedge irrelevant, and to evaluate Shas with subhedge projection.

5.1. Subhedge Projection States

Intuitively, a subhedge projection state can only loop in itself until a hedge closes (and then it is applied). When going down into a subtree, the subhedge projection state may still be changed into some other looping state. When leaving the subtree, however, one must come back to the original subhedge projection state. Clearly, any sink is a subhedge projection state. The even more interesting cases, however, are subhedge projection states that are not sinks.
We start with Shas without any restrictions but will impose schema-completeness and determinism later on.
Definition 7. 
We call a state q Q a subhedge projection state of Δ  if there exists q Q called the witness of q, such that the set of transition rules of Δ containing q or q on the leftmost position is included in the following set:
{ q q , q @ q q , q q , q @ q q } { q a q , q a q a Σ }
The set of all subhedge projection states of Δ is denoted by Q shp Δ .
For any complete Sha, the above set of transition rules must be equal to the set of all transition rules of Δ with q or q on the leftmost position. But, if the Sha is only schema-complete for some schema S , then not all these transitions must be present. Note also that a subhedge projection state q may be equal to its witness q . Therefore, any witness q of some subhedge projection state is itself a subhedge projection state with witness q .
In the example dSha A = ( Σ , Q , Δ , I , F ) in Figure 7, we have Q shp Δ = { Π , 4 , 2 , 3 , 1 , 2 } . The witness of all these subhedge projection states is Π . Note that not all possible transitions are present for the states in Q shp Δ \ { Π } , given that this automaton is not complete (but still schema-complete for schema doc ).
Lemma 6. 
If q Q shp Δ is a subhedge projection state and q h p wrt Δ, then q = p .
Proof. 
Suppose that q h p wrt Δ and that q is a subhedge projection state of Δ with witness q . So, the only possible transitions with q or q on the left-hand side are the following:
Algorithms 17 00339 i003
Suppose that h = t 1 · · t n is a sequence of trees or letters t i H Σ Σ , where 1 i n . Then, there exists runs R i run Δ ( t i ) and a run R run Δ ( h ) that has the form h = q 0 · R 1 · q 1 · · R n · q n , where q 0 = q and q n = p . We prove for all 0 i n that q i = q by induction on i. For i = 0 , this is trivial. For i > 0 , the induction hypothesis shows that q i 1 = q .
Case 
t i = a Σ . Then, q i 1 a q i in Δ . Since q i 1 = q is a subhedge projection state of Δ , it follows that q i = q , too.
Case 
t i = h  for some  h H Σ . Since q i 1 = q is a subhedge projection state of Δ with witness q , the subrun of R recognizing t i must be justified by the following diagram:
Algorithms 17 00339 i004
So, the subrun of R of h may only contain the witness q and, furthermore, q i = q .
   □

5.2. Completeness for Subhedge Projection

Before defining when a Sha is complete for subhedge projection, we show that subhedge projection states in runs of deterministic Shas may only occur at irrelevant prefixes.
Proposition 2. 
Let A = ( Σ , Q , Δ , I , F ) be a dSha with schema S , R run Δ ( h ) is the run of c o m p l ( A ) on some hedge h S , and q is a subhedge projection state for Δ. Then, for any prefix r · q of nw ( R ) , the prefix proj Σ ( r ) of nw ( h ) is subhedge irrelevant for L ( A ) for schema S .
Proof. 
Let R run Δ ( h ) be the unique run of c o m p l ( A ) on a hedge h S . Suppose that r · q · s = nw ( R ) for some subhedge projection states q of Δ and let v = proj Σ ( r ) and w = proj Σ ( s ) . Since v · w = nw ( h ) and h S , it follows that v · w nw ( S ) . We fix some hedge h H Σ , such that v · nw ( h ) · w nw ( S ) arbitrarily. Let h ˜ H Σ , such that nw ( h ˜ ) = v · nw ( h ) · w . For the subhedge irrelevance of v, we must show that
h L ( A ) h ˜ L ( A ) .
“⇒”
We suppose that h L ( A ) and have to show that h ˜ L ( A ) . For the partial run r · q of A on h ˜ , there must exist R and p, such that q · R · p run Δ ( h ) . Since q is a subhedge projection state of Δ , it follows by Lemma 6 that q = p . Hence, r · q · nw ( R ) · q · s is a run of A on h ˜ . Since R was successful for A, s must end in F; thus, the above run on h ˜ is successful for A too. Hence, h ˜ L ( A ) , as we have shown.
“⇐”
We suppose that h ˜ L ( A ) and have to show that h L ( A ) . Let R ˜ be the unique run of c o m p l ( A ) on h ˜ . By determinism, R ˜ must have the prefix r · q . So, there exist R , p , s , such that p · R · q run Δ ( h ) and r · q · R · p · s = R ˜ . Lemma 6 shows that p = q . Hence, r · q · s is the unique run of c o m p l ( A ) on h and this run is accepting. So, h L ( A ) .
   □
We note that Proposition 2 would not hold without assuming determinism. To see this, we can add some sink to any dSha as a second initial state. One can then always go to this sink, which is a subhedge projection state. Nevertheless, no prefix going to the sink may be subhedge irrelevant.
Proposition 2 shows that subhedge projection states permit distinguishing prefixes that are subhedge irrelevant for deterministic Shas. An interesting question is whether all subhedge irrelevant prefixes can be found in this way. A closely related question is whether, for all regular patterns, there exist dShas that project all irrelevant subhedges.
Example 6. 
Reconsider Example 3 with Σ = { a } , a regular pattern L = { a } , and a regular schema S = L { · a } . The prefix is irrelevant for L and S , since all its continuations ending in the schema are rejected: there is only one such continuation, which is · a . However, a dSha  for L and S with maximal subhedge projection is given in Figure 17 but will not go into a subhedge projection state at this prefix. The reason is that it will go into the same state { 3 , 4 } D 0 for any prefix in N Σ , ignoring the nested word of the irrelevant subhedge. Therefore, this state must accept a . But, when reading an a in this state, the dSha  goes into the rejection state { 5 } D 0 , showing that the hedge from the schema · a S is rejected.
The example shows that we eventually reason with sets of prefixes. We first lift the notion of subhedge irrelevance to prefix sets.
Definition 8. 
Let S H Σ be a schema and L S a language satisfying this schema. A set of nested word prefixes U prefs ( N Σ ) is called subhedge relevant for L with schema S if there exist nested words v , v N Σ prefixes u , u U , such that diff S L ( u · v , u · v ) . Otherwise, the set U is called subhedge irrelevant for L wrt S .
In order to explain the problem of Example 6, we define for each nested word prefix w prefs ( N Σ ) a class c l a s s S L ( w ) of similar prefixes up replacing the nested word of any irrelevant subhedge by N Σ from the left to the right. For all prefixes w prefs ( N Σ ) , nested words v N Σ , letters a Σ , subsets of prefixes U prefs ( N Σ ) , and l Σ ^ letters or parenthesis, the following applies:
c l a s s S L ( w ) = c l a s s _ p r { ε } ( w ) c l a s s _ p r U ( ε ) = U · N Σ i f U irrelevant for L wrt   S U else c l a s s _ p r U ( w · · v ) = c l a s s _ i n s U ( , v ) where U = c l a s s _ p r U ( w ) c l a s s _ p r U ( a · v ) = c l a s s _ i n s U ( a , v ) where U = U · a c l a s s _ p r U ( v · v ) = c l a s s _ i n s U ( , v ) where U = c l a s s _ p r U ( · v ) c l a s s _ i n s U ( l , v ) = U · N Σ if U irrelevant for L wrt   S c l a s s _ p r U · l ( v ) else
Definition 9. 
A prefix u prefs ( N Σ ) is strongly subhedge irrelevant for L wrt S if the set c l a s s S L ( u ) is subhedge irrelevant for L wrt S .
Example 7. 
In our running Example 6, where L = { a } , S = L { · a } , and Σ = { a } , we have c l a s s S L ( ) = · N Σ , so the prefix is strongly subhedge irrelevant. The set c l a s s S L ( a ) = N Σ is subhedge-relevant, so the prefix a is not strongly subhedge irrelevant. The set c l a s s S L ( · a ) = N Σ · a · N Σ is subhedge irrelevant, so the prefix · a is strongly subhedge irrelevant.
Definition 10. 
A dSha  A = ( Σ , Q , Δ , I , F ) is complete for subhedge projection for schema S  if A is schema-complete for S and for all hedges h S and all prefixes v of nw ( h ) that are strongly subhedge irrelevant for Δ, the unique run of A on h goes to some subhedge projection state of Δ at v.
Example 8. 
The dSha  in Figure 17 is complete for subhedge projection for our running Example 6 where L = { a } and S = L { · a } . It was obtained by congruence projection. On all prefixes of the subhedge irrelevant class c l a s s S L ( ) , it goes to the subhedge projection state { 0 } D 1 . On all prefixes of the class c l a s s S L ( a ) , which is not subhedge irrelevant, it goes to the state { 3 , 4 } D 0 , which is not a subhedge projection state. On all prefixes of the subhedge irrelevant class c l a s s S L ( · a ) , it goes to the subhedge projection state { 5 } D 0 .
Example 9. 
The dSha from Figure 7 is complete for subhedge projection with schema doc . It can be obtained by safe-no-change projection. In general, however, dShas obtained by safe-no-change projection may not be complete for subhedge projection, as we will discuss in Section 6.3.

5.3. In-Memory Evaluator with Subhedge Projection

We next show how to refine the transitions for Shas by subhedge projection. We define the transition relation with subhedge projection h shp Q × Q with respect to Δ , such that, for all hedges h , h H Σ and letters a Σ ,
q Q shp Δ h H Σ q h shp q wrt Δ q Q shp Δ q a q in Δ q a shp q wrt Δ q Q shp Δ q ε shp q wrt Δ
q Q shp Δ q h shp q wrt Δ q h shp q wrt Δ q h · h shp q wrt Δ
q Q shp Δ q p in Δ p h shp p q @ p q in Δ q h shp q wrt Δ
The first rule says that subhedge projecting transitions stay in subhedge projection states until the end of the current subhedge is reached. This is correct by Lemma 6 under the condition that there are no blocking runs. The other rules state that the evaluator behaves without subhedge projection in other states.
Proposition 3. 
Let A = ( Σ , Q , Δ , I , F ) be a Sha that is schema-complete for S . Then, for all hedges h S and states q , p Q , the following applies:
q h p wrt Δ iff q h shp p wrt Δ
Proof. 
We distinguish two cases as follows:
Case 
q Q shp Δ . Then for any p Q , q h p wrt Δ iff q h shp p wrt Δ by definition of the projecting transitions.
Case 
q Q shp Δ . Then, q h shp q wrt Δ . Since h S and A are schema-complete for S , there exists some run of A on h; thus, the state is q, such that q h q wrt Δ . Lemma 6 shows that q = q , thus q h q wrt Δ .
   □
For evaluating a nondeterministic Sha A = ( Σ , Q , Δ , I , F ) with subhedge projection on some hedge h whose graph is stored in memory, we can compute the unique subset P Q , such that I h shp P with respect to the transition relation Δ d e t of determinization of A and test whether P F . When reaching any subhedge projection state P while doing so, the evaluator can directly jump to the end of the current subhedge by applying the “stay” rule of the determinization of A:  
P Q shp Δ d e t h H Σ P h shp P wrt Δ d e t
Thereby, it will avoid visiting any subhedge in a subhedge projection state of the determinization of A. The running time will thus depend linearly on the size of the non-projected part of h under the assumption that the determinization of A is available.
As before, only the needed part of the determinization of A for evaluating h is to be produced on-the-fly. Overall, this costs polynomial time in the size of the needed part of the determinization (but not necessarily linear time since the determinization of Shas is more costly than for Shas). Also note that whenever discovering a new state P of the determinization of A, we have to test whether it is a subhedge projection state. For this, we must compute the state P , such that P P wrt Δ d e t and test whether only the transition rules allowed for subhedge projection states for P with witness P are available wrt Δ d e t .

6. Safe-No-Change Projection

We want to solve the regular pattern-matching problem for hedges with subhedge projection. We assume that the regular pattern is given by a dSha while the schema may be arbitrary, except that the dSha must be schema-complete (see Section 4.3). We then present the safe-no-change projection algorithm, which compiles the dSha into some dSha while introducing subhedge projection states.
The idea of the safe-no-change projection is to push information top-down by which to distinguish looping states that will safely no more be changed. Thereby, subhedge projection states are produced, as we will illustrate by examples. The soundness proof of safe-no-change projection is nontrivial and instructive for the soundness proof of congruence projection in Section 7.3. Completeness for subhedge projection cannot be expected, since the safe-no-change projection does only take simple loops into account. More general loops cannot be treated and it is not clear how that could be done. The congruence projection algorithm in Section 7.2 will eventually give an answer to this.
We keep the safe-no-change compiler independent of the schema and, therefore, do not have to assume its regularity. But, we have to assume the schema-completeness of the input dSha in order to show that the output dSha recognizes the same language within the schema. The regular pattern-matching problem can then be solved in memory and with subhedge projection by evaluating the dSha on the input hedge in memory and with subhedge projection, as discussed in Section 5.3.

6.1. Algorithm

We now describe the safe-no-change projection algorithm. Let A = ( Σ , Q , Δ , I , F ) be a Sha with schema S . The algorithm takes the dSha A as input and compiles it to some dSha A snc . We will state why A snc is sound for subhedge projection under the condition that A is schema-complete for schema S . We do not even have to assume that the schema S is regular. The proof is eight pages long and, therefore, delegated to the Appendix B.
For any set P Q , we define the set of states that safely lead to P:
safe Δ ( P ) = { q Q acc Δ ( { q } ) P } .
So, state q is safe for P if any hedge read from q on which the run with Δ does not block must reach some state in P. We note that safe Δ ( P ) can be computed in linear time in the size of Δ using inverse hedge accessibility from P. We define the set of states that will no longer change:
no-change Δ = { q q safe Δ ( { q } ) } .
So, q no-change Δ if and only if acc Δ ( { q } ) { q } . This means that all transition rules starting in q must loop in q. In the example automaton from Figure 4, the self-looping states are those in no-change Δ = { 2 , 3 , 4 , 5 } .
Lemma 7. 
Let A = ( Σ , Q , Δ , I , F ) be a  Sha  that is schema-complete for schema S . For any hedge h S and state q I no-change Δ , it then holds that q h q wrt Δ.
Proof. 
The schema-completeness of A for S applied to h S and q I yields the existence of some state q Q , such that q h q wrt Δ . Let q be any such state. Note that q acc Δ ( q ) . Since q no-change Δ , we have q safe Δ ( { q } ) , so that q acc Δ ( { q } ) implies q = q . This proves q h q wrt Δ .   □
For any state q Q and subset of states Q Q , we define the following:
s-down Δ ( q , Q ) = safe Δ ( { p Q q @ Δ p Q } ) s-no-change Δ ( q ) = s-down Δ ( q , { q } )
A state belongs to s-down Δ ( q , Q ) if all states p acc Δ ( q ) satisfy q @ Δ p Q . So, p is a state in a down level that will safely go up to some state in Q . A state p belongs to s-no-change Δ ( q ) if it safely does not change q, i.e., if { q } @ Δ acc Δ ( { p } ) { q } .
We next compile the Sha A to a Sha A snc = ( Σ , Q snc , Δ snc , I snc , F snc ) . For this, let Π be a fresh symbol and consider the state set as follows:
Q snc = { Π } ( Q × 2 Q ) .
In practice, we restrict the state space to the those states that are accessible or clean the dShas keeping only those states that are used in some successful run. But, in the worst case, the construction may indeed be exponential. Example 15 can be adapted to show this.
A pair ( q , P ) means the state q will safely no longer change in the current subhedge if q P no-change Δ . In this case, the subhedge can be projected. The sets of initial and final states are defined as follows:
I snc = I × { } F snc = F × { } .
How to generate the transition rules of A snc from those of A is described in Figure 10. On states assigned on top-level ( q , P ) , the set P is empty so that only the states in no-change Δ are safe for no change. This is why the definition of I snc and F snc use P = . When going down from a state ( q , P ) for which q is safe to not change, i.e., q P no-change Δ , then the following rule is applied:
q P no-change Δ ( q , P ) Π in Δ snc .
The evaluation on the lower level goes to the extra state Π , where it then loops until going back to q on the upper level. When going down from some state ( q , P ) , such that q P no-change Δ , then the following rule is applied:
q in Δ q P no-change Δ ( q , P ) ( q , s-no-change Δ ( q ) ) in Δ snc
The states in the set s-no-change Δ ( q ) on the lower level safely will not make q change on the upper level for any subhedge to come (we could detect more irrelevant subhedges by allowing q to change but only in a unique manner. This can be obtained with the alternative definition: s-no-change Δ ( q ) = r Q s-down Δ ( q , { r } ) . Computing this set would require quadratic time O ( n | A | ) , while computing s-no-change Δ ( q ) can be carried out in linear time O ( | A | ) ).
When applied to the Sha in Figure 4 for [ self : : list / child : : item ] , the construction yields the Sha in Figure 11, which is indeed equal to the Sha from Figure 7 up to state renaming. When run on the hedge list · list · h 1 · item · h 2 , as shown in Figure 8, it does not have to visit subhedges h 1 or h 2 since all of them will be reached starting from the projection state Π .

6.2. Soundness

We next state and prove a soundness result for safe-no-change projection.
Theorem 1 (Soundness of safe-no-change projection.).
If a Sha A is schema-complete for some schema S , then safe-no-change projection for A preserves the language within this schema: L ( A snc ) S = L ( A ) .
The projecting in-memory evaluator of A snc will be more efficient than the non-projecting evaluator of A. Note that the size of A snc may be exponentially bigger than that of A. Therefore, for evaluating a dSha A with subhedge projection on a given hedge h, we may prefer only the needed part of A snc on-the-fly. This part has a size of O ( | h | ) and can be computed in time O ( | A | | h | ) so the exponential preprocessing time is avoided.

6.3. Incompleteness

Safe-no-change projection may be incomplete for subhedge projection so that not all prefixes that are strongly subhedge irrelevant are mapped to subhedge projection states. This is shown by the counter example dSha A for the XPath filter [ child : : item ] in Figure 12. Its safe-no-change projection A snc is given in Figure 13. Note that the prefix u = item . item has the class c l a s s S L ( A ) ( u ) = item . item . N Σ , which is subhedge irrelevant so u is strongly subhedge irrelevant for the XPath filter [ child : : item ] . Nevertheless, it leads to the state ( 2 , { 1 , 3 , 5 , 6 } ) , which is not a subhedge projection state since 2 { 1 , 3 , 5 , 6 } . The problem is that this state can still be changed to ( 4 , { 1 , 3 , 5 , 6 } ) . This state is somehow equivalent with respect to the filter but not equal to ( 2 , { 1 , 3 , 5 , 6 } ) .
Another incompleteness problem should be mentioned: safe-no-change projection is sensitive to automata completion as noticed earlier in Example 5. This is because a state may belong to no-change Δ before completion but no more after adding a sink. Nevertheless, such states never change on any tree satisfying the schema. This problem can be applied, for instance, to example dSha in Figure 4 for XPath query [ self : : list / child : : item ] . Therefore, it was important to assume only schema-completeness for safe-no-change projection and not to impose full completeness.

7. Congruence Projection

We present the congruence projection algorithm for detecting irrelevant subhedges for regular hedge patterns with regular schema restrictions. We prove that congruence projection is complete for subhedge projection, resolving the incompleteness of safe-no-change projection. For this, congruence projection may no more ignore the schema, so we have to assume that the input schema is regular too.

7.1. Ideas

The starting idea of congruence projection is to resolve the counter example for the completeness of safe-no-change projection in Figure 13. There, a state is changed when reading an irrelevant subhedge hedge. But, the state change moves to a somehow equivalent state, so it is not really relevant. Therefore, the idea is to detect when a state always remains equivalent rather than unchanged when reading an arbitrary subhedge. This is eventually done by the congruence projection in Figure 14 that we are going to construct.
The obvious question is as follows: which equivalence relation do we choose? Suppose that we want to test whether a hedge satisfying a regular schema S matches a regular pattern L. In the restricted case of words without schema restrictions, the idea could be to use Myhill–Nerode’s congruence diff Σ L ¯ but mapped to states. In the general case, however, the situation becomes more complex, given that in the general case diff S L ¯ may fail to be an equivalence relation. So, it may not be a congruence, as already illustrated in Example 4. Furthermore, the treatment of the nesting of hedges will require to update the considered relation when moving down a hedge.

7.1.1. Difference Relations on States

The congruence projection algorithm will maintain a difference relation on states of the dSha which at the beginning will be induced by the difference relation on hedges diff S L and updated whenever moving down a hedge.
Definition 11. 
Let ( Σ , Q , Δ , _ , _ ) be a dSha. A difference relation for Δ  is a symmetric relation on states D Q × Q , such that for all h H Σ ,
( q h q wrt Δ p h p wrt Δ ( q , p ) D ) ( q , p ) D .
The set of all difference relations for Δ is denoted by D Δ .
We call a subset Q Q  compatible with a difference relation D D Δ if Q 2 D = . This means that no two states of Q may be different with respect to D.
Definition 12. 
Let ( Σ , Q , Δ , _ , _ ) be a Sha and D a difference relation for Δ. A subset of states Q Q is subhedge irrelevant for D wrt Δ if acc Δ ( Q ) is compatible with D. The set of all subhedge irrelevant subsets of states for D wrt. Δ thus is as follows:
dPrj Δ ( D ) = { Q Q acc Δ ( Q ) 2 D = } .
A state q Q is subhedge irrelevant for D if the singleton { q } is subhedge irrelevant for D. The set of all subhedge irrelevant states of Q is denoted by Prj Δ ( D ) .
We consider subhedge-irrelevance for subsets of states since the congruence projection algorithm has to eliminate the nondeterminism that it introduces by a kind of Sha determinization. Determinization is necessary in order to recognize subhedge irrelevant prefixes properly: all single states in a subset may be subhedge irrelevant while the whole subset is not.

7.1.2. Least Difference Relations

For any binary relation R Q × Q , let ldr Δ ( R ) be the least difference relation on states that contains R.
Lemma 8. 
( p 1 , p 2 ) ldr Δ ( R ) ( q 1 , q 2 ) R . h H Σ . p 1 h q 1 wrt Δ p 2 h q 2 wrt Δ .
Proof. 
The set { ( p 1 , p 2 ) Q 2 ( q 1 , q 2 ) R . h H Σ . p 1 h q 1 wrt Δ p 2 h q 2 wrt Δ } clearly is a difference relation that contains R, and thus, contains the least such difference relation ldr Δ ( R ) . Conversely, each pair in the above set must be contained in any difference relation containing R, and thus, in ldr Δ ( R ) .   □
Lemma 9. 
For any R Q × Q , the difference relation ldr Δ ( R ) is the value of predicate D in the least fixed point of the ground datalog program generated by the following inference rules:
p , q Q D ( p , q ) : R ( p , q ) . p 1 a p 2 wrt Δ q 1 a q 2 wrt Δ D ( p 1 , q 1 ) : D ( p 2 , q 2 ) .
p 1 @ p p 2 wrt Δ q 1 @ q q 2 wrt Δ D ( p 1 , q 1 ) : D ( p 2 , q 2 ) . p , q Q D ( q , p ) : D ( p , q ) .
Proof. 
The first inference rule guarantees that R D . The three later inference rules characterize difference relations D D Δ . The second and third rules state that differences in D are propagated backwards over hedges h. This is done recursively by treating letter hedges h = a by the second rule and tree hedges h = h by the third rule. The fourth rules expresses the symmetry of difference relations D. So, the least fixed point of the datalog program generated by the above rules contains R and is the least relation that satisfies all axioms of difference relations, so it is ldr Δ ( R ) .   □
Proposition 4. 
For any R Q × Q , the least difference relation ldr Δ ( R ) can be computed in time O ( | A | 2 ) .
Proof. 
The size of the ground datalog program computing ldr Δ ( R ) from Lemma 9 is at most quadratic in the size of A, so its least fixed point can be computed in time O ( | A | 2 ) .   □

7.2. Algorithm

We now define the congruence projection algorithm by a compiler from dShas and a set of schema final states to Shas, which, when run on a hedge, can detect all subhedge irrelevant prefixes.

7.2.1. Inputs and Outputs

As inputs, the congruence projection algorithm is given a dSha  A = ( Σ , Q , Δ , I , F ) for the regular pattern and a set F S with F F S Q . The dSha defines the regular language L = L ( A ) while the regular schema S is recognized by the dSha  A [ F / F S ] = ( Σ , Q , Δ , I , F S ) . So, the same base automaton defines the regular language and regular schema by choosing different sets of final states.
Example 10. 
In the example dSha  in Figure 4 for the XPath filter [ self : : list / child : : item ] with schema doc , we have F = { 4 } and F S = { 4 , 5 } . In the automaton for the counter example [ child : : item ] for safe-no-change projection in Figure 12, we have F = { 6 } and F S = { 5 , 6 } .
We note that, if L and S were given by independent Shas, then we can obtain a common dSha A and a set F S as above by computing the product of the dShas for L and S and completing it. As noticed in [16], it may be more efficient to determinize the Sha for S in a first step, build the product of the Sha for L with the dSha for S in a second, and determinize this product in the third step.
The congruence projection of A wrt. F S will maintain in its current configuration a subset of states Q Q and a difference relation on states in D D Δ . Thereby, the congruence projection dSha can always detect whether the current prefix is subhedge irrelevant for L wrt. schema S by testing whether the current set of states Q is subhedge irrelevant for the current difference relation D.

7.2.2. Initial Difference Relation

The initial difference relation on states D i n i t A , F S D Δ is induced from the difference relation on prefixes diff S L as follows:
D i n i t A , F S = { ( q , q ) ( v , v ) diff S L . q 0 I . q 0 hdg ( v ) q wrt Δ q 0 hdg ( v ) q wrt Δ } .
The next lemma indicates how D i n i t A , F S can be defined from A and F S without reference to the languages L = L ( A ) and S = L ( A [ F / F S ] ) .
Lemma 10. 
D i n i t A , F S = ldr Δ ( F × ( F S \ F ) ) acc Δ ( I ) 2 .
Proof. 
For one direction let ( p , p ) D i n i t A , F S . Then, there exist nested words ( v , v ) diff S L and an initial state q 0 I , such that q 0 hdg ( v ) p wrt Δ and q 0 hdg ( v ) p wrt Δ . Since v , v are nested words, hedge accessibility ( p , p ) acc Δ ( I ) 2 follows. Furthermore, ( v , v ) diff S L requires the existence of a hedge h H Σ , such that hdg ( v ) · h L and hdg ( v ) · h S \ L . Hence, there are states q F and q F S \ F , such that p h q and p h q . Using Lemma 8, this implies that ( p , p ) ldr Δ ( F × ( F S \ F ) ) .
For the other direction, let ( p , p ) ldr Δ ( F × ( F S \ F ) ) acc Δ ( I ) 2 . Using Lemma 8, property ( p , p ) ldr Δ ( F × ( F S \ F ) ) shows that there exist states q F , q F S \ F , and h H Σ , such that p h q wrt Δ and p h q wrt Δ . From ( p , p ) acc Δ ( I ) 2 , it follows that there are nested words v , v N Σ and an initial state q 0 I , such that q 0 hdg ( v ) p wrt Δ and q 0 hdg ( v ) p wrt Δ . Hence, ( v , v ) diff S L , so that ( q , q ) D i n i t A , F S .   □
As a consequence of Lemma 10 and Proposition 4, the initial difference relation D i n i t A , F S can be computed in time O ( | A | 2 ) from A and F S .
Proposition 5 (Soundness of the initial difference relation).
Let A = ( Σ , Q , Δ , I , F ) be a complete dSha, F F S Q , L = L ( A ) , and S = L ( A [ F / F S ] ) . For any hedge h H Σ and state q Q , such that q 0 h q wrt Δ for some q 0 I , the nested word nw ( h ) is subhedge irrelevant for L and S if and only if q is subhedge irrelevant for D i n i t A , F S wrt. Δ.
Proof. 
We show that nw ( h ) is subhedge relevant for L and S if and only if q is subhedge relevant for D i n i t A , F S wrt. Δ .
“⇒”
Let nw ( h ) be subhedge relevant for L and S . Then, there exist hedges h , h H Σ and a nested word w N Σ , such that
nw ( h ) · nw ( h ) · w L nw ( h ) · nw ( h ) · w S \ L .
Since A is deterministic and q 0 h q wrt Δ for the unique q 0 I , it follows that there exist states p , p Q , q F , and q F S \ F , such that
q h p hdg ( w ) q wrt Δ q h p hdg ( w ) q wrt Δ .
From q F and q F S \ F , it follows that ( q , q ) D i n i t A , F S . Since D i n i t A , F S is a difference relation, p hdg ( w ) q wrt Δ and p hdg ( w ) q wrt Δ , this implies that ( p , p ) D i n i t A , F S , too. Hence, ( p , p ) acc Δ ( { q } ) 2 D i n i t A , F S , i.e., q, is relevant for D i n i t A , F S wrt. Δ .
“⇐”
Let q be subhedge relevant for D i n i t A , F S wrt. Δ . Then, there exist a pair ( p , p ) acc Δ ( { q } ) 2 D i n i t A , F S . So, there are hedges h and h , such that
q h p wrt Δ q h p wrt Δ .
Since D i n i t A , F S = ldr Δ ( F × ( F S \ F ) ) by Lemma 10, there exist a nested word w and a pair ( q , q ) Q 2 so that either ( q , q ) or ( q , q ) belongs to F × ( F S \ F ) and
p hdg ( w ) q wrt Δ p hdg ( w ) q wrt Δ .
Hence, nw ( h ) · nw ( h ) · w in L and nw ( h ) · nw ( h ) · w S \ L or vice versa. This shows that nw ( h ) is relevant for L and S .
   □

7.2.3. Updating the Difference Relation

For any difference relation D D Δ and subset of states Q Q , we define a binary relation down Q Δ ( D ) Q × Q , such that for all states p 1 , p 2 Q :
( p 1 , p 2 ) down Q Δ ( D ) iff q 1 , q 2 Q . ( q 1 @ Δ p 1 , q 2 @ Δ p 2 ) D .
For any subset of states Q Q and difference relation D D Δ , let D Q D Δ be the least difference relation that contains down Q Δ ( D ) :
D Q = ldr Δ ( down Q Δ ( D ) ) .
Lemma 11. 
The difference relation D Q can be computed in time O ( | A | 2 ) from Q , Δ, and D.
Proof. 
The binary relation down Q Δ ( D ) can be computed in time O ( | Δ | 2 ) from Q , D, and Δ . The least difference relation ldr Δ ( down Q Δ ( D ) ) can be computed by ground datalog in time O ( | A | 2 ) from down Q Δ ( D ) using Proposition 4.   □
Example 11. 
Reconsider the dSha  for the XPath filter [ self : : list / child : : item ] in Figure 4 with the set of schema-final states F S = { 4 , 5 } . Since F = { 4 } , we have F S \ F = { 5 } . The initial difference relation is D 0 = D i n i t A , F S , which is the symmetric closure of ( { 0 } × { 1 , 4 , 5 } ) { ( 1 , 4 ) , ( 4 , 5 ) } . The subhedge irrelevant states are Prj Δ ( D 0 ) = { 1 , 2 , 3 , 4 , 5 } since only state 0 can access two states in the difference relation D 0 , the final state in F = { 4 } , and a nonfinal state that is schema-final in F S \ F = { 5 } of some hedges. The difference relation down { 0 } Δ ( D i n i t A , F S ) is the symmetric closure of { 1 , 2 } × { 3 } , which is equal to the difference relation D 1 = D 0 { 0 } = ldr Δ ( down { 0 } Δ ( D 0 ) ) . Hence, the projection states are Prj Δ ( D 1 ) = { 2 , 3 , 4 , 5 } since from states 0 and 1 one can still reach both 1 and 3 while ( 1 , 3 ) D 1 . The difference relation D 2 = down { 1 } Δ ( D 1 ) is the symmetric closure of { ( 1 , 2 ) , ( 2 , 3 ) } . Hence, the projection states are Prj Δ ( D 2 ) = { 1 , 2 , 3 , 4 , 5 } . From state 1, one can only reach states 1 and 3, which are not different for D 2 .
Projection states for the initial difference relation contain all states that are safe for selection or safe for rejection with respect to the schema:
Lemma 12. 
All states in safe Δ ( F ) and safe Δ ( F S \ F ) are subhedge irrelevant for D i n i t A , F S .
  • We omit the proof since the result is not used later on.
Given a dSha  A = ( Σ , Q , Δ , I , F ) and a set F F S Q , we now construct the congruence projection A cgr ( S ) as the following dSha:
A cgr ( S ) = ( Σ , Q cgr , Δ cgr , I cgr ( S ) , F cgr ( S ) ) ,
where the set of states, initial states, and final states are as follows:
Q cgr 2 Q × D Δ I cgr ( S ) = { ( I , D i n i t A , F S ) } F cgr ( S ) = ( Q , D i n i t A , F S ) Q dPrj Δ ( D i n i t A , F S ) Q F Q dPrj Δ ( D i n i t A , F S ) acc Δ ( Q ) F
The transition rules in Δ cgr are provided by the inference rules in Figure 15. To illustrate construction and why determinization is needed, we reconsider Example 3, where schema restrictions have consequences that at first sight may be counter-intuitive.
Example 12 (Determinization during congruence projection is needed.).
Reconsider Example 3 with signature Σ = { a } , pattern L = { a } , and schema S = L { · a } . This language can be defined by the dSha A in Figure 16. The schema S can be defined by the same automaton but with the schema-final states F S = { 3 , 5 } and F = { 3 } . The dSha  A cgr ( S ) produced by congruence projection is shown in Figure 17. The unique hedge a of the language L = L ( A ) is accepted in state ( { 3 , 4 } , D 0 ) , where D0 = { ( 3 , 5 ) , ( 5 , 3 ) } . The unique hedge · a in S \ L is rejected: after reading the tree , the automaton goes to state ( { 3 , 4 } , D 0 ) where it goes to a projecting sink ( { 5 } , D 0 ) when reading the subsequent letter a. We note that any hedge in Σ is accepted in state ( { 3 , 4 } , D 0 ) by A cgr ( S ) , even though only the single hedge a belongs to L. This is sound given that the hedges in ( ε + a · a · a ) do not belong to schema S anyway. We notice that ( { 3 , 4 } , D 0 ) cannot be a projection state since the run on hedge · a must continue to the sink ( { 5 } , D 0 ) ; therefore, state ( { 3 , 4 } , D 0 ) is subhedge relevant.
Note that both individual states 3 and 4 are subhedge irrelevant for D0 while the subset with both states { 3 , 4 } is subhedge relevant for D0 since ( 3 , 5 ) acc Δ ( { 3 , 4 } ) 2 D0. This shows that the deterministic construction is indeed needed to decide subhedge irrelevance for cases such as in this example. Also, notice that state 0 is subhedge irrelevant for D1 = . This is fine since the acceptance of hedges of the form h · h by A cgr ( S ) does indeed not depend on h. In contrast, it depends on h , so that the subset of states { 3 , 4 } cannot be subhedge irrelevant for D0.
Finally, let us discuss how the transition rules of dSha  A cgr ( S ) are inferred. The most interesting transition rule is ( { 0 } , D 0 ) @ ( { 0 } , D 1 ) ( { 3 , 4 } , D 0 ) in Δ cgr . It is produced by the following inference rule, where Q = P = { 0 } , acc Δ ( P ) = { 0 , 1 , 3 , 4 , 5 } , Q = { 3 , 4 } , D = D 0 , and D Q = D 1 :
Q @ acc Δ ( P ) Q in Δ det Q dPrj Δ ( D ) P dPrj Δ ( D Q ) ( Q , D ) @ ( P , D Q ) ( Q , D ) in Δ cgr
This rule states that if P is subhedge irrelevant for D Q , then all states accessible from P must be tried out since all of them could be reached by some different subhedge that got projected away. As a result, one cannot know due to subhedge projection into which state of Q one should go. In order to stay deterministic, we thus go into all possible states, i.e., into the subset Q . In the example, these are all the states that can be reached when reading some hedge in the pattern { h h H Σ } , given that the subhedges h Σ are not inspected with subhedge projection.
Example 13 (XPath filter [ self : : list / child : : item ] with schema doc .).
The congruence projection of the example dSha in Figure 4 is given in Figure 18. This dSha is similar to the safe-no-change projection in Figure 7, except that the useless state 2 is removed and the four projection states are now looping in themselves rather than going to a shared looping state Π. It should also be noticed that only singleton state sets are used there, so no determinization is needed. As we will see, this is typical for experiments on regular XPath queries where larger subsets rarely occur.
Example 14 (Counter-example for completeness of the safe-no-change projection.).
Reconsider the counter example for the completeness of the safe-no-change projection, i.e., the dSha in Figure 12 for the XPath query [ child : : item ] with schema final states F S = { 0 , 5 , 6 } . Its congruence projection is shown in Figure 14. We note that the prefix i t e m . i t e m leads to the state {2}D3, which is a subhedge projection state since 2 is subhedge irrelevant for D3. In particular, note that acc Δ ( { 2 } ) = { 2 , 4 } so no two states accessible from 2 are different in D3. This means that state 2 may still be changed to 4 but then it does not become different with respect to D3. This resolves the incompleteness issue with the safe-no-change projection on this example.
The first property for states ( Q , D ) assigned by the congruence projection is that Q is compatible with D. This means that no two states in Q are different with respect to D. Intuitively, each state of Q is as good as each other except for leading out of the schema. So, if ( Q , D ) is assigned to a prefix, and a suffix leads from some state in Q to F, then it cannot lead to F S \ F from some other state in Q .
Lemma 13. 
If the partial run of A cgr ( S ) assigns a state ( Q , D ) to a prefix, then Q is compatible with D.
We omit the proof since this instructive result will not be used directly later on. Still, compatibility will play an important role in the soundness proof.

7.3. Soundness

We next adapt the soundness result and proof from the safe-no-change projection to congruence projection.
Theorem 2 (Soundness of congruence projection.).
For any dSha  A = ( Σ , Q , Δ , I , F ) and set F F S Q , the congruence projection of A with respect to F S preserves the language of A within schema S = L ( A [ F / F S ] ) , i.e.,
L ( A cgr ( S ) ) S = L ( A ) .
We note that S is a schema for A since F F S . Furthermore, A is schema-complete for S since A is deterministic and S = L ( A [ F / F S ] ) .
Proof. 
The proof of the inclusion L ( A ) L ( A cgr ( S ) ) S will be based on the following two claims:
Claim 2.1a. 
For all h H , D D Δ , and Q d P r j Δ ( D ) , ( Q , D ) h ( Q , D ) wrt Δ c g r .
Proof. 
We prove Claim 2.1a by induction on the structure of h. Suppose that Q dPrj Δ ( D ) .
Case 
h = h . The induction hypothesis applied to h yields ( Q , D ) h ( Q , D ) wrt Δ cgr . We can thus apply the following two inference rules:
Q dPrj Δ ( D ) ( Q , D ) ( Q , D ) in Δ cgr Q dPrj Δ ( D ) ( Q , D ) ) @ ( Q , D ) ( Q , D ) in Δ cgr
in order to close the following diagram with respect to Δ cgr :
Algorithms 17 00339 i005
This proves ( Q , D ) h ( Q , D ) wrt Δ cgr , as required by the claim.
Case 
h = a . Since q dPrj Δ ( D ) , we can apply the inference rule:
a Σ Q dPrj Δ ( D ) ( Q , D ) a ( Q , D ) in Δ cgr
This proves this case of the claim.
Case 
h = ε . We trivially have ( Q , D ) ε ( Q , D ) wrt Δ cgr .
Case 
h = h · h . By the induction hypothesis applied to h and h , we obtain the following: ( Q , D ) h ( Q , D ) and ( Q , D ) h ( Q , D ) wrt Δ cgr . Hence, ( Q , D ) h · h ( Q , D ) wrt Δ cgr .
This ends the proof of Claim 2.1a. The next claim is the key of the soundness proof.   □
For any difference relation D, we define a binary relation dep a ( D ) 2 Q × 2 Q , such that, for any two subsets of states Q , Q Q ,
( Q , Q ) dep a ( D ) Q dPrj Δ ( D ) Q Q ( 1 a ) Q dPrj Δ ( D ) ( Q acc Δ ( Q ) Q dPrj Δ ( D ) ) ( 2 a )
Furthermore, note that Q dPrj Δ ( D ) Q acc Δ ( Q ) implies Q dPrj Δ ( D ) , and thus, ( Q , Q ) dep a ( D ) .
Claim 2.2a. 
Let h H a hedge, Q , Q Q subsets of states, and D D Δ a difference relation. If Q h Q   wrt   Δ d e t , then there exists Q Q , such that ( Q , D ) h ( Q , D )   wrt   Δ c g r and ( Q , Q ) d e p a ( D ) .
Proof. 
If Q dPrj Δ ( D ) , then ( Q , D ) h ( Q , D ) wrt Δ cgr by Claim 2.1a. Let Q = Q . We then have Q acc Δ ( Q ) and Q dPrj Δ ( D ) , and thus, Q dPrj Δ ( D ) so that ( 1 a ) and ( 2 a ) of ( Q , Q ) dep a ( D ) . So, it is sufficient to prove the claim under the assumption that Q dPrj Δ ( D ) . The proof is by induction on the structure of h.
Case 
h = h . The assumption Q h Q wrt Δ det shows that there exists a subset of states P Q closing the following diagram:
Algorithms 17 00339 i006
In particular, we have Q @ P Q in Δ det . Let D = D Q . Since Q dPrj Δ ( D ) , we can infer the following:
Q dPrj Δ ( D ) ( Q , D ) ( Δ , D ) in Δ cgr
We have to distinguish two cases depending on whether Δ belongs to dPrj Δ ( D ) or not.
Subcase 
Δ dPrj Δ ( D ) . The induction hypothesis applied to h yields the existence of P Q , such that ( P , P ) dep a ( D ) and
( Δ , D ) h ( P , D ) wrt Δ cgr .
Subsubcase 
P dPrj Δ ( D ) . Since ( P , P ) dep a ( D ) , it follows that P P . Hence, P dPrj Δ ( D ) . Let Q Q be such that Q @ P Q in Δ det . We can then apply the following inference rule:
Q @ P Q in Δ det Q dPrj Δ ( D ) P dPrj Δ ( D ) ( Q , D ) @ ( P , D ) ( Q , D ) in Δ cgr
Hence, we can close the diagram as follows:
Algorithms 17 00339 i007
This shows that ( Q , D ) h ( Q , D ) wrt Δ cgr . Since P P , Q @ P Q , and Q @ P Q in Δ det , it follows that Q Q , and thus, ( Q , Q ) dep a ( D ) . This shows the claim in this case.
Subsubcase 
P dPrj Δ ( D ) . Since ( P , P ) dep a ( D ) , it follows that P acc Δ ( P ) and P dPrj Δ ( D ) . Hence, we can apply the following inference rule for some Q H Σ :
Q @ acc Δ ( P ) Q in Δ det Q dPrj Δ ( D ) P dPrj Δ ( D ) ( Q , D ) @ ( P , D ) ( Q , D ) in Δ cgr
Hence, we can close the diagram as follows:
Algorithms 17 00339 i008
This shows that ( Q , D ) h ( Q , D ) wrt Δ cgr . Since P acc Δ ( P ) , it follows from Q @ P Q and Q @ acc Δ ( P ) Q in Δ det that Q Q . Thus, ( Q , Q ) dep a ( D ) , so the claim holds in this case too.
Subcase 
Δ Prj Δ ( D ) . Claim 2.1a shows that ( Δ , D ) h ( Δ , D ) wrt Δ cgr . Let Q be such that Q @ acc Δ ( Δ ) Q in Δ det . We can apply the following inference rule:
Q @ acc Δ ( Δ ) Q in Δ det Q dPrj Δ ( D ) Δ Prj Δ ( D ) ( Q , D ) @ ( Δ , D ) ( Q , D ) in Δ cgr
We can thus close the diagram below as follows with respect to Δ cgr :
Algorithms 17 00339 i009
This shows that ( Q , D ) h ( Q , D ) wrt Δ cgr . Since P acc Δ ( Δ ) , it follows from Q @ P Q and Q @ acc Δ ( Δ ) Q in Δ det that Q Q , and thus, ( Q , Q ) dep a ( D ) .
Case 
h = a . Since Q dPrj Δ ( D ) , we can apply the inference rule:
Q a Q in Δ det Q dPrj Δ ( D ) ( Q , D ) a ( Q , D ) in Δ cgr
Hence, ( Q , D ) h ( Q , D ) wrt Δ cgr . With Q = Q , the claim follows since ( Q , Q ) dep a ( D ) .
Case 
h = ε . In this case, we have Q = Q and ( Q , D ) ε ( Q , D ) wrt Δ cgr , so the claim holds.
Case 
h = h 1 · h 2 . Since Q h Q wrt Δ det , there exists Q 1 Q , such that Q h 1 Q 1 and Q 1 h 2 Q wrt Δ det . By the induction hypothesis applied to h 1 there exists Q 1 Q , such that ( Q , D ) h 1 ( Q 1 , D ) wrt Δ cgr and ( Q 1 , Q 1 ) dep a ( D ) .
Subcase 
Q 1 dPrj Δ ( D ) . Since ( Q 1 , Q 1 ) dep a ( D ) , it follows that Q 1 acc Δ ( Q 1 ) and Q 1 dPrj Δ ( D ) . Furthermore, Claim 2.1a shows that ( Q 1 , D ) h 2 ( Q 1 , D ) , and hence, ( Q , D ) h ( Q 1 , D ) wrt Δ cgr . Since Q acc Δ ( Q 1 ) and Q 1 acc Δ ( Q 1 ) , it follows that Q acc Δ ( Q 1 ) . Hence, ( Q 1 , Q ) dep a ( D ) and ( Q , D ) h ( Q 1 , D ) .
Subcase 
Q 1 dPrj Δ ( D ) . In this case, we can apply the induction hypothesis to Q 1 h 2 Q wrt Δ det , showing that there exists Q Q , such that ( Q 1 , D ) h 2 ( Q , D ) wrt Δ cgr and ( Q , Q ) dep a ( D ) . Hence, ( Q , D ) h ( Q , D ) wrt Δ cgr and ( Q , Q ) dep a ( D ) .
This ends the proof of Claim 2.2a.   □
Proof of Inclusion L ( A ) L ( A cgr ( S ) ) S . 
Since S is a schema for A, we have L ( A ) S , so that it is sufficient to show L ( A ) L ( A cgr ( S ) ) . Let h L ( A ) . Then, there exist q 0 I and q F , such that q 0 h q wrt Δ . Let Q = { q } . Since A is deterministic, it follows that I h Q wrt Δ det . Using Claim 2.2a, this implies the existence of a subset of states Q Q , such that ( Q , Q ) dep a ( D ) and ( I , D i n i t A , F S ) h ( Q , D i n i t A , F S ) wrt Δ cgr . Furthermore, ( I , D i n i t A , F S ) I cgr ( S ) . In order to prove h L ( A cgr ( S ) ) , it is thus sufficient to show that ( Q , D i n i t A , F S ) F cgr ( S ) . We distinguish two cases:
Case 
Q dPrj Δ ( D i n i t A , F S ) . Condition ( 1 a ) of ( Q , Q ) dep a ( D ) , shows that Q Q so that q Q . Since q F , it follows that Q F . Thus, ( Q , D i n i t A , F S ) F cgr ( S ) , so that h L ( A cgr ( S ) ) .
Case 
Q dPrj Δ ( D i n i t A , F S ) . Condition ( 2 a ) of ( Q , Q ) dep a ( D ) yields Q acc Δ ( Q ) . Hence, q acc Δ ( Q ) F so that ( Q , D i n i t A , F S ) F cgr ( S ) , and thus, h L ( A cgr ( S ) ) .
This ends the proof of the first inclusion.   □
We next want to show the inverse inclusion L ( A cgr ( S ) ) S L ( A ) . It will eventually follow from the next two claims.
Claim 2.1b. 
For any hedge h H Σ , difference relation D D Δ , projection state Q dPrj Δ ( D ) and state μ Q cgr : if ( Q , D ) h μ wrt Δ cgr , then μ = ( Q , D ) .
Proof. 
By induction on the structure of h H Σ , suppose that ( Q , D ) h μ wrt Δ cgr :
Case 
h = h . There must exist states μ 1 , μ 1 Q cgr closing the following diagram:
Algorithms 17 00339 i010
Since Q dPrj Δ ( D ) , the following rule must have been applied to infer ( Q , D ) μ 1 wrt Δ cgr :
Q dPrj Δ ( D ) ( Q , D ) ( Q , D ) in Δ cgr
Therefore, μ 1 = ( Q , D ) . The induction hypothesis applied to ( Q , D ) h μ 1 wrt Δ cgr shows that μ 1 = ( Q , D ) , too. So, μ must have been inferred by applying the rule:
Q dPrj Δ ( D ) ( Q , D ) @ ( Q , D ) ( Q , D ) in Δ cgr
Hence, μ = ( Q , D ) , as required.
Case 
h = a . The following rule must be applied:
a Σ Q dPrj Δ ( D ) ( Q , D ) a ( Q , D ) in Δ cgr
Hence, μ = ( Q , D ) .
Case 
h = ε . Obvious.
Case 
h = h 1 · h 2 . There must exist some μ 1 , such that ( Q , D ) h 1 μ 1 h 2 μ wrt Δ cgr . Using the induction hypothesis applied to h 1 , we have μ 1 = ( Q , D ) . We can thus apply the induction hypothesis to h 2 to obtain μ 2 = ( Q , D ) .
This ends the proof of Claim 2.1b.   □
We next need an inverse of Claim 2.2a. For relating runs A cgr back to runs of A, we define for any difference relation D another binary relation dep b ( D ) 2 Q × 2 Q , such that for any two subsets of states Q , Q Q ,
( Q , Q ) dep b ( D ) ( Q × Q ) D = ( 0 b ) Q dPrj Δ ( D ) Q Q ( 1 b ) Q dPrj Δ ( D ) Q acc Δ ( Q ) ( 2 b )
Claim 2.2b. 
Let Q Q be a subset of states that is compatible with a difference relation D D Δ . For any hedge h H Σ and state μ Q cgr with ( Q , D ) h μ wrt Δ cgr , there exist a pair of subsets of states ( Q , Q ) dep b ( D ) , such that μ = ( Q , D ) and Q h Q wrt. Δ det .
Proof. 
If Q dPrj Δ ( D ) , then Claim 2.1b shows that μ = ( Q , D ) . Let Q = Q and Q be the unique subset of states, such that Q h Q wrt. Δ det . We have to show that ( Q , Q ) dep b ( D ) . Since Q h Q , we have Q acc Δ ( Q ) , so condition ( 2 b ) holds. Condition ( 1 b ) holds trivially since Q dPrj Δ ( D ) . For condition ( 0 b ), note that Q dPrj Δ ( D ) implies acc Δ ( Q ) 2 D = . Furthermore, Q × Q acc Δ ( Q ) 2 , so that ( Q × Q ) D = as required.
We can thus assume that Q dPrj Δ ( D ) . The proof is by induction on the structure of h H Σ . We distinguish all the possible forms of hedges h H Σ :
Case 
h = h . By definition of ( Q , P ) h μ wrt Δ cgr , there must exist μ 1 , μ 1 Q cgr , such that the following diagram can be closed:
Algorithms 17 00339 i011
Since Q dPrj Δ ( D ) , the following inference rule was applied to infer ( Q , D ) μ 1 wrt Δ cgr , where D = D Q :
Q dPrj Δ ( D ) ( Q , D ) ( Δ , D ) in Δ cgr
Hence, μ 1 = ( Δ , D ) . If Δ dPrj Δ ( D ) , then the induction hypothesis applied to ( Δ , D ) h μ 1 wrt Δ cgr shows that there exists ( P , P ) dep b ( D ) , such that μ 1 = ( P , D ) and Δ h P wrt Δ det . Otherwise, the same can be concluded as we argued at the beginning.
Subcase 
P dPrj Δ ( D ) . The transition rule ( Q , D ) @ μ 1 μ must be inferred as follows for Q Q :
Q @ acc Δ ( P ) Q in Δ det Q dPrj Δ ( D ) P dPrj Δ ( D ) ( Q , D ) @ ( P , D ) ( Q , D ) in Δ cgr
This shows that μ = ( Q , D ) . So, we have the following diagram:
Algorithms 17 00339 i012
Let Q be the unique subset of states, such that Q @ P Q wrt Δ det . We can then close the following diagram:
Algorithms 17 00339 i013
From ( P , P ) dep b ( D ) , it follows ( P × P ) D = , and thus, ( Q × Q ) D = , i.e., condition ( 0 b ) of ( Q , Q ) dep b ( D ) . Since P dPrj Δ ( D ) , condition ( 2 b ) of ( P , P ) dep b ( D ) yields P acc Δ ( P ) . Furthermore, Q @ P Q and Q @ acc Δ ( P ) Q in Δ det , which yields Q Q so that conditions ( 1 b ) and ( 2 b ) of ( Q , Q ) dep b ( D ) follow. Hence, ( Q , Q ) dep b ( D ) . In summary, we show that μ = ( Q , D ) , Q h Q wrt Δ det , and ( Q , Q ) dep b ( D ) , as required by the claim.
Subcase 
P dPrj Δ ( D ) . The transition rule ( Q , D ) @ μ 1 μ must thus be inferred as follows for Q Q :
Q @ P Q in Δ det Q dPrj Δ ( D ) P dPrj Δ ( D ) ( Q , D ) @ ( P , D ) ( Q , D ) in Δ cgr
This shows that μ = ( Q , D ) , and we can close the following diagram:
Algorithms 17 00339 i014
Condition ( 0 b ) of ( P , P ) dep b ( D ) yields ( P × P ) D = . Let Q be the unique subset of states, such that Q @ P Q wrt. Δ det . Since D down Q Δ ( D ) and ( P × P ) D = , it follows from Q @ P Q and Q @ P Q wrt. Δ det that ( Q Q ) D = . That is, condition ( 0 b ) of ( Q , Q ) dep b ( D ) . Since P dPrj Δ ( D ) , condition ( 1 b ) of ( P , P ) dep b ( D ) is P P . From Q @ P Q and Q @ P Q , it thus follows that Q Q . Hence, conditions ( 1 b ) and ( 2 b ) of ( Q , Q ) dep b ( D ) are valid, so that ( Q , Q ) dep b ( D ) holds. Furthermore,
Algorithms 17 00339 i015
This shows that Q h Q wrt Δ . The other two requirements of the claim, μ = ( Q , D ) and ( Q , Q ) dep b ( D ) , were shown earlier.
Case 
h = a . Since Q dPrj Δ ( D ) , the following inference rule must be used:
Q a Q in Δ Q dPrj Δ ( D ) ( Q , D ) a ( Q , D ) in Δ cgr
So μ = ( Q , D ) , Q a Q wrt Δ det . Furthermore, since Q is compatible with D, and D is a difference relation, Q is compatible with D too. Hence, ( Q × Q ) D = , showing condition ( 0 b ) of ( Q , Q ) dep b ( D ) . Trivially, Q Q acc Δ ( Q ) so conditions ( 1 b ) and ( 2 b ) of ( Q , Q ) dep b ( D ) hold, too. Hence, ( Q , Q ) dep b ( D ) .
Case 
h = ε . Obvious.
Case 
h = h 1 · h 2 . Since ( Q , D ) h μ wrt. Δ cgr , there exists some μ 1 Q cgr , such that ( Q , D ) h 1 μ 1 and μ 1 h 2 μ wrt Δ cgr . We can apply the induction hypothesis to h 1 . Hence, there exist subsets of states Q 1 , Q 1 Q , such that μ 1 = ( Q 1 , D ) , Q h 1 Q 1 wrt Δ , and ( Q 1 , Q 1 ) dep b ( D ) . We distinguish two cases:
Subcase 
Q 1 dPrj Δ ( D ) . Since μ 1 = ( Q 1 , D ) h 2 μ wrt Δ cgr , Claim 2.1b shows that μ = ( Q 1 , D ) so that ( Q 1 , D ) h 2 ( Q 1 , D ) wrt. Δ cgr . Condition ( 2 b ) of ( Q 1 , Q 1 ) dep b ( D ) and Q 1 dPrj Δ ( D ) imply that Q 1 acc Δ ( Q 1 ) . Let Q 2 be the unique subset of states, such that Q 1 h 2 Q 2 wrt. Δ det . Then, Q 2 acc Δ ( Q 1 ) so that Q 2 acc Δ ( Q 1 ) , and thus, condition ( 2 b ) of ( Q 1 , Q 2 ) dep b ( D ) holds. Condition ( 1 b ) of ( Q 1 , Q 2 ) dep b ( D ) is trivial since Q 1 dPrj Δ ( D ) . Since Q 1 dPrj Δ ( D ) and Q 1 × Q 2 acc Δ ( Q ) 2 , it follows that Q 1 × Q 2 D = so condition ( 0 b ) of ( Q 1 , Q 2 ) dep b ( D ) holds, too. Hence, Q h Q 2 wrt Δ det and ( Q 1 , Q 2 ) dep b ( D ) .
Subcase 
Q 1 dPrj Δ ( D ) . Since Q 1 h 1 Q 1 and Q 1 is compatible with difference relation D, it follows that Q 1 is compatible with D, too. We can thus apply the induction hypothesis to ( Q 1 , D ) h 2 μ , showing the existence of subsets of states ( Q 2 , Q 2 ) dep b ( D ) , such that μ = ( Q 2 , D ) and Q 1 h 2 Q 2 wrt Δ . So, we have Q h Q 2 wrt. Δ and ( Q 2 , Q 2 ) dep b ( D ) , as required.
This ends the proof of Claim 2.2b.   □
Proof of Inclusion L ( A cgr ( S ) ) S L ( A ) . 
Let h L ( A cgr ( S ) ) S . Then, there exists a final subset of states Q F cgr ( S ) , such that ( I , D i n i t A , F S ) h ( Q , D i n i t A , F S ) wrt Δ cgr . Since I is a singleton or empty, it is compatible with D i n i t A , F S . Claim 2.2b shows that there exists a subset of states Q Q , such that I h Q wrt. Δ det and ( Q , Q ) dep b ( D i n i t A , F S ) . Condition ( 0 b ) of ( Q , Q ) dep b ( D i n i t A , F S ) shows that ( Q , Q ) D i n i t A , F S = . Since h S , it follows that Q F S . The determinism of A shows that Q is a singleton. So, there exists a state q F S , such that Q = { q } .
Case 
Q dPrj Δ ( D i n i t A , F S ) . Condition ( 1 b ) shows that Q Q so that q Q . Since Q F cgr , we have Q F . Since Q is compatible with D i n i t A , F S , q Q , and Q F , it follows that q F S \ F . Since q F S , this implies q F , and thus, h L ( A ) .
Case 
Q dPrj Δ ( D i n i t A , F S ) . Condition ( 2 b ) shows that Q acc Δ ( Q ) so that q acc Δ ( Q ) . Since Q F cgr , we have acc Δ ( Q ) F . Let q acc Δ ( Q ) F be arbitrary. Since Q dPrj Δ ( D i n i t A , F S ) and ( q , q ) acc Δ ( Q ) 2 , it follows that ( q , q ) D i n i t A , F S . Furthermore, q F so that q F S \ F . In combination with q F S , this implies q F S so h L ( A ) .
This ends the proof of the inverse inclusion, and thus, of L ( A ) = L ( A cgr ( S ) ) S .   □

7.4. Completeness

We next show the completeness of congruence projection for subhedge projection according to Definition 10. Let A = ( Σ , Q , Δ , I , F ) be a complete dSha and F F S Q . Automaton A defines the regular pattern L = L ( A ) with the schema-final states in F S the regular schema S = L ( A [ F / F S ] ) . We first show that all subhedge irrelevant states of difference relations are subhedge projection states A cgr ( S ) according to Definition 7.
Lemma 14. 
If Q dPrj Δ ( D ) , then ( Q , D ) is a subhedge projection state of Δ cgr with witness ( Q , D ) .
Proof. 
We assume that q Prj Δ ( D ) and have to show that ( Q , D ) is a subhedge projection state of Δ cgr . We have to show that all transition rules starting with ( Q , D ) are permitted for a subhedge projection state ( Q , D ) with ( Q , D ) . They are generated by the following rules, all of which are looping:
a Σ Q dPrj Δ ( D ) ( Q , D ) a ( Q , D ) in Δ cgr Q dPrj Δ ( D ) ( Q , D ) @ ( Q , D ) ( Q , D ) in Δ cgr Q dPrj Δ ( D ) ( Q , D ) ( Q , D ) in Δ cgr
   □
So, if a partial run of A cgr ( S ) on a prefix u assigns some state ( P , D ) with P dPrj Δ ( D ) , then ( P , D ) is a subhedge projection state by Lemma 14, and thus, subhedge irrelevant by Proposition 2.
Lemma 15. 
Let A = A [ I / Δ ] , L = L ( A ) , and S = L ( A [ F / F S ] ) . If ( Δ , D ) hdg ( u ) ( Q , D ) wrt A cgr ( S ) for some nested word u N Σ and either
  • Q dPrj Δ ( D ) and q Q , or
  • Q dPrj Δ ( D ) and q acc Δ ( Q ) ,
then there exists a nested word u c l a s s S L ( u ) and q 0 Δ , such that q 0 hdg ( u ) q wrt Δ.
Proof. 
By induction on the length of u, suppose that ( Δ , D ) hdg ( u ) ( Q , D ) wrt A cgr ( S ) . In the base case, we have u = ε . Hence, Q = Δ .
Case 
Q dPrj Δ ( D ) and q Q . Then, Δ = { q } . The unique run of A on u = ε starts and ends in the tree initial state q Δ .
Case 
Q dPrj Δ ( D ) and q acc Δ ( Q ) . Since Q dPrj Δ ( D ) , Lemma 14 shows that ( Q , D ) is a subhedge projection state, so that ε is subhedge irrelevant for L and S by Proposition 2. Hence, c l a s s S L ( ε ) = N Σ . Furthermore, since q acc Δ ( Q ) , there exists a hedge h and q 0 I , such that q 0 h q wrt Δ . Let u = nw ( h ) . The run of A on u ends in q and u c l a s s S L ( u ) .
For the induction step, we distinguish the possible forms of the nested word u. So, there exist u ˜ , v N Σ and a Σ , such that either of the following cases holds:
Case 
u = u ˜ · · v · . Let ( Q ˜ , D ) be the state in which the run of ( A ) cgr ( S ) on u ˜ ends.
Subcase 
Q ˜ dPrj Δ ( D ) . Let D = D Q ˜ . Then, there exist P Q , such that ( Δ , D ) v ( P , D ) wrt A cgr ( S ) . So, the run of ( A ) cgr ( S ) on v goes to state ( P , D ) .
Subsubcase 
P dPrj Δ ( D ) . By construction of A cgr ( S ) , we then have Q = Q ˜ @ Δ acc Δ ( P ) . Since q Q , there exist q ˜ Q ˜ and p acc Δ ( P ) , such that q ˜ @ p q in Δ . By induction hypothesis, there exists v c l a s s S L ( v ) , such that the run of A on v ends in p. Hence, the run of A on u = u ˜ · · v · ends in q, and furthermore, u c l a s s S L ( u ) .
Subsubcase 
P dPrj Δ ( D ) . By construction of A cgr ( S ) , we then have Q = Q ˜ @ Δ P . Since q Q , there exist q ˜ Q ˜ and p P , such that q ˜ @ p q in Δ . By induction hypothesis, there exists v c l a s s S L ( v ) , such that the run of A on v ends in p. Hence, the run of A on u = u ˜ · · v · ends in q, and furthermore, u c l a s s S L ( u ) .
Subcase 
Q ˜ dPrj Δ ( D ) . Then, Q = Q ˜ and q acc Δ ( Q ˜ ) . Using the induction hypothesis applied to u ˜ , there exists u ˜ c l a s s S L ( u ˜ ) , such that Δ hdg ( u ˜ ) q . Since u ˜ is irrelevant for L and S , we have u ˜ c l a s s S L ( u ) .
Case 
u = u ˜ · a . This is easier to achieve than the previous cases and is thus omitted.
   □
Lemma 16. 
A partial run of A cgr ( S ) assigns a state ( Q , D ) to a prefix u prefs ( N Σ ) . If Q dPrj Δ ( D ) and q Q , then there exists a prefix u c l a s s S L ( u ) , such that a partial run of A assigns state q to u .
Proof. 
By induction on the length of u. In the base case, we have u = ε . The partial run of A cgr ( S ) on u assigns state ( Q , D ) , where Q = I and D = D i n i t A , F S . Suppose that Q dPrj Δ ( D ) and q Q . Then, I = { q } . The unique partial run of A on u = ε ends in the initial state q I . For the induction step, let A = A [ I / Δ ] , L = L ( A ) , and S = L ( A [ F / F S ] . We distinguish the possible forms of u. So, u ˜ prefs ( N Σ ) , v N Σ , and a Σ exist, such that either of the following hold:
Case 
u = u ˜ · · v . Let ( Q ˜ , D ˜ ) be the state in which the partial run of A cgr ( S ) on u ˜ ends. From Q dPrj Δ ( D ) , it follows that Q ˜ dPrj Δ ( D ˜ ) . Furthermore, ( Δ , D ) v ( Q , D ) wrt A cgr ( S ) . So, the partial run of ( A ) cgr ( S ) on v goes to state ( Q , D ) . Using the induction hypothesis, there exists v c l a s s S L ( v ) , such that the partial run of A on v ends in q. Hence, the partial run of A on u = u ˜ · · v ends in q, and furthermore, u c l a s s S L ( u ) .
Case 
u = u ˜ · · v · . Let ( Q ˜ , D ˜ ) be the state in which the partial run of A cgr ( S ) on u ˜ ends. From Q dPrj Δ ( D ) , it follows that Q ˜ dPrj Δ ( D ˜ ) . Furthermore, ( Δ , D ) v ( P , D ) wrt A cgr ( S ) for some P Q .
Subcase 
P dPrj Δ ( D ) . Then, Q = Q ˜ @ Δ acc Δ ( P ) . So, there exists q ˜ Q and p acc Δ ( P ) , such that q ˜ @ p q in Δ . Using Lemma 15, there exists v c l a s s S L ( v ) , such that Δ hdg ( v ) p wrt Δ . Using the induction hypothesis applied to u ˜ , there exists an initial state q 0 I and a prefix u ˜ c l a s s S L ( u ˜ ) , such that the partial run of A on u ˜ goes on to state q ˜ . Hence, the partial run of A on u = u ˜ · · v · goes on to q and u c l a s s S L ( u ) .
Subcase 
P dPrj Δ ( D ) . Then, Q = Q ˜ @ Δ P . So, there exists q ˜ Q and p P , such that q ˜ @ p q in Δ . Using Lemma 15, there exists v c l a s s S L ( v ) , such that Δ hdg ( v ) p wrt Δ . Using the induction hypothesis applied to u ˜ , there exists an initial state q 0 I and a prefix u ˜ c l a s s S L ( u ˜ ) , such that the partial run of A on u ˜ goes on to state q ˜ . Hence, the partial run of A on u = u ˜ · · v · goes on to q and u c l a s s S L ( u ) .
Case 
u = u ˜ · a . This is easier to achieve than the previous case and is thus omitted.
   □
The next lemma states the key invariant of congruence projection, which eventually proves its completeness for subhedge projection. For this, we define for any F F S F the binary relation F S F as the symmetric closure of F × ( F S \ F ) , i.e., for all ( q , q ) Q 2 :
q F S F q ( q F q F S \ F ) ( q F S \ F q F )
Lemma 17 (Key Invariant).
Let A = ( Σ , Q , Δ , I , F ) a dSha, F F S Q , L = L ( A ) , and S = L ( A [ F / F S ] ) . If A cgr ( S ) has a partial run on prefix u prefs ( N Σ ) to state ( P , D ) 2 Q × D Δ , then for all ( p , p ) P 2 , ( r , r ) D , and ( v , v ) N Σ 2 , such that
p hdg ( v ) r wrt Δ p hdg ( v ) r wrt Δ
there exist ( u , u ) c l a s s S L ( u ) , w suffs ( N Σ ) , q F S F q , and q 0 I such that
q 0 hdg ( u · v · w ) q wrt Δ q 0 hdg ( u · v · w ) q wrt Δ
Proof. 
Induction on the number of dangling opening parenthesis of u.
In the base case, u does not have any dangling opening parenthesis so u N Σ is a nested word. In this case, D = D i n i t A , F S . So, A cgr ( S ) has a partial run on nested word u N Σ to ( P , D i n i t A , F S ) , where P Q . Let ( p , p ) P 2 , ( r , r ) D i n i t A , F S , and ( v , v ) N Σ 2 , such that
p hdg ( v ) r wrt Δ p hdg ( v ) r wrt Δ
It then follows that P dPrj Δ ( D i n i t A , F S ) . Therefore, we can apply Lemma 16. It shows that there exist nested words ( u , u ) ( c l a s s S L ( u ) ) 2 and q 0 I , such that
q 0 hdg ( u ) p wrt Δ q 0 hdg ( u ) p wrt Δ
Since D i n i t A , F S ldr Δ ( F × ( F S \ F ) ) by Lemma 10, there exist a nested word w N Σ and states q F S F q , such that
r hdg ( w ) q wrt Δ r hdg ( w ) q wrt Δ
Then, we have the following:
q 0 hdg ( u · v · w ) q wrt Δ q 0 hdg ( u · v · w ) q wrt Δ
For the induction step, let u = u 1 · · u 2 for some prefix u 1 prefs ( N Σ ) and nested word u 2 N Σ so that u 1 has one open dangling bracket less than u. Let A cgr ( S ) have a partial run on nested word u N Σ to ( P , D ) . Let ( P 1 , D 1 ) be the state that this run assigns to prefix u 1 . Then, D = ( D 1 ) P 1 . Let ( p , p ) P 2 , ( r , r ) D , and ( v , v ) N Σ 2 , such that
p hdg ( v ) r wrt Δ p hdg ( v ) r wrt Δ
Since D = ( D 1 ) P 1 , there exist ( p 1 , p 1 ) ( P 1 ) 2 and states ( r 1 , r 1 ) D 1 , such that p 1 @ q r 1 and p 1 @ q r 1 . In particular, P 1 dPrj Δ ( D 1 ) so that we can apply Lemma 16. It shows that there exist ( u 2 , u 2 ) c l a s s S L ( u 2 ) , such that Δ u 2 p and Δ u 2 p . Let v 1 = · u 2 · v · and v 1 = · u 2 · v · . Then, p 1 hdg ( v 1 ) r 1 , p 1 hdg ( v 1 ) r 1 wrt Δ . Using the induction hypothesis applied to u 1 , on which A cgr ( S ) has a partial run to state ( P 1 , D 1 ) , such that ( p 1 , p 1 ) ( P 1 ) 2 , p 1 hdg ( v 1 ) r 1 , p 1 hdg ( v 1 ) r 1 , and ( r 1 , r 1 ) D , there exist ( u 1 , u 1 ) c l a s s S L ( u 1 ) , w 1 suffs ( N Σ ) , q F S F q , and q 0 I , such that
q 0 hdg ( u 1 · v 1 · w 1 ) q wrt Δ q 0 hdg ( u 1 · v 1 · w 1 ) q wrt Δ
Let u = u 1 · · u 2 , u = u 1 · · u 2 , and w = · w 1 . The above then yields the following:
q 0 hdg ( u · v · w ) q wrt Δ q 0 hdg ( u · v · w ) q wrt Δ
Furthermore, ( u , u ) ( c l a s s S L ( u ) ) 2 , so this was to be shown.   □
Proposition 6. 
Let A = ( Σ , Q , Δ , I , F ) a dSha, F F S Q , L = L ( A ) , and S = L ( A [ F / F S ] ) . If A cgr ( S ) has a partial run on prefix u prefs ( N Σ ) to some state ( P , D ) 2 Q × D Δ so that P dPrj Δ ( D ) , then c l a s s S L ( u ) is subhedge relevant for L wrt S .
Proof. 
Suppose that A cgr ( S ) has a partial run on prefix u prefs ( N Σ ) to some state ( P , D ) 2 Q × D Δ so that P dPrj Δ ( D ) . Since P dPrj Δ ( D ) , there exist acc Δ ( P ) 2 D e m p t y s e t . So, there exist ( p , p ) P 2 , ( r , r ) D , and ( v , v ) N Σ 2 , such that  
p hdg ( v ) r wrt Δ p hdg ( v ) r wrt Δ
Using Lemma 17, there exist ( u , u ) c l a s s S L ( u ) , w suffs ( N Σ ) , q F S F q , and q 0 I , such that
q 0 hdg ( u · v · w ) q wrt Δ q 0 hdg ( u · v · w ) q wrt Δ
Hence, c l a s s S L ( u ) is relevant for L wrt S .   □
Theorem 3 (Completeness of congruence projection.).
For any complete dSha  ( Σ , Q , Δ , I , F ) and set F F S Q , the congruence projection A cgr ( S ) is sound and complete for subhedge projection for the regular pattern L ( A ) wrt the regular schema S = L ( A [ F S / F ] ) .
Proof. 
The soundness of congruence projection is shown in Theorem 2. For proving completeness, let u prefs ( N Σ ) be a nested word prefix that is strongly subhedge irrelevant for L wrt S . By definition of strong irrelevance, the class c l a s s S L ( u ) is irrelevant for L and S . Let ( P , D ) be the unique state assigned by the partial run of A cgr ( S ) to prefix u. Since c l a s s S L ( u ) is irrelevant for L and S , Proposition 6 shows that P dPrj Δ ( D ) . Using Lemma 14, ( P , D ) is then a subhedge projection state of A cgr ( S ) .   □

7.5. In-Memory Complexity

We next discuss the complexity of membership testing with complete subhedge projection based on the in-memory evaluation of the input hedge by the congruence projection of the input automaton.
Lemma 18. 
The number of states | Q cgr | is in O ( 2 n 2 + n ) , where n is the number of states of A.
Proof. 
With the deterministic construction, the states of A cgr ( S ) are pairs in 2 Q × D Δ . So, the maximal number of states of the congruence projection is | Q cgr ( S ) | = 2 | Q | | D Δ | 2 n 2 n 2 = 2 n 2 + n .   □
Let diff-rel ( Q cgr ( S ) ) be the set of difference relations used in the states of A cgr ( S ) .
Corollary 1. 
Let the signature Σ be fixed. For any complete dSha  A with schema S H Σ , the membership of any hedge h S can be tested in memory in time O ( | 1 | ) per nonprojected node of h after a preprocessing time of O ( | A cgr ( S ) | + n 3 d ) , where d is the cardinality of diff-rel ( Q cgr ( S ) ) and n is the number of states of A.
Since | Q cgr ( S ) | is in O ( 2 n 2 + n ) by Lemma 18, the preprocessing time O ( | A cgr ( S ) | + n 3 d | ) is in O ( 2 2 n 2 + 2 n ) , too.
Proof. 
Since A is complete, Thereom 3 shows that A cgr ( S ) is complete for subhedge projection for schema S . The in-memory evaluator of A cgr ( S ) can be run on any input hedge h S in time O ( | 1 | ) per nonprojected node of h after a preprocessing time of O ( | A cgr ( S ) | + n 3 d ) , where d is the cardinality of diff-rel ( A cgr ( S ) ) . Since the signature Σ is fixed and A cgr ( S ) deterministic, | A cgr ( S ) | is in O ( | Q cgr ( S ) | 2 ) . The dominating part | A cgr ( S ) | is in O ( | Q cgr | 2 ) .   □
When applied to regular XPath query answering as in our experiments in Section 11, we have n 80 and d 25 so the cubic part O ( n 3 d ) is not limiting. The sizes of the Shas are below 2600 counting both states and rules. The upper bound O ( | A cgr ( S ) | + n 3 d ) thus suggests that the preprocessing time may be feasible in practice, even though exponential in the worst case.
We must be careful here since the sizes of the Shas reported in Section 11 do not refer to the pure congruence projection A cgr ( S ) but to the earliest congruence projection A e ( S ) cgr ( S ) that we will introduce only in Section 8. The noncreation of transition rules to states that are safe for rejection may lead to a radical size reduction. Alternatively, an important size reduction in A cgr ( S ) could be obtained by removing all useless states and transition rules. But, this would not reduce the precomputation time, just the opposite: it woud increase the preprocessing time since cleaning automata a posteriori also needs time, which for Shas may be nonlinear.
Instead of precomputing A cgr ( S ) from the inputs A and F S statically, we can compute only the part needed of A cgr ( S ) for evaluating the input hedge h on-the-fly. We note that the number of transition rules of the needed part is bounded by O ( | h | ) but may be way smaller due to projection. The preprocessing time then goes down to O ( | A | ) . Since the full automaton A cgr ( S ) may be exponential in n in the worst case, this may reduce the overall processing time.
On the positive side, on-the-fly evaluation reduces the overall computation time to O ( | h | + n 3 d ( h ) ) , where d ( h ) is the number of difference relations in the part of A cgr ( S ) needed for the evaluation of h, so d ( h ) m i n ( | h | , d ) . On the down side, d ( h ) steps will no more be in constant but cubic time in O ( n 3 ) . It should also be noted that the overall time of the safe-no-change projection on-the-fly evaluation is in O ( | A | + | h | + n s ( h ) ) , where s ( h ) is the number of subsets of safe states of A in states of Q cgr , i.e., s ( h ) | h | . This is since each subset of safe states can be computed in linear time in O ( n ) . As argued above, the overall time of the congruence projection obtained by on-the-fly evaluation is in O ( | A | + | h | + n 3 d ( h ) ) . This is because the computation of a difference relation is in O ( n 3 ) by using ground Datalog programs of cubic size. The additional cubic factor in n 3 d ( h ) instead of a linear factor in n s ( h ) is the price to be paid for achieving complete subhedge projection with on-the-fly evaluation. Note that this price is largely acceptable in cases where n 80 , d ( h ) 15 , and  s ( h ) 15 .
Example 15 (Exponential blow-up.).
In order to see how the exponential worst case may happen, we consider a family of regular languages, for which the minimal left-to-right Dfa is exponentially bigger than the minimal right-to-left Dfa. The classical example languages with this property are L n = Σ . a . Σ n , where n N and Σ = { a , b } . Intuitively, a word in Σ belongs to L n if and only its n + 1 -th letter from the end is equal to a . The minimal left-to-right Dfa for L n has 2 n + 1 many states since it needs to memorize a window of n + 1 letters. In contrast, its minimal right-to-left Dfa has only n + 1 states; in this direction, it is sufficient to memorize the distance from the end modulo n + 1 .
We next consider the schema S defined by μ z . ( a + b ) · z . It contains all hedges having in all its subtrees a label from Σ = { a , b } as the first letter, and no further occurrences of letters of Σ elsewhere. We then consider the family of hedge languages H n H Σ with schema S , such that the sequence of labels of some root-to-leaf path of h n belongs to L n . Note that H n can be recognized in a bottom-up manner by the dSha  A n with O ( n + 1 ) states, which simulates the minimal deterministic  Dfa  of L n on all paths of the input hedge. For an evaluator with subhedge projection, the situation is different. When moving top-down, it needs to memorize the sequence of labels of the n + 1 -last ancestors, possibly filled with b s , and there are 2 n + 1 of such sequences. If, for some leaf, its sequence starts with an “a” then the following subhedges with the following leaves can be projected away. As a consequence, there cannot be any dSha recognizing H n that projects away all irrelevant subhedges with less than 2 n + 1 states. In particular, the size of A n cgr ( S ) must be exponential in the size of A n .

8. Earliest Membership with Subhedge Projection

We next enhance our compilers for introducing subhedge projection states, such that they can also detect certain membership or nonmembership in an earliest manner. For this, we show how to combine the previous earliest membership tester for dShas from [21] with any subtree projection algorithm so that it applies to both, to safe-no-change projection and to congruence projection. By combining both aspects orthogonally, we obtain an earliest dSha with subhedge projection, which can be evaluated either in in memory or in streaming mode.

8.1. Earliest Membership

We start with a semantic characterization of earliest membership, i.e., of prefixes of nested words of hedges whose suffixes are irrelevant for membership, when assuming top-down and left-to-right processing. Note that the same characterization up to various presentation choices was given earlier in the context of stream processing [21,22].
Definition 13. 
Let S H Σ be a hedge schema and L S a hedge language satisfying this schema. A nested word prefix v is certain for membership in L with schema S for all suffixes w of nested words such that v · w nw ( S ) it holds that v · w nw ( L ) . It is called certain for nonmembership in L with schema S for all suffixes w of nested words such that v · w nw ( S ) it holds that v · w nw ( L ) .
We are now interested in Sha that are able to detect certain membership and nonmembership syntactically at the earliest possible prefix or node of the run.
Definition 14. 
A dSha  A = ( Σ , Q , Δ , I , F ) with schema S H Σ is called earliest for schema S  if there exists a state sel Q , such that for all hedges h S , the unique run R of c o m p l ( A ) on h satisfies the following:
  • it goes to the state sel for exactly all those prefixes of nw ( h ) that are certain for membership in L ( A ) wrt S ;
  • it goes to the sink for all prefixes of nw ( h ) that are certain for nonmembership in L ( A ) with S .
For dSha A, schema-complete for schema S = L ( A [ F / F S ] ) , we can construct a dSha A e ( S ) that is the earliest for S , as shown in Appendix A. The idea is to adapt the earliest membership tester for deterministic Shas from [21] so that it takes schema restrictions into account. Furthermore, we need to compile the input dShas to dShas rather than to dNwas, as used there. This change is straightforward once the relationship between both automata models becomes clear.
Given that the construction of A e ( S ) is not part of the core of the present paper, we only illustrate its outcome at our running example: for the dSha A with schema S = doc , in Figure 4, we obtain the Sha  A e ( S ) in Figure 19. Note that, on the prefix list item , it goes to state to sel , showing that it is certain for membership. On prefix item , it blocks showing that it is certain for nonmembership. However, on the prefix list list , it does not go into a subhedge projection state even though this prefix is subhedge irrelevant.

8.2. Adding Subhedge Projection

We now enhance any earliest membership for dShas with subhedge projection. The idea is to combine subhedge projection with earliest membership. So, suppose that we are given a schema S and two automata, A π and A e , with schema S that recognize the same language up to the schema, so L ( A π ) L ( S ) = L ( A e ) L ( S ) . Furthermore, assume that A e has a selection state sel , indicating certain membership at the earliest possible prefix.
In order to obtain the earliest membership with subhedge projection, we combine the two Shas into a single Sha, A e π = ( Σ , Q e π ,   Δ e π , I e π , F e π ) , basically running both automata in parallel but under shared control. The state set of A e π are as follows:
Q e π = ( Q π × ( Q e \ { sel } ) ) { sel } I e π = ( I π × ( I e \ { sel } ) ) { sel sel I e } F e π = ( F π × ( F e \ { sel } ) ) { sel }
The transition rules in Δ e π are given in Figure 20. Any run of A e π synchronizes parallel runs of A π and A e as follows: whenever A e goes to sel , then A e π does so too, and whenever A π goes into a subhedge projection state, then it makes A e jump over the subsequent subhedge.
For illustration, we reconsider the dSha from Figure 4. The earliest dSha with subhedge projection A e ( S ) snc is given in Figure 21. Here, we combine the safe-no-change projection A snc from Figure 11 with the earliest membership tester A e ( S ) from Figure 19. Their combination A e ( S ) snc represents the pattern [ self : : list / child : : item ] with schema doc in a concise manner. After prefix list item , it detects certain membership: for the prefix list list , it detects that the subhedge is irrelevant, and after prefix item , it detects certain nonmembership.

8.3. Soundness and Completeness

We next show that the soundness and completeness of a projection algorithm are preserved when combined with some earliest membership tester when assuming determinism.
Proposition 7. 
Let A π and A e be dShas with the same schema S and the same language up to the schema, i.e.,  L ( A π ) L ( S ) = L ( A e ) L ( S ) . The combination dSha A e π then has the same language up to schema S , too. Furthermore, if  A e is earliest for S, then A e π is earliest for S and goes into some subhedge projection state whenever A π does.
Proof. 
If q is a subhedge projection state of A π and r a state of A e , then ( q , r ) is a subhedge projection state of A e π . The evaluator for automaton A e π runs A π and A e in parallel on the input hedge, while skipping subhedges starting in subhedge projection states of A π . These include the subhedges starting in a projection state when evaluating A e π on h. Applying Proposition 2 to A π , such subhedges are irrelevant for L ( A π ) with respect to S , since A π is assumed to be deterministic. Since L ( A π ) S = L ( A e ) S , skipping such a subhedge does not affect acceptance by A e for hedges inside schema S . Therefore, the evaluator of A e π is earliest with respect to S , too.   □
The above proposition shows that if A π is complete for subhedge projection, then A e π is also complete for subhedge projection while also being earliest.
Theorem 4. 
For any complete dShas A with schema S , the dSha  A cgr ( S ) e ( S ) is earliest and complete for subhedge projection for schema S .
Proof. 
The congruence projection A cgr ( S ) is sound by Proposition 2 and complete for subhedge projection wrt S by Theorem 3. The automaton A e ( S ) discussed in Section 8.1 is earliest for S. So, their combination A cgr ( S ) e ( S ) yields the earliest automaton with complete subhedge projection wrt S using Proposition 7.   □

8.4. In-Memory Complexity

We next discuss the complexity of earliest membership testing with complete subhedge projection by an in-memory evaluator for A cgr ( S ) e ( S ) .
Lemma 19. 
The number of states in Q cgr ( S ) e ( S ) is in O ( 2 n 2 + 2 n + l o g ( n ) ) , where n = | Q | .
Proof. 
Using Lemma 18, the number of states of A cgr ( S ) is in O ( 2 n 2 + n ) , where n = | Q | . The number of states of A e ( S ) is in O ( n 2 n ) , and thus, in O ( 2 n + l o g ( n ) ) . The number of states of Q cgr ( S ) e ( S ) is thus in O ( 2 n 2 + n 2 n + l o g ( n ) ) , which is in O ( 2 n 2 + 2 n + l o g ( n ) )    □
Corollary 2. 
Let the signature Σ be fixed. For any complete dSha  A with schema S H Σ , earliest membership on a input hedge h S can be tested in memory in time O ( | 1 | ) per nonprojected node of h after a preprocessing time of O ( | A cgr ( S ) e ( S ) | + n 3 d + n s ) , where d is the number of difference relations and s is the number of subset of safe states in Q cgr ( S ) e ( S ) . The preprocessing time is also in O ( 2 2 n 2 + 4 n + 2 l o g ( n ) ) , where n is the number of states of A.
Proof. 
Since A is complete, Thereom 4 shows that A cgr ( S ) e ( S ) is the earliest, sound, and complete for subhedge projection for schema S . The in-memory evaluator of A cgr ( S ) e ( S ) can be run on any input hedge h S in time O ( | 1 | ) per nonprojected node of h after a preprocessing time of O ( | A cgr ( S ) e ( S ) | + n 3 d + n s . Σ is fixed to the size | A cgr ( S ) e ( S ) | bounded by O ( | Q cgr ( S ) e ( S ) | 2 ) . Lemma 19 shows that | Q cgr ( S ) e ( S ) | is in O ( 2 n 2 + 2 n + l o g ( n ) ) , so | A cgr ( S ) e ( S ) | is in O ( 2 2 n 2 + 4 n + 2 l o g ( n ) ) .   □
As before, instead of precomputing A cgr ( S ) e ( S ) from the input dSha A and F S , we can compute only the part needed for evaluating the input hedge h on-the-fly. If the needed part of A cgr ( S ) e ( S ) for evaluating h is small, the overall time and space go down considerably.

9. Streaming Algorithms

We show that dShas can be evaluated on the nested word of a hedge in a streaming manner, so that they produce the same results and projection behavior as by evaluating them on hedges in-memory in a top-down and left-to-right manner.
All aspects related to subhedge projection remain unchanged. Earliest query answering in streaming mode means that nested word suffixes are ignored if irrelevant, and not only nested words as with subhedge projection. What changes between both evaluation modes is memory consumption. For in-memory evaluation, the whole graph of the input hedge must be stored, while in streaming mode, only a stack and a state are maintained at any event. For testing regular language membership, the state space is finite, while the stack space become infinite, since the depth of the stack depends on the depth of the input hedge.
The property of having equivalent streaming and in-memory evaluators was already noticed in [26] for Neumann and Seidl’s pushdown forest automata [25] and for nested word automata [11,15]. We here show how to transpose this result to dShas. This permits us to show the correctness of our evaluation algorithms based on properties of the constructed automata. It also permits obtaining correctness properties of streaming algorithms by reasoning about the in-memory evaluator of Shas.

9.1. Streaming Evaluators for Downward Hedge Automata

We define the streaming evaluator of a Sha A = ( Σ , Q , Δ , I , F ) by a visibly pushdown machine. The set of configurations of this machine is K = Q × Q , containing pairs of states and stacks of states. The visibly pushdown machine provides for any word v Σ ^ a streaming transition relation v s t r K × K , such that for all states q , q Q and stacks σ Q and words v , v Σ ^ :
true ( q , σ ) ε str ( q , σ )   wrt Δ ( q , σ ) v str ( q , σ ) ( q , σ ) v str ( q , σ )   wrt Δ ( q , σ ) v · v str ( q , σ )   wrt Δ
q a q   in Δ q a str q   wrt Δ q q   in Δ ( q , σ ) str ( q , σ · q )   wrt Δ q @ p q   in Δ ( p , σ · q ) str ( q , σ )   wrt Δ
We note that the same visibly pushdown machine can be obtained by compiling the Sha to an Nwa, whose streaming evaluator is defined by a visibly pushdown machine too (see Section 12.3 on related work).
The notion of partial runs of the in-memory evaluator is closely related to streaming runs. To see this, we first note that each partial run v wrt Δ naturally defines a stack as follows. Since v is a nested word prefix, there exist unique nested words v 0 , , v n nw ( H Σ Q ) , such that
v = v 0 · · v 1 · · · · v n .
These nested words must be partial runs of Δ too, so there exist states p 0 , p n , q 0 , , q n Q , such that v i goes from p i to q i for all 0 i n . We define the stack σ of v by σ = q 0 · · q n 1 and say that v goes from p 0 to q n and has stack σ . Note that the stack is empty if v is a nested word, since in this case n = 0 so that v = v 0 and σ = ε . For instance, the partial run v 0 = 0 · · 0 · l i s t · 1 · · 0 · i t e m · 2 has the stack σ 0 = 0 · 1 while the partial run v 1 = v 0 · · 3 · · 4 has stack σ 1 = ε .
Lemma 20. 
Consider a Sha  A = ( Σ , Q , Δ , I , F ) , states q , q Q , and a stack σ Q . Let v be a prefix of some nested word in nw ( H Σ ) . Then, there exists a partial run r wrt Δ from q to q with stack σ such that proj Σ ( r ) = v iff ( q , ε ) ε str ( q , σ ) wrt Δ.
Proof. 
This is carried out by induction on the length of prefix v.   □
This lemma implies that any in-memory transition of A on a hedge h from q to q can be identified with some streaming transition of A on nw ( h ) from ( q , ε ) to ( q , ε ) .
Proposition 8. 
q h q wrt Δ ( q , ε ) nw ( h ) str ( q , ε ) wrt Δ .
Proof. 
If q h q wrt Δ , there exists a run R of h with Δ that starts with q and ends with q . Consider the partial run v = nw ( R ) on h. Since v is a nested word, its stack is empty. By Lemma 20, we have ( q , ε ) proj Σ ( v ) str ( q , ε ) wrt Δ and proj Σ ( v ) = nw ( h ) . For the inverse implication, assume that ( q , ε ) nw ( h ) str ( q , ε ) wrt Δ . Let u = nw ( h ) . By Lemma 20, there exists a partial run v of Δ from q to q with stack ε . Then, there exists a run R of Δ on h such that v = nw ( R ) . Hence, q h q wrt Δ .   □
For any deterministic dSha A, hedge h, and state q, the streaming transition on nw ( h ) with respect to Δ starting with ( q , ε ) can be computed in time O ( 1 ) per letter after a precomputation in time O ( | A | ) . So, the overall computation time is in O ( | A | + | h | ) . The streaming memory needed to store a configuration is of size O ( d e p t h ( h ) + | A | ) since the size of the visible stack is bounded by the depth of the input hedge.

9.2. Adding Subhedge Projection

We next extend the streaming transition relation with subhedge projection, in analogy to what we did for the in-memory transition relation. We define a transition relation with subhedge projection h shp str K × K with respect to Δ , such that for all words v , v Σ ^ , letters a Σ , states p , q , q , q Q , and stacks σ , σ , σ Q :
q Q shp Δ v nw ( H Σ ) ( q , σ ) v shp str ( q , σ )   wrt Δ
q Q shp Δ q a q in Δ ( q , σ ) a shp str ( q , σ )   wrt Δ q Q shp Δ ( q , σ ) ε shp str ( q , σ )   wrt Δ
q Q shp Δ ( q , σ ) v shp str ( q , σ )   wrt Δ ( q , σ ) v shp str ( q , σ )   wrt Δ ( q , σ ) v · v shp str ( q , σ )   wrt Δ
q Q shp Δ q q in Δ ( q , σ ) shp str ( q , σ · q )   wrt Δ p Q shp Δ q @ p q in Δ ( p , σ · q ) shp str ( q , σ )   wrt Δ
The projecting transition relation stays in a configuration with a subhedge projection state until the end of the current subhedge is reached. This is correct since a subhedge projection state cannot be changed anyway.
Proposition 9. 
For any Sha A = ( Σ , Q , Δ , I , F ) , states q , q Q , nested word v on which Δ has no blocking partial run starting from q:
( q , ε ) v str ( q , ε )   wrt Δ ( q , ε ) v shp str ( q , ε )   wrt Δ .
Proof. 
Let h H Σ be the hedge such that v = nw ( h ) . The proof is by induction on the size of v. For the implication from the left to the right we assume that ( q , ε ) v str ( q , ε )   wrt Δ . Since v = nw ( h ) , Proposition 8 proves q h str q wrt Δ .
Case 
q Q shp Δ . By Lemma 6, it follows from q h str q wrt Δ that q = q . We can then apply the following rule since v is a nested word:
q Q shp Δ v = nw ( H Σ ) ( q , ε ) v shp str ( q , ε )   wrt Δ
Hence, ( q , ε ) v shp str ( q , ε )   wrt Δ .
Case 
q Q shp Δ . We distinguish all possible forms of hedge h.
Subcase 
h = h · h . Let v = nw ( h ) and v = nw ( h ) so v = v · v = nw ( h ) . The following rule must have been applied:
( q , ε ) v str ( p , σ )   wrt Δ ( p , σ ) v str ( q , ε )   wrt Δ ( q , ε ) v · v str ( q , ε )   wrt Δ
Since v is a nested word, it follows that σ = ε . By the induction hypothesis applied to the nested words v and v , we get ( q , ε ) v shp str ( p , ε )   wrt Δ and ( p , ε ) v shp str ( q , ε )   wrt Δ . Hence, we can apply the following rule:
q Q shp Δ ( q , ε ) v shp str ( p , ε )   wrt Δ ( p , ε ) v shp str ( q , ε )   wrt Δ ( q , ε ) v · v shp str ( q , ε )   wrt Δ
This shows ( q , ε ) v shp str ( q , ε ) as required.
Subcases
h = h  or  h = a  or  h = ε . Straightforward.
For the implication from the right to the left we assume that ( q , ε ) v shp str ( q , ε )   wrt Δ .
Case 
q Q shp Δ . Then, the following rule must have been applied with q = q :
q Q shp Δ v = nw ( H Σ ) ( q , ε ) v shp str ( q , ε )   wrt Δ
Since we assume that v does not have any blocking partial runs with Δ , there exists a partial run on v with Δ that starts with q. Since v is a nested word, any partial run on v is the nested word of some run, so there exists a run on the hedge h that starts with q. This shows the existence of some state p such that q h p wrt Δ . Since q Q shp Δ , Lemma 6 shows q = p , and thus, q h q wrt Δ . Proposition 8 then proves that ( q , ε ) nw ( h ) str ( q , ε ) wrt Δ , which is equal to ( q , ε ) v str ( q , ε ) wrt Δ , as required.
Case 
q Q shp Δ . We distinguish all possible forms of the hedge h.
Subcase 
h = h · h . Let v = nw ( h ) and v = nw ( h ) so v = v · v = nw ( h ) . The following rule must have been applied:
q Q shp Δ ( q , ε ) v shp str ( p , σ )   wrt Δ ( p , σ ) v shp str ( q , ε )   wrt Δ ( q , ε ) v · v shp str ( q , ε )   wrt Δ
Since v is a nested word, it follows that σ = ε too. By the induction hypothesis applied to the nested words v and v , we get ( q , ε ) v str ( p , ε )   wrt Δ and ( p , ε ) v str ( q , ε )   wrt Δ . Hence, we can apply the following rule:
( q , ε ) v str ( p , ε )   wrt Δ ( p , ε ) v str ( q , ε ) wrt Δ ( q , ε ) v · v str ( q , ε )   wrt Δ
This shows ( q , ε ) v str ( q , ε ) , as required.
Subcases
h = h  or  h = a  or  h = ε . Straightforward.
This ends the proofs of both directions.   □
A streaming evaluator with subhedge projection for deterministic Shas A on hedges h can thus be obtained by computing the streaming transition relation with subhedge projection for A of nw ( h ) starting with the initial configuration. This costs time at most O ( 1 ) per letter of nw ( h ) , i.e., constant time per event of the stream.

9.3. Streaming Complexity of Earliest Membership with Subhedge Projection

Using streaming evaluators for Shas, we can test earliest membership with subhedge projection in streaming mode by running the Sha A cgr ( S ) e ( S ) .
Corollary 3. 
For any complete dShas ( Σ , Q , Δ , I , F ) , F F S Q , language L = L ( A ) , and schema S = L ( A [ F / F S ] ) , earliest membership with complete subhedge projection for L wrt S can be tested in streaming mode for any hedge h S in time O ( | 1 | ) per nonprojected letter of nw ( h ) after a preprocessing time of O ( | A cgr ( S ) e ( S ) | + n 3 d + n s ) , where d is the number of difference relations and s the number of safe subsets in states of Q e ( S ) cgr ( S ) . The space required is in O ( d e p t h ( h ) + | A cgr ( S ) e ( S ) | ) .
Proof. 
By Theorem 4, the dSha A cgr ( S ) e ( S ) is earliest and sound and complete for subhedge projection. By Propositions 8 and 9, we can thus obtain a streaming membership tester with subhedge projection for a hedge h S by running the streaming evaluator with subhedge projection of A cgr ( S ) e ( S ) on the nested word nw ( h ) . Since this Sha is deterministic, the streaming evaluation can be done in time O ( | 1 | ) per letter of nw ( h ) after a preprocessing time of O ( | A cgr ( S ) e ( S ) | + n 3 d + n s ) .   □
As for safe-no-change projection, the exponential preprocessing time can again be avoided by creating the part of the Sha A cgr ( S ) e ( S ) needed for the evaluation of the hedge on the fly. The size of the needed part is in O ( | h | ) . Hence, the space requirements can also be reduced to O ( | h | ) , which may be too much for large streams nw ( h ) . The needed part of A cgr ( S ) e ( S ) , however, may be much smaller than | h | . Furthermore, a stack of size d e p t h ( h ) has to be maintained but this dependence on h may be acceptable even for large streams nw ( h ) .

10. Monadic Regular Queries and XPath

We next consider the problem of how to answer monadic regular queries on hedges in an earliest manner and with subhedge projection. Each monadic query defines a set of positions for each input hedge. The XPath queries self::list[child::item], for instance, can be applied to the hedge list . list . item . list . item , where it selects the position of the first top-most tree labeled by list since it contains a child tree labeled by item . We identify positions of hedges by integers. For instance, the selected list element in the hedge above is identified by the integer 2. Therefore, the answer set of the above query on the above hedge is { 2 } . Note that monadic queries do not return any other information about the selected positions, in contrast to the subtree semantics of XPath in the official W3C standard.
Answering this query is more complicated than verifying whether the XPath filter [self::list/child::item] matches at the root of the first subtree of the hedge, since the answer set needs to be constructed, too. Concerning memory consumption, alive answer candidates need to be buffered for monadic query answering. Therefore, the memory consumption of streaming evaluators of monadic queries may grow linearly in the size of the stream to the number of the alive candidates does.

10.1. Monadic Regular Queries

Monadic queries on hedges in H Σ can be identified with languages of annotated hedges in H Σ { x } , where x Σ in an arbitrarily fixed selection variable. A monadic query on hedges in H Σ can then be identified with the language of all annotated hedges in H Σ { x } , in which a single position got x-annotated and this position is selected by query.
Therefore, regular monadic queries on hedges in H Σ can be defined by nested regular expressions in nRegExp Σ { x } or by a dSha with the same signature. The regular XPath query self::list[child::item], for instance, can be defined by following nested regular expression:
list · x · · · item · · · · ,
where ⊤ is like a wildcard accepting any hedge in H Σ without x, i.e., for some recursion variable z V :
= μ z . ( + a Σ a + z ) .
Note that the nested regular expression for the XPath filter [self::list/child::item] in nRegExp Σ can be obtained from that of the XPath query self::list[child::item] by removing the x. In addition, for any monadic query on hedges, we have to restrict the annotated hedges to be recognized by the language of the query such that exactly one position in each hedge is annotated by x. This can be done by intersection with the following nested regular expression o n e x :
μ z . ( T . x . T + T . z . T )
The previous schema doc for pattern matching has now to be changed from to S = doc x o n e x , where doc x nRegExp Σ { x } is like doc but over the signature Σ { x } . For the regular XPath query self::list[child::item] we obtain the dSha in Figure 22. The transition rules with the special letter x indicate, where positions may be bound to the selection variable: x can be bound only in state 2, that is in subtrees starting with a list element. The same automaton can be used to define the schema S = doc x o n e x by using the schema-final states F S = { 14 , 20 } instead of the final states F = { 20 } . Indeed, we constructed the automaton by intersecting the automaton obtained from the XPath expression with the automata computed from the regular expressions doc x and o n e x for the schema. Note that 20 is a final state for the query (and the schema), while 14 is a sink state for the query (but not for the schema).

10.2. Earliest Query Answering with Subhedge Projection

A naive query evaluation algorithm for a hedge automaton with signature Σ { x } receives an input hedge h H Σ , guesses all possible positions of where to insert x into h, and tests for all of them whether the annotated hedge obtained is accepted by A. In a less naive manner, one inserts x only at those position where the current state has an outgoing x transition. All this can be improved if the hedge automaton under consideration supports selection, rejection, or projection states. In all cases, the algorithm has to buffer all binding candidates, that are not yet in a selection or rejection state. In the worst case it has to buffer the bindings to all positions. Therefore, the state space of the evaluator of monadic queries is no more finite.
Selection and rejection states can be identified by adapting the earliest query answering algorithm from [21]. This can be made such that one obtains an Sha A e ( S ) with signature Σ { x } depending on the schema. It should be noticed that automata used in [21] are dNwas instead of Shas, and that schemas are ignored there. The needed adaptations are therefore discussed in Appendix A. How to obtain an efficient query answering algorithm in streaming mode from A e ( S ) is the main topic of [21]. This is the base algorithm that we will extend with subhedge projection for our experiments.
In order to add congruence projection to the earliest query answering algorithm, we have to run the automaton A e ( S ) cgr ( S ) . The earliest congruence projection for the XPath query self::list[child::item] is shown in Figure 23. It binds x in state 2D1 after having read a top-level list -element and then checks for the existence of a child element item , going into selection state 1D3 once it is found. All other children of the top-level list element are projected in state 2D3. For adding earliest query answering to self-no-change projection, we have to run the automaton A e ( S ) π instead of A e ( S ) cgr . We do not give the details, but claim for the present query that the projection power of both automaton is the same.
Fortunately, the soundness and completeness theorems for earliest membership testing with subhedge projection do only affect the constructions of these Shas, but not how they are to be evaluated. However, there is one additional issue: Nested word prefixes cannot be considered as irrelevant for the subsequent subhedge, if the selection variable x can be bound there, even if the acceptance doesn’t depend on where it will be bound. This is since the position of the binding is to be returned by any query answering algorithm.
Definition 15. 
We call a prefix v binding irrelevant for L and S it there does not exist any hedge h containing x and suffix w such that v · w nw ( S ) and v · nw ( h ) · w nw ( L ) .
Our subhedge projecting evaluator can project only at subhedge irrelevant prefixes that are also binding irrelevant. Given an dSha A and a set of schema final-states F S , one can decide whether a prefix is subhedge irrelevant if the current state q does accept hedges containing x but can access some other state that is not safe for rejection and permits x-bindings. In ground Datalog, we can distinguish binding irrelevant states q by imposing the following rules for all q , q Q :
  • binding_irrelev(q) :- state(q), not binding_relev(q).
  • binding_relev(q) :- not bind_x(q),  acc(q,q’), bind_x(q’), not rej(q’).
Note in our example of the XPath query self::list[child::item] that state 2D3 of the dSha A cgr ( doc o n e x ) e ( doc o n e x ) in Figure 23 is binding irrelevant, since x must be bound on the top-level list element in state 2D1, so it cannot be bound on any list element below.

11. Experiments with Streaming XPath Evaluation

We compare experimentally three streaming evaluators for dSha, for earliest monadic query answering, without subhedge projection, with safe-no-change projection, and with congruence projection.

11.1. Benchmark dSha for XPath Queries

We start from dSha defining monadic queries for the regular XPath queries A1-A8 from the XPathMark benchmark [7] that are given in Table 1. These XPath queries show most of the features of the regular fragment of XPath. In Table 2, we added 14 further XPath path queries that we found useful for testing too.
Deterministic Shas for A1–A8 were provided earlier in [21]. For the other XPath queries we compiled them to dShas via nested regular expression.
In order to produce the input dShas for our evaluators, we intersect these automata with a dSha for the schema doc x o n e x where doc x is the schema we use for hedge encodings of real-world Xml documents with x annotations. Thereby, we could identify the schema final states F S . We minimized and completed the result. Table 3 reports the size of the input dShas for our evaluators obtained by the above procedure. For each dSha A, the size is given by two numbers size(n), the first for the overall size and the second for the number of states n.
For the input dSha A of each query, we statically computed the whole dShas A cgr ( S ) e ( S ) while using the necessary parts of the determinization algorithm from [16]. The size of the dShas obtained and the number d of difference relations are reported in Table 3 too. The biggest size is 2091(504) for the input dSha for A8. The largest number of difference relations d = 24 is also obtained for this query. So, indeed the size of these automata is much smaller than one might expect from a construction that is highly exponential in the worst case. However, the only query that we couldn’t obtain its correspondent deterministic earliest congruence projection A cgr ( S ) e ( S ) is A5.
The time for computing the earliest congruence projection took between 2.3 and 26 s.
We note that we have not yet computed the complete dShas statically for safe-no-change projection A snc , pure congruence projection A cgr ( S ) and earliest query answering A e ( S ) . We believe that these automata may become bigger than A cgr ( S ) e ( S ) that we constructed in a single shot.

11.2. Streaming Evaluation Tool: AStream

We are using and developing the AStream tool for answering monadic dSha queries on Xml streams with schema restrictions. Version 1.01 of AStream was presented in [21] supports earliest query answering without projection. Given a dSha A defining a monadic query, a set of schema final states F S and an Xml stream w, it constructs on-the-fly the needed part of the Sha A e ( S ) while parsing w and evaluating A on the hedge encoding w.
For the FCT conference version of the present paper [24], we enhanced AStream with safe-no-change projection. This leads to version AStream 2.01. It constructs the earliest safe-no-change projection dSha A e ( S ) snc on the fly while evaluating the monadic query defined by A on the hedge encoding the Xml stream w. Subhedge projection can be switched on or off in order to compare both versions without further implementation differences.
For this present journal version, we added earliest congruence projection and integrated it into AStream leading to AStream 3.0.
AStream 3.0 is different from the two previous versions in that the dSha A e ( S ) cgr is constructed statically and entirely, independently of the input hedge. We then use a generic earliest streaming evaluator for Shas that rejects in rejection states, selects in selection states, and projects in all subhedge projection states. This evaluator could also be run with A e ( S ) snc or A e ( S ) , as long as these do not grow too big.
It should be noticed that A e ( S ) cgr turned out to be nicely small for our whole benchmark, with the other dShas A e ( S ) snc and A e ( S ) risk becoming bigger. As stated earlier, we did not construct these dShas so far for our benchmark queries. This is why we continued to run the earliest streaming with safe-no-change projection with AStream 2.01, while we used AStream 3.0 for earliest streaming with congruence projection.
All AStream versions rely on Java’s abc-Datalog for computing the least fixed points of Datalog programs in a bottom-up manner. Datalog programs are needed for all logical reasoning: for computing subsets of states that are safe for rejection, selection, or subhedge projection on all levels. Also, the difference relations are computed based on Datalog. Note that earliest on-the-fly query evaluation requires running Datalog during query evaluation, while with a static approach for constructing dShas entirely, Datalog is only needed at preprocessing time during the automaton construction.

11.3. Evaluation Measures

We want to measure the time for running the three streaming query evaluators for all dShas obtained from the XPath expressions in the collection in Table 1 and Table 2.
One advantage of the XPathMark benchmark is that it comes with a generator of Xml documents to which the queries can be applied and that can be scaled in size. We created an Xml document of size 1.1 GB, which is sufficiently large to make streaming mandatory. Our experiments show that the efficiency of query evaluation grows linearly with the size of the non-projected part of the Xml document. This holds for each of our three evaluators. Therefore, measuring the time of the evaluator on Xml documents of other sizes would not show any new insights, except that memory consumption remains moderate too.
The time gain of projection for a query Q is the percentage of time saved by projection during query evaluation when ignoring the pure document parsing time t parse seconds:
time-gain Q = 100 ( 1 ( t noproject Q t parse ) / ( t project Q t parse ) ) .
We can measure the time gain for earliest safe-no-change projection by AStream 2.01 and for earliest congruence projection by AStream 3.0, since projection can be switched off in both of them. The disadvantage of the time gain is that it depends on the details of the implementation.
The event gain is a better measure for the projection power. The parser sends a stream of needed events to the Sha, while ignoring parts of the input document that can be projected away. For this, it must always be informed by the automaton about what kind of events can be projected in the current state. The event gain of projection event-gain Q is then the percentage of events that are recognized to be irrelevant and thus no more created by a projecting evaluator for query Q. One might expect that the time gain is always smaller than the event gain.
time-gain Q event-gain Q
Indeed, this will be confirmed by our experiments. We believe that these discrepancies between these two measure indicate how much room remains for optimizing the implementation of an evaluator. In this way, we can be observe that some room for further optimizations AStream 3.0 still remains.

11.4. Experiments

We used Java’s X M L S t r e a m R e a d e r Interface v1.0 from the j a v a x . x m l . s t r e a m package to parse and stream Xml files in Scala. Parsing a 1.1 GB Xml document required t parse = 15 s, while querying it without projection took us in average t noproject avg = 752 s, while varying from 600 to 900 s. The average processing time for 1 MB was thus 0.72 s. Once we knew this, we predicted the expected evaluation time from the size of the non-projected part of the Xml document. It is as follows:
event-gain Q 1100 0.72 s + t parse s .
Without projection, the parser generates 1 , 012 , 018 , 728 events for the 1.1 GB Xml document. This is the baseline for computing event-gain Q for all queries Q.
We ran AStream with Scala v2.13.3 on a machine with the operating system Ubuntu 20.04.06 LTS (64-bit). The machine is a Dell Precision-7750 machine (Round Rock, TX, USA), equipped with an Intel® Core™ i7-10875H CPU @ 2.30GHz × 16 and 32 GB of RAM (Santa Clara, CA, USA).

11.4.1. Earliest Congruence Projection

We present the measures of our evaluator with the earliest congruence projection in Table 4. The event gain is high, varying between 75.7% for A1_0c and 100% for A0, A1_0a, A1_0b, A1_4, A1_5, and A4_1. The event gain for queries without descendant axis is above 98.2%. For these queries, all subtrees that are not on the main path of the query can be projected away. In particular, only subtrees until a fixed depth have to be inspected. These are the queries A1, A4, A6-A8 and A4_0 and the queries with event gain 100% listed above. The time gain for all these queries is a little smaller than the event gain but still above 98.2%.
We next consider the second type of queries having the descendant axis, specifically A2, A1_0c, A1_1a, A1_1d, A1_2, A1_6, and A2_1. Intuitively, a lower event gain is expected in these cases because subhedge projection with the descendant axis cannot directly exclude an entire subtree under a given element; all descendant elements must be inspected. For instance, the query A1_0c equal to / site / / @ must select all attributes of all elements under the root element s i t e , requiring the inspection of every Xml element in the document, for attribute presence, except for the root element s i t e .
The lower event gain reported for the aforementioned queries align with the above expectation, showing percentages ranging from 75.7% to 81.1%. Also, these queries exhibit a larger gap between time gain and event gain, which ranges from 2.2% to 5.7% and will be subject to future optimization. It is worth noting that some of these queries combine multiple features at once, like queries A1_1a and A1_1d, which have a descendant axis, filter with attributes, and string comparison.
The only exception to this trend for the queries having descendant axis are A3 and A1_3, with respective event gains of 97.8% and 99.8%. The reason behind this high gain is that these queries starts from the root, with a child axis path of depth 3 before querying the descendant axis. This initial path allows for the exclusion and projection of most elements under s i t e that do not fit the specified path with their entire subtrees.

11.4.2. Earliest Safe-No-Change Projection

We compare earliest safe-no-change projection with the earliest congruence projection in Table 5. For this, we restrict ourselves to the queries A 1 A 8 from the XPathMark in Table 1.
The XPath queries A1, A4, A6, A7, and A8 do not have descendant axes. The time gain for safe-no-change projection on these queries is between 92.3 98.9 % . For earliest congruence projection, the time gain was better for A1, A4, and A6 with at least 96.6%. For A7 and A7, the time gain of safe-no-change projection is close to the event gain of congruence projection but the time gain of congruence projection is 1.5% lower. So, we hope that an optimized version of congruence projection could outperform safe-no-change projection in all cases.
The XPath queries with descendant axis are A2, A3, and A5. For A2 and A3 in particular, the time gain of safe-no-change projection is very low with even not 10.9% while congruence projection yields at least 77.9%. We believe that this is bug in our current implementation of safe-no-change projection, which may be prohibiting the projection of attributes and text nodes for these queries (but not for the others). Congruence projection on A2 gains 77.9% of time and 81.1% of events. This is much better, but still far from the projection power for queries without descendant axi s. For A3, congruence projection gains 97.4% of time. This is much better than for A2 since the descendant axis in A3 is at the end, while for A2 it is at the beginning of the path. But, even for A2, congruence projection is very helpful.

11.4.3. Comparison to External Tools

QuiXPath was shown to be the most efficient large coverage external tool for regular XPath evaluation on Xml streams [4]. Therefore, we compare the efficiency of congruence projection to QuiXPath in Table 6, and thereby, by transitivity to the many alternative tools. We note that QuiXPath performs early query answering by compilation to possibly nondeterministic Nwas with selection states, without always being earliest. Apart from this, it bases its efficiency on both subtree projection and descendant projection.
The experiments with QuiXPath in [4] use the parse-free time. Therefore, we do the same for congruence projection by reducing the measured overall time by 15 s of parsing time.
Compared to the parsing-free times for the XPath queries without descendant axis, the earliest congruence projection demonstrates significant improvements. On average, our projection is only slower by a factor of 1.3–2.9 than QuiXPath. This shows that our implementation is already close to be competitive with the existing Xml streaming tools, while being the only one guaranteeing earliest query answering.
For query A2 with a descendant axis, congruence projection is by a factor of 13.8 slower than QuiXPath. The reason is that QuiXPath supports descendant projection, while congruence projection concerns only subhedge. For query A3 with a descendant axis at the end, congruence projection is only by a factor of 1.6 behind, so descendant projection seems less relevant here.

12. Related Automata Models

A large number of alternative automata notions for trees, forests, and hedges were proposed in the literature. We compare them to Shas and Shas in the following subsection and discuss the implications.

12.1. Shas versus Tree, Forest, and Hedge Automata

  • Standard Hedge Automata. Standard hedge automata [9,10,32,33] operate on labeled hedges in a bottom-up manner, similarly to Shas. Also they have the same expressiveness, but horizontal languages are specified differently, leading to a problematic notion of determinism [34]. For this reason, unique minimization fails for deterministic standard hedge automata. Still, they have the same expressiveness as Shas when restricted to labeled hedges.
  • Standard Tree Automata and SHA Minimization. Syntactically, any Sha A is a standard tree automaton over the following ranked signature:
    • a unary symbol for any letter a Σ ;
    • a binary symbol @;
    • a constant .
    Semantically, there is no perfect general correspondence. Only for subclasses a perfect correspondence can be made up via binary encoding. This was shown in [8] for the subclass of dShas with I = Δ . In [35], it could also be shown for the subclass of multi-module dShas. This leads to unique minimization results for both subclasses of deterministic Shas.
  • Bojanczyk’s Forest Automata. Standard hedge automata also have the same expressiveness as Bojanczyk’s forest automata (see Section 3.3 of [36]). These are based on the idea of transition monoids, rather than on the idea of transition rules such as Shas or finite state automata or Neuman’s and Seidl’s forest automata. As a consequence, there is an exponential difference in succinctness between Shas and Bojanczyk’s forest automata [35].

12.2. SHAs versus SHAs

We start explaining why Sha’s have the same expressiveness as Shas. Basically, it is the same reason for why Nwas have the same expressiveness as Shas (see e.g., [8]). For the easier direction, we argued already that any Sha  A = ( Σ , Q , Δ , I , F ) can converted to a Sha  d o w n ( A ) .
  • Downward Elimination. Conversely, we can convert any dShaA into an equivalent Sha  elim ( A ) while possibly introducing nondeterminism as follows:
    I elim = I × Q F elim = F × Q q q in Δ ( q , q )   in Δ elim
    q a q in Δ r Q ( r , q ) a ( r , q )   in Δ elim q @ p q in Δ r Q ( r , q ) @ ( q , p ) ( r , q )   in Δ elim
Proposition 10. 
L ( A ) = L ( elim ( A ) ) .
The construction is analogous to the conversion of Nwas to Shas [8] or to hedge automata [26]. The correctness proofs for these compilers are standard.
  • Unique Minimization. Unique minimization fails for dShas as usual for deterministic multiway automata. This even happens for deterministic two-way finite state automata on words. In contrast, the subclass of dShas with the restriction that the initial state and the tree initial state are the same enjoys unique minimization [8].
  • Neuman and Seidl’s Pushdown Forest Automata. Standard hedge automata are also known to have the same expressiveness than Neumann and Seidl’s pushdown forest automata [25]. The latter can be identified with Shas for labeled hedges. What is needed to map pushdown forest automata to standard hedge automata is downward elimination, as for mapping Shas to Shas.

12.3. SHAs versus NWAs

The streaming evaluator for Shas via a visibly pushdown machine can also be obtained by compiling a Sha to a nested word automaton (Nwa) [11] whose streaming evaluator is given by a visibly pushdown machine [12], previously known as input driven automata [13,14,15].
Definition 16. 
A nested word automata (Nwa) is a tuple ( Σ , Q , Γ , Δ , I , F ) , where Σ, Γ, and Q are sets, I , F Q , and Δ = ( ( a Δ ) a Σ , Δ , Δ ) contains the following relations: a Δ Q × Q , Δ Q × ( Γ × Q ) , and Δ Q × Γ × Q . A  Nwa  is deterministic or equivalently a dNwa  if I contains at most one element and all above relations are partial functions.
The elements of Γ are called stack symbols. The transition rules in Δ again have three forms: internal rules q a q in Δ as for Shas, opening rules q γ q in Δ if Δ ( q ) = ( q , γ ) , and closing rules q γ q in Δ if Δ ( q , γ ) = q .
  • Streaming evaluators for Nwas. The streaming evaluator for Nwas can be seen as pushdown machines for evaluating the nested words of hedges in streaming manner. A configuration of the pushdown machine is a pair in K = Q × Γ containing a state and a stack. For any word v Σ ^ , we define a streaming transition relation v str K × K , such that, for all q , q Q and σ Γ , the following is applicable:
    t r u e ( q , σ ) ε str ( q , σ )   wrt Δ ( q , σ ) v str ( q , σ ) ( q , σ ) v str ( q , σ )   wrt Δ ( q , σ ) v · v str ( q , σ )   wrt Δ
    q a q   in Δ q a str q   wrt Δ q γ q   in Δ ( q , σ ) str ( q , σ )   wrt Δ q γ q   in Δ ( q , σ · γ ) str ( q , σ )   wrt Δ
  • From SHAs to NWAs. For any Sha A = ( Σ , Q , Δ , I , F ) , we can define the Nwa  A nwa = ( Σ , Q , Γ , Δ nwa , I nwa , F nwa ) while preserving determinism, such that Γ = Q , while Δ nwa contains for all a Σ and q , p Q transition rules:
    q a q in Δ q a q in Δ nwa q q in Δ q q q in Δ nwa q @ p q in Δ p q q in Δ nwa
For instance, the dNwa  A nwa of the dSha A in Figure 9 obtained from the dSha for the [ self : : list / child : : item ] from the introduction is given in Figure 24. The dNwa  A nwa of the dSha A in Figure 21 is given in Figure 25.
Lemma 21. 
L ( A nwa ) = L ( A ) .
The runs of Shas A and Nwas A nwa can be identified.
  • Projection for NWAs. Projecting evaluators for Nwas were proposed in the context of projecting Nwas [4]. They are based on the following notion of states of irrelevant subtrees for Nwas.
Definition 17. 
(Variant of Definition 3 of [4]). We call a state q of an Nwa a state of irrelevant subtrees if there exist two different stack symbols γ , γ and a state q such that the transitions shown on the right exist but no further opening transitions with γ, no further transitions with γ , and no further closing transition in q popping γ. In this case, we write q i-tree e Σ \ .
Algorithms 17 00339 i016
Lemma 22. 
Any subhedge projection state of a complete Sha A is a state of irrelevant subtrees of A nwa .
It was then shown in [4] how to map an Nwa with a subset Q shp Δ of states of irrelevant subtrees to a projecting Nwa, whose streaming semantics yield a streaming evaluator with subhedge projection. The presentation of subhedge projection for Shas without passing via Nwas yields the same result in a more direct manner.
It should also be noticed that projecting Nwas supports descendant projection besides subhedge projection. For Shas, this is left to future research.

13. Conclusions and Future Work

We introduced the notion of irrelevant subhedges for membership to hedge language under schema restrictions. We developed algorithms with subhedge projection that jump over irrelevant subhedges for testing membership to the regular hedge languages defined by dShas. All our algorithms can be run in memory mode and in streaming mode. The difficulty was how to push the needed finite state information for subtree projection top-down given that Shas operate bottom-up. We solved it based on compilers from Shas to downward Shas.
This first compiler propagates safety information regarding non-changing states. The second compiler, which we prove to be complete for subhedge projection, propagates difference relations. We then combined subhedge projection with earliest membership testing and combined it earliest monadic query answering. We integrated both safe-no-change and complete congruence in our AStream tool and confirmed their usefulness experimentally: we showed that it can indeed speed up the the AStream tool for answering regular XPath queries on Xml streams. Moreover, congruence projection showed much higher event projection gain rates than safe-no-change, ranging between 75.7% and 100% for the class of queries that does not have descendant axis, becoming competitive in time with the best existing streaming tools for regular XPath queries without a recursive axis.
A remaining open theoretical and practical question is how to obtain descendant projection for Shas, as needed to speed up the evaluation of XPath queries with a descendant axis. On the implementation and experimental side, it would be nice to investigate the in-memory version of our subhedge algorithms too. Finally, it would be nice to optimize streaming earliest-query answering tools further until they outperform the best streaming tools in all cases, while being better for queries where the other tools are not earliest.

Author Contributions

Writing—original draft, A.A.S. and J.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Earliest Membership

We adapt the earliest membership tester for dShas from [21] so that it compiles to Shas instead of Nwas and such that it accounts for schemas.

Appendix A.1. Schema Safety Is an Inclusion Problem

The idea for testing certain membership and certain nonmembership is to consider states are safe to reach final states except if going out of the schema. Let A = ( Σ , Q , Δ , I , F ) be an Sha with schema S , such that there exists F F S Q with S = L ( A [ F / F S ] ) . Note that L ( A ) S H Σ in particular. The set of states that are safe to reach a subset P Q for all hedges that do not go out of the schema S is as follows:
safe S Δ ( P ) = safe Δ ( P ( Q \ F S ) ) .
If A is deterministic and schema-complete for S then the safety of a state of A with respect to schema S is an inclusion problem.
Lemma A1. 
Let S = L ( A [ F / F S ] ) . If A is deterministic and schema-complete for S , then for all states q Q :
L ( A [ I / { q } , F / F S ] ) L ( A [ I / { q } ] ) q safe S Δ ( F ) .
Proof. 
“⇐”.
Suppose that q safe S Δ ( F ) . The definition of schema-based safety yields acc Δ ( { q } ) ( F S \ F ) . So, there exists some hedge h H Σ , such that q h F S \ F wrt Δ . In particular, q L ( A [ I / { q } , F / F S ] ) . By determinism, there exists no other transition on h from q, and thus, q L ( A [ I / { q } ] ) . Hence, L ( A [ I / { q } , F / F S ] ) L ( A [ I / { q } ] .
“⇒”.
Suppose that q safe S Δ ( F ) . Let h L ( A [ I / { q } , F / F S ] ) . Then, there exists q F S , such that q h q wrt Δ . Hence, q acc Δ ( { q } ) so that acc Δ ( { q } ) F ( Q \ F S ) implies q F . Thus, h L ( A [ I / { q } ] ) .
Lemma A1 with F S = Q shows Lemma 5 of [21], which states for any complete dShas A that safety is a universality problem:
L ( A [ I / { q } ] ) = H Σ q safe Δ ( F ) .
This is since the schema-completeness of A for S = H Σ implies the completeness of A, and thus, L ( A [ I / { q } , F / Q ] ) = H Σ .

Appendix A.2. Algorithm

We now present our earliest membership tester for dShas with schema restrictions. Given a Sha A, we compute a Sha A e ( S ) = ( Σ , Q e ,   Δ e , I e ( S ) , F e ( S ) ) as follows. Let sel be a fresh symbol. We start with set set of states that are safe for selection S 0 = safe S Δ ( F ) and, respectively, safe for rejection R 0 = safe S Δ ( Q \ F ) . The state sets of A e ( S ) are then as follows:
Q e = ( Q × 2 Q × 2 Q ) { sel } I e ( S ) = { ( q , R 0 , S 0 ) q I , q R 0 S 0 } { sel I S 0 } F e ( S ) = { ( q , R 0 , S 0 ) q F } { sel }
For illustration, reconsider the dSha from Figure 4. This dSha is not complete but schema-complete for the schema doc . We get s-down Δ ( 0 , { 4 } ) = { 3 , 4 } . It may seem counter intuitive that note only state 3 but also state 4 down remains safe for selection. This reflects the fact that no proper subhedge of any hedge satisfying the intended schema may ever get into state 4. Applying the earliest construction with safe-no-change projection to dSha in Figure 4 yields the dSha in Figure A1. The transition rules in Δ e are given in Figure A2.
Figure A1. The earliest dSha A e for the dSha A in Figure 19 with schema doc for the XPath filter [ self : : list / child : : item ] . The states of A e correspond to the following tuples: 0 = ( 0 , { 1 , 2 , 3 , 5 } , { 4 } ) , 0 = ( 0 , { 2 , 4 , 5 } , { 3 , 4 } ) , 1 = ( 1 , { 2 , 4 , 5 } , { 3 , 4 } ) , 0 = ( 0 , , { 2 , 4 , 5 } ) , 1 = ( 1 , , { 2 , 4 , 5 } ) , 0 = ( 0 , , ) , 1 = ( 1 , , ) , 2 = ( 1 , , ) , and 3 = ( 1 , , ) .
Figure A1. The earliest dSha A e for the dSha A in Figure 19 with schema doc for the XPath filter [ self : : list / child : : item ] . The states of A e correspond to the following tuples: 0 = ( 0 , { 1 , 2 , 3 , 5 } , { 4 } ) , 0 = ( 0 , { 2 , 4 , 5 } , { 3 , 4 } ) , 1 = ( 1 , { 2 , 4 , 5 } , { 3 , 4 } ) , 0 = ( 0 , , { 2 , 4 , 5 } ) , 1 = ( 1 , , { 2 , 4 , 5 } ) , 0 = ( 0 , , ) , 1 = ( 1 , , ) , 2 = ( 1 , , ) , and 3 = ( 1 , , ) .
Algorithms 17 00339 g0a1
Figure A2. The earliest transition rules Δ e inferred from transition rules Δ of some Sha.
Figure A2. The earliest transition rules Δ e inferred from transition rules Δ of some Sha.
Algorithms 17 00339 g0a2

Appendix A.3. Soundness and Completeness

We next provide soundness and completeness results for our earliest in-memory membership tester with projection.
Proposition A1 (Soundness).
For any Sha A with schema S = L ( A [ F / F S ] ) , hedge h S , and partial run v of A e ( S ) on h that ends in state sel , then h L ( A ) .
So, if the partial run reaches state sel then proj Σ ( v ) is certain for membership to L ( A ) with respect to S . Evaluating a Shas A e ( S ) yields an earliest in-memory membership tester for dSha A. A single adaptation is as follows: whenever sel is reached, the evaluation can stop all over and accept the input hedge.
Theorem A1 (Completeness).
Let A = ( Σ , Q , Δ , I , F ) be a dSha that is schema-complete for schema S = L ( A [ F / F S ] ) . For any hedge h S and prefix v of nw ( h ) :
  • If v is certain for membership in L ( A ) with S , then there exists a partial run w for A e ( S ) on h, such that v = proj Σ ( w ) and w ends with state sel .
  • If v is certain for nonmembership in L ( A ) with S , then there exists blocking partial run w for A e ( S ) on h, such that proj Σ ( w ) is a prefix of v.
This means that certainty for membership and nonmembership is detected at the earliest prefix when running the evaluator for A e ( S ) .
Proof. 
We notice that this algorithm is basically the same as the earliest membership tester from Proposition 6 of [21], except that it also check for safe rejection and that it accounts for schemas. The fact that Nwas are used there instead of Shas here is not essential. So, we can lift the soundness and completeness proof from there without problem. □

Appendix B. Proofs for Section 6 (Safe-No-Change Projection)

Proof. 
We have to prove that no more changing states q P no-change Δ is sound.
If q no-change Δ , this follows from the schema-completeness of Δ so that one can neither block on any hedge from the schema nor change the state. In the case q P , the intuition is that the state on level above, say r, can neither change, since P = s-no-change Δ ( r ) , nor can the automaton block on any hedge from the schema due to schema-completeness.
We first prove the inclusion L ( A ) L ( A snc ) . Since L ( A ) S by definition of schemas, this implies L ( A ) L ( A snc ) S . The proof will be based on the following three Claims A1.1a, A1.2a, and A1.3a. Note that schema-completeness is not needed for this direction. □
Claim A1.1a. 
Π h Π   wrt Δ snc for all hedges h H Σ .
  • The proof is straightforward by induction on the structure of h. It uses the last three transition rules of Δ snc in Figure 10 permitting to always stay in Π for whatever hedge follows.
Claim A1.2a. 
For all h H Σ , q Q , and P Q , such that q P no-change Δ ,
( q , P ) h ( q , P )   wrt Δ snc .
We prove Claim A1.2a by induction on the structure of h.
Case 
h = h . In this case, we can use Claim A1.1a to show Π h Π wrt Δ snc and the inference rules
q P no-change Δ ( q , P ) Π   in Δ snc q P no-change Δ ( q , P ) @ Π ( q , P )   in Δ snc
in order to close the following diagram with respect to Δ snc :
Algorithms 17 00339 i017
This proves ( q , P ) h ( q , P ) wrt Δ , as required by the claim.
Case 
h = a . Since q P no-change Δ we can apply the inference rule:
a Σ q P no-change Δ ( q , P ) a ( q , P )   in Δ snc
This proves this case of the claim.
Case 
h = ϵ . We trivially have ( q , P ) ϵ ( q , P ) wrt Δ snc .
Case 
h = h h . By induction hypothesis applied to h and h we have: ( q , P ) h ( q , P ) and ( q , P ) h ( q , P ) wrt Δ snc . Hence, ( q , P ) h h ( q , P ) wrt Δ snc .
This ends the proof of Claim A1.2a. The next claim, in which the induction step is a little more tedious to prove, is the key of the soundness proof. We define the following predicate for all q , q Q and P Q :
q P q iff ( q = q q , q P )
Claim A1.3a
Let h H Σ be a hedge, q , q Q states and P a subset of states, such that acc Δ   ( P ) P and q P no-change Δ . If q h q wrt Δ , then there exists q , such that ( q , P ) h ( q , P )   wrt Δ snc and q P q .
Proof. 
Induction on the structure of h.
Case 
h = h . The assumption q h q wrt Δ shows that there exists states p 0 Δ and p Q closing the following diagram:
Algorithms 17 00339 i018
Let P = s-no-change Δ ( q ) and note that acc Δ ( P ) P . Since q P no-change Δ we can infer the following:
p 0 in Δ q P no-change Δ ( q , P ) ( p 0 , P )   in Δ snc q @ p q in Δ q P no-change Δ ( q , P ) @ ( p , P ) ( q , P )   in Δ snc
Subcase 
P 0 P no-change Δ . The induction hypothesis applied to h shows that there exists p′, such that ( P 0 , P ) h ( P , P )   wrt Δ snc and p P p . We distinguish the two cases justifying the latter predicate:
Subsubcase 
p = p . Hence, ( P 0 , P ) h ( p , P )   wrt Δ snc , so we can close the diagram as follows:
Algorithms 17 00339 i019
This shows that ( q , P ) h ( q , P )   wrt Δ snc , and thus, the claim since q P q .
Subsubcase 
p , p P . Since p P and P = s-no-change Δ ( q ) , we have q @ p q in Δ . Hence, we can close the diagram as follows:
Algorithms 17 00339 i020
Since p P and q @ p q in Δ , we have q = q by definition of P = s-no-change Δ ( q ) . This shows that ( q , P ) h ( q , P )   wrt Δ snc . Since q P q , the claim follows.
Subcase 
p 0 P no-change Δ . Claim A1.2a then shows that ( p 0 , P ) h ( p 0 , P ) wrt Δ snc .
Subsubcase 
p 0 P . Since p acc Δ ( p 0 ) and acc Δ ( P ) P , it follows that p P , too. By definition P = s-no-change Δ ( q ) and the completeness of Δ , the memberships p 0 P and p P imply that q @ Δ = { q } = q @ Δ p . We can now close the diagram below as follows:
Algorithms 17 00339 i021
Subsubcase 
p0no-changeΔ. In this case, p 0 = p so that q q @ Δ p 0 . Hence,
Algorithms 17 00339 i022
Case 
h = a . Since q P no-change Δ , we can apply the inference rule:
q a q in Δ q P no-change Δ ( q , P ) a ( q , P )   in Δ snc
This shows that ( q , P ) h ( q , P ) validates the claim since q p q .
Case 
h = ϵ . In this case, we have q = q and ( q , P ) ϵ ( q , P ) so the claim holds.
Case 
h = h 1 · h 2 . Since q h q wrt Δ , there exists q 1 Q , such that q h 1 q 1 wrt Δ and q 1 h 2 q wrt Δ . Since q P no-change Δ , we apply the induction hypothesis on h 1 . This implies that there exists q 1 such that,
( q , P ) h 1 ( q 1 , P )   wrt Δ snc and q 1 p q 1 .
We distinguish the two cases of q 1 p q 1 :
Subcase 
q 1 = q 1 . We also distinguish two subcases here:
Subsubcase q 1 P .
The induction hypothesis applied to h 2 yields the following:
q . ( q 1 , P ) h 2 ( q , P )   wrt Δ snc q p q ,
Hence
q . ( q , P ) h ( q , P )   wrt Δ snc Δ q p q .
Subsubcase 
q 1 P . By Claim A1.2a, we have ( q 1 , P ) h 2 ( q 1 , P ) . We also have q a c c ( { q 1 } ) , and since we assume a c c ( P ) P , this implies q P . Hence, ( q , P ) h ( q 1 , P ) and q , q 1 P , implying the claim is q p q 1 .
Subcase 
q 1 , q 1 P . Since q 1 P , Claim A1.2a implies ( q 1 , P ) h 2 ( q 1 , P ) wrt Δ snc . Thus, ( q , P ) h ( q 1 , P ) wrt Δ snc . Since q acc Δ ( { q 1 } ) and q 1 P , it follows that q acc Δ ( P ) P . Here, we used as in the previous subsubcase that a c c ( P ) P is assumed by the claim. Let q = q 1 . Then, we have ( q , P ) h ( q , P ) wrt Δ snc and q , q P showing the claim.
This ends the proof of Claim A1.3a. □
Proof of Inclusion ( A ) L ( A snc ) . 
Let h L ( A ) . Then, there exists q 0 I and q F , such that q 0 h q . We distinguish two cases:
Case 
q 0 no-change Δ . By definition of no-changeΔ and since q acc Δ ( q 0 ) we have q 0 = q . Claim A1.2a shows that ( q 0 , ) h ( q 0 , ) wrt Δ snc and thus ( q 0 , ) h ( q , ) so that h L ( A snc ) .
Case 
q 0 no-change Δ . Claim A1.3a with P = shows that ( q 0 , ) h ( q , ) wrt Δ snc and hence h L ( A snc ) .
This ends the proof of the first inclusion. □
We next want to show the inverse inclusion L ( A snc ) S L ( A ) . It will eventually follow from the following three Claims A1.1b, A1.2b, and A1.3b.
Claim A1.1b. 
For any hedge h and state μ Q snc , if  Π h μ wrt Δ snc then μ = Π .
  • The proof is straightforward by induction on the structure of h: the only transitions rules of Δ snc with Π on the left hand side are inferred by the last three rules in Figure 10. These require to stay in Π whatever hedge follows.
Claim A1.2b. 
For any hedge h, set P Q , state q P no-change Δ , and state μ Q snc : if ( q , P ) h μ wrt Δ snc , then μ = ( q , P ) .
Proof. 
By induction on the structure of h. Suppose that ( q , P ) h μ wrt Δ snc :
Case 
h = h . There must exist states μ 1 , μ 1 Q snc closing the following diagram:
Algorithms 17 00339 i023
Since q P no-change Δ , the following rule must have been applied to infer ( q , P ) μ 1 wrt Δ snc :
q P no-change Δ ( q , P ) Π in Δ snc
Therefore, μ 1 = Π . Claim A1.1b shows that μ 1 = Π , too. So, μ must have been inferred by applying the rule:
q P no-change Δ ( q , P ) @ Π ( q , P ) in Δ snc .
So, μ = ( q , P ) , as required.
Case 
h = a . The following rule must have been applied:
q P no-change Δ ( q , P ) a ( q , P ) in Δ snc
Hence, μ = ( q , P ) .
Case 
h = ε . Obvious.
Case 
h = h 1 · h 2 . There must exist μ 1 , such that ( q , P ) h 1 μ 1 h 2 μ wrt Δ snc . By induction hypothesis applied to h 1 , we have μ 1 = ( q , P ) . We can thus apply the induction hypothesis to h 2 to obtain μ 2 = ( q , P ) .
This ends the proof of Claim A1.2b. We next need an inverse of Claim A1.3a. □
Claim A1.3b. 
Let q Q , P Q , such that q P no-change Δ and acc Δ ( P ) P . For any h H Σ , such that ( q , P ) h μ wrt Δ snc for some μ Q snc and such that Δ does not have any blocking partial run on h starting from q, there exists q , q , such that
μ = ( q , P ) , q h q wrt Δ , and q P q .
Proof. 
By induction on the structure of h H Σ , we distinguish cases for all possible forms h.
Case 
h = h . By definition of ( q , P ) h μ wrt Δ snc , there must exist μ 1 , μ 1 Q snc , such that the following diagram can be closed:
Algorithms 17 00339 i024
Since q P no-change Δ , the following rule got applied to infer ( q , P ) μ 1 where P = s-no-change Δ ( q ) :
p in Δ q P no-change Δ ( q , P ) ( p , P ) in Δ snc
Hence, there exists p Δ , such that μ 1 = ( p , P ) . We fix such a state p arbitrarily.
Since Δ does not have any blocking runs on h from q there exists p , q Q , such that p h p and q @ p q wrt Δ . Furthermore, Δ does not have any blocking partial run on h starting from p.
Subcase 
p P . In this case, we can apply Claim A1.2b to ( p , P ) h μ 1 wrt Δ in order to show that μ 1 = ( p , P ) . Since p acc Δ ( { p } ) and p P = s-no-change Δ ( q ) it follows that q @ Δ p = { q } . Hence, the following rule got applied to infer ( q , P ) h μ :
q @ p q in Δ q P no-change Δ ( q , P ) @ ( p , P ) ( q , P ) in Δ snc ,
This shows that μ = ( q , P ) . Let q = q so that μ = ( q , P ) . Since p h p wrt Δ , we have p acc Δ ( { p } ) so that q @ p q wrt Δ by definition of s-no-change Δ ( { q } ) . Hence, we can close the following diagram:
Algorithms 17 00339 i025
Let q = q , so that q = q . It then holds that q P q and q h q wrt Δ , as required by the claim.
Subcase 
p no-change Δ . In this case p h p wrt Δ implies that p = p . Hence, we can close the following diagram:
Algorithms 17 00339 i026
This shows that q h q wrt Δ .
Subcase 
p P no-change Δ . By induction hypothesis applied to ( p , P ) h μ 1 wrt Δ , there exists p , such that
μ 1 = ( p , P ) , p h p wrt Δ , and p P p .
Since Δ does not have blocking runs on h starting with q, there exist q such that q @ p q wrt Δ . There are two ways to satisfy p P p :
Subsubcase 
p = p . We then have the following:
Algorithms 17 00339 i027
This shows q h q wrt Δ .
Subsubcase 
p , p P . By definition of P , it follows that q = q = q . Hence,
Algorithms 17 00339 i028
This shows q h q wrt Δ .
Case 
h = a . Since q P no-change Δ , the following inference rule must be used:
q a q in Δ q P no-change Δ ( q , P ) a ( q , P ) in Δ snc
So μ = ( q , P ) and q a q wrt Δ .
Case 
h = ε . Obvious.
Case 
h = h 1 · h 2 . The judgement ( q , P ) h μ wrt Δ snc shows that there exists μ 1 , such that ( q , P ) h 1 μ 1 h 2 μ wrt Δ snc . Since q P no-change Δ , we can apply the induction hypothesis to h 1 . It shows that there exists q 1 and q 1 , such that μ 1 = ( q 1 , P ) , q h 1 q 1 wrt Δ , and q 1 P q 1 .
Subcase 
q 1 = q 1 . Hence, ( q 1 , P ) h 2 μ wrt Δ snc .
Subsubcase 
q 1 P no-change Δ . In this case, we can apply the induction hypothesis to ( q 1 , P ) h 2 μ wrt Δ snc , showing the existence of q , such that μ = ( q , P ) and state q , such that q 1 h 2 q and q P q . Hence, q . q h q wrt Δ and q P q so the claim holds.
Subsubcase 
q 1 P no-change Δ . Claim A1.2b applied to ( q 1 , P ) h 2 μ wrt Δ snc shows that μ = ( q 1 , P ) .
Subcase 
q 1 , q 1 P . Recall that ( q 1 , P ) h 2 μ wrt Δ snc and q 1 P . Claim A1.2b shows that μ = ( q 1 , P ) wrt Δ snc . We also have q h 1 q 1 wrt Δ . Since there are no blocking partial runs on h starting from q there exist a state q such that q 1 h 2 q wrt Δ . Since q 1 P and P is closed by accessibility, we have q a c c ( { q 1 } ) a c c ( P ) P . From q h 1 q 1 wrt Δ , we get q h q wrt Δ . Since q 1 , q P , it follows that q 1 P q , and thus, the claim holds.
This ends the proof of Claim A1.3b.
Proof of Inclusion L ( A snc ) S L ( A ) . 
Let h L ( A snc ) S . Since h L ( A snc ) , then there exists q 0 I and q F , such that ( q 0 , ) h ( q , ) wrt Δ snc . Using Lemma 7, we have q 0 h q wrt Δ .
We distinguish two cases:
Case 
q 0 no-change Δ . Claim A1.2b shows that q = q 0 . Since A is schema-complete for S , h S , and q 0 I no-change Δ , Lemma 7 shows that q 0 h q 0 wrt Δ . Since q = q 0 , this yields h L ( A ) .
Case 
q 0 no-change Δ . Since A is schema-complete for S and h S there exist no blocking runs on h that start in q 0 . Therefore, we can apply Claim A1.3b with P = to ( q 0 , ) h ( q , ) wrt Δ snc . This shows that q 0 h q wrt Δ , and hence, h L ( A ) .
This ends the proof of the inverse inclusion, and thus, of L ( A ) = L ( A snc ) .   □

References

  1. Marian, A.; Siméon, J. Projecting XML Documents. In Proceedings of the VLDB; Morgan Kaufmann: San Francisco, CA, USA, 2003; pp. 213–224. [Google Scholar]
  2. Frisch, A. Regular Tree Language Recognition with Static Information. In Proceedings of the Exploring New Frontiers of Theoretical Informatics, IFIP 18th World Computer Congress, TCS 3rd International Conference on Theoretical Computer Science, Toulouse, France, 22–27 August 2004; pp. 661–674. [Google Scholar]
  3. Maneth, S.; Nguyen, K. XPath Whole Query Optimization. VLPB J. 2010, 3, 882–893. [Google Scholar] [CrossRef]
  4. Sebastian, T.; Niehren, J. Projection for Nested Word Automata Speeds up XPath Evaluation on XML Streams. In Proceedings of the International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM), Harrachov, Czech Republic, 23–28 January 2016. [Google Scholar]
  5. Kay, M. The Saxon XSLT and XQuery Processor. 2004. Available online: https://www.saxonica.com (accessed on 22 July 2024).
  6. Gienieczko, M.; Murlak, F.; Paperman, C. Supporting Descendants in SIMD-Accelerated JSONPath. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, La Jolla, CA, USA, 27 April–1 May 2024; Volume 1, pp. 15–29. [Google Scholar]
  7. Al Serhali, A.; Niehren, J. A Benchmark Collection of Deterministic Automata for XPath Queries. In Proceedings of the XML Prague 2022, Prague, Czech Republic, 9–11 June 2022. [Google Scholar]
  8. Niehren, J.; Sakho, M. Determinization and Minimization of Automata for Nested Words Revisited. Algorithms 2021, 14, 68. [Google Scholar] [CrossRef]
  9. Comon, H.; Dauchet, M.; Gilleron, R.; Löding, C.; Jacquemard, F.; Lugiez, D.; Tison, S.; Tommasi, M. Tree Automata Techniques and Applications. Available online: http://tata.gforge.inria.fr (accessed on 22 July 2024).
  10. Thatcher, J.W. Characterizing derivation trees of context-free grammars through a generalization of automata theory. J. Comput. Syst. Sci. 1967, 1, 317–322. [Google Scholar] [CrossRef]
  11. Alur, R. Marrying Words and Trees. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Beijing, China, 11–14 June 2007; ACM-Press: New York, NY, USA, 2007; pp. 233–242. [Google Scholar]
  12. Alur, R.; Madhusudan, P. Visibly pushdown languages. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004; Babai, L., Ed.; ACM: New York, NY, USA, 2004; pp. 202–211. [Google Scholar] [CrossRef]
  13. Mehlhorn, K. Pebbling Mountain Ranges and its Application of DCFL-Recognition. In Proceedings of the Automata, Languages and Programming, 7th Colloquium, Noordweijkerhout, The Netherlands, 14–18 July 1980; Proceedings. de Bakker, J.W., van Leeuwen, J., Eds.; Springer: Berlin/Heidelberg, Germany, 1980; Volume 85, pp. 422–435. [Google Scholar] [CrossRef]
  14. von Braunmühl, B.; Verbeek, R. Input Driven Languages are Recognized in log n Space. In Topics in the Theory of Computation; North-Holland Mathematics Studies; Karpinski, M., van Leeuwen, J., Eds.; North-Holland: Amsterdam, The Netherlands, 1985; Volume 102, pp. 1–19. [Google Scholar] [CrossRef]
  15. Okhotin, A.; Salomaa, K. Complexity of input-driven pushdown automata. SIGACT News 2014, 45, 47–67. [Google Scholar] [CrossRef]
  16. Niehren, J.; Sakho, M.; Al Serhali, A. Schema-Based Automata Determinization. In Proceedings of the 13th International Symposium on Games, Automata, Logics and Formal Verification, GandALF 2022, Madrid, Spain, 21–23 September 2022; Ganty, P., Monica, D.D., Eds.; EPTCS: The Hague, The Netherlands, 2022; Volume 370, pp. 49–65. [Google Scholar] [CrossRef]
  17. Lick, A.; Schmitz, S. XPath Benchmark. Available online: https://archive.softwareheritage.org/browse/directory/1ea68cf5bb3f9f3f2fe8c7995f1802ebadf17fb5 (accessed on 22 July 2024).
  18. Mozafari, B.; Zeng, K.; Zaniolo, C. High-performance complex event processing over XML streams. In Proceedings of the SIGMOD Conference; Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L., Fuxman, A., Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L., Fuxman, A., Eds.; ACM: New York, NY, USA, 2012; pp. 253–264. [Google Scholar] [CrossRef]
  19. Grez, A.; Riveros, C.; Ugarte, M. A Formal Framework for Complex Event Processing. In Proceedings of the 22nd International Conference on Database Theory (ICDT 2019), Lisbon, Portugal, 26–28 March 2019; pp. 5:1–5:18. [Google Scholar] [CrossRef]
  20. Muñoz, M.; Riveros, C. Streaming Enumeration on Nested Documents. In Proceedings of the 25th International Conference on Database Theory, ICDT 2022, Edinburgh, UK, 29 March–1 April 2022; Olteanu, D., Vortmeier, N., Eds.; Schloss Dagstuhl—Leibniz-Zentrum für Informatik: Wadern, Germany, 2022; Volume 220, pp. 19:1–19:18. [Google Scholar] [CrossRef]
  21. Al Serhali, A.; Niehren, J. Earliest Query Answering for Deterministic Stepwise Hedge Automata. In Proceedings of the Implementation and Application of Automata—27th International Conference, CIAA 2023, Famagusta, North Cyprus, 19–22 September 2023; Proceedings. Nagy, B., Ed.; Springer: Berlin/Heidelberg, Germany, 2023; Volume 14151, pp. 53–65. [Google Scholar] [CrossRef]
  22. Gauwin, O.; Niehren, J.; Tison, S. Earliest Query Answering for Deterministic Nested Word Automata. In Proceedings of the 17th International Symposium on Fundamentals of Computer Theory, Wroclaw, Poland, 2–4 September 2009; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5699, pp. 121–132. [Google Scholar]
  23. Debarbieux, D.; Gauwin, O.; Niehren, J.; Sebastian, T.; Zergaoui, M. Early nested word automata for XPath query answering on XML streams. Theor. Comput. Sci. 2015, 578, 100–125. [Google Scholar] [CrossRef]
  24. Al Serhali, A.; Niehren, J. Subhedge Projection for Stepwise Hedge Automata. In Proceedings of the Fundamentals of Computation Theory—24th International Symposium, FCT 2023, Trier, Germany, 18–21 September 2023; Proceedings. Fernau, H., Jansen, K., Eds.; Springer: Berlin/Heidelberg, Germany, 2023; Volume 14292, pp. 16–31. [Google Scholar] [CrossRef]
  25. Neumann, A.; Seidl, H. Locating Matches of Tree Patterns in Forests. In Proceedings of the Foundations of Software Technology and Theoretical Computer Science, Chennai, India, 17–19 December 1998; Springer: Berlin/Heidelberg, Germany, 1998; Volume 1530, pp. 134–145. [Google Scholar]
  26. Gauwin, O.; Niehren, J.; Roos, Y. Streaming Tree Automata. Inf. Process. Lett. 2008, 109, 13–17. [Google Scholar] [CrossRef]
  27. Franceschet, M. XPathMark Performance Test. Available online: https://users.dimi.uniud.it/~massimo.franceschet/xpathmark/PTbench.html (accessed on 25 October 2020).
  28. Hosoya, H.; Pierce, B.C. XDuce: A Statically Typed XML Processing Language. ACM Trans. Internet Technol. 2003, 3, 117–148. [Google Scholar] [CrossRef]
  29. Gécseg, F.; Steinby, M. Tree Automata; Akadémiai Kiadó: Budapest, Hungary, 1984. [Google Scholar]
  30. Gold, E.M. Language Identification in the Limit. Inf. Control. 1967, 10, 447–474. [Google Scholar] [CrossRef]
  31. Carme, J.; Niehren, J.; Tommasi, M. Querying Unranked Trees with Stepwise Tree Automata. In Proceedings of the Rewriting Techniques and Applications, 15th International Conference, RTA 2004, Aachen, Germany, 3–5 June 2004; Proceedings. van Oostrom, V., Ed.; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3091, pp. 105–118. [Google Scholar] [CrossRef]
  32. Pair, C.; Quéré, A. Définition et etude des bilangages réguliers. Inf. Control 1968, 13, 565–593. [Google Scholar] [CrossRef]
  33. Murata, M. Hedge Automata: A Formal Model for XML Schemata. 2000. Available online: http://www.xml.gr.jp/relax/hedge_nice.html (accessed on 22 July 2024).
  34. Martens, W.; Niehren, J. On the Minimization of XML Schemas and Tree Automata for Unranked Trees. J. Comput. Syst. Sci. 2007, 73, 550–583. [Google Scholar] [CrossRef]
  35. Niehren, J. Stepwise Hedge Automata are exponentially more succinct than Forest Automata. 2024; in preparation. [Google Scholar]
  36. Bojanczyk, M.; Walukiewicz, I. Forest algebras. In Proceedings of the Logic and Automata: History and Perspectives [in Honor of Wolfgang Thomas]; Flum, J., Grädel, E., Wilke, T., Eds.; Amsterdam University Press: Amsterdam, The Netherlands, 2008; Volume 2, pp. 107–132. [Google Scholar]
Figure 1. The graph of the hedge l i s t · i t e m · F · C · T · i t e m · C · I · A · A · l i s t · i t e m · C · M · S · B is a sequence of trees.
Figure 1. The graph of the hedge l i s t · i t e m · F · C · T · i t e m · C · I · A · A · l i s t · i t e m · C · M · S · B is a sequence of trees.
Algorithms 17 00339 g001
Figure 2. The sequence of Xml documents encoded by the hedge in Figure 1.
Figure 2. The sequence of Xml documents encoded by the hedge in Figure 1.
Algorithms 17 00339 g002
Figure 3. The part of the hedge distinguished by the nested word prefix l i s t · i t e m · F · C · T · i t e m · C · I · A · A · l i s t · .
Figure 3. The part of the hedge distinguished by the nested word prefix l i s t · i t e m · F · C · T · i t e m · C · I · A · A · l i s t · .
Algorithms 17 00339 g003
Figure 4. A dSha for the XPath filter [ self : : list / child : : item ] that is schema-complete for schema doc .
Figure 4. A dSha for the XPath filter [ self : : list / child : : item ] that is schema-complete for schema doc .
Algorithms 17 00339 g004
Figure 5. The unique successful run of the dSha in Figure 4 on the hedge list · item is the hedge 0 · 0 · list · 1 · 0 · item · 2 · 3 · 4 computed, as illustrated above.
Figure 5. The unique successful run of the dSha in Figure 4 on the hedge list · item is the hedge 0 · 0 · list · 1 · 0 · item · 2 · 3 · 4 computed, as illustrated above.
Algorithms 17 00339 g005
Figure 6. The determinization Δ det of the transition rules Δ of a Sha.
Figure 6. The determinization Δ det of the transition rules Δ of a Sha.
Algorithms 17 00339 g006
Figure 7. The deterministic Sha with subhedge projection state Π obtained by safe-no-change projection.
Figure 7. The deterministic Sha with subhedge projection state Π obtained by safe-no-change projection.
Algorithms 17 00339 g007
Figure 8. A successful run of the Sha in Figure 7 on the hedge list · list · h 1 · item · h 2 .
Figure 8. A successful run of the Sha in Figure 7 on the hedge list · list · h 1 · item · h 2 .
Algorithms 17 00339 g008
Figure 9. The Sha A down for the dSha A for the filter [ self : : list / child : : item ] in Figure 4.
Figure 9. The Sha A down for the dSha A for the filter [ self : : list / child : : item ] in Figure 4.
Algorithms 17 00339 g009
Figure 10. The transition rules of the Sha  A snc inferred from those of the Sha A.
Figure 10. The transition rules of the Sha  A snc inferred from those of the Sha A.
Algorithms 17 00339 g010
Figure 11. The safe-no-change projection dSha A snc constructed from the dSha A for query [ self : : list / child : : item ] is shown in Figure 4. Subhedge projection states are colored in orange. Useless states and transitions leading out of schema doc are omitted. State Π has a transition labeled by wildcard letter “−”, which stands for either letter list or item . This dSha is equal to the dSha in Figure 7 up to the state renaming 0 = ( 0 , { } ) , 0 = ( 0 , { 2 , 4 } ) , 0 = ( 0 , { 1 , 3 , 4 } ) , 1 = ( 1 , { 2 , 4 } ) , 1 = ( 1 , { 1 , 3 , 4 } ) , 2 = ( 2 , { 2 , 4 } ) , 2 = ( 2 , { 1 , 3 , 4 } ) , 3 = ( 3 , { 2 , 4 } ) , and 4 = ( 4 , { } ) . Recall that no-change Δ = { 2 , 3 , 4 , 5 } .
Figure 11. The safe-no-change projection dSha A snc constructed from the dSha A for query [ self : : list / child : : item ] is shown in Figure 4. Subhedge projection states are colored in orange. Useless states and transitions leading out of schema doc are omitted. State Π has a transition labeled by wildcard letter “−”, which stands for either letter list or item . This dSha is equal to the dSha in Figure 7 up to the state renaming 0 = ( 0 , { } ) , 0 = ( 0 , { 2 , 4 } ) , 0 = ( 0 , { 1 , 3 , 4 } ) , 1 = ( 1 , { 2 , 4 } ) , 1 = ( 1 , { 1 , 3 , 4 } ) , 2 = ( 2 , { 2 , 4 } ) , 2 = ( 2 , { 1 , 3 , 4 } ) , 3 = ( 3 , { 2 , 4 } ) , and 4 = ( 4 , { } ) . Recall that no-change Δ = { 2 , 3 , 4 , 5 } .
Algorithms 17 00339 g011
Figure 12. A schema-complete dSha for the XPath filter [ child : : item ] . It is a counter example for the completeness of safe-no-change projection, see Figure 13.
Figure 12. A schema-complete dSha for the XPath filter [ child : : item ] . It is a counter example for the completeness of safe-no-change projection, see Figure 13.
Algorithms 17 00339 g012
Figure 13. The safe-no-change projection A snc of the dSha A in Figure 12. It is incomplete for subhedge projection with with schema doc at the state ( 2 , { 1 , 3 , 5 , 6 } ) : this state is not a subhedge projection state, even though the prefixes list · item and item · item leading to it are subhedge irrelevant.
Figure 13. The safe-no-change projection A snc of the dSha A in Figure 12. It is incomplete for subhedge projection with with schema doc at the state ( 2 , { 1 , 3 , 5 , 6 } ) : this state is not a subhedge projection state, even though the prefixes list · item and item · item leading to it are subhedge irrelevant.
Algorithms 17 00339 g013
Figure 14. The congruence projection dSha A cgr ( doc ) for the counter example of the completeness of safe-no-change projection, i.e., the dSha A for the filter [ child : : item ] in Figure 12 with the schema final states F S = { 0 , 5 , 6 } . Note that state {1}D3 that is reached over the prefix list · item and the state {2}D3 that is reached over the prefix item · item are subhedge irrelevant, and thus, subhedge projection states.
Figure 14. The congruence projection dSha A cgr ( doc ) for the counter example of the completeness of safe-no-change projection, i.e., the dSha A for the filter [ child : : item ] in Figure 12 with the schema final states F S = { 0 , 5 , 6 } . Note that state {1}D3 that is reached over the prefix list · item and the state {2}D3 that is reached over the prefix item · item are subhedge irrelevant, and thus, subhedge projection states.
Algorithms 17 00339 g014
Figure 15. The transitions rules Δ cgr of the congruence projection A cgr ( S ) .
Figure 15. The transitions rules Δ cgr of the congruence projection A cgr ( S ) .
Algorithms 17 00339 g015
Figure 16. A dSha A for L ( A ) = { a } and schema-final states F S = { 3 , 5 } for the schema S = L ( A ) { · a } .
Figure 16. A dSha A for L ( A ) = { a } and schema-final states F S = { 3 , 5 } for the schema S = L ( A ) { · a } .
Algorithms 17 00339 g016
Figure 17. The congruence projection A cgr ( S ) for the dSha A and the schema-final states F S in Figure 16.
Figure 17. The congruence projection A cgr ( S ) for the dSha A and the schema-final states F S in Figure 16.
Algorithms 17 00339 g017
Figure 18. The congruence projection A cgr ( doc ) of the dSha A in Figure 4 for the XPath filter [ self : : list / child : : item ] with schema doc . Subhedge irrelevant states are colored in orange. The underscore stands for any label, either list or item .
Figure 18. The congruence projection A cgr ( doc ) of the dSha A in Figure 4 for the XPath filter [ self : : list / child : : item ] with schema doc . Subhedge irrelevant states are colored in orange. The underscore stands for any label, either list or item .
Algorithms 17 00339 g018
Figure 19. The earliest dSha A e ( S ) for the dSha A in Figure 4 with schema S = doc for the XPath filter [ self : : list / child : : item ] .
Figure 19. The earliest dSha A e ( S ) for the dSha A in Figure 4 with schema S = doc for the XPath filter [ self : : list / child : : item ] .
Algorithms 17 00339 g019
Figure 20. The transition rules Δ e π inferred from those of the Shas A π with projection states P and A e with selection state sel .
Figure 20. The transition rules Δ e π inferred from those of the Shas A π with projection states P and A e with selection state sel .
Algorithms 17 00339 g020
Figure 21. The earliest dSha A e ( S ) snc with safe-no-change subhedge projection for the dSha A in Figure 4 with schema S = doc for the XPath filter [ self : : list / child : : item ] .
Figure 21. The earliest dSha A e ( S ) snc with safe-no-change subhedge projection for the dSha A in Figure 4 with schema S = doc for the XPath filter [ self : : list / child : : item ] .
Algorithms 17 00339 g021
Figure 22. The dSha for the XPath query self::list[child::item]. The selection position is indicated by x.
Figure 22. The dSha for the XPath query self::list[child::item]. The selection position is indicated by x.
Algorithms 17 00339 g022
Figure 23. The earliest congruence projection A cgr ( doc o n e x )                       e ( doc o n e x )      for the dSha A in Figure 22 with F S = { 14 , 20 } . It defines the query of the XPath self::list[child::item].
Figure 23. The earliest congruence projection A cgr ( doc o n e x )                       e ( doc o n e x )      for the dSha A in Figure 22 with F S = { 14 , 20 } . It defines the query of the XPath self::list[child::item].
Algorithms 17 00339 g023
Figure 24. The dNwa  A nwa for the dSha A in Figure 9 obtained by applying the d o w n operator to the dSha in Figure 4 for the filter [ self : : list / child : : item ] .
Figure 24. The dNwa  A nwa for the dSha A in Figure 9 obtained by applying the d o w n operator to the dSha in Figure 4 for the filter [ self : : list / child : : item ] .
Algorithms 17 00339 g024
Figure 25. The dNwa  ( A e ( doc ) snc ) nwa for the earliest dSha with safe-no-change projection in Figure 21.
Figure 25. The dNwa  ( A e ( doc ) snc ) nwa for the earliest dSha with safe-no-change projection in Figure 21.
Algorithms 17 00339 g025
Table 1. Regular XPath queries A1-A8 from the XPathMark collection.
Table 1. Regular XPath queries A1-A8 from the XPathMark collection.
Query IDXPath Expression
A1/site/closed_auctions/closed_auction/annotation/description/ text/keyword
A2//closed_auction//keyword
A3/site/closed_auctions/closed_auction//keyword
A4/site/closed_auctions/closed_auction[annotation/description/ text/keyword]/date
A5/site/closed_auctions/closed_auction [descendant::keyword]/date
A6/site/people/person[profile/gender and profile/age]/name
A7/site/people/person[phone or homepage]/name
A8/site/people/person[address and (phone or homepage) and (creditcard or profile)]/name
Table 2. Additional regular XPath queries for the XPathMark documents.
Table 2. Additional regular XPath queries for the XPathMark documents.
A0/site
A1_0a/site/*
A1_0b/site/@*
A1_0c/site//@*
A1_1a//bidder/personref[starts-with(@person,’person0’)]
A1_1d//bidder/personref[@person=’person0’]
A1_2//person
A1_3/site/regions/africa//@*
A1_4/site/regions/africa/*
A1_5/site/regions/*
A1_6//closed_auction/annotation//keyword
A2_1//closed_auction[descendant::keyword]
A4_0/site/closed_auctions/closed_auction[annotation]/date
A4_1/site[open_auctions]/closed_auctions
Table 3. Overall size and number n of states of the input dSha A for the query with schema-final states F S , the overall size and number of states of the Sha A cgr ( S ) e ( S ) obtained by earliest congruence projection, and the number d its difference relations.
Table 3. Overall size and number n of states of the input dSha A for the query with schema-final states F S , the overall size and number of states of the Sha A cgr ( S ) e ( S ) obtained by earliest congruence projection, and the number d its difference relations.
QuerydSha AdSha#Diff-QuerydSha AdSha#Diff-
ID Size(n) Size(#State) Rel. d ID Size(n) Size(#State) Rel. d
A1482(68)1296(324)16A1_0c230(43)238(62)6
A2224(42)316(82)7A1_1a305(54)384(101)8
A3320(53)662(156)10A1_1d305(54)382(101)8
A4629(74)1651(404)18A1_2194(39)152(42)5
A5438(63)1226(269)13A1_3318(53)672(159)10
A6675(76)2090(500)22A1_4312(52)479(132)10
A7394(59)728(184)12A1_5266(47)293(84)8
A8648(79)2091(504)24A1_6265(47)588(142)9
A0203(40)158(44)5A2_1232(42)295(78)7
A1_0a224(42)145(44)6A4_0392(59)719(184)12
A1_0b203(39)68(23)5A4_1292(49)287(78)8
Table 4. Time gain and event gain by earliest congruence projection with AStream3.01 on a 1.1 GB XPathMark stream.
Table 4. Time gain and event gain by earliest congruence projection with AStream3.01 on a 1.1 GB XPathMark stream.
Query#Ans-TimeTimeEventQuery#Ans-TimeTimeEvent
ID Wer in s Gain Gain ID Wer in s Gain Gain
A138,26729.398.1%98.9%A1_0c3,600,816226.672.9%75.7%
A2117,389171.877.9%81.1%A1_1a2187.878.1%80.3%
A3117,38933.497.4%97.8%A1_1d2217.374.8%80.3%
A424,95933.297.8%98.9%A1_21,062,965238.268.2%76.0%
A550,186   A1_325,07416.299.1%99.8%
A630,30745.996.9%98.2%A1_4517017.599.7%100.0%
A7179,88930.598.0%98.7%A1_5614.3100.0%100.0%
A867,68837.897.3%98.7%A1_6117,389188.775.4%81.1%
A0121.399.1%100.0%A2_150,186199.776.5%81.1%
A1_0a615.299.9%100.0%A4_091,65021.899.0%99.3%
A1_0b00.0100.0%100.0%A4_1112.0100.0%100.0%
Table 5. Time gain for earliest safe-no-change and eaerliest congruence projection, and the event gain of earliest congrunce projection.
Table 5. Time gain for earliest safe-no-change and eaerliest congruence projection, and the event gain of earliest congrunce projection.
QueryTime (s)Time (s)Time GainTime GainEvent Gain
ID Snc Congr Snc Congr Congr
A172.829.392.3%98.1%98.9%
A2664.6171.88.5%77.9%81.1%
A3666.133.410.9%97.4%97.8%
A478.533.292.4%97.8%98.9%
A577.7 92.3%
A665.245.995.1%96.9%98.2%
A724.630.598.8%98.0%98.7%
A824.837.598.9%97.3%98.7%
Table 6. Comparison of of parse-free timings in seconds for XPathMark queries with QuiXPath and earliest congruence projection.
Table 6. Comparison of of parse-free timings in seconds for XPathMark queries with QuiXPath and earliest congruence projection.
QueryParse-FreeParse-FreeFract.QueryParse-FreeParse-FreeFract.
ID Time (s) Time (s) ID Time (s) Time (s)
QuiXPath Congr. Prj. QuiXPath Congr. Prj.
A111.014.31.3A211.4156.813.8
A411.618.21.6A311.518.41.6
A610.730.92.9A512.0
A78.615.51.8
A88.822.52.6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Al Serhali, A.; Niehren, J. Complete Subhedge Projection for Stepwise Hedge Automata. Algorithms 2024, 17, 339. https://doi.org/10.3390/a17080339

AMA Style

Al Serhali A, Niehren J. Complete Subhedge Projection for Stepwise Hedge Automata. Algorithms. 2024; 17(8):339. https://doi.org/10.3390/a17080339

Chicago/Turabian Style

Al Serhali, Antonio, and Joachim Niehren. 2024. "Complete Subhedge Projection for Stepwise Hedge Automata" Algorithms 17, no. 8: 339. https://doi.org/10.3390/a17080339

APA Style

Al Serhali, A., & Niehren, J. (2024). Complete Subhedge Projection for Stepwise Hedge Automata. Algorithms, 17(8), 339. https://doi.org/10.3390/a17080339

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop