Complete Subhedge Projection for Stepwise Hedge Automata

Al Serhali, Antonio; Niehren, Joachim

doi:10.3390/a17080339

Open AccessArticle

Complete Subhedge Projection for Stepwise Hedge Automata^†

by

Antonio Al Serhali

^* and

Joachim Niehren

^*

Inria Center, University of Lille, 59000 Lille, France

^*

Authors to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in International Symposium on Fundamentals of Computation Theory, Trier, Germany, 18–21 September 2023.

Algorithms 2024, 17(8), 339; https://doi.org/10.3390/a17080339

Submission received: 29 January 2024 / Revised: 23 July 2024 / Accepted: 23 July 2024 / Published: 2 August 2024

(This article belongs to the Special Issue Selected Algorithmic Papers From FCT 2023)

Download

Browse Figures

Versions Notes

Abstract

:

We demonstrate how to evaluate stepwise hedge automata (Shas) with subhedge projection while completely projecting irrelevant subhedges. Since this requires passing finite state information top-down, we introduce the notion of downward stepwise hedge automata. We use them to define in-memory and streaming evaluators with complete subhedge projection for Shas. We then tune the evaluators so that they can decide on membership at the earliest time point. We apply our algorithms to the problem of answering regular XPath queries on Xml streams. Our experiments show that complete subhedge projection of Shas can indeed speed up earliest query answering on Xml streams so that it becomes competitive with the best existing streaming tools for XPath queries.

Keywords:

automata; projection algorithm; streaming algorithm; Xml

1. Introduction

Hedges are sequences of letters from the alphabet and trees

〈 h 〉

where h is again a hedge. Hedges are abstract from the syntactical details of Xml or JSON documents while still being able to represent them. The linearization of a hedge is a nested word that can be written in a file. In this article, we study the problem of hedge pattern matching, i.e., whether a given hedge h belongs to a given regular hedge language L. Regular hedge languages can be defined in XPath or JSONPath in particular.

Projection is necessary for the efficiency of many algorithms on words, trees, hedges, or nested words. Intuitively, an algorithm is projecting if it visits only a fragment of the input structure: in the best case, the part that is relevant to the problem under consideration. The relevance of projection for Xml processing was already noticed by [1,2,3,4]. Saxon’s in-memory evaluator, for instance, projects input Xml documents relative to an Xslt program, which contains a collection of XPath queries to be answered simultaneously [5]. The QuiXPath tool [4] evaluates XPath queries in streaming mode with subtree and descendant projection. Projection during the evaluation of JSONPath queries on JSON documents in streaming mode is called fast-forwarding [6].

Projecting in-memory evaluation assumes that the full graph of the input hedge is constructed beforehand. Nevertheless, projection may still save time if one has to match several patterns on the same input hedge or if the graph was constructed for different reasons anyway. In streaming mode, the situation is similar: the whole input hedge on the stream needs to be parsed, but the evaluators need to inspect only nodes that are not projected away. Given that pure parsing is by two or three orders of magnitude faster than pattern evaluation, the time gain of projection may be considerable.

The starting point of this article is the notion of the subhedge irrelevance of positions of hedge h with respect to a hedge language L that we introduce. These are positions of h where, for all possible continuations, inserting any subhedge does not affect membership to L. We contribute an algorithm for hedge pattern matching with complete subhedge projection. Our algorithm decides hedge pattern matching while completely projecting away the subhedges located at irrelevant subhedge positions. In other words, it decides whether a hedge h matches a pattern L without visiting any subhedge of h that is located at a position that is irrelevant with respect to L.

Regular hedge languages can be represented by nested regular expressions, which can be derived from regular XPath queries [7] or by some kind of hedge automata, which can be compiled from nested regular expressions. We use stepwise hedge automata (Shas) [8], a recent variant of standard hedge automata, which, in turn, date back to the 1960s [9,10]. Shas mix up the bottom-up processing of standard tree automata with the left-to-right processing of finite word automata (Dfas). They neither support top-down processing nor have an explicit stack, unlike nested word automata Nwas [11,12,13,14,15]. Shas have a good notion of bottom-up and left-to-right determinism, avoiding the problematic notion of determinism for standard hedge automata and the too-costly notion of determinism for Nwas. Furthermore, compilers from nested regular expressions to Shas are available [8], and determinization algorithms [16] produce small Shas for all regular XPath queries in the existing practical benchmarks [7,17].

The motivation for the present work is to add subhedge projection to a recent streaming algorithm for deterministic Shas that can evaluate regular XPath queries in the earliest manner. Most alternative streaming approaches avoid earliest query answering by considering sublanguages of regular queries without delays so that node selection depends only on the past of the node but not on the future [18,19] or by admitting late selection [20]. In particular, it was recently demonstrated that the earliest query answering for regular monadic queries defined by deterministic Shas [21] has a lower worst-case complexity than for deterministic Nwas [22]. Thereby, the earliest query answering for regular queries became feasible in practice for the first time. On the other hand side, it is still slower experimentally than the best existing non-earliest approaches for general regular monadic queries on Xml streams (see [23] for a comparison) since the latter supports projection. How projections could be made for Shas has not been studied before the conference version of the present article [24].

Consider the example of the XPath filter

[self : : list / child : : item]

that is satisfied by an Xml document if its root is a list element that has an item child. With the encoding of the Xml document as hedges are chosen in the present article, satisfying hedges have the form

〈 list \cdot h_{1} \cdot 〈 item \cdot h_{2} 〉 \cdot h_{3} 〉 \cdot h_{4}

, and for some hedges,

h_{1}, h_{2}, h_{3}, h_{4}

. When evaluating this filter on a hedge, it is sufficient to inspect the first subtree for having the root label

list

and then all its children until one with a root label

item

is found. The subhedges of these children, i.e., the hedge

h_{3}

and the subhedges of the top-level trees in

h_{1}

and

h_{3}

, can be projected away, as well as the hedge

h_{4}

: they do not have to be inspected for evaluating this filter. However, one must memorize whether the level of the current node is 0, 1, or greater. This level of information can be naturally updated in a top-down manner. The evaluators of Shas, however, operate bottom-up and left-to-right exclusively. Therefore, projecting evaluators for Shas need to be based on more general machines. We propose downward stepwise hedge automata (Sha^↓s), a variant of Shas that supports top-down processing. They are basically Neumann and Seidl’s pushdown forest automata [25], except that they are applied to unlabeled hedges instead of labeled forests. Furthermore, Nwas are known to operate in basically the same manner on nested words [26] while allowing for slightly more general visible pushdowns. We then distinguish subhedge projection states for Sha^↓s, and show how to use them to evaluate Shas with subhedge projection both in memory and in streaming mode.

As a first contribution, we present the safe-no-change compiler from Shas to Sha^↓s that can distinguish appropriate subhedge projection states. The idea is that the safe-no-change Sha^↓ can distinguish contexts in which the states of the Sha will safely not change. For instance, the XPath filter

[self : : list / child : : item]

can be defined by a deterministic Sha operting bottom-up which our compiler maps to the Sha^↓ operating top-down. The context information made explicit by top-down processing is about the levels of the states. This permits us to distinguish projection states starting from level 2, and in which subhedges can be ignored. We prove the soundness of our compiler based on a nontrivial invariant that we establish (we note that the proof required an adaptation of the original compiler from FCT [24]). We also present a counter-example showing that the safe-no-change projection is not complete for subhedge projection. It shows that a subhedge may be irrelevant even though its state may still be changing.

As a second contribution, we propose the congruence projection algorithm. It again compiles Shas to Sha^↓s but relies on the congruence relations of automata states. We then prove that congruence projection yields not only a sound algorithm but that this algorithm is also complete for subhedge projection, i.e., all strongly irrelevant subhedges (see Definition 9) are evaluated into a looping state. Congruence projection starts on the top-level of hedges with the Myhill–Nerode congruence that is well-known from automata minimization.

For languages with words L, this congruence identifies prefixes v that have the same residual language

v^{- 1} (L) = {w ∣ v \cdot w \in L}

. A prefix v is then an irrelevant suffix if its residual language

v^{- 1} (L)

is either universal or empty. In the case of hedges, which extend on words by adding hierarchical nesting via a tree constructor, the congruence must be adapted when moving down into subtrees.

We show that both compilers may increase the size of the automata exponentially since they must be able to convert bottom-up deterministic automata on monadic trees into top-down deterministic automata. That is the same as inverting deterministic automata on words, which is well known to raise an exponential blow-up in the worst case (e.g., for the family of languages

{(a + b)}^{*} \cdot a \cdot {(a + b)}^{n}

, where

n \in N

, the minimal deterministic left-to-left automaton, has

2^{n + 1}

states while the minimal deterministic right-to-left automaton has

n + 1

states).

The exponential explosion can be avoided when interested in the evaluation of hedges by automaton with subhedge projection, by not constructing the complete Sha^↓s statically. Instead, only the required part of the Sha^↓s may be constructed on the fly when evaluating a hedge.

Our third contribution is the refinement of the two previous compilers so that they can also distinguish safe states for selection at the earliest position. For this, we combine these compilers with a variant of the earliest membership tester of [21] that operates in memory by compiling Sha^↓s instead of Nwas. Furthermore, membership failure is detected at the earliest position, too. In this way, we obtain the earliest in-memory membership tester for deterministic Shas.

Our fourth contribution is to show how to run the previous in-memory evaluators in streaming mode while reading the linearization of the hedge into a nested word as an input. Thereby, we can improve on the recent earliest streaming membership tester behind the AStream tool [21] by adding subhedge projection to it.

Our fifth and last contribution is an implementation and experimental evaluation of the streaming earliest query answering algorithm for dShas by introducing subhedge projection into the AStream tool [21]. We implemented both safe-no-change and congruence projection. For this, we lifted our earliest membership tester with subhedge projection to an earliest query answering algorithm for monadic queries. For the experimentation, we started with deterministic Shas constructed with the compiler from [8] for the forward regular XPath queries of the XPathMark benchmark [27] and real-world XPath queries [7]. It turns out that congruence projection projects much more strongly than the safe-no-change projection for at least half of the benchmark queries. It reduces the running time for all regular XPath queries considerably since large parts of the input hedges can be projected away. In our benchmark, the projected percentage ranges from 75.7% to 100% of the input stream. For XPath queries that contain only child axes, the earliest query answering algorithm of AStream with congruence projection becomes competitive in efficiency with the best existing streaming tool called QuiXPath [23], which is not always earliest for some queries. Our current implementation of the earliest congruence projection in AStream is by a factor from 1.3 to 2.9 slower than QuiXPath on the benchmark queries. The improvement is smaller for XPath queries with the descendant axis, where less subhedge projection is possible. Instead, some kind of descendant projection would be needed. Still, even in the worst case in our benchmark, the earliest congruence projection with AStream is currently only by a factor of 13.8 slower than QuiXPath.

Outline. We start with preliminaries on hedges and nested words in Section 2. In Section 3, we recall the problem of hedge pattern matching and formally define irrelevant subhedges. In Section 4, we recall the definition of Shas, introduce Sha^↓s, and discuss their in-memory evaluators. In Section 5, we introduce the notion of subhedge projection states, define when an Sha^↓ is complete for subhedge projection, and present an in-memory evaluator for Sha^↓s with subhedge projection. In Section 6, we introduce the safe-no-change projection, state its soundness, and show its incompleteness. In Section 7, we introduce congruence projection an prove it to be sound and complete for subhedge projection. The earliest in-memory evaluators for Sha^↓s with subhedge projection follow in Section 8. Streaming variants of our in-memory evaluators are derived systematically in Section 9. Section 10 discusses how to lift our algorithms from membership testing to monadic queries. Section 11 discusses our practical experiments. Section 12 discusses further related work on automata notions related to Shas and Sha^↓s. Appendix A presents the reformulation of the earliest query-answering algorithm from [21] based on Sha^↓s and with schema restrictions. Appendix B provides sound proof for the safe-no-change projection.

Publication Comments. This original version of the journal article was published at the FCT conference [24]. Compared to the previous publication, the following contributions are new:
- The definition of subhedge irrelevant prefixes of nested words and the definition of completeness for subhedge projection (Section 3).
- The addition of schema restrictions throughout the paper and the notion of schema completeness (Section 4.3).
- The notion of completeness for subhedge projection (Section 5.2).
- The soundness proof of the safe-no-change projection algorithm in Section 6.2, moved to Appendix B.
- The congruence projection algorithm and the proof of its soundness and completeness for subhedge projection (Section 7).
- A systematic method to add subhedge projection to an earliest Sha^↓ (Section 8).
- A systematic method to reduce the correctness of streaming automata evaluators to the corresponding in-memory evaluators (Section 9).
- A discussion of how to deal with monadic queries while exploiting the availability of schema restrictions (Section 10).
- An implementation of congruence projection with experimental results for XPath evaluation on Xml streams (Section 11).
- A longer discussion on related work (Section 12).
- In Appendix A, we also show how to obtain the earliest dSha^↓s from dShas by adapting the compiler from [21] mapping dShas to the earliest dNwas.

2. Preliminaries

Let A and B be sets. If A is a subset of B, then the complement of A in B is denoted by

\bar{A} = B \ A

while keeping the set relative to which the complement is taken implicit. The domain of the binary relation

r \subseteq A \times B

is

dom (r) = {a \in A ∣ \exists b \in B . (a, b) \in r}

. A partial function

f : A ↪ B

is a binary relation

f \subseteq A \times B

that is functional. A total function

f : A \to B

is a partial function

f : A ↪ B

with

dom (f) = A

.

Words. Let $N$ be the set of natural numbers, including 0. Let the alphabet $Σ$ be a set. The set of words over $Σ$ is $Σ^{*} = \cup_{n \in N} Σ^{n}$ . A word is $(a_{1}, \dots, a_{n}) \in Σ^{n}$ , where $n \in N$ is written as $a_{1} \dots a_{n}$ . We denote the empty word of length 0 by $ε \in Σ^{0}$ , and by $v_{1} \cdot v_{2} \in Σ^{*}$ the concatenation of two words $v_{1}, v_{2} \in Σ^{*}$ .

If

v = u \cdot v^{'} \cdot w

is a word, then we call u a prefix of v and

v^{'}

a factor of v and w a suffix of v. Given any subset

L \subseteq Σ^{*}

, we denote the set of prefixes of words in L by

prefs (L)

and the set of suffixes of words in L by

suffs (L)

.

For any subalphabet

Σ^{'} \subseteq Σ

, the projection of a word

v \in Σ^{*}

to

Σ^{'}

is the word

{proj}_{Σ^{'}} (v)

in

{(Σ^{'})}^{*}

that is obtained from v by removing all letters from

Σ \ Σ^{'}

.

Hedges. Hedges are sequences of letters and trees $〈 h 〉$ with a hedge h. More formally, a hedge $h \in H_{Σ}$ has the following abstract syntax:

$\begin{matrix} h, h^{'} \in H_{Σ} & : : = & ε | a | 〈 h 〉 | h \cdot h^{'} where a \in Σ \end{matrix}$

We assume $ε \cdot h = h \cdot ε = h$ and $(h \cdot h_{1}) \cdot h_{2} = h \cdot (h_{1} \cdot h_{2})$ . Therefore, we consider any word in $Σ^{*}$ as a hedge in $H_{Σ}$ , i.e., $Σ^{*} ∋ a a b = a \cdot a \cdot b \in H_{Σ}$ . Any hedge can be converted or stored as a graph. For the signature $Σ = {l i s t, i t e m, A, \dots, Z}$ , the graph of an example hedge in $H_{Σ}$ is shown in Figure 1. It encodes the sequence of Xml documents in Figure 2:

Nested Words. A nested word over $Σ$ is a word over the alphabet $\hat{Σ} = Σ \cup {〈, 〉}$ in which all parentheses are well-nested. Intuitively, this means that for any opening parenthesis, there is a matching closing parenthesis and vice versa.

The set of nested words over

Σ

can be defined as the subsets of words over

\hat{Σ}

that are a linearization of the hedge, where the linearization function

nw : H_{Σ} \to {(Σ \cup {〈, 〉})}^{*}

is defined as follows:

\begin{matrix} nw (ε) = ε, & nw (〈 h 〉) = 〈 \cdot nw (h) \cdot 〉, & nw (a) = a, & nw (h \cdot h^{'}) = nw (h) \cdot nw (h^{'}) . \end{matrix}

So, the set of all nested words over

Σ

is

nw (H_{Σ})

. Let

hdg

be the inverse of the injective function

nw

restricted to its image, so the mapping from nested words to hedges with

hdg (nw (h)) = h

for all

h \in H_{Σ}

.

Nested Word Prefixes. Prefixes, suffixes, and factors of nested words may not be nested words themselves. For instance, the hedge $h = a \cdot 〈 b \cdot c 〉$ has the linearization $nw (h) = a 〈 b c 〉$ . Its prefix $a 〈 b$ is not well-nested since it has a dangling opening parenthesis. Its suffix $c 〉$ is also not well-nested since it has a dangling closing parenthesis.

Any algorithm that traverses a hedge h in a top-down and left-to-right manner inspects all the prefixes of

nw (h)

. Any prefix v of

nw (h)

not only distinguishes the position of the hedge h but also specifies the part of the hedge that is located before this position in the linearization

nw (h)

. An example is illustrated graphically in Figure 3. This particularly holds for streaming algorithms that receive the nested word

nw (h)

as an input on a stream and may be inspected only once from the left to the right. But it holds equally for in-memory algorithms that receive the input hedge h as a hierarchical structure whose graph is stored in memory and then traverses the graph top-down and left-to-right. Any

〈 〉

node will then be visited twice: once when reading an opening parenthesis 〈 and going down into a subhedge, and another time when going up to the closing parenthesis 〉 after having processed the subhedge.

3. Problem

We formally define the pattern-matching problem for hedges and the notion of subhedge irrelevance, which allows us to characterize the parts of the input hedge that do not need to be visited during pattern matching.

3.1. Hedge Pattern Matching

We are interested in algorithms that match a hedge pattern against a hedge. We start with algorithms that test whether such matching exists. This is sometimes called filtering or, alternatively, Boolean query answering. In Section 10, we consider the more general problem of returning the set of matchings. This problem is sometimes called monadic query answering. Our experiments in Section 11 apply monadic query answering for regular XPath queries.

We start from hedge patterns that are nested regular expressions with letters in the signature

Σ

. For these, we fix a set V of recursion variables and assume that it is disjointed from

Σ

.

e, e^{'} \in {nRegExp}_{Σ} : : = ε ∣ a ∣ z ∣ e \cdot e^{'} ∣ e + e^{'} ∣ e^{*} ∣ 〈 e 〉 ∣ μ z . e where a \in Σ, z \in V

Each nested regular expression

e \in {nRegExp}_{Σ}

is satisfied by a subset of hedges in

〚 e 〛 \in H_{Σ \cup V}

. We recall the definition of this set only informally. The expression

ε

is satisfied only by the empty word, expression

a \in Σ

only by the hedge a only, and expression

z \in V

only by the one letter hedge z. An expression

e \cdot e^{'}

with the concatenation operator and

e, e^{'} \in {nRegExp}_{Σ}

is satisfied by the concatenations of hedges satisfying e and

e^{'}

, respectively. An expression

e^{*}

with Kleene’s star operator and

e \in {nRegExp}_{Σ}

is satisfied by repetitions of words satisfying e. The Kleene star provides for horizontal recursion on hedges. The

μ

expression

μ z . e

, where

z \in V

and

e \in {nRegExp}_{Σ}

, is satisfied by all hedges without recursion variable x in which z is repeatedly replaced by a hedge satisfying e until it disappears. We adopt the usual restriction [28] that in any nested regular expression

μ z . e

, all occurrences of z in e are guarded by a tree constructor

〈 . 〉

. In this way, the

μ

-operator provides proper vertical recursion. Also note that, for defining sublanguages of

H_{Σ}

, we only need such expressions in

{nRegExp}_{Σ}

in which all occurrences of recursion variables z are bound in the scope of binder

μ z

.

It is well known that nested regular expressions in

{nRegExp}_{Σ}

can define all regular sublanguages of hedges in

H_{Σ}

, i.e., they have the same expressiveness as hedge automata with alphabet

Σ

. Analogous results for regular tree expressions and tree automata are folklore [29].

Example 1 (Nested regular expressions of regular XPath filters).

For instance, we can define subsets of hedge encodings of Xml documents with two kinds of elements in

Σ = {list, item}

by expressions in

{nRegExp}_{Σ}

. For this article, we will use an encoding of the Xml document, which produces hedges that satisfy the following regular expression

doc \in {nRegExp}_{Σ}

, where

z \in V

:

doc =_{def} {tree}^{*} where tree =_{def} μ z . 〈 (list + item) \cdot z^{*} 〉

In our application, the encodings are a little richer in order to distinguish different types of Xml nodes such as elements, attributes, comments, and texts. An XPath filter such as [self::list/child::item] can then be expressed by the nested regular expression:

filter =_{def} 〈 list \cdot doc \cdot 〈 item \cdot doc 〉 \cdot doc 〉 \cdot doc

General compilers from regular downward XPath filter queries to nested regular expressions were proposed in [8]. An implementation was developed in the context of the AStream tool [21].

We will use schemas

S \subseteq H_{Σ}

to restrict the inputs of our pattern-matching problem. As we are mostly interested in regular schemas, we do not assume regularity globally since safe-no-change projection does not rely on it. The pattern-matching problem for nested regular expressions with respect to a schema

S \subseteq H_{Σ}

receives the following inputs:

a nested regular expression $e \in {nRegExp}_{Σ}$ and
a hedge $h \in S$ .

It then returns the truth value of the judgment

h \in 〚 e 〛

. The input hedge

h \in S

may either be given by a graph that resides in memory or by a stream containing the linearization

nw (h)

of the input hedge, which can be read only once from left to right.

3.2. Irrelevant Subhedges

We define the concept of irrelevant occurrences of subhedges with respect to a given hedge pattern. What this means depends on the kind of algorithm that we will use for pattern matching. We will use algorithms that operate on the input hedge top-down, left-to-right, and bottom-up. This holds for streaming algorithms in particular.

Intuitively, when the pattern-matching algorithm reaches a node top-down or left-to-right whose subsequent subhedge is irrelevant, then it can jump over it without inspecting its structure. What jumping means should be clear if the hedge is given by a graph that is stored in memory. Notice that the full graph is to be constructed beforehand, even though some parts of it may turn out irrelevant. Still, one may win lots of time by jumping over irrelevant parts if either the graph was already constructed for other reasons or if many pattern-matching problems are to be solved on the same hedge.

In the case of streams, the irrelevant subhedge still needs to be parsed, but it will not be analyzed otherwise. Most typically, the possible analysis is carried out by automata, which may take two orders of magnitude more time than needed for parsing. Therefore, not having to do any analysis may considerably speed up the streaming algorithm.

Definition 1.

Let

S \subseteq H_{Σ}

be a schema and

L \subseteq S

a language satisfying this schema. We define

{diff}_{S}^{L}

as the least symmetric relation on prefixes of nested words

u, u^{'} \in prefs (N_{Σ})

:

{diff}_{S}^{L} (u, u^{'}) \Leftrightarrow \exists w \in suffs (N_{Σ}) . u \cdot w \in nw (L) \land u^{'} \cdot w \in nw (S \ L) .

A nested word prefix u is called subhedge relevant for L with schema

S

if there exists nested words

v, v^{'} \in N_{Σ}

such that

{diff}_{S}^{L} (u \cdot v, u \cdot v^{'})

. Otherwise, the prefix u is called subhedge irrelevant for L with schema

S

.

So,

{diff}_{S}^{L} (u, u^{'})

states that u and

u^{'}

have continuations that behave differently with respect to L and

S

. Furthermore, a prefix u is subhedge irrelevant if language membership does not depend on hedge insertion at u under the condition that schema membership is guaranteed. The case without schema restrictions is subsumed by the above definition by choosing

S = H_{Σ}

. In this case, the complement

\bar{{diff}_{H_{Σ}}^{L}}

is the Myhill–Nerode congruence of L, which is well-known in formal language theory from the minimization of DFAs (see any standard textbook on formal language theory or Wikipedia (Myhill–Nerode theorem: https://en.wikipedia.org/wiki/Myhill-Nerode_theorem, accessed on 23 July 2024)).

This congruence serves as the basis for Gold-style learning of regular languages from positive and negative examples (see [30] or Wikipedia (Gold-style learning in the limit https://en.wikipedia.org/wiki/Language_identification_in_the_limit, accessed on 23 July 2024)).

Schema restrictions are needed for our application to regular XPath patterns.

Example 2 (Regular XPath filters).

Consider the schema

S = 〚 doc 〛

for hedges representing Xml documents with signature

Σ = {list, item}

, and the regular hedge language

L = 〚 filter 〛

, where

filter = 〈 list \cdot {tree}^{*} \cdot 〈 item \cdot {tree}^{*} 〉 \cdot {tree}^{*} 〉 \cdot {tree}^{*}

is the nested regular expression from Example 1 for the XPath filter

[self : : list / child : : item]

. Recall that this nested regular expression can be applied to hedges representing Xml documents only. The nested word prefix

u = 〈 list \cdot 〈 item

is subhedge irrelevant for the language L with schema

S

. Note that its continuation

〈 list \cdot 〈 item 〉 〉

with the suffix

w = 〉 〉

belongs to L. Hence, for any

h \in H_{Σ}

, if the continuation

〈 list \cdot 〈 item \cdot h 〉 〉

belongs to

S = 〚 doc 〛

, then it also belongs to L. Nevertheless, for the hedge

h_{1} = 〈 〉

, the continuation

〈 list \cdot 〈 item \cdot h_{1} 〉 〉

does not belong to L since it does not satisfy the schema

S = 〚 doc 〛

.

The prefix

〈 item

is also irrelevant for the language L, even independently of the schema. This is because L does not contain any continuation of this prefix to some hedge. The prefix

〈 list

is not irrelevant for the language L wrt

S

. This can be seen with the suffix

w = 〉

, since

〈 list 〉

does not belong to L while the continuation

〈 list \cdot h 〉

with

h = 〈 item 〉

does belong to L and both continuations satisfy the schema

S = 〚 doc 〛

.

However, schema restrictions may also have surprising consequences that make the problem more tedious.

Example 3 (Surprising consequences of schema restrictions).

Consider the signature

Σ = {a}

, the pattern

L = {〈 a 〉}

, and the schema

S = L \cup {〈 〉 \cdot a}

. The prefix 〈 is indeed subhedge irrelevant to L and

S

. In order to see this, we consider all possible closing suffixes one by one. The smallest closing prefix is 〉. Note that a hedge

〈 h 〉 \in L

if and only if

h = a

. So, membership to L seems to depend on the subhedge h. However, also note that

〈 h 〉 \in S

if and only if

h = a

. Therefore, when assuming that

〈 h 〉

belongs to the schema

S

, it must also belong to L. So, the pattern must match when assuming schema membership. The next closing suffix is

〉 \cdot a

. Note that a hedge

〈 h 〉 \cdot a \in S

if and only if

h = ε

. However,

〈 h 〉 \cdot a \notin L

, so the pattern will not match for any subhedge h with this prefix when assuming the input hedge satisfies the schema. The situation is similar for larger suffixes, so in no case does language membership depend on the subhedge at prefix 〈 when assuming that the full hedge satisfies the schema

S

.

3.3. Basic Properties

We first show that subhedge irrelevant prefixes remain irrelevant when extended by the nested word of any hedge. This property should be expected for any reasonable notion of subhedge irrelevance.

Lemma 1.

For any nested word prefix u, language

L, S \subseteq H_{Σ}

, and nested word

v \in N_{Σ}

, if the prefix u is subhedge irrelevant for L with schema

S

, then the prefix

u \cdot v

is so too.

Proof.

Let u be subhedge irrelevant for L with schema

S

and

v \in N_{Σ}

a nested word. We fix arbitrary nested words

v^{'}, v^{″} \in N_{Σ}

and a nested word suffix

w \in suffs (N_{Σ})

such that

u \cdot v \cdot v^{'} \cdot w \in nw (S)

and

u \cdot v \cdot v^{″} \cdot w \in nw (S)

. Since

v \cdot v^{'}

and

v \cdot v^{″}

are also nested words, the subhedge irrelevance of u yields the following:

u \cdot v \cdot v^{'} \cdot w \in nw (L) \Leftrightarrow u \cdot v \cdot v^{″} \cdot w \in nw (L) .

Hence, the nested word prefix

u \cdot v

is subhedge irrelevant for L with schema

S

. □

Definition 2.

We call a binary relation D on prefixes of nested words a difference relation on prefixes if for all prefixes of nested words

u, u^{'} \in prefs (N_{Σ})

and nested words

v \in N_{Σ}

:

(u \cdot v, u^{'} \cdot v) \in D \Rightarrow (u, u^{'}) \in D .

Lemma 2.

{diff}_{S}^{L}

is a difference relation.

Proof.

We must show for all prefixes

u, u^{'} \in prefs (N_{Σ})

and nested words

v \in N_{Σ}

that

(u \cdot v, u^{'} \cdot v) \in {diff}_{S}^{L}

implies

(u, u^{'}) \in {diff}_{S}^{L}

. So, let

u, u^{'}

be prefixes of nested words and v be a nested word such that

(u \cdot v, u^{'} \cdot v) \in {diff}_{S}^{L}

. Then, there exists a nested suffix w such that

u \cdot v \cdot w \in nw (L) \land u^{'} \cdot v \cdot w \in nw (S \ L) .

The suffix

\tilde{w} = v \cdot w

then satisfies

u \cdot \tilde{w} \in nw (L) \land u^{'} \cdot \tilde{w} \in nw (S \ L)

. Hence,

(u, u^{'}) \in {diff}_{S}^{L}

as required. □

In the schema-free case of words, the complement

\bar{{diff}_{Σ^{*}}^{L}}

is an equivalence relation that is known as the Myhill–Nerode congruence. In the case of schemas, however, the complement

\bar{{diff}_{S}^{L}}

may not be transitive; thus, it may not even be an equivalence relation. This might be surprising at first sight.

Example 4.

Let

Σ = {a, b}

,

L = {ε, b, a a}

and

S = L \cup {a a b}

. Then,

(ε, a) \in \bar{{diff}_{S}^{L}}

since there is no continuation that extends both ε and a into the schema. For the same reason, we have

(a, a a) \in \bar{{diff}_{S}^{L}}

. However,

{diff}_{S}^{L} (ε, a a)

since

ε \cdot b \in L

and

a a \cdot b \in S \ L

. Thus,

(ε, a a) \notin \bar{{diff}_{S}^{L}}

, so the complement

\bar{{diff}_{S}^{L}}

is not transitive.

It should be noted for all languages L that

\bar{{diff}_{H_{Σ}}^{L}}

is reflexive and transitive, so it is an equivalence relation and thus a congruence. For general schemas

S

,

\bar{{diff}_{S}^{L}}

may not be transitive, as shown by Example 4, and may thus not be congruent.

This indicates that schemas may make it more difficult to decide subhedge irrelevance. Indeed, the natural manner in which one might expect to eliminate schemas based on complements does not work, as confirmed by following lemma (in particular, the counter-example in proof

1 . ⇏ 2

.):

Lemma 3.

For any nested word prefix

u \in suffs (N_{Σ})

and

L, S \subseteq H_{Σ}

, consider the following two properties:

1.: u is irrelevant for L with schema $S$ .
2.: u is irrelevant for $L \cup \bar{S}$ with schema $H_{Σ}$ .

Then, property 2. implies property 1., but not always vice versa.

Proof.

$1 . ⇏ 2$ .:: Here, a counter-example is presented. Let $L = {ε}$ and $S = {ε, a}$ . Then, prefix a is irrelevant for L with schema $S$ . However, prefix a is relevant for $L \cup \bar{S} = H_{Σ} \ {a}$ with schema $H_{Σ}$ , since for the nested words $v = ε$ and $v^{'} = a$ we have $a \cdot v \notin L \cup \bar{S}$ and $a \cdot v^{'} \in L \cup \bar{S}$ .
$1 . \Leftarrow 2$ .:: Next, we assume property 2. Consider arbitrary nested words $v, v^{'} \in N_{Σ}$ and a nested word suffix $w \in suffs (N_{Σ})$ such that $u \cdot v \cdot w, u \cdot v^{'} \cdot w \in S$ . It is sufficient to show that $u \cdot v \cdot w \in S$ and $u \cdot v^{'} \cdot w \in S$ implies $u \cdot v \cdot w \in L \Leftrightarrow u \cdot v^{'} \cdot w \in L$ . This implication holds trivially if $u \cdot v \cdot w \in \bar{S}$ or $u \cdot v^{'} \cdot w \in \bar{S}$ . Otherwise, we have $u \cdot v \cdot w \in S$ and $u \cdot v^{'} \cdot w \in S$ . Property 2., as assumed, then implies $u \cdot v \cdot w \in L \Leftrightarrow u \cdot v^{'} \cdot w \in L$ . So, property 1. holds.

□

4. Hedge Automata

We recall the notion of stepwise hedge automata (Shas), introduce their downward extension (Sha^↓), discuss schema-completeness for both kinds of automata, define subhedge projection states for Sha^↓s, and show how to use them for evaluating Sha^↓s with subhedge projection. We limit ourselves to in-memory evaluation in this section but will discuss streaming evaluation in Section 9. The relation to several closely related notions of automata on hedges, forests, or nested words is discussed in some more detail in Section 12.

4.1. Stepwise Hedge Automata

Shas are a recent notion of automata for hedges [8] that mix up bottom-up tree and left-to-right word automata in a natural manner. They extend stepwise tree automata [31] in that they operate on hedges rather than unranked trees, i.e., on sequences of letters and trees containing hedges. They differ from nested word automata (Nwas) [11,15] in that they process hedges directly rather than their linearizations to nested words.

Definition 3.

A stepwise hedge automaton (Sha) is a tuple

A = (Σ, Q, Δ, I, F)

where Σ and

Q

are finite sets,

I, F \subseteq Q

, and

Δ = ({(a^{Δ})}_{a \in Σ}, {〈 〉}^{Δ}, @^{Δ})

, where

a^{Δ} \subseteq Q \times Q

,

{〈 〉}^{Δ} \subseteq Q

, and

@^{Δ} \subseteq (Q \times Q) \times Q

. A Sha is deterministic or equivalently a dSha if I and

{〈 〉}^{Δ}

contain at most one element, and all relations

{(a^{Δ})}_{a \in Σ}

and

@^{Δ}

are partial functions.

The set of states

Q

subsumes a subset I of the initial state, a subset F of final states, and a subset

{〈 〉}^{Δ}

of tree initial states. The transition rules in

Δ

have three forms: if

(q, q^{'}) \in a^{Δ}

, then we have a letter rule that we write as

q \overset{a}{\to} q^{'} in Δ

. If

(q, p, q^{'}) \in @^{Δ}

, then we have an apply rule that we write as

q @ p \to q^{'} in Δ

. Lastly, if

q \in {〈 〉}^{Δ} \in Q

, then we have a tree initial rule that we denote as

\overset{〈 〉}{\to} q in Δ

.

In Figure 4, we illustrate the graph of a dSha for the XPath filter

[self : : list / child : : item]

. The graph of Sha is obtained and drawn as usual for state automaton on words. The states are the nodes of the graph; in the example,

Q = {0, 1, 2, 3, 4, 5}

. The transition rules define the edges. The edge of a letter rule

q \overset{a}{\to} q^{'}

in

Δ

is annotated by the letter

a \in Σ

. In the example,

Σ = {list, item}

. Any apply rule

q @ p \to q^{'}

in

Δ

is drawn as an edge

q \overset{p}{\to} q^{'}

annotated by state

p \in Q

. The initial state

0 \in I

is indicated by an ingoing gray arrow, and the tree initial state

0 \in {〈 〉}^{Δ}

by an ingoing gray arrow that is annotated by symbol

〈 〉

. Final states, such as

4 \in F

, are indicated by a double circle.

4.1.1. Semantics

We next define the semantics of Shas, i.e., the transitions on hedges that it defines and the hedge language that it accepts. For any hedge

h \in H_{Σ}

we define the transition steps

q \overset{h}{\to} p wrt Δ

such that for all

q, q^{'}, p, p^{'} \in Q

,

a \in Σ

, and

h, h^{'} \in H_{Σ}

\frac{q \overset{a}{\to} q^{'} in Δ}{q \overset{a}{\to} q^{'} wrt Δ} \frac{q \overset{h}{\to} q^{'} wrt Δ q^{'} \overset{h^{'}}{\to} q^{″} wrt Δ}{q \overset{h \cdot h^{'}}{\to} q^{″} wrt Δ}

\frac{q \in Q}{q \overset{ε}{\to} q wrt Δ} \frac{\overset{〈 〉}{\to} p in Δ p \overset{h}{\to} p^{'} q @ p^{'} \to q^{'} in Δ}{q \overset{〈 h 〉}{\to} q^{'} wrt Δ}

The transitions can be used to evaluate hedges nondeterministically bottom-up and left-to-right using Shas. The first three rules state how to evaluate sequences of trees and letters, such as a usual finite state automaton, while assuming that the trees were already evaluated to states. The last rule states how to evaluate a tree

〈 h 〉

from some state q to some state q′. This can be visualized as follows:

For any tree initial state

p \in {〈 〉}^{Δ}

, one has to evaluate the subhedge h nondeterministically to some state

p^{'}

. For each

p^{'}

obtained, one then returns to a state

q^{'}

such that

q @ p^{'} \to q^{'} in Δ

nondeterministically.

A hedge is accepted by A if it permits a transition from some initial state to some final state. The language

L (A)

is the set of all accepted hedges:

L (A) = {h \in H_{Σ} ∣ q \overset{h}{\to} q^{'} wrt Δ, q \in I, q^{'} \in F} .

4.1.2. Runs

Runs can represent the whole history of a single choice of the nondeterministic evaluator on a hedge. For instance, the unique successful run of the dSha in Figure 4 on the hedge

h = 〈 h^{'} 〉

with subhedge

h^{'} = l i s t \cdot 〈 i t e m 〉

is the hedge

0 \cdot 〈 0 \cdot l i s t \cdot 1 \cdot 〈 0 \cdot i t e m \cdot 2 〉 \cdot 3 〉 \cdot 4

whose computation is illustrated in Figure 5. On the top level, the run justifies the transition

0 \overset{h}{\to} 4 wrt Δ

from the initial state 0 to a final state 4. When started in state 0, the subhedge

h^{'}

needs to be evaluated first, so the run on the upper level needs to be suspended until this is done. The subrun on

h^{'}

eventually justifies the transition

0 \overset{h^{'}}{\to} 3 wrt Δ

. At this time point, the run of the upper level can be resumed and continue in state 4 since

0 @ 3 \to 4 in Δ

.

When drawing runs, we illustrate any suspension/resumption mechanism with a box, for instance, the box on the edge 0 Algorithms 17 00339 i029

4. The box stands for a future value computed by the lower level, causing the upper level to suspend until its arrival. Eventually, the box is filled by state 3, as illustrated by 3 Algorithms 17 00339 i030

, so that the computation can be resumed on the upper level.

We next define runs of Shas on hedges formally. Whether a hedge with letters and states

R \in H_{Σ \cup Q}

is a

Δ

-run or simple a run on a hedge

h \in H_{Σ}

, written

R \in {run}^{Δ} (h)

, is defined by the following rules:

\frac{q \overset{a}{\to} q^{'} in Δ}{q \cdot a \cdot q^{'} \in {run}^{Δ} (a)} \frac{q \cdot R \cdot q^{'} \in {run}^{Δ} (h) q^{'} \cdot R^{'} \cdot q^{″} \in {run}^{Δ} (h^{'})}{q \cdot R \cdot q^{'} \cdot R^{'} \cdot q^{''} \in {run}^{Δ} (h \cdot h^{'})}

\frac{t r u e}{q \in {run}^{Δ} (ε)} \frac{\overset{〈 〉}{\to} p in Δ p \cdot R \cdot p^{'} \in {run}^{Δ} (h) q @ p^{'} \to q^{'} in Δ}{q \cdot 〈 R 〉 \cdot q^{'} \in {run}^{Δ} (〈 h 〉)}

Note that if

R \in {run}^{Δ} (h)

, then h can be obtained by removing all states from R, i.e.,

{proj}_{Σ} (R) = h

. Any run

R \in {run}^{Δ} (h)

can be identified with the mapping of prefixes of the nested word

nw (h)

to states in

Q

. Note that

nw (h) = {proj}_{Σ} (nw (R))

. For any prefix v of

nw (h)

, there exists a unique prefix r of the nested word

nw (R)

that ends with some state

q \in Q

and satisfies

{proj}_{Σ} (r) = v

. We then call q the state of R at prefix v. The run

0 \cdot 〈 0 \cdot l i s t \cdot 1 \cdot 〈 0 \cdot i t e m \cdot 2 〉 \cdot 3 〉 \cdot 4

on hedge

〈 l i s t 〈 i t e m 〉 〉

from Figure 5, for instance, assigns state 0 to the nested word prefixes

ε

and 〈 and state 1 to the nested word prefix

〈 \cdot list

, etc.

A

Δ

-run is called a run of

A = (Σ, Q, Δ, I, F)

if it starts with some state in I. A run of A is successful if it ends with some state in F.

Lemma 4.

h \in L (A)

iff there exists a successful run

R \in {run}^{Δ} (h)

of A.

Proof.

Straightforward with induction on h. □

4.1.3. Determinization

Shas can always be determinized using the subset construction known from the determinization of finite state automata on words or of tree automata.

Given a Sha

A = (Σ, Q, Δ, I, F)

, the state set of its determinization is

Q^{\det} = 2^{Q}

. The transition rules

Δ^{\det}

of the determinization are presented in Figure 6.

4.1.4. In-Memory Membership Testing

Testing membership

h \in L (A)

for a hedge h, whose graph is stored in memory, can be carried out by computing the unique run of the determinization of A in a bottom-up and left-to-right manner, and testing whether it is successful. Alternatively, one can compute the unique set

P \subseteq Q

such that

I \overset{h}{\to} P

with respect to the determinization of A and to test whether

P \cap F \neq \emptyset

. The computations are essentially the same, but the full history of the computation is lost when returning only the set P of states reached and not the run.

For determinizing Shas we can rely on a variant of the usual subset construction that is well-known from finite state automata on words or standard tree automata (see e.g., [8,16]). When it comes to membership testing, on-the-fly determinization is sufficient, which creates only the part of the deterministic automaton that is used by the run on h. This part is of linear size

O (| h |)

while the full determinization of A may be of exponential size

O (2^{| A |})

. In this way, membership

h \in L (A)

can be tested in time

O (| h | | A |)

.

4.1.5. Hedge Accessibility

For any set

P \subseteq Q

we define the set of states accessible from P via a hedge as follows:

{acc}^{Δ} (P) = {q^{'} \in Q ∣ \exists q \in P . \exists h \in H_{Σ} . q \overset{h}{\to} q^{'} wrt Δ} .

We note that

{acc}^{Δ} (P)

can be computed from

Δ

and P in linear time in the size of

Δ

. A state is hedge accessible from P if and only if it is accessible for P in the graph of

Δ

. Or by computing the least fixed point of the following ground Datalog program depending on P and

Δ

that is of linear size in A:

\frac{q \in P}{a c c (p) .} \frac{q \overset{a}{\to} q^{'} in Δ}{a c c (q^{'}) : - a c c (q) .} \frac{q @ p \to q^{'} in Δ}{a c c (q^{'}) : - a c c (q), a c c (p) .}

Clearly,

L (A) = \emptyset

if and only if

F \cap {acc}^{Δ} (I) = \emptyset

. Therefore, emptiness can be decided in linear time for Shas. This is in contrast to the more general notion of Sha^↓s that we introduce next.

4.2. Downward Stepwise Hedge Automata

The evaluation of Shas operates in a bottom-up and left-to-right manner, but cannot pass any information top-down. We next propose an extension of Shas to Sha^↓s, adding the ability to pass finite state information top-down.

Sha^↓s are basically equal to Neumann and Seidl’s pushdown forest automata [25] but lifted from (labeled) forests to (unlabeled) hedges. See Section 12 for a more detailed discussion on the relationship of Sha^↓s with other automata notions and Nwas in particular.

Definition 4.

A downward stepwise hedge automaton (Sha^↓) is a tuple

A = (Σ, Q, Δ, I, F)

, where Σ and

Q

are finite sets,

I, F \subseteq Q

, and

Δ = ({(a^{Δ})}_{a \in Σ}, {〈 〉}^{Δ},

@^{Δ})

. Furthermore,

a^{Δ} \subseteq Q \times Q

,

{〈 〉}^{Δ} \subseteq Q \times Q

, and

@^{Δ} \subseteq (Q \times Q) \times Q

. A Sha^↓ is deterministic or equivalently a dSha^↓ if I contains at most one element, and all relations,

{〈 〉}^{Δ}

,

a^{Δ}

, and

@^{Δ}

, are partial functions.

The only difference compared to Shas is the form of the tree-opening rules. If

(q, q^{'}) \in {〈 〉}^{Δ} \in Q

, then we have a tree initial rule that we denote as follows:

q \overset{〈 〉}{\to} q^{'} in Δ

.

So, state

q^{'}

, where the evaluation of a subhedge starts, depends on the state q of the parent.

4.2.1. Semantics

The definition of transition steps for Sha^↓s differs from that of Shas only in the usage of opening rules in the following inference rule:

\frac{q \overset{〈 〉}{\to} p in Δ p \overset{h}{\to} p^{'} wrt Δ q @ p^{'} \to q^{'} in Δ}{q \overset{〈 h 〉}{\to} q^{'} wrt Δ}

This means that the evaluation of the subhedge h starts in a state p such that

q \overset{〈 〉}{\to} p in Δ

.

So, the restart state p that is chosen may now depend on the state q above. This is how finite state information is passed top-down by Sha^↓s. Shas, in contrast, operate purely bottom-up and left-to-right.

4.2.2. Runs

The notion of runs can be adapted straightforwardly from Shas to Sha^↓s. When in state q, it is sufficient to restart the computation in subhedges on the state p, such that

q \overset{〈 〉}{\to} p \in Δ

. In this way, finite state information is passed down (while for Shas some tree initial is to be chosen that is independent of q). The only rule of runs to be changed is the following:

\frac{q \overset{〈 〉}{\to} p in Δ p \cdot R \cdot p^{'} \in {run}^{Δ} (h) q @ p^{'} \to q^{'} in Δ}{q \cdot 〈 R 〉 \cdot q^{'} \in {run}^{Δ} (〈 h 〉)}

An example of a run of the dSha^↓ in Figure 7 is shown in Figure 8. Here, the information on the level of nodes is passed down. It is represented by the number of primes. For instance, when opening the first subtree labeled with item we use the transition rule

1^{'} \overset{〈 〉}{\to} 0^{″}

stating that one moves from state 1 on the first level to state 0 on the second level.

One can see that all nodes of the subtrees

h_{1}

and

h_{2}

are evaluated to the projection state

Π

. So, one can jump over these subtrees without reading them. This is justified by the fact that they are on level 2, the information that the evaluator of this Sha^↓ was passing down.

4.2.3. In-Memory Membership Testing

Testing membership

h \in L (A)

for a Sha^↓s in memory can again be carried out by computing the unique run of the determinization of A. As before, the membership tester can be based on on-the-fly determinization. However, the determinization procedure for Sha^↓s is more complicated then for Shas since Sha^↓s can pass information top-down and not only bottom-up and left-to-right. Determinization not only relies on the pure subset construction but also uses pairs of states in the subsets, basically in the same way as for nested word automata (Nwas) [14,15]. This is needed to deal with the stack of states that were seen above. Therefore, each determinization step may cost time

O (| A |^{5})

as stated for instance in [23]. The upper time bound for membership testing for Sha^↓s is thus in time

O (| h | | A |^{5})

and no longer combined linear as it is for Shas.

4.2.4. Relationship to Shas

Any Sha

A = (Σ, Q, Δ, I, F)

can be mapped to a Sha^↓

A^{down} = (Σ, Q, Δ^{down},

I,

F)

while preserving runs and determinism. The only change of

Δ^{down}

compared to

Δ

is described by the following rule:

\frac{\overset{〈 〉}{\to} p in Δ q \in Q}{q \overset{〈 〉}{\to} p in Δ^{down}}

Independently of the current state

q \in Q

, the Sha^↓

A^{down}

can start the evaluation of the subhedge of a subtree in any open tree state

p \in {〈 〉}^{Δ}

. For instance, the dSha^↓

A^{down}

for dSha A from Figure 4 from the introduction is provided in Figure 9. Conversely, one can convert any dSha^↓ A into an equivalent Sha by introducing nondeterminism to eliminate top-down processing. See Section 12.2 for the construction.

4.2.5. Hedge Accessiblity

Hedge accessibility can be defined for Sha^↓s identically as for Shas. However, the set

{acc}^{Δ} (P)

can no longer be computed in linear time in the size of

Δ

, in contrast to Shas. It can still be done in cubic time, similarly as for Nwas [22]. The problem is that hedge accessibility for Sha^↓s does no more coincide with the accessibility notion of the automaton’s graph.

For instance, in the Sha^↓ from Figure 7, we have

{acc}^{Δ} ({0}) = {4}

. Note that

0^{'}

does not belong to this set, even though there is a transition rule

0 \overset{〈 〉}{\to} 0^{'}

in

Δ

, making node

0^{'}

accessible from node 0 in the graph of the automaton. This is because

0^{'}

is not accessible over any hedge from 0; that is, there is no hedge h such that

0 \overset{h}{\to} 0^{'}

wrt

Δ

. Note that

0 \cdot 〈 \cdot 0^{'}

is a factor of the nested word of some run of

Δ

. This shows that

0^{'}

is accessible from 0 over a nested word prefix nevertheless. For any Sha^↓ A, hedge accessibility can be computed by the following ground Datalog program of cubic size depending on the number of states of A:

\frac{p @ q^{'} \to q in Δ p \overset{〈 〉}{\to} p^{'} in Δ}{a c c (p, q) : - a c c (p^{'}, q^{'}) .} \frac{p \in Q}{a c c (p, p) .} \frac{p \overset{a}{\to} q in Δ}{a c c (p, q) .}

\frac{p, q, r \in Q}{a c c (p, q) : - a c c (p, r), a c c (r, q) .}

Due to the cubic time of hedge accessibility for Sha^↓s, we will use hedge accessibility only for Shas where it is in linear time.

4.3. Schema Completeness

Schemas are inputs of the membership problem that we consider while completeness is a fundamental property of automata. To combine both aspects appropriately, we propose the notion of schema completeness, which we define uniformly for both Shas and Sha^↓s.

We call a set of

Δ

transition rules Sha^↓ complete if all its relations

{(a^{Δ})}_{a \in Σ}

,

{〈 〉}^{Δ}

, and

@^{Δ}

are total. For Shas, we require that

{〈 〉}^{Δ}

is a nonempty set (instead of a total relation). A Sha or Sha^↓ A is called complete if

Δ

is complete and

I \neq \emptyset

. We can complete any Sha or Sha^↓ by adding a fresh sink state and also transition rules into the sink state for all left-hand sides, for which no other transition rule exists. The sink state is made initial if and only if there is no initial state before. It is not final though. We denote the completion of A by

c o m p l (A)

.

Definition 5.

A partial run of A on a hedge h is a prefix r of a run of

c o m p l (A)

on h, such that

r ends with some state of $Q$ , and
r does not contain the sink state of $c o m p l (A)$ .

A partial run r of A on h is called blocking if there does not exist any partial run

r^{'}

of A on h, such that r is a strict prefix of

r^{'}

.

For instance, consider the hedge

h = 〈 list \cdot 〈 item 〉 〉

. The dSha in Figure 4 has the unique run

R = 0 \cdot 〈 \cdot 0 \cdot list \cdot 1 \cdot 〈 \cdot 0 \cdot item \cdot 2 〉 \cdot 3 \cdot 〉 \cdot 4

on h that is illustrated in Figure 5. The partial runs of h are thus all prefixes of

nw (R)

that end with some state, i.e., 0,

0 \cdot 〈 \cdot 0

,

0 \cdot 〈 \cdot 0 \cdot list \cdot 1

, etc. None of these partial runs are blocking.

Definition 6.

A schema for an automaton A is a hedge language

S \subseteq H_{Σ}

, such that

L (A) \subseteq S

. We call the automaton A schema-complete for schema

S

if no hedge

h \in S

has a blocking partial run of A.

Schemas

S

are usually used to restrict the type of hedges that a problem may accept as input. The dSha pattern-matching problem for a schema

S

takes two inputs: a hedge

h \in S

and an automaton A, rather than a nested regular expression e. The automaton A then selects those input hedges

h \in S

that match the pattern, i.e., such that

h \in L (A)

. For this reason, we assume that

S

is a schema for A, i.e.,

L (A) \subseteq S

. We will often assume that A is schema-complete for

S

, so that no partial run of A on any input hedge

h \in S

may ever block. This can always be achieved next based on the next Lemma 5.

Example 5.

The dSha in Figure 4 for the XPath filter

[self : : list / child : : item]

has signature

Σ = {item, list}

. It is schema-complete for schema

〚 doc 〛

. To make this happen, we added state 5 to this automaton, which is not used by any successful run. Still, this Sha is not complete. For completing it, we would have to run 5 into sink states by adding many transitions into it, such as

0 @ 4 \to 5

and

0 @ 5 \to 5

, and also add for all other states

q \in {1, 2, 3, 4, 5}

the transition rules

q \overset{list}{\to} 5

,

q \overset{item}{\to} 5

,

q @ 4 \to 5

, and

q @ 5 \to 5

.

As we will see, automaton completion would make the safe-no-change algorithm incomplete for subhedge projection for Example 5. Without schema-completeness, however, safe-no-change projection may be unsound: it might overlook the rejection of hedges inside the schema. This is why it is much better to assume only schema-completeness rather than full completeness for the input automata of safe-no-change projection. For congruence projection, schema-completeness will be assumed implicitly.

We note that schema-completeness is well-defined even for nonregular schemas. For safe-no-change projection, we indeed do not have to restrict schemas to be regular, while for congruence projection, only regular schemas are considered.

Furthermore, if A is schema-complete for the universal schema

S = H_{Σ}

and does not have inaccessible states, then A is complete.

Schema-completeness of deterministic automata for regular schemas can always be made to hold as we will show next. For this we define that A is compatible with schema

S

if for any hedge

h \in S

there exists a run of A in

{run}^{Δ} (h)

. Schema-completeness implies compatibility, as shown by the next lemma. The converse is true only when assuming determinism.

Lemma 5.

Let A be a deterministic Sha with schema

S

. Then, A is compatible with

S

iff A is schema-complete for

S

.

Proof.

Suppose that A is schema-complete for

S

Wlog., we can assume

S \neq \emptyset

. Note that if

I = \emptyset

, then

ε

would be a blocking partial run for any hedge in

S \ {ε}

. Since

S \neq \emptyset

, it follows that there exists

q_{0} \in I

. So,

q_{0}

is a partial run for any hedge

h \in S \ {ε}

. Since there exist no blocking partial runs by schema-completeness, this run can be extended step by step to a run on any

h \in S

. Therefore, for any hedge

h \in S

, there exists some run by A, showing compatibility.

For the converse, let A be compatible with

S

and deterministic. We have to show that A is schema-complete for

S

. Let v be a partial run of A on

h \in S

. By compatibility, there exists a run

R \in {run}^{Δ} (h)

that starts in some initial state of A. By determinism, v must be a prefix of

nw (R)

. Thus, v is not blocking. □

Proposition 1.

Any Sha A with regular schema

S

can be made schema-complete for

S

.

Proof.

By Lemma 5, it is sufficient to make A compatible with

S

. Let B be a dSha with

S = L (B)

. Automaton B is compatible with

S

. Since B is deterministic, Lemma 5 implies that B is schema-complete for

S

. Since

S

is a schema for A with

L (A) \subseteq L (B)

. We then compute the completion of

c o m p l (A)

. The product

C = B \times c o m p l (A)

with final states

F^{C} = F^{B} \times F^{A}

is schema-complete for schema

S

. Furthermore, since

L (A) \subseteq S = L (B)

, it follows that

L (C) = L (B) \cap L (A) = L (A)

. □

Finally, note that the schema

S

can be recognized by the same product

C = B \times c o m p l (A)

except for replacing the set of final states

F^{C}

by the set of schema final states

F^{S} = F^{B} \times (Q^{A} \cup {sink})

. We have

S = L (C [F^{C} / F^{S}])

, where

C [F^{C} / F^{S}]

is the dSha obtained from C by replacing the set of final states

F^{C}

with

F^{S}

.

5. Subhedge Projection

We next introduce the concept of subhedge projection states for Sha^↓s in order to distinguish prefixes of hedges that are subhedge irrelevant, and to evaluate Sha^↓s with subhedge projection.

5.1. Subhedge Projection States

Intuitively, a subhedge projection state can only loop in itself until a hedge closes (and then it is applied). When going down into a subtree, the subhedge projection state may still be changed into some other looping state. When leaving the subtree, however, one must come back to the original subhedge projection state. Clearly, any sink is a subhedge projection state. The even more interesting cases, however, are subhedge projection states that are not sinks.

We start with Sha^↓s without any restrictions but will impose schema-completeness and determinism later on.

Definition 7.

We call a state

q \in Q

a subhedge projection state of

Δ

if there exists

q^{'} \in Q

called the witness of q, such that the set of transition rules of Δ containing

q^{'}

or q on the leftmost position is included in the following set:

\begin{matrix} {q \overset{〈 〉}{\to} q^{'}, q @ q^{'} \to q, q^{'} \overset{〈 〉}{\to} q^{'}, q^{'} @ q^{'} \to q^{'}} \\ \cup {q^{'} \overset{a}{\to} q^{'}, q \overset{a}{\to} q ∣ a \in Σ} \end{matrix}

The set of all subhedge projection states of Δ is denoted by

Q_{shp}^{Δ}

.

For any complete Sha^↓, the above set of transition rules must be equal to the set of all transition rules of

Δ

with q or

q^{'}

on the leftmost position. But, if the Sha^↓ is only schema-complete for some schema

S

, then not all these transitions must be present. Note also that a subhedge projection state q may be equal to its witness

q^{'}

. Therefore, any witness

q^{'}

of some subhedge projection state is itself a subhedge projection state with witness

q^{'}

.

In the example dSha^↓

A = (Σ, Q, Δ, I, F)

in Figure 7, we have

Q_{shp}^{Δ} = {Π, 4, 2^{'}, 3^{'}, 1^{″}, 2^{″}}

. The witness of all these subhedge projection states is

Π

. Note that not all possible transitions are present for the states in

Q_{shp}^{Δ} \ {Π}

, given that this automaton is not complete (but still schema-complete for schema

〚 doc 〛

).

Lemma 6.

If

q \in Q_{shp}^{Δ}

is a subhedge projection state and

q \overset{h}{\to} p

wrt Δ, then

q = p

.

Proof.

Suppose that

q \overset{h}{\to} p

wrt

Δ

and that q is a subhedge projection state of

Δ

with witness

q^{'}

. So, the only possible transitions with q or

q^{'}

on the left-hand side are the following:

Suppose that

h = t_{1} \cdot \dots \cdot t_{n}

is a sequence of trees or letters

t_{i} \in 〈 H_{Σ} 〉 \cup Σ

, where

1 \leq i \leq n

. Then, there exists runs

R_{i} \in {run}^{Δ} (t_{i})

and a run

R \in {run}^{Δ} (h)

that has the form

h = q_{0} \cdot R_{1} \cdot q_{1} \cdot \dots \cdot R_{n} \cdot q_{n}

, where

q_{0} = q

and

q_{n} = p

. We prove for all

0 \leq i \leq n

that

q_{i} = q

by induction on i. For

i = 0

, this is trivial. For

i > 0

, the induction hypothesis shows that

q_{i - 1} = q

.

Case: $t_{i} = a \in Σ$ . Then, $q_{i - 1} \overset{a}{\to} q_{i} in Δ$ . Since $q_{i - 1} = q$ is a subhedge projection state of $Δ$ , it follows that $q_{i} = q$ , too.
Case: $t_{i} = 〈 h^{'} 〉$ for some $h^{'} \in H_{Σ}$ . Since $q_{i - 1} = q$ is a subhedge projection state of $Δ$ with witness $q^{'}$ , the subrun of R recognizing $t_{i}$ must be justified by the following diagram:

So, the subrun of R of $h^{'}$ may only contain the witness $q^{'}$ and, furthermore, $q_{i} = q$ .

□

5.2. Completeness for Subhedge Projection

Before defining when a Sha^↓ is complete for subhedge projection, we show that subhedge projection states in runs of deterministic Shas may only occur at irrelevant prefixes.

Proposition 2.

Let

A = (Σ, Q, Δ, I, F)

be a dSha^↓ with schema

S

,

R \in {run}^{Δ} (h)

is the run of

c o m p l (A)

on some hedge

h \in S

, and q is a subhedge projection state for Δ. Then, for any prefix

r \cdot q

of

nw (R)

, the prefix

{proj}_{Σ} (r)

of

nw (h)

is subhedge irrelevant for

L (A)

for schema

S

.

Proof.

Let

R \in {run}^{Δ} (h)

be the unique run of

c o m p l (A)

on a hedge

h \in S

. Suppose that

r \cdot q \cdot s = nw (R)

for some subhedge projection states q of

Δ

and let

v = {proj}_{Σ} (r)

and

w = {proj}_{Σ} (s)

. Since

v \cdot w = nw (h)

and

h \in S

, it follows that

v \cdot w \in nw (S)

. We fix some hedge

h^{'} \in H_{Σ}

, such that

v \cdot nw (h^{'}) \cdot w \in nw (S)

arbitrarily. Let

\tilde{h} \in H_{Σ}

, such that

nw (\tilde{h}) = v \cdot nw (h^{'}) \cdot w

. For the subhedge irrelevance of v, we must show that

h \in L (A) \Leftrightarrow \tilde{h} \in L (A) .

“⇒”: We suppose that $h \in L (A)$ and have to show that $\tilde{h} \in L (A)$ . For the partial run $r \cdot q$ of A on $\tilde{h}$ , there must exist $R^{'}$ and p, such that $q \cdot R^{'} \cdot p \in {run}^{Δ} (h^{'})$ . Since q is a subhedge projection state of $Δ$ , it follows by Lemma 6 that $q = p$ . Hence, $r \cdot q \cdot nw (R^{'}) \cdot q \cdot s$ is a run of A on $\tilde{h}$ . Since R was successful for A, s must end in F; thus, the above run on $\tilde{h}$ is successful for A too. Hence, $\tilde{h} \in L (A)$ , as we have shown.
“⇐”: We suppose that $\tilde{h} \in L (A)$ and have to show that $h \in L (A)$ . Let $\tilde{R}$ be the unique run of $c o m p l (A)$ on $\tilde{h}$ . By determinism, $\tilde{R}$ must have the prefix $r \cdot q$ . So, there exist $R^{'}, p^{'}, s$ , such that $p \cdot R^{'} \cdot q \in {run}^{Δ} (h^{'})$ and $r \cdot q \cdot R^{'} \cdot p \cdot s = \tilde{R}$ . Lemma 6 shows that $p = q$ . Hence, $r \cdot q \cdot s$ is the unique run of $c o m p l (A)$ on h and this run is accepting. So, $h \in L (A)$ .

□

We note that Proposition 2 would not hold without assuming determinism. To see this, we can add some sink to any dSha as a second initial state. One can then always go to this sink, which is a subhedge projection state. Nevertheless, no prefix going to the sink may be subhedge irrelevant.

Proposition 2 shows that subhedge projection states permit distinguishing prefixes that are subhedge irrelevant for deterministic Sha^↓s. An interesting question is whether all subhedge irrelevant prefixes can be found in this way. A closely related question is whether, for all regular patterns, there exist dShas that project all irrelevant subhedges.

Example 6.

Reconsider Example 3 with

Σ = {a}

, a regular pattern

L = {〈 a 〉}

, and a regular schema

S = L \cup {〈 〉 \cdot a}

. The prefix

〈 〉

is irrelevant for L and

S

, since all its continuations ending in the schema are rejected: there is only one such continuation, which is

〈 〉 \cdot a

. However, a dSha for L and S with maximal subhedge projection is given in Figure 17 but will not go into a subhedge projection state at this prefix. The reason is that it will go into the same state

{3, 4} D 0

for any prefix in

〈 N_{Σ} 〉

, ignoring the nested word of the irrelevant subhedge. Therefore, this state must accept

〈 a 〉

. But, when reading an a in this state, the dSha goes into the rejection state

{5} D 0

, showing that the hedge from the schema

〈 〉 \cdot a \in S

is rejected.

The example shows that we eventually reason with sets of prefixes. We first lift the notion of subhedge irrelevance to prefix sets.

Definition 8.

Let

S \subseteq H_{Σ}

be a schema and

L \subseteq S

a language satisfying this schema. A set of nested word prefixes

U \subseteq prefs (N_{Σ})

is called subhedge relevant for L with schema

S

if there exist nested words

v^{'}, v^{″} \in N_{Σ}

prefixes

u^{'}, u^{″} \in U

, such that

{diff}_{S}^{L} (u^{'} \cdot v^{'}, u^{″} \cdot v^{″})

. Otherwise, the set U is called subhedge irrelevant for L wrt

S

.

In order to explain the problem of Example 6, we define for each nested word prefix

w \in prefs (N_{Σ})

a class

c l a s s_{S}^{L} (w)

of similar prefixes up replacing the nested word of any irrelevant subhedge by

N_{Σ}

from the left to the right. For all prefixes

w \in prefs (N_{Σ})

, nested words

v \in N_{Σ}

, letters

a \in Σ

, subsets of prefixes

U \subseteq prefs (N_{Σ})

, and

l \in \hat{Σ}

letters or parenthesis, the following applies:

\begin{matrix} c l a s s_{S}^{L} (w) & = & c l a s s_p r_{{ε}} (w) \\ c l a s s_p r_{U} (ε) & = & \{\begin{matrix} U \cdot N_{Σ} & i f U irrelevant for L wrt S \\ U & else \end{matrix} \\ c l a s s_p r_{U} (w \cdot 〈 \cdot v) & = & c l a s s_i n s_{U^{'}} (〈, v) where U^{'} = c l a s s_p r_{U} (w) \\ c l a s s_p r_{U} (a \cdot v) & = & c l a s s_i n s_{U^{'}} (a, v) where U^{'} = U \cdot a \\ c l a s s_p r_{U} (〈 v' 〉 \cdot v) & = & c l a s s_i n s_{U^{'}} (〉, v) where U^{'} = c l a s s_p r_{U} (〈 \cdot v') \\ c l a s s_i n s_{U} (l, v) & = & \{\begin{matrix} U \cdot N_{Σ} & if U irrelevant for L wrt S \\ c l a s s_p r_{U \cdot l} (v) & else \end{matrix} \end{matrix}

Definition 9.

A prefix

u \in prefs (N_{Σ})

is strongly subhedge irrelevant for L wrt

S

if the set

c l a s s_{S}^{L} (u)

is subhedge irrelevant for L wrt

S

.

Example 7.

In our running Example 6, where

L = {〈 a 〉}

,

S = L \cup {〈 〉 \cdot a}

, and

Σ = {a}

, we have

c l a s s_{S}^{L} (〈) = 〈 \cdot N_{Σ}

, so the prefix 〈 is strongly subhedge irrelevant. The set

c l a s s_{S}^{L} (〈 a 〉) = 〈 N_{Σ} 〉

is subhedge-relevant, so the prefix

〈 a 〉

is not strongly subhedge irrelevant. The set

c l a s s_{S}^{L} (〈 〉 \cdot a) = 〈 N_{Σ} 〉 \cdot a \cdot N_{Σ}

is subhedge irrelevant, so the prefix

〈 〉 \cdot a

is strongly subhedge irrelevant.

Definition 10.

A dSha^↓

A = (Σ, Q, Δ, I, F)

is complete for subhedge projection for schema

S

if A is schema-complete for

S

and for all hedges

h \in S

and all prefixes v of

nw (h)

that are strongly subhedge irrelevant for Δ, the unique run of A on h goes to some subhedge projection state of Δ at v.

Example 8.

The dSha in Figure 17 is complete for subhedge projection for our running Example 6 where

L = {〈 a 〉}

and

S = L \cup {〈 〉 \cdot a}

. It was obtained by congruence projection. On all prefixes of the subhedge irrelevant class

c l a s s_{S}^{L} (〈)

, it goes to the subhedge projection state

{0} D 1

. On all prefixes of the class

c l a s s_{S}^{L} (〈 a 〉)

, which is not subhedge irrelevant, it goes to the state

{3, 4} D 0

, which is not a subhedge projection state. On all prefixes of the subhedge irrelevant class

c l a s s_{S}^{L} (〈 〉 \cdot a)

, it goes to the subhedge projection state

{5} D 0

.

Example 9.

The dSha^↓ from Figure 7 is complete for subhedge projection with schema

〚 doc 〛

. It can be obtained by safe-no-change projection. In general, however, dSha^↓s obtained by safe-no-change projection may not be complete for subhedge projection, as we will discuss in Section 6.3.

5.3. In-Memory Evaluator with Subhedge Projection

We next show how to refine the transitions for Sha^↓s by subhedge projection. We define the transition relation with subhedge projection

{\overset{h}{\to}}_{shp} \subseteq Q \times Q

with respect to

Δ

, such that, for all hedges

h, h^{'} \in H_{Σ}

and letters

a \in Σ

,

\frac{q \in Q_{shp}^{Δ} h \in H_{Σ}}{q {\overset{h}{\to}}_{shp} q wrt Δ} \frac{q \notin Q_{shp}^{Δ} q \overset{a}{\to} q^{'} in Δ}{q {\overset{a}{\to}}_{shp} q^{'} wrt Δ} \frac{q \notin Q_{shp}^{Δ}}{q {\overset{ε}{\to}}_{shp} q wrt Δ}

\frac{q \notin Q_{shp}^{Δ} q {\overset{h}{\to}}_{shp} q^{'} wrt Δ q^{'} {\overset{h^{'}}{\to}}_{shp} q^{''} wrt Δ}{q {\overset{h \cdot h^{'}}{\to}}_{shp} q^{″} wrt Δ}

\frac{q \notin Q_{shp}^{Δ} q \overset{〈 〉}{\to} p in Δ p {\overset{h}{\to}}_{shp} p^{'} q @ p^{'} \to q^{'} in Δ}{q {\overset{〈 h 〉}{\to}}_{shp} q^{'} wrt Δ}

The first rule says that subhedge projecting transitions stay in subhedge projection states until the end of the current subhedge is reached. This is correct by Lemma 6 under the condition that there are no blocking runs. The other rules state that the evaluator behaves without subhedge projection in other states.

Proposition 3.

Let

A = (Σ, Q, Δ, I, F)

be a Sha^↓ that is schema-complete for

S

. Then, for all hedges

h \in S

and states

q, p \in Q

, the following applies:

q \overset{h}{\to} p wrt Δ iff q {\overset{h}{\to}}_{shp} p wrt Δ

Proof.

We distinguish two cases as follows:

Case: $q \notin Q_{shp}^{Δ}$ . Then for any $p \in Q$ , $q \overset{h}{\to} p wrt Δ$ iff $q {\overset{h}{\to}}_{shp} p wrt Δ$ by definition of the projecting transitions.
Case: $q \in Q_{shp}^{Δ}$ . Then, $q {\overset{h}{\to}}_{shp} q wrt Δ$ . Since $h \in S$ and A are schema-complete for $S$ , there exists some run of A on h; thus, the state is q, such that $q \overset{h}{\to} q^{'} wrt Δ$ . Lemma 6 shows that $q^{'} = q$ , thus $q \overset{h}{\to} q wrt Δ$ .

□

For evaluating a nondeterministic Sha^↓

A = (Σ, Q, Δ, I, F)

with subhedge projection on some hedge h whose graph is stored in memory, we can compute the unique subset

P \subseteq Q

, such that

I {\overset{h}{\to}}_{shp} P

with respect to the transition relation

Δ^{d e t}

of determinization of A and test whether

P \cap F \neq \emptyset

. When reaching any subhedge projection state

P^{'}

while doing so, the evaluator can directly jump to the end of the current subhedge by applying the “stay” rule of the determinization of A:

\frac{P^{'} \in Q_{shp}^{Δ^{d e t}} h \in H_{Σ}}{P^{'} {\overset{h}{\to}}_{shp} P^{'} wrt Δ^{d e t}}

Thereby, it will avoid visiting any subhedge in a subhedge projection state of the determinization of A. The running time will thus depend linearly on the size of the non-projected part of h under the assumption that the determinization of A is available.

As before, only the needed part of the determinization of A for evaluating h is to be produced on-the-fly. Overall, this costs polynomial time in the size of the needed part of the determinization (but not necessarily linear time since the determinization of Sha^↓s is more costly than for Shas). Also note that whenever discovering a new state

P^{'}

of the determinization of A, we have to test whether it is a subhedge projection state. For this, we must compute the state

P^{″}

, such that

P^{'} \overset{〈 〉}{\to} P^{″}

wrt

Δ^{d e t}

and test whether only the transition rules allowed for subhedge projection states for

P^{'}

with witness

P^{″}

are available wrt

Δ^{d e t}

.

6. Safe-No-Change Projection

We want to solve the regular pattern-matching problem for hedges with subhedge projection. We assume that the regular pattern is given by a dSha while the schema may be arbitrary, except that the dSha must be schema-complete (see Section 4.3). We then present the safe-no-change projection algorithm, which compiles the dSha into some dSha^↓ while introducing subhedge projection states.

The idea of the safe-no-change projection is to push information top-down by which to distinguish looping states that will safely no more be changed. Thereby, subhedge projection states are produced, as we will illustrate by examples. The soundness proof of safe-no-change projection is nontrivial and instructive for the soundness proof of congruence projection in Section 7.3. Completeness for subhedge projection cannot be expected, since the safe-no-change projection does only take simple loops into account. More general loops cannot be treated and it is not clear how that could be done. The congruence projection algorithm in Section 7.2 will eventually give an answer to this.

We keep the safe-no-change compiler independent of the schema and, therefore, do not have to assume its regularity. But, we have to assume the schema-completeness of the input dSha in order to show that the output dSha^↓ recognizes the same language within the schema. The regular pattern-matching problem can then be solved in memory and with subhedge projection by evaluating the dSha^↓ on the input hedge in memory and with subhedge projection, as discussed in Section 5.3.

6.1. Algorithm

We now describe the safe-no-change projection algorithm. Let

A = (Σ, Q, Δ, I, F)

be a Sha with schema

S

. The algorithm takes the dSha A as input and compiles it to some dSha^↓

A^{snc}

. We will state why

A^{snc}

is sound for subhedge projection under the condition that A is schema-complete for schema

S

. We do not even have to assume that the schema

S

is regular. The proof is eight pages long and, therefore, delegated to the Appendix B.

For any set

P \subseteq Q

, we define the set of states that safely lead to P:

{safe}^{Δ} (P) = {q \in Q ∣ {acc}^{Δ} ({q}) \subseteq P} .

So, state q is safe for P if any hedge read from q on which the run with

Δ

does not block must reach some state in P. We note that

{safe}^{Δ} (P)

can be computed in linear time in the size of

Δ

using inverse hedge accessibility from P. We define the set of states that will no longer change:

{no-change}^{Δ} = {q ∣ q \in {safe}^{Δ} ({q})} .

So,

q \in {no-change}^{Δ}

if and only if

{acc}^{Δ} ({q}) \subseteq {q}

. This means that all transition rules starting in q must loop in q. In the example automaton from Figure 4, the self-looping states are those in

{no-change}^{Δ} = {2, 3, 4, 5}

.

Lemma 7.

Let

A = (Σ, Q, Δ, I, F)

be a Sha that is schema-complete for schema

S

. For any hedge

h \in S

and state

q \in I \cap {no-change}^{Δ}

, it then holds that

q \overset{h}{\to} q

wrt Δ.

Proof.

The schema-completeness of A for

S

applied to

h \in S

and

q \in I

yields the existence of some state

q^{'} \in Q

, such that

q \overset{h}{\to} q^{'}

wrt

Δ

. Let

q^{'}

be any such state. Note that

q^{'} \in {acc}^{Δ} (q)

. Since

q \in {no-change}^{Δ}

, we have

q \in {safe}^{Δ} ({q})

, so that

q^{'} \in {acc}^{Δ} ({q})

implies

q^{'} = q

. This proves

q \overset{h}{\to} q

wrt

Δ

. □

For any state

q \in Q

and subset of states

Q \subseteq Q

, we define the following:

\begin{matrix} {s-down}^{Δ} (q, Q) = {safe}^{Δ} ({p \in Q ∣ q @^{Δ} p \subseteq Q}) \\ {s-no-change}^{Δ} (q) = {s-down}^{Δ} (q, {q}) \end{matrix}

A state belongs to

{s-down}^{Δ} (q, Q)

if all states

p \in {acc}^{Δ} (q)

satisfy

q @^{Δ} p \subseteq Q

. So, p is a state in a down level that will safely go up to some state in

Q

. A state p belongs to

{s-no-change}^{Δ} (q)

if it safely does not change q, i.e., if

{q} @^{Δ} {acc}^{Δ} ({p}) \subseteq {q}

.

We next compile the Sha A to a Sha^↓

A^{snc} = (Σ, Q^{snc},

Δ^{snc},

I^{snc}, F^{snc})

. For this, let

Π

be a fresh symbol and consider the state set as follows:

Q^{snc} = {Π} ⊎ (Q \times 2^{Q}) .

In practice, we restrict the state space to the those states that are accessible or clean the dShas keeping only those states that are used in some successful run. But, in the worst case, the construction may indeed be exponential. Example 15 can be adapted to show this.

A pair

(q, P)

means the state q will safely no longer change in the current subhedge if

q \in P \cup {no-change}^{Δ}

. In this case, the subhedge can be projected. The sets of initial and final states are defined as follows:

\begin{matrix} I^{snc} = I \times {\emptyset} F^{snc} = F \times {\emptyset} \end{matrix} .

How to generate the transition rules of

A^{snc}

from those of A is described in Figure 10. On states assigned on top-level

(q, P)

, the set P is empty so that only the states in

{no-change}^{Δ}

are safe for no change. This is why the definition of

I^{snc}

and

F^{snc}

use

P = \emptyset

. When going down from a state

(q, P)

for which q is safe to not change, i.e.,

q \in P \cup {no-change}^{Δ}

, then the following rule is applied:

\frac{q \in P \cup {no-change}^{Δ}}{(q, P) \overset{〈 〉}{\to} Π in Δ^{snc} .}

The evaluation on the lower level goes to the extra state

Π

, where it then loops until going back to q on the upper level. When going down from some state

(q, P)

, such that

q \notin P \cup {no-change}^{Δ}

, then the following rule is applied:

\frac{\overset{〈 〉}{\to} q^{'} in Δ q \notin P \cup {no-change}^{Δ}}{(q, P) \overset{〈 〉}{\to} (q^{'}, {s-no-change}^{Δ} (q)) in Δ^{snc}}

The states in the set

{s-no-change}^{Δ} (q)

on the lower level safely will not make q change on the upper level for any subhedge to come (we could detect more irrelevant subhedges by allowing q to change but only in a unique manner. This can be obtained with the alternative definition:

{s-no-change}^{'} Δ (q) = \cup_{r \in Q} {s-down}^{Δ} (q, {r})

. Computing this set would require quadratic time

O (n | A |)

, while computing

{s-no-change}^{Δ} (q)

can be carried out in linear time

O (| A |)

).

When applied to the Sha in Figure 4 for

[self : : list / child : : item]

, the construction yields the Sha^↓ in Figure 11, which is indeed equal to the Sha^↓ from Figure 7 up to state renaming. When run on the hedge

〈 list \cdot 〈 list \cdot h_{1} 〉 \cdot 〈 item \cdot h_{2} 〉 〉

, as shown in Figure 8, it does not have to visit subhedges

h_{1}

or

h_{2}

since all of them will be reached starting from the projection state

Π

.

6.2. Soundness

We next state and prove a soundness result for safe-no-change projection.

Theorem 1 (Soundness of safe-no-change projection.).

If a Sha A is schema-complete for some schema

S

, then safe-no-change projection for A preserves the language within this schema:

L (A^{snc}) \cap S = L (A)

.

The projecting in-memory evaluator of

A^{snc}

will be more efficient than the non-projecting evaluator of A. Note that the size of

A^{snc}

may be exponentially bigger than that of A. Therefore, for evaluating a dSha A with subhedge projection on a given hedge h, we may prefer only the needed part of

A^{snc}

on-the-fly. This part has a size of

O (| h |)

and can be computed in time

O (| A | | h |)

so the exponential preprocessing time is avoided.

6.3. Incompleteness

Safe-no-change projection may be incomplete for subhedge projection so that not all prefixes that are strongly subhedge irrelevant are mapped to subhedge projection states. This is shown by the counter example dSha A for the XPath filter

[child : : item]

in Figure 12. Its safe-no-change projection

A^{snc}

is given in Figure 13. Note that the prefix

u = 〈 item . 〈 item

has the class

c l a s s_{S}^{L (A)} (u) = 〈 item . 〈 item . N_{Σ}

, which is subhedge irrelevant so u is strongly subhedge irrelevant for the XPath filter

[child : : item]

. Nevertheless, it leads to the state

(2, {1, 3, 5, 6})

, which is not a subhedge projection state since

2 \notin {1, 3, 5, 6}

. The problem is that this state can still be changed to

(4, {1, 3, 5, 6})

. This state is somehow equivalent with respect to the filter but not equal to

(2, {1, 3, 5, 6})

.

Another incompleteness problem should be mentioned: safe-no-change projection is sensitive to automata completion as noticed earlier in Example 5. This is because a state may belong to

{no-change}^{Δ}

before completion but no more after adding a sink. Nevertheless, such states never change on any tree satisfying the schema. This problem can be applied, for instance, to example dSha in Figure 4 for XPath query

[self : : list / child : : item]

. Therefore, it was important to assume only schema-completeness for safe-no-change projection and not to impose full completeness.

7. Congruence Projection

We present the congruence projection algorithm for detecting irrelevant subhedges for regular hedge patterns with regular schema restrictions. We prove that congruence projection is complete for subhedge projection, resolving the incompleteness of safe-no-change projection. For this, congruence projection may no more ignore the schema, so we have to assume that the input schema is regular too.

7.1. Ideas

The starting idea of congruence projection is to resolve the counter example for the completeness of safe-no-change projection in Figure 13. There, a state is changed when reading an irrelevant subhedge hedge. But, the state change moves to a somehow equivalent state, so it is not really relevant. Therefore, the idea is to detect when a state always remains equivalent rather than unchanged when reading an arbitrary subhedge. This is eventually done by the congruence projection in Figure 14 that we are going to construct.

The obvious question is as follows: which equivalence relation do we choose? Suppose that we want to test whether a hedge satisfying a regular schema

S

matches a regular pattern L. In the restricted case of words without schema restrictions, the idea could be to use Myhill–Nerode’s congruence

\bar{{diff}_{Σ^{*}}^{L}}

but mapped to states. In the general case, however, the situation becomes more complex, given that in the general case

\bar{{diff}_{S}^{L}}

may fail to be an equivalence relation. So, it may not be a congruence, as already illustrated in Example 4. Furthermore, the treatment of the nesting of hedges will require to update the considered relation when moving down a hedge.

7.1.1. Difference Relations on States

The congruence projection algorithm will maintain a difference relation on states of the dSha which at the beginning will be induced by the difference relation on hedges

{diff}_{S}^{L}

and updated whenever moving down a hedge.

Definition 11.

Let

(Σ, Q, Δ,_,_)

be a dSha. A difference relation for

Δ

is a symmetric relation on states

D \subseteq Q \times Q

, such that for all

h \in H_{Σ}

,

(q \overset{h}{\to} q^{'} wrt Δ \land p \overset{h}{\to} p^{'} wrt Δ \land (q^{'}, p^{'}) \in D) \Rightarrow (q, p) \in D .

The set of all difference relations for Δ is denoted by

D^{Δ}

.

We call a subset

Q \subseteq Q

compatible with a difference relation

D \in D^{Δ}

if

Q^{2} \cap D = \emptyset

. This means that no two states of

Q

may be different with respect to D.

Definition 12.

Let

(Σ, Q, Δ,_,_)

be a Sha and D a difference relation for Δ. A subset of states

Q \subseteq Q

is subhedge irrelevant for D wrt Δ if

{acc}^{Δ} (Q)

is compatible with D. The set of all subhedge irrelevant subsets of states for D wrt. Δ thus is as follows:

{dPrj}^{Δ} (D) = {Q \subseteq Q ∣ {acc}^{Δ} {(Q)}^{2} \cap D = \emptyset} .

A state

q \in Q

is subhedge irrelevant for D if the singleton

{q}

is subhedge irrelevant for D. The set of all subhedge irrelevant states of

Q

is denoted by

{Prj}^{Δ} (D)

.

We consider subhedge-irrelevance for subsets of states since the congruence projection algorithm has to eliminate the nondeterminism that it introduces by a kind of Sha determinization. Determinization is necessary in order to recognize subhedge irrelevant prefixes properly: all single states in a subset may be subhedge irrelevant while the whole subset is not.

7.1.2. Least Difference Relations

For any binary relation

R \subseteq Q \times Q

, let

{ldr}^{Δ} (R)

be the least difference relation on states that contains R.

Lemma 8.

(p_{1}, p_{2}) \in {ldr}^{Δ} (R) \Leftrightarrow \exists (q_{1}, q_{2}) \in R . \exists h \in H_{Σ} . p_{1} \overset{h}{\to} q_{1} wrt Δ \land p_{2} \overset{h}{\to} q_{2} wrt Δ

.

Proof.

The set

{(p_{1}, p_{2}) \in Q^{2} ∣ \exists (q_{1}, q_{2}) \in R . \exists h \in H_{Σ} . p_{1} \overset{h}{\to} q_{1} wrt Δ \land p_{2} \overset{h}{\to} q_{2} wrt Δ}

clearly is a difference relation that contains R, and thus, contains the least such difference relation

{ldr}^{Δ} (R)

. Conversely, each pair in the above set must be contained in any difference relation containing R, and thus, in

{ldr}^{Δ} (R)

. □

Lemma 9.

For any

R \subseteq Q \times Q

, the difference relation

{ldr}^{Δ} (R)

is the value of predicate D in the least fixed point of the ground datalog program generated by the following inference rules:

\frac{p, q \in Q}{D (p, q) : - R (p, q) .} \frac{p_{1} \overset{a}{\to} p_{2} wrt Δ q_{1} \overset{a}{\to} q_{2} wrt Δ}{D (p_{1}, q_{1}) : - D (p_{2}, q_{2}) .}

\frac{p_{1} @ p \to p_{2} wrt Δ q_{1} @ q \to q_{2} wrt Δ}{D (p_{1}, q_{1}) : - D (p_{2}, q_{2}) .} \frac{p, q \in Q}{D (q, p) : - D (p, q) .}

Proof.

The first inference rule guarantees that

R \subseteq D

. The three later inference rules characterize difference relations

D \in D^{Δ}

. The second and third rules state that differences in D are propagated backwards over hedges h. This is done recursively by treating letter hedges

h = a

by the second rule and tree hedges

h = 〈 h^{'} 〉

by the third rule. The fourth rules expresses the symmetry of difference relations D. So, the least fixed point of the datalog program generated by the above rules contains R and is the least relation that satisfies all axioms of difference relations, so it is

{ldr}^{Δ} (R)

. □

Proposition 4.

For any

R \subseteq Q \times Q

, the least difference relation

{ldr}^{Δ} (R)

can be computed in time

O (| A |^{2})

.

Proof.

The size of the ground datalog program computing

{ldr}^{Δ} (R)

from Lemma 9 is at most quadratic in the size of A, so its least fixed point can be computed in time

O (| A |^{2})

. □

7.2. Algorithm

We now define the congruence projection algorithm by a compiler from dShas and a set of schema final states to Sha^↓s, which, when run on a hedge, can detect all subhedge irrelevant prefixes.

7.2.1. Inputs and Outputs

As inputs, the congruence projection algorithm is given a dSha

A = (Σ, Q, Δ, I, F)

for the regular pattern and a set

F_{S}

with

F \subseteq F_{S} \subseteq Q

. The dSha defines the regular language

L = L (A)

while the regular schema

S

is recognized by the dSha

A [F / F_{S}] = (Σ, Q, Δ, I, F_{S})

. So, the same base automaton defines the regular language and regular schema by choosing different sets of final states.

Example 10.

In the example dSha in Figure 4 for the XPath filter

[self : : list / child : : item]

with schema

〚 doc 〛

, we have

F = {4}

and

F_{S} = {4, 5}

. In the automaton for the counter example

[child : : item]

for safe-no-change projection in Figure 12, we have

F = {6}

and

F_{S} = {5, 6}

.

We note that, if L and

S

were given by independent Shas, then we can obtain a common dSha A and a set

F_{S}

as above by computing the product of the dShas for L and

S

and completing it. As noticed in [16], it may be more efficient to determinize the Sha for

S

in a first step, build the product of the Sha for L with the dSha for

S

in a second, and determinize this product in the third step.

The congruence projection of A wrt.

F_{S}

will maintain in its current configuration a subset of states

Q \subseteq Q

and a difference relation on states in

D \in D^{Δ}

. Thereby, the congruence projection dSha^↓ can always detect whether the current prefix is subhedge irrelevant for L wrt. schema

S

by testing whether the current set of states

Q

is subhedge irrelevant for the current difference relation D.

7.2.2. Initial Difference Relation

The initial difference relation on states

D_{i n i t}^{A, F_{S}} \in D^{Δ}

is induced from the difference relation on prefixes

{diff}_{S}^{L}

as follows:

D_{i n i t}^{A, F_{S}} = {(q^{'}, q^{″}) ∣ \exists (v^{'}, v^{″}) \in {diff}_{S}^{L} . \exists q_{0} \in I . q_{0} \overset{hdg (v^{'})}{\to} q^{'} wrt Δ \land q_{0} \overset{hdg (v^{″})}{\to} q^{″} wrt Δ} .

The next lemma indicates how

D_{i n i t}^{A, F_{S}}

can be defined from A and

F_{S}

without reference to the languages

L = L (A)

and

S = L (A [F / F_{S}])

.

Lemma 10.

D_{i n i t}^{A, F_{S}} = {ldr}^{Δ} (F \times (F_{S} \ F)) \cap {acc}^{Δ} {(I)}^{2}

.

Proof.

For one direction let

(p^{'}, p^{″}) \in D_{i n i t}^{A, F_{S}}

. Then, there exist nested words

(v^{'}, v^{″}) \in {diff}_{S}^{L}

and an initial state

q_{0} \in I

, such that

q_{0} \overset{hdg (v^{'})}{\to} p^{'} wrt Δ

and

q_{0} \overset{hdg (v^{″})}{\to} p^{″} wrt Δ

. Since

v^{'}, v^{″}

are nested words, hedge accessibility

(p^{'}, p^{″}) \in {acc}^{Δ} {(I)}^{2}

follows. Furthermore,

(v^{'}, v^{″}) \in {diff}_{S}^{L}

requires the existence of a hedge

h \in H_{Σ}

, such that

hdg (v^{'}) \cdot h \in L

and

hdg (v^{″}) \cdot h \in S \ L

. Hence, there are states

q^{'} \in F

and

q^{″} \in F_{S} \ F

, such that

p^{'} \overset{h}{\to} q^{'}

and

p^{″} \overset{h}{\to} q^{″}

. Using Lemma 8, this implies that

(p^{'}, p^{″}) \in {ldr}^{Δ} (F \times (F_{S} \ F))

.

For the other direction, let

(p^{'}, p^{″}) \in {ldr}^{Δ} (F \times (F_{S} \ F)) \cap {acc}^{Δ} {(I)}^{2}

. Using Lemma 8, property

(p^{'}, p^{″}) \in {ldr}^{Δ} (F \times (F_{S} \ F))

shows that there exist states

q^{'} \in F

,

q^{″} \in F_{S} \ F

, and

h \in H_{Σ}

, such that

p^{'} \overset{h}{\to} q^{'} wrt Δ

and

p^{″} \overset{h}{\to} q^{″} wrt Δ

. From

(p^{'}, p^{″}) \in {acc}^{Δ} {(I)}^{2}

, it follows that there are nested words

v^{'}, v^{″} \in N_{Σ}

and an initial state

q_{0} \in I

, such that

q_{0} \overset{hdg (v^{'})}{\to} p^{'} wrt Δ

and

q_{0}^{'} \overset{hdg (v^{″})}{\to} p^{″} wrt Δ

. Hence,

(v^{'}, v^{″}) \in {diff}_{S}^{L}

, so that

(q^{'}, q^{″}) \in D_{i n i t}^{A, F_{S}}

. □

As a consequence of Lemma 10 and Proposition 4, the initial difference relation

D_{i n i t}^{A, F_{S}}

can be computed in time

O (| A |^{2})

from A and

F_{S}

.

Proposition 5 (Soundness of the initial difference relation).

Let

A = (Σ, Q, Δ, I, F)

be a complete dSha,

F \subseteq F_{S} \subseteq Q

,

L = L (A)

, and

S = L (A [F / F_{S}])

. For any hedge

h \in H_{Σ}

and state

q \in Q

, such that

q_{0} \overset{h}{\to} q

wrt Δ for some

q_{0} \in I

, the nested word

nw (h)

is subhedge irrelevant for L and

S

if and only if q is subhedge irrelevant for

D_{i n i t}^{A, F_{S}}

wrt. Δ.

Proof.

We show that

nw (h)

is subhedge relevant for L and

S

if and only if q is subhedge relevant for

D_{i n i t}^{A, F_{S}}

wrt.

Δ

.

“⇒”: Let $nw (h)$ be subhedge relevant for L and $S$ . Then, there exist hedges $h^{'}, h^{″} \in H_{Σ}$ and a nested word $w \in N_{Σ}$ , such that

$nw (h) \cdot nw (h^{'}) \cdot w \in L \land nw (h) \cdot nw (h^{″}) \cdot w \in S \ L .$

Since A is deterministic and $q_{0} \overset{h}{\to} q$ wrt $Δ$ for the unique $q_{0} \in I$ , it follows that there exist states $p^{'}, p^{″} \in Q$ , $q^{'} \in F$ , and $q^{″} \in F_{S} \ F$ , such that

$q \overset{h^{'}}{\to} p^{'} \overset{hdg (w)}{\to} q^{'} wrt Δ \land q \overset{h^{″}}{\to} p^{″} \overset{hdg (w)}{\to} q^{″} wrt Δ .$

From $q^{'} \in F$ and $q^{″} \in F_{S} \ F$ , it follows that $(q^{'}, q^{″}) \in D_{i n i t}^{A, F_{S}}$ . Since $D_{i n i t}^{A, F_{S}}$ is a difference relation, $p^{'} \overset{hdg (w)}{\to} q^{'} wrt Δ$ and $p^{″} \overset{hdg (w)}{\to} q^{″} wrt Δ$ , this implies that $(p^{'}, p^{″}) \in D_{i n i t}^{A, F_{S}}$ , too. Hence, $(p^{'}, p^{″}) \in {acc}^{Δ} {({q})}^{2} \cap D_{i n i t}^{A, F_{S}}$ , i.e., q, is relevant for $D_{i n i t}^{A, F_{S}}$ wrt. $Δ$ .
“⇐”: Let q be subhedge relevant for $D_{i n i t}^{A, F_{S}}$ wrt. $Δ$ . Then, there exist a pair $(p^{'}, p^{″}) \in {acc}^{Δ} {({q})}^{2} \cap D_{i n i t}^{A, F_{S}}$ . So, there are hedges $h^{'}$ and $h^{″}$ , such that

$q \overset{h^{'}}{\to} p^{'} wrt Δ \land q \overset{h^{″}}{\to} p^{″} wrt Δ .$

Since $D_{i n i t}^{A, F_{S}} = {ldr}^{Δ} (F \times (F_{S} \ F))$ by Lemma 10, there exist a nested word w and a pair $(q^{'}, q^{″}) \in Q^{2}$ so that either $(q^{'}, q^{″})$ or $(q^{″}, q^{'})$ belongs to $F \times (F_{S} \ F)$ and

$p^{'} \overset{hdg (w)}{\to} q^{'} wrt Δ \land p^{″} \overset{hdg (w)}{\to} q^{″} wrt Δ .$

Hence, $nw (h) \cdot nw (h^{'}) \cdot w$ in L and $nw (h) \cdot nw (h^{″}) \cdot w \in S \ L$ or vice versa. This shows that $nw (h)$ is relevant for L and $S$ .

□

7.2.3. Updating the Difference Relation

For any difference relation

D \subseteq D^{Δ}

and subset of states

Q \subseteq Q

, we define a binary relation

{down}_{Q}^{Δ} (D) \subseteq Q \times Q

, such that for all states

p_{1}, p_{2} \in Q

:

(p_{1}, p_{2}) \in {down}_{Q}^{Δ} (D) iff \exists q_{1}, q_{2} \in Q . (q_{1} @^{Δ} p_{1}, q_{2} @^{Δ} p_{2}) \in D .

For any subset of states

Q \subseteq Q

and difference relation

D \in D^{Δ}

, let

D_{Q} \in D^{Δ}

be the least difference relation that contains

{down}_{Q}^{Δ} (D)

:

D_{Q} = {ldr}^{Δ} ({down}_{Q}^{Δ} (D)) .

Lemma 11.

The difference relation

D_{Q}

can be computed in time

O (| A |^{2})

from

Q

, Δ, and D.

Proof.

The binary relation

{down}_{Q}^{Δ} (D)

can be computed in time

O (| Δ |^{2})

from

Q

, D, and

Δ

. The least difference relation

{ldr}^{Δ} ({down}_{Q}^{Δ} (D))

can be computed by ground datalog in time

O (| A |^{2})

from

{down}_{Q}^{Δ} (D)

using Proposition 4. □

Example 11.

Reconsider the dSha for the XPath filter

[self : : list / child : : item]

in Figure 4 with the set of schema-final states

F_{S} = {4, 5}

. Since

F = {4}

, we have

F_{S} \ F = {5}

. The initial difference relation is

D 0 = D_{i n i t}^{A, F_{S}}

, which is the symmetric closure of

({0} \times {1, 4, 5}) \cup {(1, 4), (4, 5)}

. The subhedge irrelevant states are

{Prj}^{Δ} (D 0) = {1, 2, 3, 4, 5}

since only state 0 can access two states in the difference relation

D 0

, the final state in

F = {4}

, and a nonfinal state that is schema-final in

F_{S} \ F = {5}

of some hedges. The difference relation

{down}_{{0}}^{Δ} (D_{i n i t}^{A, F_{S}})

is the symmetric closure of

{1, 2} \times {3}

, which is equal to the difference relation

D 1 = {D 0}_{{0}} = {ldr}^{Δ} ({down}_{{0}}^{Δ} (D 0))

. Hence, the projection states are

{Prj}^{Δ} (D 1) = {2, 3, 4, 5}

since from states 0 and 1 one can still reach both 1 and 3 while

(1, 3) \in D 1

. The difference relation

D 2 = {down}_{{1}}^{Δ} (D 1)

is the symmetric closure of

{(1, 2), (2, 3)}

. Hence, the projection states are

{Prj}^{Δ} (D 2) = {1, 2, 3, 4, 5}

. From state 1, one can only reach states 1 and 3, which are not different for

D 2

.

Projection states for the initial difference relation contain all states that are safe for selection or safe for rejection with respect to the schema:

Lemma 12.

All states in

{safe}^{Δ} (F)

and

{safe}^{Δ} (F_{S} \ F)

are subhedge irrelevant for

D_{i n i t}^{A, F_{S}}

.

We omit the proof since the result is not used later on.

Given a dSha

A = (Σ, Q, Δ, I, F)

and a set

F \subseteq F_{S} \subseteq Q

, we now construct the congruence projection

A^{cgr (S)}

as the following dSha^↓:

A^{cgr (S)} = (Σ, Q^{cgr}, Δ^{cgr}, I^{cgr (S)}, F^{cgr (S)}),

where the set of states, initial states, and final states are as follows:

\begin{matrix} Q^{cgr} & \subseteq & 2^{Q} \times D^{Δ} \\ I^{cgr (S)} & = & {(I, D_{i n i t}^{A, F_{S}})} \\ F^{cgr (S)} & = & \{(Q, D_{i n i t}^{A, F_{S}}) ∣ (\begin{matrix} Q \notin {dPrj}^{Δ} (D_{i n i t}^{A, F_{S}}) \Rightarrow Q \cap F \neq \emptyset \land \\ Q \in {dPrj}^{Δ} (D_{i n i t}^{A, F_{S}}) \Rightarrow {acc}^{Δ} (Q) \cap F \neq \emptyset \end{matrix})\} \end{matrix}

The transition rules in

Δ^{cgr}

are provided by the inference rules in Figure 15. To illustrate construction and why determinization is needed, we reconsider Example 3, where schema restrictions have consequences that at first sight may be counter-intuitive.

Example 12 (Determinization during congruence projection is needed.).

Reconsider Example 3 with signature

Σ = {a}

, pattern

L = {〈 a 〉}

, and schema

S = L \cup {〈 〉 \cdot a}

. This language can be defined by the dSha A in Figure 16. The schema

S

can be defined by the same automaton but with the schema-final states

F_{S} = {3, 5}

and

F = {3}

. The dSha

A^{cgr (S)}

produced by congruence projection is shown in Figure 17. The unique hedge

〈 a 〉

of the language

L = L (A)

is accepted in state

({3, 4}, D 0)

, where D0

= {(3, 5), (5, 3)}

. The unique hedge

〈 〉 \cdot a

in

S \ L

is rejected: after reading the tree

〈 〉

, the automaton goes to state

({3, 4}, D 0)

where it goes to a projecting sink

({5}, D 0)

when reading the subsequent letter a. We note that any hedge in

〈 Σ^{*} 〉

is accepted in state

({3, 4}, D 0)

by

A^{cgr (S)}

, even though only the single hedge

〈 a 〉

belongs to L. This is sound given that the hedges in

〈 (ε + a \cdot a \cdot a^{*}) 〉

do not belong to schema

S

anyway. We notice that

({3, 4}, D 0)

cannot be a projection state since the run on hedge

〈 〉 \cdot a

must continue to the sink

({5}, D 0)

; therefore, state

({3, 4}, D 0)

is subhedge relevant.

Note that both individual states 3 and 4 are subhedge irrelevant for D0 while the subset with both states

{3, 4}

is subhedge relevant for D0 since

(3, 5) \in {acc}^{Δ} {({3, 4})}^{2} \cap

D0. This shows that the deterministic construction is indeed needed to decide subhedge irrelevance for cases such as in this example. Also, notice that state 0 is subhedge irrelevant for D1 = ∅. This is fine since the acceptance of hedges of the form

〈 h 〉 \cdot h^{'}

by

A^{cgr (S)}

does indeed not depend on h. In contrast, it depends on

h^{'}

, so that the subset of states

{3, 4}

cannot be subhedge irrelevant for D0.

Finally, let us discuss how the transition rules of dSha^↓

A^{cgr (S)}

are inferred. The most interesting transition rule is

({0}, D 0) @ ({0}, D 1) \to ({3, 4}, D 0)

in

Δ^{cgr}

. It is produced by the following inference rule, where

Q = P = {0}

,

{acc}^{Δ} (P) = {0, 1, 3, 4, 5}

,

Q^{'} = {3, 4}

,

D = D 0

, and

D_{Q} = D 1

:

\frac{Q @ {acc}^{Δ} (P) \to Q^{'} in Δ^{\det} Q \notin {dPrj}^{Δ} (D) P \in {dPrj}^{Δ} (D_{Q})}{(Q, D) @ (P, D_{Q}) \to (Q^{'}, D) in Δ^{cgr}}

This rule states that if P is subhedge irrelevant for

D_{Q}

, then all states accessible from P must be tried out since all of them could be reached by some different subhedge that got projected away. As a result, one cannot know due to subhedge projection into which state of

Q^{'}

one should go. In order to stay deterministic, we thus go into all possible states, i.e., into the subset

Q^{'}

. In the example, these are all the states that can be reached when reading some hedge in the pattern

{〈 h 〉 ∣ h \in H_{Σ}}

, given that the subhedges

h \in Σ

are not inspected with subhedge projection.

Example 13 (XPath filter

[self : : list / child : : item]

with schema

〚 doc 〛

.).

The congruence projection of the example dSha in Figure 4 is given in Figure 18. This dSha is similar to the safe-no-change projection in Figure 7, except that the useless state

2^{'}

is removed and the four projection states are now looping in themselves rather than going to a shared looping state Π. It should also be noticed that only singleton state sets are used there, so no determinization is needed. As we will see, this is typical for experiments on regular XPath queries where larger subsets rarely occur.

Example 14 (Counter-example for completeness of the safe-no-change projection.).

Reconsider the counter example for the completeness of the safe-no-change projection, i.e., the dSha in Figure 12 for the XPath query

[child : : item]

with schema final states

F_{S} = {0, 5, 6}

. Its congruence projection is shown in Figure 14. We note that the prefix

〈 i t e m . 〈 i t e m

leads to the state {2}D3, which is a subhedge projection state since 2 is subhedge irrelevant for D3. In particular, note that

{acc}^{Δ} ({2}) = {2, 4}

so no two states accessible from 2 are different in D3. This means that state 2 may still be changed to 4 but then it does not become different with respect to D3. This resolves the incompleteness issue with the safe-no-change projection on this example.

The first property for states

(Q, D)

assigned by the congruence projection is that

Q

is compatible with D. This means that no two states in

Q

are different with respect to D. Intuitively, each state of

Q

is as good as each other except for leading out of the schema. So, if

(Q, D)

is assigned to a prefix, and a suffix leads from some state in

Q

to F, then it cannot lead to

F_{S} \ F

from some other state in

Q

.

Lemma 13.

If the partial run of

A^{cgr (S)}

assigns a state

(Q, D)

to a prefix, then

Q

is compatible with D.

We omit the proof since this instructive result will not be used directly later on. Still, compatibility will play an important role in the soundness proof.

7.3. Soundness

We next adapt the soundness result and proof from the safe-no-change projection to congruence projection.

Theorem 2 (Soundness of congruence projection.).

For any dSha

A = (Σ, Q, Δ, I, F)

and set

F \subseteq F_{S} \subseteq Q

, the congruence projection of A with respect to

F_{S}

preserves the language of A within schema

S = L (A [F / F_{S}])

, i.e.,

L (A^{cgr (S)}) \cap S = L (A) .

We note that

S

is a schema for A since

F \subseteq F_{S}

. Furthermore, A is schema-complete for

S

since A is deterministic and

S = L (A [F / F_{S}])

.

Proof.

The proof of the inclusion

L (A) \subseteq L (A^{cgr (S)}) \cap S

will be based on the following two claims:

Claim 2.1a.

For all

h \in H_{\sum}, D \in D^{Δ}

, and

Q \in d P r j^{Δ} (D), (Q, D) \overset{h}{\to} (Q, D) wrt Δ^{c g r} .

Proof.

We prove Claim 2.1a by induction on the structure of h. Suppose that

Q \in {dPrj}^{Δ} (D)

.

Case: $h = 〈 h^{'} 〉$ . The induction hypothesis applied to $h^{'}$ yields $(Q, D) \overset{h^{'}}{\to} (Q, D) wrt Δ^{cgr}$ . We can thus apply the following two inference rules:

$\frac{Q \in {dPrj}^{Δ} (D)}{(Q, D) \overset{〈 〉}{\to} (Q, D) in Δ^{cgr}} \frac{Q \in {dPrj}^{Δ} (D)}{(Q, D)) @ (Q, D) \to (Q, D) in Δ^{cgr}}$

in order to close the following diagram with respect to $Δ^{cgr}$ :

This proves $(Q, D) \overset{h}{\to} (Q, D)$ wrt $Δ^{cgr}$ , as required by the claim.
Case: $h = a$ . Since $q \in {dPrj}^{Δ} (D)$ , we can apply the inference rule:

$\frac{a \in Σ Q \in {dPrj}^{Δ} (D)}{(Q, D) \overset{a}{\to} (Q, D) in Δ^{cgr}}$

This proves this case of the claim.
Case: $h = ε$ . We trivially have $(Q, D) \overset{ε}{\to} (Q, D)$ wrt $Δ^{cgr}$ .
Case: $h = h^{'} \cdot h^{″}$ . By the induction hypothesis applied to $h^{'}$ and $h^{″}$ , we obtain the following: $(Q, D) \overset{h^{'}}{\to} (Q, D)$ and $(Q, D) \overset{h^{″}}{\to} (Q, D)$ wrt $Δ^{cgr}$ . Hence, $(Q, D) \overset{h^{'} \cdot h^{″}}{\to} (Q, D)$ wrt $Δ^{cgr}$ .

This ends the proof of Claim 2.1a. The next claim is the key of the soundness proof. □

For any difference relation D, we define a binary relation

{dep}^{a} (D) \subseteq 2^{Q} \times 2^{Q}

, such that, for any two subsets of states

Q^{'}, Q^{″} \subseteq Q

,

(Q^{'}, Q^{″}) \in {dep}^{a} (D) \Leftrightarrow \{\begin{matrix} Q^{'} \notin {dPrj}^{Δ} (D) \Rightarrow Q^{'} \subseteq Q^{″} \land & (1^{a}) \\ Q^{'} \in {dPrj}^{Δ} (D) \Rightarrow (Q^{'} \subseteq {acc}^{Δ} (Q^{″}) \land Q^{″} \in {dPrj}^{Δ} (D)) & (2^{a}) \end{matrix}

Furthermore, note that

Q^{″} \in {dPrj}^{Δ} (D) \land Q^{'} \subseteq {acc}^{Δ} (Q^{″})

implies

Q^{'} \in {dPrj}^{Δ} (D)

, and thus,

(Q^{'}, Q^{″}) \in {dep}^{a} (D)

.

Claim 2.2a.

Let

h \in H_{\sum}

a hedge,

Q, Q^{'} \subseteq Q

subsets of states, and

D \in D^{Δ}

a difference relation. If

Q \overset{h}{\to} Q^{'} wrt Δ^{d e t}

, then there exists

Q^{″} \subseteq Q

, such that

(Q, D) \overset{h}{\to} (Q^{''}, D) wrt Δ^{c g r}

and

(Q^{'}, Q^{″}) \in {d e p}^{a} (D)

.

Proof.

If

Q \in {dPrj}^{Δ} (D)

, then

(Q, D) \overset{h}{\to} (Q, D) wrt Δ^{cgr}

by Claim 2.1a. Let

Q^{″} = Q

. We then have

Q^{'} \subseteq {acc}^{Δ} (Q^{″})

and

Q^{″} \in {dPrj}^{Δ} (D)

, and thus,

Q^{'} \in {dPrj}^{Δ} (D)

so that (

1^{a}

) and (

2^{a}

) of

(Q^{'}, Q^{″}) \in {dep}^{a} (D)

. So, it is sufficient to prove the claim under the assumption that

Q \notin {dPrj}^{Δ} (D)

. The proof is by induction on the structure of h.

Case

h = 〈 h^{'} 〉

. The assumption

Q \overset{h}{\to} Q^{'} wrt Δ^{\det}

shows that there exists a subset of states

P \subseteq Q

closing the following diagram:

In particular, we have

Q @ P \to Q^{'}

in

Δ^{\det}

. Let

D^{'} = D_{Q}

. Since

Q \notin {dPrj}^{Δ} (D)

, we can infer the following:

\frac{Q \notin {dPrj}^{Δ} (D)}{(Q, D) \overset{〈 〉}{\to} ({〈 〉}^{Δ}, D^{'}) in Δ^{cgr}}

We have to distinguish two cases depending on whether

{〈 〉}^{Δ}

belongs to

{dPrj}^{Δ} (D^{'})

or not.

Subcase

{〈 〉}^{Δ} \notin {dPrj}^{Δ} (D^{'})

. The induction hypothesis applied to

h^{'}

yields the existence of

P^{'} \subseteq Q

, such that

(P, P^{'}) \in {dep}^{a} (D^{'})

and

({〈 〉}^{Δ}, D^{'}) \overset{h^{'}}{\to} (P^{'}, D^{'}) wrt Δ^{cgr} .

Subsubcase: $P \notin {dPrj}^{Δ} (D^{'})$ . Since $(P, P^{'}) \in {dep}^{a} (D^{'})$ , it follows that $P \subseteq P^{'}$ . Hence, $P^{'} \notin {dPrj}^{Δ} (D^{'})$ . Let $Q^{″} \subseteq Q$ be such that $Q @ P^{'} \to Q^{″} in Δ^{\det}$ . We can then apply the following inference rule:

$\frac{Q @ P^{'} \to Q^{″} in Δ^{\det} Q \notin {dPrj}^{Δ} (D) P^{'} \notin {dPrj}^{Δ} (D^{'})}{(Q, D) @ (P^{'}, D^{'}) \to (Q^{″}, D) in Δ^{cgr}}$

Hence, we can close the diagram as follows:

This shows that $(Q, D) \overset{h}{\to} (Q^{″}, D) wrt Δ^{cgr}$ . Since $P \subseteq P^{'}$ , $Q @ P \to Q^{'}$ , and $Q @ P^{'} \to Q^{″} in Δ^{\det}$ , it follows that $Q^{'} \subseteq Q^{″}$ , and thus, $(Q^{'}, Q^{″}) \in {dep}^{a} (D)$ . This shows the claim in this case.
Subsubcase: $P \in {dPrj}^{Δ} (D^{'})$ . Since $(P, P^{'}) \in {dep}^{a} (D^{'})$ , it follows that $P \subseteq {acc}^{Δ} (P^{'})$ and $P^{'} \in {dPrj}^{Δ} (D^{'})$ . Hence, we can apply the following inference rule for some $Q^{″} \subseteq H_{Σ}$ :

$\frac{Q @ {acc}^{Δ} (P^{'}) \to Q^{″} in Δ^{\det} Q \notin {dPrj}^{Δ} (D) P^{'} \in {dPrj}^{Δ} (D^{'})}{(Q, D) @ (P^{'}, D^{'}) \to (Q^{″}, D) in Δ^{cgr}}$

Hence, we can close the diagram as follows:

This shows that $(Q, D) \overset{h}{\to} (Q^{″}, D) wrt Δ^{cgr}$ . Since $P \subseteq {acc}^{Δ} (P^{'})$ , it follows from $Q @ P \to Q^{'}$ and $Q @ {acc}^{Δ} (P^{'}) \to Q^{″}$ in $Δ^{\det}$ that $Q^{'} \subseteq Q^{″}$ . Thus, $(Q^{'}, Q^{″}) \in {dep}^{a} (D)$ , so the claim holds in this case too.

Subcase

{〈 〉}^{Δ} \in {Prj}^{Δ} (D^{'})

. Claim 2.1a shows that

({〈 〉}^{Δ}, D^{'}) \overset{h^{'}}{\to} ({〈 〉}^{Δ}, D^{'})

wrt

Δ^{cgr}

. Let

Q^{″}

be such that

Q @ {acc}^{Δ} ({〈 〉}^{Δ}) \to Q^{″} in Δ^{\det}

. We can apply the following inference rule:

\frac{Q @ {acc}^{Δ} ({〈 〉}^{Δ}) \to Q^{″} in Δ^{\det} Q \notin {dPrj}^{Δ} (D) {〈 〉}^{Δ} \in {Prj}^{Δ} (D^{'})}{(Q, D) @ ({〈 〉}^{Δ}, D^{'}) \to (Q^{″}, D) in Δ^{cgr}}

We can thus close the diagram below as follows with respect to

Δ^{cgr}

:

This shows that

(Q, D) \overset{h}{\to} (Q^{″}, D) wrt Δ^{cgr}

. Since

P \subseteq {acc}^{Δ} ({〈 〉}^{Δ})

, it follows from

Q @ P \to Q^{'}

and

Q @ {acc}^{Δ} ({〈 〉}^{Δ}) \to Q^{″}

in

Δ^{\det}

that

Q^{'} \subseteq Q^{″}

, and thus,

(Q^{'}, Q^{″}) \in {dep}^{a} (D)

.

Case

h = a

. Since

Q \notin {dPrj}^{Δ} (D)

, we can apply the inference rule:

\frac{Q \overset{a}{\to} Q^{'} in Δ^{\det} Q \notin {dPrj}^{Δ} (D)}{(Q, D) \overset{a}{\to} (Q^{'}, D) in Δ^{cgr}}

Hence,

(Q, D) \overset{h}{\to} (Q^{'}, D) wrt Δ^{cgr}

. With

Q^{″} = Q^{'}

, the claim follows since

(Q^{'}, Q^{″}) \in {dep}^{a} (D)

.

Case

h = ε

. In this case, we have

Q = Q^{″}

and

(Q, D) \overset{ε}{\to} (Q, D) wrt Δ^{cgr}

, so the claim holds.

Case

h = h_{1} \cdot h_{2}

. Since

Q \overset{h}{\to} Q^{'} wrt Δ^{\det}

, there exists

Q_{1} \subseteq Q

, such that

Q \overset{h_{1}}{\to} Q_{1}

and

Q_{1} \overset{h_{2}}{\to} Q^{'} wrt Δ^{\det}

. By the induction hypothesis applied to

h_{1}

there exists

Q_{1}^{'} \subseteq Q

, such that

(Q, D) \overset{h_{1}}{\to} (Q_{1}^{'}, D) wrt Δ^{cgr}

and

(Q_{1}, Q_{1}^{'}) \in {dep}^{a} (D)

.

Subcase: $Q_{1} \in {dPrj}^{Δ} (D)$ . Since $(Q_{1}, Q_{1}^{'}) \in {dep}^{a} (D)$ , it follows that $Q_{1} \subseteq {acc}^{Δ} (Q_{1}^{'})$ and $Q_{1}^{'} \in {dPrj}^{Δ} (D)$ . Furthermore, Claim 2.1a shows that $(Q_{1}^{'}, D) \overset{h_{2}}{\to} (Q_{1}^{'}, D)$ , and hence, $(Q, D) \overset{h}{\to} (Q_{1}^{'}, D)$ wrt $Δ^{cgr}$ . Since $Q^{'} \subseteq {acc}^{Δ} (Q_{1})$ and $Q_{1} \subseteq {acc}^{Δ} (Q_{1}^{'})$ , it follows that $Q^{'} \subseteq {acc}^{Δ} (Q_{1}^{'})$ . Hence, $(Q_{1}^{'}, Q^{'}) \in {dep}^{a} (D)$ and $(Q, D) \overset{h}{\to} (Q_{1}^{'}, D)$ .
Subcase: $Q_{1} \notin {dPrj}^{Δ} (D)$ . In this case, we can apply the induction hypothesis to $Q_{1} \overset{h_{2}}{\to} Q^{'} wrt Δ^{\det}$ , showing that there exists $Q^{″} \subseteq Q$ , such that $(Q_{1}^{'}, D) \overset{h_{2}}{\to} (Q^{″}, D) wrt Δ^{cgr}$ and $(Q^{'}, Q^{″}) \in {dep}^{a} (D)$ . Hence, $(Q, D) \overset{h}{\to} (Q^{″}, D) wrt Δ^{cgr}$ and $(Q^{'}, Q^{″}) \in {dep}^{a} (D)$ .

This ends the proof of Claim 2.2a. □

Proof of Inclusion $L (A) \subseteq L (A^{cgr (S)}) \cap S$ .

Since

S

is a schema for A, we have

L (A) \subseteq S

, so that it is sufficient to show

L (A) \subseteq L (A^{cgr (S)})

. Let

h \in L (A)

. Then, there exist

q_{0} \in I

and

q \in F

, such that

q_{0} \overset{h}{\to} q

wrt

Δ

. Let

Q = {q}

. Since A is deterministic, it follows that

I \overset{h}{\to} Q

wrt

Δ^{\det}

. Using Claim 2.2a, this implies the existence of a subset of states

Q^{'} \in Q

, such that

(Q, Q^{'}) \in {dep}^{a} (D)

and

(I, D_{i n i t}^{A, F_{S}}) \overset{h}{\to} (Q^{'}, D_{i n i t}^{A, F_{S}})

wrt

Δ^{cgr}

. Furthermore,

(I, D_{i n i t}^{A, F_{S}}) \in I^{cgr (S)}

. In order to prove

h \in L (A^{cgr (S)})

, it is thus sufficient to show that

(Q^{'}, D_{i n i t}^{A, F_{S}}) \in F^{cgr (S)}

. We distinguish two cases:

Case: $Q^{'} \notin {dPrj}^{Δ} (D_{i n i t}^{A, F_{S}})$ . Condition ( $1^{a}$ ) of $(Q, Q^{'}) \in {dep}^{a} (D)$ , shows that $Q \subseteq Q^{'}$ so that $q \in Q^{'}$ . Since $q \in F$ , it follows that $Q^{'} \cap F \neq \emptyset$ . Thus, $(Q^{'}, D_{i n i t}^{A, F_{S}}) \in F^{cgr (S)}$ , so that $h \in L (A^{cgr (S)})$ .
Case: $Q^{'} \in {dPrj}^{Δ} (D_{i n i t}^{A, F_{S}})$ . Condition ( $2^{a}$ ) of $(Q, Q^{'}) \in {dep}^{a} (D)$ yields $Q \subseteq {acc}^{Δ} (Q^{'})$ . Hence, $q \in {acc}^{Δ} (Q^{'}) \cap F$ so that $(Q^{'}, D_{i n i t}^{A, F_{S}}) \in F^{cgr (S)}$ , and thus, $h \in L (A^{cgr (S)})$ .

This ends the proof of the first inclusion. □

We next want to show the inverse inclusion

L (A^{cgr (S)}) \cap S \subseteq L (A)

. It will eventually follow from the next two claims.

Claim 2.1b.

For any hedge

h \in H_{Σ}

, difference relation

D \subseteq D^{Δ}

, projection state

Q \in {dPrj}^{Δ} (D)

and state

μ \in Q^{cgr}

: if

(Q, D) \overset{h}{\to} μ wrt Δ^{cgr}

, then

μ = (Q, D)

.

Proof.

By induction on the structure of

h \in H_{Σ}

, suppose that

(Q, D) \overset{h}{\to} μ wrt Δ^{cgr}

:

Case: $h = 〈 h^{'} 〉$ . There must exist states $μ_{1}, μ_{1}^{'} \in Q^{cgr}$ closing the following diagram:

Since $Q \in {dPrj}^{Δ} (D)$ , the following rule must have been applied to infer $(Q, D) \overset{〈 〉}{\to} μ_{1} wrt Δ^{cgr}$ :

$\frac{Q \in {dPrj}^{Δ} (D)}{(Q, D) \overset{〈 〉}{\to} (Q, D) in Δ^{cgr}}$

Therefore, $μ_{1} = (Q, D)$ . The induction hypothesis applied to $(Q, D) \overset{h^{'}}{\to} μ_{1}^{'} wrt Δ^{cgr}$ shows that $μ_{1}^{'} = (Q, D)$ , too. So, $μ$ must have been inferred by applying the rule:

$\frac{Q \in {dPrj}^{Δ} (D)}{(Q, D) @ (Q, D) \to (Q, D) in Δ^{cgr}}$

Hence, $μ = (Q, D)$ , as required.
Case: $h = a$ . The following rule must be applied:

$\frac{a \in Σ Q \in {dPrj}^{Δ} (D)}{(Q, D) \overset{a}{\to} (Q, D) in Δ^{cgr}}$

Hence, $μ = (Q, D)$ .
Case: $h = ε$ . Obvious.
Case: $h = h_{1} \cdot h_{2}$ . There must exist some $μ_{1}$ , such that $(Q, D) \overset{h_{1}}{\to} μ_{1} \overset{h_{2}}{\to} μ$ wrt $Δ^{cgr}$ . Using the induction hypothesis applied to $h_{1}$ , we have $μ_{1} = (Q, D)$ . We can thus apply the induction hypothesis to $h_{2}$ to obtain $μ_{2} = (Q, D)$ .

This ends the proof of Claim 2.1b. □

We next need an inverse of Claim 2.2a. For relating runs

A^{cgr}

back to runs of A, we define for any difference relation D another binary relation

{dep}^{b} (D) \subseteq 2^{Q} \times 2^{Q}

, such that for any two subsets of states

Q^{'}, Q^{″} \subseteq Q

,

(Q^{'}, Q^{″}) \in {dep}^{b} (D) \Leftrightarrow \{\begin{matrix} (Q^{'} \times Q^{″}) \cap D = \emptyset \land & (0^{b}) \\ Q^{'} \notin {dPrj}^{Δ} (D) \Rightarrow Q^{″} \subseteq Q^{'} \land & (1^{b}) \\ Q^{'} \in {dPrj}^{Δ} (D) \Rightarrow Q^{″} \subseteq {acc}^{Δ} (Q^{'}) & (2^{b}) \end{matrix}

Claim 2.2b.

Let

Q \in Q

be a subset of states that is compatible with a difference relation

D \in D^{Δ}

. For any hedge

h \in H_{Σ}

and state

μ \in Q^{cgr}

with

(Q, D) \overset{h}{\to} μ wrt Δ^{cgr},

there exist a pair of subsets of states

(Q^{'}, Q^{″}) \in {dep}^{b} (D)

, such that

μ = (Q^{'}, D)

and

Q \overset{h}{\to} Q^{″}

wrt.

Δ^{\det}

.

Proof.

If

Q \in {dPrj}^{Δ} (D)

, then Claim 2.1b shows that

μ = (Q, D)

. Let

Q^{'} = Q

and

Q^{″}

be the unique subset of states, such that

Q \overset{h}{\to} Q^{″}

wrt.

Δ^{\det}

. We have to show that

(Q^{'}, Q^{″}) \in {dep}^{b} (D)

. Since

Q \overset{h}{\to} Q^{″}

, we have

Q^{″} \subseteq {acc}^{Δ} (Q^{'})

, so condition (

2^{b}

) holds. Condition (

1^{b}

) holds trivially since

Q \in {dPrj}^{Δ} (D)

. For condition (

0^{b}

), note that

Q \in {dPrj}^{Δ} (D)

implies

{acc}^{Δ} {(Q)}^{2} \cap D = \emptyset

. Furthermore,

Q^{'} \times Q^{″} \subseteq {acc}^{Δ} {(Q)}^{2}

, so that

(Q^{'} \times Q^{″}) \cap D = \emptyset

as required.

We can thus assume that

Q \notin {dPrj}^{Δ} (D)

. The proof is by induction on the structure of

h \in H_{Σ}

. We distinguish all the possible forms of hedges

h \in H_{Σ}

:

Case

h = 〈 h^{'} 〉

. By definition of

(Q, P) \overset{h}{\to} μ wrt Δ^{cgr}

, there must exist

μ_{1}, μ_{1}^{'} \in Q^{cgr}

, such that the following diagram can be closed:

Since

Q \notin {dPrj}^{Δ} (D)

, the following inference rule was applied to infer

(Q, D) \overset{〈 〉}{\to} μ_{1}

wrt

Δ^{cgr}

, where

D^{'} = D_{Q}

:

\frac{Q \notin {dPrj}^{Δ} (D)}{(Q, D) \overset{〈 〉}{\to} ({〈 〉}^{Δ}, D^{'}) in Δ^{cgr}}

Hence,

μ_{1} = ({〈 〉}^{Δ}, D^{'})

. If

{〈 〉}^{Δ} \notin {dPrj}^{Δ} (D)

, then the induction hypothesis applied to

({〈 〉}^{Δ}, D^{'}) \overset{h^{'}}{\to} μ_{1}^{'}

wrt

Δ^{cgr}

shows that there exists

(P^{'}, P^{″}) \in {dep}^{b} (D^{'})

, such that

μ_{1}^{'} = (P^{'}, D^{'})

and

{〈 〉}^{Δ} \overset{h^{'}}{\to} P^{″} wrt Δ^{\det}

. Otherwise, the same can be concluded as we argued at the beginning.

Subcase: $P^{'} \in {dPrj}^{Δ} (D^{'})$ . The transition rule $(Q, D) @ μ_{1} \to μ$ must be inferred as follows for $Q^{'} \subseteq Q$ :

$\frac{Q @ {acc}^{Δ} (P^{'}) \to Q^{'} in Δ^{\det} Q \notin {dPrj}^{Δ} (D) P^{'} \in {dPrj}^{Δ} (D)}{(Q, D) @ (P^{'}, D^{'}) \to (Q^{'}, D) in Δ^{cgr}}$

This shows that $μ = (Q^{'}, D)$ . So, we have the following diagram:

Let $Q^{″}$ be the unique subset of states, such that $Q @ P^{″} \to Q^{″}$ wrt $Δ^{\det}$ . We can then close the following diagram:

From $(P^{'}, P^{″}) \in {dep}^{b} (D^{'})$ , it follows $(P^{'} \times P^{″}) \cap D^{'} = \emptyset$ , and thus, $(Q^{'} \times Q^{″}) \cap D = \emptyset$ , i.e., condition ( $0^{b}$ ) of $(Q^{'}, Q^{″}) \in {dep}^{b} (D)$ . Since $P^{'} \in {dPrj}^{Δ} (D^{'})$ , condition ( $2^{b}$ ) of $(P^{'}, P^{″}) \in {dep}^{b} (D^{'})$ yields $P^{″} \subseteq {acc}^{Δ} (P^{'})$ . Furthermore, $Q @ P^{″} \to Q^{″}$ and $Q @ {acc}^{Δ} (P^{'}) \to Q^{'}$ in $Δ^{\det}$ , which yields $Q^{″} \subseteq Q^{'}$ so that conditions ( $1^{b}$ ) and ( $2^{b}$ ) of $(Q^{'}, Q^{″}) \in {dep}^{b} (D)$ follow. Hence, $(Q^{'}, Q^{″}) \in {dep}^{b} (D)$ . In summary, we show that $μ = (Q^{'}, D)$ , $Q \overset{〈 h^{'} 〉}{\to} Q^{″} wrt Δ^{\det}$ , and $(Q^{'}, Q^{″}) \in {dep}^{b} (D)$ , as required by the claim.
Subcase: $P^{'} \notin {dPrj}^{Δ} (D^{'})$ . The transition rule $(Q, D) @ μ_{1} \to μ$ must thus be inferred as follows for $Q^{'} \subseteq Q$ :

$\frac{Q @ P^{'} \to Q^{'} in Δ^{\det} Q \notin {dPrj}^{Δ} (D) P^{'} \notin {dPrj}^{Δ} (D)}{(Q, D) @ (P^{'}, D^{'}) \to (Q^{'}, D) in Δ^{cgr}}$

This shows that $μ = (Q^{'}, D)$ , and we can close the following diagram:

Condition ( $0^{b}$ ) of $(P^{'}, P^{″}) \in {dep}^{b} (D^{'})$ yields $(P^{'} \times P^{″}) \cap D^{'} = \emptyset$ . Let $Q^{″}$ be the unique subset of states, such that $Q @ P^{″} \to Q^{″}$ wrt. $Δ^{\det}$ . Since $D^{'} \subseteq {down}_{Q}^{Δ} (D)$ and $(P^{'} \times P^{″}) \cap D^{'} = \emptyset$ , it follows from $Q @ P^{'} \to Q^{'}$ and $Q @ P^{″} \to Q^{″}$ wrt. $Δ^{\det}$ that $(Q^{'} \cap Q^{″}) \cap D = \emptyset$ . That is, condition ( $0^{b}$ ) of $(Q^{'}, Q^{″}) \in {dep}^{b} (D)$ . Since $P^{'} \notin {dPrj}^{Δ} (D^{'})$ , condition ( $1^{b}$ ) of $(P^{'}, P^{″}) \in {dep}^{b} (D^{'})$ is $P^{″} \subseteq P^{'}$ . From $Q @ P^{″} \to Q^{″}$ and $Q @ P^{'} \to Q^{'}$ , it thus follows that $Q^{″} \subseteq Q^{'}$ . Hence, conditions ( $1^{b}$ ) and ( $2^{b}$ ) of $(Q^{'}, Q^{″}) \in {dep}^{b} (D)$ are valid, so that $(Q^{'}, Q^{″}) \in {dep}^{b} (D)$ holds. Furthermore,

This shows that $Q \overset{〈 h^{'} 〉}{\to} Q^{″}$ wrt $Δ$ . The other two requirements of the claim, $μ = (Q^{'}, D)$ and $(Q^{'}, Q^{″}) \in {dep}^{b} (D)$ , were shown earlier.

Case

h = a

. Since

Q \notin {dPrj}^{Δ} (D)

, the following inference rule must be used:

\frac{Q \overset{a}{\to} Q^{'} in Δ Q \notin {dPrj}^{Δ} (D)}{(Q, D) \overset{a}{\to} (Q^{'}, D) in Δ^{cgr}}

So

μ = (Q^{'}, D)

,

Q \overset{a}{\to} Q^{'} wrt Δ^{\det}

. Furthermore, since

Q

is compatible with D, and D is a difference relation,

Q^{'}

is compatible with D too. Hence,

(Q^{'} \times Q^{'}) \cap D = \emptyset

, showing condition (

0^{b}

) of

(Q^{'}, Q^{'}) \in {dep}^{b} (D)

. Trivially,

Q^{'} \subseteq Q^{'} \subseteq {acc}^{Δ} (Q^{'})

so conditions (

1^{b}

) and (

2^{b}

) of

(Q^{'}, Q^{'}) \in {dep}^{b} (D)

hold, too. Hence,

(Q^{'}, Q^{'}) \in {dep}^{b} (D)

.

Case

h = ε

. Obvious.

Case

h = h_{1} \cdot h_{2}

. Since

(Q, D) \overset{h}{\to} μ

wrt.

Δ^{cgr}

, there exists some

μ_{1} \in Q^{cgr}

, such that

(Q, D) \overset{h_{1}}{\to} μ_{1}

and

μ_{1} \overset{h_{2}}{\to} μ

wrt

Δ^{cgr}

. We can apply the induction hypothesis to

h_{1}

. Hence, there exist subsets of states

Q_{1}, Q_{1}^{'} \subseteq Q

, such that

μ_{1} = (Q_{1}, D)

,

Q \overset{h_{1}}{\to} Q_{1}^{'}

wrt

Δ

, and

(Q_{1}, Q_{1}^{'}) \in {dep}^{b} (D)

. We distinguish two cases:

Subcase: $Q_{1} \in {dPrj}^{Δ} (D)$ . Since $μ_{1} = (Q_{1}, D) \overset{h_{2}}{\to} μ$ wrt $Δ^{cgr}$ , Claim 2.1b shows that $μ = (Q_{1}, D)$ so that $(Q_{1}, D) \overset{h_{2}}{\to} (Q_{1}, D)$ wrt. $Δ^{cgr}$ . Condition ( $2^{b}$ ) of $(Q_{1}, Q_{1}^{'}) \in {dep}^{b} (D)$ and $Q_{1} \in {dPrj}^{Δ} (D)$ imply that $Q_{1}^{'} \in {acc}^{Δ} (Q_{1})$ . Let $Q_{2}$ be the unique subset of states, such that $Q_{1}^{'} \overset{h_{2}}{\to} Q_{2}$ wrt. $Δ^{\det}$ . Then, $Q_{2} \subseteq {acc}^{Δ} (Q_{1}^{'})$ so that $Q_{2} \subseteq {acc}^{Δ} (Q_{1})$ , and thus, condition ( $2^{b}$ ) of $(Q_{1}, Q_{2}) \in {dep}^{b} (D)$ holds. Condition ( $1^{b}$ ) of $(Q_{1}, Q_{2}) \in {dep}^{b} (D)$ is trivial since $Q_{1} \in {dPrj}^{Δ} (D)$ . Since $Q_{1} \in {dPrj}^{Δ} (D)$ and $Q_{1} \times Q_{2} \in {acc}^{Δ} {(Q)}^{2}$ , it follows that $Q_{1} \times Q_{2} \cap D = \emptyset$ so condition ( $0^{b}$ ) of $(Q_{1}, Q_{2}) \in {dep}^{b} (D)$ holds, too. Hence, $Q \overset{h}{\to} Q_{2}$ wrt $Δ^{\det}$ and $(Q_{1}, Q_{2}) \in {dep}^{b} (D)$ .
Subcase: $Q_{1} \notin {dPrj}^{Δ} (D)$ . Since $Q_{1} \overset{h_{1}}{\to} Q_{1}^{'}$ and $Q_{1}$ is compatible with difference relation D, it follows that $Q_{1}^{'}$ is compatible with D, too. We can thus apply the induction hypothesis to $(Q_{1}, D) \overset{h_{2}}{\to} μ$ , showing the existence of subsets of states $(Q_{2}, Q_{2}^{'}) \in {dep}^{b} (D)$ , such that $μ = (Q_{2}, D)$ and $Q_{1} \overset{h_{2}}{\to} Q_{2}^{'}$ wrt $Δ$ . So, we have $Q \overset{h}{\to} Q_{2}^{'}$ wrt. $Δ$ and $(Q_{2}, Q_{2}^{'}) \in {dep}^{b} (D)$ , as required.

This ends the proof of Claim 2.2b. □

Proof of Inclusion $L (A^{cgr (S)}) \cap S \subseteq L (A)$ .

Let

h \in L (A^{cgr (S)}) \cap S

. Then, there exists a final subset of states

Q \in F^{cgr (S)}

, such that

(I, D_{i n i t}^{A, F_{S}}) \overset{h}{\to} (Q, D_{i n i t}^{A, F_{S}})

wrt

Δ^{cgr}

. Since I is a singleton or empty, it is compatible with

D_{i n i t}^{A, F_{S}}

. Claim 2.2b shows that there exists a subset of states

Q^{'} \subseteq Q

, such that

I \overset{h}{\to} Q^{'}

wrt.

Δ^{\det}

and

(Q, Q^{'}) \in {dep}^{b} (D_{i n i t}^{A, F_{S}})

. Condition (

0^{b}

) of

(Q, Q^{'}) \in {dep}^{b} (D_{i n i t}^{A, F_{S}})

shows that

(Q, Q^{'}) \cap D_{i n i t}^{A, F_{S}} = \emptyset

. Since

h \in S

, it follows that

Q^{'} \cap F_{S} \neq \emptyset

. The determinism of A shows that

Q^{'}

is a singleton. So, there exists a state

q^{'} \in F_{S}

, such that

Q^{'} = {q^{'}}

.

Case: $Q \notin {dPrj}^{Δ} (D_{i n i t}^{A, F_{S}})$ . Condition ( $1^{b}$ ) shows that $Q^{'} \subseteq Q$ so that $q^{'} \in Q$ . Since $Q \in F^{cgr}$ , we have $Q \cap F \neq \emptyset$ . Since $Q$ is compatible with $D_{i n i t}^{A, F_{S}}$ , $q^{'} \in Q$ , and $Q \cap F \neq \emptyset$ , it follows that $q^{'} \notin F_{S} \ F$ . Since $q^{'} \in F_{S}$ , this implies $q^{'} \in F$ , and thus, $h \in L (A)$ .
Case: $Q \in {dPrj}^{Δ} (D_{i n i t}^{A, F_{S}})$ . Condition ( $2^{b}$ ) shows that $Q^{'} \subseteq {acc}^{Δ} (Q)$ so that $q^{'} \in {acc}^{Δ} (Q)$ . Since $Q \in F^{cgr}$ , we have ${acc}^{Δ} (Q) \cap F \neq \emptyset$ . Let $q \in {acc}^{Δ} (Q) \cap F$ be arbitrary. Since $Q \in {dPrj}^{Δ} (D_{i n i t}^{A, F_{S}})$ and $(q^{'}, q) \in {acc}^{Δ} {(Q)}^{2}$ , it follows that $(q, q^{'}) \notin D_{i n i t}^{A, F_{S}}$ . Furthermore, $q \in F$ so that $q^{'} \notin F_{S} \ F$ . In combination with $q^{'} \in F_{S}$ , this implies $q^{'} \in F_{S}$ so $h \in L (A)$ .

This ends the proof of the inverse inclusion, and thus, of

L (A) = L (A^{cgr (S)}) \cap S

. □

7.4. Completeness

We next show the completeness of congruence projection for subhedge projection according to Definition 10. Let

A = (Σ, Q, Δ, I, F)

be a complete dSha and

F \subseteq F_{S} \subseteq Q

. Automaton A defines the regular pattern

L = L (A)

with the schema-final states in

F_{S}

the regular schema

S = L (A [F / F_{S}])

. We first show that all subhedge irrelevant states of difference relations are subhedge projection states

A^{cgr (S)}

according to Definition 7.

Lemma 14.

If

Q \in {dPrj}^{Δ} (D)

, then

(Q, D)

is a subhedge projection state of

Δ^{cgr}

with witness

(Q, D)

.

Proof.

We assume that

q \in {Prj}^{Δ} (D)

and have to show that

(Q, D)

is a subhedge projection state of

Δ^{cgr}

. We have to show that all transition rules starting with

(Q, D)

are permitted for a subhedge projection state

(Q, D)

with

(Q, D)

. They are generated by the following rules, all of which are looping:

\frac{a \in Σ Q \in {dPrj}^{Δ} (D)}{(Q, D) \overset{a}{\to} (Q, D) in Δ^{cgr}} \frac{Q \in {dPrj}^{Δ} (D)}{(Q, D) @ (Q, D) \to (Q, D) in Δ^{cgr}} \frac{Q \in {dPrj}^{Δ} (D)}{(Q, D) \overset{〈 〉}{\to} (Q, D) in Δ^{cgr}}

□

So, if a partial run of

A^{cgr (S)}

on a prefix u assigns some state

(P, D)

with

P \in {dPrj}^{Δ} (D)

, then

(P, D)

is a subhedge projection state by Lemma 14, and thus, subhedge irrelevant by Proposition 2.

Lemma 15.

Let

A^{'} = A [I / {〈 〉}^{Δ}]

,

L^{'} = L (A^{'})

, and

S^{'} = L (A^{'} [F / F_{S}])

. If

({〈 〉}^{Δ}, D) \overset{hdg (u)}{\to} (Q, D)

wrt

A^{cgr (S)}

for some nested word

u \in N_{Σ}

and either

$Q \notin {dPrj}^{Δ} (D)$ and $q \in Q$ , or
$Q \in {dPrj}^{Δ} (D)$ and $q \in {acc}^{Δ} (Q)$ ,

then there exists a nested word

u^{'} \in c l a s s_{S^{'}}^{L^{'}} (u)

and

q_{0} \in {〈 〉}^{Δ}

, such that

q_{0} \overset{hdg (u^{'})}{\to} q

wrt Δ.

Proof.

By induction on the length of u, suppose that

({〈 〉}^{Δ}, D) \overset{hdg (u)}{\to} (Q, D)

wrt

A^{cgr (S)}

. In the base case, we have

u = ε

. Hence,

Q = {〈 〉}^{Δ}

.

Case: $Q \notin {dPrj}^{Δ} (D)$ and $q \in Q$ . Then, ${〈 〉}^{Δ} = {q}$ . The unique run of $A^{'}$ on $u = ε$ starts and ends in the tree initial state $q \in {〈 〉}^{Δ}$ .
Case: $Q \in {dPrj}^{Δ} (D)$ and $q \in {acc}^{Δ} (Q)$ . Since $Q \in {dPrj}^{Δ} (D)$ , Lemma 14 shows that $(Q, D)$ is a subhedge projection state, so that $ε$ is subhedge irrelevant for $L^{'}$ and $S^{'}$ by Proposition 2. Hence, $c l a s s_{S^{'}}^{L^{'}} (ε) = N_{Σ}$ . Furthermore, since $q \in {acc}^{Δ} (Q)$ , there exists a hedge h and $q_{0} \in I$ , such that $q_{0} \overset{h}{\to} q$ wrt $Δ$ . Let $u^{'} = nw (h)$ . The run of A on $u^{'}$ ends in q and $u^{'} \in c l a s s_{S^{'}}^{L^{'}} (u)$ .

For the induction step, we distinguish the possible forms of the nested word u. So, there exist

\tilde{u}, v \in N_{Σ}

and

a \in Σ

, such that either of the following cases holds:

Case

u = \tilde{u} \cdot 〈 \cdot v \cdot 〉

. Let

(\tilde{Q}, D)

be the state in which the run of

{(A^{'})}^{cgr (S)}

on

\tilde{u}

ends.

Subcase

\tilde{Q} \notin {dPrj}^{Δ} (D)

. Let

D^{'} = D_{\tilde{Q}}

. Then, there exist

P \subseteq Q

, such that

({〈 〉}^{Δ}, D^{'}) \overset{v}{\to} (P, D^{'})

wrt

A^{cgr (S)}

. So, the run of

{(A^{'})}^{cgr (S)}

on v goes to state

(P, D^{'})

.

Subsubcase: $P \in {dPrj}^{Δ} (D^{'})$ . By construction of $A^{cgr (S)}$ , we then have $Q = \tilde{Q} @^{Δ} {acc}^{Δ} (P)$ . Since $q \in Q$ , there exist $\tilde{q} \in \tilde{Q}$ and $p \in {acc}^{Δ} (P)$ , such that $\tilde{q} @ p \to q$ in $Δ$ . By induction hypothesis, there exists $v^{'} \in c l a s s_{S^{'}}^{L^{'}} (v)$ , such that the run of $A^{'}$ on $v^{'}$ ends in p. Hence, the run of A on $u^{'} = \tilde{u} \cdot 〈 \cdot v^{'} \cdot 〉$ ends in q, and furthermore, $u^{'} \in c l a s s_{S}^{L} (u)$ .
Subsubcase: $P \notin {dPrj}^{Δ} (D^{'})$ . By construction of $A^{cgr (S)}$ , we then have $Q = \tilde{Q} @^{Δ} P$ . Since $q \in Q$ , there exist $\tilde{q} \in \tilde{Q}$ and $p \in P$ , such that $\tilde{q} @ p \to q$ in $Δ$ . By induction hypothesis, there exists $v^{'} \in c l a s s_{S^{'}}^{L^{'}} (v)$ , such that the run of $A^{'}$ on $v^{'}$ ends in p. Hence, the run of A on $u^{'} = \tilde{u} \cdot 〈 \cdot v^{'} \cdot 〉$ ends in q, and furthermore, $u^{'} \in c l a s s_{S}^{L} (u)$ .

Subcase

\tilde{Q} \in {dPrj}^{Δ} (D)

. Then,

Q = \tilde{Q}

and

q \in {acc}^{Δ} (\tilde{Q})

. Using the induction hypothesis applied to

\tilde{u}

, there exists

{\tilde{u}}^{'} \in c l a s s_{S}^{L} (\tilde{u})

, such that

{〈 〉}^{Δ} \overset{hdg ({\tilde{u}}^{'})}{\to} q

. Since

\tilde{u}

is irrelevant for L and

S^{'}

, we have

{\tilde{u}}^{'} \in c l a s s_{S}^{L} (u)

.

Case

u = \tilde{u} \cdot a

. This is easier to achieve than the previous cases and is thus omitted.

□

Lemma 16.

A partial run of

A^{cgr (S)}

assigns a state

(Q, D)

to a prefix

u \in prefs (N_{Σ})

. If

Q \notin {dPrj}^{Δ} (D)

and

q \in Q

, then there exists a prefix

u^{'} \in c l a s s_{S}^{L} (u)

, such that a partial run of A assigns state q to

u^{'}

.

Proof.

By induction on the length of u. In the base case, we have

u = ε

. The partial run of

A^{cgr (S)}

on u assigns state

(Q, D)

, where

Q = I

and

D = D_{i n i t}^{A, F_{S}}

. Suppose that

Q \notin {dPrj}^{Δ} (D)

and

q \in Q

. Then,

I = {q}

. The unique partial run of A on

u = ε

ends in the initial state

q \in I

. For the induction step, let

A^{'} = A [I / {〈 〉}^{Δ}]

,

L^{'} = L (A^{'})

, and

S^{'} = L (A^{'} [F / F_{S}]

. We distinguish the possible forms of u. So,

\tilde{u} \in prefs (N_{Σ})

,

v \in N_{Σ}

, and

a \in Σ

exist, such that either of the following hold:

Case

u = \tilde{u} \cdot 〈 \cdot v

. Let

(\tilde{Q}, \tilde{D})

be the state in which the partial run of

A^{cgr (S)}

on

\tilde{u}

ends. From

Q \notin {dPrj}^{Δ} (D)

, it follows that

\tilde{Q} \notin {dPrj}^{Δ} (\tilde{D})

. Furthermore,

({〈 〉}^{Δ}, D) \overset{v}{\to} (Q, D)

wrt

A^{cgr (S)}

. So, the partial run of

{(A^{'})}^{cgr (S)}

on v goes to state

(Q, D)

. Using the induction hypothesis, there exists

v^{'} \in c l a s s_{S^{'}}^{L^{'}} (v)

, such that the partial run of

A^{'}

on

v^{'}

ends in q. Hence, the partial run of A on

u^{'} = \tilde{u} \cdot 〈 \cdot v^{'}

ends in q, and furthermore,

u^{'} \in c l a s s_{S}^{L} (u)

.

Case

u = \tilde{u} \cdot 〈 \cdot v \cdot 〉

. Let

(\tilde{Q}, \tilde{D})

be the state in which the partial run of

A^{cgr (S)}

on

\tilde{u}

ends. From

Q \notin {dPrj}^{Δ} (D)

, it follows that

\tilde{Q} \notin {dPrj}^{Δ} (\tilde{D})

. Furthermore,

({〈 〉}^{Δ}, D) \overset{v}{\to} (P, D)

wrt

A^{cgr (S)}

for some

P \subseteq Q

.

Subcase: $P \in {dPrj}^{Δ} (D)$ . Then, $Q = \tilde{Q} @ Δ {acc}^{Δ} (P)$ . So, there exists $\tilde{q} \in Q$ and $p \in {acc}^{Δ} (P)$ , such that $\tilde{q} @ p \to q$ in $Δ$ . Using Lemma 15, there exists $v^{'} \in c l a s s_{S^{'}}^{L^{'}} (v)$ , such that ${〈 〉}^{Δ} \overset{hdg (v^{'})}{\to} p$ wrt $Δ$ . Using the induction hypothesis applied to $\tilde{u}$ , there exists an initial state $q_{0} \in I$ and a prefix ${\tilde{u}}^{'} \in c l a s s_{S}^{L} (\tilde{u})$ , such that the partial run of A on ${\tilde{u}}^{'}$ goes on to state $\tilde{q}$ . Hence, the partial run of A on $u^{'} = {\tilde{u}}^{'} \cdot 〈 \cdot v^{'} \cdot 〉$ goes on to q and $u^{'} \in c l a s s_{S}^{L} (u)$ .
Subcase: $P \notin {dPrj}^{Δ} (D)$ . Then, $Q = \tilde{Q} @ Δ P$ . So, there exists $\tilde{q} \in Q$ and $p \in P$ , such that $\tilde{q} @ p \to q$ in $Δ$ . Using Lemma 15, there exists $v^{'} \in c l a s s_{S^{'}}^{L^{'}} (v)$ , such that ${〈 〉}^{Δ} \overset{hdg (v^{'})}{\to} p$ wrt $Δ$ . Using the induction hypothesis applied to $\tilde{u}$ , there exists an initial state $q_{0} \in I$ and a prefix ${\tilde{u}}^{'} \in c l a s s_{S}^{L} (\tilde{u})$ , such that the partial run of A on ${\tilde{u}}^{'}$ goes on to state $\tilde{q}$ . Hence, the partial run of A on $u^{'} = {\tilde{u}}^{'} \cdot 〈 \cdot v^{'} \cdot 〉$ goes on to q and $u^{'} \in c l a s s_{S}^{L} (u)$ .

Case

u = \tilde{u} \cdot a

. This is easier to achieve than the previous case and is thus omitted.

□

The next lemma states the key invariant of congruence projection, which eventually proves its completeness for subhedge projection. For this, we define for any

F \subseteq F_{S} \subseteq F

the binary relation

≉_{F_{S}}^{F}

as the symmetric closure of

F \times (F_{S} \ F)

, i.e., for all

(q^{'}, q^{″}) \in Q^{2}

:

q^{'} ≉_{F_{S}}^{F} q^{″} \Leftrightarrow \{\begin{matrix} (q^{'} \in F & \land & q^{″} \in F_{S} \ F) & \lor \\ (q^{'} \in F_{S} \ F & \land & q^{″} \in F) \end{matrix}

Lemma 17 (Key Invariant).

Let

A = (Σ, Q, Δ, I, F)

a dSha,

F \subseteq F_{S} \subseteq Q

,

L = L (A)

, and

S = L (A [F / F_{S}])

. If

A^{cgr (S)}

has a partial run on prefix

u \in prefs (N_{Σ})

to state

(P, D) \in 2^{Q} \times D^{Δ}

, then for all

(p^{'}, p^{″}) \in P^{2}

,

(r^{'}, r^{″}) \in D

, and

(v^{'}, v^{″}) \in N_{Σ}^{2}

, such that

p^{'} \overset{hdg (v^{'})}{\to} r^{'} wrt Δ \land p^{″} \overset{hdg (v^{″})}{\to} r^{″} wrt Δ

there exist

(u^{'}, u^{″}) \in c l a s s_{S}^{L} (u)

,

w \in suffs (N_{Σ})

,

q^{'} ≉_{F_{S}}^{F} q^{″}

, and

q_{0} \in I

such that

q_{0} \overset{hdg (u^{'} \cdot v^{'} \cdot w)}{\to} q^{'} wrt Δ \land q_{0} \overset{hdg (u^{″} \cdot v^{″} \cdot w)}{\to} q^{″} wrt Δ

Proof.

Induction on the number of dangling opening parenthesis of u.

In the base case, u does not have any dangling opening parenthesis so

u \in N_{Σ}

is a nested word. In this case,

D = D_{i n i t}^{A, F_{S}}

. So,

A^{cgr (S)}

has a partial run on nested word

u \in N_{Σ}

to

(P, D_{i n i t}^{A, F_{S}})

, where

P \subseteq Q

. Let

(p^{'}, p^{″}) \in P^{2}

,

(r^{'}, r^{″}) \in D_{i n i t}^{A, F_{S}}

, and

(v^{'}, v^{″}) \in N_{Σ}^{2}

, such that

p^{'} \overset{hdg (v^{'})}{\to} r^{'} wrt Δ \land p^{″} \overset{hdg (v^{″})}{\to} r^{″} wrt Δ

It then follows that

P \notin {dPrj}^{Δ} (D_{i n i t}^{A, F_{S}})

. Therefore, we can apply Lemma 16. It shows that there exist nested words

(u^{'}, u^{″}) \in {(c l a s s_{S}^{L} (u))}^{2}

and

q_{0} \in I

, such that

q_{0} \overset{hdg (u^{'})}{\to} p^{'} wrt Δ \land q_{0} \overset{hdg (u^{″})}{\to} p^{″} wrt Δ

Since

D_{i n i t}^{A, F_{S}} \subseteq {ldr}^{Δ} (F \times (F_{S} \ F))

by Lemma 10, there exist a nested word

w \in N_{Σ}

and states

q^{'} ≉_{F_{S}}^{F} q^{″}

, such that

r^{'} \overset{hdg (w)}{\to} q^{'} wrt Δ \land r^{″} \overset{hdg (w)}{\to} q^{″} wrt Δ

Then, we have the following:

q_{0} \overset{hdg (u^{'} \cdot v^{'} \cdot w)}{\to} q^{'} wrt Δ \land q_{0} \overset{hdg (u^{″} \cdot v^{″} \cdot w)}{\to} q^{″} wrt Δ

For the induction step, let

u = u_{1} \cdot 〈 \cdot u_{2}

for some prefix

u_{1} \in prefs (N_{Σ})

and nested word

u_{2} \in N_{Σ}

so that

u_{1}

has one open dangling bracket less than u. Let

A^{cgr (S)}

have a partial run on nested word

u \in N_{Σ}

to

(P, D)

. Let

(P_{1}, D_{1})

be the state that this run assigns to prefix

u_{1}

. Then,

D = {(D_{1})}_{P_{1}}

. Let

(p^{'}, p^{″}) \in P^{2}

,

(r^{'}, r^{″}) \in D

, and

(v^{'}, v^{″}) \in N_{Σ}^{2}

, such that

p^{'} \overset{hdg (v^{'})}{\to} r^{'} wrt Δ \land p^{″} \overset{hdg (v^{″})}{\to} r^{″} wrt Δ

Since

D = {(D_{1})}_{P_{1}}

, there exist

(p_{1}^{'}, p_{1}^{″}) \in {(P_{1})}^{2}

and states

(r_{1}^{'}, r_{1}^{″}) \in D_{1}

, such that

p_{1}^{'} @ q^{'} \to r_{1}^{'}

and

p_{1}^{″} @ q^{″} \to r_{1}^{″}

. In particular,

P_{1} \notin {dPrj}^{Δ} (D_{1})

so that we can apply Lemma 16. It shows that there exist

(u_{2}^{'}, u_{2}^{″}) \in c l a s s_{S^{'}}^{L^{'}} (u_{2})

, such that

{〈 〉}^{Δ} \overset{u_{2}^{'}}{\to} p^{'}

and

{〈 〉}^{Δ} \overset{u_{2}^{″}}{\to} p^{″}

. Let

v_{1}^{'} = 〈 \cdot u_{2}^{'} \cdot v^{'} \cdot 〉

and

v_{1}^{″} = 〈 \cdot u_{2}^{'} \cdot v^{″} \cdot 〉

. Then,

p_{1}^{'} \overset{hdg (v_{1}^{'})}{\to} r_{1}^{'}

,

p_{1}^{'} \overset{hdg (v_{1}^{″})}{\to} r_{1}^{″}

wrt

Δ

. Using the induction hypothesis applied to

u_{1}

, on which

A^{cgr (S)}

has a partial run to state

(P_{1}, D_{1})

, such that

(p_{1}^{'}, p_{1}^{″}) \in {(P_{1})}^{2}

,

p_{1}^{'} \overset{hdg (v_{1}^{'})}{\to} r_{1}^{'}

,

p_{1}^{'} \overset{hdg (v_{1}^{″})}{\to} r_{1}^{″}

, and

(r_{1}^{'}, r_{1}^{″}) \in D

, there exist

(u_{1}^{'}, u_{1}^{″}) \in c l a s s_{S}^{L} (u_{1})

,

w_{1} \in suffs (N_{Σ})

,

q^{'} ≉_{F_{S}}^{F} q^{″}

, and

q_{0} \in I

, such that

q_{0} \overset{hdg (u_{1}^{'} \cdot v_{1}^{'} \cdot w_{1})}{\to} q^{'} wrt Δ \land q_{0} \overset{hdg (u_{1}^{″} \cdot v_{1}^{″} \cdot w_{1})}{\to} q^{″} wrt Δ

Let

u^{'} = u_{1}^{'} \cdot 〈 \cdot u_{2}^{'}

,

u^{″} = u_{1}^{″} \cdot 〈 \cdot u_{2}^{″}

, and

w = 〉 \cdot w_{1}

. The above then yields the following:

q_{0} \overset{hdg (u^{'} \cdot v^{'} \cdot w)}{\to} q^{'} wrt Δ \land q_{0} \overset{hdg (u^{″} \cdot v^{″} \cdot w)}{\to} q^{″} wrt Δ

Furthermore,

(u^{'}, u^{″}) \in {(c l a s s_{S}^{L} (u))}^{2}

, so this was to be shown. □

Proposition 6.

Let

A = (Σ, Q, Δ, I, F)

a dSha,

F \subseteq F_{S} \subseteq Q

,

L = L (A)

, and

S = L (A [F / F_{S}])

. If

A^{cgr (S)}

has a partial run on prefix

u \in prefs (N_{Σ})

to some state

(P, D) \in 2^{Q} \times D^{Δ}

so that

P \notin {dPrj}^{Δ} (D)

, then

c l a s s_{S}^{L} (u)

is subhedge relevant for L wrt

S

.

Proof.

Suppose that

A^{cgr (S)}

has a partial run on prefix

u \in prefs (N_{Σ})

to some state

(P, D) \in 2^{Q} \times D^{Δ}

so that

P \notin {dPrj}^{Δ} (D)

. Since

P \notin {dPrj}^{Δ} (D)

, there exist

{acc}^{Δ} {(P)}^{2} \cap D \neq e m p t y s e t

. So, there exist

(p^{'}, p^{″}) \in P^{2}

,

(r^{'}, r^{″}) \in D

, and

(v^{'}, v^{″}) \in N_{Σ}^{2}

, such that

p^{'} \overset{hdg (v^{'})}{\to} r^{'} wrt Δ \land p^{″} \overset{hdg (v^{″})}{\to} r^{″} wrt Δ

Using Lemma 17, there exist

(u^{'}, u^{″}) \in c l a s s_{S}^{L} (u)

,

w \in suffs (N_{Σ})

,

q^{'} ≉_{F_{S}}^{F} q^{″}

, and

q_{0} \in I

, such that

q_{0} \overset{hdg (u^{'} \cdot v^{'} \cdot w)}{\to} q^{'} wrt Δ \land q_{0} \overset{hdg (u^{″} \cdot v^{″} \cdot w)}{\to} q^{″} wrt Δ

Hence,

c l a s s_{S}^{L} (u)

is relevant for L wrt

S

. □

Theorem 3 (Completeness of congruence projection.).

For any complete dSha

(Σ, Q, Δ, I, F)

and set

F \subseteq F_{S} \subseteq Q

, the congruence projection

A^{cgr (S)}

is sound and complete for subhedge projection for the regular pattern

L (A)

wrt the regular schema

S = L (A [F_{S} / F])

.

Proof.

The soundness of congruence projection is shown in Theorem 2. For proving completeness, let

u \in prefs (N_{Σ})

be a nested word prefix that is strongly subhedge irrelevant for L wrt

S

. By definition of strong irrelevance, the class

c l a s s_{S}^{L} (u)

is irrelevant for L and

S

. Let

(P, D)

be the unique state assigned by the partial run of

A^{cgr (S)}

to prefix u. Since

c l a s s_{S}^{L} (u)

is irrelevant for L and

S

, Proposition 6 shows that

P \in {dPrj}^{Δ} (D)

. Using Lemma 14,

(P, D)

is then a subhedge projection state of

A^{cgr (S)}

. □

7.5. In-Memory Complexity

We next discuss the complexity of membership testing with complete subhedge projection based on the in-memory evaluation of the input hedge by the congruence projection of the input automaton.

Lemma 18.

The number of states

| Q^{cgr} |

is in

O (2^{n^{2} + n})

, where n is the number of states of A.

Proof.

With the deterministic construction, the states of

A^{cgr (S)}

are pairs in

2^{Q} \times D^{Δ}

. So, the maximal number of states of the congruence projection is

| Q^{cgr (S)} | = 2^{| Q |} | D^{Δ} | \leq 2^{n} 2^{n^{2}} = 2^{n^{2} + n}

. □

Let

diff-rel (Q^{cgr (S)})

be the set of difference relations used in the states of

A^{cgr (S)}

.

Corollary 1.

Let the signature Σ be fixed. For any complete dSha A with schema

S \subseteq H_{Σ}

, the membership of any hedge

h \in S

can be tested in memory in time

O (| 1 |)

per nonprojected node of h after a preprocessing time of

O (| A^{cgr (S)} | + n^{3} d)

, where d is the cardinality of

diff-rel (Q^{cgr (S)})

and n is the number of states of A.

Since

| Q^{cgr (S)} |

is in

O (2^{n^{2} + n})

by Lemma 18, the preprocessing time

O (| A^{cgr (S)} | + n^{3} d |)

is in

O (2^{2 n^{2} + 2 n})

, too.

Proof.

Since A is complete, Thereom 3 shows that

A^{cgr (S)}

is complete for subhedge projection for schema

S

. The in-memory evaluator of

A^{cgr (S)}

can be run on any input hedge

h \in S

in time

O (| 1 |)

per nonprojected node of h after a preprocessing time of

O (| A^{cgr (S)} | + n^{3} d)

, where d is the cardinality of

diff-rel (A^{cgr (S)})

. Since the signature

Σ

is fixed and

A^{cgr (S)}

deterministic,

| A^{cgr (S)} |

is in

O (| Q^{cgr (S)} |^{2})

. The dominating part

| A^{cgr (S)} |

is in

O (| Q^{cgr} |^{2})

. □

When applied to regular XPath query answering as in our experiments in Section 11, we have

n \leq 80

and

d \leq 25

so the cubic part

O (n^{3} d)

is not limiting. The sizes of the Sha^↓s are below 2600 counting both states and rules. The upper bound

O (| A^{cgr (S)} | + n^{3} d)

thus suggests that the preprocessing time may be feasible in practice, even though exponential in the worst case.

We must be careful here since the sizes of the Sha^↓s reported in Section 11 do not refer to the pure congruence projection

A^{cgr (S)}

but to the earliest congruence projection

A_{e (S)}^{cgr (S)}

that we will introduce only in Section 8. The noncreation of transition rules to states that are safe for rejection may lead to a radical size reduction. Alternatively, an important size reduction in

A^{cgr (S)}

could be obtained by removing all useless states and transition rules. But, this would not reduce the precomputation time, just the opposite: it woud increase the preprocessing time since cleaning automata a posteriori also needs time, which for Sha^↓s may be nonlinear.

Instead of precomputing

A^{cgr (S)}

from the inputs A and

F_{S}

statically, we can compute only the part needed of

A^{cgr (S)}

for evaluating the input hedge h on-the-fly. We note that the number of transition rules of the needed part is bounded by

O (| h |)

but may be way smaller due to projection. The preprocessing time then goes down to

O (| A |)

. Since the full automaton

A^{cgr (S)}

may be exponential in n in the worst case, this may reduce the overall processing time.

On the positive side, on-the-fly evaluation reduces the overall computation time to

O (| h | + n^{3} d (h))

, where

d (h)

is the number of difference relations in the part of

A^{cgr (S)}

needed for the evaluation of h, so

d (h) \leq m i n (| h |, d)

. On the down side,

d (h)

steps will no more be in constant but cubic time in

O (n^{3})

. It should also be noted that the overall time of the safe-no-change projection on-the-fly evaluation is in

O (| A | + | h | + n s (h))

, where

s (h)

is the number of subsets of safe states of A in states of

Q^{cgr}

, i.e.,

s (h) \leq | h |

. This is since each subset of safe states can be computed in linear time in

O (n)

. As argued above, the overall time of the congruence projection obtained by on-the-fly evaluation is in

O (| A | + | h | + n^{3} d (h))

. This is because the computation of a difference relation is in

O (n^{3})

by using ground Datalog programs of cubic size. The additional cubic factor in

n^{3} d (h)

instead of a linear factor in

n s (h)

is the price to be paid for achieving complete subhedge projection with on-the-fly evaluation. Note that this price is largely acceptable in cases where

n \leq 80

,

d (h) \leq 15

, and

s (h) \leq 15

.

Example 15 (Exponential blow-up.).

In order to see how the exponential worst case may happen, we consider a family of regular languages, for which the minimal left-to-right Dfa is exponentially bigger than the minimal right-to-left Dfa. The classical example languages with this property are

L_{n} = Σ^{*} . a . Σ^{n}

, where

n \in N

and

Σ = {a, b}

. Intuitively, a word in

Σ^{*}

belongs to

L_{n}

if and only its

n + 1

-th letter from the end is equal to

‘ a ’

. The minimal left-to-right Dfa for

L_{n}

has

2^{n + 1}

many states since it needs to memorize a window of

n + 1

letters. In contrast, its minimal right-to-left Dfa has only

n + 1

states; in this direction, it is sufficient to memorize the distance from the end modulo

n + 1

.

We next consider the schema

S

defined by

μ z . {〈 (a + b) \cdot z 〉}^{*}

. It contains all hedges having in all its subtrees a label from

Σ = {a, b}

as the first letter, and no further occurrences of letters of Σ elsewhere. We then consider the family of hedge languages

H_{n} \in H_{Σ}

with schema

S

, such that the sequence of labels of some root-to-leaf path of

h_{n}

belongs to

L_{n}

. Note that

H_{n}

can be recognized in a bottom-up manner by the dSha

A_{n}

with

O (n + 1)

states, which simulates the minimal deterministic Dfa of

L_{n}

on all paths of the input hedge. For an evaluator with subhedge projection, the situation is different. When moving top-down, it needs to memorize the sequence of labels of the

n + 1

-last ancestors, possibly filled with

b^{'} s

, and there are

2^{n + 1}

of such sequences. If, for some leaf, its sequence starts with an “a” then the following subhedges with the following leaves can be projected away. As a consequence, there cannot be any dSha^↓ recognizing

H_{n}

that projects away all irrelevant subhedges with less than

2^{n + 1}

states. In particular, the size of

A_{n}^{cgr (S)}

must be exponential in the size of

A_{n}

.

8. Earliest Membership with Subhedge Projection

We next enhance our compilers for introducing subhedge projection states, such that they can also detect certain membership or nonmembership in an earliest manner. For this, we show how to combine the previous earliest membership tester for dShas from [21] with any subtree projection algorithm so that it applies to both, to safe-no-change projection and to congruence projection. By combining both aspects orthogonally, we obtain an earliest dSha with subhedge projection, which can be evaluated either in in memory or in streaming mode.

8.1. Earliest Membership

We start with a semantic characterization of earliest membership, i.e., of prefixes of nested words of hedges whose suffixes are irrelevant for membership, when assuming top-down and left-to-right processing. Note that the same characterization up to various presentation choices was given earlier in the context of stream processing [21,22].

Definition 13.

Let

S \subseteq H_{Σ}

be a hedge schema and

L \subseteq S

a hedge language satisfying this schema. A nested word prefix v is certain for membership in L with schema

S

for all suffixes w of nested words such that

v \cdot w \in nw (S)

it holds that

v \cdot w \in nw (L)

. It is called certain for nonmembership in L with schema

S

for all suffixes w of nested words such that

v \cdot w \in nw (S)

it holds that

v \cdot w \notin nw (L)

.

We are now interested in Sha^↓ that are able to detect certain membership and nonmembership syntactically at the earliest possible prefix or node of the run.

Definition 14.

A dSha^↓

A = (Σ, Q, Δ, I, F)

with schema

S \subseteq H_{Σ}

is called earliest for schema

S

if there exists a state

sel \in Q

, such that for all hedges

h \in S

, the unique run R of

c o m p l (A)

on h satisfies the following:

it goes to the state $sel$ for exactly all those prefixes of $nw (h)$ that are certain for membership in $L (A)$ wrt $S$ ;
it goes to the sink for all prefixes of $nw (h)$ that are certain for nonmembership in $L (A)$ with $S$ .

For dSha A, schema-complete for schema

S = L (A [F / F_{S}])

, we can construct a dSha^↓

A_{e (S)}

that is the earliest for

S

, as shown in Appendix A. The idea is to adapt the earliest membership tester for deterministic Shas from [21] so that it takes schema restrictions into account. Furthermore, we need to compile the input dShas to dSha^↓s rather than to dNwas, as used there. This change is straightforward once the relationship between both automata models becomes clear.

Given that the construction of

A_{e (S)}

is not part of the core of the present paper, we only illustrate its outcome at our running example: for the dSha A with schema

S = 〚 doc 〛

, in Figure 4, we obtain the Sha^↓

A_{e (S)}

in Figure 19. Note that, on the prefix

〈 list 〈 item

, it goes to state to

sel

, showing that it is certain for membership. On prefix

〈 item

, it blocks showing that it is certain for nonmembership. However, on the prefix

〈 list 〈 list

, it does not go into a subhedge projection state even though this prefix is subhedge irrelevant.

8.2. Adding Subhedge Projection

We now enhance any earliest membership for dShas with subhedge projection. The idea is to combine subhedge projection with earliest membership. So, suppose that we are given a schema

S

and two automata,

A^{π}

and

A_{e}

, with schema

S

that recognize the same language up to the schema, so

L (A^{π}) \cap L (S) = L (A_{e}) \cap L (S)

. Furthermore, assume that

A_{e}

has a selection state

sel

, indicating certain membership at the earliest possible prefix.

In order to obtain the earliest membership with subhedge projection, we combine the two Sha^↓s into a single Sha^↓,

A_{e}^{π} = (Σ, Q_{e}^{π},

Δ_{e}^{π},

I_{e}^{π}, F_{e}^{π})

, basically running both automata in parallel but under shared control. The state set of

A_{e}^{π}

are as follows:

\begin{matrix} Q_{e}^{π} = (Q^{π} \times (Q_{e} \ {sel})) \cup {sel} \\ I_{e}^{π} = (I^{π} \times (I_{e} \ {sel})) \cup {sel ∣ sel \in I_{e}} \\ F_{e}^{π} = (F^{π} \times (F_{e} \ {sel})) \cup {sel} \end{matrix}

The transition rules in

Δ_{e}^{π}

are given in Figure 20. Any run of

A_{e}^{π}

synchronizes parallel runs of

A^{π}

and

A_{e}

as follows: whenever

A_{e}

goes to

sel

, then

A_{e}^{π}

does so too, and whenever

A^{π}

goes into a subhedge projection state, then it makes

A_{e}

jump over the subsequent subhedge.

For illustration, we reconsider the dSha from Figure 4. The earliest dSha^↓ with subhedge projection

A_{e (S)}^{snc}

is given in Figure 21. Here, we combine the safe-no-change projection

A^{snc}

from Figure 11 with the earliest membership tester

A_{e (S)}

from Figure 19. Their combination

A_{e (S)}^{snc}

represents the pattern

[self : : list / child : : item]

with schema

〚 doc 〛

in a concise manner. After prefix

〈 list 〈 item

, it detects certain membership: for the prefix

〈 list 〈 list

, it detects that the subhedge is irrelevant, and after prefix

〈 item

, it detects certain nonmembership.

8.3. Soundness and Completeness

We next show that the soundness and completeness of a projection algorithm are preserved when combined with some earliest membership tester when assuming determinism.

Proposition 7.

Let

A^{π}

and

A_{e}

be dSha^↓s with the same schema

S

and the same language up to the schema, i.e.,

L (A^{π}) \cap L (S) = L (A_{e}) \cap L (S)

. The combination dSha^↓

A_{e}^{π}

then has the same language up to schema

S

, too. Furthermore, if

A_{e}

is earliest for S, then

A_{e}^{π}

is earliest for S and goes into some subhedge projection state whenever

A^{π}

does.

Proof.

If q is a subhedge projection state of

A^{π}

and r a state of

A_{e}

, then

(q, r)

is a subhedge projection state of

A_{e}^{π}

. The evaluator for automaton

A_{e}^{π}

runs

A^{π}

and

A_{e}

in parallel on the input hedge, while skipping subhedges starting in subhedge projection states of

A^{π}

. These include the subhedges starting in a projection state when evaluating

A_{e}^{π}

on h. Applying Proposition 2 to

A^{π}

, such subhedges are irrelevant for

L (A^{π})

with respect to

S

, since

A^{π}

is assumed to be deterministic. Since

L (A^{π}) \cap S = L (A_{e}) \cap S

, skipping such a subhedge does not affect acceptance by

A_{e}

for hedges inside schema

S

. Therefore, the evaluator of

A_{e}^{π}

is earliest with respect to

S

, too. □

The above proposition shows that if

A^{π}

is complete for subhedge projection, then

A_{e}^{π}

is also complete for subhedge projection while also being earliest.

Theorem 4.

For any complete dShas A with schema

S

, the dSha

A^{cgr} {(S)}_{e (S)}

is earliest and complete for subhedge projection for schema

S

.

Proof.

The congruence projection

A^{cgr} (S)

is sound by Proposition 2 and complete for subhedge projection wrt

S

by Theorem 3. The automaton

A_{e (S)}

discussed in Section 8.1 is earliest for S. So, their combination

A^{cgr} {(S)}_{e (S)}

yields the earliest automaton with complete subhedge projection wrt

S

using Proposition 7. □

8.4. In-Memory Complexity

We next discuss the complexity of earliest membership testing with complete subhedge projection by an in-memory evaluator for

A^{cgr} {(S)}_{e (S)}

.

Lemma 19.

The number of states in

Q^{cgr} {(S)}_{e (S)}

is in

O (2^{n^{2} + 2 n + l o g (n)})

, where

n = | Q |

.

Proof.

Using Lemma 18, the number of states of

A^{cgr} (S)

is in

O (2^{n^{2} + n})

, where

n = | Q |

. The number of states of

A_{e (S)}

is in

O (n 2^{n})

, and thus, in

O (2^{n + l o g (n)})

. The number of states of

Q^{cgr} {(S)}_{e (S)}

is thus in

O (2^{n^{2} + n} 2^{n + l o g (n)})

, which is in

O (2^{n^{2} + 2 n + l o g (n)})

□

Corollary 2.

Let the signature Σ be fixed. For any complete dSha A with schema

S \subseteq H_{Σ}

, earliest membership on a input hedge

h \in S

can be tested in memory in time

O (| 1 |)

per nonprojected node of h after a preprocessing time of

O (| A^{cgr} {(S)}_{e (S)} | + n^{3} d + n s)

, where d is the number of difference relations and s is the number of subset of safe states in

Q^{cgr} {(S)}_{e (S)}

. The preprocessing time is also in

O (2^{2 n^{2} + 4 n + 2 l o g (n)})

, where n is the number of states of A.

Proof.

Since A is complete, Thereom 4 shows that

A^{cgr} {(S)}_{e (S)}

is the earliest, sound, and complete for subhedge projection for schema

S

. The in-memory evaluator of

A^{cgr} {(S)}_{e (S)}

can be run on any input hedge

h \in S

in time

O (| 1 |)

per nonprojected node of h after a preprocessing time of

O (| A^{cgr} {(S)}_{e (S)} | + n^{3} d + n s

.

Σ

is fixed to the size

| A^{cgr} {(S)}_{e (S)} |

bounded by

O (| Q^{cgr} {(S)}_{e (S)} |^{2})

. Lemma 19 shows that

| Q^{cgr} {(S)}_{e (S)} |

is in

O (2^{n^{2} + 2 n + l o g (n)})

, so

| A^{cgr} {(S)}_{e (S)} |

is in

O (2^{2 n^{2} + 4 n + 2 l o g (n)})

. □

As before, instead of precomputing

A^{cgr} {(S)}_{e (S)}

from the input dSha A and

F_{S}

, we can compute only the part needed for evaluating the input hedge h on-the-fly. If the needed part of

A^{cgr} {(S)}_{e (S)}

for evaluating h is small, the overall time and space go down considerably.

9. Streaming Algorithms

We show that dSha^↓s can be evaluated on the nested word of a hedge in a streaming manner, so that they produce the same results and projection behavior as by evaluating them on hedges in-memory in a top-down and left-to-right manner.

All aspects related to subhedge projection remain unchanged. Earliest query answering in streaming mode means that nested word suffixes are ignored if irrelevant, and not only nested words as with subhedge projection. What changes between both evaluation modes is memory consumption. For in-memory evaluation, the whole graph of the input hedge must be stored, while in streaming mode, only a stack and a state are maintained at any event. For testing regular language membership, the state space is finite, while the stack space become infinite, since the depth of the stack depends on the depth of the input hedge.

The property of having equivalent streaming and in-memory evaluators was already noticed in [26] for Neumann and Seidl’s pushdown forest automata [25] and for nested word automata [11,15]. We here show how to transpose this result to dSha^↓s. This permits us to show the correctness of our evaluation algorithms based on properties of the constructed automata. It also permits obtaining correctness properties of streaming algorithms by reasoning about the in-memory evaluator of Sha^↓s.

9.1. Streaming Evaluators for Downward Hedge Automata

We define the streaming evaluator of a Sha^↓

A = (Σ, Q, Δ, I, F)

by a visibly pushdown machine. The set of configurations of this machine is

K = Q \times Q^{*}

, containing pairs of states and stacks of states. The visibly pushdown machine provides for any word

v \in {\hat{Σ}}^{*}

a streaming transition relation

{\overset{v}{\to}}^{s t r} \subseteq K \times K

, such that for all states

q, q^{'} \in Q

and stacks

σ \in Q^{*}

and words

v, v^{'} \in {\hat{Σ}}^{*}

:

\frac{true}{(q, σ) {\overset{ε}{\to}}^{str} (q, σ) wrt Δ} \frac{(q, σ) {\overset{v}{\to}}^{str} (q', σ) (q', σ) {\overset{v'}{\to}}^{str} (q ″, σ) wrt Δ}{(q', σ) {\overset{v \cdot v'}{\to}}^{str} (q ″, σ) wrt Δ}

\frac{q \overset{a}{\to} q' in Δ}{q {\overset{a}{\to}}^{str} q' wrt Δ} \frac{q \overset{〈 〉}{\to} q' in Δ}{(q, σ) {\overset{〈}{\to}}^{str} (q', σ \cdot q) wrt Δ} \frac{q @ p \to q' in Δ}{(p, σ \cdot q) {\overset{〉}{\to}}^{str} (q', σ) wrt Δ}

We note that the same visibly pushdown machine can be obtained by compiling the Sha^↓ to an Nwa, whose streaming evaluator is defined by a visibly pushdown machine too (see Section 12.3 on related work).

The notion of partial runs of the in-memory evaluator is closely related to streaming runs. To see this, we first note that each partial run v wrt

Δ

naturally defines a stack as follows. Since v is a nested word prefix, there exist unique nested words

v_{0}, \dots, v_{n} \in nw (H_{Σ \cup Q})

, such that

v = v_{0} \cdot 〈 \cdot v_{1} \cdot 〈 \cdot \dots \cdot 〈 \cdot v_{n} .

These nested words must be partial runs of

Δ

too, so there exist states

p_{0}, \dots p_{n}, q_{0}, \dots, q_{n} \in Q

, such that

v_{i}

goes from

p_{i}

to

q_{i}

for all

0 \leq i \leq n

. We define the stack

σ

of v by

σ = q_{0} \cdot \dots \cdot q_{n - 1}

and say that v goes from

p_{0}

to

q_{n}

and has stack

σ

. Note that the stack is empty if v is a nested word, since in this case

n = 0

so that

v = v_{0}

and

σ = ε

. For instance, the partial run

v_{0} = 0 \cdot 〈 \cdot 0 \cdot l i s t \cdot 1 \cdot 〈 \cdot 0 \cdot i t e m \cdot 2

has the stack

σ_{0} = 0 \cdot 1

while the partial run

v_{1} = v_{0} \cdot 〉 \cdot 3 \cdot 〉 \cdot 4

has stack

σ_{1} = ε

.

Lemma 20.

Consider a Sha

A = (Σ, Q, Δ, I, F)

, states

q, q^{'} \in Q

, and a stack

σ \in Q^{*}

. Let v be a prefix of some nested word in

nw (H_{Σ})

. Then, there exists a partial run r wrt Δ from q to

q^{'}

with stack σ such that

{proj}_{Σ} (r) = v

iff

(q, ε) {\overset{ε}{\to}}^{str} (q^{'}, σ)

wrt Δ.

Proof.

This is carried out by induction on the length of prefix v. □

This lemma implies that any in-memory transition of A on a hedge h from q to

q^{'}

can be identified with some streaming transition of A on

nw (h)

from

(q, ε)

to

(q^{'}, ε)

.

Proposition 8.

q \overset{h}{\to} q^{'} wrt Δ \Leftrightarrow (q, ε) {\overset{nw (h)}{\to}}^{str} (q^{'}, ε) wrt Δ

.

Proof.

If

q \overset{h}{\to} q^{'} wrt Δ

, there exists a run R of

h

with

Δ

that starts with q and ends with

q^{'}

. Consider the partial run

v = nw (R)

on h. Since v is a nested word, its stack is empty. By Lemma 20, we have

(q, ε) {\overset{{proj}_{Σ} (v)}{\to}}^{str} (q^{'}, ε)

wrt

Δ

and

{proj}_{Σ} (v) = nw (h)

. For the inverse implication, assume that

(q, ε) {\overset{nw (h)}{\to}}^{str} (q^{'}, ε) wrt Δ

. Let

u = nw (h)

. By Lemma 20, there exists a partial run v of

Δ

from q to

q^{'}

with stack

ε

. Then, there exists a run R of

Δ

on h such that

v = nw (R)

. Hence,

q \overset{h}{\to} q^{'} wrt Δ

. □

For any deterministic dSha^↓ A, hedge h, and state q, the streaming transition on

nw (h)

with respect to

Δ

starting with

(q, ε)

can be computed in time

O (1)

per letter after a precomputation in time

O (| A |)

. So, the overall computation time is in

O (| A | + | h |)

. The streaming memory needed to store a configuration is of size

O (d e p t h (h) + | A |)

since the size of the visible stack is bounded by the depth of the input hedge.

9.2. Adding Subhedge Projection

We next extend the streaming transition relation with subhedge projection, in analogy to what we did for the in-memory transition relation. We define a transition relation with subhedge projection

{\overset{h}{\to}}_{shp}^{str} \subseteq K \times K

with respect to

Δ

, such that for all words

v, v^{'} \in {\hat{Σ}}^{*}

, letters

a \in Σ

, states

p, q, q^{'}, q^{″} \in Q

, and stacks

σ, σ^{'}, σ^{″} \in Q^{*}

:

\frac{q \in Q_{shp}^{Δ} v \in nw (H_{Σ})}{(q, σ) {\overset{v}{\to}}_{shp}^{str} (q, σ) wrt Δ}

\frac{q \notin Q_{shp}^{Δ} q \overset{a}{\to} q^{'} in Δ}{(q, σ) {\overset{a}{\to}}_{shp}^{str} (q^{'}, σ) wrt Δ} \frac{q \notin Q_{shp}^{Δ}}{(q, σ) {\overset{ε}{\to}}_{shp}^{str} (q, σ) wrt Δ}

\frac{q \notin Q_{shp}^{Δ} (q, σ) {\overset{v}{\to}}_{shp}^{str} (q^{'}, σ^{'}) wrt Δ (q^{'}, σ^{'}) {\overset{v'}{\to}}_{shp}^{str} (q^{″}, σ^{″}) wrt Δ}{(q, σ) {\overset{v \cdot v'}{\to}}_{shp}^{str} (q^{″}, σ^{″}) wrt Δ}

\frac{q \notin Q_{shp}^{Δ} q \overset{〈 〉}{\to} q^{'} in Δ}{(q, σ) {\overset{〈}{\to}}_{shp}^{str} (q^{'}, σ \cdot q) wrt Δ} \frac{p \notin Q_{shp}^{Δ} q @ p \to q^{'} in Δ}{(p, σ \cdot q) {\overset{〉}{\to}}_{shp}^{str} (q^{'}, σ) wrt Δ}

The projecting transition relation stays in a configuration with a subhedge projection state until the end of the current subhedge is reached. This is correct since a subhedge projection state cannot be changed anyway.

Proposition 9.

For any Sha^↓

A = (Σ, Q, Δ, I, F)

, states

q, q^{'} \in Q

, nested word v on which Δ has no blocking partial run starting from q:

(q, ε) {\overset{v}{\to}}^{str} (q^{'}, ε) wrt Δ \Leftrightarrow (q, ε) {\overset{v}{\to}}_{shp}^{str} (q^{'}, ε) wrt Δ .

Proof.

Let

h \in H_{Σ}

be the hedge such that

v = nw (h)

. The proof is by induction on the size of v. For the implication from the left to the right we assume that

(q, ε) {\overset{v}{\to}}^{str} (q^{'}, ε) wrt Δ

. Since

v = nw (h)

, Proposition 8 proves

q {\overset{h}{\to}}^{str} q^{'}

wrt

Δ

.

Case

q \in Q_{shp}^{Δ}

. By Lemma 6, it follows from

q {\overset{h}{\to}}^{str} q^{'}

wrt

Δ

that

q = q^{'}

. We can then apply the following rule since v is a nested word:

\frac{q \in Q_{shp}^{Δ} v = nw (H_{Σ})}{(q, ε) {\overset{v}{\to}}_{shp}^{str} (q, ε) wrt Δ}

Hence,

(q, ε) {\overset{v}{\to}}_{shp}^{str} (q, ε) wrt Δ

.

Case

q \notin Q_{shp}^{Δ}

. We distinguish all possible forms of hedge h.

Subcase: $h = h^{'} \cdot h^{″}$ . Let $v^{'} = nw (h^{'})$ and $v^{″} = nw (h^{″})$ so $v = v^{'} \cdot v^{″} = nw (h)$ . The following rule must have been applied:

$\frac{(q, ε) {\overset{v'}{\to}}^{str} (p, σ) wrt Δ (p, σ) {\overset{v ″}{\to}}^{str} (q^{'}, ε) wrt Δ}{(q, ε) {\overset{v' \cdot v ″}{\to}}^{str} (q^{'}, ε) wrt Δ}$

Since $v^{'}$ is a nested word, it follows that $σ = ε$ . By the induction hypothesis applied to the nested words $v^{'}$ and $v^{″}$ , we get $(q, ε) {\overset{v'}{\to}}_{shp}^{str} (p, ε) wrt Δ$ and $(p, ε) {\overset{v ″}{\to}}_{shp}^{str} (q^{'}, ε) wrt Δ$ . Hence, we can apply the following rule:

$\frac{q \notin Q_{shp}^{Δ} (q, ε) {\overset{v'}{\to}}_{shp}^{str} (p, ε) wrt Δ (p, ε) {\overset{v ″}{\to}}_{shp}^{str} (q^{'}, ε) wrt Δ}{(q, ε) {\overset{v' \cdot v ″}{\to}}_{shp}^{str} (q^{'}, ε) wrt Δ}$

This shows $(q, ε) {\overset{v}{\to}}_{shp}^{str} (q^{'}, ε)$ as required.
Subcases: $h = 〈 h^{'} 〉$ or $h = a$ or $h = ε$ . Straightforward.

For the implication from the right to the left we assume that

(q, ε) {\overset{v}{\to}}_{shp}^{str} (q^{'}, ε) wrt Δ

.

Case

q \in Q_{shp}^{Δ}

. Then, the following rule must have been applied with

q = q^{'}

:

\frac{q \in Q_{shp}^{Δ} v = nw (H_{Σ})}{(q, ε) {\overset{v}{\to}}_{shp}^{str} (q, ε) wrt Δ}

Since we assume that v does not have any blocking partial runs with

Δ

, there exists a partial run on v with

Δ

that starts with q. Since v is a nested word, any partial run on v is the nested word of some run, so there exists a run on the hedge h that starts with q. This shows the existence of some state p such that

q \overset{h}{\to} p

wrt

Δ

. Since

q \in Q_{shp}^{Δ}

, Lemma 6 shows

q = p

, and thus,

q \overset{h}{\to} q

wrt

Δ

. Proposition 8 then proves that

(q, ε) {\overset{nw (h)}{\to}}^{str} (q, ε)

wrt

Δ

, which is equal to

(q, ε) {\overset{v}{\to}}^{str} (q^{'}, ε)

wrt

Δ

, as required.

Case

q \notin Q_{shp}^{Δ}

. We distinguish all possible forms of the hedge h.

Subcase: $h = h^{'} \cdot h^{″}$ . Let $v^{'} = nw (h^{'})$ and $v^{″} = nw (h^{″})$ so $v = v^{'} \cdot v^{″} = nw (h)$ . The following rule must have been applied:

$\frac{q \notin Q_{shp}^{Δ} (q, ε) {\overset{v'}{\to}}_{shp}^{str} (p, σ) wrt Δ (p, σ) {\overset{v ″}{\to}}_{shp}^{str} (q^{'}, ε) wrt Δ}{(q, ε) {\overset{v' \cdot v ″}{\to}}_{shp}^{str} (q^{'}, ε) wrt Δ}$

Since $v^{'}$ is a nested word, it follows that $σ = ε$ too. By the induction hypothesis applied to the nested words $v^{'}$ and $v^{″}$ , we get $(q, ε) {\overset{v'}{\to}}^{str} (p, ε) wrt Δ$ and $(p, ε) {\overset{v ″}{\to}}^{str} (q^{'}, ε) wrt Δ$ . Hence, we can apply the following rule:

$\frac{(q, ε) {\overset{v'}{\to}}^{str} (p, ε) wrt Δ (p, ε) {\overset{v ″}{\to}}^{str} (q^{'}, ε) wrt Δ}{(q, ε) {\overset{v' \cdot v ″}{\to}}^{str} (q^{'}, ε) wrt Δ}$

This shows $(q, ε) {\overset{v}{\to}}^{str} (q^{'}, ε)$ , as required.
Subcases: $h = 〈 h^{'} 〉$ or $h = a$ or $h = ε$ . Straightforward.

This ends the proofs of both directions. □

A streaming evaluator with subhedge projection for deterministic Sha^↓s A on hedges h can thus be obtained by computing the streaming transition relation with subhedge projection for A of

nw (h)

starting with the initial configuration. This costs time at most

O (1)

per letter of

nw (h)

, i.e., constant time per event of the stream.

9.3. Streaming Complexity of Earliest Membership with Subhedge Projection

Using streaming evaluators for Sha^↓s, we can test earliest membership with subhedge projection in streaming mode by running the Sha^↓

A^{cgr} {(S)}_{e (S)}

.

Corollary 3.

For any complete dShas

(Σ, Q, Δ, I, F)

,

F \subseteq F_{S} \subseteq Q

, language

L = L (A)

, and schema

S = L (A [F / F_{S}])

, earliest membership with complete subhedge projection for L wrt

S

can be tested in streaming mode for any hedge

h \in S

in time

O (| 1 |)

per nonprojected letter of

nw (h)

after a preprocessing time of

O (| A^{cgr} {(S)}_{e (S)} | + n^{3} d + n s)

, where d is the number of difference relations and s the number of safe subsets in states of

Q_{e (S)}^{cgr} (S)

. The space required is in

O (d e p t h (h) + | A^{cgr} {(S)}_{e (S)} |)

.

Proof.

By Theorem 4, the dSha^↓

A^{cgr} {(S)}_{e (S)}

is earliest and sound and complete for subhedge projection. By Propositions 8 and 9, we can thus obtain a streaming membership tester with subhedge projection for a hedge

h \in S

by running the streaming evaluator with subhedge projection of

A^{cgr} {(S)}_{e (S)}

on the nested word

nw (h)

. Since this Sha^↓ is deterministic, the streaming evaluation can be done in time

O (| 1 |)

per letter of

nw (h)

after a preprocessing time of

O (| A^{cgr} {(S)}_{e (S)} | + n^{3} d + n s)

. □

As for safe-no-change projection, the exponential preprocessing time can again be avoided by creating the part of the Sha^↓

A^{cgr} {(S)}_{e (S)}

needed for the evaluation of the hedge on the fly. The size of the needed part is in

O (| h |)

. Hence, the space requirements can also be reduced to

O (| h |)

, which may be too much for large streams

nw (h)

. The needed part of

A^{cgr} {(S)}_{e (S)}

, however, may be much smaller than

| h |

. Furthermore, a stack of size

d e p t h (h)

has to be maintained but this dependence on h may be acceptable even for large streams

nw (h)

.

10. Monadic Regular Queries and XPath

We next consider the problem of how to answer monadic regular queries on hedges in an earliest manner and with subhedge projection. Each monadic query defines a set of positions for each input hedge. The XPath queries self::list[child::item], for instance, can be applied to the hedge

〈 list . 〈 list 〉 . 〈 item 〉 〉 . 〈 list . 〈 item 〉 〉

, where it selects the position of the first top-most tree labeled by

list

since it contains a child tree labeled by

item

. We identify positions of hedges by integers. For instance, the selected

list

element in the hedge above is identified by the integer 2. Therefore, the answer set of the above query on the above hedge is

{2}

. Note that monadic queries do not return any other information about the selected positions, in contrast to the subtree semantics of XPath in the official W3C standard.

Answering this query is more complicated than verifying whether the XPath filter [self::list/child::item] matches at the root of the first subtree of the hedge, since the answer set needs to be constructed, too. Concerning memory consumption, alive answer candidates need to be buffered for monadic query answering. Therefore, the memory consumption of streaming evaluators of monadic queries may grow linearly in the size of the stream to the number of the alive candidates does.

10.1. Monadic Regular Queries

Monadic queries on hedges in

H_{Σ}

can be identified with languages of annotated hedges in

H_{Σ \cup {x}}

, where

x \notin Σ

in an arbitrarily fixed selection variable. A monadic query on hedges in

H_{Σ}

can then be identified with the language of all annotated hedges in

H_{Σ \cup {x}}

, in which a single position got x-annotated and this position is selected by query.

Therefore, regular monadic queries on hedges in

H_{Σ}

can be defined by nested regular expressions in

{nRegExp}_{Σ \cup {x}}

or by a dSha with the same signature. The regular XPath query self::list[child::item], for instance, can be defined by following nested regular expression:

〈 list \cdot x \cdot ⊤ \cdot 〈 \cdot item \cdot ⊤ \cdot 〉 \cdot ⊤ 〉 \cdot ⊤,

where ⊤ is like a wildcard accepting any hedge in

H_{Σ}

without x, i.e., for some recursion variable

z \in V

:

⊤ = μ z . {(+_{a \in Σ} a + 〈 z 〉)}^{*} .

Note that the nested regular expression for the XPath filter [self::list/child::item] in

{nRegExp}_{Σ}

can be obtained from that of the XPath query self::list[child::item] by removing the x. In addition, for any monadic query on hedges, we have to restrict the annotated hedges to be recognized by the language of the query such that exactly one position in each hedge is annotated by x. This can be done by intersection with the following nested regular expression

o n e_{x}

:

μ z . (T . x . T + T . 〈 z 〉 . T)

The previous schema

〚 doc 〛

for pattern matching has now to be changed from to

S = 〚 {doc}_{x} 〛 \cap 〚 o n e_{x} 〛

, where

{doc}_{x} \in {nRegExp}_{Σ \cup {x}}

is like

doc

but over the signature

Σ \cup {x}

. For the regular XPath query self::list[child::item] we obtain the dSha in Figure 22. The transition rules with the special letter x indicate, where positions may be bound to the selection variable: x can be bound only in state 2, that is in subtrees starting with a

list

element. The same automaton can be used to define the schema

S = 〚 {doc}_{x} 〛 \cap 〚 o n e_{x} 〛

by using the schema-final states

F_{S} = {14, 20}

instead of the final states

F = {20}

. Indeed, we constructed the automaton by intersecting the automaton obtained from the XPath expression with the automata computed from the regular expressions

{doc}_{x}

and

o n e_{x}

for the schema. Note that 20 is a final state for the query (and the schema), while 14 is a sink state for the query (but not for the schema).

10.2. Earliest Query Answering with Subhedge Projection

A naive query evaluation algorithm for a hedge automaton with signature

Σ \cup {x}

receives an input hedge

h \in H_{Σ}

, guesses all possible positions of where to insert x into h, and tests for all of them whether the annotated hedge obtained is accepted by A. In a less naive manner, one inserts x only at those position where the current state has an outgoing x transition. All this can be improved if the hedge automaton under consideration supports selection, rejection, or projection states. In all cases, the algorithm has to buffer all binding candidates, that are not yet in a selection or rejection state. In the worst case it has to buffer the bindings to all positions. Therefore, the state space of the evaluator of monadic queries is no more finite.

Selection and rejection states can be identified by adapting the earliest query answering algorithm from [21]. This can be made such that one obtains an Sha^↓

A_{e (S)}

with signature

Σ \cup {x}

depending on the schema. It should be noticed that automata used in [21] are dNwas instead of Sha^↓s, and that schemas are ignored there. The needed adaptations are therefore discussed in Appendix A. How to obtain an efficient query answering algorithm in streaming mode from

A_{e (S)}

is the main topic of [21]. This is the base algorithm that we will extend with subhedge projection for our experiments.

In order to add congruence projection to the earliest query answering algorithm, we have to run the automaton

A_{e (S)}^{cgr} (S)

. The earliest congruence projection for the XPath query self::list[child::item] is shown in Figure 23. It binds x in state 2D1 after having read a top-level

list

-element and then checks for the existence of a child element

item

, going into selection state 1D3 once it is found. All other children of the top-level

list

element are projected in state 2D3. For adding earliest query answering to self-no-change projection, we have to run the automaton

A_{e (S)}^{π}

instead of

A_{e (S)}^{cgr}

. We do not give the details, but claim for the present query that the projection power of both automaton is the same.

Fortunately, the soundness and completeness theorems for earliest membership testing with subhedge projection do only affect the constructions of these Sha^↓s, but not how they are to be evaluated. However, there is one additional issue: Nested word prefixes cannot be considered as irrelevant for the subsequent subhedge, if the selection variable x can be bound there, even if the acceptance doesn’t depend on where it will be bound. This is since the position of the binding is to be returned by any query answering algorithm.

Definition 15.

We call a prefix v binding irrelevant for L and

S

it there does not exist any hedge h containing x and suffix w such that

v \cdot w \in nw (S)

and

v \cdot nw (h) \cdot w \in nw (L)

.

Our subhedge projecting evaluator can project only at subhedge irrelevant prefixes that are also binding irrelevant. Given an dSha A and a set of schema final-states

F_{S}

, one can decide whether a prefix is subhedge irrelevant if the current state q does accept hedges containing x but can access some other state that is not safe for rejection and permits x-bindings. In ground Datalog, we can distinguish binding irrelevant states q by imposing the following rules for all

q, q^{'} \in Q

:

binding_irrelev(q) :- state(q), not binding_relev(q).
binding_relev(q) :- not bind_x(q), acc(q,q’), bind_x(q’), not rej(q’).

Note in our example of the XPath query self::list[child::item] that state 2D3 of the dSha^↓

A^{cgr} {(〚 doc 〛 \cap 〚 {o n e}_{x} 〛)}_{e (〚 doc 〛 \cap 〚 {o n e}_{x} 〛)}

in Figure 23 is binding irrelevant, since x must be bound on the top-level

list

element in state 2D1, so it cannot be bound on any

list

element below.

11. Experiments with Streaming XPath Evaluation

We compare experimentally three streaming evaluators for dSha, for earliest monadic query answering, without subhedge projection, with safe-no-change projection, and with congruence projection.

11.1. Benchmark dSha for XPath Queries

We start from dSha defining monadic queries for the regular XPath queries A1-A8 from the XPathMark benchmark [7] that are given in Table 1. These XPath queries show most of the features of the regular fragment of XPath. In Table 2, we added 14 further XPath path queries that we found useful for testing too.

Deterministic Shas for A1–A8 were provided earlier in [21]. For the other XPath queries we compiled them to dShas via nested regular expression.

In order to produce the input dShas for our evaluators, we intersect these automata with a dSha for the schema

〚 {doc}_{x}^{'} 〛 \cap 〚 o n e_{x} 〛

where

{doc}_{x}^{'}

is the schema we use for hedge encodings of real-world Xml documents with x annotations. Thereby, we could identify the schema final states

F_{S}

. We minimized and completed the result. Table 3 reports the size of the input dShas for our evaluators obtained by the above procedure. For each dSha A, the size is given by two numbers size(n), the first for the overall size and the second for the number of states n.

For the input dSha A of each query, we statically computed the whole dSha^↓s

A^{cgr} {(S)}_{e (S)}

while using the necessary parts of the determinization algorithm from [16]. The size of the dSha^↓s obtained and the number d of difference relations are reported in Table 3 too. The biggest size is 2091(504) for the input dSha for A8. The largest number of difference relations d = 24 is also obtained for this query. So, indeed the size of these automata is much smaller than one might expect from a construction that is highly exponential in the worst case. However, the only query that we couldn’t obtain its correspondent deterministic earliest congruence projection

A^{cgr} {(S)}_{e (S)}

is A5.

The time for computing the earliest congruence projection took between 2.3 and 26 s.

We note that we have not yet computed the complete dSha^↓s statically for safe-no-change projection

A^{snc}

, pure congruence projection

A^{cgr} (S)

and earliest query answering

A_{e (S)}

. We believe that these automata may become bigger than

A^{cgr} {(S)}_{e (S)}

that we constructed in a single shot.

11.2. Streaming Evaluation Tool: AStream

We are using and developing the AStream tool for answering monadic dSha queries on Xml streams with schema restrictions. Version 1.01 of AStream was presented in [21] supports earliest query answering without projection. Given a dSha A defining a monadic query, a set of schema final states

F_{S}

and an Xml stream w, it constructs on-the-fly the needed part of the Sha^↓

A_{e (S)}

while parsing w and evaluating A on the hedge encoding w.

For the FCT conference version of the present paper [24], we enhanced AStream with safe-no-change projection. This leads to version AStream 2.01. It constructs the earliest safe-no-change projection dSha^↓

A_{e (S)}^{snc}

on the fly while evaluating the monadic query defined by A on the hedge encoding the Xml stream w. Subhedge projection can be switched on or off in order to compare both versions without further implementation differences.

For this present journal version, we added earliest congruence projection and integrated it into AStream leading to AStream 3.0.

AStream 3.0 is different from the two previous versions in that the dSha^↓

A_{e (S)}^{cgr}

is constructed statically and entirely, independently of the input hedge. We then use a generic earliest streaming evaluator for Sha^↓s that rejects in rejection states, selects in selection states, and projects in all subhedge projection states. This evaluator could also be run with

A_{e (S)}^{snc}

or

A_{e (S)}

, as long as these do not grow too big.

It should be noticed that

A_{e (S)}^{cgr}

turned out to be nicely small for our whole benchmark, with the other dSha^↓s

A_{e (S)}^{snc}

and

A_{e (S)}

risk becoming bigger. As stated earlier, we did not construct these dSha^↓s so far for our benchmark queries. This is why we continued to run the earliest streaming with safe-no-change projection with AStream 2.01, while we used AStream 3.0 for earliest streaming with congruence projection.

All AStream versions rely on Java’s abc-Datalog for computing the least fixed points of Datalog programs in a bottom-up manner. Datalog programs are needed for all logical reasoning: for computing subsets of states that are safe for rejection, selection, or subhedge projection on all levels. Also, the difference relations are computed based on Datalog. Note that earliest on-the-fly query evaluation requires running Datalog during query evaluation, while with a static approach for constructing dSha^↓s entirely, Datalog is only needed at preprocessing time during the automaton construction.

11.3. Evaluation Measures

We want to measure the time for running the three streaming query evaluators for all dShas obtained from the XPath expressions in the collection in Table 1 and Table 2.

One advantage of the XPathMark benchmark is that it comes with a generator of Xml documents to which the queries can be applied and that can be scaled in size. We created an Xml document of size 1.1 GB, which is sufficiently large to make streaming mandatory. Our experiments show that the efficiency of query evaluation grows linearly with the size of the non-projected part of the Xml document. This holds for each of our three evaluators. Therefore, measuring the time of the evaluator on Xml documents of other sizes would not show any new insights, except that memory consumption remains moderate too.

The time gain of projection for a query Q is the percentage of time saved by projection during query evaluation when ignoring the pure document parsing time

t_{parse}

seconds:

{time-gain}_{Q} = 100 * (1 - (t_{{noproject}_{Q}} - t_{parse}) / (t_{{project}_{Q}} - t_{parse})) .

We can measure the time gain for earliest safe-no-change projection by AStream 2.01 and for earliest congruence projection by AStream 3.0, since projection can be switched off in both of them. The disadvantage of the time gain is that it depends on the details of the implementation.

The event gain is a better measure for the projection power. The parser sends a stream of needed events to the Sha^↓, while ignoring parts of the input document that can be projected away. For this, it must always be informed by the automaton about what kind of events can be projected in the current state. The event gain of projection

{event-gain}_{Q}

is then the percentage of events that are recognized to be irrelevant and thus no more created by a projecting evaluator for query Q. One might expect that the time gain is always smaller than the event gain.

{time-gain}_{Q} \leq {event-gain}_{Q}

Indeed, this will be confirmed by our experiments. We believe that these discrepancies between these two measure indicate how much room remains for optimizing the implementation of an evaluator. In this way, we can be observe that some room for further optimizations AStream 3.0 still remains.

11.4. Experiments

We used Java’s

X M L S t r e a m R e a d e r

Interface v1.0 from the

j a v a x . x m l . s t r e a m

package to parse and stream Xml files in Scala. Parsing a 1.1 GB Xml document required

t_{parse} = 15

s, while querying it without projection took us in average

t_{{noproject}_{avg}} = 752

s, while varying from 600 to 900 s. The average processing time for 1 MB was thus 0.72 s. Once we knew this, we predicted the expected evaluation time from the size of the non-projected part of the Xml document. It is as follows:

{event-gain}_{Q} * 1100 * 0.72 s + t_{parse} s .

Without projection, the parser generates

1, 012, 018, 728

events for the 1.1 GB Xml document. This is the baseline for computing

{event-gain}_{Q}

for all queries Q.

We ran AStream with Scala v2.13.3 on a machine with the operating system Ubuntu 20.04.06 LTS (64-bit). The machine is a Dell Precision-7750 machine (Round Rock, TX, USA), equipped with an Intel^® Core™ i7-10875H CPU @ 2.30GHz × 16 and 32 GB of RAM (Santa Clara, CA, USA).

11.4.1. Earliest Congruence Projection

We present the measures of our evaluator with the earliest congruence projection in Table 4. The event gain is high, varying between 75.7% for A1_0c and 100% for A0, A1_0a, A1_0b, A1_4, A1_5, and A4_1. The event gain for queries without descendant axis is above 98.2%. For these queries, all subtrees that are not on the main path of the query can be projected away. In particular, only subtrees until a fixed depth have to be inspected. These are the queries A1, A4, A6-A8 and A4_0 and the queries with event gain 100% listed above. The time gain for all these queries is a little smaller than the event gain but still above 98.2%.

We next consider the second type of queries having the descendant axis, specifically A2, A1_0c, A1_1a, A1_1d, A1_2, A1_6, and A2_1. Intuitively, a lower event gain is expected in these cases because subhedge projection with the descendant axis cannot directly exclude an entire subtree under a given element; all descendant elements must be inspected. For instance, the query A1_0c equal to

/ site / / @ *

must select all attributes of all elements under the root element

s i t e

, requiring the inspection of every Xml element in the document, for attribute presence, except for the root element

s i t e

.

The lower event gain reported for the aforementioned queries align with the above expectation, showing percentages ranging from 75.7% to 81.1%. Also, these queries exhibit a larger gap between time gain and event gain, which ranges from 2.2% to 5.7% and will be subject to future optimization. It is worth noting that some of these queries combine multiple features at once, like queries A1_1a and A1_1d, which have a descendant axis, filter with attributes, and string comparison.

The only exception to this trend for the queries having descendant axis are A3 and A1_3, with respective event gains of 97.8% and 99.8%. The reason behind this high gain is that these queries starts from the root, with a child axis path of depth 3 before querying the descendant axis. This initial path allows for the exclusion and projection of most elements under

s i t e

that do not fit the specified path with their entire subtrees.

11.4.2. Earliest Safe-No-Change Projection

We compare earliest safe-no-change projection with the earliest congruence projection in Table 5. For this, we restrict ourselves to the queries

A 1

–

A 8

from the XPathMark in Table 1.

The XPath queries A1, A4, A6, A7, and A8 do not have descendant axes. The time gain for safe-no-change projection on these queries is between

92.3

–

98.9 %

. For earliest congruence projection, the time gain was better for A1, A4, and A6 with at least 96.6%. For A7 and A7, the time gain of safe-no-change projection is close to the event gain of congruence projection but the time gain of congruence projection is 1.5% lower. So, we hope that an optimized version of congruence projection could outperform safe-no-change projection in all cases.

The XPath queries with descendant axis are A2, A3, and A5. For A2 and A3 in particular, the time gain of safe-no-change projection is very low with even not 10.9% while congruence projection yields at least 77.9%. We believe that this is bug in our current implementation of safe-no-change projection, which may be prohibiting the projection of attributes and text nodes for these queries (but not for the others). Congruence projection on A2 gains 77.9% of time and 81.1% of events. This is much better, but still far from the projection power for queries without descendant axi s. For A3, congruence projection gains 97.4% of time. This is much better than for A2 since the descendant axis in A3 is at the end, while for A2 it is at the beginning of the path. But, even for A2, congruence projection is very helpful.

11.4.3. Comparison to External Tools

QuiXPath was shown to be the most efficient large coverage external tool for regular XPath evaluation on Xml streams [4]. Therefore, we compare the efficiency of congruence projection to QuiXPath in Table 6, and thereby, by transitivity to the many alternative tools. We note that QuiXPath performs early query answering by compilation to possibly nondeterministic Nwas with selection states, without always being earliest. Apart from this, it bases its efficiency on both subtree projection and descendant projection.

The experiments with QuiXPath in [4] use the parse-free time. Therefore, we do the same for congruence projection by reducing the measured overall time by 15 s of parsing time.

Compared to the parsing-free times for the XPath queries without descendant axis, the earliest congruence projection demonstrates significant improvements. On average, our projection is only slower by a factor of 1.3–2.9 than QuiXPath. This shows that our implementation is already close to be competitive with the existing Xml streaming tools, while being the only one guaranteeing earliest query answering.

For query A2 with a descendant axis, congruence projection is by a factor of 13.8 slower than QuiXPath. The reason is that QuiXPath supports descendant projection, while congruence projection concerns only subhedge. For query A3 with a descendant axis at the end, congruence projection is only by a factor of 1.6 behind, so descendant projection seems less relevant here.

12. Related Automata Models

A large number of alternative automata notions for trees, forests, and hedges were proposed in the literature. We compare them to Shas and Sha^↓s in the following subsection and discuss the implications.

12.1. Shas versus Tree, Forest, and Hedge Automata

Standard Hedge Automata. Standard hedge automata [9,10,32,33] operate on labeled hedges in a bottom-up manner, similarly to Shas. Also they have the same expressiveness, but horizontal languages are specified differently, leading to a problematic notion of determinism [34]. For this reason, unique minimization fails for deterministic standard hedge automata. Still, they have the same expressiveness as Shas when restricted to labeled hedges.

Standard Tree Automata and SHA Minimization. Syntactically, any Sha A is a standard tree automaton over the following ranked signature:
- a unary symbol for any letter $a \in Σ$ ;
- a binary symbol @;
- a constant $〈 〉$ .
Semantically, there is no perfect general correspondence. Only for subclasses a perfect correspondence can be made up via binary encoding. This was shown in [8] for the subclass of dShas with $I = {〈 〉}^{Δ}$ . In [35], it could also be shown for the subclass of multi-module dShas. This leads to unique minimization results for both subclasses of deterministic Shas.

Bojanczyk’s Forest Automata. Standard hedge automata also have the same expressiveness as Bojanczyk’s forest automata (see Section 3.3 of [36]). These are based on the idea of transition monoids, rather than on the idea of transition rules such as Shas or finite state automata or Neuman’s and Seidl’s forest automata. As a consequence, there is an exponential difference in succinctness between Shas and Bojanczyk’s forest automata [35].

12.2. SHA^↓s versus SHAs

We start explaining why Sha^↓’s have the same expressiveness as Shas. Basically, it is the same reason for why Nwas have the same expressiveness as Shas (see e.g., [8]). For the easier direction, we argued already that any Sha

A = (Σ, Q, Δ, I, F)

can converted to a Sha^↓

d o w n (A)

.

Downward Elimination. Conversely, we can convert any dSha^↓A into an equivalent Sha ${elim}^{↓} (A)$ while possibly introducing nondeterminism as follows:

$\begin{matrix} I^{{elim}^{↓}} = I \times Q \\ F^{{elim}^{↓}} = F \times Q \end{matrix} \frac{q \overset{〈 〉}{\to} q^{'} in Δ}{\overset{〈 〉}{\to} (q, q^{'}) in Δ^{{elim}^{↓}}}$

$\frac{q \overset{a}{\to} q^{'} in Δ r \in Q}{(r, q) \overset{a}{\to} (r, q^{'}) in Δ^{{elim}^{↓}}} \frac{q @ p \to q^{'} in Δ r \in Q}{(r, q) @ (q, p) \to (r, q^{'}) in Δ^{{elim}^{↓}}}$

Proposition 10.

L (A) = L ({elim}^{↓} (A))

.

The construction is analogous to the conversion of Nwas to Shas [8] or to hedge automata [26]. The correctness proofs for these compilers are standard.

Unique Minimization. Unique minimization fails for dSha^↓s as usual for deterministic multiway automata. This even happens for deterministic two-way finite state automata on words. In contrast, the subclass of dShas with the restriction that the initial state and the tree initial state are the same enjoys unique minimization [8].

Neuman and Seidl’s Pushdown Forest Automata. Standard hedge automata are also known to have the same expressiveness than Neumann and Seidl’s pushdown forest automata [25]. The latter can be identified with Sha^↓s for labeled hedges. What is needed to map pushdown forest automata to standard hedge automata is downward elimination, as for mapping Sha^↓s to Shas.

12.3. SHA^↓s versus NWAs

The streaming evaluator for Sha^↓s via a visibly pushdown machine can also be obtained by compiling a Sha^↓ to a nested word automaton (Nwa) [11] whose streaming evaluator is given by a visibly pushdown machine [12], previously known as input driven automata [13,14,15].

Definition 16.

A nested word automata (Nwa) is a tuple

(Σ, Q, Γ, Δ, I, F)

, where Σ, Γ, and

Q

are sets,

I, F \subseteq Q

, and

Δ = ({(a^{Δ})}_{a \in Σ}, {〈^{Δ}, 〉}^{Δ})

contains the following relations:

a^{Δ} \subseteq Q \times Q

,

〈^{Δ} \subseteq Q \times (Γ \times Q)

, and

〉^{Δ} \subseteq Q \times Γ \times Q

. A Nwa is deterministic or equivalently a dNwa if I contains at most one element and all above relations are partial functions.

The elements of

Γ

are called stack symbols. The transition rules in

Δ

again have three forms: internal rules

q \overset{a}{\to} q^{'}

in

Δ

as for Shas, opening rules

q \overset{〈 ↓ γ}{\to} q^{'}

in

Δ

if

〈^{Δ} (q) = (q^{'}, γ)

, and closing rules

q \overset{〉 ↑ γ}{\to} q^{'}

in

Δ

if

〉^{Δ} (q, γ) = q^{'}

.

Streaming evaluators for Nwas. The streaming evaluator for Nwas can be seen as pushdown machines for evaluating the nested words of hedges in streaming manner. A configuration of the pushdown machine is a pair in $K = Q \times Γ^{*}$ containing a state and a stack. For any word $v \in {\hat{Σ}}^{*}$ , we define a streaming transition relation ${\overset{v}{\to}}^{str} \subseteq K \times K$ , such that, for all $q, q^{'} \in Q$ and $σ \in Γ^{*}$ , the following is applicable:

$\frac{t r u e}{(q, σ) {\overset{ε}{\to}}^{str} (q, σ) wrt Δ} \frac{(q, σ) {\overset{v}{\to}}^{str} (q^{'}, σ) (q^{'}, σ) {v^{'}}^{str} (q^{″}, σ) wrt Δ}{(q, σ) {\overset{v \cdot v^{'}}{\to}}^{str} (q^{″}, σ) wrt Δ}$

$\frac{q \overset{a}{\to} q^{'} in Δ}{q {\overset{a}{\to}}^{str} q^{'} wrt Δ} \frac{q \overset{〈 ↓ γ}{\to} q^{'} in Δ}{(q, σ) {\overset{〈}{\to}}^{str} (q^{'}, σ) wrt Δ} \frac{q \overset{〉 ↑ γ}{\to} q^{'} in Δ}{(q, σ \cdot γ) {\overset{〉}{\to}}^{str} (q^{'}, σ) wrt Δ}$

From SHA^↓s to NWAs. For any Sha^↓ $A = (Σ, Q, Δ, I, F)$ , we can define the Nwa $A^{nwa} = (Σ, Q,$ $Γ, Δ^{nwa}, I^{nwa}, F^{nwa})$ while preserving determinism, such that $Γ = Q$ , while $Δ^{nwa}$ contains for all $a \in Σ$ and $q, p \in Q$ transition rules:

$\frac{q \overset{a}{\to} q^{'} in Δ}{q \overset{a}{\to} q^{'} in Δ^{nwa}} \frac{q \overset{〈 〉}{\to} q^{'} in Δ}{q \overset{〈 ↓ q}{\to} q^{'} in Δ^{nwa}} \frac{q @ p \to q^{'} in Δ}{p \overset{〉 ↑ q}{\to} q^{'} in Δ^{nwa}}$

For instance, the dNwa

A^{nwa}

of the dSha^↓ A in Figure 9 obtained from the dSha for the

[self : : list / child : : item]

from the introduction is given in Figure 24. The dNwa

A^{nwa}

of the dSha^↓ A in Figure 21 is given in Figure 25.

Lemma 21.

L (A^{nwa}) = L (A)

.

The runs of Sha^↓s A and Nwas

A^{nwa}

can be identified.

Projection for NWAs. Projecting evaluators for Nwas were proposed in the context of projecting Nwas [4]. They are based on the following notion of states of irrelevant subtrees for Nwas.

Definition 17.

(Variant of Definition 3 of [4]). We call a state q of an Nwa a state of irrelevant subtrees if there exist two different stack symbols

γ, γ^{'}

and a state

q^{'}

such that the transitions shown on the right exist but no further opening transitions with γ, no further transitions with

γ^{'}

, and no further closing transition in q popping γ. In this case, we write

q \in i-tree e_{Σ \ \emptyset}

.

Lemma 22.

Any subhedge projection state of a complete Sha^↓ A is a state of irrelevant subtrees of

A^{nwa}

.

It was then shown in [4] how to map an Nwa with a subset

Q_{shp}^{Δ}

of states of irrelevant subtrees to a projecting Nwa, whose streaming semantics yield a streaming evaluator with subhedge projection. The presentation of subhedge projection for Sha^↓s without passing via Nwas yields the same result in a more direct manner.

It should also be noticed that projecting Nwas supports descendant projection besides subhedge projection. For Shas, this is left to future research.

13. Conclusions and Future Work

We introduced the notion of irrelevant subhedges for membership to hedge language under schema restrictions. We developed algorithms with subhedge projection that jump over irrelevant subhedges for testing membership to the regular hedge languages defined by dShas. All our algorithms can be run in memory mode and in streaming mode. The difficulty was how to push the needed finite state information for subtree projection top-down given that Shas operate bottom-up. We solved it based on compilers from Shas to downward Shas.

This first compiler propagates safety information regarding non-changing states. The second compiler, which we prove to be complete for subhedge projection, propagates difference relations. We then combined subhedge projection with earliest membership testing and combined it earliest monadic query answering. We integrated both safe-no-change and complete congruence in our AStream tool and confirmed their usefulness experimentally: we showed that it can indeed speed up the the AStream tool for answering regular XPath queries on Xml streams. Moreover, congruence projection showed much higher event projection gain rates than safe-no-change, ranging between 75.7% and 100% for the class of queries that does not have descendant axis, becoming competitive in time with the best existing streaming tools for regular XPath queries without a recursive axis.

A remaining open theoretical and practical question is how to obtain descendant projection for Shas, as needed to speed up the evaluation of XPath queries with a descendant axis. On the implementation and experimental side, it would be nice to investigate the in-memory version of our subhedge algorithms too. Finally, it would be nice to optimize streaming earliest-query answering tools further until they outperform the best streaming tools in all cases, while being better for queries where the other tools are not earliest.

Author Contributions

Writing—original draft, A.A.S. and J.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Earliest Membership

We adapt the earliest membership tester for dShas from [21] so that it compiles to Sha^↓s instead of Nwas and such that it accounts for schemas.

Appendix A.1. Schema Safety Is an Inclusion Problem

The idea for testing certain membership and certain nonmembership is to consider states are safe to reach final states except if going out of the schema. Let

A = (Σ, Q, Δ, I, F)

be an Sha with schema

S

, such that there exists

F \subseteq F_{S} \subseteq Q

with

S = L (A [F / F_{S}])

. Note that

L (A) \subseteq S \subseteq H_{Σ}

in particular. The set of states that are safe to reach a subset

P \subseteq Q

for all hedges that do not go out of the schema

S

is as follows:

{safe}_{S}^{Δ} (P) = {safe}^{Δ} (P \cup (Q \ F_{S})) .

If A is deterministic and schema-complete for

S

then the safety of a state of A with respect to schema

S

is an inclusion problem.

Lemma A1.

Let

S = L (A [F / F_{S}])

. If A is deterministic and schema-complete for

S

, then for all states

q \in Q

:

L (A [I / {q}, F / F_{S}]) \subseteq L (A [I / {q}]) \Leftrightarrow q \in {safe}_{S}^{Δ} (F) .

Proof.

“⇐”.: Suppose that $q \notin {safe}_{S}^{Δ} (F)$ . The definition of schema-based safety yields ${acc}^{Δ} ({q}) \cap (F_{S} \ F) \neq \emptyset$ . So, there exists some hedge $h \in H_{Σ}$ , such that $q \overset{h}{\to} F_{S} \ F$ wrt $Δ$ . In particular, $q \in L (A [I / {q}, F / F_{S}])$ . By determinism, there exists no other transition on h from q, and thus, $q \notin L (A [I / {q}])$ . Hence, $L (A [I / {q}, F / F_{S}]) ⊈ L (A [I / {q}]$ .
“⇒”.: Suppose that $q \in {safe}_{S}^{Δ} (F)$ . Let $h \in L (A [I / {q}, F / F_{S}])$ . Then, there exists $q^{'} \in F_{S}$ , such that $q \overset{h}{\to} q^{'}$ wrt $Δ$ . Hence, $q^{'} \in {acc}^{Δ} ({q})$ so that ${acc}^{Δ} ({q}) \subseteq F \cup (Q \ F_{S})$ implies $q^{'} \in F$ . Thus, $h \in L (A [I / {q}])$ .

□

Lemma A1 with

F_{S} = Q

shows Lemma 5 of [21], which states for any complete dShas A that safety is a universality problem:

L (A [I / {q}]) = H_{Σ} \Leftrightarrow q \in {safe}^{Δ} (F) .

This is since the schema-completeness of A for

S = H_{Σ}

implies the completeness of A, and thus,

L (A [I / {q}, F / Q]) = H_{Σ}

.

Appendix A.2. Algorithm

We now present our earliest membership tester for dShas with schema restrictions. Given a Sha A, we compute a Sha^↓

A_{e (S)} = (Σ, Q_{e},

Δ_{e},

I_{e (S)}, F_{e (S)})

as follows. Let

sel

be a fresh symbol. We start with set set of states that are safe for selection

S_{0} = {safe}_{S}^{Δ} (F)

and, respectively, safe for rejection

R_{0} = {safe}_{S}^{Δ} (Q \ F)

. The state sets of

A_{e (S)}

are then as follows:

\begin{matrix} Q_{e} = (Q \times 2^{Q} \times 2^{Q}) \cup {sel} \\ I_{e (S)} = {(q, R_{0}, S_{0}) ∣ q \in I, q \notin R_{0} \cup S_{0}} \cup {sel ∣ I \cap S_{0} \neq \emptyset} \\ F_{e (S)} = {(q, R_{0}, S_{0}) ∣ q \in F} \cup {sel} \end{matrix}

For illustration, reconsider the dSha from Figure 4. This dSha is not complete but schema-complete for the schema

〚 doc 〛

. We get

{s-down}^{Δ} (0, {4}) = {3, 4}

. It may seem counter intuitive that note only state 3 but also state 4 down remains safe for selection. This reflects the fact that no proper subhedge of any hedge satisfying the intended schema may ever get into state 4. Applying the earliest construction with safe-no-change projection to dSha in Figure 4 yields the dSha^↓ in Figure A1. The transition rules in

Δ_{e}

are given in Figure A2.

Figure A1. The earliest dSha^↓

A_{e}

for the dSha A in Figure 19 with schema

〚 doc 〛

for the XPath filter

[self : : list / child : : item]

. The states of

A_{e}

correspond to the following tuples:

0 = (0, {1, 2, 3, 5}, {4})

,

0^{'} = (0, {2, 4, 5}, {3, 4})

,

1^{'} = (1, {2, 4, 5}, {3, 4})

,

0^{″} = (0, \emptyset, {2, 4, 5})

,

1^{″} = (1, \emptyset, {2, 4, 5})

,

0^{‴} = (0, \emptyset, \emptyset)

,

1^{‴} = (1, \emptyset, \emptyset)

,

2^{‴} = (1, \emptyset, \emptyset)

, and

3^{‴} = (1, \emptyset, \emptyset)

.

Figure A1. The earliest dSha^↓

A_{e}

for the dSha A in Figure 19 with schema

〚 doc 〛

for the XPath filter

[self : : list / child : : item]

. The states of

A_{e}

correspond to the following tuples:

0 = (0, {1, 2, 3, 5}, {4})

,

0^{'} = (0, {2, 4, 5}, {3, 4})

,

1^{'} = (1, {2, 4, 5}, {3, 4})

,

0^{″} = (0, \emptyset, {2, 4, 5})

,

1^{″} = (1, \emptyset, {2, 4, 5})

,

0^{‴} = (0, \emptyset, \emptyset)

,

1^{‴} = (1, \emptyset, \emptyset)

,

2^{‴} = (1, \emptyset, \emptyset)

, and

3^{‴} = (1, \emptyset, \emptyset)

.

Figure A2. The earliest transition rules

Δ_{e}

inferred from transition rules

Δ

of some Sha.

Figure A2. The earliest transition rules

Δ_{e}

inferred from transition rules

Δ

of some Sha.

Appendix A.3. Soundness and Completeness

We next provide soundness and completeness results for our earliest in-memory membership tester with projection.

Proposition A1 (Soundness).

For any Sha A with schema

S = L (A [F / F_{S}])

, hedge

h \in S

, and partial run v of

A_{e (S)}

on h that ends in state

sel

, then

h \in L (A)

.

So, if the partial run reaches state

sel

then

{proj}_{Σ} (v)

is certain for membership to

L (A)

with respect to

S

. Evaluating a Sha^↓s

A_{e (S)}

yields an earliest in-memory membership tester for dSha A. A single adaptation is as follows: whenever

sel

is reached, the evaluation can stop all over and accept the input hedge.

Theorem A1 (Completeness).

Let

A = (Σ, Q, Δ, I, F)

be a dSha that is schema-complete for schema

S = L (A [F / F_{S}])

. For any hedge

h \in S

and prefix v of

nw (h)

:

If v is certain for membership in $L (A)$ with $S$ , then there exists a partial run w for $A_{e (S)}$ on h, such that $v = {proj}_{Σ} (w)$ and w ends with state $sel$ .
If v is certain for nonmembership in $L (A)$ with $S$ , then there exists blocking partial run w for $A_{e (S)}$ on h, such that ${proj}_{Σ} (w)$ is a prefix of v.

This means that certainty for membership and nonmembership is detected at the earliest prefix when running the evaluator for

A_{e (S)}

.

Proof.

We notice that this algorithm is basically the same as the earliest membership tester from Proposition 6 of [21], except that it also check for safe rejection and that it accounts for schemas. The fact that Nwas are used there instead of Sha^↓s here is not essential. So, we can lift the soundness and completeness proof from there without problem. □

Appendix B. Proofs for Section 6 (Safe-No-Change Projection)

Proof.

We have to prove that no more changing states

q \in P \cup {no-change}^{Δ}

is sound.

If

q \in {no-change}^{Δ}

, this follows from the schema-completeness of

Δ

so that one can neither block on any hedge from the schema nor change the state. In the case

q \in P

, the intuition is that the state on level above, say r, can neither change, since

P = {s-no-change}^{Δ} (r)

, nor can the automaton block on any hedge from the schema due to schema-completeness.

We first prove the inclusion

L (A) \subseteq L (A^{snc})

. Since

L (A) \subseteq S

by definition of schemas, this implies

L (A) \subseteq L (A^{snc}) \cap S

. The proof will be based on the following three Claims A1.1a, A1.2a, and A1.3a. Note that schema-completeness is not needed for this direction. □

Claim A1.1a.

Π \overset{h}{\to} Π wrt Δ^{snc}

for all hedges

h \in H_{Σ}

.

The proof is straightforward by induction on the structure of h. It uses the last three transition rules of $Δ^{snc}$ in Figure 10 permitting to always stay in $Π$ for whatever hedge follows.

Claim A1.2a.

For all

h \in H_{Σ}

,

q \in Q

, and

P \subseteq Q

, such that

q \in P \cup {no-change}^{Δ}

,

(q, P) \overset{h}{\to} (q, P) wrt Δ^{snc} .

We prove Claim A1.2a by induction on the structure of h.

Case: $h = h^{'}$ . In this case, we can use Claim A1.1a to show $Π \overset{h^{'}}{\to} Π$ wrt $Δ^{snc}$ and the inference rules

$\frac{q \in P \cup {no-change}^{Δ}}{(q, P) \overset{〈 〉}{\to} Π in Δ^{snc}} \frac{q \in P \cup {no-change}^{Δ}}{(q, P) @ Π \to (q, P) in Δ^{snc}}$

in order to close the following diagram with respect to $Δ^{snc}$ :

This proves $(q, P) \overset{h}{\to} (q, P)$ wrt $Δ$ , as required by the claim.
Case: $h = a$ . Since $q \in P \cup {no-change}^{Δ}$ we can apply the inference rule:

$\frac{a \in Σ q \in P \cup {no-change}^{Δ}}{(q, P) \overset{a}{\to} (q, P) in Δ^{snc}}$

This proves this case of the claim.
Case: $h = ϵ$ . We trivially have $(q, P) \overset{ϵ}{\to} (q, P)$ wrt $Δ^{snc}$ .
Case: $h = h^{'} h^{″}$ . By induction hypothesis applied to $h^{'}$ and $h^{″}$ we have: $(q, P) \overset{h^{'}}{\to} (q, P)$ and $(q, P) \overset{h^{″}}{\to} (q, P)$ wrt $Δ^{snc}$ . Hence, $(q, P) \overset{h^{'} h^{″}}{\to} (q, P)$ wrt $Δ^{snc}$ .

This ends the proof of Claim A1.2a. The next claim, in which the induction step is a little more tedious to prove, is the key of the soundness proof. We define the following predicate for all

q^{'}, q^{″} \in Q

and

P \subseteq Q

:

q^{'} \sim_{P} q^{″} iff (q^{'} = q^{″} \lor q^{'}, q^{″} \in P)

Claim A1.3a.

Let

h \in H Σ

be a hedge,

q, q^{'} \in Q

states and

P \subseteq

a subset of states, such that

{acc}^{Δ} (P) \subseteq P

and

q \notin P \cup {no-change}^{Δ}

. If

q \overset{h}{\to} q^{'} wrt Δ

, then there exists

q^{″}

, such that

(q, P) \overset{h}{\to} (q^{″}, P) wrt Δ^{snc}

and

q^{'} \sim_{P} q^{″}

.

Proof.

Induction on the structure of h.

Case

h = h^{'}

. The assumption

q \overset{h}{\to} q^{'} wrt Δ

shows that there exists states

p 0 \in {〈 〉}^{Δ}

and

p \in Q

closing the following diagram:

Let

P^{'} = {s-no-change}^{Δ} (q)

and note that

{acc}^{Δ} (P^{'}) \subseteq P^{'}

. Since

q \notin P \cup {no-change}^{Δ}

we can infer the following:

\frac{\overset{〈 〉}{\to} p^{0} in Δ q \notin P \cup {no-change}^{Δ}}{(q, P) \overset{〈 〉}{\to} (p^{0}, P^{'}) in Δ^{snc}} \frac{q @ p \to q^{'} in Δ q \notin P \cup {no-change}^{Δ}}{(q, P) @ (p, P^{'}) \to (q^{'}, P) in Δ^{snc}}

Subcase

P_{0} \notin P^{'} \cup {no-change}^{Δ}

. The induction hypothesis applied to

h^{'}

shows that there exists p′, such that

(P_{0}, P^{'}) \overset{h^{'}}{\to} (P^{'}, P^{'}) wrt Δ^{snc}

and

p \sim_{P} p^{'}

. We distinguish the two cases justifying the latter predicate:

Subsubcase: $p^{'} = p$ . Hence, $(P_{0}, P^{'}) \overset{h^{'}}{\to} (p, P^{'}) wrt Δ^{snc}$ , so we can close the diagram as follows:

This shows that $(q, P) \overset{h}{\to} (q^{'}, P) wrt Δ^{snc}$ , and thus, the claim since $q^{'} \sim_{P} q^{'}$ .
Subsubcase: $p, p^{'} \in P^{'}$ . Since $p^{'} \in P^{'}$ and $P^{'} = {s-no-change}^{Δ} (q)$ , we have $q @ p^{'} \to q$ in $Δ$ . Hence, we can close the diagram as follows:

Since $p \in P^{'}$ and $q @ p \to q^{'}$ in $Δ$ , we have $q = q^{'}$ by definition of $P^{'} = {s-no-change}^{Δ} (q)$ . This shows that $(q, P) \overset{h}{\to} (q, P) wrt Δ^{snc}$ . Since $q \sim_{P} q$ , the claim follows.

Subcase

p_{0} \in P^{'} \cup {no-change}^{Δ}

. Claim A1.2a then shows that

(p_{0}, P^{'}) \overset{h^{'}}{\to} (p_{0}, P^{'})

wrt

Δ snc

.

Subsubcase: $p_{0} \in P^{'}$ . Since $p \in {acc}^{Δ} (p_{0})$ and ${acc}^{Δ} (P^{'}) \subseteq P^{'}$ , it follows that $p \in P^{'}$ , too. By definition $P^{'} = {s-no-change}^{Δ} (q)$ and the completeness of $Δ$ , the memberships $p_{0} \in P^{'}$ and $p \in P^{'}$ imply that $q @^{Δ} = {q} = q @^{Δ} p$ . We can now close the diagram below as follows:
Subsubcase: p₀ ∈ no-change^Δ. In this case, $p_{0} = p$ so that $q^{'} \in q @^{Δ} p_{0}$ . Hence,

Case

h = a

. Since

q \notin P \cup {no-change}^{Δ}

, we can apply the inference rule:

\frac{q \overset{a}{\to} q^{'} in Δ q \notin P \cup {no-change}^{Δ}}{(q, P) \overset{a}{\to} (q^{'}, P) in Δ^{snc}}

This shows that

(q, P) \overset{h}{\to} (q^{'}, P)

validates the claim since

q^{'} \sim_{p} q^{'}

.

Case

h = ϵ

. In this case, we have

q = q^{'}

and

(q, P) \overset{ϵ}{\to} (q, P)

so the claim holds.

Case

h = h_{1} \cdot h_{2}

. Since

q \overset{h}{\to} q^{'}

wrt

Δ

, there exists

q_{1} \in Q

, such that

q \overset{h_{1}}{\to} q_{1}

wrt

Δ

and

q_{1} \overset{h_{2}}{\to} q^{'}

wrt

Δ

. Since

q \notin P \cup {no-change}^{Δ}

, we apply the induction hypothesis on

h_{1}

. This implies that there exists

q_{1}^{'}

such that,

(q, P) \overset{h_{1}}{\to} (q_{1}^{'}, P) wrt Δ^{snc} and q_{1} \sim_{p} q_{1}^{'} .

We distinguish the two cases of

q_{1} \sim_{p} q_{1}^{'}

:

Subcase

q_{1} = q_{1}^{'}

. We also distinguish two subcases here:

Subsubcase $q_{1} \notin P$ .: The induction hypothesis applied to $h_{2}$ yields the following:

$\exists q^{″} . (q_{1}, P) \overset{h_{2}}{\to} (q^{″}, P) wrt Δ^{snc} \land q^{'} \sim_{p} q^{″},$

Hence

$\exists q^{″} . (q, P) \overset{h}{\to} (q^{″}, P) wrt Δ^{snc} Δ \land q^{'} \sim_{p} q^{″} .$
Subsubcase: $q_{1} \in P$ . By Claim A1.2a, we have $(q_{1}, P) \overset{h_{2}}{\to} (q_{1}, P)$ . We also have $q^{'} \in a c c ({q_{1}})$ , and since we assume $a c c (P) \subseteq P$ , this implies $q^{'} \in P$ . Hence, $(q, P) \overset{h}{\to} (q_{1}, P)$ and $q^{'}, q_{1} \in P$ , implying the claim is $q^{'} \sim_{p} q_{1}$ .

Subcase

q_{1}, q_{1}^{'} \in P

. Since

q_{1}^{'} \in P

, Claim A1.2a implies

(q_{1}^{'}, P) \overset{h_{2}}{\to} (q_{1}^{'}, P)

wrt

Δ^{snc}

. Thus,

(q, P) \overset{h}{\to} (q_{1}^{'}, P)

wrt

Δ^{snc}

. Since

q^{'} \in {acc}^{Δ} ({q_{1}})

and

q_{1} \in P

, it follows that

q^{'} \in {acc}^{Δ} (P) \subseteq P

. Here, we used as in the previous subsubcase that

a c c (P) \subseteq P

is assumed by the claim. Let

q^{″} = q_{1}^{'}

. Then, we have

(q, P) \overset{h}{\to} (q^{″}, P)

wrt

Δ^{snc}

and

q^{'}, q^{″} \in P

showing the claim.

This ends the proof of Claim A1.3a. □

Proof of Inclusion $(A) \subseteq L (A^{snc})$ .

Let

h \in L (A)

. Then, there exists

q_{0} \in I

and

q \in F

, such that

q_{0} \overset{h}{\to} q

. We distinguish two cases:

Case: $q_{0} \in {no-change}^{Δ}$ . By definition of no-change^Δ and since $q \in {acc}^{Δ} (q_{0})$ we have $q_{0} = q$ . Claim A1.2a shows that $(q_{0}, \emptyset) \overset{h}{\to} (q_{0}, \emptyset)$ wrt $Δ^{snc}$ and thus $(q_{0}, \emptyset) \overset{h}{\to} (q, \emptyset)$ so that $h \in L (A^{snc})$ .
Case: $q_{0} \notin {no-change}^{Δ}$ . Claim A1.3a with $P = \emptyset$ shows that $(q_{0}, \emptyset) \overset{h}{\to} (q, \emptyset)$ wrt $Δ^{snc}$ and hence $h \in L (A^{snc})$ .

This ends the proof of the first inclusion. □

We next want to show the inverse inclusion

L (A^{snc}) \cap S \subseteq L (A)

. It will eventually follow from the following three Claims A1.1b, A1.2b, and A1.3b.

Claim A1.1b.

For any hedge h and state

μ \in Q^{snc}

, if

Π \overset{h}{\to} μ wrt Δ^{snc}

then

μ = Π

.

The proof is straightforward by induction on the structure of h: the only transitions rules of $Δ^{snc}$ with $Π$ on the left hand side are inferred by the last three rules in Figure 10. These require to stay in $Π$ whatever hedge follows.

Claim A1.2b.

For any hedge h, set

P \subseteq Q

, state

q \in P \cup {no-change}^{Δ}

, and state

μ \in Q^{snc}

: if

(q, P) \overset{h}{\to} μ wrt Δ^{snc}

, then

μ = (q, P)

.

Proof.

By induction on the structure of h. Suppose that

(q, P) \overset{h}{\to} μ wrt Δ^{snc}

:

Case: $h = 〈 h^{'} 〉$ . There must exist states $μ_{1}, μ_{1}^{'} \in Q^{snc}$ closing the following diagram:

Since $q \in P \cup {no-change}^{Δ}$ , the following rule must have been applied to infer $(q, P) \overset{〈 〉}{\to} μ_{1} wrt Δ^{snc}$ :

$\frac{q \in P \cup {no-change}^{Δ}}{(q, P) \overset{〈 〉}{\to} Π in Δ^{snc}}$

Therefore, $μ_{1} = Π$ . Claim A1.1b shows that $μ_{1}^{'} = Π$ , too. So, $μ$ must have been inferred by applying the rule:

$\frac{q \in P \cup {no-change}^{Δ}}{(q, P) @ Π \to (q, P) in Δ^{snc}} .$

So, $μ = (q, P)$ , as required.
Case: $h = a$ . The following rule must have been applied:

$\frac{q \in P \cup {no-change}^{Δ}}{(q, P) \overset{a}{\to} (q, P) in Δ^{snc}}$

Hence, $μ = (q, P)$ .
Case: $h = ε$ . Obvious.
Case: $h = h_{1} \cdot h_{2}$ . There must exist $μ_{1}$ , such that $(q, P) \overset{h_{1}}{\to} μ_{1} \overset{h_{2}}{\to} μ$ wrt $Δ^{snc}$ . By induction hypothesis applied to $h_{1}$ , we have $μ_{1} = (q, P)$ . We can thus apply the induction hypothesis to $h_{2}$ to obtain $μ_{2} = (q, P)$ .

This ends the proof of Claim A1.2b. We next need an inverse of Claim A1.3a. □

Claim A1.3b.

Let

q \in Q

,

P \subseteq Q

, such that

q \notin P \cup {no-change}^{Δ}

and

{acc}^{Δ} (P) \subseteq P

. For any

h \in H_{Σ}

, such that

(q, P) \overset{h}{\to} μ wrt Δ^{snc}

for some

μ \in Q^{snc}

and such that

Δ

does not have any blocking partial run on h starting from q, there exists

q^{'}, q^{″}

, such that

μ = (q^{'}, P), q \overset{h}{\to} q^{″} wrt Δ, and q^{'} \sim_{P} q^{″} .

Proof.

By induction on the structure of

h \in H_{Σ}

, we distinguish cases for all possible forms h.

Case

h = 〈 h^{'} 〉

. By definition of

(q, P) \overset{h}{\to} μ wrt Δ^{snc}

, there must exist

μ_{1}, μ_{1}^{'} \in Q^{snc}

, such that the following diagram can be closed:

Since

q \notin P \cup {no-change}^{Δ}

, the following rule got applied to infer

(q, P) \overset{〈 〉}{\to} μ_{1}

where

P^{'} = {s-no-change}^{Δ} (q)

:

\frac{\overset{〈 〉}{\to} p in Δ q \notin P \cup {no-change}^{Δ}}{(q, P) \overset{〈 〉}{\to} (p, P^{'}) in Δ^{snc}}

Hence, there exists

p \in {〈 〉}^{Δ}

, such that

μ_{1} = (p, P^{'})

. We fix such a state p arbitrarily.

Since

Δ

does not have any blocking runs on h from q there exists

p^{'}, q^{″} \in Q

, such that

p \overset{h^{'}}{\to} p^{'}

and

q @ p^{'} \to q^{″}

wrt

Δ

. Furthermore,

Δ

does not have any blocking partial run on

h^{'}

starting from p.

Subcase

p \in P^{'}

. In this case, we can apply Claim A1.2b to

(p, P^{'}) \overset{h^{'}}{\to} μ_{1}^{'}

wrt

Δ

in order to show that

μ_{1}^{'} = (p, P^{'})

. Since

p \in {acc}^{Δ} ({p})

and

p \in P^{'} = {s-no-change}^{Δ} (q)

it follows that

q @^{Δ} p = {q}

. Hence, the following rule got applied to infer

(q, P) \overset{h}{\to} μ

:

\frac{q @ p \to q in Δ q \notin P \cup {no-change}^{Δ}}{(q, P) @ (p, P^{'}) \to (q, P) in Δ^{snc}},

This shows that

μ = (q, P)

. Let

q^{'} = q

so that

μ = (q^{'}, P)

. Since

p \overset{h^{'}}{\to} p^{'} wrt Δ

, we have

p^{'} \in {acc}^{Δ} ({p})

so that

q @ p^{'} \to q wrt Δ

by definition of

{s-no-change}^{Δ} ({q})

. Hence, we can close the following diagram:

Let

q^{″} = q

, so that

q^{'} = q^{″}

. It then holds that

q^{'} \sim_{P} q^{″}

and

q \overset{〈 h^{'} 〉}{\to} q^{'} wrt Δ

, as required by the claim.

Subcase

p \in {no-change}^{Δ}

. In this case

p \overset{h^{'}}{\to} p^{'} wrt Δ

implies that

p^{'} = p

. Hence, we can close the following diagram:

This shows that

q \overset{〈 h^{'} 〉}{\to} q^{'} wrt Δ

.

Subcase

p \notin P^{'} \cup {no-change}^{Δ}

. By induction hypothesis applied to

(p, P^{'}) \overset{h^{'}}{\to} μ_{1}^{'}

wrt

Δ

, there exists

p^{″}

, such that

μ_{1}^{'} = (p^{'}, P^{'}), p \overset{h^{'}}{\to} p^{″} wrt Δ, and p^{'} \sim_{P^{'}} p^{″} .

Since

Δ

does not have blocking runs on h starting with q, there exist

q^{″}

such that

q @ p^{″} \to q^{″}

wrt

Δ

. There are two ways to satisfy

p^{'} \sim_{P^{'}} p^{″}

:

Subsubcase: $p^{'} = p^{″}$ . We then have the following:

This shows $q \overset{〈 h^{'} 〉}{\to} q^{'}$ wrt $Δ$ .
Subsubcase: $p^{'}, p^{″} \in P$ . By definition of $P^{'}$ , it follows that $q^{'} = q = q^{″}$ . Hence,

This shows $q \overset{〈 h^{'} 〉}{\to} q$ wrt $Δ$ .

Case

h = a

. Since

q \notin P \cup {no-change}^{Δ}

, the following inference rule must be used:

\frac{q \overset{a}{\to} q^{'} in Δ q \notin P \cup {no-change}^{Δ}}{(q, P) \overset{a}{\to} (q^{'}, P) in Δ^{snc}}

So

μ = (q^{'}, P)

and

q \overset{a}{\to} q^{'} wrt Δ

.

Case

h = ε

. Obvious.

Case

h = h_{1} \cdot h_{2}

. The judgement

(q, P) \overset{h}{\to} μ wrt Δ^{snc}

shows that there exists

μ_{1}

, such that

(q, P) \overset{h_{1}}{\to} μ_{1} \overset{h_{2}}{\to} μ wrt Δ^{snc}

. Since

q \notin P \cup {no-change}^{Δ}

, we can apply the induction hypothesis to

h_{1}

. It shows that there exists

q_{1}^{'}

and

q_{1}^{″}

, such that

μ_{1} = (q_{1}^{'}, P)

,

q \overset{h_{1}}{\to} q_{1}^{″} wrt Δ

, and

q_{1}^{'} \sim_{P} q_{1}^{″}

.

Subcase

q_{1}^{'} = q_{1}^{″}

. Hence,

(q_{1}, P) \overset{h_{2}}{\to} μ wrt Δ^{snc}

.

Subsubcase: $q_{1}^{'} \notin P \cup {no-change}^{Δ}$ . In this case, we can apply the induction hypothesis to $(q_{1}^{'}, P) \overset{h_{2}}{\to} μ wrt Δ^{snc}$ , showing the existence of $q^{'}$ , such that $μ = (q^{'}, P)$ and state $q^{″}$ , such that $q_{1}^{'} \overset{h_{2}}{\to} q^{″}$ and $q^{'} \sim_{P} q^{″}$ . Hence, $\exists q^{″} . q \overset{h}{\to} q^{″} wrt Δ$ and $q^{'} \sim_{P} q^{″}$ so the claim holds.
Subsubcase: $q_{1}^{'} \in P \cup {no-change}^{Δ}$ . Claim A1.2b applied to $(q_{1}^{'}, P) \overset{h_{2}}{\to} μ wrt Δ^{snc}$ shows that $μ = (q_{1}^{'}, P)$ .

Subcase

q_{1}^{'}, q_{1}^{″} \in P

. Recall that

(q_{1}^{″}, P) \overset{h_{2}}{\to} μ wrt Δ^{snc}

and

q_{1}^{″} \in P

. Claim A1.2b shows that

μ = (q_{1}^{″}, P) wrt Δ^{snc}

. We also have

q \overset{h_{1}}{\to} q_{1}^{″} wrt Δ

. Since there are no blocking partial runs on h starting from q there exist a state

q^{″}

such that

q_{1}^{″} \overset{h_{2}}{\to} q^{″}

wrt

Δ

. Since

q_{1}^{'} \in P

and P is closed by accessibility, we have

q^{″} \in a c c ({q_{1}^{″}}) \subseteq a c c (P) \subseteq P

. From

q \overset{h_{1}}{\to} q_{1}^{″}

wrt

Δ

, we get

q \overset{h}{\to} q^{″}

wrt

Δ

. Since

q_{1}^{″}, q^{″} \in P

, it follows that

q_{1}^{″} \sim_{P} q^{″}

, and thus, the claim holds.

This ends the proof of Claim A1.3b.

□

Proof of Inclusion $L (A^{snc}) \cap S \subseteq L (A)$ .

Let

h \in L (A^{snc}) \cap S

. Since

h \in L (A^{snc})

, then there exists

q_{0} \in I

and

q \in F

, such that

(q_{0}, \emptyset) \overset{h}{\to} (q, \emptyset)

wrt

Δ^{snc}

. Using Lemma 7, we have

q_{0} \overset{h}{\to} q

wrt

Δ

.

We distinguish two cases:

Case: $q_{0} \in {no-change}^{Δ}$ . Claim A1.2b shows that $q = q_{0}$ . Since A is schema-complete for $S$ , $h \in S$ , and $q_{0} \in I \cap {no-change}^{Δ}$ , Lemma 7 shows that $q_{0} \overset{h}{\to} q_{0}$ wrt $Δ$ . Since $q = q_{0}$ , this yields $h \in L (A)$ .
Case: $q_{0} \notin {no-change}^{Δ}$ . Since A is schema-complete for $S$ and $h \in S$ there exist no blocking runs on h that start in $q_{0}$ . Therefore, we can apply Claim A1.3b with $P = \emptyset$ to $(q_{0}, \emptyset) \overset{h}{\to} (q, \emptyset)$ wrt $Δ^{snc}$ . This shows that $q_{0} \overset{h}{\to} q$ wrt $Δ$ , and hence, $h \in L (A)$ .

This ends the proof of the inverse inclusion, and thus, of

L (A) = L (A^{snc})

. □

References

Marian, A.; Siméon, J. Projecting XML Documents. In Proceedings of the VLDB; Morgan Kaufmann: San Francisco, CA, USA, 2003; pp. 213–224. [Google Scholar]
Frisch, A. Regular Tree Language Recognition with Static Information. In Proceedings of the Exploring New Frontiers of Theoretical Informatics, IFIP 18th World Computer Congress, TCS 3rd International Conference on Theoretical Computer Science, Toulouse, France, 22–27 August 2004; pp. 661–674. [Google Scholar]
Maneth, S.; Nguyen, K. XPath Whole Query Optimization. VLPB J. 2010, 3, 882–893. [Google Scholar] [CrossRef]
Sebastian, T.; Niehren, J. Projection for Nested Word Automata Speeds up XPath Evaluation on XML Streams. In Proceedings of the International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM), Harrachov, Czech Republic, 23–28 January 2016. [Google Scholar]
Kay, M. The Saxon XSLT and XQuery Processor. 2004. Available online: https://www.saxonica.com (accessed on 22 July 2024).
Gienieczko, M.; Murlak, F.; Paperman, C. Supporting Descendants in SIMD-Accelerated JSONPath. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, La Jolla, CA, USA, 27 April–1 May 2024; Volume 1, pp. 15–29. [Google Scholar]
Al Serhali, A.; Niehren, J. A Benchmark Collection of Deterministic Automata for XPath Queries. In Proceedings of the XML Prague 2022, Prague, Czech Republic, 9–11 June 2022. [Google Scholar]
Niehren, J.; Sakho, M. Determinization and Minimization of Automata for Nested Words Revisited. Algorithms 2021, 14, 68. [Google Scholar] [CrossRef]
Comon, H.; Dauchet, M.; Gilleron, R.; Löding, C.; Jacquemard, F.; Lugiez, D.; Tison, S.; Tommasi, M. Tree Automata Techniques and Applications. Available online: http://tata.gforge.inria.fr (accessed on 22 July 2024).
Thatcher, J.W. Characterizing derivation trees of context-free grammars through a generalization of automata theory. J. Comput. Syst. Sci. 1967, 1, 317–322. [Google Scholar] [CrossRef]
Alur, R. Marrying Words and Trees. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Beijing, China, 11–14 June 2007; ACM-Press: New York, NY, USA, 2007; pp. 233–242. [Google Scholar]
Alur, R.; Madhusudan, P. Visibly pushdown languages. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004; Babai, L., Ed.; ACM: New York, NY, USA, 2004; pp. 202–211. [Google Scholar] [CrossRef]
Mehlhorn, K. Pebbling Mountain Ranges and its Application of DCFL-Recognition. In Proceedings of the Automata, Languages and Programming, 7th Colloquium, Noordweijkerhout, The Netherlands, 14–18 July 1980; Proceedings. de Bakker, J.W., van Leeuwen, J., Eds.; Springer: Berlin/Heidelberg, Germany, 1980; Volume 85, pp. 422–435. [Google Scholar] [CrossRef]
von Braunmühl, B.; Verbeek, R. Input Driven Languages are Recognized in log n Space. In Topics in the Theory of Computation; North-Holland Mathematics Studies; Karpinski, M., van Leeuwen, J., Eds.; North-Holland: Amsterdam, The Netherlands, 1985; Volume 102, pp. 1–19. [Google Scholar] [CrossRef]
Okhotin, A.; Salomaa, K. Complexity of input-driven pushdown automata. SIGACT News 2014, 45, 47–67. [Google Scholar] [CrossRef]
Niehren, J.; Sakho, M.; Al Serhali, A. Schema-Based Automata Determinization. In Proceedings of the 13th International Symposium on Games, Automata, Logics and Formal Verification, GandALF 2022, Madrid, Spain, 21–23 September 2022; Ganty, P., Monica, D.D., Eds.; EPTCS: The Hague, The Netherlands, 2022; Volume 370, pp. 49–65. [Google Scholar] [CrossRef]
Lick, A.; Schmitz, S. XPath Benchmark. Available online: https://archive.softwareheritage.org/browse/directory/1ea68cf5bb3f9f3f2fe8c7995f1802ebadf17fb5 (accessed on 22 July 2024).
Mozafari, B.; Zeng, K.; Zaniolo, C. High-performance complex event processing over XML streams. In Proceedings of the SIGMOD Conference; Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L., Fuxman, A., Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L., Fuxman, A., Eds.; ACM: New York, NY, USA, 2012; pp. 253–264. [Google Scholar] [CrossRef]
Grez, A.; Riveros, C.; Ugarte, M. A Formal Framework for Complex Event Processing. In Proceedings of the 22nd International Conference on Database Theory (ICDT 2019), Lisbon, Portugal, 26–28 March 2019; pp. 5:1–5:18. [Google Scholar] [CrossRef]
Muñoz, M.; Riveros, C. Streaming Enumeration on Nested Documents. In Proceedings of the 25th International Conference on Database Theory, ICDT 2022, Edinburgh, UK, 29 March–1 April 2022; Olteanu, D., Vortmeier, N., Eds.; Schloss Dagstuhl—Leibniz-Zentrum für Informatik: Wadern, Germany, 2022; Volume 220, pp. 19:1–19:18. [Google Scholar] [CrossRef]
Al Serhali, A.; Niehren, J. Earliest Query Answering for Deterministic Stepwise Hedge Automata. In Proceedings of the Implementation and Application of Automata—27th International Conference, CIAA 2023, Famagusta, North Cyprus, 19–22 September 2023; Proceedings. Nagy, B., Ed.; Springer: Berlin/Heidelberg, Germany, 2023; Volume 14151, pp. 53–65. [Google Scholar] [CrossRef]
Gauwin, O.; Niehren, J.; Tison, S. Earliest Query Answering for Deterministic Nested Word Automata. In Proceedings of the 17th International Symposium on Fundamentals of Computer Theory, Wroclaw, Poland, 2–4 September 2009; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5699, pp. 121–132. [Google Scholar]
Debarbieux, D.; Gauwin, O.; Niehren, J.; Sebastian, T.; Zergaoui, M. Early nested word automata for XPath query answering on XML streams. Theor. Comput. Sci. 2015, 578, 100–125. [Google Scholar] [CrossRef]
Al Serhali, A.; Niehren, J. Subhedge Projection for Stepwise Hedge Automata. In Proceedings of the Fundamentals of Computation Theory—24th International Symposium, FCT 2023, Trier, Germany, 18–21 September 2023; Proceedings. Fernau, H., Jansen, K., Eds.; Springer: Berlin/Heidelberg, Germany, 2023; Volume 14292, pp. 16–31. [Google Scholar] [CrossRef]
Neumann, A.; Seidl, H. Locating Matches of Tree Patterns in Forests. In Proceedings of the Foundations of Software Technology and Theoretical Computer Science, Chennai, India, 17–19 December 1998; Springer: Berlin/Heidelberg, Germany, 1998; Volume 1530, pp. 134–145. [Google Scholar]
Gauwin, O.; Niehren, J.; Roos, Y. Streaming Tree Automata. Inf. Process. Lett. 2008, 109, 13–17. [Google Scholar] [CrossRef]
Franceschet, M. XPathMark Performance Test. Available online: https://users.dimi.uniud.it/~massimo.franceschet/xpathmark/PTbench.html (accessed on 25 October 2020).
Hosoya, H.; Pierce, B.C. XDuce: A Statically Typed XML Processing Language. ACM Trans. Internet Technol. 2003, 3, 117–148. [Google Scholar] [CrossRef]
Gécseg, F.; Steinby, M. Tree Automata; Akadémiai Kiadó: Budapest, Hungary, 1984. [Google Scholar]
Gold, E.M. Language Identification in the Limit. Inf. Control. 1967, 10, 447–474. [Google Scholar] [CrossRef]
Carme, J.; Niehren, J.; Tommasi, M. Querying Unranked Trees with Stepwise Tree Automata. In Proceedings of the Rewriting Techniques and Applications, 15th International Conference, RTA 2004, Aachen, Germany, 3–5 June 2004; Proceedings. van Oostrom, V., Ed.; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3091, pp. 105–118. [Google Scholar] [CrossRef]
Pair, C.; Quéré, A. Définition et etude des bilangages réguliers. Inf. Control 1968, 13, 565–593. [Google Scholar] [CrossRef]
Murata, M. Hedge Automata: A Formal Model for XML Schemata. 2000. Available online: http://www.xml.gr.jp/relax/hedge_nice.html (accessed on 22 July 2024).
Martens, W.; Niehren, J. On the Minimization of XML Schemas and Tree Automata for Unranked Trees. J. Comput. Syst. Sci. 2007, 73, 550–583. [Google Scholar] [CrossRef]
Niehren, J. Stepwise Hedge Automata are exponentially more succinct than Forest Automata. 2024; in preparation. [Google Scholar]
Bojanczyk, M.; Walukiewicz, I. Forest algebras. In Proceedings of the Logic and Automata: History and Perspectives [in Honor of Wolfgang Thomas]; Flum, J., Grädel, E., Wilke, T., Eds.; Amsterdam University Press: Amsterdam, The Netherlands, 2008; Volume 2, pp. 107–132. [Google Scholar]

Figure 1. The graph of the hedge

〈 l i s t \cdot 〈 i t e m \cdot F \cdot C \cdot T 〉 \cdot 〈 i t e m \cdot C \cdot I \cdot A \cdot A 〉 〉 \cdot 〈 l i s t \cdot 〈 i t e m \cdot C \cdot M \cdot S \cdot B 〉 〉

is a sequence of trees.

Figure 1. The graph of the hedge

〈 l i s t \cdot 〈 i t e m \cdot F \cdot C \cdot T 〉 \cdot 〈 i t e m \cdot C \cdot I \cdot A \cdot A 〉 〉 \cdot 〈 l i s t \cdot 〈 i t e m \cdot C \cdot M \cdot S \cdot B 〉 〉

is a sequence of trees.

Figure 2. The sequence of Xml documents encoded by the hedge in Figure 1.

Figure 3. The part of the hedge distinguished by the nested word prefix

〈 l i s t \cdot 〈 i t e m \cdot F \cdot C \cdot T 〉 \cdot 〈 i t e m \cdot C \cdot I \cdot A \cdot A 〉 〉 \cdot 〈 l i s t \cdot 〈

.

Figure 3. The part of the hedge distinguished by the nested word prefix

〈 l i s t \cdot 〈 i t e m \cdot F \cdot C \cdot T 〉 \cdot 〈 i t e m \cdot C \cdot I \cdot A \cdot A 〉 〉 \cdot 〈 l i s t \cdot 〈

.

Figure 4. A dSha for the XPath filter

[self : : list / child : : item]

that is schema-complete for schema

〚 doc 〛

.

Figure 4. A dSha for the XPath filter

[self : : list / child : : item]

that is schema-complete for schema

〚 doc 〛

.

Figure 5. The unique successful run of the dSha in Figure 4 on the hedge

〈 list \cdot 〈 item 〉 〉

is the hedge

0 \cdot 〈 0 \cdot list \cdot 1 \cdot 〈 0 \cdot item \cdot 2 〉 \cdot 3 〉 \cdot 4

computed, as illustrated above.

Figure 5. The unique successful run of the dSha in Figure 4 on the hedge

〈 list \cdot 〈 item 〉 〉

is the hedge

0 \cdot 〈 0 \cdot list \cdot 1 \cdot 〈 0 \cdot item \cdot 2 〉 \cdot 3 〉 \cdot 4

computed, as illustrated above.

Figure 6. The determinization

Δ^{\det}

of the transition rules

Δ

of a Sha.

Figure 6. The determinization

Δ^{\det}

of the transition rules

Δ

of a Sha.

Figure 7. The deterministic Sha^↓ with subhedge projection state

Π

obtained by safe-no-change projection.

Figure 7. The deterministic Sha^↓ with subhedge projection state

Π

obtained by safe-no-change projection.

Figure 8. A successful run of the Sha^↓ in Figure 7 on the hedge

〈 list \cdot 〈 list \cdot h_{1} 〉 \cdot 〈 item \cdot h_{2} 〉 〉

.

Figure 8. A successful run of the Sha^↓ in Figure 7 on the hedge

〈 list \cdot 〈 list \cdot h_{1} 〉 \cdot 〈 item \cdot h_{2} 〉 〉

.

Figure 9. The Sha^↓

A^{down}

for the dSha A for the filter

[self : : list / child : : item]

in Figure 4.

Figure 9. The Sha^↓

A^{down}

for the dSha A for the filter

[self : : list / child : : item]

in Figure 4.

Figure 10. The transition rules of the Sha^↓

A^{snc}

inferred from those of the Sha A.

Figure 10. The transition rules of the Sha^↓

A^{snc}

inferred from those of the Sha A.

Figure 11. The safe-no-change projection dSha^↓

A^{snc}

constructed from the dSha A for query

[self : : list / child : : item]

is shown in Figure 4. Subhedge projection states are colored in orange. Useless states and transitions leading out of schema

〚 doc 〛

are omitted. State

Π

has a transition labeled by wildcard letter “−”, which stands for either letter

list

or

item

. This dSha^↓ is equal to the dSha^↓ in Figure 7 up to the state renaming

0 = (0, {})

,

0^{'} = (0, {2, 4})

,

0^{″} = (0, {1, 3, 4})

,

1^{'} = (1, {2, 4})

,

1^{″} = (1, {1, 3, 4})

,

2^{'} = (2, {2, 4})

,

2^{″} = (2, {1, 3, 4})

,

3^{'} = (3, {2, 4})

, and

4 = (4, {})

. Recall that

{no-change}^{Δ} = {2, 3, 4, 5}

.

Figure 11. The safe-no-change projection dSha^↓

A^{snc}

constructed from the dSha A for query

[self : : list / child : : item]

is shown in Figure 4. Subhedge projection states are colored in orange. Useless states and transitions leading out of schema

〚 doc 〛

are omitted. State

Π

has a transition labeled by wildcard letter “−”, which stands for either letter

list

or

item

. This dSha^↓ is equal to the dSha^↓ in Figure 7 up to the state renaming

0 = (0, {})

,

0^{'} = (0, {2, 4})

,

0^{″} = (0, {1, 3, 4})

,

1^{'} = (1, {2, 4})

,

1^{″} = (1, {1, 3, 4})

,

2^{'} = (2, {2, 4})

,

2^{″} = (2, {1, 3, 4})

,

3^{'} = (3, {2, 4})

, and

4 = (4, {})

. Recall that

{no-change}^{Δ} = {2, 3, 4, 5}

.

Figure 12. A schema-complete dSha for the XPath filter

[child : : item]

. It is a counter example for the completeness of safe-no-change projection, see Figure 13.

Figure 12. A schema-complete dSha for the XPath filter

[child : : item]

. It is a counter example for the completeness of safe-no-change projection, see Figure 13.

Figure 13. The safe-no-change projection

A^{snc}

of the dSha A in Figure 12. It is incomplete for subhedge projection with with schema

〚 doc 〛

at the state

(2, {1, 3, 5, 6})

: this state is not a subhedge projection state, even though the prefixes

〈 list \cdot 〈 item

and

〈 item \cdot 〈 item

leading to it are subhedge irrelevant.

Figure 13. The safe-no-change projection

A^{snc}

of the dSha A in Figure 12. It is incomplete for subhedge projection with with schema

〚 doc 〛

at the state

(2, {1, 3, 5, 6})

: this state is not a subhedge projection state, even though the prefixes

〈 list \cdot 〈 item

and

〈 item \cdot 〈 item

leading to it are subhedge irrelevant.

Figure 14. The congruence projection dSha^↓

A^{cgr (〚 doc 〛)}

for the counter example of the completeness of safe-no-change projection, i.e., the dSha A for the filter

[child : : item]

in Figure 12 with the schema final states

F_{S} = {0, 5, 6}

. Note that state {1}D3 that is reached over the prefix

〈 list \cdot 〈 item

and the state {2}D3 that is reached over the prefix

〈 item \cdot 〈 item

are subhedge irrelevant, and thus, subhedge projection states.

Figure 14. The congruence projection dSha^↓

A^{cgr (〚 doc 〛)}

for the counter example of the completeness of safe-no-change projection, i.e., the dSha A for the filter

[child : : item]

in Figure 12 with the schema final states

F_{S} = {0, 5, 6}

. Note that state {1}D3 that is reached over the prefix

〈 list \cdot 〈 item

and the state {2}D3 that is reached over the prefix

〈 item \cdot 〈 item

are subhedge irrelevant, and thus, subhedge projection states.

Figure 15. The transitions rules

Δ^{cgr}

of the congruence projection

A^{cgr (S)}

.

Figure 15. The transitions rules

Δ^{cgr}

of the congruence projection

A^{cgr (S)}

.

Figure 16. A dSha A for

L (A) = {〈 a 〉}

and schema-final states

F_{S} = {3, 5}

for the schema

S = L (A) \cup {〈 〉 \cdot a}

.

Figure 16. A dSha A for

L (A) = {〈 a 〉}

and schema-final states

F_{S} = {3, 5}

for the schema

S = L (A) \cup {〈 〉 \cdot a}

.

Figure 17. The congruence projection

A^{cgr (S)}

for the dSha A and the schema-final states

F_{S}

in Figure 16.

Figure 17. The congruence projection

A^{cgr (S)}

for the dSha A and the schema-final states

F_{S}

in Figure 16.

Figure 18. The congruence projection

A^{cgr (〚 doc 〛)}

of the dSha A in Figure 4 for the XPath filter

[self : : list / child : : item]

with schema

〚 doc 〛

. Subhedge irrelevant states are colored in orange. The underscore stands for any label, either

list

or

item

.

Figure 18. The congruence projection

A^{cgr (〚 doc 〛)}

of the dSha A in Figure 4 for the XPath filter

[self : : list / child : : item]

with schema

〚 doc 〛

. Subhedge irrelevant states are colored in orange. The underscore stands for any label, either

list

or

item

.

Figure 19. The earliest dSha^↓

A_{e (S)}

for the dSha A in Figure 4 with schema

S = 〚 doc 〛

for the XPath filter

[self : : list / child : : item]

.

Figure 19. The earliest dSha^↓

A_{e (S)}

for the dSha A in Figure 4 with schema

S = 〚 doc 〛

for the XPath filter

[self : : list / child : : item]

.

Figure 20. The transition rules

Δ_{e}^{π}

inferred from those of the Sha^↓s

A^{π}

with projection states P and

A_{e}

with selection state

sel

.

Figure 20. The transition rules

Δ_{e}^{π}

inferred from those of the Sha^↓s

A^{π}

with projection states P and

A_{e}

with selection state

sel

.

Figure 21. The earliest dSha^↓

A_{e (S)}^{snc}

with safe-no-change subhedge projection for the dSha A in Figure 4 with schema

S = 〚 doc 〛

for the XPath filter

[self : : list / child : : item]

.

Figure 21. The earliest dSha^↓

A_{e (S)}^{snc}

with safe-no-change subhedge projection for the dSha A in Figure 4 with schema

S = 〚 doc 〛

for the XPath filter

[self : : list / child : : item]

.

Figure 22. The dSha for the XPath query self::list[child::item]. The selection position is indicated by x.

Figure 23. The earliest congruence projection

A^{cgr} {(〚 doc 〛 \cap 〚 {o n e}_{x} 〛)}_{e (〚 doc 〛 \cap 〚 o n e_{x} 〛)}

for the dSha A in Figure 22 with

F_{S} = {14, 20}

. It defines the query of the XPath self::list[child::item].

Figure 23. The earliest congruence projection

A^{cgr} {(〚 doc 〛 \cap 〚 {o n e}_{x} 〛)}_{e (〚 doc 〛 \cap 〚 o n e_{x} 〛)}

for the dSha A in Figure 22 with

F_{S} = {14, 20}

. It defines the query of the XPath self::list[child::item].

Figure 24. The dNwa

A^{nwa}

for the dSha^↓ A in Figure 9 obtained by applying the

d o w n

operator to the dSha in Figure 4 for the filter

[self : : list / child : : item]

.

Figure 24. The dNwa

A^{nwa}

for the dSha^↓ A in Figure 9 obtained by applying the

d o w n

operator to the dSha in Figure 4 for the filter

[self : : list / child : : item]

.

Figure 25. The dNwa

{(A_{e (〚 doc 〛)}^{snc})}^{nwa}

for the earliest dSha^↓ with safe-no-change projection in Figure 21.

Figure 25. The dNwa

{(A_{e (〚 doc 〛)}^{snc})}^{nwa}

for the earliest dSha^↓ with safe-no-change projection in Figure 21.

Table 1. Regular XPath queries A1-A8 from the XPathMark collection.

Query ID	XPath Expression
A1	`/site/closed_auctions/closed_auction/annotation/description/ text/keyword`
A2	`//closed_auction//keyword`
A3	`/site/closed_auctions/closed_auction//keyword`
A4	`/site/closed_auctions/closed_auction[annotation/description/ text/keyword]/date`
A5	`/site/closed_auctions/closed_auction [descendant::keyword]/date`
A6	`/site/people/person[profile/gender and profile/age]/name`
A7	`/site/people/person[phone or homepage]/name`
A8	`/site/people/person[address and (phone or homepage) and (creditcard or profile)]/name`

Table 2. Additional regular XPath queries for the XPathMark documents.

A0	`/site`
A1_0a	`/site/*`
A1_0b	`/site/@*`
A1_0c	`/site//@*`
A1_1a	`//bidder/personref[starts-with(@person,’person0’)]`
A1_1d	`//bidder/personref[@person=’person0’]`
A1_2	`//person`
A1_3	`/site/regions/africa//@*`
A1_4	`/site/regions/africa/*`
A1_5	`/site/regions/*`
A1_6	`//closed_auction/annotation//keyword`
A2_1	`//closed_auction[descendant::keyword]`
A4_0	`/site/closed_auctions/closed_auction[annotation]/date`
A4_1	`/site[open_auctions]/closed_auctions`

Table 3. Overall size and number n of states of the input dSha A for the query with schema-final states

F_{S}

, the overall size and number of states of the Sha^↓

A^{cgr} {(S)}_{e (S)}

obtained by earliest congruence projection, and the number d its difference relations.

Table 3. Overall size and number n of states of the input dSha A for the query with schema-final states

F_{S}

, the overall size and number of states of the Sha^↓

A^{cgr} {(S)}_{e (S)}

obtained by earliest congruence projection, and the number d its difference relations.

Query	dSha A	dSha^↓	#Diff-	Query	dSha A	dSha^↓	#Diff-
ID	Size(n)	Size(#State)	Rel. d	ID	Size(n)	Size(#State)	Rel. d
A1	482(68)	1296(324)	16	A1_0c	230(43)	238(62)	6
A2	224(42)	316(82)	7	A1_1a	305(54)	384(101)	8
A3	320(53)	662(156)	10	A1_1d	305(54)	382(101)	8
A4	629(74)	1651(404)	18	A1_2	194(39)	152(42)	5
A5	438(63)	1226(269)	13	A1_3	318(53)	672(159)	10
A6	675(76)	2090(500)	22	A1_4	312(52)	479(132)	10
A7	394(59)	728(184)	12	A1_5	266(47)	293(84)	8
A8	648(79)	2091(504)	24	A1_6	265(47)	588(142)	9
A0	203(40)	158(44)	5	A2_1	232(42)	295(78)	7
A1_0a	224(42)	145(44)	6	A4_0	392(59)	719(184)	12
A1_0b	203(39)	68(23)	5	A4_1	292(49)	287(78)	8

Table 4. Time gain and event gain by earliest congruence projection with AStream3.01 on a 1.1 GB XPathMark stream.

Query	#Ans-	Time	Time	Event	Query	#Ans-	Time	Time	Event
ID	Wer	in s	Gain	Gain	ID	Wer	in s	Gain	Gain
A1	38,267	29.3	98.1%	98.9%	A1_0c	3,600,816	226.6	72.9%	75.7%
A2	117,389	171.8	77.9%	81.1%	A1_1a	2	187.8	78.1%	80.3%
A3	117,389	33.4	97.4%	97.8%	A1_1d	2	217.3	74.8%	80.3%
A4	24,959	33.2	97.8%	98.9%	A1_2	1,062,965	238.2	68.2%	76.0%
A5	50,186				A1_3	25,074	16.2	99.1%	99.8%
A6	30,307	45.9	96.9%	98.2%	A1_4	5170	17.5	99.7%	100.0%
A7	179,889	30.5	98.0%	98.7%	A1_5	6	14.3	100.0%	100.0%
A8	67,688	37.8	97.3%	98.7%	A1_6	117,389	188.7	75.4%	81.1%
A0	1	21.3	99.1%	100.0%	A2_1	50,186	199.7	76.5%	81.1%
A1_0a	6	15.2	99.9%	100.0%	A4_0	91,650	21.8	99.0%	99.3%
A1_0b	0	0.0	100.0%	100.0%	A4_1	1	12.0	100.0%	100.0%

Table 5. Time gain for earliest safe-no-change and eaerliest congruence projection, and the event gain of earliest congrunce projection.

Query	Time (s)	Time (s)	Time Gain	Time Gain	Event Gain
ID	Snc	Congr	Snc	Congr	Congr
A1	72.8	29.3	92.3%	98.1%	98.9%
A2	664.6	171.8	8.5%	77.9%	81.1%
A3	666.1	33.4	10.9%	97.4%	97.8%
A4	78.5	33.2	92.4%	97.8%	98.9%
A5	77.7		92.3%
A6	65.2	45.9	95.1%	96.9%	98.2%
A7	24.6	30.5	98.8%	98.0%	98.7%
A8	24.8	37.5	98.9%	97.3%	98.7%

Table 6. Comparison of of parse-free timings in seconds for XPathMark queries with QuiXPath and earliest congruence projection.

Query	Parse-Free	Parse-Free	Fract.	Query	Parse-Free	Parse-Free	Fract.
ID	Time (s)	Time (s)		ID	Time (s)	Time (s)
	QuiXPath	Congr. Prj.			QuiXPath	Congr. Prj.
A1	11.0	14.3	1.3	A2	11.4	156.8	13.8
A4	11.6	18.2	1.6	A3	11.5	18.4	1.6
A6	10.7	30.9	2.9	A5	12.0
A7	8.6	15.5	1.8
A8	8.8	22.5	2.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al Serhali, A.; Niehren, J. Complete Subhedge Projection for Stepwise Hedge Automata. Algorithms 2024, 17, 339. https://doi.org/10.3390/a17080339

AMA Style

Al Serhali A, Niehren J. Complete Subhedge Projection for Stepwise Hedge Automata. Algorithms. 2024; 17(8):339. https://doi.org/10.3390/a17080339

Chicago/Turabian Style

Al Serhali, Antonio, and Joachim Niehren. 2024. "Complete Subhedge Projection for Stepwise Hedge Automata" Algorithms 17, no. 8: 339. https://doi.org/10.3390/a17080339

APA Style

Al Serhali, A., & Niehren, J. (2024). Complete Subhedge Projection for Stepwise Hedge Automata. Algorithms, 17(8), 339. https://doi.org/10.3390/a17080339

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Complete Subhedge Projection for Stepwise Hedge Automata †

Abstract

1. Introduction

2. Preliminaries

3. Problem

3.1. Hedge Pattern Matching

3.2. Irrelevant Subhedges

3.3. Basic Properties

4. Hedge Automata

4.1. Stepwise Hedge Automata

4.1.1. Semantics

4.1.2. Runs

4.1.3. Determinization

4.1.4. In-Memory Membership Testing

4.1.5. Hedge Accessibility

4.2. Downward Stepwise Hedge Automata

4.2.1. Semantics

4.2.2. Runs

4.2.3. In-Memory Membership Testing

4.2.4. Relationship to Shas

4.2.5. Hedge Accessiblity

4.3. Schema Completeness

5. Subhedge Projection

5.1. Subhedge Projection States

5.2. Completeness for Subhedge Projection

5.3. In-Memory Evaluator with Subhedge Projection

6. Safe-No-Change Projection

6.1. Algorithm

6.2. Soundness

6.3. Incompleteness

7. Congruence Projection

7.1. Ideas

7.1.1. Difference Relations on States

7.1.2. Least Difference Relations

7.2. Algorithm

7.2.1. Inputs and Outputs

7.2.2. Initial Difference Relation

7.2.3. Updating the Difference Relation

7.3. Soundness

7.4. Completeness

7.5. In-Memory Complexity

8. Earliest Membership with Subhedge Projection

8.1. Earliest Membership

8.2. Adding Subhedge Projection

8.3. Soundness and Completeness

8.4. In-Memory Complexity

9. Streaming Algorithms

9.1. Streaming Evaluators for Downward Hedge Automata

9.2. Adding Subhedge Projection

9.3. Streaming Complexity of Earliest Membership with Subhedge Projection

10. Monadic Regular Queries and XPath

10.1. Monadic Regular Queries

10.2. Earliest Query Answering with Subhedge Projection

11. Experiments with Streaming XPath Evaluation

11.1. Benchmark dSha for XPath Queries

11.2. Streaming Evaluation Tool: AStream

11.3. Evaluation Measures

11.4. Experiments

11.4.1. Earliest Congruence Projection

11.4.2. Earliest Safe-No-Change Projection

11.4.3. Comparison to External Tools

12. Related Automata Models

12.1. Shas versus Tree, Forest, and Hedge Automata

12.2. SHA↓s versus SHAs

12.3. SHA↓s versus NWAs

13. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Earliest Membership

Appendix A.1. Schema Safety Is an Inclusion Problem

Appendix A.2. Algorithm

Appendix A.3. Soundness and Completeness

Appendix B. Proofs for Section 6 (Safe-No-Change Projection)

References

Share and Cite

Article Metrics

Article Access Statistics

Complete Subhedge Projection for Stepwise Hedge Automata^†

12.2. SHA^↓s versus SHAs

12.3. SHA^↓s versus NWAs