1. Introduction
Hedges are sequences of letters from the alphabet and trees where h is again a hedge. Hedges are abstract from the syntactical details of Xml or JSON documents while still being able to represent them. The linearization of a hedge is a nested word that can be written in a file. In this article, we study the problem of hedge pattern matching, i.e., whether a given hedge h belongs to a given regular hedge language L. Regular hedge languages can be defined in XPath or JSONPath in particular.
Projection is necessary for the efficiency of many algorithms on words, trees, hedges, or nested words. Intuitively, an algorithm is projecting if it visits only a fragment of the input structure: in the best case, the part that is relevant to the problem under consideration. The relevance of projection for
Xml processing was already noticed by [
1,
2,
3,
4]. Saxon’s in-memory evaluator, for instance, projects input
Xml documents relative to an
Xslt program, which contains a collection of XPath queries to be answered simultaneously [
5]. The QuiXPath tool [
4] evaluates XPath queries in streaming mode with subtree and descendant projection. Projection during the evaluation of JSONPath queries on JSON documents in streaming mode is called fast-forwarding [
6].
Projecting in-memory evaluation assumes that the full graph of the input hedge is constructed beforehand. Nevertheless, projection may still save time if one has to match several patterns on the same input hedge or if the graph was constructed for different reasons anyway. In streaming mode, the situation is similar: the whole input hedge on the stream needs to be parsed, but the evaluators need to inspect only nodes that are not projected away. Given that pure parsing is by two or three orders of magnitude faster than pattern evaluation, the time gain of projection may be considerable.
The starting point of this article is the notion of the subhedge irrelevance of positions of hedge h with respect to a hedge language L that we introduce. These are positions of h where, for all possible continuations, inserting any subhedge does not affect membership to L. We contribute an algorithm for hedge pattern matching with complete subhedge projection. Our algorithm decides hedge pattern matching while completely projecting away the subhedges located at irrelevant subhedge positions. In other words, it decides whether a hedge h matches a pattern L without visiting any subhedge of h that is located at a position that is irrelevant with respect to L.
Regular hedge languages can be represented by nested regular expressions, which can be derived from regular XPath queries [
7] or by some kind of hedge automata, which can be compiled from nested regular expressions. We use stepwise hedge automata (
Shas) [
8], a recent variant of standard hedge automata, which, in turn, date back to the 1960s [
9,
10].
Shas mix up the bottom-up processing of standard tree automata with the left-to-right processing of finite word automata (
Dfas). They neither support top-down processing nor have an explicit stack, unlike nested word automata
Nwas [
11,
12,
13,
14,
15].
Shas have a good notion of bottom-up and left-to-right determinism, avoiding the problematic notion of determinism for standard hedge automata and the too-costly notion of determinism for
Nwas. Furthermore, compilers from nested regular expressions to
Shas are available [
8], and determinization algorithms [
16] produce small
Shas for all regular XPath queries in the existing practical benchmarks [
7,
17].
The motivation for the present work is to add subhedge projection to a recent streaming algorithm for deterministic
Shas that can evaluate regular XPath queries in the earliest manner. Most alternative streaming approaches avoid earliest query answering by considering sublanguages of regular queries without delays so that node selection depends only on the past of the node but not on the future [
18,
19] or by admitting late selection [
20]. In particular, it was recently demonstrated that the earliest query answering for regular monadic queries defined by deterministic
Shas [
21] has a lower worst-case complexity than for deterministic
Nwas [
22]. Thereby, the earliest query answering for regular queries became feasible in practice for the first time. On the other hand side, it is still slower experimentally than the best existing non-earliest approaches for general regular monadic queries on
Xml streams (see [
23] for a comparison) since the latter supports projection. How projections could be made for
Shas has not been studied before the conference version of the present article [
24].
Consider the example of the XPath filter
that is satisfied by an
Xml document if its root is a
list element that has an
item child. With the encoding of the
Xml document as hedges are chosen in the present article, satisfying hedges have the form
, and for some hedges,
. When evaluating this filter on a hedge, it is sufficient to inspect the first subtree for having the root label
and then all its children until one with a root label
is found. The subhedges of these children, i.e., the hedge
and the subhedges of the top-level trees in
and
, can be projected away, as well as the hedge
: they do not have to be inspected for evaluating this filter. However, one must memorize whether the level of the current node is 0, 1, or greater. This level of information can be naturally updated in a top-down manner. The evaluators of
Shas, however, operate bottom-up and left-to-right exclusively. Therefore, projecting evaluators for
Shas need to be based on more general machines. We propose
downward stepwise hedge automata (
Sha↓s), a variant of
Shas that supports top-down processing. They are basically Neumann and Seidl’s pushdown forest automata [
25], except that they are applied to unlabeled hedges instead of labeled forests. Furthermore,
Nwas are known to operate in basically the same manner on nested words [
26] while allowing for slightly more general visible pushdowns. We then distinguish subhedge projection states for
Sha↓s, and show how to use them to evaluate
Shas with subhedge projection both in memory and in streaming mode.
As a first contribution, we present the safe-no-change compiler from
Shas to
Sha↓s that can distinguish appropriate subhedge projection states. The idea is that the safe-no-change
Sha↓ can distinguish contexts in which the states of the
Sha will safely not change. For instance, the XPath filter
can be defined by a deterministic
Sha operting bottom-up which our compiler maps to the
Sha↓ operating top-down. The context information made explicit by top-down processing is about the levels of the states. This permits us to distinguish projection states starting from level 2, and in which subhedges can be ignored. We prove the soundness of our compiler based on a nontrivial invariant that we establish (we note that the proof required an adaptation of the original compiler from FCT [
24]). We also present a counter-example showing that the safe-no-change projection is not complete for subhedge projection. It shows that a subhedge may be irrelevant even though its state may still be changing.
As a second contribution, we propose the congruence projection algorithm. It again compiles Shas to Sha↓s but relies on the congruence relations of automata states. We then prove that congruence projection yields not only a sound algorithm but that this algorithm is also complete for subhedge projection, i.e., all strongly irrelevant subhedges (see Definition 9) are evaluated into a looping state. Congruence projection starts on the top-level of hedges with the Myhill–Nerode congruence that is well-known from automata minimization.
For languages with words L, this congruence identifies prefixes v that have the same residual language . A prefix v is then an irrelevant suffix if its residual language is either universal or empty. In the case of hedges, which extend on words by adding hierarchical nesting via a tree constructor, the congruence must be adapted when moving down into subtrees.
We show that both compilers may increase the size of the automata exponentially since they must be able to convert bottom-up deterministic automata on monadic trees into top-down deterministic automata. That is the same as inverting deterministic automata on words, which is well known to raise an exponential blow-up in the worst case (e.g., for the family of languages , where , the minimal deterministic left-to-left automaton, has states while the minimal deterministic right-to-left automaton has states).
The exponential explosion can be avoided when interested in the evaluation of hedges by automaton with subhedge projection, by not constructing the complete Sha↓s statically. Instead, only the required part of the Sha↓s may be constructed on the fly when evaluating a hedge.
Our third contribution is the refinement of the two previous compilers so that they can also distinguish safe states for selection at the earliest position. For this, we combine these compilers with a variant of the earliest membership tester of [
21] that operates in memory by compiling
Sha↓s instead of
Nwas. Furthermore, membership failure is detected at the earliest position, too. In this way, we obtain the earliest in-memory membership tester for deterministic
Shas.
Our fourth contribution is to show how to run the previous in-memory evaluators in streaming mode while reading the linearization of the hedge into a nested word as an input. Thereby, we can improve on the recent earliest streaming membership tester behind the AStream tool [
21] by adding subhedge projection to it.
Our fifth and last contribution is an implementation and experimental evaluation of the streaming earliest query answering algorithm for d
Shas by introducing subhedge projection into the AStream tool [
21]. We implemented both safe-no-change and congruence projection. For this, we lifted our earliest membership tester with subhedge projection to an earliest query answering algorithm for monadic queries. For the experimentation, we started with deterministic
Shas constructed with the compiler from [
8] for the forward regular XPath queries of the XPathMark benchmark [
27] and real-world XPath queries [
7]. It turns out that congruence projection projects much more strongly than the safe-no-change projection for at least half of the benchmark queries. It reduces the running time for all regular XPath queries considerably since large parts of the input hedges can be projected away. In our benchmark, the projected percentage ranges from 75.7% to 100% of the input stream. For XPath queries that contain only child axes, the earliest query answering algorithm of AStream with congruence projection becomes competitive in efficiency with the best existing streaming tool called QuiXPath [
23], which is not always earliest for some queries. Our current implementation of the earliest congruence projection in AStream is by a factor from 1.3 to 2.9 slower than QuiXPath on the benchmark queries. The improvement is smaller for XPath queries with the descendant axis, where less subhedge projection is possible. Instead, some kind of descendant projection would be needed. Still, even in the worst case in our benchmark, the earliest congruence projection with AStream is currently only by a factor of 13.8 slower than QuiXPath.
Outline. We start with preliminaries on hedges and nested words in
Section 2. In
Section 3, we recall the problem of hedge pattern matching and formally define irrelevant subhedges. In
Section 4, we recall the definition of
Shas, introduce
Sha↓s, and discuss their in-memory evaluators. In
Section 5, we introduce the notion of subhedge projection states, define when an
Sha↓ is complete for subhedge projection, and present an in-memory evaluator for
Sha↓s with subhedge projection. In
Section 6, we introduce the safe-no-change projection, state its soundness, and show its incompleteness. In
Section 7, we introduce congruence projection an prove it to be sound and complete for subhedge projection. The earliest in-memory evaluators for
Sha↓s with subhedge projection follow in
Section 8. Streaming variants of our in-memory evaluators are derived systematically in
Section 9.
Section 10 discusses how to lift our algorithms from membership testing to monadic queries.
Section 11 discusses our practical experiments.
Section 12 discusses further related work on automata notions related to
Shas and
Sha↓s.
Appendix A presents the reformulation of the earliest query-answering algorithm from [
21] based on
Sha↓s and with schema restrictions.
Appendix B provides sound proof for the safe-no-change projection.
Publication Comments. This original version of the journal article was published at the FCT conference [
24]. Compared to the previous publication, the following contributions are new:
The definition of subhedge irrelevant prefixes of nested words and the definition of completeness for subhedge projection (
Section 3).
The addition of schema restrictions throughout the paper and the notion of schema completeness (
Section 4.3).
The notion of completeness for subhedge projection (
Section 5.2).
The congruence projection algorithm and the proof of its soundness and completeness for subhedge projection (
Section 7).
A systematic method to add subhedge projection to an earliest
Sha↓ (
Section 8).
A systematic method to reduce the correctness of streaming automata evaluators to the corresponding in-memory evaluators (
Section 9).
A discussion of how to deal with monadic queries while exploiting the availability of schema restrictions (
Section 10).
An implementation of congruence projection with experimental results for XPath evaluation on
Xml streams (
Section 11).
In
Appendix A, we also show how to obtain the earliest d
Sha↓s from d
Shas by adapting the compiler from [
21] mapping d
Shas to the earliest d
Nwas.
2. Preliminaries
Let A and B be sets. If A is a subset of B, then the complement of A in B is denoted by while keeping the set relative to which the complement is taken implicit. The domain of the binary relation is . A partial function is a binary relation that is functional. A total function is a partial function with .
Words. Let be the set of natural numbers, including 0. Let the alphabet be a set. The set of words over is . A word is , where is written as . We denote the empty word of length 0 by , and by the concatenation of two words .
If is a word, then we call u a prefix of v and a factor of v and w a suffix of v. Given any subset , we denote the set of prefixes of words in L by and the set of suffixes of words in L by .
For any subalphabet , the projection of a word to is the word in that is obtained from v by removing all letters from .
Hedges. Hedges are sequences of letters and trees
with a hedge
h. More formally, a hedge
has the following abstract syntax:
We assume
and
. Therefore, we consider any word in
as a hedge in
, i.e.,
. Any hedge can be converted or stored as a graph. For the signature
, the graph of an example hedge in
is shown in
Figure 1. It encodes the sequence of
Xml documents in
Figure 2:
Nested Words. A nested word over is a word over the alphabet in which all parentheses are well-nested. Intuitively, this means that for any opening parenthesis, there is a matching closing parenthesis and vice versa.
The set of nested words over
can be defined as the subsets of words over
that are a linearization of the hedge, where the linearization function
is defined as follows:
So, the set of all nested words over
is
. Let
be the inverse of the injective function
restricted to its image, so the mapping from nested words to hedges with
for all
.
Nested Word Prefixes. Prefixes, suffixes, and factors of nested words may not be nested words themselves. For instance, the hedge has the linearization . Its prefix is not well-nested since it has a dangling opening parenthesis. Its suffix is also not well-nested since it has a dangling closing parenthesis.
Any algorithm that traverses a hedge
h in a top-down and left-to-right manner inspects all the prefixes of
. Any prefix
v of
not only distinguishes the position of the hedge
h but also specifies the part of the hedge that is located before this position in the linearization
. An example is illustrated graphically in
Figure 3. This particularly holds for streaming algorithms that receive the nested word
as an input on a stream and may be inspected only once from the left to the right. But it holds equally for in-memory algorithms that receive the input hedge
h as a hierarchical structure whose graph is stored in memory and then traverses the graph top-down and left-to-right. Any
node will then be visited twice: once when reading an opening parenthesis 〈 and going down into a subhedge, and another time when going up to the closing parenthesis 〉 after having processed the subhedge.
3. Problem
We formally define the pattern-matching problem for hedges and the notion of subhedge irrelevance, which allows us to characterize the parts of the input hedge that do not need to be visited during pattern matching.
3.1. Hedge Pattern Matching
We are interested in algorithms that match a hedge pattern against a hedge. We start with algorithms that test whether such matching exists. This is sometimes called filtering or, alternatively, Boolean query answering. In
Section 10, we consider the more general problem of returning the set of matchings. This problem is sometimes called monadic query answering. Our experiments in
Section 11 apply monadic query answering for regular XPath queries.
We start from hedge patterns that are nested regular expressions with letters in the signature
. For these, we fix a set
V of recursion variables and assume that it is disjointed from
.
Each nested regular expression
is satisfied by a subset of hedges in
. We recall the definition of this set only informally. The expression
is satisfied only by the empty word, expression
only by the hedge
a only, and expression
only by the one letter hedge
z. An expression
with the concatenation operator and
is satisfied by the concatenations of hedges satisfying
e and
, respectively. An expression
with Kleene’s star operator and
is satisfied by repetitions of words satisfying
e. The Kleene star provides for horizontal recursion on hedges. The
expression
, where
and
, is satisfied by all hedges without recursion variable
x in which
z is repeatedly replaced by a hedge satisfying
e until it disappears. We adopt the usual restriction [
28] that in any nested regular expression
, all occurrences of
z in
e are guarded by a tree constructor
. In this way, the
-operator provides proper vertical recursion. Also note that, for defining sublanguages of
, we only need such expressions in
in which all occurrences of recursion variables
z are bound in the scope of binder
.
It is well known that nested regular expressions in
can define all regular sublanguages of hedges in
, i.e., they have the same expressiveness as hedge automata with alphabet
. Analogous results for regular tree expressions and tree automata are folklore [
29].
Example 1 (Nested regular expressions of regular XPath filters).
For instance, we can define subsets of hedge encodings of Xml documents with two kinds of elements in by expressions in . For this article, we will use an encoding of the Xml document, which produces hedges that satisfy the following regular expression , where :In our application, the encodings are a little richer in order to distinguish different types of Xml nodes such as elements, attributes, comments, and texts. An XPath filter such as [self::list/child::item] can then be expressed by the nested regular expression: General compilers from regular downward XPath filter queries to nested regular expressions were proposed in [
8]. An implementation was developed in the context of the AStream tool [
21].
We will use schemas to restrict the inputs of our pattern-matching problem. As we are mostly interested in regular schemas, we do not assume regularity globally since safe-no-change projection does not rely on it. The pattern-matching problem for nested regular expressions with respect to a schema receives the following inputs:
It then returns the truth value of the judgment . The input hedge may either be given by a graph that resides in memory or by a stream containing the linearization of the input hedge, which can be read only once from left to right.
3.2. Irrelevant Subhedges
We define the concept of irrelevant occurrences of subhedges with respect to a given hedge pattern. What this means depends on the kind of algorithm that we will use for pattern matching. We will use algorithms that operate on the input hedge top-down, left-to-right, and bottom-up. This holds for streaming algorithms in particular.
Intuitively, when the pattern-matching algorithm reaches a node top-down or left-to-right whose subsequent subhedge is irrelevant, then it can jump over it without inspecting its structure. What jumping means should be clear if the hedge is given by a graph that is stored in memory. Notice that the full graph is to be constructed beforehand, even though some parts of it may turn out irrelevant. Still, one may win lots of time by jumping over irrelevant parts if either the graph was already constructed for other reasons or if many pattern-matching problems are to be solved on the same hedge.
In the case of streams, the irrelevant subhedge still needs to be parsed, but it will not be analyzed otherwise. Most typically, the possible analysis is carried out by automata, which may take two orders of magnitude more time than needed for parsing. Therefore, not having to do any analysis may considerably speed up the streaming algorithm.
Definition 1. Let be a schema and a language satisfying this schema. We define as the least symmetric relation on prefixes of nested words :A nested word prefix u is called subhedge relevant
for L with schema if there exists nested words such that . Otherwise, the prefix u is called subhedge irrelevant
for L with schema . So,
states that
u and
have continuations that behave differently with respect to
L and
. Furthermore, a prefix
u is subhedge irrelevant if language membership does not depend on hedge insertion at
u under the condition that schema membership is guaranteed. The case without schema restrictions is subsumed by the above definition by choosing
. In this case, the complement
is the Myhill–Nerode congruence of
L, which is well-known in formal language theory from the minimization of DFAs (see any standard textbook on formal language theory or Wikipedia (Myhill–Nerode theorem:
https://en.wikipedia.org/wiki/Myhill-Nerode_theorem, accessed on 23 July 2024)).
Schema restrictions are needed for our application to regular XPath patterns.
Example 2 (Regular XPath filters).
Consider the schema for hedges representing Xml documents with signature , and the regular hedge language , whereis the nested regular expression from Example 1 for the XPath filter . Recall that this nested regular expression can be applied to hedges representing Xml documents only. The nested word prefix is subhedge irrelevant for the language L with schema . Note that its continuation with the suffix belongs to L. Hence, for any , if the continuation belongs to , then it also belongs to L. Nevertheless, for the hedge , the continuation does not belong to L since it does not satisfy the schema . The prefix is also irrelevant for the language L, even independently of the schema. This is because L does not contain any continuation of this prefix to some hedge. The prefix is not irrelevant for the language L wrt . This can be seen with the suffix , since does not belong to L while the continuation with does belong to L and both continuations satisfy the schema .
However, schema restrictions may also have surprising consequences that make the problem more tedious.
Example 3 (Surprising consequences of schema restrictions).
Consider the signature , the pattern , and the schema . The prefix 〈 is indeed subhedge irrelevant to L and . In order to see this, we consider all possible closing suffixes one by one. The smallest closing prefix is 〉. Note that a hedge if and only if . So, membership to L seems to depend on the subhedge h. However, also note that if and only if . Therefore, when assuming that belongs to the schema , it must also belong to L. So, the pattern must match when assuming schema membership. The next closing suffix is . Note that a hedge if and only if . However, , so the pattern will not match for any subhedge h with this prefix when assuming the input hedge satisfies the schema. The situation is similar for larger suffixes, so in no case does language membership depend on the subhedge at prefix 〈 when assuming that the full hedge satisfies the schema .
3.3. Basic Properties
We first show that subhedge irrelevant prefixes remain irrelevant when extended by the nested word of any hedge. This property should be expected for any reasonable notion of subhedge irrelevance.
Lemma 1. For any nested word prefix u, language , and nested word , if the prefix u is subhedge irrelevant for L with schema , then the prefix is so too.
Proof. Let
u be subhedge irrelevant for
L with schema
and
a nested word. We fix arbitrary nested words
and a nested word suffix
such that
and
. Since
and
are also nested words, the subhedge irrelevance of
u yields the following:
Hence, the nested word prefix
is subhedge irrelevant for
L with schema
. □
Definition 2. We call a binary relation D on prefixes of nested words a difference relation on prefixes
if for all prefixes of nested words and nested words : Lemma 2. is a difference relation.
Proof. We must show for all prefixes
and nested words
that
implies
. So, let
be prefixes of nested words and
v be a nested word such that
. Then, there exists a nested suffix
w such that
The suffix
then satisfies
. Hence,
as required. □
In the schema-free case of words, the complement is an equivalence relation that is known as the Myhill–Nerode congruence. In the case of schemas, however, the complement may not be transitive; thus, it may not even be an equivalence relation. This might be surprising at first sight.
Example 4. Let , and . Then, since there is no continuation that extends both ε and a into the schema. For the same reason, we have . However, since and . Thus, , so the complement is not transitive.
It should be noted for all languages L that is reflexive and transitive, so it is an equivalence relation and thus a congruence. For general schemas , may not be transitive, as shown by Example 4, and may thus not be congruent.
This indicates that schemas may make it more difficult to decide subhedge irrelevance. Indeed, the natural manner in which one might expect to eliminate schemas based on complements does not work, as confirmed by following lemma (in particular, the counter-example in proof .):
Lemma 3. For any nested word prefix and , consider the following two properties:
- 1.
u is irrelevant for L with schema .
- 2.
u is irrelevant for with schema .
Then, property 2. implies property 1., but not always vice versa.
Proof. - .:
Here, a counter-example is presented. Let and . Then, prefix a is irrelevant for L with schema . However, prefix a is relevant for with schema , since for the nested words and we have and .
- .:
Next, we assume property 2. Consider arbitrary nested words and a nested word suffix such that . It is sufficient to show that and implies . This implication holds trivially if or . Otherwise, we have and . Property 2., as assumed, then implies . So, property 1. holds.
□
4. Hedge Automata
We recall the notion of stepwise hedge automata (
Shas), introduce their downward extension (
Sha↓), discuss schema-completeness for both kinds of automata, define subhedge projection states for
Sha↓s, and show how to use them for evaluating
Sha↓s with subhedge projection. We limit ourselves to in-memory evaluation in this section but will discuss streaming evaluation in
Section 9. The relation to several closely related notions of automata on hedges, forests, or nested words is discussed in some more detail in
Section 12.
4.1. Stepwise Hedge Automata
Shas are a recent notion of automata for hedges [
8] that mix up bottom-up tree and left-to-right word automata in a natural manner. They extend stepwise tree automata [
31] in that they operate on hedges rather than unranked trees, i.e., on sequences of letters and trees containing hedges. They differ from nested word automata (
Nwas) [
11,
15] in that they process hedges directly rather than their linearizations to nested words.
Definition 3. A stepwise hedge automaton (Sha) is a tuple where Σ and are finite sets, , and , where , , and . A Sha is deterministic or equivalently a dSha
if I and contain at most one element, and all relations and are partial functions.
The set of states subsumes a subset I of the initial state, a subset F of final states, and a subset of tree initial states. The transition rules in have three forms: if , then we have a letter rule that we write as . If , then we have an apply rule that we write as . Lastly, if , then we have a tree initial rule that we denote as .
In
Figure 4, we illustrate the graph of a d
Sha for the XPath filter
. The graph of
Sha is obtained and drawn as usual for state automaton on words. The states are the nodes of the graph; in the example,
. The transition rules define the edges. The edge of a letter rule
in
is annotated by the letter
. In the example,
. Any apply rule
in
is drawn as an edge
annotated by state
. The initial state
is indicated by an ingoing gray arrow, and the tree initial state
by an ingoing gray arrow that is annotated by symbol
. Final states, such as
, are indicated by a double circle.
4.1.1. Semantics
We next define the semantics of
Shas, i.e., the transitions on hedges that it defines and the hedge language that it accepts. For any hedge
we define the transition steps
such that for all
,
, and
The transitions can be used to evaluate hedges nondeterministically bottom-up and left-to-right using
Shas. The first three rules state how to evaluate sequences of trees and letters, such as a usual finite state automaton, while assuming that the trees were already evaluated to states. The last rule states how to evaluate a tree
from some state
q to some state
q′. This can be visualized as follows:
For any tree initial state
, one has to evaluate the subhedge
h nondeterministically to some state
. For each
obtained, one then returns to a state
such that
nondeterministically.
A hedge is accepted by
A if it permits a transition from some initial state to some final state. The language
is the set of all accepted hedges:
4.1.2. Runs
Runs can represent the whole history of a single choice of the nondeterministic evaluator on a hedge. For instance, the unique successful run of the d
Sha in
Figure 4 on the hedge
with subhedge
is the hedge
whose computation is illustrated in
Figure 5. On the top level, the run justifies the transition
from the initial state 0 to a final state 4. When started in state 0, the subhedge
needs to be evaluated first, so the run on the upper level needs to be suspended until this is done. The subrun on
eventually justifies the transition
. At this time point, the run of the upper level can be resumed and continue in state 4 since
.
When drawing runs, we illustrate any suspension/resumption mechanism with a box, for instance, the box on the edge 0
4. The box stands for a future value computed by the lower level, causing the upper level to suspend until its arrival. Eventually, the box is filled by state 3, as illustrated by 3
, so that the computation can be resumed on the upper level.
We next define runs of
Shas on hedges formally. Whether a hedge with letters and states
is a
-run or simple a run on a hedge
, written
, is defined by the following rules:
Note that if
, then
h can be obtained by removing all states from
R, i.e.,
. Any run
can be identified with the mapping of prefixes of the nested word
to states in
. Note that
. For any prefix
v of
, there exists a unique prefix
r of the nested word
that ends with some state
and satisfies
. We then call
q the state of
R at prefix
v. The run
on hedge
from
Figure 5, for instance, assigns state 0 to the nested word prefixes
and 〈 and state 1 to the nested word prefix
, etc.
A -run is called a run of if it starts with some state in I. A run of A is successful if it ends with some state in F.
Lemma 4. iff there exists a successful run of A.
Proof. Straightforward with induction on h. □
4.1.3. Determinization
Shas can always be determinized using the subset construction known from the determinization of finite state automata on words or of tree automata.
Given a
Sha , the state set of its determinization is
. The transition rules
of the determinization are presented in
Figure 6.
4.1.4. In-Memory Membership Testing
Testing membership for a hedge h, whose graph is stored in memory, can be carried out by computing the unique run of the determinization of A in a bottom-up and left-to-right manner, and testing whether it is successful. Alternatively, one can compute the unique set such that with respect to the determinization of A and to test whether . The computations are essentially the same, but the full history of the computation is lost when returning only the set P of states reached and not the run.
For determinizing
Shas we can rely on a variant of the usual subset construction that is well-known from finite state automata on words or standard tree automata (see e.g., [
8,
16]). When it comes to membership testing, on-the-fly determinization is sufficient, which creates only the part of the deterministic automaton that is used by the run on
h. This part is of linear size
while the full determinization of
A may be of exponential size
. In this way, membership
can be tested in time
.
4.1.5. Hedge Accessibility
For any set
we define the set of states accessible from
P via a hedge as follows:
We note that
can be computed from
and
P in linear time in the size of
. A state is hedge accessible from
P if and only if it is accessible for
P in the graph of
. Or by computing the least fixed point of the following ground Datalog program depending on
P and
that is of linear size in
A:
Clearly,
if and only if
. Therefore, emptiness can be decided in linear time for
Shas. This is in contrast to the more general notion of
Sha↓s that we introduce next.
4.2. Downward Stepwise Hedge Automata
The evaluation of Shas operates in a bottom-up and left-to-right manner, but cannot pass any information top-down. We next propose an extension of Shas to Sha↓s, adding the ability to pass finite state information top-down.
Sha↓s are basically equal to Neumann and Seidl’s pushdown forest automata [
25] but lifted from (labeled) forests to (unlabeled) hedges. See
Section 12 for a more detailed discussion on the relationship of
Sha↓s with other automata notions and
Nwas in particular.
Definition 4. A downward stepwise hedge automaton (Sha↓) is a tuple , where Σ and are finite sets, , and . Furthermore, , , and . A Sha↓ is deterministic or equivalently a dSha↓ if I contains at most one element, and all relations, , , and , are partial functions.
The only difference compared to Shas is the form of the tree-opening rules. If , then we have a tree initial rule that we denote as follows: .
So, state , where the evaluation of a subhedge starts, depends on the state q of the parent.
4.2.1. Semantics
The definition of transition steps for
Sha↓s differs from that of
Shas only in the usage of opening rules in the following inference rule:
This means that the evaluation of the subhedge
h starts in a state
p such that
.
So, the restart state
p that is chosen may now depend on the state
q above. This is how finite state information is passed top-down by
Sha↓s.
Shas, in contrast, operate purely bottom-up and left-to-right.
4.2.2. Runs
The notion of runs can be adapted straightforwardly from
Shas to
Sha↓s. When in state
q, it is sufficient to restart the computation in subhedges on the state
p, such that
. In this way, finite state information is passed down (while for
Shas some tree initial is to be chosen that is independent of
q). The only rule of runs to be changed is the following:
An example of a run of the d
Sha↓ in
Figure 7 is shown in
Figure 8. Here, the information on the level of nodes is passed down. It is represented by the number of primes. For instance, when opening the first subtree labeled with
item we use the transition rule
stating that one moves from state 1 on the first level to state 0 on the second level.
One can see that all nodes of the subtrees and are evaluated to the projection state . So, one can jump over these subtrees without reading them. This is justified by the fact that they are on level 2, the information that the evaluator of this Sha↓ was passing down.
4.2.3. In-Memory Membership Testing
Testing membership
for a
Sha↓s in memory can again be carried out by computing the unique run of the determinization of
A. As before, the membership tester can be based on on-the-fly determinization. However, the determinization procedure for
Sha↓s is more complicated then for
Shas since
Sha↓s can pass information top-down and not only bottom-up and left-to-right. Determinization not only relies on the pure subset construction but also uses pairs of states in the subsets, basically in the same way as for nested word automata (
Nwas) [
14,
15]. This is needed to deal with the stack of states that were seen above. Therefore, each determinization step may cost time
as stated for instance in [
23]. The upper time bound for membership testing for
Sha↓s is thus in time
and no longer combined linear as it is for
Shas.
4.2.4. Relationship to Shas
Any
Sha can be mapped to a
Sha↓ while preserving runs and determinism. The only change of
compared to
is described by the following rule:
Independently of the current state
, the
Sha↓ can start the evaluation of the subhedge of a subtree in any open tree state
. For instance, the d
Sha↓ for d
Sha A from
Figure 4 from the introduction is provided in
Figure 9. Conversely, one can convert any d
Sha↓ A into an equivalent
Sha by introducing nondeterminism to eliminate top-down processing. See
Section 12.2 for the construction.
4.2.5. Hedge Accessiblity
Hedge accessibility can be defined for
Sha↓s identically as for
Shas. However, the set
can no longer be computed in linear time in the size of
, in contrast to
Shas. It can still be done in cubic time, similarly as for
Nwas [
22]. The problem is that hedge accessibility for
Sha↓s does no more coincide with the accessibility notion of the automaton’s graph.
For instance, in the
Sha↓ from
Figure 7, we have
. Note that
does not belong to this set, even though there is a transition rule
in
, making node
accessible from node 0 in the graph of the automaton. This is because
is not accessible over any hedge from 0; that is, there is no hedge
h such that
wrt
. Note that
is a factor of the nested word of some run of
. This shows that
is accessible from 0 over a nested word prefix nevertheless. For any
Sha↓ A, hedge accessibility can be computed by the following ground Datalog program of cubic size depending on the number of states of
A:
Due to the cubic time of hedge accessibility for
Sha↓s, we will use hedge accessibility only for
Shas where it is in linear time.
4.3. Schema Completeness
Schemas are inputs of the membership problem that we consider while completeness is a fundamental property of automata. To combine both aspects appropriately, we propose the notion of schema completeness, which we define uniformly for both Shas and Sha↓s.
We call a set of transition rules Sha↓ complete if all its relations , , and are total. For Shas, we require that is a nonempty set (instead of a total relation). A Sha or Sha↓ A is called complete if is complete and . We can complete any Sha or Sha↓ by adding a fresh sink state and also transition rules into the sink state for all left-hand sides, for which no other transition rule exists. The sink state is made initial if and only if there is no initial state before. It is not final though. We denote the completion of A by .
Definition 5. A partial run of A on a hedge h is a prefix r of a run of on h, such that
A partial run r of A on h is called blocking if there does not exist any partial run of A on h, such that r is a strict prefix of .
For instance, consider the hedge
. The d
Sha in
Figure 4 has the unique run
on
h that is illustrated in
Figure 5. The partial runs of
h are thus all prefixes of
that end with some state, i.e., 0,
,
, etc. None of these partial runs are blocking.
Definition 6. A schema for an automaton A is a hedge language , such that . We call the automaton A schema-complete for schema if no hedge has a blocking partial run of A.
Schemas are usually used to restrict the type of hedges that a problem may accept as input. The dSha pattern-matching problem for a schema takes two inputs: a hedge and an automaton A, rather than a nested regular expression e. The automaton A then selects those input hedges that match the pattern, i.e., such that . For this reason, we assume that is a schema for A, i.e., . We will often assume that A is schema-complete for , so that no partial run of A on any input hedge may ever block. This can always be achieved next based on the next Lemma 5.
Example 5. The dSha in Figure 4 for the XPath filter has signature . It is schema-complete for schema . To make this happen, we added state 5
to this automaton, which is not used by any successful run. Still, this
Sha
is not complete. For completing it, we would have to run 5 into sink states by adding many transitions into it, such as and , and also add for all other states the transition rules , , , and . As we will see, automaton completion would make the safe-no-change algorithm incomplete for subhedge projection for Example 5. Without schema-completeness, however, safe-no-change projection may be unsound: it might overlook the rejection of hedges inside the schema. This is why it is much better to assume only schema-completeness rather than full completeness for the input automata of safe-no-change projection. For congruence projection, schema-completeness will be assumed implicitly.
We note that schema-completeness is well-defined even for nonregular schemas. For safe-no-change projection, we indeed do not have to restrict schemas to be regular, while for congruence projection, only regular schemas are considered.
Furthermore, if A is schema-complete for the universal schema and does not have inaccessible states, then A is complete.
Schema-completeness of deterministic automata for regular schemas can always be made to hold as we will show next. For this we define that A is compatible with schema if for any hedge there exists a run of A in . Schema-completeness implies compatibility, as shown by the next lemma. The converse is true only when assuming determinism.
Lemma 5. Let A be a deterministic
Sha
with schema . Then, A is compatible with iff A is schema-complete for .
Proof. Suppose that A is schema-complete for Wlog., we can assume . Note that if , then would be a blocking partial run for any hedge in . Since , it follows that there exists . So, is a partial run for any hedge . Since there exist no blocking partial runs by schema-completeness, this run can be extended step by step to a run on any . Therefore, for any hedge , there exists some run by A, showing compatibility.
For the converse, let A be compatible with and deterministic. We have to show that A is schema-complete for . Let v be a partial run of A on . By compatibility, there exists a run that starts in some initial state of A. By determinism, v must be a prefix of . Thus, v is not blocking. □
Proposition 1. Any
Sha
A with regular schema can be made schema-complete for .
Proof. By Lemma 5, it is sufficient to make A compatible with . Let B be a dSha with . Automaton B is compatible with . Since B is deterministic, Lemma 5 implies that B is schema-complete for . Since is a schema for A with . We then compute the completion of . The product with final states is schema-complete for schema . Furthermore, since , it follows that . □
Finally, note that the schema can be recognized by the same product except for replacing the set of final states by the set of schema final states . We have , where is the dSha obtained from C by replacing the set of final states with .
7. Congruence Projection
We present the congruence projection algorithm for detecting irrelevant subhedges for regular hedge patterns with regular schema restrictions. We prove that congruence projection is complete for subhedge projection, resolving the incompleteness of safe-no-change projection. For this, congruence projection may no more ignore the schema, so we have to assume that the input schema is regular too.
7.1. Ideas
The starting idea of congruence projection is to resolve the counter example for the completeness of safe-no-change projection in
Figure 13. There, a state is changed when reading an irrelevant subhedge hedge. But, the state change moves to a somehow equivalent state, so it is not really relevant. Therefore, the idea is to detect when a state always remains equivalent rather than unchanged when reading an arbitrary subhedge. This is eventually done by the congruence projection in
Figure 14 that we are going to construct.
The obvious question is as follows: which equivalence relation do we choose? Suppose that we want to test whether a hedge satisfying a regular schema matches a regular pattern L. In the restricted case of words without schema restrictions, the idea could be to use Myhill–Nerode’s congruence but mapped to states. In the general case, however, the situation becomes more complex, given that in the general case may fail to be an equivalence relation. So, it may not be a congruence, as already illustrated in Example 4. Furthermore, the treatment of the nesting of hedges will require to update the considered relation when moving down a hedge.
7.1.1. Difference Relations on States
The congruence projection algorithm will maintain a difference relation on states of the dSha which at the beginning will be induced by the difference relation on hedges and updated whenever moving down a hedge.
Definition 11. Let be a dSha. A difference relation for
is a symmetric relation on states , such that for all ,The set of all difference relations for Δ
is denoted by . We call a subset compatible with a difference relation if . This means that no two states of may be different with respect to D.
Definition 12. Let be a Sha and D a difference relation for Δ
. A subset of states is subhedge irrelevant
for D wrt Δ if is compatible with D. The set of all subhedge irrelevant subsets of states for D wrt. Δ
thus is as follows:A state is subhedge irrelevant for D if the singleton is subhedge irrelevant for D. The set of all subhedge irrelevant states of is denoted by . We consider subhedge-irrelevance for subsets of states since the congruence projection algorithm has to eliminate the nondeterminism that it introduces by a kind of Sha determinization. Determinization is necessary in order to recognize subhedge irrelevant prefixes properly: all single states in a subset may be subhedge irrelevant while the whole subset is not.
7.1.2. Least Difference Relations
For any binary relation , let be the least difference relation on states that contains R.
Lemma 8. .
Proof. The set clearly is a difference relation that contains R, and thus, contains the least such difference relation . Conversely, each pair in the above set must be contained in any difference relation containing R, and thus, in . □
Lemma 9. For any , the difference relation is the value of predicate D in the least fixed point of the ground datalog program generated by the following inference rules: Proof. The first inference rule guarantees that . The three later inference rules characterize difference relations . The second and third rules state that differences in D are propagated backwards over hedges h. This is done recursively by treating letter hedges by the second rule and tree hedges by the third rule. The fourth rules expresses the symmetry of difference relations D. So, the least fixed point of the datalog program generated by the above rules contains R and is the least relation that satisfies all axioms of difference relations, so it is . □
Proposition 4. For any , the least difference relation can be computed in time .
Proof. The size of the ground datalog program computing from Lemma 9 is at most quadratic in the size of A, so its least fixed point can be computed in time . □
7.2. Algorithm
We now define the congruence projection algorithm by a compiler from dShas and a set of schema final states to Sha↓s, which, when run on a hedge, can detect all subhedge irrelevant prefixes.
7.2.1. Inputs and Outputs
As inputs, the congruence projection algorithm is given a dSha for the regular pattern and a set with . The dSha defines the regular language while the regular schema is recognized by the dSha . So, the same base automaton defines the regular language and regular schema by choosing different sets of final states.
Example 10. In the example dSha
in Figure 4 for the XPath filter with schema , we have and . In the automaton for the counter example for safe-no-change projection in Figure 12, we have and . We note that, if
L and
were given by independent
Shas, then we can obtain a common d
Sha A and a set
as above by computing the product of the d
Shas for
L and
and completing it. As noticed in [
16], it may be more efficient to determinize the
Sha for
in a first step, build the product of the
Sha for
L with the d
Sha for
in a second, and determinize this product in the third step.
The congruence projection of A wrt. will maintain in its current configuration a subset of states and a difference relation on states in . Thereby, the congruence projection dSha↓ can always detect whether the current prefix is subhedge irrelevant for L wrt. schema by testing whether the current set of states is subhedge irrelevant for the current difference relation D.
7.2.2. Initial Difference Relation
The initial difference relation on states
is induced from the difference relation on prefixes
as follows:
The next lemma indicates how
can be defined from
A and
without reference to the languages
and
.
Lemma 10. .
Proof. For one direction let . Then, there exist nested words and an initial state , such that and . Since are nested words, hedge accessibility follows. Furthermore, requires the existence of a hedge , such that and . Hence, there are states and , such that and . Using Lemma 8, this implies that .
For the other direction, let . Using Lemma 8, property shows that there exist states , , and , such that and . From , it follows that there are nested words and an initial state , such that and . Hence, , so that . □
As a consequence of Lemma 10 and Proposition 4, the initial difference relation can be computed in time from A and .
Proposition 5 (Soundness of the initial difference relation).
Let be a complete dSha, , , and . For any hedge and state , such that wrt Δ for some , the nested word is subhedge irrelevant for L and if and only if q is subhedge irrelevant for wrt. Δ.
Proof. We show that is subhedge relevant for L and if and only if q is subhedge relevant for wrt. .
- “⇒”
Let
be subhedge relevant for
L and
. Then, there exist hedges
and a nested word
, such that
Since
A is deterministic and
wrt
for the unique
, it follows that there exist states
,
, and
, such that
From and , it follows that . Since is a difference relation, and , this implies that , too. Hence, , i.e., q, is relevant for wrt. .
- “⇐”
Let
q be subhedge relevant for
wrt.
. Then, there exist a pair
. So, there are hedges
and
, such that
Since
by Lemma 10, there exist a nested word
w and a pair
so that either
or
belongs to
and
Hence, in L and or vice versa. This shows that is relevant for L and .
□
7.2.3. Updating the Difference Relation
For any difference relation
and subset of states
, we define a binary relation
, such that for all states
:
For any subset of states
and difference relation
, let
be the least difference relation that contains
:
Lemma 11. The difference relation can be computed in time from , Δ, and D.
Proof. The binary relation can be computed in time from , D, and . The least difference relation can be computed by ground datalog in time from using Proposition 4. □
Example 11. Reconsider the dSha
for the XPath filter in Figure 4 with the set of schema-final states . Since , we have . The initial difference relation is , which is the symmetric closure of . The subhedge irrelevant states are since only state 0 can access two states in the difference relation , the final state in , and a nonfinal state that is schema-final in of some hedges. The difference relation is the symmetric closure of , which is equal to the difference relation . Hence, the projection states are since from states 0
and 1
one can still reach both 1
and 3
while . The difference relation is the symmetric closure of . Hence, the projection states are . From state 1
, one can only reach states 1
and 3
, which are not different for . Projection states for the initial difference relation contain all states that are safe for selection or safe for rejection with respect to the schema:
Lemma 12. All states in and are subhedge irrelevant for .
Given a d
Sha and a set
, we now construct the congruence projection
as the following d
Sha↓:
where the set of states, initial states, and final states are as follows:
The transition rules in
are provided by the inference rules in
Figure 15. To illustrate construction and why determinization is needed, we reconsider Example 3, where schema restrictions have consequences that at first sight may be counter-intuitive.
Example 12 (Determinization during congruence projection is needed.).
Reconsider Example 3 with signature , pattern , and schema . This language can be defined by the dSha A in Figure 16. The schema can be defined by the same automaton but with the schema-final states and . The dSha produced by congruence projection is shown in Figure 17. The unique hedge of the language is accepted in state , where D0. The unique hedge in is rejected: after reading the tree , the automaton goes to state where it goes to a projecting sink when reading the subsequent letter a. We note that any hedge in is accepted in state by , even though only the single hedge belongs to L. This is sound given that the hedges in do not belong to schema anyway. We notice that cannot be a projection state since the run on hedge must continue to the sink ; therefore, state is subhedge relevant. Note that both individual states 3 and 4 are subhedge irrelevant for D0 while the subset with both states is subhedge relevant for D0 since D0. This shows that the deterministic construction is indeed needed to decide subhedge irrelevance for cases such as in this example. Also, notice that state 0 is subhedge irrelevant for D1 = ∅. This is fine since the acceptance of hedges of the form by does indeed not depend on h. In contrast, it depends on , so that the subset of states cannot be subhedge irrelevant for D0.
Finally, let us discuss how the transition rules of dSha↓ are inferred. The most interesting transition rule is in . It is produced by the following inference rule, where , , , , and :This rule states that if P is subhedge irrelevant for , then all states accessible from P must be tried out since all of them could be reached by some different subhedge that got projected away. As a result, one cannot know due to subhedge projection into which state of one should go. In order to stay deterministic, we thus go into all possible states, i.e., into the subset . In the example, these are all the states that can be reached when reading some hedge in the pattern , given that the subhedges are not inspected with subhedge projection. Example 13 (XPath filter
with schema
.).
The congruence projection of the example dSha in Figure 4 is given in Figure 18. This dSha is similar to the safe-no-change projection in Figure 7, except that the useless state is removed and the four projection states are now looping in themselves rather than going to a shared looping state Π
. It should also be noticed that only singleton state sets are used there, so no determinization is needed. As we will see, this is typical for experiments on regular XPath queries where larger subsets rarely occur. Example 14 (Counter-example for completeness of the safe-no-change projection.).
Reconsider the counter example for the completeness of the safe-no-change projection, i.e., the dSha in Figure 12 for the XPath query with schema final states . Its congruence projection is shown in Figure 14. We note that the prefix leads to the state {2}D3, which is a subhedge projection state since 2
is subhedge irrelevant for D3. In particular, note that so no two states accessible from 2
are different in D3. This means that state 2
may still be changed to 4
but then it does not become different with respect to D3. This resolves the incompleteness issue with the safe-no-change projection on this example. The first property for states assigned by the congruence projection is that is compatible with D. This means that no two states in are different with respect to D. Intuitively, each state of is as good as each other except for leading out of the schema. So, if is assigned to a prefix, and a suffix leads from some state in to F, then it cannot lead to from some other state in .
Lemma 13. If the partial run of assigns a state to a prefix, then is compatible with D.
We omit the proof since this instructive result will not be used directly later on. Still, compatibility will play an important role in the soundness proof.
7.3. Soundness
We next adapt the soundness result and proof from the safe-no-change projection to congruence projection.
Theorem 2 (Soundness of congruence projection.).
For any dSha and set , the congruence projection of A with respect to preserves the language of A within schema , i.e., We note that is a schema for A since . Furthermore, A is schema-complete for since A is deterministic and .
Proof. The proof of the inclusion will be based on the following two claims:
Claim 2.1a. For all , and
Proof. We prove Claim 2.1a by induction on the structure of h. Suppose that .
- Case
. The induction hypothesis applied to
yields
. We can thus apply the following two inference rules:
in order to close the following diagram with respect to
:
This proves
wrt
, as required by the claim.
- Case
. Since
, we can apply the inference rule:
This proves this case of the claim.
- Case
. We trivially have wrt .
- Case
. By the induction hypothesis applied to and , we obtain the following: and wrt . Hence, wrt .
This ends the proof of Claim 2.1a. The next claim is the key of the soundness proof. □
For any difference relation
D, we define a binary relation
, such that, for any two subsets of states
,
Furthermore, note that
implies
, and thus,
.
Claim 2.2a. Let a hedge, subsets of states, and a difference relation. If , then there exists , such that and .
Proof. If , then by Claim 2.1a. Let . We then have and , and thus, so that () and () of . So, it is sufficient to prove the claim under the assumption that . The proof is by induction on the structure of h.
- Case
. The assumption shows that there exists a subset of states closing the following diagram:
In particular, we have
in
. Let
. Since
, we can infer the following:
We have to distinguish two cases depending on whether
belongs to
or not.
- Subcase
. The induction hypothesis applied to
yields the existence of
, such that
and
- Subsubcase
. Since
, it follows that
. Hence,
. Let
be such that
. We can then apply the following inference rule:
Hence, we can close the diagram as follows:
This shows that
. Since
,
, and
, it follows that
, and thus,
. This shows the claim in this case.
- Subsubcase
. Since
, it follows that
and
. Hence, we can apply the following inference rule for some
:
Hence, we can close the diagram as follows:
This shows that
. Since
, it follows from
and
in
that
. Thus,
, so the claim holds in this case too.
- Subcase
. Claim 2.1a shows that
wrt
. Let
be such that
. We can apply the following inference rule:
We can thus close the diagram below as follows with respect to
:
This shows that
. Since
, it follows from
and
in
that
, and thus,
.
- Case
. Since
, we can apply the inference rule:
Hence,
. With
, the claim follows since
.
- Case
. In this case, we have and , so the claim holds.
- Case
. Since , there exists , such that and . By the induction hypothesis applied to there exists , such that and .
- Subcase
. Since , it follows that and . Furthermore, Claim 2.1a shows that , and hence, wrt . Since and , it follows that . Hence, and .
- Subcase
. In this case, we can apply the induction hypothesis to , showing that there exists , such that and . Hence, and .
This ends the proof of Claim 2.2a. □
Proof of Inclusion . Since is a schema for A, we have , so that it is sufficient to show . Let . Then, there exist and , such that wrt . Let . Since A is deterministic, it follows that wrt . Using Claim 2.2a, this implies the existence of a subset of states , such that and wrt . Furthermore, . In order to prove , it is thus sufficient to show that . We distinguish two cases:
- Case
. Condition () of , shows that so that . Since , it follows that . Thus, , so that .
- Case
. Condition () of yields . Hence, so that , and thus, .
This ends the proof of the first inclusion. □
We next want to show the inverse inclusion . It will eventually follow from the next two claims.
Claim 2.1b. For any hedge , difference relation , projection state and state : if , then .
Proof. By induction on the structure of , suppose that :
- Case
. There must exist states
closing the following diagram:
Since
, the following rule must have been applied to infer
:
Therefore,
. The induction hypothesis applied to
shows that
, too. So,
must have been inferred by applying the rule:
Hence,
, as required.
- Case
. The following rule must be applied:
Hence,
.
- Case
. Obvious.
- Case
. There must exist some , such that wrt . Using the induction hypothesis applied to , we have . We can thus apply the induction hypothesis to to obtain .
This ends the proof of Claim 2.1b. □
We next need an inverse of Claim 2.2a. For relating runs
back to runs of
A, we define for any difference relation
D another binary relation
, such that for any two subsets of states
,
Claim 2.2b.
Let be a subset of states that is compatible with a difference relation . For any hedge and state with there exist a pair of subsets of states , such that and wrt. .
Proof. If , then Claim 2.1b shows that . Let and be the unique subset of states, such that wrt. . We have to show that . Since , we have , so condition () holds. Condition () holds trivially since . For condition (), note that implies . Furthermore, , so that as required.
We can thus assume that . The proof is by induction on the structure of . We distinguish all the possible forms of hedges :
- Case
. By definition of
, there must exist
, such that the following diagram can be closed:
Since
, the following inference rule was applied to infer
wrt
, where
:
Hence, . If , then the induction hypothesis applied to wrt shows that there exists , such that and . Otherwise, the same can be concluded as we argued at the beginning.
- Subcase
. The transition rule
must be inferred as follows for
:
This shows that
. So, we have the following diagram:
Let
be the unique subset of states, such that
wrt
. We can then close the following diagram:
From
, it follows
, and thus,
, i.e., condition (
) of
. Since
, condition (
) of
yields
. Furthermore,
and
in
, which yields
so that conditions (
) and (
) of
follow. Hence,
. In summary, we show that
,
, and
, as required by the claim.
- Subcase
. The transition rule
must thus be inferred as follows for
:
This shows that
, and we can close the following diagram:
Condition (
) of
yields
. Let
be the unique subset of states, such that
wrt.
. Since
and
, it follows from
and
wrt.
that
. That is, condition (
) of
. Since
, condition (
) of
is
. From
and
, it thus follows that
. Hence, conditions (
) and (
) of
are valid, so that
holds. Furthermore,
This shows that
wrt
. The other two requirements of the claim,
and
, were shown earlier.
- Case
. Since
, the following inference rule must be used:
So
,
. Furthermore, since
is compatible with
D, and
D is a difference relation,
is compatible with
D too. Hence,
, showing condition (
) of
. Trivially,
so conditions (
) and (
) of
hold, too. Hence,
.
- Case
. Obvious.
- Case
. Since wrt. , there exists some , such that and wrt . We can apply the induction hypothesis to . Hence, there exist subsets of states , such that , wrt , and . We distinguish two cases:
- Subcase
. Since wrt , Claim 2.1b shows that so that wrt. . Condition () of and imply that . Let be the unique subset of states, such that wrt. . Then, so that , and thus, condition () of holds. Condition () of is trivial since . Since and , it follows that so condition () of holds, too. Hence, wrt and .
- Subcase
. Since and is compatible with difference relation D, it follows that is compatible with D, too. We can thus apply the induction hypothesis to , showing the existence of subsets of states , such that and wrt . So, we have wrt. and , as required.
This ends the proof of Claim 2.2b. □
Proof of Inclusion . Let . Then, there exists a final subset of states , such that wrt . Since I is a singleton or empty, it is compatible with . Claim 2.2b shows that there exists a subset of states , such that wrt. and . Condition () of shows that . Since , it follows that . The determinism of A shows that is a singleton. So, there exists a state , such that .
- Case
. Condition () shows that so that . Since , we have . Since is compatible with , , and , it follows that . Since , this implies , and thus, .
- Case
. Condition () shows that so that . Since , we have . Let be arbitrary. Since and , it follows that . Furthermore, so that . In combination with , this implies so .
This ends the proof of the inverse inclusion, and thus, of . □
7.4. Completeness
We next show the completeness of congruence projection for subhedge projection according to Definition 10. Let be a complete dSha and . Automaton A defines the regular pattern with the schema-final states in the regular schema . We first show that all subhedge irrelevant states of difference relations are subhedge projection states according to Definition 7.
Lemma 14. If , then is a subhedge projection state of with witness .
Proof. We assume that
and have to show that
is a subhedge projection state of
. We have to show that all transition rules starting with
are permitted for a subhedge projection state
with
. They are generated by the following rules, all of which are looping:
□
So, if a partial run of on a prefix u assigns some state with , then is a subhedge projection state by Lemma 14, and thus, subhedge irrelevant by Proposition 2.
Lemma 15. Let , , and . If wrt for some nested word and either
and , or
and ,
then there exists a nested word and , such that wrt Δ.
Proof. By induction on the length of u, suppose that wrt . In the base case, we have . Hence, .
- Case
and . Then, . The unique run of on starts and ends in the tree initial state .
- Case
and . Since , Lemma 14 shows that is a subhedge projection state, so that is subhedge irrelevant for and by Proposition 2. Hence, . Furthermore, since , there exists a hedge h and , such that wrt . Let . The run of A on ends in q and .
For the induction step, we distinguish the possible forms of the nested word u. So, there exist and , such that either of the following cases holds:
- Case
. Let be the state in which the run of on ends.
- Subcase
. Let . Then, there exist , such that wrt . So, the run of on v goes to state .
- Subsubcase
. By construction of , we then have . Since , there exist and , such that in . By induction hypothesis, there exists , such that the run of on ends in p. Hence, the run of A on ends in q, and furthermore, .
- Subsubcase
. By construction of , we then have . Since , there exist and , such that in . By induction hypothesis, there exists , such that the run of on ends in p. Hence, the run of A on ends in q, and furthermore, .
- Subcase
. Then, and . Using the induction hypothesis applied to , there exists , such that . Since is irrelevant for L and , we have .
- Case
. This is easier to achieve than the previous cases and is thus omitted.
□
Lemma 16. A partial run of assigns a state to a prefix . If and , then there exists a prefix , such that a partial run of A assigns state q to .
Proof. By induction on the length of u. In the base case, we have . The partial run of on u assigns state , where and . Suppose that and . Then, . The unique partial run of A on ends in the initial state . For the induction step, let , , and . We distinguish the possible forms of u. So, , , and exist, such that either of the following hold:
- Case
. Let be the state in which the partial run of on ends. From , it follows that . Furthermore, wrt . So, the partial run of on v goes to state . Using the induction hypothesis, there exists , such that the partial run of on ends in q. Hence, the partial run of A on ends in q, and furthermore, .
- Case
. Let be the state in which the partial run of on ends. From , it follows that . Furthermore, wrt for some .
- Subcase
. Then, . So, there exists and , such that in . Using Lemma 15, there exists , such that wrt . Using the induction hypothesis applied to , there exists an initial state and a prefix , such that the partial run of A on goes on to state . Hence, the partial run of A on goes on to q and .
- Subcase
. Then, . So, there exists and , such that in . Using Lemma 15, there exists , such that wrt . Using the induction hypothesis applied to , there exists an initial state and a prefix , such that the partial run of A on goes on to state . Hence, the partial run of A on goes on to q and .
- Case
. This is easier to achieve than the previous case and is thus omitted.
□
The next lemma states the key invariant of congruence projection, which eventually proves its completeness for subhedge projection. For this, we define for any
the binary relation
as the symmetric closure of
, i.e., for all
:
Lemma 17 (Key Invariant).
Let a dSha, , , and . If has a partial run on prefix to state , then for all , , and , such thatthere exist , , , and such that Proof. Induction on the number of dangling opening parenthesis of u.
In the base case,
u does not have any dangling opening parenthesis so
is a nested word. In this case,
. So,
has a partial run on nested word
to
, where
. Let
,
, and
, such that
It then follows that
. Therefore, we can apply Lemma 16. It shows that there exist nested words
and
, such that
Since
by Lemma 10, there exist a nested word
and states
, such that
Then, we have the following:
For the induction step, let
for some prefix
and nested word
so that
has one open dangling bracket less than
u. Let
have a partial run on nested word
to
. Let
be the state that this run assigns to prefix
. Then,
. Let
,
, and
, such that
Since
, there exist
and states
, such that
and
. In particular,
so that we can apply Lemma 16. It shows that there exist
, such that
and
. Let
and
. Then,
,
wrt
. Using the induction hypothesis applied to
, on which
has a partial run to state
, such that
,
,
, and
, there exist
,
,
, and
, such that
Let
,
, and
. The above then yields the following:
Furthermore,
, so this was to be shown. □
Proposition 6. Let a dSha, , , and . If has a partial run on prefix to some state so that , then is subhedge relevant for L wrt .
Proof. Suppose that
has a partial run on prefix
to some state
so that
. Since
, there exist
. So, there exist
,
, and
, such that
Using Lemma 17, there exist
,
,
, and
, such that
Hence,
is relevant for
L wrt
. □
Theorem 3 (Completeness of congruence projection.).
For any complete dSha and set , the congruence projection is sound and complete for subhedge projection for the regular pattern wrt the regular schema .
Proof. The soundness of congruence projection is shown in Theorem 2. For proving completeness, let be a nested word prefix that is strongly subhedge irrelevant for L wrt . By definition of strong irrelevance, the class is irrelevant for L and . Let be the unique state assigned by the partial run of to prefix u. Since is irrelevant for L and , Proposition 6 shows that . Using Lemma 14, is then a subhedge projection state of . □
7.5. In-Memory Complexity
We next discuss the complexity of membership testing with complete subhedge projection based on the in-memory evaluation of the input hedge by the congruence projection of the input automaton.
Lemma 18. The number of states is in , where n is the number of states of A.
Proof. With the deterministic construction, the states of are pairs in . So, the maximal number of states of the congruence projection is . □
Let be the set of difference relations used in the states of .
Corollary 1. Let the signature Σ be fixed. For any complete dSha
A with schema , the membership of any hedge can be tested in memory in time per nonprojected node of h after a preprocessing time of , where d is the cardinality of and n is the number of states of A.
Since is in by Lemma 18, the preprocessing time is in , too.
Proof. Since A is complete, Thereom 3 shows that is complete for subhedge projection for schema . The in-memory evaluator of can be run on any input hedge in time per nonprojected node of h after a preprocessing time of , where d is the cardinality of . Since the signature is fixed and deterministic, is in . The dominating part is in . □
When applied to regular XPath query answering as in our experiments in
Section 11, we have
and
so the cubic part
is not limiting. The sizes of the
Sha↓s are below 2600 counting both states and rules. The upper bound
thus suggests that the preprocessing time may be feasible in practice, even though exponential in the worst case.
We must be careful here since the sizes of the
Sha↓s reported in
Section 11 do not refer to the pure congruence projection
but to the earliest congruence projection
that we will introduce only in
Section 8. The noncreation of transition rules to states that are safe for rejection may lead to a radical size reduction. Alternatively, an important size reduction in
could be obtained by removing all useless states and transition rules. But, this would not reduce the precomputation time, just the opposite: it woud increase the preprocessing time since cleaning automata a posteriori also needs time, which for
Sha↓s may be nonlinear.
Instead of precomputing from the inputs A and statically, we can compute only the part needed of for evaluating the input hedge h on-the-fly. We note that the number of transition rules of the needed part is bounded by but may be way smaller due to projection. The preprocessing time then goes down to . Since the full automaton may be exponential in n in the worst case, this may reduce the overall processing time.
On the positive side, on-the-fly evaluation reduces the overall computation time to , where is the number of difference relations in the part of needed for the evaluation of h, so . On the down side, steps will no more be in constant but cubic time in . It should also be noted that the overall time of the safe-no-change projection on-the-fly evaluation is in , where is the number of subsets of safe states of A in states of , i.e., . This is since each subset of safe states can be computed in linear time in . As argued above, the overall time of the congruence projection obtained by on-the-fly evaluation is in . This is because the computation of a difference relation is in by using ground Datalog programs of cubic size. The additional cubic factor in instead of a linear factor in is the price to be paid for achieving complete subhedge projection with on-the-fly evaluation. Note that this price is largely acceptable in cases where , , and .
Example 15 (Exponential blow-up.).
In order to see how the exponential worst case may happen, we consider a family of regular languages, for which the minimal left-to-right Dfa is exponentially bigger than the minimal right-to-left Dfa. The classical example languages with this property are , where and . Intuitively, a word in belongs to if and only its -th letter from the end is equal to . The minimal left-to-right Dfa for has many states since it needs to memorize a window of letters. In contrast, its minimal right-to-left Dfa has only states; in this direction, it is sufficient to memorize the distance from the end modulo .
We next consider the schema defined by . It contains all hedges having in all its subtrees a label from as the first letter, and no further occurrences of letters of Σ elsewhere. We then consider the family of hedge languages with schema , such that the sequence of labels of some root-to-leaf path of belongs to . Note that can be recognized in a bottom-up manner by the dSha with states, which simulates the minimal deterministic
Dfa
of on all paths of the input hedge. For an evaluator with subhedge projection, the situation is different. When moving top-down, it needs to memorize the sequence of labels of the -last ancestors, possibly filled with , and there are of such sequences. If, for some leaf, its sequence starts with an “a” then the following subhedges with the following leaves can be projected away. As a consequence, there cannot be any dSha↓ recognizing that projects away all irrelevant subhedges with less than states. In particular, the size of must be exponential in the size of .
9. Streaming Algorithms
We show that dSha↓s can be evaluated on the nested word of a hedge in a streaming manner, so that they produce the same results and projection behavior as by evaluating them on hedges in-memory in a top-down and left-to-right manner.
All aspects related to subhedge projection remain unchanged. Earliest query answering in streaming mode means that nested word suffixes are ignored if irrelevant, and not only nested words as with subhedge projection. What changes between both evaluation modes is memory consumption. For in-memory evaluation, the whole graph of the input hedge must be stored, while in streaming mode, only a stack and a state are maintained at any event. For testing regular language membership, the state space is finite, while the stack space become infinite, since the depth of the stack depends on the depth of the input hedge.
The property of having equivalent streaming and in-memory evaluators was already noticed in [
26] for Neumann and Seidl’s pushdown forest automata [
25] and for nested word automata [
11,
15]. We here show how to transpose this result to d
Sha↓s. This permits us to show the correctness of our evaluation algorithms based on properties of the constructed automata. It also permits obtaining correctness properties of streaming algorithms by reasoning about the in-memory evaluator of
Sha↓s.
9.1. Streaming Evaluators for Downward
Hedge Automata
We define the streaming evaluator of a
Sha↓ by a visibly pushdown machine. The set of configurations of this machine is
, containing pairs of states and stacks of states. The visibly pushdown machine provides for any word
a streaming transition relation
, such that for all states
and stacks
and words
:
We note that the same visibly pushdown machine can be obtained by compiling the
Sha↓ to an
Nwa, whose streaming evaluator is defined by a visibly pushdown machine too (see
Section 12.3 on related work).
The notion of partial runs of the in-memory evaluator is closely related to streaming runs. To see this, we first note that each partial run
v wrt
naturally defines a stack as follows. Since
v is a nested word prefix, there exist unique nested words
, such that
These nested words must be partial runs of
too, so there exist states
, such that
goes from
to
for all
. We define the stack
of
v by
and say that
v goes from
to
and has stack
. Note that the stack is empty if
v is a nested word, since in this case
so that
and
. For instance, the partial run
has the stack
while the partial run
has stack
.
Lemma 20. Consider a Sha
, states , and a stack . Let v be a prefix of some nested word in . Then, there exists a partial run r wrt Δ from q to with stack σ such that iff wrt Δ.
Proof. This is carried out by induction on the length of prefix v. □
This lemma implies that any in-memory transition of A on a hedge h from q to can be identified with some streaming transition of A on from to .
Proposition 8. .
Proof. If , there exists a run R of with that starts with q and ends with . Consider the partial run on h. Since v is a nested word, its stack is empty. By Lemma 20, we have wrt and . For the inverse implication, assume that . Let . By Lemma 20, there exists a partial run v of from q to with stack . Then, there exists a run R of on h such that . Hence, . □
For any deterministic dSha↓ A, hedge h, and state q, the streaming transition on with respect to starting with can be computed in time per letter after a precomputation in time . So, the overall computation time is in . The streaming memory needed to store a configuration is of size since the size of the visible stack is bounded by the depth of the input hedge.
9.2. Adding Subhedge Projection
We next extend the streaming transition relation with subhedge projection, in analogy to what we did for the in-memory transition relation. We define a transition relation with subhedge projection
with respect to
, such that for all words
, letters
, states
, and stacks
:
The projecting transition relation stays in a configuration with a subhedge projection state until the end of the current subhedge is reached. This is correct since a subhedge projection state cannot be changed anyway.
Proposition 9. For any Sha↓ , states , nested word v on which Δ
has no blocking partial run starting from q: Proof. Let be the hedge such that . The proof is by induction on the size of v. For the implication from the left to the right we assume that . Since , Proposition 8 proves wrt .
- Case
. By Lemma 6, it follows from
wrt
that
. We can then apply the following rule since
v is a nested word:
Hence,
.
- Case
. We distinguish all possible forms of hedge h.
- Subcase
. Let
and
so
. The following rule must have been applied:
Since
is a nested word, it follows that
. By the induction hypothesis applied to the nested words
and
, we get
and
. Hence, we can apply the following rule:
This shows
as required.
- Subcases
or or . Straightforward.
For the implication from the right to the left we assume that .
- Case
. Then, the following rule must have been applied with
:
Since we assume that
v does not have any blocking partial runs with
, there exists a partial run on
v with
that starts with
q. Since
v is a nested word, any partial run on
v is the nested word of some run, so there exists a run on the hedge
h that starts with
q. This shows the existence of some state
p such that
wrt
. Since
, Lemma 6 shows
, and thus,
wrt
. Proposition 8 then proves that
wrt
, which is equal to
wrt
, as required.
- Case
. We distinguish all possible forms of the hedge h.
- Subcase
. Let
and
so
. The following rule must have been applied:
Since
is a nested word, it follows that
too. By the induction hypothesis applied to the nested words
and
, we get
and
. Hence, we can apply the following rule:
This shows
, as required.
- Subcases
or or . Straightforward.
This ends the proofs of both directions. □
A streaming evaluator with subhedge projection for deterministic Sha↓s A on hedges h can thus be obtained by computing the streaming transition relation with subhedge projection for A of starting with the initial configuration. This costs time at most per letter of , i.e., constant time per event of the stream.
9.3. Streaming Complexity of Earliest Membership
with Subhedge Projection
Using streaming evaluators for Sha↓s, we can test earliest membership with subhedge projection in streaming mode by running the Sha↓ .
Corollary 3. For any complete dShas , , language , and schema , earliest membership with complete subhedge projection for L wrt can be tested in streaming mode for any hedge in time per nonprojected letter of after a preprocessing time of , where d is the number of difference relations and s the number of safe subsets in states of . The space required is in .
Proof. By Theorem 4, the dSha↓ is earliest and sound and complete for subhedge projection. By Propositions 8 and 9, we can thus obtain a streaming membership tester with subhedge projection for a hedge by running the streaming evaluator with subhedge projection of on the nested word . Since this Sha↓ is deterministic, the streaming evaluation can be done in time per letter of after a preprocessing time of . □
As for safe-no-change projection, the exponential preprocessing time can again be avoided by creating the part of the Sha↓ needed for the evaluation of the hedge on the fly. The size of the needed part is in . Hence, the space requirements can also be reduced to , which may be too much for large streams . The needed part of , however, may be much smaller than . Furthermore, a stack of size has to be maintained but this dependence on h may be acceptable even for large streams .
10. Monadic Regular Queries and XPath
We next consider the problem of how to answer monadic regular queries on hedges in an earliest manner and with subhedge projection. Each monadic query defines a set of positions for each input hedge. The XPath queries self::list[child::item], for instance, can be applied to the hedge , where it selects the position of the first top-most tree labeled by since it contains a child tree labeled by . We identify positions of hedges by integers. For instance, the selected element in the hedge above is identified by the integer 2. Therefore, the answer set of the above query on the above hedge is . Note that monadic queries do not return any other information about the selected positions, in contrast to the subtree semantics of XPath in the official W3C standard.
Answering this query is more complicated than verifying whether the XPath filter [self::list/child::item] matches at the root of the first subtree of the hedge, since the answer set needs to be constructed, too. Concerning memory consumption, alive answer candidates need to be buffered for monadic query answering. Therefore, the memory consumption of streaming evaluators of monadic queries may grow linearly in the size of the stream to the number of the alive candidates does.
10.1. Monadic Regular Queries
Monadic queries on hedges in can be identified with languages of annotated hedges in , where in an arbitrarily fixed selection variable. A monadic query on hedges in can then be identified with the language of all annotated hedges in , in which a single position got x-annotated and this position is selected by query.
Therefore, regular monadic queries on hedges in
can be defined by nested regular expressions in
or by a d
Sha with the same signature. The regular XPath query
self::list[child::item], for instance, can be defined by following nested regular expression:
where ⊤ is like a wildcard accepting any hedge in
without
x, i.e., for some recursion variable
:
Note that the nested regular expression for the XPath filter
[self::list/child::item] in
can be obtained from that of the XPath query
self::list[child::item] by removing the
x. In addition, for any monadic query on hedges, we have to restrict the annotated hedges to be recognized by the language of the query such that exactly one position in each hedge is annotated by
x. This can be done by intersection with the following nested regular expression
:
The previous schema
for pattern matching has now to be changed from to
, where
is like
but over the signature
. For the regular XPath query
self::list[child::item] we obtain the d
Sha in
Figure 22. The transition rules with the special letter
x indicate, where positions may be bound to the selection variable:
x can be bound only in state 2, that is in subtrees starting with a
element. The same automaton can be used to define the schema
by using the schema-final states
instead of the final states
. Indeed, we constructed the automaton by intersecting the automaton obtained from the XPath expression with the automata computed from the regular expressions
and
for the schema. Note that 20 is a final state for the query (and the schema), while 14 is a sink state for the query (but not for the schema).
10.2. Earliest Query Answering with Subhedge Projection
A naive query evaluation algorithm for a hedge automaton with signature receives an input hedge , guesses all possible positions of where to insert x into h, and tests for all of them whether the annotated hedge obtained is accepted by A. In a less naive manner, one inserts x only at those position where the current state has an outgoing x transition. All this can be improved if the hedge automaton under consideration supports selection, rejection, or projection states. In all cases, the algorithm has to buffer all binding candidates, that are not yet in a selection or rejection state. In the worst case it has to buffer the bindings to all positions. Therefore, the state space of the evaluator of monadic queries is no more finite.
Selection and rejection states can be identified by adapting the earliest query answering algorithm from [
21]. This can be made such that one obtains an
Sha↓ with signature
depending on the schema. It should be noticed that automata used in [
21] are d
Nwas instead of
Sha↓s, and that schemas are ignored there. The needed adaptations are therefore discussed in
Appendix A. How to obtain an efficient query answering algorithm in streaming mode from
is the main topic of [
21]. This is the base algorithm that we will extend with subhedge projection for our experiments.
In order to add congruence projection to the earliest query answering algorithm, we have to run the automaton
. The earliest congruence projection for the XPath query
self::list[child::item] is shown in
Figure 23. It binds
x in state
2D1 after having read a top-level
-element and then checks for the existence of a child element
, going into selection state
1D3 once it is found. All other children of the top-level
element are projected in state
2D3. For adding earliest query answering to self-no-change projection, we have to run the automaton
instead of
. We do not give the details, but claim for the present query that the projection power of both automaton is the same.
Fortunately, the soundness and completeness theorems for earliest membership testing with subhedge projection do only affect the constructions of these Sha↓s, but not how they are to be evaluated. However, there is one additional issue: Nested word prefixes cannot be considered as irrelevant for the subsequent subhedge, if the selection variable x can be bound there, even if the acceptance doesn’t depend on where it will be bound. This is since the position of the binding is to be returned by any query answering algorithm.
Definition 15. We call a prefix v binding irrelevant for L and it there does not exist any hedge h containing x and suffix w such that and .
Our subhedge projecting evaluator can project only at subhedge irrelevant prefixes that are also binding irrelevant. Given an dSha A and a set of schema final-states , one can decide whether a prefix is subhedge irrelevant if the current state q does accept hedges containing x but can access some other state that is not safe for rejection and permits x-bindings. In ground Datalog, we can distinguish binding irrelevant states q by imposing the following rules for all :
binding_irrelev(q) :- state(q), not binding_relev(q).
binding_relev(q) :- not bind_x(q), acc(q,q’), bind_x(q’), not rej(q’).
Note in our example of the XPath query
self::list[child::item] that state
2D3 of the d
Sha↓ in
Figure 23 is binding irrelevant, since
x must be bound on the top-level
element in state
2D1, so it cannot be bound on any
element below.
11. Experiments with Streaming XPath Evaluation
We compare experimentally three streaming evaluators for dSha, for earliest monadic query answering, without subhedge projection, with safe-no-change projection, and with congruence projection.
11.1. Benchmark dSha for XPath Queries
We start from d
Sha defining monadic queries for the regular XPath queries A1-A8 from the XPathMark benchmark [
7] that are given in
Table 1. These XPath queries show most of the features of the regular fragment of XPath. In
Table 2, we added 14 further XPath path queries that we found useful for testing too.
Deterministic
Shas for A1–A8 were provided earlier in [
21]. For the other XPath queries we compiled them to d
Shas via nested regular expression.
In order to produce the input d
Shas for our evaluators, we intersect these automata with a d
Sha for the schema
where
is the schema we use for hedge encodings of real-world
Xml documents with
x annotations. Thereby, we could identify the schema final states
. We minimized and completed the result.
Table 3 reports the size of the input d
Shas for our evaluators obtained by the above procedure. For each d
Sha A, the size is given by two numbers size(n), the first for the overall size and the second for the number of states n.
For the input d
Sha A of each query, we statically computed the whole d
Sha↓s
while using the necessary parts of the determinization algorithm from [
16]. The size of the d
Sha↓s obtained and the number
d of difference relations are reported in
Table 3 too. The biggest size is 2091(504) for the input d
Sha for A8. The largest number of difference relations d = 24 is also obtained for this query. So, indeed the size of these automata is much smaller than one might expect from a construction that is highly exponential in the worst case. However, the only query that we couldn’t obtain its correspondent deterministic earliest congruence projection
is A5.
The time for computing the earliest congruence projection took between 2.3 and 26 s.
We note that we have not yet computed the complete dSha↓s statically for safe-no-change projection , pure congruence projection and earliest query answering . We believe that these automata may become bigger than that we constructed in a single shot.
11.2. Streaming Evaluation Tool: AStream
We are using and developing the AStream tool for answering monadic d
Sha queries on
Xml streams with schema restrictions. Version 1.01 of AStream was presented in [
21] supports earliest query answering without projection. Given a d
Sha A defining a monadic query, a set of schema final states
and an
Xml stream
w, it constructs on-the-fly the needed part of the
Sha↓ while parsing
w and evaluating
A on the hedge encoding
w.
For the FCT conference version of the present paper [
24], we enhanced AStream with safe-no-change projection. This leads to version AStream 2.01. It constructs the earliest safe-no-change projection d
Sha↓ on the fly while evaluating the monadic query defined by
A on the hedge encoding the
Xml stream
w. Subhedge projection can be switched on or off in order to compare both versions without further implementation differences.
For this present journal version, we added earliest congruence projection and integrated it into AStream leading to AStream 3.0.
AStream 3.0 is different from the two previous versions in that the dSha↓ is constructed statically and entirely, independently of the input hedge. We then use a generic earliest streaming evaluator for Sha↓s that rejects in rejection states, selects in selection states, and projects in all subhedge projection states. This evaluator could also be run with or , as long as these do not grow too big.
It should be noticed that turned out to be nicely small for our whole benchmark, with the other dSha↓s and risk becoming bigger. As stated earlier, we did not construct these dSha↓s so far for our benchmark queries. This is why we continued to run the earliest streaming with safe-no-change projection with AStream 2.01, while we used AStream 3.0 for earliest streaming with congruence projection.
All AStream versions rely on Java’s abc-Datalog for computing the least fixed points of Datalog programs in a bottom-up manner. Datalog programs are needed for all logical reasoning: for computing subsets of states that are safe for rejection, selection, or subhedge projection on all levels. Also, the difference relations are computed based on Datalog. Note that earliest on-the-fly query evaluation requires running Datalog during query evaluation, while with a static approach for constructing dSha↓s entirely, Datalog is only needed at preprocessing time during the automaton construction.
11.3. Evaluation Measures
We want to measure the time for running the three streaming query evaluators for all d
Shas obtained from the XPath expressions in the collection in
Table 1 and
Table 2.
One advantage of the XPathMark benchmark is that it comes with a generator of Xml documents to which the queries can be applied and that can be scaled in size. We created an Xml document of size 1.1 GB, which is sufficiently large to make streaming mandatory. Our experiments show that the efficiency of query evaluation grows linearly with the size of the non-projected part of the Xml document. This holds for each of our three evaluators. Therefore, measuring the time of the evaluator on Xml documents of other sizes would not show any new insights, except that memory consumption remains moderate too.
The time gain of projection for a query
Q is the percentage of time saved by projection during query evaluation when ignoring the pure document parsing time
seconds:
We can measure the time gain for earliest safe-no-change projection by AStream 2.01 and for earliest congruence projection by AStream 3.0, since projection can be switched off in both of them. The disadvantage of the time gain is that it depends on the details of the implementation.
The event gain is a better measure for the projection power. The parser sends a stream of needed events to the
Sha↓, while ignoring parts of the input document that can be projected away. For this, it must always be informed by the automaton about what kind of events can be projected in the current state. The event gain of projection
is then the percentage of events that are recognized to be irrelevant and thus no more created by a projecting evaluator for query
Q. One might expect that the time gain is always smaller than the event gain.
Indeed, this will be confirmed by our experiments. We believe that these discrepancies between these two measure indicate how much room remains for optimizing the implementation of an evaluator. In this way, we can be observe that some room for further optimizations AStream 3.0 still remains.
11.4. Experiments
We used Java’s
Interface v1.0 from the
package to parse and stream
Xml files in Scala. Parsing a 1.1 GB
Xml document required
s, while querying it without projection took us in average
s, while varying from 600 to 900 s. The average processing time for 1 MB was thus 0.72 s. Once we knew this, we predicted the expected evaluation time from the size of the non-projected part of the
Xml document. It is as follows:
Without projection, the parser generates
events for the 1.1 GB
Xml document. This is the baseline for computing
for all queries
Q.
We ran AStream with Scala v2.13.3 on a machine with the operating system Ubuntu 20.04.06 LTS (64-bit). The machine is a Dell Precision-7750 machine (Round Rock, TX, USA), equipped with an Intel® Core™ i7-10875H CPU @ 2.30GHz × 16 and 32 GB of RAM (Santa Clara, CA, USA).
11.4.1. Earliest Congruence Projection
We present the measures of our evaluator with the earliest congruence projection in
Table 4. The event gain is high, varying between 75.7% for A1_0c and 100% for A0, A1_0a, A1_0b, A1_4, A1_5, and A4_1. The event gain for queries without descendant axis is above 98.2%. For these queries, all subtrees that are not on the main path of the query can be projected away. In particular, only subtrees until a fixed depth have to be inspected. These are the queries A1, A4, A6-A8 and A4_0 and the queries with event gain 100% listed above. The time gain for all these queries is a little smaller than the event gain but still above 98.2%.
We next consider the second type of queries having the descendant axis, specifically A2, A1_0c, A1_1a, A1_1d, A1_2, A1_6, and A2_1. Intuitively, a lower event gain is expected in these cases because subhedge projection with the descendant axis cannot directly exclude an entire subtree under a given element; all descendant elements must be inspected. For instance, the query A1_0c equal to must select all attributes of all elements under the root element , requiring the inspection of every Xml element in the document, for attribute presence, except for the root element .
The lower event gain reported for the aforementioned queries align with the above expectation, showing percentages ranging from 75.7% to 81.1%. Also, these queries exhibit a larger gap between time gain and event gain, which ranges from 2.2% to 5.7% and will be subject to future optimization. It is worth noting that some of these queries combine multiple features at once, like queries A1_1a and A1_1d, which have a descendant axis, filter with attributes, and string comparison.
The only exception to this trend for the queries having descendant axis are A3 and A1_3, with respective event gains of 97.8% and 99.8%. The reason behind this high gain is that these queries starts from the root, with a child axis path of depth 3 before querying the descendant axis. This initial path allows for the exclusion and projection of most elements under that do not fit the specified path with their entire subtrees.
11.4.2. Earliest Safe-No-Change Projection
We compare earliest safe-no-change projection with the earliest congruence projection in
Table 5. For this, we restrict ourselves to the queries
–
from the XPathMark in
Table 1.
The XPath queries A1, A4, A6, A7, and A8 do not have descendant axes. The time gain for safe-no-change projection on these queries is between –. For earliest congruence projection, the time gain was better for A1, A4, and A6 with at least 96.6%. For A7 and A7, the time gain of safe-no-change projection is close to the event gain of congruence projection but the time gain of congruence projection is 1.5% lower. So, we hope that an optimized version of congruence projection could outperform safe-no-change projection in all cases.
The XPath queries with descendant axis are A2, A3, and A5. For A2 and A3 in particular, the time gain of safe-no-change projection is very low with even not 10.9% while congruence projection yields at least 77.9%. We believe that this is bug in our current implementation of safe-no-change projection, which may be prohibiting the projection of attributes and text nodes for these queries (but not for the others). Congruence projection on A2 gains 77.9% of time and 81.1% of events. This is much better, but still far from the projection power for queries without descendant axi s. For A3, congruence projection gains 97.4% of time. This is much better than for A2 since the descendant axis in A3 is at the end, while for A2 it is at the beginning of the path. But, even for A2, congruence projection is very helpful.
11.4.3. Comparison to External Tools
QuiXPath was shown to be the most efficient large coverage external tool for regular XPath evaluation on
Xml streams [
4]. Therefore, we compare the efficiency of congruence projection to QuiXPath in
Table 6, and thereby, by transitivity to the many alternative tools. We note that QuiXPath performs early query answering by compilation to possibly nondeterministic
Nwas with selection states, without always being earliest. Apart from this, it bases its efficiency on both subtree projection and descendant projection.
The experiments with QuiXPath in [
4] use the parse-free time. Therefore, we do the same for congruence projection by reducing the measured overall time by 15 s of parsing time.
Compared to the parsing-free times for the XPath queries without descendant axis, the earliest congruence projection demonstrates significant improvements. On average, our projection is only slower by a factor of 1.3–2.9 than QuiXPath. This shows that our implementation is already close to be competitive with the existing Xml streaming tools, while being the only one guaranteeing earliest query answering.
For query A2 with a descendant axis, congruence projection is by a factor of 13.8 slower than QuiXPath. The reason is that QuiXPath supports descendant projection, while congruence projection concerns only subhedge. For query A3 with a descendant axis at the end, congruence projection is only by a factor of 1.6 behind, so descendant projection seems less relevant here.