In our research, we have developed a number of approaches for adapting the algorithms from the previous section to work on general trees, and we review these approaches in this section.
4.1. Dynamic Programming Approaches
Problems with generalizing the dynamic programming algorithm for binary trees to general trees are two-fold. First, using edges to claim quartets is only meaningful for resolved quartets, and secondly, the resolved quartets claimed by an edge can be distributed over a large number of trees. The first problem can possibly be dealt with by using nodes to claim unresolved quartets, but the second problem is more serious.
We can still compute the tables of the intersections of trees, |F ∩ G|. If nodes can have a degree up to d, then in the recursion for building tables |F ∩ G|, we need to combine counts for O(d2) pairs of trees, but the total sum of degrees for each tree is bounded by O(n). Therefore, the product of the pairs of degrees is bounded by O(n2). Worse, however, is computing A(e1,e2) where we need to consider all ways of picking trees F2 and F3 and pairing them with choices G2 and G3, with a worst-case performance of O(d4) for each pair of edges.
In a series of papers, we have developed different algorithms for computing the quartet distance between general trees efficiently by avoiding explicitly having to deal with choosing pairs of trees for inner nodes. Common for these is that we also avoid explicitly handling unresolved quartets, but only consider the resolved topologies and handle the unresolved quartets implicitly.
Christiansen
et al. [
20] developed two algorithms for computing the quartet distance between general trees. The idea in the first algorithm, which runs in time
O(
n3), is to consider all triplets, {
i,
j,
k}, and count for each fourth leaf-label,
x, how many of the quartets, {
i,
j,
k,
x}, have the same topology in the two trees; whether this be resolved topologies,
ix|
jk,
ij|
xk or
ik|
jx counting
A, or unresolved, (
ijkx) counting
E. Counting this way, each shared quartet topology will be counted twelve times; so, the algorithm actually computes 12 · (
A +
E), and we must divide by twelve at the end. We explicitly iterate through all
O(
n3) triplets, and the crux of the algorithm is counting the shared quartet topologies in constant time.
The approach is as follows: For each triplet, {
i,
j,
k}, we identify the “center”,
C, of the induced topology in both trees. Let
Fi,
Fj and
Fk be the trees in
T1 connected to
C containing leaves
i,
j and
k, respectively, and let
Gi,
Gj and
Gk be the corresponding trees in
T2. A resolved quartet,
ix|
jk, is shared between the two trees if
x is in
Fi and in
Gi (see
Figure 7), so the number of such quartets is |
Fi ∩
Gi| − 1 (minus one because we otherwise would also count
ii|
jk). Therefore, given
i,
j and
k, the number of shared resolved quartets is:
which can be computed in constant time from the tables we build in preprocessing using dynamic programming, assuming we know the centers for the triplet.
Figure 7.
Computing quartets between high-degree nodes.
Figure 7.
Computing quartets between high-degree nodes.
Now, let
F−ijk denote all the leaves in
T1, except those in
Fi,
Fj and
Fk, and similarly, let
G−ijk denote the set of leaves in
T2 not in
Gi,
Gj and
Gk. The number of unresolved quartets (
ijkx) is then given by |
F−ijk ∩
G−ijk|. We cannot directly build a table of all |
F−ijk ∩
G−ijk| in
O(
n2), since just from the number of indices (
i,
j and
k), we would need a table of size
O(
n3). Instead, we can compute:
where:
(we can tabulate |
Gl| for all
l in
O(
n) in preprocessing) and:
for
l =
i,
j,
k; so, we can also count how many unresolved topologies we have for (
ijkx) if we have built tables in preprocessing and we know the center nodes.
To get the running time of O(n3), we thus only need to find the center nodes in constant time for each triplet, {i, j, k}. We achieve this by finding a linear number of centers, in linear time, for a linear number of triplets. For pairs, {i, j}, we iterate through all k and their corresponding centers in linear time.
The idea is as follows: for each pair,
i and
j, we identify the path between them, which is easily done in
O(
n). Along this path, we then consider all inner nodes and the trees branching off. We can explicitly iterate through all
k in such a subtree, and the corresponding center node is the node where the subtree branches off the path from
i to
j (see
Figure 8). This way, for each pair, {
i,
j}—of which there are
O(
n2)—we iterate through leaves
k in
O(
n) and count the number of shared quartets, {
i,
j,
k,
x}, giving us a total running time of
O(
n3).
The second algorithm in Christiansen
et al. [
20] completely avoids dealing with unresolved quartet topologies by only counting the resolved topologies that are either shared,
A, or that differ,
B, between the two trees. Recall that the quartet distance between trees
T1 and
T2 are given by
B+
C+
D in
Figure 2. If we have functions A(
T1,
T2) and B(
T1,
T2) that compute
A and
B for the two trees, respectively, then we can compute the quartet distance without considering unresolved topologies explicitly, since
A +
B +
C = A(
T1,
T1),
A +
B +
D = A(
T2,
T2), and:
Computing the row and column sums,
A +
B +
C and
A +
B +
D, can also be done faster than using the A function; in fact, it can be computed in linear time for both triplets and quartets using dynamic programming (see either Christiansen
et al. [
21] or Brodal
et al. [
5] for details).
Figure 8.
Handling all trees hanging off the path from i to j.
Figure 8.
Handling all trees hanging off the path from i to j.
Should one want to compute the parameterized quartet distance instead, it can be done a little more cumbersomely, but still using only the A and B counts:
although that is not likely to be an efficient way of doing this. We will not consider it in more detail for now, however.
The second algorithm from Christiansen
et al. [
20] is based on counting the resolved quartets, whether case
A or
B, using oriented edges,
, and the quartets they claim. Consider oriented edges
e1 and
e2 in
T1 and
T2, respectively, and let
v1 and
v2 denote the destination nodes of the edges. If
v1 or
v2 are high-degree nodes, then the edges do not correspond to unique claims,
and
, since the trees,
F2,
F3,
G2 and
G3, are not uniquely defined (see
Figure 9a). To get around this, the algorithm transforms the trees by expanding (arbitrarily) each high-degree node into a binary tree, tagging edges, so we can distinguish between original edges and edges caused by the expansion (see
Figure 9b). We then define
extended claims,
, consisting of two edges,
e1 and
, such that
e1 is one of the original edges,
is a result of expanding nodes and such that the oriented path from
e1 to
only goes through expansion edges. The extended claims play the role that claims do in the original algorithm, and each resolved quartet in the original tree is claimed by exactly two extended claims in the expanded tree.
The algorithm then implements functions A and B by explicitly iterating through all pairs of extended claims. Each of the original O(n) edges are expanded into at most O(nd) extended claims, where d is the maximal degree of the trees, and by counting the equal or different quartet topologies in constant time for each pair of extended claims, the total running time is O(n2d2).
Figure 9.
Expanding high-degree nodes.
Figure 9.
Expanding high-degree nodes.
The A function counts the number of equal topologies from the table of |
F ∩
G| counts exactly as the algorithm for binary trees:
and:
The B function, counting the number of resolved quartets, {
i,
j,
k,
l}, with different topologies, also uses the tables, but with a slightly different expression. Since we have resolved, but different, quartet topologies whenever
claims
ij →
kl (
i.e.,
i,
j ∈
F1,
k ∈
F2 and
l ∈
F3) and
claims
ik →
jl,
il →
jk,
jk →
il or
jl →
ik, we can count as:
Because of symmetries and counting each quartet in two claims in each tree, this will over-count by a factor of four, so the number of resolved, but different, quartet topologies between the two trees is counted as:
Christiansen
et al. [
21] improved the running time to
O(
n2d) by changing the counting schemes for the A and B functions. The gist of the ideas in the improved counting scheme is to extend the table of shared leaf counts, |
F ∩
G|, with values |
F ∩
G|, |
F ∩
G| and |
F ∩
G|, where
F denotes the set of leaves
not in subtree
F. Using these extra counts, the algorithm builds additional tables, over-counting the number of quartets claimed and, then, adjusting the over-count. For details on this, rather involved, counting scheme, we refer to the original paper [
21].
Using the same basic idea, but with yet another counting scheme and another set of tables counting sets of shared leaves, Nielsen
et al. [
22] developed an
O(
n2+α) algorithm, where
α = (
ω − 1)/2 and
O(
nω) is the time it takes to multiply two
n ×
n matrices. Using the Coppersmith-Winograd algorithm [
23], where
ω = 2.376, this yields a running time of
O(
n2.688) and was the first guaranteed sub-cubic time algorithm for computing the quartet distance between general trees. Here, the underlying idea is to reduce an explicit iteration over
O(
d2) pairs of claims to a matrix-matrix multiplication as part of the counting iteration. Again, we refer to the original paper for details [
22].
4.2. Tree Coloring
The coloring approach of Brodal
et al. [
14,
15] for binary trees, described in
Section 3.2, has been extended to general trees, first by Stissing
et al. [
24] and recently by Brodal
et al. [
5].
The result of [
24] is an
O(
d9n log
n) time algorithm for computing the quartet distance between two trees,
T1 and
T2, where
d = max{
d1,
d2} for
di, the maximal degree of a node in tree
Ti. This was the first algorithm for general trees allowing for sub-quadratic time. The approach of [
24] stays rather close to that of [
14,
15], but with a decoration scheme adapted to general trees. The dependency,
d9, on d stems from the HDT data structure having a
d factor in its balance bound when applied to general trees and from the applied decoration scheme for the HDT requiring
O(
d8) time for the update of the decoration of a node based on its children after a color change.
The results of [
5] are an
O(
n log
n) time algorithm for computing the triplet distance and an
O(
dn log
n) time algorithm for computing the quartet distance, both for general trees, with
d defined as above. The improved dependency on
d stems from a new definition of the HDT data structure achieving good balance for general trees and from a much more elaborate decoration scheme, with a large system of auxiliary counters assisting in the calculation of the main counters.
Johansen and Holt [
25] have later shown how to improve the algorithm, such that
d = min{
d1,
d2}, which is of significance if only one of the trees has a high degree.
The new HDT definition is based on the four types of components shown in
Figure 10. Here, the base components, constituting the leaves of the HDT, are
L components containing a leaf from
T2 and
I components containing a single internal node from
T2. The internal nodes of the HDT are formed by
C components containing a connected set of nodes with at most two external edges to other components and
G components containing the subtrees of some of the siblings of a node in
T2. During construction of the HDT,
C and
G components are created by the transformations and compositions shown in
Figure 11. The construction proceeds in rounds, with each round performing a set of non-overlapping compositions. It is shown in [
5] that in
O(log
n) rounds and
O(
n) time, a 1/ log(10/9)-locally balanced HDT tree is formed. This plays the role of the HDT described in
Section 3.2, but now, for general trees.
Figure 10.
The four different types of components.
Figure 10.
The four different types of components.
Figure 11.
The two types of transformations (top) and the three types of compositions (bottom).
Figure 11.
The two types of transformations (top) and the three types of compositions (bottom).
Additional changes include the use of
d + 1 colors, denoted 0, 1, 2,...,
d, and new definitions of which node a triplet/quartet is associated with. The node in question is denoted the
anchor, and the choice of anchors can be seen in
Figure 12. This choice forms the basis for the definition of the decoration system, which, as said, is quite involved and for which we refer to [
5] for details (as well as to [
25] for minor corrections and for a trimmed variant of the system).
Figure 12.
The anchors (white nodes) of resolved and unresolved triplets and quartets. Edges in the figures represents paths in the tree.
Figure 12.
The anchors (white nodes) of resolved and unresolved triplets and quartets. Edges in the figures represents paths in the tree.
The remaining parts follow the lines of [
14,
15] (see
Section 3.2). The basic algorithm, corresponding to
Figure 4 for binary trees, for traversing
T1 recursively can be seen in
Figure 13. It maintains the following invariants: (1) When entering a node,
v, all leaves in the subtree of
v have the color, one, and all leaves not in the subtree of
v have the color, zero; (2) When exiting
v, all leaves in
T1 have the color, zero. To initialize Invariant (1), all leaves are colored one at the start. As in
Section 3.2, the operations, Contract and Extract, are then added (analogously to
Figure 6) for exploiting a variant of the “extended smaller-half trick”, in order to arrive at the stated running times.
Figure 13.
The main algorithm performing a recursive traversal of T1.
Figure 13.
The main algorithm performing a recursive traversal of T1.