Next Article in Journal
Complex Knowledge Graph Embeddings Based on Convolution and Translation
Next Article in Special Issue
A Dynamic Programming Approach to the Collision Avoidance of Autonomous Ships
Previous Article in Journal
An Adaptive Controller Design for Nonlinear Active Air Suspension Systems with Uncertainties
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Polynomial-Time Constrained Message Passing for Exact MAP Inference on Discrete Models with Global Dependencies

by
Alexander Bauer
1,2,*,
Shinichi Nakajima
1,3,4 and
Klaus-Robert Müller
1,3,5,6,*
1
Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
2
BASLEARN— TU Berlin/BASF Joint Lab for Machine Learning, Technische Universität Berlin, 10587 Berlin, Germany
3
Berlin Institute for the Foundations of Learning and Data, 10587 Berlin, Germany
4
RIKEN Center for AIP, Tokyo 103-0027, Japan
5
Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Republic of Korea
6
Max-Planck-Institut für Informatik, Saarland Informatics Campus E1 4, 66123 Saarbrücken, Germany
*
Authors to whom correspondence should be addressed.
Mathematics 2023, 11(12), 2628; https://doi.org/10.3390/math11122628
Submission received: 7 April 2023 / Revised: 1 June 2023 / Accepted: 6 June 2023 / Published: 8 June 2023
(This article belongs to the Special Issue Dynamic Programming)

Abstract

:
Considering the worst-case scenario, the junction-tree algorithm remains the most general solution for exact MAP inference with polynomial run-time guarantees. Unfortunately, its main tractability assumption requires the treewidth of a corresponding MRF to be bounded, strongly limiting the range of admissible applications. In fact, many practical problems in the area of structured prediction require modeling global dependencies by either directly introducing global factors or enforcing global constraints on the prediction variables. However, this always results in a fully-connected graph, making exact inferences by means of this algorithm intractable. Previous works focusing on the problem of loss-augmented inference have demonstrated how efficient inference can be performed on models with specific global factors representing non-decomposable loss functions within the training regime of SSVMs. Making the observation that the same fundamental idea can be applied to solve a broader class of computational problems, in this paper, we adjust the framework for an efficient exact inference to allow much finer interactions between the energy of the core model and the sufficient statistics of the global terms. As a result, we greatly increase the range of admissible applications and strongly improve upon the theoretical guarantees of computational efficiency. We illustrate the applicability of our method in several use cases, including one that is not covered by the previous problem formulation. Furthermore, we propose a new graph transformation technique via node cloning, which ensures a polynomial run-time for solving our target problem. In particular, the overall computational complexity of our constrained message-passing algorithm depends only on form-independent quantities such as the treewidth of a corresponding graph (without global connections) and image size of the sufficient statistics of the global terms.
MSC:
68T99

1. Introduction

Many practical tasks can be effectively formulated as discrete optimization problems within the framework of graphical models such as Markov Random Fields (MRFs) [1,2,3] by representing the constraints and objective function in a factorized form. Finding the corresponding solution refers to the task of maximum a posteriori (MAP) inference, which is known to be NP-hard in general. Although there are plenty of existing approximation algorithms [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23], several problems (described below) require finding an optimal solution. Existing exact algorithms [24,25,26,27,28,29,30,31,32,33,34], on the other hand, either make specific assumptions about the energy function or do not provide polynomial run-time guarantees for the worst case. Assuming the worst-case scenario, the junction (or clique)-tree algorithm [1,35], therefore, remains the most efficient and general solution for exact MAP inference. Unfortunately, its main tractability assumption requires the treewidth [36,37] of a corresponding MRF to be bounded, strongly limiting the range of admissible applications by excluding models with global interactions. Many problems in the area of structured prediction, however, require modeling of global dependencies by either directly introducing global factors or enforcing global constraints on the prediction variables. Among the most popular use cases are (a) learning using non-decomposable (or high-order) loss functions and training via slack scaling formulation within the framework of a structural support vector machine (SSVM) [38,39,40,41,42,43,44,45,46,47], (b) evaluating generalization bounds in structured prediction [14,15,44,48,49,50,51,52,53,54,55,56,57,58,59], and (c) performing MAP inference on otherwise tractable models subject to global constraints [19,60,61]. The latter covers various search problems, including the special task of (diverse) k-best MAP inference [62,63]. Learning with non-decomposable loss functions, in particular, benefits from finding an optimal solution, as all of the theoretical guarantees of training with SSVMs assume exact inference during optimization [39,64,65,66,67,68].
Previous works [45,46,47,69] focusing on the problem of loss-augmented inference (use case (a)) have demonstrated how efficient computation can be performed on models with specific global factors by leveraging a dynamic programming approach based on constrained message passing. The proposed idea models non-decomposable functions as a kind of multivariate cardinality potential y η ( G ( y ) ) , where η : R P R + is some function and G denotes the sufficient statistics of the global term. Although able to model popular performance measures, the objective of a corresponding inference problem is rather restricted to simple interactions between the energy of the core model F and the sufficient statistics G according to F η ( G ) , where : R × R R is either a summation or a multiplication operation. Although the same framework can be applied to use case (c) by modeling global constraints via an indicator function, it cannot handle a range of other problems in use case (b) that introduce more subtle dependencies between F and G .
In this paper, we extend the framework for an efficient exact inference proposed in [69] by allowing much finer interactions between the energy of the core model and the sufficient statistics of the global terms. The extended framework covers all the previous cases and applies to new problems, including the evaluation of generalization bounds in structured learning, which cannot be handled by the previous approach. At the same time, the generalization comes with no additional costs, preserving all the run-time guarantees. In fact, the resulting performance is identical to that of the previous formulation, as the corresponding modifications do not change the computational core idea of the previously proposed message passing constrained via auxiliary variables but only affect the final evaluation step (line 8 in Algorithm 1) of the resulting inference algorithm after all the required statistics have been computed. We accordingly adjust the formal statements given in [69] to ensure the correctness of the algorithmic procedure for the extended case.
Furthermore, we propose an additional graph transformation technique via node cloning that greatly improves the theoretical guarantees on the asymptotic upper bound for the computational complexity. In particular, the above-mentioned work only guarantees polynomial run-time in the case where the core model can be represented by a tree-shaped factor graph, excluding problems with cyclic dependencies. The corresponding complexity estimation for clique trees (see Theorem 2 in [69]), however, requires the maximal node degree ν to be bounded by a graph-independent constant; otherwise, it results in a term that depends exponentially on the graph size. Here, we first provide an intuition that ν tends to take on small values (see Proposition 2) and then present an additional graph transformation that reduces this parameter to a constant ν = 3 (see Corollary 1). Furthermore, we analyze how the maximal number of states of auxiliary variables R, which greatly affects the resulting run-time, grows relative to the graph size (see Theorem 2).
The rest of the paper is organized as follows. In Section 2, we formally introduce the class of problems we tackle in this paper, and in Section 3, we present the constrained message-passing algorithm for finding a corresponding optimal solution based on a general representation with clique trees. Additionally, we propose a graph transformation technique via node cloning, which ensures an overall polynomial running time for the computational procedure. In Section 4, we demonstrate the expressivity of our problem formulation on several examples. For an important use case of loss-augmented inference with SSVMs, we show in Section 5 how to represent different dissimilarity measures as global cardinality potentials to align with our problem formulation. In order to validate the guarantees for the computational complexity of Algorithm 1, we present the experimental run-time measurements in Section 6. In Section 7, we provide a summary of our contributions, emphasizing the differences between our results and those in [69]. In Section 8, we broadly discuss the previous works, which is followed by the conclusions in Section 9.

2. Problem Setting

Given an MRF [1,70] over a set of discrete variables, the goal of the maximum a posteriori (MAP) problem is to find a joint variable assignment with the highest probability. This problem is equivalent to minimizing the energy of the model, which describes the corresponding (unnormalized) probability distribution over the variables. In the context of structured prediction, it is equivalent to maximizing a score or compatibility function. To avoid ambiguity, we now refer to the MAP problem as the maximization of an objective function F : R M R defined over a set of discrete variables y = ( y 1 , , y M ) . More precisely, we associate each function F with an MRF, where each variable y m represents a node in the corresponding graph. Furthermore, we assume, without loss of generality, that the function F factorizes over maximal cliques y C t , C t { 1 , , M } of the corresponding MRF according to
F ( y ) = t = 1 T f t ( y C t ) .
We now use the concept of the treewidth of a graph [36] to define the complexity of a corresponding function with respect to the MAP inference, as follows.
Definition 1
( τ -decomposability). We say that a function F : D R M R is τ-decomposable if the (unnormalized) probability exp ( F ( y ) ) factorizes over an MRF with a bounded treewidth τ.
Informally, the treewidth describes the tree-likeness of a graph, that is, how well the graph structure resembles the form of a tree. In an MRF with no cycles going over the individual cliques, the treewidth is equal to the maximal size of a clique minus 1, that is, τ = max t | C t | 1 . Furthermore, the treewidth of a graph is considered bounded if it does not depend on the size of the graph in the following sense. If it is possible to increase the graph size by replicating the individual parts, the treewidth should not be affected by the number of variables in the resulting graph. One simple example is a Markov chain. Increasing the length of the chain does not affect the treewidth, which remains equal to the Markov order of that chain.
The treewidth is defined as the minimum width of a graph and can be computed algorithmically after transforming the corresponding MRF into a data structure called a junction tree or clique tree. Although the problem of constructing a clique tree with a minimum width is NP-hard in general, there are several efficient techniques [1] that can achieve good results with a width close to the treewidth.
In the following, let M be the total number of nodes in an MRF over the variables { y m } m = 1 M , and let N be the maximum number of possible values each variable y m can take on. Assuming that the maximization step dominates the time for creating a clique tree, we obtain the following known result [71]:
Proposition 1.
The computational time complexity for maximizing a τ-decomposable function is upper bounded by O ( M · N τ + 1 ) .
The notion of τ -decomposability for real-valued functions naturally extends to mappings with multivariate outputs for which we now define joint decomposability.
Definition 2
(Joint τ -decomposability). We say two mappings G : D R M R P and G : D R M R P are jointly τ -decomposable if they factorize over a common MRF with a bounded treewidth τ.
Definition 2 ensures the existence of a common clique tree with nodes { C t } t = 1 T and the corresponding potentials { g t , g t } t = 1 T , where max t | C t | 1 = τ , and 
G ( y ) = t = 1 T g t ( y C t ) , G ( y ) = t = 1 T g t ( y C t ) .
Note that the individual factor functions are allowed to have fewer variables in their scope than in a corresponding clique, that is, scope ( g t ) , scope ( g t ) C t .
Building on the above definitions we now formally introduce a class of problem instances of MAP inference for which we later provide an exact message-passing algorithm.
Problem 1.
For F : Y R , G : Y R P , H : R × R P R with Y R M , | Y | N M and P , M , N N + , we consider the following discrete optimization problem:
maximize y Y H ( F ( y ) , G ( y ) )
where we assume that (1) F and G are jointly τ-decomposable, and (2) H is non-decreasing in the first argument.
In the next section, we present multiple examples of practical problems that align with the above-mentioned abstract formulation. As our working example, here, we consider the problem of loss-augmented inference within the framework of SSVMs. This framework includes two different formulations known as margin and slack scaling, both of which require solving a combinatorial optimization problem during training. This optimization problem is either used to compute the subgradient of a corresponding objective function or find the configuration of prediction variables that violates the problem constraints the most. For example, in the case of slack scaling formulation, we can define H ( F ( y ) , G ( y ) ) = F ( y ) · η ( G ( y ) ) for some η : R P R + , where F ( y ) = w Ψ ( x , y ) corresponds to the compatibility score given by the inner product between a joint feature map Ψ ( x , y ) and a vector of trainable weights w (see [39] for more details), and η ( G ( y ) ) = Δ ( y * , y ) describes the corresponding loss function for a prediction y and a ground-truth output y * . In fact, a considerable number of popular loss functions used in structured prediction can generally be represented in this form, that is, as a multivariate cardinality-based potential that depends on counts of different label statistics.

3. Exact Inference for Problem 1

In this section, we derive a polynomial-time message-passing algorithm that always finds an optimal solution for Problem 1. The corresponding results can be seen as a direct extension of the well-known junction-tree algorithm.

3.1. Algorithmic Core Idea for a Simple Chain Graph

We begin by providing an intuition of why efficient inference is possible for Problem 1 using our working example of loss-augmented inference for SSVMs. For margin scaling, in the case of a linear η , the objective F ( y ) + η ( G ( y ) ) inherits the τ -decomposability directly from F and G and, therefore, can be efficiently maximized according to Proposition 1.
The main source of difficulty for slack scaling lies in the multiplication operation between F ( y ) and η ( G ( y ) ) , which results in a fully-connected MRF regardless of the form of the function η . Moreover, many popular loss functions used in structured learning require η to be non-linear, preventing efficient inference even for the margin scaling. Nevertheless, efficient inference is possible for a considerable number of practical cases, as shown below. Specifically, the global interactions between a jointly decomposable F and G can be controlled by using auxiliary variables at a polynomial cost. We now illustrate this with a simple example.
Figure 1. Factor graph representation for the margin-scaling objective, with a decomposable loss G (on the left), and a (non-decomposable) high-order loss η ( G ) (on the right).
Figure 1. Factor graph representation for the margin-scaling objective, with a decomposable loss G (on the left), and a (non-decomposable) high-order loss η ( G ) (on the right).
Mathematics 11 02628 g001
Consider a (Markov) chain of nodes with a 1-decomposable F and 0-decomposable G (e.g., Hamming distance), that is,
F ( y ) = m = 1 M 1 f m ( y m , y m + 1 ) and G ( y ) = m = 1 M g m ( y m ) .
We aim at maximizing an objective F ( y ) η ( G ( y ) ) , where : R × R R is a placeholder for either a summation or a multiplication operation.
The case for margin scaling with a decomposable loss η ( G ) = G is illustrated by the leftmost factor graph in Figure 1. Here, the corresponding factors f m and g m can be folded together, enabling an efficient inference according to Proposition 1. The non-linearity of η , however, can result (in the worst case!) in a global dependency between all the variable nodes, leading to a high-order potential η ( G ) , as illustrated by the rightmost factor graph in Figure 1. In slack scaling, even for a linear η , after multiplying the individual factors, we can see that the resulting model has an edge for every pair of variables, resulting in a fully-connected graph. Thus, for the last two cases, exact inference using the junction-tree algorithm is generally not feasible. The key idea of our approach is to relax the dense connections in these graphs by introducing auxiliary variables  L = ( l 1 , , l M ) R P × M subject to the constraints
l m = k = 1 m g k ( y k ) , m { 1 , , M } .
More precisely, for  H ( F ( y ) , G ( y ) ) = F ( y ) η ( G ( y ) ) , Problem 1 is equivalent to the following constrained optimization problem in the sense that both have the same optimal value and the same set of optimal solutions with respect to y :
maximize y , L F ( y ) η ( l M ) subject to l m + 1 = l m + g m + 1 ( y m + 1 ) m { 1 , , M 1 } l 1 = g 1 ( y 1 )
where the new objective involves no global dependencies and is 1-decomposable if we regard η ( l M ) as a constant. We can make the local dependency structure of the new formulation more explicit by taking the constraints directly into the objective as follows:
Q ( y , L ) = F ( y ) η ( l M ) 1 [ l 1 g 1 ( y 1 ) ] m = 1 M 1 1 [ l m + 1 l m + g m + 1 ( y m + 1 ) ] .
Here, 1 α [ · ] denotes the indicator function such that 1 α [ · ] = α if the argument in [ · ] is true and 1 α [ · ] = 0 otherwise. The indicator functions rule out the configurations that do not satisfy Equation (4) when maximization is performed. A corresponding factor graph for margin scaling is illustrated by the leftmost graph in Figure 2. We can see that our new augmented objective (6) shows only local dependencies and is, in fact, 2-decomposable.
Applying the same scheme for slack scaling yields a much more sparsely connected graph (see the rightmost graph in Figure 2). This is achieved by forcing the majority of connections to go through a single node l 4 , which we call a hub node. Actually, Q ( y , L ) becomes 2-decomposable if we fix the value of l 4 , which then can be multiplied into the corresponding factors of F. In this way, we can effectively reduce the overall treewidth at the expense of increased polynomial computation time (compared to a chain without the global factor), provided that the maximal number R of different states of each auxiliary variable l m is polynomially bounded in M, which represents the number of nodes in the original graph. In the context of training SSVMs, for example, the majority of the popular loss functions satisfy this condition (see Table 1 in Section 5 for an overview).

3.2. Constrained Message-Passing Algorithm on Clique Trees

The idea presented in the previous section is intuitive and enables the reuse of existing software. Specifically, we can use the conventional junction-tree algorithm for graphical models by extending the original graph with nodes corresponding to the auxiliary variables. Alternatively, instead of performing an explicit graph transformation, we can modify the message-passing protocol, which is asymptotically at least one order of magnitude faster. Therefore, we do not explicitly introduce auxiliary variables as graph nodes before constructing the clique tree but rather use them to condition the message-passing rules. In the following, we derive an algorithm for solving an instance of Problem 1 via constrained message passing on clique trees. The resulting computational scheme can be seen as a direct extension of the junction-tree algorithm to models with specific global factors. Note that the form of tractable global dependencies is constrained according to the definition of Problem 1.
First, similar to the conventional junction-tree algorithm, we need to construct a clique tree for a given set of factors that preserves the family structure and has the running intersection property. Note that we ignore the global term during this process. The corresponding energy is given by the function F according to the definition of Problem 1. There are two equivalent approaches for constructing the click tree [1,70]. The first is based on variable elimination and the second is based on graph triangulation, with an upper bound of O ( M · N τ + 1 ) . Figure 3 illustrates the intermediate steps of the corresponding construction process for a given factor graph using the triangulation approach. The resulting clique tree example defines the starting point for our message-passing algorithm.
Assume that a clique tree with cliques C 1 , , C K is given, where C i denotes a set of indices of variables contained in the i-th clique. We denote a corresponding set of variables by y C i . Furthermore, we use the notations { f C i } i = 1 K and { g C i } i = 1 K to denote the clique potentials (or factors) related to the mappings F and G in the definition of Problem 1, respectively. Additionally, we denote by C r a clique chosen to be the root of the clique tree. Finally, we use the notation n e ( C i ) for the indices of the neighbors of the clique C i . We can now compute the optimal value of the objective in Problem 1 as follows. Starting at the leaves of the clique tree, we iteratively send messages toward the root according to the following message-passing protocol. A clique C i can send a message to its parent clique C j if it received all messages from the rest of its neighbors C k for k n e ( C i ) { j } . In that case, we say that C i is ready.
For each configuration of the variables y C i C j and parameters l i R P (encoding the state of an auxiliary variable associated with the current clique C i ), a corresponding message from a clique C i to a clique C j can be computed according to the following equation:
μ C i C j l i ( y C i C j ) = max y C i C j , { l k } f C i ( y C i ) + k n e ( C i ) { j } μ C k C i l k ( y C k C i )
where we maximize over all configurations of the variables y C i C j and over all parameters { l k } = { l k : k n e ( C i ) { j } } subject to the following constraint
k n e ( C i ) { j } l k = l i g C i ( y C i ) .
This means that each clique C i is assigned exactly one (multivariate) auxiliary variable l i and the range of possible values l i can take on is implicitly defined by Equation (8). After resolving the recursion in the above equation, we can see that the variable l i corresponds to a sum of the potentials g C k ( y C k ) for each previously processed clique C k in a subtree of the graph of which C i forms the root. We refer to Equation (4) in the previous subsection for comparison.
Algorithm 1 Constrained Message Passing on a Clique Tree
 Require: 
clique tree { C i } i ; Output: optimal assignment y *
1:
while root clique C r did not receive all messages do
2:
   if a clique C i is ready then
3:
     for all y C i C j and l i  do
4:
        send a message μ C i C j l i ( y C i C j ) to a parent clique C j according to Equation (7); save the maximizing arguments λ C i C j l i ( y C i C j ) : = [ y C i C j * ; { l k } * ]
5:
     end for
6:
   end if
7:
end while
8:
l * argmax l H ( μ ( l ) , l ) ,  , where μ ( l ) is defined by Equation (9)
9:
Let y C r * be a maximizing argument for μ ( l * ) in Equation (9); starting with values l * and y C r * , recursively reconstruct an optimal configuration y * from λ according to Equation (7).
The algorithm terminates if the designated root clique C r received all messages from its neighbors. We then compute the values
μ ( l ) = max y C r , { l k } f C r ( y C r ) + k n e ( C r ) μ C k C r l k ( y C k C r )
maximizing over all configurations of y C r , and { l k } = { l k : k n e ( C r ) } subject to the constraint k l k = l g C r ( y C r ) , which we use to obtain the optimal value p * of Problem 1 according to
p * = max l H ( μ ( l ) , l ) .
The corresponding optimal solution of Problem 1 can be obtained by backtracking the additional variables λ , saving optimal decisions in intermediate steps. The complete algorithm is summarized in Algorithm 1. As an alternative, we provide an additional flowchart diagram illustrating the algorithmic steps in Appendix B (see Figure A1 for further details). We underline this important result by the following theorem, for which we provide the proof in Appendix A. It should be noted that this theorem refers to the more general target objective defined in Problem 1 and should replace the corresponding statement in Theorem 2 in [69].
Theorem 1.
Algorithm 1 always finds an optimal solution to Problem 1. The computational complexity is of the order O ( M · N τ + 1 · R ν 1 ) , where R denotes an upper bound on the number of states of each auxiliary variable, and ν is defined as the maximal number of neighbors of a node in a corresponding clique tree.
Besides the treewidth τ , the value of the parameter ν also appears to be crucial for the resulting running time of Algorithm 1 since the corresponding complexity is also exponential in ν . The following proposition suggests that among all possible cluster graphs for a given MRF, there always exists a clique tree for which ν tends to take on small values (provided τ is small) and effectively does not depend on the size of the corresponding MRF. We provide the proof in Appendix C.
Proposition 2.
For any MRF with treewidth τ, there is a clique tree for which the maximal number of neighbors of each node is upper bounded according to ν 2 τ + 2 4 .
To support the above proposition, we consider the following extreme example, which is illustrated in Figure 4. We are given an MRF with a star-like shape (on the left) with M = 7 variables and treewidth τ = 1 . One valid clique tree for this MRF is shown in the middle. In particular, the clique containing the variables y 1 , y 2 has ν = M 1 neighbors. Therefore, running Algorithm 1 on that clique tree results in a computational time exponential in the graph size M. However, it is easy to modify that clique tree to have a small number of neighbors for each node (shown on the right), upper bounded by ν = τ + 1 = 2 .
Although Proposition 2 assures the existence of a clique tree with a small ν , the actual upper bound on ν is still very pessimistic (exponential in the treewidth). In fact, by allowing a simple graph modification, we can always reduce the ν -parameter to a small constant ( ν = 3 ). Specifically, we can clone each cluster node with more than three neighbors multiple times so that each clone only carries one of the original neighbors. We then connect the clones by a chain that preserves the running intersection property. To ensure that the new cluster graph describes the same set of potentials we set the potentials for each copy of a cluster node C i to zero: f C i ( y C i ) = 0 and g C i ( y C i ) = 0 . The complete modification procedure is illustrated in Figure 5. We summarize this result in the following corollary.
Corollary 1.
Provided a given clique tree is modified according to the presented procedure for reducing the number of neighbors for each cluster node, the overall computational complexity of running Algorithm 1 (including time for graph modification) is of the order O ( M · N τ + 1 · R 2 ) .
Note that in the case where the corresponding clique tree is a chain, the resulting complexity reduces to O ( M · N τ + 1 · R ) . At this point, we would like to provide an alternative perspective of the computational complexity in this case (with τ = 1 , R M 2 ), which shows the connection to the conventional junction-tree algorithm. Specifically, the constrained message-passing algorithm (Algorithm 1) can be seen as conventional message passing on a clique tree (for the mapping F in Problem 1) without auxiliary variables, but with an increased size of the state space for each variable y i , from N to N · M . Then, Proposition 1 guarantees an exact inference in time of the order O ( M · ( N · M ) τ + 1 ) . The summation constraints with respect to the auxiliary variables can be ensured by extending the corresponding potential functions f C i to take on , forbidding inconsistent state transitions between individual variables. The same observation holds for message passing on factor graphs. To summarize, by introducing auxiliary variables, we can remove the global dependencies imposed by the mapping H in Problem 1, thereby reducing the overall treewidth. However, this comes at the cost of the label space becoming a function of the graph size (R is usually dependent on M).
We conclude our discussion by analyzing the relation between the maximal number of states (of the auxiliary variables) R and the number of variables M in the original MRF. In the worst case, R can grow exponentially with the graph size M. This happens, for example, when the values that the individual factors g C k can take on are scattered across a very large range that grows much faster relative to the graph size. For practical cases, however, we can assume that the individual factor functions g C k take values in an integer interval, which is either fixed or grows polynomially with the graph size. In that case, R is always a polynomial in M, rendering the overall complexity of Algorithm 1 a polynomial in the graph size, as we demonstrate with several examples in Section 6. We summarize this in the following theorem. The corresponding proof is given in Appendix D.
Theorem 2.
Consider an instance of Problem 1 given by a clique tree with M variables. Let T N be a number that grows polynomially with M. Provided each factor g C k in a decomposition of G assumes values from a discrete set of integers [ T , T ] Z , the number R grows polynomially with M according to R T · M .

4. Application Use Cases

In this section, we demonstrate the expressivity of Problem 1 by showcasing the diversity of existing (and potentially new) applications that align with our target objective. In particular, practitioners can gain insight into specific examples to verify whether a given task is an instance of Problem 1. The research conducted within the scope of this paper has been motivated by the following use cases:
  • Learning with High-Order Loss Functions
    SSVM enables building complex and accurate models for structured prediction by directly integrating the desired performance measure into the training objective. However, its applicability relies on the availability of efficient inference algorithms. In the state-of-the-art training algorithms, such as cutting planes [64,65], bundle methods [67,68], subgradient methods [18], and Frank–Wolfe optimization [66], inference is repeatedly performed either to compute a subgradient or find the most violating configuration. In the literature, the corresponding computational task is generally referred to as the loss-augmented inference, which is the main computational bottleneck during training.
  • Enabling Training of Slack Scaling Formulation for SSVMs
    The maximum-margin framework of SSVMs includes two loss-sensitive formulations known as margin scaling and slack scaling. Since the original paper on SSVMs [39], there has been much speculation about the differences in training using either of these two formulations. In particular, training via slack scaling has been conjectured to be more accurate and beneficial than margin scaling. Nevertheless, it has rarely been used in practice due to the lack of known efficient inference algorithms.
  • Evaluating Generalization Bounds for Structured Prediction
    The purpose of generalization bounds is to provide useful theoretical insights into the behavior and stability of a learning algorithm by upper bounding the expected loss or the risk of a prediction function. Evaluating such a bound could provide certain guarantees on how a system trained on some finite data will perform in the future on unseen examples. Unlike in standard regression or classification tasks with univariate real-valued outputs, in structured prediction, evaluating generalization bounds requires solving a combinatorial optimization problem, thereby limiting its use in practice [48].
  • Globally Constrained MAP Inference
    In many cases, evaluating a prediction function with structured outputs technically corresponds to performing MAP inference on a discrete graphical model, including Markov random fields (MRFs) [1], probabilistic context-free grammars (PCFGs) [72,73,74], hidden Markov models (HMMs) [75], conditional random fields (CRFs) [3], probabilistic relational models (PRMs) [76,77], and Markov logic networks (MLNs) [78]. In practice, we might want to modify the prediction function by imposing additional (global) constraints on its output. For example, we could perform a corresponding MAP inference subject to the constraints on the label counts specifying the size of the output or the distribution of the resulting labels, which is a common approach in applications such as sequence tagging and image segmentation. Alternatively, we might want to generate the best output with a score from a specific range that can provide deeper insights into the energy function of a corresponding model. Finally, we might want to restrict the set of possible outputs directly by excluding specific label configurations. The latter is closely related to the computational task known as (diverse) k-best MAP inference [62,63].
In the following, we provide technical details about how the generic tasks listed above can be addressed using our framework.

4.1. Loss-Augmented Inference with High-Order Loss Functions

As already mentioned, Problem 1 covers as a special case the task of loss-augmented inference (for margin and slack scaling) within the framework of SSVM [39,46]. In order to match the generic representation given in (2), we can define F ( y ) = w Ψ ( x , y ) + const , and η ( G ( y ) ) = Δ ( y * , y ) for suitable G : Y R P and η : R P R + . Here, Ψ : X × Y R d , d N denotes a joint feature map on an input–output pair ( x , y ) , w R d is a trainable weight vector, and Δ : Y × Y R is a dissimilarity measure between a prediction y and a true output y * . Given this notation, our target objective can be written as follows:
H ( F ( y ) , G ( y ) ) = F ( y ) η ( G ( y ) ) , { + , · } .
We note that a considerable number of non-decomposable (or high-order) loss functions in structured prediction can be represented as multivariate cardinality-based potentials y η ( G ( y ) ) , where the mapping G encodes the label statistics, e.g., the number of true or false positives with respect to the ground truth. Furthermore, the maximal number of states R for the corresponding auxiliary variables related to G is polynomially bounded in the number of variables M (see Table 1 in Section 5 for an overview of existing loss functions and the resulting values for R). For the specific case of a chain graph with F β -loss, for example, the resulting complexity O ( M 3 · N 2 ) of Algorithm 1 is cubic in the graph size.

4.2. Evaluating Generalization Bounds in Structured Prediction

In the following, we demonstrate how our algorithmic idea can be used to evaluate the PAC-Bayesian generalization bounds for max-margin structured prediction. As a working example, we consider the following generalization theorem, as stated in [48]:
Theorem 3.
Assume that 0 Δ ( y * , y ) 1 . With a probability of at least 1 δ over the draw of the training set S = { ( x 1 , y 1 ) , , ( x 1 , y n ) } of size n N , the following holds simultaneously for all weight vectors w :
E ( x , y ) ρ [ Δ ( y , h w ( x ) ) ] w 2 n + w 2 ln ( 2 d n w 2 ) + ln ( n δ ) 2 ( n 1 ) + 1 n i = 1 n max y ^ 1 1 w ( Ψ ( x i , y i ) Ψ ( x i , y ^ ) ) Δ HD ( y i , y ^ ) · Δ ( y i , y ^ )
where h w ( x ) = argmax y ^ w Ψ ( x , y ^ ) denotes the corresponding prediction function.
Evaluating the second term on the right-hand side of the inequality in (12) involves a maximization over y Y for each data point ( x , y * ) according to
max y Y 1 1 w Ψ ( x , y * ) Ψ ( x , y ) Δ HD ( y * , y ) · Δ ( y * , y ) .
We now show that this maximization term is an instance of Problem 1. More precisely, we consider an example with τ > 1 and an F 1 -loss (see Table 1). Next, we define F ( y ) = w Ψ ( x , y ) , η ( G ( y ) ) = Δ F 1 ( y * , y ) , and set
H ( F ( y ) , G ( y ) ) = 1 1 w Ψ ( x , y * ) F ( y ) | y * | G 1 ( y ) + G 2 ( y ) · η ( G ( y ) )
where we use Δ HD ( y * , y ) = F P + F N , F N =   | y * | T P , which removes the need for additional auxiliary variables for the Hamming distance, reducing the resulting computational cost. Here, T P , F P , and F N denote the numbers of true positives, false positives, and false negatives, respectively. | y * | denotes the size of the output y * . Both | y * | and w Ψ ( x , y * ) are constant with respect to the maximization over y . Note also that H is non-decreasing in F ( y ) . Furthermore, the number of states of the auxiliary variables is upper bounded by R = M 2 (see Table 1). Therefore, here, the computational complexity of Algorithm 1 (according to Corollary 1) is given by O ( M 5 · N τ + 1 ) .
As a final remark, we note that training an SSVM corresponds to solving a convex problem but is not consistent. It fails to converge to the optimal predictor even in the limit of infinite training data (see [60] for more details). However, minimizing the (non-convex) generalization bound is consistent. Algorithm 1 provides an effective evaluation tool that could potentially be used for the development of new training algorithms based on the direct minimization of such bounds. We leave the corresponding investigation for future work.

4.3. Globally-Constrained MAP Inference

Another common use case is performing MAP inference on graphical models (such as MRFs) subject to additional constraints on the variables or the range of the corresponding objective including various tasks such as image segmentation in computer vision, sequence tagging in computational biology or natural language processing, and signal denoising in information theory. We note that an important tractability assumption in the definition of Problem 1 is the τ -decomposability of F and G with a reasonably small treewidth τ . In areas such as computer vision, we usually encounter models (e.g., Ising grid model) where the treewidth is a function of the graph size given by τ = M . In this case, we can use Algorithm 1 by leveraging the technique of dual decomposition [5,16,79]. More precisely, we decompose the original graph in (multiple) trees, where the global factor is attached to exactly one of the trees in our decomposition. We also note that from a technical perspective, the problem of MAP inference subject to some global constraints on the statistics G ( y ) is equivalent to the MAP problem augmented with a global cardinality-based potential η ( G ( y ) ) . Specifically, we can define η as an indicator function 1 [ · ] , which returns if the corresponding constraint on G ( y ) is violated. Furthermore, the form of η does not affect the message passing of the presented algorithm. We can always check the validity of a corresponding constraint after all the necessary statistics have been computed.

4.3.1. Constraints on Label Counts

As a simple example, consider the binary-sequence tagging experiment, that is, every output y Y is a sequence, and each site in the sequence can be either 0 or 1. Given some prior information on the number b of positive labels, we can improve the quality of the results by imposing a corresponding constraint on the outputs:
maximize y Y w Ψ ( x , y ) subject to i = 1 M y i = b
We can write this as an instance of Problem 1 by setting
H ( F ( y ) , G ( y ) ) = F ( y ) + 1 [ G ( y ) b ] ,
where F ( y ) = w Ψ ( x , y ) , and G ( y ) = i = 1 M y i . Since all the variables in y are binary, the number of states R of the corresponding auxiliary variables l m = i = 1 m y i is upper bounded by M. In addition, because the output graph is a sequence, we have τ = 1 , ν = 2 . Therefore, here, the computational complexity of Algorithm 1 is of the order O ( M 2 · N 2 ) .

4.3.2. Constraints on Objective Value

We continue with the binary-sequence tagging example (with pairwise interactions). To force constraints on the score to be in a specific range, as in
maximize y Y w Ψ ( x , y ) subject to a w Ψ ( x , y ) b
we first rewrite the prediction function in terms of its sufficient statistics according to
w Ψ ( x , y ) = o , s w o , s t = 1 M 1 1 [ x t = o y t = s ] = : G o , s ( y ) + s 1 , s 2 w s 1 , s 2 t = 2 M 1 1 [ y t 1 = s 1 y t = s 2 ] = : G s 1 , s 2 ( y ) ,
w = ( , w o , s , , w s 1 , s 2 , . . . ) , G = ( , G o , s , , G s 1 , s 2 , . . . ) , and define
H ( F ( y ) , G ( y ) ) = F ( y ) + 1 [ w G ( y ) [ a , b ] ] .
Note that G contains all the sufficient statistics of F such that F ( y ) = w G ( y ) . Here, we could replace a w Ψ ( y ) b with any (non-linear) constraint on the sufficient statistics of the joint feature map Ψ ( y ) .
The corresponding computational complexity can be derived by considering an urn problem: one with D · N and one with N 2 distinguishable urns and M indistinguishable balls. Here, D denotes the size of the dictionary for the observations x t in the input sequence x . Note that the dictionary of the input symbols can be large compared to other problem parameters. However, we can reduce D to the size of the vocabulary only occurring in the current input x . The first urn problem corresponds to the unary observation-state statistics G o , s ( y ) , and the second corresponds to the pairwise statistics for the state transition G s 1 , s 2 ( y ) . The resulting number of possible distributions of balls over the urns is given by
M + D · N 1 M M D · N · M + N 2 1 M M N 2 M D · N + N 2 = R .
Although the resulting complexity (due to ν = 2 ) being O ( M D · N + N 2 + 1 · N 2 ) is still a polynomial in the number of variables M, the degree is quite high, making it suitable for only short sequences. For practical use, we recommend the efficient approximation framework of Lagrangian relaxation and Dual Decomposition [5,16,79].

4.3.3. Constraints on Search Space

The constraints on the search space can be different from the constraints we can impose on the label counts. For example, we might want to exclude a set of K complete outputs { y 1 , , y K } from the feasible set Y by using an exclusion potential 1 [ y { y 1 , , y K } ] .
For simplicity, we again consider a sequence-tagging example with pairwise dependencies. Given a set of K patterns to exclude, we can introduce auxiliary variables l m { 0 , 1 } K , where for each pattern y k , we have a constraint ( l m ) k = max { 1 1 [ y m k y m ] , ( l m 1 ) k } . More precisely, we modify the message computation in (7) with respect to the auxiliary variables by replacing the corresponding constraints ( l m ) k = ( l m 1 ) k + 1 1 [ y m k y m ] in the maximization over { l m } with the constraints ( l m ) k = max { 1 1 [ y m k y m ] , ( l m 1 ) k } . Therefore, the maximal number of states for l m is given by R = 2 K . The resulting complexity for finding an optimal solution over y Y { y 1 , , y K } is of the order O ( 2 K · M · N 2 ) .
A related problem is finding a diverse k-best solution. Here, the goal is to produce the best solutions that are sufficiently different from each other according to a diversity function, e.g., a loss function such as the Hamming distance Δ HD . More precisely, after computing the MAP solution y 1 , we compute the second-best (diverse) output y 2 with Δ HD ( y 1 , y 2 ) m 1 . For the third-best solution, we then require Δ HD ( y 1 , y 3 ) m 2 and Δ HD ( y 2 , y 3 ) m 2 , and so on. In other words, we search for an optimal output y K such that Δ HD ( y k , y K ) m K 1 , m k N for all k { 1 , , K 1 } .
For this purpose, we define auxiliary variables l m { 0 , , M } K 1 , where for each pattern y k , we have a constraint ( l m ) k = ( l m 1 ) k + 1 1 [ y m k y m ] , which computes the Hamming distance of a solution y with respect to the pattern y k . Therefore, we can define
H ( F ( y ) , G ( y ) ) = F ( y ) + 1 [ k { 1 , , K 1 } : G k ( y ) < m k ]
where G = ( G 1 , , G K 1 ) , and at the final stage (due to G ( y ) = l M ), we have all the necessary information to evaluate the constraints with respect to the diversity function (here Hamming distance). The maximal number of states R for the auxiliary variables is upper bounded by M K 1 . Therefore, the resulting running time for finding the K-th diverse output sequence is of the order O ( M K · N τ + 1 ) .
Finally, we note that the concept of diverse K-best solutions can also be used during the training of SSVMs to speed up the convergence of a corresponding algorithm by generating diverse cutting planes or subgradients, as described in [63]. An appealing property of Algorithm 1 is that we obtain some of the necessary information for free as a side effect of the message passing.

5. Compact Representation of Loss Functions

We now further advance the task of loss-augmented inference (see Section 4.1) by presenting a list of popular dissimilarity measures that our algorithm can handle, which are summarized in Table 1. The measures are given in a compact representation Δ ( y * , y ) = η ( G ( y ) ) based on the corresponding sufficient statistics encoded as a mapping G . Columns 2 and 3 show the form of G ( · ) and η , respectively. Column 4 provides an upper bound R on the number of possible values of auxiliary variables that affect the resulting running time of Algorithm 1 (see Corollary 1).
Table 1. Compact representation of popular dissimilarity measures based on the corresponding sufficient statistics G ( · ) . The upper (and lower!) bounds on the number of states of the auxiliary variables R for the presented loss functions are shown in the last column.
Table 1. Compact representation of popular dissimilarity measures based on the corresponding sufficient statistics G ( · ) . The upper (and lower!) bounds on the number of states of the auxiliary variables R for the presented loss functions are shown in the last column.
Loss G ( y ) η ( G ( · ) ) R
Δ 0 / 1 ( TP , 1 1 [ FP > 0 ] ) 1 1 [ max { M G 1 , G 2 } > 0 ] M 2
Δ HD t = 1 M 1 1 [ y t * y t ] GM
Δ HL t = 1 M 1 1 [ y t * y t ] G / M M
Δ WHD { # ( s 1 , s 2 ) } s 1 , s 2 { 1 , , N } s 1 , s 2 weight ( s 1 , s 2 ) · G s 1 , s 2 M N 2
Δ # FP FP GM
Δ R TP 1 G / | y * | M
Δ P ( TP , FP ) 1 G 1 G 1 + G 2 M 2
Δ F β ( TP , FP ) 1 ( 1 + β 2 ) · G 1 β 2 · | y * | + G 1 + G 2 M 2
Δ / ( TP , FP ) 1 G 1 | y * | + G 2 M 2
Δ LC t = 1 M y t G t = 1 M y t * M
Δ # CB # CB GM
Δ CBR ( # CB , | y | ) G 1 / G 2 M 2
Δ BLEU ( TP 1 , FP 1 , , TP K , FP K ) 1 BP ( · ) · exp 1 K k = 1 K log p k M 2 K
Δ ROUGE K { count ( k , X ) } k grams ( R e f ) 1 S R e f k grams ( S ) min { count ( k , S ) , G k } S R e f k grams ( S ) count ( k , S ) M D
Δ ROUGE LCS { LCS ( X , S ) } S R e f 1 1 | R e f | S R e f ( 1 + β 2 ) P ( G S ) · R ( G S ) β 2 P ( G S ) + R ( G S ) M 2 | R e f |
Here, | y | = M denotes the number of nodes in the output y . TP , FP , and FN are the numbers of true positives, false positives, and false negatives, respectively. The number of true positives for a prediction y and a true output y * is defined as the number of common nodes with the same label. The number of false positives is determined by the number of nodes that are present in the output y but missing (or with another label) in the true output y * . Similarly, the number of false negatives corresponds to the number of nodes present in y * but missing (or with another label) in y . In particular, it holds that | y * |   = TP + FN .
We can see in Table 1 that each element of G ( · ) is a sum of binary variables, which significantly reduces the image size of mapping G ( · ) . This occurs despite the exponential variety of the output space Y . As a result, the image size grows only polynomially with the size of the outputs y Y , and the number R provides an upper bound on the image size of G ( · ) .
Zero-One Loss ( Δ 0 / 1 )
This loss function takes on binary values { 0 , 1 } and is the most uninformative since it requires a prediction to match the ground truth to 100% and provides no partial quantification of the prediction quality in the opposite case. Technically, this measure is not decomposable since it requires the numbers of FP and FN to be evaluated via
Δ 0 / 1 ( y * , y ) = 1 1 [ max { FP , FN } > 0 ] .
Sometimes, we cannot compute the FN (unlike FP ) from the individual nodes of a prediction. Instead, we can count the TP and compute the FN using the relationship | y * | = TP + FN . For example, if the outputs y Y are set with no ordering indication of the individual set elements, we need to know the whole set y in order to be able to compute the FN . Therefore, computing FN from a partially constructed output is not possible. We note, however, that in the case of the zero-one loss function, there is a faster inference approach, which involves modifying the prediction algorithm to also compute the second-best output and selecting the best result based on the value of the objective function.
Hamming Distance/Hamming Loss ( Δ HD , Δ HL )
In the context of sequence learning, given a true output y * and a prediction y of the same length, the Hamming distance measures the number of states on which the two sequences disagree:
Δ HD ( y * , y ) = t = 1 M 1 1 [ y t * y t ] .
By normalizing this value, we obtain the Hamming loss, which does not depend on the length of the sequences. Both measures are decomposable.
Weighted Hamming Distance ( Δ WHD )
For a given matrix weight R N × N , the weighted Hamming distance is defined as Δ WHD ( y * , y ) = t = 1 M weight ( y t * , y t ) . However, keeping track of the accumulated sum of the weights until the current position t in a sequence, unlike for the Hamming distance, can be intractable. We can, however, use the following observation. It is sufficient to count the occurrences ( y t * , y t ) for each pair of states y t * , y t { 1 , , N } according to
t = 1 M weight ( y t * , y t ) = s 1 , s 2 weight ( s 1 , s 2 ) t = 1 M 1 1 [ y t * = s 1 y t = s 2 ] .
In other words, each dimension of G (denoted as G s 1 , s 2 ) corresponds to
G s 1 , s 2 ( y ; y * ) = t = 1 M 1 1 [ y t * = s 1 y t = s 2 ] .
Here, we can upper bound the image size of G ( · ) by considering an urn problem with N 2 distinguishable urns and M indistinguishable balls. The number of possible distributions of the balls over the urns is given by M + N 2 1 M M N 2 .
False Positives/Precision/Recall ( Δ # FP , Δ P , Δ R )
False positives measure the discrepancy between outputs by counting the number of false positives in a prediction y with respect to the true output y * . This metric is often used in learning tasks such as natural language parsing due to its simplicity. Precision and recall are popular measures used in information retrieval. By subtracting the corresponding values from one, we can easily convert them to a loss function. Unlike precision given by TP / ( TP + FP ) , recall effectively depends on only one parameter. Although it is originally parameterized by two parameters given as TP / ( TP + FN ) , we can exploit the fact that the value | y * |   = TP + FN is always known in advance during the inference, rendering recall a decomposable measure.
F β -Loss ( Δ F β )
The F β = 1 -score is often used to evaluate performance in various natural language processing applications and is also suitable for many structured prediction tasks. It is originally defined as the harmonic mean of precision and recall
F 1 = 2 TP 2 TP + FP + F N .
However, since the value | y * |   = TP + FN is always known in advance during the inference, the F β -score effectively depends on only two parameters ( TP , FP ) . The corresponding loss function is defined as Δ F β = 1 F β .
Intersection Over Union ( Δ / )
The Intersection-Over-Union loss is mostly used in image processing tasks such as image segmentation and object recognition and was used as a performance measure in the Pascal Visual Object Classes Challenge [80]. It is defined as 1 a r e a ( y * y ) / a r e a ( y * y ) . We can easily interpret this value in cases where the outputs y * , y describe the bounding boxes of pixels. The more the overlap of two boxes, the smaller the loss value. We note that in the case of binary image segmentation, for example, we have a different interpretation of true and false positives. In particular, it holds that TP + FN = P , where P is the number of positive entries in y * . In terms of the contingency table, this yields
Δ / = 1 TP ( TP + FP + FN ) .
Since | y * | = TP + FN , the value Δ / effectively depends on only two parameters (instead of three). Moreover, unlike F β -loss, Δ / defines a proper distance metric on sets.
Label-Count Loss ( Δ LC )
The Label-Count loss is a performance measure used for the task of binary image segmentation in computer vision and is given by
Δ ( y * , y ) = 1 M i = 1 M y i i = 1 M y i * .
This loss function prevents assigning low energy to segmentation labelings with substantially different areas compared to the ground truth.
Number/Rate of Crossing Brackets ( Δ # C B , Δ CBR )
The number of Crossing Brackets ( # C B ) is a measure used to evaluate performance in natural language parsing. It computes the average of how many constituents in one tree y cross over constituent boundaries in the other tree y * . The normalized version (by | y | ) of this measure is called the Crossing Brackets (Recall) Rate. Since the value | y | is not known in advance, the evaluation requires a further parameter for the size of y .
Bilingual Evaluation Understudy ( Δ BLEU )
Bilingual Evaluation Understudy, or BLEU for short [81], is a measure used to evaluate the quality of machine translations. It computes the geometric mean of the precision p k = T P k / ( T P k + F P k ) of k-grams of various lengths (for k = 1 , , K ) between a hypothesis and a set of reference translations, multiplied by a factor B P ( · ) to penalize short sentences according to
Δ BLEU ( y * , y ) = 1 BP ( y ) · exp 1 K k = 1 K log p k .
Note that K is a constant, rendering the term M 2 K a polynomial in M.
Recall-Oriented Understudy for Gisting Evaluation ( Δ ROUGE K , Δ ROUGE LCS )
The Recall-Oriented Understudy for Gisting Evaluation, or ROUGE for short [82], is a measure used to evaluate the quality of a summary by comparing it to other summaries created by humans. More precisely, for a given set of reference summaries R e f and a summary candidate X, ROUGE-K computes the percentage of k-grams from R e f that appear in X according to
ROUGE K ( X , R e f ) = S R e f k k - grams ( S ) min { count ( k , X ) , count ( k , S ) } S R e f k k - grams ( S ) count ( k , S )
where count ( k , S ) provides the number of occurrences of a k-gram k in a summary S. We can estimate an upper bound R on the image size of G ( · ) similarly to the derivation for the weighted Hamming distance above as M D , where D : = | grams ( Ref ) | is the dimensionality of G ( · ) . This represents the number of unique k-grams occurring in the reference summaries. Note that we do not need to count grams that do not occur in the references.
Another version, ROUGE-LCS, is based on the concept of the longest common subsequence (LCS). More precisely, for two summaries X and Y, we first compute L C S ( X , Y ) , which is the length of the LCS. We then use this value to define precision and recall measures given by L C S ( X , Y ) / | X | and L C S ( X , Y ) / | Y | , respectively. These measures are used to evaluate the corresponding F-measure:
Δ ROUGE LCS = 1 1 | R e f | S R e f ( 1 + β 2 ) P L C S ( X , S ) · R L C S ( X , S ) β 2 P L C S ( X , S ) + R L C S ( X , S ) .
In other words, each dimension in G ( · ) is indexed by an S R e f . ROUGE-LCS (unlike ROUGE-K) is non-decomposable.

6. Validation of Theoretical Time Complexity

To demonstrate feasibility, we evaluate the performance of our algorithm on several application tasks: part-of-speech tagging [83], base-NP chunking [84], and constituency parsing [39,46]. More precisely, we consider the task of loss-augmented inference (see Section 4.1) for margin and slack scaling with different loss functions. The run-times for the task of diverse k-best MAP inference (Section 4.3) and for evaluating the structured generalization bounds (see Section 4.2) are identical to the run-time for the loss-augmented inference with slack scaling. We omit the corresponding plots due to redundancy. In all the experiments, we used Penn English Treebank-3 [85] as a benchmark data set, which provides a large corpus of annotated sentences from the Wall Street Journal. For better visualization, we restrict our experiments to sentences containing at most 40 words. The resulting time performance is shown in Figure 6. We can see that different loss functions result in different computation costs, as indicated by the number of values for the auxiliary variables shown in the last column of Table 1. In particular, the shapes of the curves are consistent with the upper bound provided in Theorem 1, which reflects the polynomial degree of the overall dependency with respect to the graph size M. The difference between margin and slack scaling is due to the fact that in the case of a decomposable loss function G , the corresponding loss terms can be folded into the factors of the compatibility function F, allowing for the use of conventional message passing.

7. Summary of Contributions

In this paper, we provided a set of formal contributions by generalizing the idea of constrained message passing introduced in [69]. Our improved framework, which is presented in this paper, includes the following key elements:
  • Abstract definition of the target problem (see Problem 1);
  • Constrained message-passing algorithm on clique trees (see Algorithm 1);
  • Formal statements to ensure theoretical properties such as correctness and efficiency (see Theorem 1, Proposition 2, Corollary 1, Theorem 2).
We emphasize that the idea of using auxiliary variables to constrain message passing is not novel per se and was originally proposed in [69]. Our first contribution concerns the difference in the definition of the target problem, which broadens the range of admissible applications. More precisely, the target objective H in the previous publication relates the energy of the model F and the vector-valued mapping G , which describes the sufficient statistics of the global terms in a rather restricted form according to
H ( F ( · ) , G ( · ) ) : = F ( · ) η ( G ( · ) ) , { + , · }
where η is a non-negative function on G . Although this is sufficient for the purposes of [69], there are other computational problems that could benefit from the same message-passing idea but cannot be addressed by it in the above form. Therefore, we relax the form of the function H in our target problem (Problem 1) to allow much finer interactions between F and G . Specifically, we allow H to be any function on the arguments ( F , G ) , which is restricted only by the requirement that it is non-decreasing in the first argument. Without this assumption, the proposed algorithm (Algorithm 1) is not guaranteed to provide an optimal solution. Therefore, all related theorems, algorithms, and the corresponding proofs in [69] must be adjusted accordingly. Conceptually, we can reuse the computational core idea, including the introduction of auxiliary variables, without any changes. The only difference in the new version of the algorithm is how the statistics gathered during message passing are evaluated at the end to ensure the optimality of the solution according to the new requirement on the function H (see line 8 in Algorithm 1). In order to motivate the significance and practical usability of this extension, we demonstrated that mappings F and G can interact by means of the function H in a highly non-trivial way, thereby exceeding the range of problem formulations in [69]. In particular, we showed how the generalization bounds in structured prediction can be evaluated using our approach (see Section 4.2). To the best of our knowledge, there are no existing methods for finding an optimal solution for this task.
Our second contribution involves a new graph transformation technique via node cloning (illustrated in Figure 5), which significantly enhances the asymptotic bounds on the computational complexity of Algorithm 1. Previously, in [69], the polynomial running time was guaranteed only if the maximal node degree ν in a corresponding cluster graph was bounded by a (small) constant. The proposed transformation effectively removes this limitation and ensures polynomial running time regardless of the graph structure. We emphasize this result in Corollary 1. Note that the term R ν 1 in Theorem 1 in [69] has been replaced with R 2 in Corollary 1 of the current paper, thereby reducing the computational complexity of the resulting message-passing algorithm. This reduction is significant, as ν can be dependent on the graph size in the worst case (see Figure 4).
As our third contribution, we investigate the important question of how the parameter R, which describes the maximum number of states for auxiliary variables, grows with the size of the graph, as measured by the total number of variables M in the corresponding MRF. As a result, we identify a sufficient condition on G that guarantees the overall polynomial run-time of Algorithm 1 in relation to the graph size in our target problem (Problem 1; see Theorem 2).

8. Related Works

Several previous works have addressed the problem of exact MAP inference with global factors in the context of SSVMs when optimizing for non-decomposable loss functions. Joachims [86] proposed an algorithm for a set of multivariate losses, including the F β -loss. However, the presented idea applies only to a simple case, where the corresponding mapping F in (2) decomposes into non-overlapping components (e.g., unary potentials).
Similar ideas based on the introduction of auxiliary variables have been proposed [87,88] to modify the belief propagation algorithm according to a special form of high-order potentials. Specifically, for the case of binary-valued variables y i { 0 , 1 } , the authors focus on the univariate cardinality potentials η ( G ( y ) ) = η ( i y i ) . For the tasks of sequence tagging and constituency parsing, ref. [45,46] propose an exact inference algorithm for the slack scaling formulation that focuses on the univariate dissimilarity measures G ( y ) = t 1 1 [ y t * y t ] and G ( y ) = # FP ( y ) (see Table 1 for details). In [69] the authors extrapolate this idea and provide a unified strategy to tackle multivariate and non-decomposable loss functions.
In the current paper, we build on the results in [69] and generalize the target problem (Problem 1) by increasing the range of admissible applications. More precisely, we replace the binary operation : R × R R corresponding to either a summation or multiplication in the previous objective F ( y ) η ( G ( y ) ) with a function H : R × R P R , thereby allowing for more subtle interactions between the energy of the core model F and the sufficient statistics G according to H ( F ( y ) , G ( y ) ) . The increased flexibility, however, must be further restricted in order for a corresponding solution to be optimal. We have found that it is sufficient to impose a requirement on H to be non-decreasing in the first argument. Note that for the special case of the objective in [69], this requirement automatically holds.
Furthermore, the previous works could only guarantee polynomial run-time in cases where (a) the core model can be represented by a tree-shaped factor graph, and (b) the maximal degree of a variable node in the factor graph is bounded by a constant. The former excludes problems with cyclic dependencies and the latter rejects graphs with star-like shapes. The corresponding idea for clique trees can handle cycles but suffers from a similar restriction on the maximal node degree being bounded. Here, we solve this problem by applying the graph transformation proposed in Section 3.2, effectively reducing the maximal node degree to ν = 3 . We note that a similar idea can be applied to factor graphs by replicating variable nodes and introducing constant factors. As a result, we improve the guarantee on the computational complexity by reducing the potentially unbounded parameter ν in the upper bound O ( M · N τ + 1 · R ν 1 ) to ν 3 .
In future work, we will consider applying our framework to the explanation task based on the technique of layer-wise relevance propagation [89] in graph neural networks, following in the spirit of [90,91].

9. Conclusions

Despite the high diversity in the range of existing applications, a considerable number of the underlying MAP problems share the unifying property that the information on the global variable interactions imposed by either a global factor or a global constraint can be locally propagated through the network by means of dynamic programming. By extending previous works, we presented a theoretical framework for efficient exact inference on models involving global factors with decomposable internal structures. At the heart of the presented framework is a constrained message-passing algorithm that always finds an optimal solution for our target problem in polynomial time. In particular, the performance of our approach does not explicitly depend on the graph form but rather on intrinsic properties such as the treewidth and the number of states of the auxiliary variables defined by the sufficient statistics of global interactions. The overall computational procedure is provably exact, and it has lower asymptotic bounds on the computational time complexity compared to previous works.

Author Contributions

Conceptualization, A.B.; Methodology, A.B.; Software, A.B.; Validation, A.B.; Formal analysis, A.B. and S.N.; Investigation, A.B.; Writing—original draft, A.B.; Writing—review & editing, A.B., S.N. and K.-R.M.; Funding acquisition, K.-R.M. All authors have read and agreed to the published version of the manuscript.

Funding

A.B. acknowledges support by the BASLEARN—TU Berlin/BASF Joint Lab for Machine Learning, cofinanced by TU Berlin and BASF SE. S.N. and K.-R.M. acknowledge support by the German Federal Ministry of Education and Research (BMBF) for BIFOLD under grants 01IS18025A and 01IS18037A. K.-R.M. was partly supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP) grants funded by the Korea government(MSIT) (no. 2019-0-00079, Artificial Intelligence Graduate School Program, Korea University and no. 2022-0-00984, Development of Artificial Intelligence Technology for Personalized Plug-and-Play Explanation and Verification of Explanation) and by the German Federal Ministry for Education and Research (BMBF) under grants 01IS14013B-E and 01GQ1115.

Data Availability Statement

The data used in the experimental part of this paper is available at https://catalog.ldc.upenn.edu.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
LCSLongest Common Subsequence
MAPMaximum A Posteriori
MRFMarkov Random Field
SSVMStructural Support Vector Machine

Appendix A. Proof of Theorem 1

Proof. 
We now show the correctness of the presented computations. For this purpose, we first provide a semantic interpretation of messages as follows. Let C i C j be an edge in a clique tree. We denote by F ( i j ) the set of clique factors f C k of the mapping F on the C i -th side of the tree, and by G ( i j ) the corresponding set of clique factors of the mapping G . Furthermore, we denote by V ( i j ) the set of all variables appearing on the C i -th side but not in the sepset C i C j . Intuitively, a message μ C i C j l i ( y C i C j ) sent from clique C i to C j corresponds to the sum of all factors contained in F ( i j ) , which is maximized (for fixed values of y C i C j and l i ) over the variables in V ( i j ) subject to the constraint l i = g C k G ( i j ) g C k ( y C k ) . In other words, we define the following induction hypothesis:
μ C i C j l i ( y C i C j ) = max V ( i j ) : l i = g C k G ( i j ) g C k ( y C k ) f C k F ( i j ) f C k ( y C k ) .
Now, consider an edge ( C i C j ) such that C i is not a leaf. Let i 1 , , i m be the neighboring cliques of C i other than C j . It follows from the running intersection property that V ( i j ) is a disjoint union of V ( i k i ) for k = 1 , , m and the variables y C i C j eliminated at C i itself. Similarly, F ( i j ) is the disjoint union of the F ( i k i ) and { f C i } . Finally, G ( i j ) is the disjoint union of the G ( i k i ) and { g C i } . In the following, we abbreviate the term V ( i k i ) : l i k = g G ( i k i ) g describing a range of variables in V ( i k i ) subject to the corresponding equality constraint with respect to l i k by V ( i k i ) : l i k . Thus, the right-hand side of Equation (A1) is equal to
max y C i C j max { l i k } k = 1 m max V ( i 1 i ) : l i 1 max V ( i m i ) : l i m f F ( i 1 i ) f + + f F ( i m i ) f + f C i
where in the second max, we maximize over all configurations of { l i k } k = 1 m subject to the constraint k = 1 m l i k = l i g C i ( y C i ) . Since all the corresponding sets are disjoint, the term (A2) is equal to
max y C i C j , { l i k } k = 1 m f C i + max V ( i 1 i ) : l i 1 f F ( i 1 i ) f μ C i 1 C i l i 1 ( y C i 1 C i ) + + max V ( i m i ) : l i m f F ( i m i ) f μ C i m C i l i m ( y C i m C i )
where, again, the maximization over { l i k } k = 1 m is subject to the constraint k = 1 m l i k = l i g C i ( y C i ) . Using the induction hypothesis in the last expression, we obtain the right-hand side of Equation (7), thereby proving the claim in Equation (A1).
Now, look at Equation (9). By using Equation (A1) and considering that all the sets of variables and factors involved in different messages are disjoint, we can conclude that the computed values μ ( l ) correspond to the sum of all factors f for the mapping F over the variables in y , which is maximized subject to the constraint G ( y ) = l . Note that up until now, the proof is equivalent to that provided for Theorem 2 in the previous publication [69] because the message-passing step constrained via auxiliary variables is identical. Now, we use an additional requirement on H to ensure the optimality of the corresponding solution. Because H is non-decreasing in the first argument, by performing maximization over all values l according to Equation (10), we obtain the optimal value of Problem 1.
By inspecting the formula for the message passing in Equation (7), we can conclude that the corresponding operations can be performed in O ( M · N τ + 1 · R ν 1 ) time, where ν denotes the maximal number of neighbors of any clique node C i . First, the summation in Equation (7) involves | n e ( C i ) | terms, resulting in | n e ( C i ) | 1 summation operations. Second, a maximization is performed first over | C i C j | variables with a cost N | C i C j | . This, however, is carried out for each configuration of y C i C j , where | C i C j |   +   | C i C j |   =   | C i |   τ + 1 , resulting in N τ + 1 . Then, a maximization over { l k } costs an additional R | n e ( C i ) | 2 . Together with the possible values for l , it yields R ν 1 , where we upper bound | n e ( C i ) | by ν . Therefore, sending a message for all possible configurations of ( y C i C j ; l ) on the edge C i C j costs O ( N τ + 1 · ( | n e ( C i ) | 1 ) · R ν 1 ) time. Finally, we need to carry out these operations for each edge ( i , j ) E in the clique tree. The resulting cost can be estimated as follows: ( i , j ) E N τ + 1 · R ν 1 · ( | n e ( C i ) | 1 ) = N τ + 1 · R ν 1 ( i , j ) E ( | n e ( C i ) | 1 ) N τ + 1 · R ν 1 · | E | = N τ + 1 · R ν 1 · ( | V | 1 ) N τ + 1 · R ν 1 · M , where V denotes the set of clique nodes in the clique tree. Therefore, the total complexity is upper bounded by O ( M · N τ + 1 · R ν 1 ) . □

Appendix B. Flowchart Diagram of Algorithm 1

We present a flowchart diagram for Algorithm 1 in Figure A1.
Figure A1. Flowchart diagram for Algorithm 1 for implementing a constrained message-passing scheme on clique trees. Algorithm 1 is guaranteed to always find an optimal solution in polynomial time. Its computational complexity is upper bounded according to O ( M · N τ + 1 · R 2 ) .
Figure A1. Flowchart diagram for Algorithm 1 for implementing a constrained message-passing scheme on clique trees. Algorithm 1 is guaranteed to always find an optimal solution in polynomial time. Its computational complexity is upper bounded according to O ( M · N τ + 1 · R 2 ) .
Mathematics 11 02628 g0a1

Appendix C. Proof of Proposition 2

Proof. 
Assume we are given a clique tree with treewidth τ . This means that every node in the clique tree has at most τ + 1 variables. Therefore, the number of all possible sepsets for a clique, which refers to the number of all variable combinations shared between two neighbors, is given by 2 τ + 1 2 , where we exclude the empty set and the set containing all the variables in the corresponding clique.
Furthermore, we can deal with sepset duplicates by iteratively rearranging the edges in the clique tree such that for every sepset from the 2 τ + 1 2 possibilities, there is at most one duplicate providing an upper bound of 2 τ + 2 4 on the number of edges for each node. More precisely, we first choose a node with more than 2 τ + 2 4 neighbors as the root and then reshape the clique tree by propagating some of the duplicate edges (together with the corresponding subtrees) toward the leaves, as illustrated in Figure A2. The duplicate edges of nodes connected to the leaves of the clique tree can be reattached in a sequential manner similar to the example shown in Figure 4. Due to this procedure, the maximal number of duplicates for every sepset is upper bounded by 2. Multiplied by the maximal number of possible sepsets for each node, we obtain an upper bound on the number of neighbors in the reshaped clique tree given by ν 2 τ + 2 4 . □
Figure A2. Illustration of the reshaping procedure for a clique tree in the case where the condition ν 2 τ + 2 4 is violated. C r is the root clique where a sepset a occurs at least two times. The number of neighbors of C r can be reduced by removing the edge between C 1 and C r and attaching C 1 to C 2 . In this way, we can ensure that every node has at most one duplicate for every possible sepset. Furthermore, this procedure preserves the running intersection property.
Figure A2. Illustration of the reshaping procedure for a clique tree in the case where the condition ν 2 τ + 2 4 is violated. C r is the root clique where a sepset a occurs at least two times. The number of neighbors of C r can be reduced by removing the edge between C 1 and C r and attaching C 1 to C 2 . In this way, we can ensure that every node has at most one duplicate for every possible sepset. Furthermore, this procedure preserves the running intersection property.
Mathematics 11 02628 g0a2

Appendix D. Proof of Theorem 2

Proof. 
We provide the proof by induction. Let C 1 , , C K be the cliques of a corresponding instance of Problem 1. We now consider an arbitrary but fixed order of C k for k { 1 , , K } . We denote by R i the number of states of a variable l i , that is, R is given by max i R i . As previously mentioned (see Equation (8)), an auxiliary variable l i corresponds to the sum of potentials g C k ( y C k ) over all C k in a subtree of which C i is the root. This means that the number R i is upper bounded by the image size of the corresponding sum function according to
R i k = 1 i g C k ( · ) [ i · T , i · T ] Z 2 · i · T
where the corresponding values are in the set [ i · T , i · T ] Z , defining our induction hypothesis. Using the induction hypothesis and the assumption g C k [ T , T ] , it directly follows that the values for R i + 1 are all in the set [ ( i + 1 ) · T , ( i + 1 ) · T ] Z . The base case for R 1 holds due to the assumption of the theorem. Because K M always holds and the order of the considered cliques is arbitrary, we can conclude that R is upper bounded by 2 · M · T , which is a polynomial in M. □

References

  1. Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques-Adaptive Computation and Machine Learning; The MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  2. Wainwright, M.J.; Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference. Found. Trends Mach. Learn. 2008, 1, 1–305. [Google Scholar] [CrossRef] [Green Version]
  3. Lafferty, J. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML, Williamstown, MA, USA, 28 June– 1 July 2001; pp. 282–289. [Google Scholar]
  4. Kappes, J.H.; Andres, B.; Hamprecht, F.A.; Schnörr, C.; Nowozin, S.; Batra, D.; Kim, S.; Kausler, B.X.; Kröger, T.; Lellmann, J.; et al. A Comparative Study of Modern Inference Techniques for Structured Discrete Energy Minimization Problems. Int. J. Comput. Vis. 2015, 115, 155–184. [Google Scholar] [CrossRef] [Green Version]
  5. Bauer, A.; Nakajima, S.; Görnitz, N.; Müller, K.R. Partial Optimality of Dual Decomposition for MAP Inference in Pairwise MRFs. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Okinawa, 16–18 April 2019; Volume 89, pp. 1696–1703. [Google Scholar]
  6. Wainwright, M.J.; Jaakkola, T.S.; Willsky, A.S. MAP estimation via agreement on trees: Message-passing and linear programming. IEEE Trans. Inf. Theory 2005, 51, 3697–3717. [Google Scholar] [CrossRef]
  7. Kolmogorov, V. Convergent Tree-Reweighted Message Passing for Energy Minimization. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1568–1583. [Google Scholar] [CrossRef] [PubMed]
  8. Kolmogorov, V.; Wainwright, M.J. On the Optimality of Tree-reweighted Max-product Message-passing. In Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, Edinburgh, UK, 26–29 July 2005; pp. 316–323. [Google Scholar]
  9. Sontag, D.; Meltzer, T.; Globerson, A.; Jaakkola, T.S.; Weiss, Y. Tightening LP Relaxations for MAP using Message Passing. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, Helsinki, Finland, 9–12 July 2012. [Google Scholar]
  10. Sontag, D. Approximate Inference in Graphical Models Using LP Relaxations. Ph.D. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA, 2010. [Google Scholar]
  11. Sontag, D.; Globerson, A.; Jaakkola, T. Introduction to Dual Decomposition for Inference. In Optimization for Machine Learning; MIT Press: Cambridge, MA, USA, 2011. [Google Scholar]
  12. Wang, J.; Yeung, S. A Compact Linear Programming Relaxation for Binary Sub-modular MRF. In Proceedings of the Energy Minimization Methods in Computer Vision and Pattern Recognition-10th International Conference, EMMCVPR 2015, Hong Kong, China, 13–16 January 2015; pp. 29–42. [Google Scholar]
  13. Iii, H.D.; Marcu, D. Learning as search optimization: Approximate large margin methods for structured prediction. In Proceedings of the ICML, Bonn, Germany, 7–11 August 2005; pp. 169–176. [Google Scholar]
  14. Sheng, L.; Binbin, Z.; Sixian, C.; Feng, L.; Ye, Z. Approximated Slack Scaling for Structural Support Vector Machines in Scene Depth Analysis. Math. Probl. Eng. 2013, 2013, 817496. [Google Scholar]
  15. Kulesza, A.; Pereira, F. Structured Learning with Approximate Inference. In Proceedings of the 20th NIPS, Vancouver, BC, Canada, 3–6 December 2007; pp. 785–792. [Google Scholar]
  16. Rush, A.M.; Collins, M.J. A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing. J. Artif. Intell. Res. 2012, 45, 305–362. [Google Scholar] [CrossRef]
  17. Bodenstab, N.; Dunlop, A.; Hall, K.B.; Roark, B. Beam-Width Prediction for Efficient Context-Free Parsing. In Proceedings of the 49th ACL, Portland, OR, USA, 19–24 June 2011; pp. 440–449. [Google Scholar]
  18. Ratliff, N.D.; Bagnell, J.A.; Zinkevich, M. (Approximate) Subgradient Methods for Structured Prediction. In Proceedings of the 11th AISTATS, San Juan, Puerto Rico, 21–24 March 2007; pp. 380–387. [Google Scholar]
  19. Lim, Y.; Jung, K.; Kohli, P. Efficient Energy Minimization for Enforcing Label Statistics. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1893–1899. [Google Scholar] [CrossRef]
  20. Ranjbar, M.; Vahdat, A.; Mori, G. Complex loss optimization via dual decomposition. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2304–2311. [Google Scholar]
  21. Komodakis, N.; Paragios, N. Beyond pairwise energies: Efficient optimization for higher-order MRFs. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA, 20–25 June 2009; pp. 2985–2992. [Google Scholar]
  22. Boykov, Y.; Veksler, O. Graph Cuts in Vision and Graphics: Theories and Applications. In Handbook of Mathematical Models in Computer Vision; 2006; pp. 79–96. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=279e0f5d885110c173bb86d37c997becf198651b (accessed on 16 May 2017).
  23. Kolmogorov, V.; Zabih, R. What Energy Functions Can Be Minimized via Graph Cuts? IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 147–159. [Google Scholar] [CrossRef] [Green Version]
  24. Hurley, B.; O’Sullivan, B.; Allouche, D.; Katsirelos, G.; Schiex, T.; Zytnicki, M.; de Givry, S. Multi-language evaluation of exact solvers in graphical model discrete optimization. Constraints 2016, 21, 413–434. [Google Scholar] [CrossRef]
  25. Haller, S.; Swoboda, P.; Savchynskyy, B. Exact MAP-Inference by Confining Combinatorial Search With LP Relaxation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, LA, USA, 2–7 February 2018; pp. 6581–6588. [Google Scholar]
  26. Savchynskyy, B.; Kappes, J.H.; Swoboda, P.; Schnörr, C. Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation. In Proceedings of the Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 1950–1958. [Google Scholar]
  27. Kappes, J.H.; Speth, M.; Reinelt, G.; Schnörr, C. Towards Efficient and Exact MAP-Inference for Large Scale Discrete Computer Vision Problems via Combinatorial Optimization. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1752–1758. [Google Scholar]
  28. Forney, G.D. The Viterbi algorithm. Proc. IEEE 1973, 61, 268–278. [Google Scholar] [CrossRef]
  29. Tarlow, D.; Givoni, I.E.; Zemel, R.S. HOP-MAP: Efficient message passing with high order potentials. In Proceedings of the 13th AISTATS, Sardinia, Italy, 13–15 May 2010. [Google Scholar]
  30. McAuley, J.J.; Caetano, T.S. Faster Algorithms for Max-Product Message-Passing. J. Mach. Learn. Res. 2011, 12, 1349–1388. [Google Scholar]
  31. Younger, D.H. Recognition and Parsing of Context-Free Languages in Time n3. Inf. Control 1967, 10, 189–208. [Google Scholar] [CrossRef] [Green Version]
  32. Klein, D.; Manning, C.D. A* Parsing: Fast Exact Viterbi Parse Selection. In Proceedings of the HLT-NAACL, Edmonton, AB, Canada, 31 May 2003; pp. 119–126. [Google Scholar]
  33. Gupta, R.; Diwan, A.A.; Sarawagi, S. Efficient inference with cardinality-based clique potentials. In Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, OR, USA, 20–24 June 2007; pp. 329–336. [Google Scholar]
  34. Kolmogorov, V.; Boykov, Y.; Rother, C. Applications of parametric maxflow in computer vision. In Proceedings of the IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, 14–20 October 2007; pp. 1–8. [Google Scholar]
  35. McAuley, J.J.; Caetano, T.S. Exploiting Within-Clique Factorizations in Junction-Tree Algorithms. In Proceedings of the 13th AISTATS, Sardinia, Italy, 13–15 May 2010; pp. 525–532. [Google Scholar]
  36. Bodlaender, H. A Tourist Guide Through Treewidth. Acta Cybern. 1993, 11, 1–22. [Google Scholar]
  37. Chandrasekaran, V.; Srebro, N.; Harsha, P. Complexity of Inference in Graphical Models. In Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland, 9–12 July 2008; pp. 70–78. [Google Scholar]
  38. Taskar, B.; Guestrin, C.; Koller, D. Max-Margin Markov Networks. In Proceedings of the 16th NIPS, Whistler, BC, Canada, 9–11 December 2003; pp. 25–32. [Google Scholar]
  39. Tsochantaridis, I.; Joachims, T.; Hofmann, T.; Altun, Y. Large Margin Methods for Structured and Interdependent Output Variables. J. Mach. Learn. Res. 2005, 6, 1453–1484. [Google Scholar]
  40. Joachims, T.; Hofmann, T.; Yue, Y.; Yu, C.N. Predicting Structured Objects with Support Vector Machines. Commun. ACM Res. Highlight 2009, 52, 97–104. [Google Scholar] [CrossRef] [Green Version]
  41. Sarawagi, S.; Gupta, R. Accurate max-margin training for structured output spaces. In Proceedings of the 25th ICML, Helsinki, Finland, 5–9 July 2008; pp. 888–895. [Google Scholar]
  42. Taskar, B.; Klein, D.; Collins, M.; Koller, D.; Manning, C.D. Max-Margin Parsing. In Proceedings of the EMNLP, Barcelona, Spain, 25–26 July 2004; pp. 1–8. [Google Scholar]
  43. Nam John Yu, C.; Joachims, T. Learning Structural SVMs with Latent Variables. In Proceedings of the ICML, Montreal, QC, Canada, 14–18 June 2009; pp. 1169–1176. [Google Scholar]
  44. Bakir, G.; Hoffman, T.; Schölkopf, B.; Smola, A.J.; Taskar, B.; Vishwanathan, S.V.N. Predicting Structured Data; The MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
  45. Bauer, A.; Görnitz, N.; Biegler, F.; Müller, K.R.; Kloft, M. Efficient Algorithms for Exact Inference in Sequence Labeling SVMs. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 870–881. [Google Scholar] [CrossRef]
  46. Bauer, A.; Braun, M.L.; Müller, K.R. Accurate Maximum-Margin Training for Parsing With Context-Free Grammars. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 44–56. [Google Scholar] [CrossRef]
  47. Bauer, A.; Nakajima, S.; Görnitz, N.; Müller, K.R. Optimizing for Measure of Performance in Max-Margin Parsing. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 2680–2684. [Google Scholar] [CrossRef]
  48. McAllester, D. Generalization bounds and consistency for structured labeling. In Proceedings of the Predicting Structured Data; Gökhan, B.H., Hofmanns, T., Schölkopf, B., Smola, A.J., Taskar, B., Vishwanathan, S.V.N., Eds.; The MIT Press: Cambridge, MA, USA, 2006; pp. 247–262. [Google Scholar]
  49. McAllester, D.A.; Keshet, J. Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss. In Proceedings of the 25th NIPS, Granada, Spain, 12–15 December 2011; pp. 2205–2212. [Google Scholar]
  50. London, B.; Huang, B.; Getoor, L. Stability and Generalization in Structured Prediction. J. Mach. Learn. Res. 2016, 17, 1–52. [Google Scholar]
  51. Ranjbar, M.; Mori, G.; Wang, Y. Optimizing Complex Loss Functions in Structured Prediction. In Proceedings of the 11th ECCV, Heraklion, Crete, Greece, 5–11 September 2010; pp. 580–593. [Google Scholar]
  52. Rätsch, G.; Sonnenburg, S. Large Scale Hidden Semi-Markov SVMs. In Proceedings of the 19 NIPS, Barcelona, Spain, 9 December 2007; pp. 1161–1168. [Google Scholar]
  53. Tarlow, D.; Zemel, R.S. Structured Output Learning with High Order Loss Functions. In Proceedings of the 15th AISTATS, La Palma, Canary Islands, Spain, 21–23 April 2012; pp. 1212–1220. [Google Scholar]
  54. Taskar, B.; Chatalbashev, V.; Koller, D.; Guestrin, C. Learning structured prediction models: A large margin approach. In Proceedings of the 22nd ICML, Bonn, Germany, 7–11 August 2005; pp. 896–903. [Google Scholar]
  55. Finley, T.; Joachims, T. Training Structural SVMs when Exact Inference is Intractable. In Proceedings of the 25th ICML, Helsinki, Finland, 5–9 July 2008; pp. 304–311. [Google Scholar]
  56. Meshi, O.; Sontag, D.; Jaakkola, T.S.; Globerson, A. Learning Efficiently with Approximate Inference via Dual Losses. In Proceedings of the 27th ICML, Haifa, Israel, 21–24 June 2010; pp. 783–790. [Google Scholar]
  57. Balamurugan, P.; Shevade, S.K.; Sundararajan, S. A Simple Label Switching Algorithm for Semisupervised Structural SVMs. Neural Comput. 2015, 27, 2183–2206. [Google Scholar] [CrossRef]
  58. Shevade, S.K.; Balamurugan, P.; Sundararajan, S.; Keerthi, S.S. A Sequential Dual Method for Structural SVMs. In Proceedings of the Eleventh SIAM International Conference on Data Mining, SDM 2011, Mesa, AZ, USA, 28–30 April 2011; pp. 223–234. [Google Scholar]
  59. Taskar, B.; Lacoste-Julien, S.; Jordan, M.I. Structured Prediction, Dual Extragradient and Bregman Projections. J. Mach. Learn. Res. 2006, 7, 1627–1653. [Google Scholar]
  60. Nowozin, S.; Gehler, P.V.; Jancsary, J.; Lampert, C.H. Advanced Structured Prediction; The MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
  61. Martins, A.F.T.; Figueiredo, M.A.T.; Aguiar, P.M.Q.; Smith, N.A.; Xing, E.P. An Augmented Lagrangian Approach to Constrained MAP Inference. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, WA, USA, 28 June–2 July 2011; Getoor, L., Scheffer, T., Eds.; Omnipress: Paraskevi, Greece, 2011; pp. 169–176. [Google Scholar]
  62. Batra, D.; Yadollahpour, P.; Guzmán-Rivera, A.; Shakhnarovich, G. Diverse M-Best Solutions in Markov Random Fields. In Proceedings of the Computer Vision-ECCV 2012-12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Part V. pp. 1–16. [Google Scholar]
  63. Guzmán-Rivera, A.; Kohli, P.; Batra, D. DivMCuts: Faster Training of Structural SVMs with Diverse M-Best Cutting-Planes. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, Scottsdale, AZ, USA, 29 April–1 May 2013; pp. 316–324. [Google Scholar]
  64. Joachims, T.; Finley, T.; Yu, C.N. Cutting-Plane Training of Structural SVMs. Mach. Learn. 2009, 77, 27–59. [Google Scholar] [CrossRef] [Green Version]
  65. Kelley, J.E. The Cutting-Plane Method for Solving Convex Programs. J. Soc. Ind. Appl. Math. 1960, 8, 703–712. [Google Scholar] [CrossRef]
  66. Lacoste-Julien, S.; Jaggi, M.; Schmidt, M.W.; Pletscher, P. Block-Coordinate Frank-Wolfe Optimization for Structural SVMs. In Proceedings of the 30th ICML, Atlanta, GA, USA, 16–21 June 2013; pp. 53–61. [Google Scholar]
  67. Teo, C.H.; Vishwanathan, S.V.N.; Smola, A.J.; Le, Q.V. Bundle Methods for Regularized Risk Minimization. J. Mach. Learn. Res. 2010, 11, 311–365. [Google Scholar]
  68. Smola, A.J.; Vishwanathan, S.V.N.; Le, Q.V. Bundle Methods for Machine Learning. In Proceedings of the 21st NIPS, Vancouver, BC, Canada, 7–8 December 2007; pp. 1377–1384. [Google Scholar]
  69. Bauer, A.; Nakajima, S.; Müller, K.R. Efficient Exact Inference with Loss Augmented Objective in Structured Learning. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2566–2579. [Google Scholar] [CrossRef]
  70. Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer, Inc.: Secaucus, NJ, USA, 2006. [Google Scholar]
  71. Lauritzen, S.L.; Spiegelhalter, D.J. Chapter Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems. In Readings in Uncertain Reasoning; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1990; pp. 415–448. [Google Scholar]
  72. Johnson, M. PCFG Models of Linguistic Tree Representations. Comput. Linguist. 1998, 24, 613–632. [Google Scholar]
  73. Heemskerk, J.S. A Probabilistic Context-free Grammar for Disambiguation in Morphological Parsing. In Proceedings of the EACL, Utrecht, The Netherlands, 19–23 April 1993; pp. 183–192. [Google Scholar]
  74. Charniak, E. Statistical Parsing with a Context-Free Grammar and Word Statistics. In Proceedings of the 40th National Conference on Artificial Intelligence and 9th Innovative Applications of Artificial Intelligence, Providence, RI, USA, 27–31 July 1997; pp. 598–603. [Google Scholar]
  75. Rabiner, L.R. A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef] [Green Version]
  76. Koller, D. Probabilistic Relational Models. In Proceedings of the Inductive Logic Programming, 9th International Workshop, ILP-99, Bled, Slovenia, 24–27 June 1999; pp. 3–13. [Google Scholar]
  77. Taskar, B.; Abbeel, P.; Koller, D. Discriminative Probabilistic Models for Relational Data. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Edmonton, AB, Canada, 11–15 July 2013. [Google Scholar]
  78. Richardson, M.; Domingos, P.M. Markov logic networks. Mach. Learn. 2006, 62, 107–136. [Google Scholar] [CrossRef] [Green Version]
  79. Komodakis, N.; Paragios, N.; Tziritas, G. MRF Energy Minimization and Beyond via Dual Decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 531–552. [Google Scholar] [CrossRef] [Green Version]
  80. Everingham, M.; Eslami, S.M.A.; Gool, L.J.V.; Williams, C.K.I.; Winn, J.M.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
  81. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
  82. He, T.; Chen, J.; Ma, L.; Gui, Z.; Li, F.; Shao, W.; Wang, Q. ROUGE-C: A Fully Automated Evaluation Method for Multi-document Summarization. In Proceedings of the 2008 IEEE International Conference on Granular Computing, GrC 2008, Hangzhou, China, 26–28 August 2008; pp. 269–274. [Google Scholar]
  83. Manning, C.D. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In Computational Linguistics and Intelligent Text Processing, Proceedings of the 12th International Conference, CICLing 2011, Tokyo, Japan, 20–26 February 2011; Proceedings, Part I; Lecture Notes in Computer Science; Gelbukh, A.F., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6608, pp. 171–189. [Google Scholar]
  84. Tongchim, S.; Sornlertlamvanich, V.; Isahara, H. Experiments in Base-NP Chunking and Its Role in Dependency Parsing for Thai. In Proceedings of the 22nd COLING, Manchester, UK, 18–22 August 2008; pp. 123–126. [Google Scholar]
  85. Marcus, M.P.; Kim, G.; Marcinkiewicz, M.A.; MacIntyre, R.; Bies, A.; Ferguson, M.; Katz, K.; Schasberger, B. The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the Human Language Technology, Plainsboro, NJ, USA, 8–11 March 1994. [Google Scholar]
  86. Joachims, T. A Support Vector Method for Multivariate Performance Measures. In Proceedings of the 22nd ICML, Bonn, Germany, 7–11 August 2005; pp. 377–384. [Google Scholar]
  87. Tarlow, D.; Swersky, K.; Zemel, R.S.; Adams, R.P.; Frey, B.J. Fast Exact Inference for Recursive Cardinality Models. In Proceedings of the 28th UAI, Catalina Island, CA, USA, 14–18 August 2012. [Google Scholar]
  88. Mezuman, E.; Tarlow, D.; Globerson, A.; Weiss, Y. Tighter Linear Program Relaxations for High Order Graphical Models. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI), Bellevue, WA, USA, 11–15 July 2013. [Google Scholar]
  89. Kohlbrenner, M.; Bauer, A.; Nakajima, S.; Binder, A.; Samek, W.; Lapuschkin, S. Towards Best Practice in Explaining Neural Network Decisions with LRP. In Proceedings of the 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, UK, 19–24 July 2020; pp. 1–7. [Google Scholar]
  90. Schnake, T.; Eberle, O.; Lederer, J.; Nakajima, S.; Schütt, K.T.; Müller, K.; Montavon, G. Higher-Order Explanations of Graph Neural Networks via Relevant Walks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7581–7596. [Google Scholar] [CrossRef]
  91. Xiong, P.; Schnake, T.; Montavon, G.; Müller, K.R.; Nakajima, S. Efficient Computation of Higher-Order Subgraph Attribution via Message Passing. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 24478–24495. [Google Scholar]
Figure 2. Factor graph representation for an augmented objective Q ( y , L ) for margin scaling (on the left) and for slack scaling (on the right). The auxiliary variables L = ( l 1 , , l 4 ) are marked in blue (except l 4 for slack scaling). l 4 is the hub node.
Figure 2. Factor graph representation for an augmented objective Q ( y , L ) for margin scaling (on the left) and for slack scaling (on the right). The auxiliary variables L = ( l 1 , , l 4 ) are marked in blue (except l 4 for slack scaling). l 4 is the hub node.
Mathematics 11 02628 g002
Figure 3. Illustration of the transformation process of a factor graph into a clique tree. The upper-left graph represents the original factor graph, which describes how the energy of the model without the global factor (see F in the definition of Problem 1) factorizes over the individual variables. In the first step, the factor graph is transformed into an MRF, as shown in the upper-right graph. The lower-left graph shows the result of triangulating the MRF from the previous step. Finally, the lower-right graph shows the resulting cluster graph with cliques C 1 , , C 6 constructed from the triangulated MRF. The numbers within the cluster nodes denote the variable indices belonging to the corresponding clique (e.g., C 5 refers to { y 5 , y 6 , y 7 } ). The green edges indicate a valid spanning clique tree extracted from the cluster graph. The dark arrows represent one possible order of message passing if clique C 6 (marked blue) has been chosen as the root of the clique tree.
Figure 3. Illustration of the transformation process of a factor graph into a clique tree. The upper-left graph represents the original factor graph, which describes how the energy of the model without the global factor (see F in the definition of Problem 1) factorizes over the individual variables. In the first step, the factor graph is transformed into an MRF, as shown in the upper-right graph. The lower-left graph shows the result of triangulating the MRF from the previous step. Finally, the lower-right graph shows the resulting cluster graph with cliques C 1 , , C 6 constructed from the triangulated MRF. The numbers within the cluster nodes denote the variable indices belonging to the corresponding clique (e.g., C 5 refers to { y 5 , y 6 , y 7 } ). The green edges indicate a valid spanning clique tree extracted from the cluster graph. The dark arrows represent one possible order of message passing if clique C 6 (marked blue) has been chosen as the root of the clique tree.
Mathematics 11 02628 g003
Figure 4. Illustration of an extreme example where ν can be linear in the graph size. The leftmost graph represents an MRF with M = 7 variables and treewidth τ = 1 . The graph in the middle shows a valid clique tree for the MRF on the left, where the clique { y 1 , y 2 } has M 1 neighbors, that is, ν is linear in the graph size for that clique tree. The rightmost graph represents another clique tree that has a chain form, where ν = τ + 1 = 2 . The squared nodes denote the corresponding sepsets.
Figure 4. Illustration of an extreme example where ν can be linear in the graph size. The leftmost graph represents an MRF with M = 7 variables and treewidth τ = 1 . The graph in the middle shows a valid clique tree for the MRF on the left, where the clique { y 1 , y 2 } has M 1 neighbors, that is, ν is linear in the graph size for that clique tree. The rightmost graph represents another clique tree that has a chain form, where ν = τ + 1 = 2 . The squared nodes denote the corresponding sepsets.
Mathematics 11 02628 g004
Figure 5. Illustration of a modification procedure to reduce the maximal number of neighbors ν for each cluster node in a given clique tree. The graph on the left represents an original clique tree. The only node with more than three neighbors is marked in red. We clone this cluster node multiple times so that each clone only carries one of the original neighbors. The clones are connected by a chain that preserves the running intersection property. The arrow in the left graph indicates the (arbitrarily chosen) order of processing the neighbor nodes. The graph resulting from this transformation is shown on the right. The clones in the new graph are marked in gray. To ensure that the new cluster graph describes the same set of potentials, we set the potentials for the copies of each cloned cluster node C i to zero: f C i ( y C i ) = 0 and g C i ( y C i ) = 0 . This procedure reduces the ν -parameter to a constant ( ν = 3 ), significantly reducing the computational cost.
Figure 5. Illustration of a modification procedure to reduce the maximal number of neighbors ν for each cluster node in a given clique tree. The graph on the left represents an original clique tree. The only node with more than three neighbors is marked in red. We clone this cluster node multiple times so that each clone only carries one of the original neighbors. The clones are connected by a chain that preserves the running intersection property. The arrow in the left graph indicates the (arbitrarily chosen) order of processing the neighbor nodes. The graph resulting from this transformation is shown on the right. The clones in the new graph are marked in gray. To ensure that the new cluster graph describes the same set of potentials, we set the potentials for the copies of each cloned cluster node C i to zero: f C i ( y C i ) = 0 and g C i ( y C i ) = 0 . This procedure reduces the ν -parameter to a constant ( ν = 3 ), significantly reducing the computational cost.
Mathematics 11 02628 g005
Figure 6. Empirical evaluation of the run-time performance of loss-augmented inference for part-of-speech tagging, base-NP chunking, and constituency-parsing tasks.
Figure 6. Empirical evaluation of the run-time performance of loss-augmented inference for part-of-speech tagging, base-NP chunking, and constituency-parsing tasks.
Mathematics 11 02628 g006
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bauer, A.; Nakajima, S.; Müller, K.-R. Polynomial-Time Constrained Message Passing for Exact MAP Inference on Discrete Models with Global Dependencies. Mathematics 2023, 11, 2628. https://doi.org/10.3390/math11122628

AMA Style

Bauer A, Nakajima S, Müller K-R. Polynomial-Time Constrained Message Passing for Exact MAP Inference on Discrete Models with Global Dependencies. Mathematics. 2023; 11(12):2628. https://doi.org/10.3390/math11122628

Chicago/Turabian Style

Bauer, Alexander, Shinichi Nakajima, and Klaus-Robert Müller. 2023. "Polynomial-Time Constrained Message Passing for Exact MAP Inference on Discrete Models with Global Dependencies" Mathematics 11, no. 12: 2628. https://doi.org/10.3390/math11122628

APA Style

Bauer, A., Nakajima, S., & Müller, K. -R. (2023). Polynomial-Time Constrained Message Passing for Exact MAP Inference on Discrete Models with Global Dependencies. Mathematics, 11(12), 2628. https://doi.org/10.3390/math11122628

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop